# Part 3 - Text analysis 

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [1]:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

In [2]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
# load stopwords
sw = set(stopwords.words('english'))

In [4]:
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [5]:
'a' in sw

True

In [6]:
p = './'
df = pd.read_csv(os.path.join(p,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [7]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [8]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [9]:
#test
df = df.iloc[:100,:]

In [10]:
df['comments'][0]

'Daniel is really cool. The place was nice and clean. Very quiet neighborhood. He had maps and a lonely planet guide book in the room for you to use. I didnt have any trouble finding the place from Central Station. I would defintely come back! Thanks!'

In [11]:
word_tokenize(df['comments'][0])

['Daniel',
 'is',
 'really',
 'cool',
 '.',
 'The',
 'place',
 'was',
 'nice',
 'and',
 'clean',
 '.',
 'Very',
 'quiet',
 'neighborhood',
 '.',
 'He',
 'had',
 'maps',
 'and',
 'a',
 'lonely',
 'planet',
 'guide',
 'book',
 'in',
 'the',
 'room',
 'for',
 'you',
 'to',
 'use',
 '.',
 'I',
 'didnt',
 'have',
 'any',
 'trouble',
 'finding',
 'the',
 'place',
 'from',
 'Central',
 'Station',
 '.',
 'I',
 'would',
 'defintely',
 'come',
 'back',
 '!',
 'Thanks',
 '!']

In [12]:
w = word_tokenize(df['comments'][0])
pos_tag(w)

[('Daniel', 'NNP'),
 ('is', 'VBZ'),
 ('really', 'RB'),
 ('cool', 'JJ'),
 ('.', '.'),
 ('The', 'DT'),
 ('place', 'NN'),
 ('was', 'VBD'),
 ('nice', 'JJ'),
 ('and', 'CC'),
 ('clean', 'JJ'),
 ('.', '.'),
 ('Very', 'RB'),
 ('quiet', 'JJ'),
 ('neighborhood', 'NN'),
 ('.', '.'),
 ('He', 'PRP'),
 ('had', 'VBD'),
 ('maps', 'NNS'),
 ('and', 'CC'),
 ('a', 'DT'),
 ('lonely', 'JJ'),
 ('planet', 'NN'),
 ('guide', 'NN'),
 ('book', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('room', 'NN'),
 ('for', 'IN'),
 ('you', 'PRP'),
 ('to', 'TO'),
 ('use', 'VB'),
 ('.', '.'),
 ('I', 'PRP'),
 ('didnt', 'VBP'),
 ('have', 'VBP'),
 ('any', 'DT'),
 ('trouble', 'NN'),
 ('finding', 'VBG'),
 ('the', 'DT'),
 ('place', 'NN'),
 ('from', 'IN'),
 ('Central', 'JJ'),
 ('Station', 'NNP'),
 ('.', '.'),
 ('I', 'PRP'),
 ('would', 'MD'),
 ('defintely', 'RB'),
 ('come', 'VB'),
 ('back', 'RB'),
 ('!', '.'),
 ('Thanks', 'NNS'),
 ('!', '.')]

In [13]:
df['tokenized'] = df['comments'].apply(lambda x : word_tokenize(x))
df['tagged'] = df['tokenized'].apply(lambda x : pos_tag(x))
df

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel, is, really, cool, ., The, place, was,...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool..."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel, is, the, most, amazing, host, !, His,...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We, had, such, a, great, time, in, Amsterdam,...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very, professional, operation, ., Room, is, v...","[(Very, RB), (professional, JJ), (operation, N..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel, is, highly, recommended, ., He, provi...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco..."
...,...,...,...,...,...,...,...,...
95,2818,21664685,2014-10-21,13195901,Doruk,Daniel was a great host. He had prepared anyth...,"[Daniel, was, a, great, host, ., He, had, prep...","[(Daniel, NNP), (was, VBD), (a, DT), (great, J..."
96,2818,21787755,2014-10-24,21388205,Tuan,Daniel is nice. It was fun.,"[Daniel, is, nice, ., It, was, fun, .]","[(Daniel, NNP), (is, VBZ), (nice, JJ), (., .),..."
97,2818,22033157,2014-10-28,4592146,AmBer,Daniel was an incredible host. The room looked...,"[Daniel, was, an, incredible, host, ., The, ro...","[(Daniel, NNP), (was, VBD), (an, DT), (incredi..."
98,2818,23174233,2014-11-24,533940,Victor,A really great Airbnb experience with Daniel. ...,"[A, really, great, Airbnb, experience, with, D...","[(A, DT), (really, RB), (great, JJ), (Airbnb, ..."


In [14]:
def lower(x):
    return (x[0].lower(),x[1])
list(map(lower,df['tagged'][0]))

[('daniel', 'NNP'),
 ('is', 'VBZ'),
 ('really', 'RB'),
 ('cool', 'JJ'),
 ('.', '.'),
 ('the', 'DT'),
 ('place', 'NN'),
 ('was', 'VBD'),
 ('nice', 'JJ'),
 ('and', 'CC'),
 ('clean', 'JJ'),
 ('.', '.'),
 ('very', 'RB'),
 ('quiet', 'JJ'),
 ('neighborhood', 'NN'),
 ('.', '.'),
 ('he', 'PRP'),
 ('had', 'VBD'),
 ('maps', 'NNS'),
 ('and', 'CC'),
 ('a', 'DT'),
 ('lonely', 'JJ'),
 ('planet', 'NN'),
 ('guide', 'NN'),
 ('book', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('room', 'NN'),
 ('for', 'IN'),
 ('you', 'PRP'),
 ('to', 'TO'),
 ('use', 'VB'),
 ('.', '.'),
 ('i', 'PRP'),
 ('didnt', 'VBP'),
 ('have', 'VBP'),
 ('any', 'DT'),
 ('trouble', 'NN'),
 ('finding', 'VBG'),
 ('the', 'DT'),
 ('place', 'NN'),
 ('from', 'IN'),
 ('central', 'JJ'),
 ('station', 'NNP'),
 ('.', '.'),
 ('i', 'PRP'),
 ('would', 'MD'),
 ('defintely', 'RB'),
 ('come', 'VB'),
 ('back', 'RB'),
 ('!', '.'),
 ('thanks', 'NNS'),
 ('!', '.')]

In [15]:
df['tagged'].apply(lambda x: list(map(lower,x)))

0     [(daniel, NNP), (is, VBZ), (really, RB), (cool...
1     [(daniel, NNP), (is, VBZ), (the, DT), (most, R...
2     [(we, PRP), (had, VBD), (such, JJ), (a, DT), (...
3     [(very, RB), (professional, JJ), (operation, N...
4     [(daniel, NNP), (is, VBZ), (highly, RB), (reco...
                            ...                        
95    [(daniel, NNP), (was, VBD), (a, DT), (great, J...
96    [(daniel, NNP), (is, VBZ), (nice, JJ), (., .),...
97    [(daniel, NNP), (was, VBD), (an, DT), (incredi...
98    [(a, DT), (really, RB), (great, JJ), (airbnb, ...
99    [(daniel, NNP), (was, VBD), (a, DT), (wonderfu...
Name: tagged, Length: 100, dtype: object

In [16]:
def lower(x):
    return (x[0].lower(),x[1])



def process_reviews(df):
  # your code here
    df['tokenized'] = df['comments'].apply(lambda x : word_tokenize(x))
    df['tagged'] = df['tokenized'].apply(lambda x : pos_tag(x))
    df['lower_tagged'] = df['tagged'].apply(lambda x: list(map(lower,x)))
    return df

In [17]:
df = process_reviews(df)

In [18]:
df

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel, is, really, cool, ., The, place, was,...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(daniel, NNP), (is, VBZ), (really, RB), (cool..."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel, is, the, most, amazing, host, !, His,...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(daniel, NNP), (is, VBZ), (the, DT), (most, R..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We, had, such, a, great, time, in, Amsterdam,...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(we, PRP), (had, VBD), (such, JJ), (a, DT), (..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very, professional, operation, ., Room, is, v...","[(Very, RB), (professional, JJ), (operation, N...","[(very, RB), (professional, JJ), (operation, N..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel, is, highly, recommended, ., He, provi...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(daniel, NNP), (is, VBZ), (highly, RB), (reco..."
...,...,...,...,...,...,...,...,...,...
95,2818,21664685,2014-10-21,13195901,Doruk,Daniel was a great host. He had prepared anyth...,"[Daniel, was, a, great, host, ., He, had, prep...","[(Daniel, NNP), (was, VBD), (a, DT), (great, J...","[(daniel, NNP), (was, VBD), (a, DT), (great, J..."
96,2818,21787755,2014-10-24,21388205,Tuan,Daniel is nice. It was fun.,"[Daniel, is, nice, ., It, was, fun, .]","[(Daniel, NNP), (is, VBZ), (nice, JJ), (., .),...","[(daniel, NNP), (is, VBZ), (nice, JJ), (., .),..."
97,2818,22033157,2014-10-28,4592146,AmBer,Daniel was an incredible host. The room looked...,"[Daniel, was, an, incredible, host, ., The, ro...","[(Daniel, NNP), (was, VBD), (an, DT), (incredi...","[(daniel, NNP), (was, VBD), (an, DT), (incredi..."
98,2818,23174233,2014-11-24,533940,Victor,A really great Airbnb experience with Daniel. ...,"[A, really, great, Airbnb, experience, with, D...","[(A, DT), (really, RB), (great, JJ), (Airbnb, ...","[(a, DT), (really, RB), (great, JJ), (airbnb, ..."


### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [19]:
df['lower_tagged'][0][:10]

[('daniel', 'NNP'),
 ('is', 'VBZ'),
 ('really', 'RB'),
 ('cool', 'JJ'),
 ('.', '.'),
 ('the', 'DT'),
 ('place', 'NN'),
 ('was', 'VBD'),
 ('nice', 'JJ'),
 ('and', 'CC')]

In [20]:
list_lower_tagged = sum(df['lower_tagged'],[])
list_lower_tagged[:10]

[('daniel', 'NNP'),
 ('is', 'VBZ'),
 ('really', 'RB'),
 ('cool', 'JJ'),
 ('.', '.'),
 ('the', 'DT'),
 ('place', 'NN'),
 ('was', 'VBD'),
 ('nice', 'JJ'),
 ('and', 'CC')]

In [21]:
list_lower_tagged[0][1][1]

'N'

In [22]:
def f(x):
    if x[0] in sw:
        return 0
    if x[1][0] =="N":
        return x[0]
    else:
        return 0


In [23]:
word_list_lower_tagged = [f(list_lower_tagged[i]) for i in range(len(list_lower_tagged))]
word_list_lower_tagged[:100]

['daniel',
 0,
 0,
 0,
 0,
 0,
 'place',
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 'neighborhood',
 0,
 0,
 0,
 'maps',
 0,
 0,
 0,
 'planet',
 'guide',
 'book',
 0,
 0,
 'room',
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 'trouble',
 0,
 0,
 'place',
 0,
 0,
 'station',
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 'thanks',
 0,
 'daniel',
 0,
 0,
 0,
 0,
 'host',
 0,
 0,
 'place',
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 'everything',
 0,
 0,
 0,
 0,
 0,
 0,
 'bed',
 0,
 0,
 0,
 'maps',
 0,
 'mini-fridge',
 0,
 'towels',
 0,
 0,
 'toiletries',
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [24]:
Count_word_Noun = Counter(word_list_lower_tagged)
Count_word_Noun

Counter({'daniel': 202,
         0: 7566,
         'place': 73,
         'neighborhood': 15,
         'maps': 25,
         'planet': 1,
         'guide': 2,
         'book': 1,
         'room': 66,
         'trouble': 3,
         'station': 8,
         'thanks': 9,
         'host': 60,
         'everything': 35,
         'bed': 8,
         'mini-fridge': 2,
         'towels': 12,
         'toiletries': 3,
         'way': 10,
         'highly': 8,
         'time': 27,
         'amsterdam': 83,
         'helpful': 3,
         'coffee': 9,
         'supplies': 1,
         'bathroom': 16,
         'location': 13,
         'tram': 6,
         'stop': 4,
         'minutes': 6,
         'city': 33,
         'center': 14,
         'totally': 4,
         'operation': 1,
         'attractions': 2,
         'stay': 39,
         'necessities': 1,
         'buses': 2,
         'trams': 3,
         'clean': 6,
         'beds': 13,
         'advices': 1,
         'tips': 5,
         'tools': 1,
     

In [25]:
Count_word_Noun_list = [(l,k) for k,l in sorted([(j,i) for i,j in Count_word_Noun.items()], reverse=True)]
Count_word_Noun_list[:10]

[(0, 7566),
 ('daniel', 202),
 ('amsterdam', 83),
 ('place', 73),
 ('room', 66),
 ('host', 60),
 ('stay', 39),
 ('apartment', 39),
 ('everything', 35),
 ('city', 33)]

In [26]:
if Count_word_Noun_list[0][0] == 0:
    tmp = Count_word_Noun_list[1:1001]
else :
    tmp = Count_word_Noun_list[0:1000]
tmp

[(0, 7566),
 ('daniel', 202),
 ('amsterdam', 83),
 ('place', 73),
 ('room', 66),
 ('host', 60),
 ('stay', 39),
 ('apartment', 39),
 ('everything', 35),
 ('city', 33),
 ('time', 27),
 ('maps', 25),
 ('information', 19),
 ('home', 17),
 ('bathroom', 16),
 ('neighborhood', 15),
 ('bus', 15),
 ('center', 14),
 ('location', 13),
 ('beds', 13),
 ('towels', 12),
 ('trip', 11),
 ('tea', 11),
 ('experience', 11),
 ('bikes', 11),
 ('area', 11),
 ('way', 10),
 ('transport', 10),
 ('things', 10),
 ('guests', 10),
 ('thanks', 9),
 ('lot', 9),
 ('kind', 9),
 ('coffee', 9),
 ('station', 8),
 ('questions', 8),
 ('places', 8),
 ('highly', 8),
 ('guest', 8),
 ('friend', 8),
 ('directions', 8),
 ('bed', 8),
 ('transportation', 7),
 ('service', 7),
 ('recommendations', 7),
 ('garden', 7),
 ('daniels', 7),
 ('arrival', 7),
 ('anyone', 7),
 ('airbnb', 7),
 ('access', 7),
 ('très', 6),
 ('travel', 6),
 ('tram', 6),
 ('thing', 6),
 ('space', 6),
 ('restaurants', 6),
 ('person', 6),
 ('nights', 6),
 ('minutes'

In [27]:
cent_vocab = [tmp[i][0] for i in range(len(tmp))]
cent_vocab

[0,
 'daniel',
 'amsterdam',
 'place',
 'room',
 'host',
 'stay',
 'apartment',
 'everything',
 'city',
 'time',
 'maps',
 'information',
 'home',
 'bathroom',
 'neighborhood',
 'bus',
 'center',
 'location',
 'beds',
 'towels',
 'trip',
 'tea',
 'experience',
 'bikes',
 'area',
 'way',
 'transport',
 'things',
 'guests',
 'thanks',
 'lot',
 'kind',
 'coffee',
 'station',
 'questions',
 'places',
 'highly',
 'guest',
 'friend',
 'directions',
 'bed',
 'transportation',
 'service',
 'recommendations',
 'garden',
 'daniels',
 'arrival',
 'anyone',
 'airbnb',
 'access',
 'très',
 'travel',
 'tram',
 'thing',
 'space',
 'restaurants',
 'person',
 'nights',
 'minutes',
 'days',
 'day',
 'clean',
 'bedroom',
 'wifi',
 'walk',
 'tips',
 'thank',
 'shower',
 'people',
 'night',
 'morning',
 'minute',
 'luggage',
 'lots',
 'living',
 'house',
 'hotel',
 'food',
 'door',
 'books',
 'bit',
 'bike',
 'à',
 'window',
 'visit',
 'tourists',
 'totally',
 'séjour',
 'stops',
 'stop',
 'sehr',
 'proble

In [28]:
from collections import Counter
def f_N(x):
    if x[0] in sw:
        return 0
    elif x[1][0] =="N":
        return x[0]
    else:
        return 0

def f_JV(x):
    if x[0] in sw:
        return 0
    elif (x[1][0] =="J") | (x[1][0] =="V"):
        return x[0]
    else:
        return 0

#NOUN
word_list_lower_tagged = [f_N(list_lower_tagged[i]) for i in range(len(list_lower_tagged))]
Count_word_Noun = Counter(word_list_lower_tagged)
Count_word_Noun_list = [(l,k) for k,l in sorted([(j,i) for i,j in Count_word_Noun.items()], reverse=True)]

if Count_word_Noun_list[0][0] == None:
    tmp = Count_word_Noun_list[1:1001]
else :
    tmp = Count_word_Noun_list[0:1000]
    
cent_vocab = [tmp[i][0] for i in range(len(tmp))]

# VERB ADJECTIVE
word_list_lower_tagged = [f_JV(list_lower_tagged[i]) for i in range(len(list_lower_tagged))]
Count_word_JV = Counter(word_list_lower_tagged)
Count_word_JV_list = [(l,k) for k,l in sorted([(j,i) for i,j in Count_word_JV.items()], reverse=True)]

if Count_word_JV_list[0][0] == 0:
    tmp = Count_word_JV_list[1:1001]
else :
    tmp = Count_word_JV_list[0:1000]
    
cont_vocab = [tmp[i][0] for i in range(len(tmp))]
#cont_vocab

In [30]:
cont_vocab[:10]

['great',
 'clean',
 'comfortable',
 'nice',
 'stay',
 'recommend',
 'good',
 'provided',
 'get',
 'staying']

In [31]:
from collections import Counter
def f_N(x):
    if x[0] in sw:
        return 0
    elif x[1][0] =="N":
        return x[0]
    else:
        return 0

def f_JV(x):
    if x[0] in sw:
        return 0
    elif (x[1][0] =="J") | (x[1][0] =="V"):
        return x[0]
    else:
        return 0

def get_vocab(df):
  # your code here
    #NOUN
    word_list_lower_tagged = [f_N(list_lower_tagged[i]) for i in range(len(list_lower_tagged))]
    Count_word_Noun = Counter(word_list_lower_tagged)
    Count_word_Noun_list = [(l,k) for k,l in sorted([(j,i) for i,j in Count_word_Noun.items()], reverse=True)]

    if Count_word_Noun_list[0][0] == 0:
        tmp = Count_word_Noun_list[1:1001]
    else :
        tmp = Count_word_Noun_list[0:1000]

    cent_vocab = [tmp[i][0] for i in range(len(tmp))]

    # VERB ADJECTIVE
    word_list_lower_tagged = [f_JV(list_lower_tagged[i]) for i in range(len(list_lower_tagged))]
    Count_word_JV = Counter(word_list_lower_tagged)
    Count_word_JV_list = [(l,k) for k,l in sorted([(j,i) for i,j in Count_word_JV.items()], reverse=True)]

    if Count_word_JV_list[0][0] == 0:
        tmp = Count_word_JV_list[1:1001]
    else :
        tmp = Count_word_JV_list[0:1000]

    cont_vocab = [tmp[i][0] for i in range(len(tmp))]

    return cent_vocab, cont_vocab

In [32]:
cent_vocab, cont_vocab = get_vocab(df)

In [33]:
cent_vocab[:10]

['daniel',
 'amsterdam',
 'place',
 'room',
 'host',
 'stay',
 'apartment',
 'everything',
 'city',
 'time']

In [34]:
cont_vocab[:10]

['great',
 'clean',
 'comfortable',
 'nice',
 'stay',
 'recommend',
 'good',
 'provided',
 'get',
 'staying']

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

##### Comments for exceptional case : 
1. A word "Daniel" appear in context_vocab, since I only use 1000 data in this question and "Daniel" could be wrongly used in this data. I found that it says "Daniel Daniel Dainel" in one sentence so that "Daniel" came out in context_vocab. Therefore if the number of used data is bigger "Daniel" will not appear anymore in context data. I checked cont_vocab below, and conclude most of cont_verb are correct.
2. For the problem some words could be noun and verb, it is used as it is without any other conditions since a word "Work" is both case of center and correct word.

In [35]:
df['sent_word_comments'] = df['comments'].apply(lambda x : list(map(word_tokenize,(sent_tokenize((x).lower())))))
df

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged,sent_word_comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel, is, really, cool, ., The, place, was,...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(daniel, NNP), (is, VBZ), (really, RB), (cool...","[[daniel, is, really, cool, .], [the, place, w..."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel, is, the, most, amazing, host, !, His,...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(daniel, NNP), (is, VBZ), (the, DT), (most, R...","[[daniel, is, the, most, amazing, host, !], [h..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We, had, such, a, great, time, in, Amsterdam,...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(we, PRP), (had, VBD), (such, JJ), (a, DT), (...","[[we, had, such, a, great, time, in, amsterdam..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very, professional, operation, ., Room, is, v...","[(Very, RB), (professional, JJ), (operation, N...","[(very, RB), (professional, JJ), (operation, N...","[[very, professional, operation, .], [room, is..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel, is, highly, recommended, ., He, provi...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(daniel, NNP), (is, VBZ), (highly, RB), (reco...","[[daniel, is, highly, recommended, .], [he, pr..."
...,...,...,...,...,...,...,...,...,...,...
95,2818,21664685,2014-10-21,13195901,Doruk,Daniel was a great host. He had prepared anyth...,"[Daniel, was, a, great, host, ., He, had, prep...","[(Daniel, NNP), (was, VBD), (a, DT), (great, J...","[(daniel, NNP), (was, VBD), (a, DT), (great, J...","[[daniel, was, a, great, host, .], [he, had, p..."
96,2818,21787755,2014-10-24,21388205,Tuan,Daniel is nice. It was fun.,"[Daniel, is, nice, ., It, was, fun, .]","[(Daniel, NNP), (is, VBZ), (nice, JJ), (., .),...","[(daniel, NNP), (is, VBZ), (nice, JJ), (., .),...","[[daniel, is, nice, .], [it, was, fun, .]]"
97,2818,22033157,2014-10-28,4592146,AmBer,Daniel was an incredible host. The room looked...,"[Daniel, was, an, incredible, host, ., The, ro...","[(Daniel, NNP), (was, VBD), (an, DT), (incredi...","[(daniel, NNP), (was, VBD), (an, DT), (incredi...","[[daniel, was, an, incredible, host, .], [the,..."
98,2818,23174233,2014-11-24,533940,Victor,A really great Airbnb experience with Daniel. ...,"[A, really, great, Airbnb, experience, with, D...","[(A, DT), (really, RB), (great, JJ), (Airbnb, ...","[(a, DT), (really, RB), (great, JJ), (airbnb, ...","[[a, really, great, airbnb, experience, with, ..."


In [36]:
sent_word_c = df['sent_word_comments'][0]

coocs = dict()
sent_word_c[1]        

['the', 'place', 'was', 'nice', 'and', 'clean', '.']

In [37]:
sent_word_c[3]

['he',
 'had',
 'maps',
 'and',
 'a',
 'lonely',
 'planet',
 'guide',
 'book',
 'in',
 'the',
 'room',
 'for',
 'you',
 'to',
 'use',
 '.']

In [38]:
coocs = dict()
for x in sent_word_c[3]:
    if x in cent_vocab:
        n=x
        if n not in coocs.keys():
            coocs[n] = {}
coocs

{'maps': {}, 'planet': {}, 'guide': {}, 'book': {}, 'room': {}}

In [39]:
coocs = dict()
noun_list = []
for x in sent_word_c[3]:
    if x in cent_vocab:
        noun_list.append(x)
        if x not in coocs.keys():
            coocs[x] = {}
for x in sent_word_c[3]:
    if x in cont_vocab:
        print(x)
        for noun in noun_list:
            coocs[noun][x] = coocs[noun].get(x,0)+1
    
coocs

lonely
use


{'maps': {'lonely': 1, 'use': 1},
 'planet': {'lonely': 1, 'use': 1},
 'guide': {'lonely': 1, 'use': 1},
 'book': {'lonely': 1, 'use': 1},
 'room': {'lonely': 1, 'use': 1}}

In [40]:
coocs['maps']

{'lonely': 1, 'use': 1}

In [41]:
coocs['maps']['use']

1

In [42]:
def coocs_one(data):
    tmp = dict()
    noun_list = []
    for x in data:
        if x in cent_vocab:
            noun_list.append(x)
            if x not in tmp.keys():
                tmp[x] = {}
    for x in data:
        if x in cont_vocab:
            for noun in noun_list:
                tmp[noun][x] = tmp[noun].get(x,0)+1

    return tmp

In [43]:
cont_vocab 

['great',
 'clean',
 'comfortable',
 'nice',
 'stay',
 'recommend',
 'good',
 'provided',
 'get',
 'staying',
 'go',
 'easy',
 'quiet',
 'needed',
 'helpful',
 'gave',
 'organized',
 'public',
 'friendly',
 'see',
 'need',
 'make',
 "'s",
 'sure',
 'made',
 'first',
 'fantastic',
 'stayed',
 'recommended',
 'decorated',
 'provides',
 'much',
 'little',
 'feel',
 'available',
 'accommodating',
 'professional',
 'looking',
 'located',
 'getting',
 'est',
 'best',
 'wonderful',
 'want',
 'visit',
 'use',
 'perfect',
 'many',
 'local',
 'happy',
 'come',
 'amazing',
 'super',
 'offered',
 'let',
 'know',
 'find',
 'felt',
 'beautiful',
 'useful',
 'tram',
 'spent',
 'safe',
 'open',
 'next',
 'last',
 'fresh',
 'found',
 'excellent',
 'enjoyed',
 'enjoyable',
 'convenient',
 'central',
 'able',
 'whole',
 'welcoming',
 'vous',
 'various',
 'tidy',
 'think',
 'take',
 'small',
 'short',
 'say',
 'provide',
 'pleasant',
 'plan',
 'peaceful',
 'makes',
 'loved',
 'helping',
 'helped',
 'europ

In [44]:
sum(df['sent_word_comments'],[])

[['daniel', 'is', 'really', 'cool', '.'],
 ['the', 'place', 'was', 'nice', 'and', 'clean', '.'],
 ['very', 'quiet', 'neighborhood', '.'],
 ['he',
  'had',
  'maps',
  'and',
  'a',
  'lonely',
  'planet',
  'guide',
  'book',
  'in',
  'the',
  'room',
  'for',
  'you',
  'to',
  'use',
  '.'],
 ['i',
  'didnt',
  'have',
  'any',
  'trouble',
  'finding',
  'the',
  'place',
  'from',
  'central',
  'station',
  '.'],
 ['i', 'would', 'defintely', 'come', 'back', '!'],
 ['thanks', '!'],
 ['daniel', 'is', 'the', 'most', 'amazing', 'host', '!'],
 ['his',
  'place',
  'is',
  'extremely',
  'clean',
  ',',
  'and',
  'he',
  'provides',
  'everything',
  'you',
  'could',
  'possibly',
  'want',
  '(',
  'comfy',
  'bed',
  ',',
  'guidebooks',
  '&',
  'maps',
  ',',
  'mini-fridge',
  ',',
  'towels',
  ',',
  'even',
  'toiletries',
  ')',
  '.'],
 ['he',
  'is',
  'extremely',
  'friendly',
  'and',
  'helpful',
  ',',
  'and',
  'will',
  'go',
  'out',
  'of',
  'his',
  'way',
  't

In [45]:
df['coocs'] = df['sent_word_comments'].apply(lambda x : list(map(coocs_one,x)))
df['coocs'][0]

[{'daniel': {'daniel': 1, 'cool': 1}},
 {'place': {'place': 1, 'nice': 1, 'clean': 1},
  'nice': {'place': 1, 'nice': 1, 'clean': 1},
  'clean': {'place': 1, 'nice': 1, 'clean': 1}},
 {'neighborhood': {'quiet': 1}},
 {'maps': {'lonely': 1, 'use': 1},
  'planet': {'lonely': 1, 'use': 1},
  'guide': {'lonely': 1, 'use': 1},
  'book': {'lonely': 1, 'use': 1},
  'room': {'lonely': 1, 'use': 1}},
 {'trouble': {'didnt': 1, 'finding': 1, 'place': 1, 'central': 1},
  'place': {'didnt': 1, 'finding': 1, 'place': 1, 'central': 1},
  'station': {'didnt': 1, 'finding': 1, 'place': 1, 'central': 1}},
 {'would': {'come': 1}},
 {'thanks': {}}]

In [46]:
def get_coocs(df, cent_vocab, cont_vocab):
  # your code here
    df['sent_word_comments'] = df['comments'].apply(lambda x : list(map(word_tokenize,(sent_tokenize((x).lower())))))
    coocs = df['sent_word_comments'].apply(lambda x : list(map(coocs_one,x)))
    return coocs  

In [47]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

In [48]:
coocs

0     [{'daniel': {'daniel': 1, 'cool': 1}}, {'place...
1     [{'daniel': {'daniel': 1, 'amazing': 1}, 'host...
2     [{'great': {'great': 1, 'amsterdam': 1}, 'time...
3     [{'operation': {'professional': 1}}, {'room': ...
4     [{'daniel': {'daniel': 1, 'recommended': 1}, '...
                            ...                        
95    [{'daniel': {'daniel': 1, 'great': 1}, 'great'...
96    [{'daniel': {'daniel': 1, 'nice': 1}, 'nice': ...
97    [{'daniel': {'daniel': 1, 'incredible': 1}, 'h...
98    [{'great': {'great': 1, 'daniel': 1}, 'airbnb'...
99    [{'daniel': {'daniel': 1, 'wonderful': 1, 'des...
Name: sent_word_comments, Length: 100, dtype: object

In [49]:
sum(coocs,[])

[{'daniel': {'daniel': 1, 'cool': 1}},
 {'place': {'place': 1, 'nice': 1, 'clean': 1},
  'nice': {'place': 1, 'nice': 1, 'clean': 1},
  'clean': {'place': 1, 'nice': 1, 'clean': 1}},
 {'neighborhood': {'quiet': 1}},
 {'maps': {'lonely': 1, 'use': 1},
  'planet': {'lonely': 1, 'use': 1},
  'guide': {'lonely': 1, 'use': 1},
  'book': {'lonely': 1, 'use': 1},
  'room': {'lonely': 1, 'use': 1}},
 {'trouble': {'didnt': 1, 'finding': 1, 'place': 1, 'central': 1},
  'place': {'didnt': 1, 'finding': 1, 'place': 1, 'central': 1},
  'station': {'didnt': 1, 'finding': 1, 'place': 1, 'central': 1}},
 {'would': {'come': 1}},
 {'thanks': {}},
 {'daniel': {'daniel': 1, 'amazing': 1}, 'host': {'daniel': 1, 'amazing': 1}},
 {'place': {'place': 1,
   'clean': 1,
   'provides': 1,
   'everything': 1,
   'want': 1,
   'comfy': 1,
   'guidebooks': 1},
  'extremely': {'place': 1,
   'clean': 1,
   'provides': 1,
   'everything': 1,
   'want': 1,
   'comfy': 1,
   'guidebooks': 1},
  'clean': {'place': 1,
  

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [50]:
cent_vocab[:5]

['daniel', 'amsterdam', 'place', 'room', 'host']

In [51]:
cont_vocab[:5]

['great', 'clean', 'comfortable', 'nice', 'stay']

In [52]:
# formulate the dictionary into dataframe
vocab = sorted(cent_vocab) # sort vocab
df = pd.DataFrame(data=np.zeros((len(cent_vocab), len(cont_vocab)), dtype=np.int16),
                  index=cent_vocab, columns=cont_vocab)
df

Unnamed: 0,great,clean,comfortable,nice,stay,recommend,good,provided,get,staying,...,accepted,accept,absolute,aber,@,5-minute,30-45,2-3,....,*
daniel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
amsterdam,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
place,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
room,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
host,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
accommadations,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
accomation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
accessories,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
@,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [53]:
df.at['amsterdam','great']

0

In [54]:
test = coocs[99][0]
try :
    df.at['amsterdam','great'] = test['amsterdam']['great']
except:
    pass
df

Unnamed: 0,great,clean,comfortable,nice,stay,recommend,good,provided,get,staying,...,accepted,accept,absolute,aber,@,5-minute,30-45,2-3,....,*
daniel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
amsterdam,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
place,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
room,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
host,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
accommadations,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
accomation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
accessories,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
@,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:
test

{'daniel': {'daniel': 1, 'wonderful': 1, 'described/pictured': 1},
 'wonderful': {'daniel': 1, 'wonderful': 1, 'described/pictured': 1},
 'host': {'daniel': 1, 'wonderful': 1, 'described/pictured': 1},
 'room': {'daniel': 1, 'wonderful': 1, 'described/pictured': 1}}

In [56]:
test = coocs[99][0]
try :
    df.at['host','daniel'] = test['host']['daniel']
except:
    pass
df.at['host','daniel'] 

1

In [57]:
coocs

0     [{'daniel': {'daniel': 1, 'cool': 1}}, {'place...
1     [{'daniel': {'daniel': 1, 'amazing': 1}, 'host...
2     [{'great': {'great': 1, 'amsterdam': 1}, 'time...
3     [{'operation': {'professional': 1}}, {'room': ...
4     [{'daniel': {'daniel': 1, 'recommended': 1}, '...
                            ...                        
95    [{'daniel': {'daniel': 1, 'great': 1}, 'great'...
96    [{'daniel': {'daniel': 1, 'nice': 1}, 'nice': ...
97    [{'daniel': {'daniel': 1, 'incredible': 1}, 'h...
98    [{'great': {'great': 1, 'daniel': 1}, 'airbnb'...
99    [{'daniel': {'daniel': 1, 'wonderful': 1, 'des...
Name: sent_word_comments, Length: 100, dtype: object

In [58]:
len(coocs)

100

In [59]:
def cooc_dict2df(coocs):
    # your code here
    coocdf = pd.DataFrame(data=np.zeros((len(cent_vocab), len(cont_vocab)), dtype=np.int16),
                      index=cent_vocab, columns=cont_vocab)
    for k in range(len(coocs)):
        for i in cent_vocab:
            for j in cont_vocab:
                try :
                     coocdf.at[i,j] += coocs[k][0][i][j]
                except:
                    pass
        print(k)
    return coocdf

In [60]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


(759, 667)

In [61]:
coocdf

Unnamed: 0,great,clean,comfortable,nice,stay,recommend,good,provided,get,staying,...,accepted,accept,absolute,aber,@,5-minute,30-45,2-3,....,*
daniel,19,1,2,3,13,1,4,1,1,4,...,0,0,1,0,0,0,0,0,1,0
amsterdam,4,1,1,0,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
place,4,4,1,1,5,0,3,0,1,1,...,0,0,0,0,0,0,0,0,0,0
room,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
host,12,1,2,2,2,1,2,1,1,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
accommadations,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
accomation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
accessories,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
@,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [62]:
def cooc2pmi(df):
    
    # your code here
    col_totals = df.sum(axis=0)
    total = col_totals.sum()
    row_totals = df.sum(axis=1)
    expected = np.outer(row_totals, col_totals) / total
    
    pmidf = df / expected
    
    # Silence distracting warnings about log(0):
    with np.errstate(divide='ignore'):
        pmidf = np.log(pmidf)
    pmidf[np.isinf(pmidf)] = 0.0  # log(0) = 0
    pmidf = pmidf.fillna(0.0)
    pmidf[pmidf < 0] = 0.0
    return pmidf

In [63]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

(759, 667)

In [64]:
pmidf

Unnamed: 0,great,clean,comfortable,nice,stay,recommend,good,provided,get,staying,...,accepted,accept,absolute,aber,@,5-minute,30-45,2-3,....,*
daniel,0.157439,0.000000,0.000000,0.321374,0.163005,0.401417,0.178273,0.00000,0.000000,0.340792,...,0.0,0.0,0.689099,0.0,0.0,0.0,0.0,0.0,0.178273,0.0
amsterdam,0.000000,0.000000,0.230524,0.000000,0.317535,0.000000,0.000000,0.00000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
place,0.000000,0.723000,0.000000,0.268264,0.252997,0.000000,0.936093,0.00000,0.435318,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
room,0.000000,0.794459,1.400595,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
host,0.527186,0.000000,0.419766,0.745188,0.000000,1.230696,0.314405,0.67108,0.219095,0.000000,...,0.0,0.0,1.518378,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
accommadations,0.000000,0.000000,2.499207,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
accomation,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
accessories,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
@,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [90]:
def topk(df, center_word, N=10):
    # your code here
    tmp = pd.DataFrame(df.loc[center_word,:])
    top_words = tmp.sort_values(by = center_word, ascending = False)[:N].index
    return top_words

In [91]:
topk(pmidf, 'coffee')

Index(['great', 'luxurious', 'love', 'lose', 'looks', 'looked', 'look',
       'longer', 'lonely', 'lock'],
      dtype='object')