# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [245]:
import pandas as pd
from nltk.tag import pos_tag
from nltk import RegexpParser
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

In [246]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [247]:
# load stopwords
sw = set(stopwords.words('english'))

In [248]:
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [249]:
'''We want to find the alternative forms of stopwords that have the "'" symbol in them 
in order to be able to add also to stopwords the word without this symbol'''

pattern = r'\w+\'\w+'

new_stopwords = []
for word in sw:
    # If it finds a word that contains "'" it appends the word in new_stopwords list
    if len(re.findall(pattern,word)) == 1:
        new_stopwords.append(re.findall(pattern,word)[0].replace('\'',''))
new_stopwords

['youre',
 'havent',
 'shouldnt',
 'wouldnt',
 'shant',
 'wasnt',
 'dont',
 'werent',
 'arent',
 'neednt',
 'didnt',
 'youve',
 'youd',
 'wont',
 'youll',
 'hadnt',
 'mightnt',
 'hasnt',
 'shes',
 'couldnt',
 'isnt',
 'shouldve',
 'doesnt',
 'thatll',
 'its',
 'mustnt']

In [250]:
# After checking those "new" words we add them to the stopwords variables named sw
for word in new_stopwords:
    sw.add(word)
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'arent',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'couldnt',
 'd',
 'did',
 'didn',
 "didn't",
 'didnt',
 'do',
 'does',
 'doesn',
 "doesn't",
 'doesnt',
 'doing',
 'don',
 "don't",
 'dont',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'hadnt',
 'has',
 'hasn',
 "hasn't",
 'hasnt',
 'have',
 'haven',
 "haven't",
 'havent',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'isnt',
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'mightnt',
 'more',
 'most',
 'mustn',
 "mustn't",
 'mustnt',
 'my',
 'myself',
 'needn',
 "needn't",
 'neednt',
 'no',
 'nor',
 'not',
 'no

In [251]:
basedir = os.getcwd()
df = pd.read_csv(os.path.join(basedir,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [252]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [253]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [254]:
def process_reviews(df):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    # Initialize 3 lists one for each column we will create
    tokenized_col = []
    tagged_col = []
    lower_tagged_col = []


    mylen = len(df)
    count = 0
    
    # Iterate through the given dataframe
    for index, row in df.iterrows():
        # tokenize the words for the comments of a row
        token = word_tokenize(row.comments)
        # Append the tokenized words to the proper list
        tokenized_col.append(token)
        # Tag the tokenized words of the row and then append them to the proper list
        tagged_col.append(pos_tag(token))
        # lower_tagged.append(list(set(pos_tag([item.lower() for item in token]))))
        # Make the tagged words lowercased and then if they are not stopwords append them to the lower_tagged_col list
        lower_tagged_col.append(pos_tag([item.lower() for item in token if item.lower() not in sw]))
        count += 1

        if count % 1000 == 0:
            print(f'{count} out of {mylen}')

    # Set as values of the 3 new columns the proper list we created for each one
    df['tokenized'] = tokenized_col
    df['tagged'] = tagged_col
    df['lower_tagged'] = lower_tagged_col

    # Return the modified dataframe
    return df

In [255]:
df = process_reviews(df)
# df = process_reviews(df[:500])

1000 out of 452143
2000 out of 452143
3000 out of 452143
4000 out of 452143
5000 out of 452143
6000 out of 452143
7000 out of 452143
8000 out of 452143
9000 out of 452143
10000 out of 452143
11000 out of 452143
12000 out of 452143
13000 out of 452143
14000 out of 452143
15000 out of 452143
16000 out of 452143
17000 out of 452143
18000 out of 452143
19000 out of 452143
20000 out of 452143
21000 out of 452143
22000 out of 452143
23000 out of 452143
24000 out of 452143
25000 out of 452143
26000 out of 452143
27000 out of 452143
28000 out of 452143
29000 out of 452143
30000 out of 452143
31000 out of 452143
32000 out of 452143
33000 out of 452143
34000 out of 452143
35000 out of 452143
36000 out of 452143
37000 out of 452143
38000 out of 452143
39000 out of 452143
40000 out of 452143
41000 out of 452143
42000 out of 452143
43000 out of 452143
44000 out of 452143
45000 out of 452143
46000 out of 452143
47000 out of 452143
48000 out of 452143
49000 out of 452143
50000 out of 452143
51000 out

397000 out of 452143
398000 out of 452143
399000 out of 452143
400000 out of 452143
401000 out of 452143
402000 out of 452143
403000 out of 452143
404000 out of 452143
405000 out of 452143
406000 out of 452143
407000 out of 452143
408000 out of 452143
409000 out of 452143
410000 out of 452143
411000 out of 452143
412000 out of 452143
413000 out of 452143
414000 out of 452143
415000 out of 452143
416000 out of 452143
417000 out of 452143
418000 out of 452143
419000 out of 452143
420000 out of 452143
421000 out of 452143
422000 out of 452143
423000 out of 452143
424000 out of 452143
425000 out of 452143
426000 out of 452143
427000 out of 452143
428000 out of 452143
429000 out of 452143
430000 out of 452143
431000 out of 452143
432000 out of 452143
433000 out of 452143
434000 out of 452143
435000 out of 452143
436000 out of 452143
437000 out of 452143
438000 out of 452143
439000 out of 452143
440000 out of 452143
441000 out of 452143
442000 out of 452143
443000 out of 452143
444000 out of

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [256]:
def get_vocab(df):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    cent_list, cont_list = [], []

    for review in df.lower_tagged:
        cent_list.extend([word for word in [list_of_words[0] for list_of_words in review if list_of_words[1][0] == 'N']])
        cont_list.extend([word for word in [list_of_words[0] for list_of_words in review if (list_of_words[1][0] == 'J') or (list_of_words[1][0] == 'V')]])

    cent_dict = Counter(cent_list)
    cont_dict = Counter(cont_list)

    cent_vocab = [key for key, value in sorted(cent_dict.items(), key=lambda item: item[1], reverse=True)][:1000]
    cont_vocab = [key for key, value in sorted(cont_dict.items(), key=lambda item: item[1], reverse=True)][:1000]

    return cent_vocab, cont_vocab

In [257]:
cent_vocab, cont_vocab = get_vocab(df)

In [258]:
cent_vocab[:5]

['place', 'apartment', 'location', 'stay', 'amsterdam']

In [259]:
cont_vocab[:5]

['great', 'nice', 'recommend', 'clean', 'good']

In [260]:
samewords = [name for name in cent_vocab if name in cont_vocab]
len(samewords)

385

In [261]:
samewords

['place',
 'apartment',
 'location',
 'stay',
 'amsterdam',
 'host',
 'home',
 'très',
 'center',
 '’',
 'walk',
 'centre',
 'tram',
 'experience',
 'à',
 'hosts',
 'neighborhood',
 'clean',
 'bien',
 'perfect',
 'la',
 'bed',
 'bathroom',
 'et',
 'e',
 'sehr',
 'thank',
 'breakfast',
 'lot',
 'ist',
 'street',
 'boat',
 'tips',
 'der',
 'visit',
 'coffee',
 'arrival',
 'bus',
 'need',
 'muy',
 'min',
 'appartement',
 'die',
 'que',
 'transport',
 'airbnb',
 'view',
 'cozy',
 'check',
 'shops',
 'kitchen',
 'bars',
 'minute',
 'helpful',
 'get',
 'es',
 'dans',
 'bit',
 'le',
 'lots',
 'je',
 'stairs',
 'con',
 'neighbourhood',
 'convenient',
 'zu',
 'du',
 'wir',
 'houseboat',
 'beautiful',
 'pictures',
 'des',
 'super',
 'train',
 'park',
 'il',
 'door',
 'couple',
 'supermarket',
 'séjour',
 'bike',
 'recommend',
 'metro',
 'è',
 'airport',
 'situé',
 'help',
 'casa',
 'friends',
 'alles',
 'balcony',
 'photos',
 'garden',
 'comfy',
 'appartment',
 'feel',
 'corner',
 'nice',
 'stop

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [262]:
# def get_coocs(df, cent_vocab, cont_vocab):
#   sentences = []
#   comments = df.comments

#   for comment in comments:
#     sentences.extend([sentence for sentence in comment.split('.')])
  
#   print('yolo')
#   # print(sentences)
  
#   coocs = {}
  
#   count = 0
#   for center_word in cent_vocab:
#     count += 1
#     words = []
#     for sentence in sentences:
#       if center_word in sentence:
#         words_in_sentence = word_tokenize(sentence)
#         words.extend([word for word in words_in_sentence if word in cont_vocab])
    
#     center_word_dict = dict(Counter(words))
#     coocs[center_word] = center_word_dict
#     print(f'{count} out of 1000')
    
#   # cent_dict = Counter(cent_list)
#   # cont_dict = Counter(cont_list)

  
#   return coocs  

In [263]:
def get_coocs(df, cent_vocab, cont_vocab):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.
        cent_vocab: The dataframe we want to modify.
        cont_vocab: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    sentences = []
    comments = df.comments

    for comment in comments:
        sentences.extend([sentence for sentence in comment.split('.')])
  
    print('yolo')
  
    sentences_per_center_word = {center_word : Filter(sentences, center_word) for center_word in cent_vocab}

    print('swag')

    
    coocs = {}

    count = 0
    count2 = 0
    diff = 0
    for center_word, sentences in sentences_per_center_word.items():
        words = []
        count += 1
        start = pd.to_datetime('today')
        count2 = 0
        
        for sentence in sentences:
            count2 += 1
#             words_of_sentence = [word for word in word_tokenize(sentence) if word in cont_vocab]
            words_of_sentence = [word for word in word_tokenize(sentence) if word in cont_vocab and word != center_word]
            if len(words_of_sentence) > 0: words.extend(words_of_sentence)
                
        end = pd.to_datetime('today')
        diff += (end-start).total_seconds()
        print(f'{center_word} - - - - {(end-start).total_seconds()} seconds')
        coocs[center_word] = dict(Counter(words))
        if count % 20 == 0 : print(f'{count} center_word out of {len(sentences_per_center_word)} in {diff/60} minutes')

            
    print(diff/60)
    return coocs 

def Filter(sentences, center_word):
    sentences_for_center_word = []
    
    for sentence in sentences:
        if center_word in sentence.split():
            sentences_for_center_word.append(sentence)
    print(f'{center_word} --- {len(sentences_for_center_word)}')
    return sentences_for_center_word

In [264]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

yolo
place --- 143353
apartment --- 137014
location --- 93599
stay --- 142952
amsterdam --- 4153
host --- 70952
everything --- 55826
city --- 64904
room --- 55321
time --- 46178
house --- 39941
area --- 34473
home --- 33146
très --- 33046
center --- 32159
restaurants --- 25888
’ --- 4
station --- 21166
minutes --- 27090
walk --- 43063
centre --- 31021
tram --- 38796
experience --- 22591
space --- 22309
thanks --- 5686
à --- 29368
hosts --- 18465
neighborhood --- 22779
clean --- 66383
bien --- 20151
perfect --- 51095
communication --- 14870
day --- 17173
la --- 31247
kind --- 17087
distance --- 19192
days --- 17297
bed --- 21938
trip --- 15376
bathroom --- 15332
et --- 45074
night --- 14199
e --- 15555
sehr --- 18771
people --- 13647
thank --- 4912
places --- 15591
breakfast --- 14564
lot --- 15913
ist --- 18690
street --- 13037
boat --- 12785
tips --- 14939
der --- 11717
visit --- 16380
coffee --- 11435
arrival --- 16110
bus --- 14565
need --- 26034
muy --- 12772
war --- 12255
min --- 

locals --- 1361
zeit --- 37
tolle --- 1765
roof --- 3178
tiene --- 1377
schöne --- 1613
love --- 8757
tour --- 2517
beaucoup --- 1944
frank --- 161
.... --- 0
ons --- 1339
share --- 1507
sont --- 4084
visiter --- 1580
anna --- 18
boats --- 1487
dass --- 1482
er --- 2442
nicht --- 3921
barrio --- 1358
decor --- 1445
dank --- 231
ground --- 1476
com --- 1508
era --- 1389
tres --- 1354
make --- 13102
fruit --- 1100
sofa --- 1582
laura --- 20
anne --- 57
quelques --- 1735
responses --- 1338
vor --- 1421
schnell --- 1896
fine --- 3651
stadt --- 26
faire --- 1543
quality --- 1374
patio --- 1723
kommunikation --- 51
las --- 3154
downstairs --- 1688
einem --- 2373
partner --- 1355
journey --- 1383
position --- 1117
blocks --- 1734
está --- 2634
на --- 1439
op --- 2862
bonne --- 1576
café --- 1002
zeer --- 1137
floors --- 1151
level --- 1288
trouble --- 1375
gogh --- 66
dryer --- 1100
travelers --- 1202
books --- 1141
terrasse --- 1517
staircase --- 1397
john --- 18
company --- 1247
gastgeberin

clean - - - - 19.242837 seconds
bien - - - - 6.54181 seconds
perfect - - - - 17.412425 seconds
communication - - - - 4.870008 seconds
day - - - - 7.92879 seconds
la - - - - 12.66217 seconds
kind - - - - 6.234365 seconds
distance - - - - 7.407158 seconds
days - - - - 5.545173 seconds
bed - - - - 8.22401 seconds
trip - - - - 6.216874 seconds
bathroom - - - - 5.915179 seconds
40 center_word out of 1000 in 9.256049099999997 minutes
et - - - - 16.350453 seconds
night - - - - 6.270349 seconds
e - - - - 7.015547 seconds
sehr - - - - 5.867294 seconds
people - - - - 5.301824 seconds
thank - - - - 1.899908 seconds
places - - - - 6.756928 seconds
breakfast - - - - 5.929172 seconds
lot - - - - 6.532534 seconds
ist - - - - 5.857309 seconds
street - - - - 5.60602 seconds
boat - - - - 4.911862 seconds
tips - - - - 5.919168 seconds
der - - - - 4.256613 seconds
visit - - - - 6.324092 seconds
coffee - - - - 5.231976 seconds
arrival - - - - 5.132287 seconds
bus - - - - 5.87944 seconds
need - - - - 9.8155

wonderful - - - - 9.644251 seconds
tv - - - - 0.379022 seconds
car - - - - 1.410189 seconds
privacy - - - - 1.313487 seconds
meet - - - - 3.629288 seconds
reach - - - - 2.513248 seconds
и - - - - 1.14996 seconds
merci - - - - 0.420836 seconds
building - - - - 1.517961 seconds
parfait - - - - 0.880607 seconds
te - - - - 1.625665 seconds
280 center_word out of 1000 in 22.70721156666667 minutes
instructions - - - - 1.374363 seconds
cat - - - - 1.205787 seconds
eat - - - - 2.693828 seconds
welcome - - - - 4.79319 seconds
enjoy - - - - 3.650235 seconds
see - - - - 5.026249 seconds
facilities - - - - 1.045205 seconds
plenty - - - - 2.958225 seconds
stops - - - - 1.734363 seconds
something - - - - 1.424268 seconds
respond - - - - 1.178849 seconds
nothing - - - - 1.056211 seconds
kids - - - - 1.056181 seconds
ville - - - - 2.064438 seconds
hope - - - - 1.868962 seconds
size - - - - 1.094074 seconds
warm - - - - 3.981981 seconds
chez - - - - 1.347652 seconds
dem - - - - 1.481843 seconds
details

right - - - - 8.216056 seconds
system - - - - 0.631287 seconds
lines - - - - 0.588454 seconds
phone - - - - 0.703149 seconds
spotless - - - - 0.551556 seconds
parts - - - - 0.737994 seconds
parks - - - - 0.538561 seconds
zentrum - - - - 0.009971 seconds
connection - - - - 0.54953 seconds
grande - - - - 0.662232 seconds
su - - - - 0.951455 seconds
amstel - - - - 0.031915 seconds
520 center_word out of 1000 in 28.315946533333353 minutes
entrance - - - - 0.63578 seconds
sit - - - - 1.010335 seconds
tipps - - - - 0.139639 seconds
look - - - - 1.536847 seconds
erreichen - - - - 0.501656 seconds
locals - - - - 0.628351 seconds
zeit - - - - 0.014958 seconds
tolle - - - - 0.602395 seconds
roof - - - - 1.280577 seconds
tiene - - - - 0.608372 seconds
schöne - - - - 0.513625 seconds
love - - - - 3.201589 seconds
tour - - - - 1.141945 seconds
beaucoup - - - - 0.756975 seconds
frank - - - - 0.079787 seconds
.... - - - - 0.0 seconds
ons - - - - 0.506645 seconds
share - - - - 0.700128 seconds
sont - 

oder - - - - 0.551566 seconds
offers - - - - 0.665181 seconds
sono - - - - 0.725061 seconds
braucht - - - - 0.335145 seconds
bij - - - - 0.463721 seconds
apartamento - - - - 1.243675 seconds
bons - - - - 0.441832 seconds
climb - - - - 0.669208 seconds
simple - - - - 0.968397 seconds
über - - - - 0.398931 seconds
og - - - - 0.505649 seconds
cuisine - - - - 0.468862 seconds
peaceful - - - - 1.765276 seconds
gemütlich - - - - 0.362029 seconds
juice - - - - 0.32414 seconds
760 center_word out of 1000 in 31.023621750000004 minutes
points - - - - 0.423852 seconds
bicycle - - - - 0.424494 seconds
chill - - - - 0.5415 seconds
job - - - - 0.297206 seconds
message - - - - 0.501657 seconds
können - - - - 0.404916 seconds
guys - - - - 0.365025 seconds
jours - - - - 0.456779 seconds
commerces - - - - 0.296285 seconds
let - - - - 2.715736 seconds
parking - - - - 1.049191 seconds
hôtes - - - - 0.506639 seconds
guide - - - - 0.845097 seconds
evenings - - - - 0.453826 seconds
parfaitement - - - - 0.398

travellers - - - - 0.263328 seconds
plans - - - - 0.479716 seconds
vondel - - - - 0.053857 seconds
helpfull - - - - 0.389956 seconds
все - - - - 0.363068 seconds
peut - - - - 0.568439 seconds
max - - - - 0.189501 seconds
mobility - - - - 0.499618 seconds
perfecto - - - - 0.188503 seconds
fruits - - - - 0.284203 seconds
slept - - - - 1.582769 seconds
eggs - - - - 0.34308 seconds
petits - - - - 0.53856 seconds
show - - - - 1.138943 seconds
découvrir - - - - 0.407909 seconds
wait - - - - 0.980377 seconds
visitors - - - - 0.272272 seconds
1000 center_word out of 1000 in 33.85742159999999 minutes
33.85742159999999


### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [265]:
def cooc_dict2df(coocs):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        coocs: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    coocdf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

    for index, row in coocdf.iterrows():
        for word in cont_vocab:
            try:
                coocdf[word][index] = coocs[index][word]
            except: 
                coocdf[word][index] = 0

    return coocdf

In [266]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(1000, 1000)

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [267]:
coocdf

Unnamed: 0,great,nice,recommend,clean,good,stay,comfortable,easy,perfect,quiet,...,climb,chilled,downstairs,well-located,accommodate,based,andrea,avant,taxi,wi-fi
place,26420,14996,16646,14120,7276,36269,6695,4696,11603,6262,...,104,67,138,106,65,58,2,13,91,14
apartment,22344,15152,7999,19536,6362,16411,9446,5068,9495,6579,...,274,59,188,146,72,68,1,0,138,26
location,27734,8168,2605,6542,9425,6716,3286,5484,14100,6176,...,43,27,80,3,22,49,2,6,86,8
stay,22455,10532,15231,5843,5488,147261,6791,3394,8952,2423,...,78,60,65,31,75,49,1,0,93,11
amsterdam,545,471,279,163,242,751,91,178,299,140,...,7,2,3,1,1,1,0,2,6,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
petits,0,0,2,1,0,0,0,0,2,0,...,0,0,0,0,0,0,0,7,2,0
show,215,154,31,97,75,136,46,51,49,20,...,0,1,3,0,1,2,0,0,7,0
découvrir,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9,0,1
wait,162,89,134,72,52,488,29,42,68,10,...,1,0,3,0,0,0,0,0,7,0


In [268]:
def cooc2pmi(df):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    pmidf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

    N = 0
    for index, row in df.iterrows():
        N += sum(row)

    for index, row in df.iterrows():
        for word in cont_vocab:
            try:
                pmi = df[word][index] / (sum(df[word])/N / sum(row)/N)
                if pmi == 0:
                    pmidf[word][index] = 0
                else:
                    pmidf[word][index] = np.log([pmi])[0] 
                    #         print(pmidf[word][index])
            except: 
                pmidf[word][index] = 0
      
    return pmidf

In [269]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

(1000, 1000)

In [270]:
pmidf

Unnamed: 0,great,nice,recommend,clean,good,stay,comfortable,easy,perfect,quiet,...,climb,chilled,downstairs,well-located,accommodate,based,andrea,avant,taxi,wi-fi
place,45.149405,44.794369,45.535720,45.010148,44.638298,45.483243,44.667340,44.231537,45.087971,44.671160,...,43.852704,44.808034,43.956286,45.456813,44.319470,44.717852,44.417858,42.197828,43.574874,43.822251
apartment,44.972302,44.795179,44.793328,45.325275,44.494521,44.680693,45.002032,44.298233,44.877934,44.711005,...,44.811903,44.671340,44.255935,45.767442,44.412210,44.867378,43.715172,0,43.981730,44.431752
location,44.800611,43.789478,43.283651,43.843467,44.499751,43.399441,43.558319,43.989329,44.885550,44.260000,...,42.572182,43.501847,43.013727,41.494655,42.838794,44.151898,44.020527,41.027307,43.121031,42.865304
stay,44.866344,44.320551,45.326430,44.007347,44.235828,46.764033,44.561125,43.786383,44.708131,43.601215,...,43.444570,44.577234,43.082967,44.106909,44.342118,44.428777,43.604259,0,43.476162,43.460637
amsterdam,37.932408,37.997783,38.111101,37.212645,37.898994,38.270024,37.033178,37.622949,38.093490,37.534642,...,37.818318,37.960583,36.791739,37.457468,36.809177,37.321503,0,36.990120,37.519869,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
petits,0,0,32.293557,31.239415,0,0,0,0,32.206714,0,...,0,0,0,0,0,0,0,37.363404,35.541777,0
show,36.259349,36.136966,35.170965,35.950694,35.984633,35.818362,35.608049,35.630080,35.541955,34.845821,...,0,36.524525,36.048828,0,36.066266,37.271739,0,0,36.931108,0
découvrir,30.685534,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,37.548109,0,36.901200
wait,36.027132,35.639474,36.685642,35.703474,35.669213,37.146847,35.197527,35.486748,35.920467,34.203498,...,35.180321,0,36.099652,0,0,0,0,0,36.981932,0


### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [271]:
# pmidf.iloc[2,:]
pmidf['place']['room']

42.02028054671144

In [277]:
pmidf['great']['place']

45.14940450039596

In [278]:
pmidf['place']['nice']

44.10072738197349

In [280]:
pmidf['based']['place']

44.71785175360374

In [272]:
def topk(df, center_word, N=10):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.
        center_word: The dataframe we want to modify.
        N: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    dicts_ = {word: df[word][center_word] for word in cont_vocab}
    top_words = [key for key, value in sorted(dicts_.items(), key=lambda item: item[1], reverse=True)][:N]

    return top_words

In [273]:
topk(pmidf, 'place')

['place',
 'recommand',
 'recomend',
 'recommendable',
 'reccomend',
 'maarten',
 'looking',
 'nicole',
 'affordable',
 'advertised']

In [274]:
topk(pmidf, 'location')

['location',
 'prime',
 'superb',
 'ideal',
 'terrific',
 'walkable',
 'convenient',
 'fantastic',
 'perfect',
 'central']

In [275]:
topk(pmidf, 'coffee')

['coffee',
 'tea',
 'nespresso',
 'microwave',
 'complimentary',
 'fridge',
 'supplied',
 'shops',
 'nick',
 'including']

In [281]:
topk(pmidf, 'stay')

['stay',
 'enjoyed',
 'enjoyable',
 'hesitate',
 'memorable',
 'letting',
 'longer',
 'future',
 'ensure',
 'pleased']

In [282]:
topk(pmidf, 'petits')

['aux',
 'ses',
 'ont',
 'sont',
 'apprécié',
 'des',
 'hôtes',
 'les',
 'cafés',
 'tous']

In [283]:
topk(pmidf, 'sauber')

['sauber',
 'zimmer',
 'sehr',
 'wie',
 'und',
 'alles',
 'zentral',
 'wohnung',
 'allem',
 'ist']

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

...