# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [56]:
import pandas as pd
from nltk.tag import pos_tag
from nltk import RegexpParser
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

In [57]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [58]:
# load stopwords
sw = set(stopwords.words('english'))

In [59]:
# sw

In [60]:
'''We want to find the alternative forms of stopwords that have the "'" symbol in them 
in order to be able to add also to stopwords the word without this symbol'''

pattern = r'\w+\'\w+'

new_stopwords = []
for word in sw:
    # If it finds a word that contains "'" it appends the word in new_stopwords list
    if len(re.findall(pattern,word)) == 1:
        new_stopwords.append(re.findall(pattern,word)[0].replace('\'',''))
new_stopwords

['youre',
 'havent',
 'shouldnt',
 'wouldnt',
 'shant',
 'wasnt',
 'dont',
 'werent',
 'arent',
 'neednt',
 'didnt',
 'youve',
 'youd',
 'wont',
 'youll',
 'hadnt',
 'mightnt',
 'hasnt',
 'shes',
 'couldnt',
 'isnt',
 'shouldve',
 'doesnt',
 'thatll',
 'its',
 'mustnt']

In [61]:
# After checking those "new" words we add them to the stopwords variables named sw
for word in new_stopwords:
    sw.add(word)
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'arent',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'couldnt',
 'd',
 'did',
 'didn',
 "didn't",
 'didnt',
 'do',
 'does',
 'doesn',
 "doesn't",
 'doesnt',
 'doing',
 'don',
 "don't",
 'dont',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'hadnt',
 'has',
 'hasn',
 "hasn't",
 'hasnt',
 'have',
 'haven',
 "haven't",
 'havent',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'isnt',
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'mightnt',
 'more',
 'most',
 'mustn',
 "mustn't",
 'mustnt',
 'my',
 'myself',
 'needn',
 "needn't",
 'neednt',
 'no',
 'nor',
 'not',
 'no

In [62]:
basedir = os.getcwd()
df = pd.read_csv(os.path.join(basedir,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [63]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [64]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [65]:
def process_reviews(df):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    # Initialize 3 lists one for each column we will create
    tokenized_col = []
    tagged_col = []
    lower_tagged_col = []


    mylen = len(df)
    count = 0
    
    # Iterate through the given dataframe
    for index, row in df.iterrows():
        # tokenize the words for the comments of a row
        token = word_tokenize(row.comments)
        # Append the tokenized words to the proper list
        tokenized_col.append(token)
        # Tag the tokenized words of the row and then append them to the proper list
        tagged_col.append(pos_tag(token))
        # lower_tagged.append(list(set(pos_tag([item.lower() for item in token]))))
        # Make the tagged words lowercased and then if they are not stopwords append them to the lower_tagged_col list
        lower_tagged_col.append(pos_tag([item.lower() for item in token if item.lower() not in sw]))
        count += 1

    if count % 1000 == 0:
        print(f'{count} out of {mylen}')

    # Set as values of the 3 new columns the proper list we created for each one
    df['tokenized'] = tokenized_col
    df['tagged'] = tagged_col
    df['lower_tagged'] = lower_tagged_col

    # Return the modified dataframe
    return df

In [66]:
# df = process_reviews(df)
df = process_reviews(df[:2000])

2000 out of 2000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tokenized'] = tokenized_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tagged'] = tagged_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lower_tagged'] = lower_tagged_col


In [67]:
df.lower_tagged

0       [(daniel, NN), (really, RB), (cool, JJ), (., ....
1       [(daniel, NN), (amazing, VBG), (host, NN), (!,...
2       [(great, JJ), (time, NN), (amsterdam, NN), (.,...
3       [(professional, JJ), (operation, NN), (., .), ...
4       [(daniel, NN), (highly, RB), (recommended, VBD...
                              ...                        
1995    [(advertised, JJ), (!, .), (n't, RB), (miss, J...
1996    [(wonderful, JJ), (time, NN), (staying, VBG), ...
1997    [(wonderful, JJ), (stay, NN), (tanyas, RB), (b...
1998    [(tanya, NN), (made, VBD), (entire, JJ), (proc...
1999    [(stay, NN), (tanya, EX), ('s, POS), (appartme...
Name: lower_tagged, Length: 2000, dtype: object

In [68]:
# len(lowertagged)

In [69]:
# # correction with lower_tagged.append(pos_tag([item.lower() for item in token if item.lower() not in sw]))
# lowertagged = []

# pattern = r'\w+'

# for line in df.lower_tagged:
# #     print(line)
#     for word in line:
#         if word[1][0] == 'N' and len(re.findall(pattern,word[0])) == 0: 
#             pass
# #             print(word)
#         else:
#             lowertagged.append(word[0])
# #         if word[1][0] == 'N': print(word)
#             # print(re.findall(pattern,word[0]))
#             # if 
# len(lowertagged)

In [70]:
df

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel, is, really, cool, ., The, place, was,...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(daniel, NN), (really, RB), (cool, JJ), (., ...."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel, is, the, most, amazing, host, !, His,...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(daniel, NN), (amazing, VBG), (host, NN), (!,..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We, had, such, a, great, time, in, Amsterdam,...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(great, JJ), (time, NN), (amsterdam, NN), (.,..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very, professional, operation, ., Room, is, v...","[(Very, RB), (professional, JJ), (operation, N...","[(professional, JJ), (operation, NN), (., .), ..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel, is, highly, recommended, ., He, provi...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(daniel, NN), (highly, RB), (recommended, VBD..."
...,...,...,...,...,...,...,...,...,...
1995,44129,11523378,2014-04-06,13687490,Itai,as advertised! don't miss!\r\nWe just got back...,"[as, advertised, !, do, n't, miss, !, We, just...","[(as, IN), (advertised, JJ), (!, .), (do, VBP)...","[(advertised, JJ), (!, .), (n't, RB), (miss, J..."
1996,44129,11756211,2014-04-14,13191907,Alex,We had a wonderful time staying at Tanya's pla...,"[We, had, a, wonderful, time, staying, at, Tan...","[(We, PRP), (had, VBD), (a, DT), (wonderful, J...","[(wonderful, JJ), (time, NN), (staying, VBG), ..."
1997,44129,11902961,2014-04-18,12942622,Mikkel,We had a wonderful stay at Tanyas beautiful ap...,"[We, had, a, wonderful, stay, at, Tanyas, beau...","[(We, PRP), (had, VBD), (a, DT), (wonderful, J...","[(wonderful, JJ), (stay, NN), (tanyas, RB), (b..."
1998,44129,12000412,2014-04-21,5806581,Jill,Tanya made the entire process really easy on u...,"[Tanya, made, the, entire, process, really, ea...","[(Tanya, NNP), (made, VBD), (the, DT), (entire...","[(tanya, NN), (made, VBD), (entire, JJ), (proc..."


In [71]:
# grammar = "CHUNK: {<JJ>*<NN.>+}" 
# cp = RegexpParser(grammar)
# parsed = cp.parse(df.tagged[0])
# print(parsed)

In [72]:
# for tree in parsed.subtrees():
#     print(tree)

In [73]:
# df.lower_tagged[0]

In [74]:
# num = 0
# print(len(df.tagged[num]), len(set(df.lower_tagged[num])))
# list(set(df.lower_tagged[num]))

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [75]:
def get_vocab(df):
  cent_list, cont_list = [], []

  for review in df.lower_tagged:
    cent_list.extend([word for word in [list_of_words[0] for list_of_words in review if list_of_words[1][0] == 'N']])
    cont_list.extend([word for word in [list_of_words[0] for list_of_words in review if (list_of_words[1][0] == 'J') or (list_of_words[1][0] == 'V')]])
    
  cent_dict = Counter(cent_list)
  cont_dict = Counter(cont_list)

  cent_vocab = [key for key, value in sorted(cent_dict.items(), key=lambda item: item[1], reverse=True)][:1000]
  cont_vocab = [key for key, value in sorted(cont_dict.items(), key=lambda item: item[1], reverse=True)][:1000]

  return cent_vocab, cont_vocab

In [76]:
cent_vocab, cont_vocab = get_vocab(df)

In [77]:
cent_vocab[:5]

['location', 'place', 'room', 'host', 'amsterdam']

In [78]:
cont_vocab[:5]

['great', 'clean', 'nice', 'edwin', 'recommend']

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [79]:
def get_coocs(df, cent_vocab, cont_vocab):
  sentences = []
  comments = df.comments

  for comment in comments:
    sentences.extend([sentence for sentence in comment.split('.')])
  
  print('yolo')
  # print(sentences)
  
  coocs = {}
  
  count = 0
  for center_word in cent_vocab:
    count += 1
    words = []
    for sentence in sentences:
      if center_word in sentence:
        words_in_sentence = word_tokenize(sentence)
        words.extend([word for word in words_in_sentence if word in cont_vocab])
    
    center_word_dict = dict(Counter(words))
    coocs[center_word] = center_word_dict
    print(f'{count} out of 1000')
    
  # cent_dict = Counter(cent_list)
  # cont_dict = Counter(cont_list)

  
  return coocs  

In [109]:
def get_coocs(df, cent_vocab, cont_vocab):
    sentences = []
    comments = df.comments

    for comment in comments:
        sentences.extend([sentence for sentence in comment.split('.')])
  
    print('yolo')
    # print(sentences)
    #   sentences_df = pd.DataFrame(data = sentences, columns=['Sentences'])
  
    sentences_per_center_word = {center_word : Filter(sentences, center_word) for center_word in cent_vocab}

    print('swag')

    
    coocs = {}

    count = 0
    count2 = 0
    diff = 0
    for center_word, sentences in sentences_per_center_word.items():
        words = []
        #     print(sentences)
        count += 1
        start = pd.to_datetime('today')
        # print(value)
        count2 = 0
        
        for sentence in sentences:
#             print(sentence)
            # break
            count2 += 1
            # if count2 % 10 == 0 : print(f'{count2} sentence out of {len(sentences)} sentences')
            words_of_sentence = [word for word in word_tokenize(sentence) if word in cont_vocab]
            if len(words_of_sentence) > 0: words.extend(words_of_sentence)
#             print(words_of_sentence)
#             print(words)
#             if count2 == 4: break
#         if count2 == 4: break
        
        #       for word in word_tokenize(sentence):
        #         # print(word)
        #         # break
        #         if word in cont_vocab:
        #           words.append(word)   
     
           
        # break
        #   if count == 10:
        #     break
        # print(center_word)
        # print(words)
        # print(Counter(words))
        end = pd.to_datetime('today')
        diff += (end-start).total_seconds()
        print(f'{center_word} - - - - {(end-start).total_seconds()} seconds')
        coocs[center_word] = dict(Counter(words))
        if count % 20 == 0 : print(f'{count} center_word out of {len(sentences_per_center_word)} in {diff/60} minutes')
        # break

        # coocs = {key: dict(Counter([word for word in word_tokenize(value) if word in cont_vocab])) for key, value in sentences_per_center_word.items()}


    #   coocs = {}

    #   count = 0
    #   for center_word in cent_vocab:
    #     count += 1
    #     words = []
    #     for sentence in Filter(sentences, [center_word]):
    #         words_in_sentence = word_tokenize(sentence)
    #         words.extend([word for word in words_in_sentence if word in cont_vocab])

    #     center_word_dict = dict(Counter(words))
    #     coocs[center_word] = center_word_dict
    #     print(f'{count} out of 1000')

    # cent_dict = Counter(cent_list)
    # cont_dict = Counter(cont_list)

    print(diff/60)
    return coocs 

def Filter(sentences, center_word):
    sentences_for_center_word = []
#     sentences_for_center_word = [sentence for sentence in sentences if center_word in sentence.split()]
#     print(f'{center_word} --- {sentences_for_center_word}')
#     return [center_word for word in sentences.split() if center_word == word]
#     return 
    for sentence in sentences:
#         print(sentence)
#         for word in sentence.split():
        if center_word in sentence.split():
            sentences_for_center_word.append(sentence)
    print(f'{center_word} --- {len(sentences_for_center_word)}')
    return sentences_for_center_word
#     return [str for str in string if any(sub in str for sub in substr)]

In [110]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

yolo
location --- 581
place --- 683
room --- 589
host --- 419
amsterdam --- 27
stay --- 712
everything --- 312
apartment --- 377
city --- 272
daniel --- 8
edwin --- 9
time --- 232
house --- 187
night --- 132
home --- 140
bathroom --- 142
experience --- 146
flip --- 4
walk --- 207
restaurants --- 104
area --- 123
breakfast --- 136
station --- 78
bed --- 140
street --- 116
thanks --- 22
’ --- 0
distance --- 114
clean --- 380
perfect --- 273
center --- 95
boat --- 86
bars --- 69
trip --- 87
day --- 78
la --- 124
space --- 81
neighborhood --- 90
visit --- 102
way --- 84
thank --- 26
morning --- 72
tanya --- 0
minutes --- 70
kind --- 70
très --- 70
que --- 65
tram --- 107
things --- 67
muy --- 71
bikes --- 61
lot --- 80
days --- 70
bien --- 64
centre --- 76
information --- 67
get --- 227
coffee --- 58
questions --- 60
places --- 66
e --- 70
bit --- 85
person --- 46
anyone --- 60
need --- 132
cozy --- 68
es --- 100
people --- 55
airbnb --- 21
attractions --- 51
helpful --- 197
tips --- 61
he

offers --- 8
accurate --- 8
kann --- 7
case --- 5
foto --- 4
bath --- 7
perfekt --- 7
erreichen --- 8
tipps --- 0
était --- 8
gentile --- 5
ed --- 4
quarters --- 5
chaleureux --- 6
immer --- 8
schön --- 4
sauber --- 10
email --- 11
einem --- 8
design --- 3
love --- 37
bnb --- 5
dormir --- 6
visitors --- 5
rembrandt --- 0
tiene --- 6
enter --- 7
thought --- 24
guess --- 7
più --- 6
grande --- 7
bagno --- 7
em --- 6
floors --- 4
entrance --- 6
cold --- 29
placé --- 4
response --- 5
highlight --- 7
doors --- 5
ubicado --- 7
bring --- 21
girlfriend --- 8
front --- 29
merci --- 1
cama --- 7
walls --- 7
mornings --- 5
empfehlen --- 6
flights --- 6
fault --- 8
sympa --- 3
bem --- 6
leave --- 23
climb --- 11
tolle --- 8
etwas --- 9
canoe --- 9
personnes --- 6
zeer --- 5
auf --- 10
centrale --- 4
caminando --- 5
ruido --- 9
noches --- 4
restaurantes --- 9
mobility --- 6
close --- 221
hostess --- 3
noisy --- 52
relax --- 8
toiletries --- 4
well --- 207
connection --- 3
personality --- 5
highligh

sehr - - - - 0.013962 seconds
recommendations - - - - 0.014962 seconds
market - - - - 0.016955 seconds
spot - - - - 0.012966 seconds
maps - - - - 0.014963 seconds
help - - - - 0.02493 seconds
food - - - - 0.013963 seconds
amazing - - - - 0.046874 seconds
arrival - - - - 0.017952 seconds
hospitality - - - - 0.009956 seconds
feel - - - - 0.041888 seconds
fatih - - - - 0.0 seconds
hotel - - - - 0.013962 seconds
thing - - - - 0.017952 seconds
para - - - - 0.01496 seconds
è - - - - 0.013965 seconds
guests - - - - 0.012965 seconds
anything - - - - 0.011967 seconds
water - - - - 0.012965 seconds
bedroom - - - - 0.015959 seconds
120 center_word out of 1000 in 0.07822381666666675 minutes
zu - - - - 0.017949 seconds
der - - - - 0.013963 seconds
weekend - - - - 0.009973 seconds
times - - - - 0.012965 seconds
value - - - - 0.012966 seconds
comfy - - - - 0.010971 seconds
et - - - - 0.026928 seconds
war - - - - 0.008976 seconds
alexander - - - - 0.0 seconds
shower - - - - 0.01895 seconds
luggage - -

find - - - - 0.031916 seconds
todos - - - - 0.006012 seconds
top - - - - 0.009944 seconds
flips - - - - 0.001028 seconds
system - - - - 0.004987 seconds
eat - - - - 0.004956 seconds
380 center_word out of 1000 in 0.12523414999999977 minutes
family - - - - 0.002989 seconds
dinner - - - - 0.00399 seconds
par - - - - 0.00399 seconds
tout - - - - 0.006982 seconds
lock - - - - 0.005985 seconds
job - - - - 0.00798 seconds
atento - - - - 0.005972 seconds
al - - - - 0.009969 seconds
città - - - - 0.002993 seconds
downtown - - - - 0.006982 seconds
en - - - - 0.043883 seconds
pour - - - - 0.012965 seconds
travelers - - - - 0.001994 seconds
l - - - - 0.0 seconds
side - - - - 0.005984 seconds
fue - - - - 0.006982 seconds
right - - - - 0.048893 seconds
hay - - - - 0.004987 seconds
clubs - - - - 0.004955 seconds
los - - - - 0.010971 seconds
400 center_word out of 1000 in 0.12855824999999974 minutes
mais - - - - 0.005984 seconds
george - - - - 0.0 seconds
position - - - - 0.002992 seconds
superb - - 

cold - - - - 0.011966 seconds
placé - - - - 0.001996 seconds
response - - - - 0.001995 seconds
highlight - - - - 0.002994 seconds
doors - - - - 0.001993 seconds
ubicado - - - - 0.002992 seconds
bring - - - - 0.009973 seconds
girlfriend - - - - 0.00399 seconds
660 center_word out of 1000 in 0.1544780500000001 minutes
front - - - - 0.014967 seconds
merci - - - - 0.0 seconds
cama - - - - 0.003989 seconds
walls - - - - 0.003989 seconds
mornings - - - - 0.001994 seconds
empfehlen - - - - 0.001996 seconds
flights - - - - 0.003025 seconds
fault - - - - 0.004953 seconds
sympa - - - - 0.001996 seconds
bem - - - - 0.002992 seconds
leave - - - - 0.012003 seconds
climb - - - - 0.003986 seconds
tolle - - - - 0.004951 seconds
etwas - - - - 0.003989 seconds
canoe - - - - 0.004987 seconds
personnes - - - - 0.001995 seconds
zeer - - - - 0.002992 seconds
auf - - - - 0.003989 seconds
centrale - - - - 0.001994 seconds
caminando - - - - 0.001994 seconds
680 center_word out of 1000 in 0.15585773333333347 mi

toll - - - - 0.000997 seconds
амстердама - - - - 0.0 seconds
replies - - - - 0.001996 seconds
month - - - - 0.003987 seconds
memories - - - - 0.000997 seconds
940 center_word out of 1000 in 0.17370865000000035 minutes
dat - - - - 0.001994 seconds
fantastisch - - - - 0.000997 seconds
deze - - - - 0.002994 seconds
direction - - - - 0.001993 seconds
cheese - - - - 0.000998 seconds
ms - - - - 0.0 seconds
boot - - - - 0.000998 seconds
proche - - - - 0.001995 seconds
voor - - - - 0.001996 seconds
hausboot - - - - 0.0 seconds
atmosphere - - - - 0.004986 seconds
watch - - - - 0.005985 seconds
wow - - - - 0.000991 seconds
section - - - - 0.003998 seconds
none - - - - 0.000992 seconds
taxi - - - - 0.002995 seconds
liegt - - - - 0.001991 seconds
toller - - - - 0.000998 seconds
schönen - - - - 0.001995 seconds
liebevoll - - - - 0.003956 seconds
960 center_word out of 1000 in 0.17442285000000038 minutes
unterkunft - - - - 0.0 seconds
anne - - - - 0.0 seconds
action - - - - 0.001994 seconds
quality 

In [29]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

yolo
swag
place - - - - 9.739992 seconds
apartment - - - - 9.123758 seconds
location - - - - 7.272575 seconds
stay - - - - 9.127552 seconds
amsterdam - - - - 0.307179 seconds
host - - - - 4.992666 seconds
everything - - - - 4.695553 seconds
city - - - - 5.360662 seconds
room - - - - 4.647571 seconds
time - - - - 3.827798 seconds
house - - - - 3.347038 seconds
area - - - - 2.853403 seconds
home - - - - 2.726665 seconds
très - - - - 2.555164 seconds
center - - - - 2.17319 seconds
restaurants - - - - 2.144299 seconds
’ - - - - 0.006188 seconds
station - - - - 1.810153 seconds
minutes - - - - 2.400603 seconds
walk - - - - 2.781559 seconds
20 center_word out of 1000 in 1.3648928000000002 minutes
centre - - - - 2.021594 seconds
tram - - - - 2.438515 seconds
experience - - - - 1.81614 seconds
space - - - - 1.839122 seconds
thanks - - - - 0.469934 seconds
à - - - - 2.164167 seconds
hosts - - - - 1.314486 seconds
neighborhood - - - - 1.645763 seconds
clean - - - - 4.123058 seconds
bien - - - - 

store - - - - 0.347079 seconds
jordaan - - - - 0.010971 seconds
ferry - - - - 0.339065 seconds
sur - - - - 0.389985 seconds
husband - - - - 0.314191 seconds
mais - - - - 0.439824 seconds
restaurant - - - - 0.320145 seconds
trams - - - - 0.316152 seconds
views - - - - 0.284229 seconds
tout - - - - 0.60937 seconds
par - - - - 0.337099 seconds
se - - - - 0.553492 seconds
si - - - - 0.344081 seconds
communicate - - - - 0.336118 seconds
toilet - - - - 0.3171 seconds
return - - - - 0.409942 seconds
b - - - - 0.015997 seconds
das - - - - 0.419877 seconds
était - - - - 0.407912 seconds
air - - - - 0.207442 seconds
260 center_word out of 1000 in 4.524476783333338 minutes
von - - - - 0.357032 seconds
directions - - - - 0.266316 seconds
chambre - - - - 0.302157 seconds
petit - - - - 0.293218 seconds
end - - - - 0.311346 seconds
tea - - - - 0.230373 seconds
work - - - - 0.29326 seconds
travel - - - - 0.342042 seconds
supermarkets - - - - 0.215421 seconds
wonderful - - - - 1.864968 seconds
tv - - -

ruhig - - - - 0.138632 seconds
juste - - - - 0.131683 seconds
fue - - - - 0.185525 seconds
vivement - - - - 0.15758 seconds
apartments - - - - 0.126661 seconds
bright - - - - 0.288229 seconds
walking - - - - 1.568805 seconds
ben - - - - 0.109708 seconds
deal - - - - 0.154586 seconds
case - - - - 0.125664 seconds
daniel - - - - 0.003984 seconds
comfort - - - - 0.118683 seconds
downtown - - - - 0.250362 seconds
stars - - - - 0.09774 seconds
daughter - - - - 0.143655 seconds
tourists - - - - 0.121675 seconds
den - - - - 0.334117 seconds
einer - - - - 0.168581 seconds
fact - - - - 0.124672 seconds
500 center_word out of 1000 in 5.681029266666674 minutes
comme - - - - 0.163522 seconds
summer - - - - 0.124668 seconds
unit - - - - 0.122702 seconds
mind - - - - 0.186483 seconds
stations - - - - 0.124667 seconds
gefühlt - - - - 0.132677 seconds
bnb - - - - 0.084772 seconds
aux - - - - 0.217445 seconds
right - - - - 1.304542 seconds
system - - - - 0.108678 seconds
lines - - - - 0.118826 seconds


loin - - - - 0.111702 seconds
flat - - - - 1.79018 seconds
etwas - - - - 0.127658 seconds
charming - - - - 0.330117 seconds
internet - - - - 0.081815 seconds
sie - - - - 0.092864 seconds
conseils - - - - 0.135674 seconds
reviews - - - - 0.093749 seconds
mirjam - - - - 0.001995 seconds
accomodation - - - - 0.093749 seconds
music - - - - 0.077765 seconds
locatie - - - - 0.106716 seconds
della - - - - 0.094746 seconds
bicycles - - - - 0.091756 seconds
marc - - - - 0.001994 seconds
baby - - - - 0.098773 seconds
attentions - - - - 0.084773 seconds
art - - - - 0.093752 seconds
monique - - - - 0.001028 seconds
740 center_word out of 1000 in 6.233817283333344 minutes
breeze - - - - 0.08976 seconds
côté - - - - 0.099733 seconds
action - - - - 0.072805 seconds
deux - - - - 0.137632 seconds
justice - - - - 0.072805 seconds
oder - - - - 0.110704 seconds
offers - - - - 0.127654 seconds
sono - - - - 0.140609 seconds
braucht - - - - 0.084773 seconds
bij - - - - 0.109709 seconds
apartamento - - - - 0.

960 center_word out of 1000 in 6.792878183333346 minutes
buurt - - - - 0.12068 seconds
bis - - - - 0.084803 seconds
hair - - - - 0.087802 seconds
condition - - - - 0.06785 seconds
vue - - - - 0.102726 seconds
schoon - - - - 0.062809 seconds
weeks - - - - 0.069843 seconds
matthijs - - - - 0.000997 seconds
sister - - - - 0.085767 seconds
bottle - - - - 0.324181 seconds
terms - - - - 0.073771 seconds
gift - - - - 0.074786 seconds
robert - - - - 0.00304 seconds
communications - - - - 0.070841 seconds
péniche - - - - 0.070878 seconds
pass - - - - 0.093784 seconds
plek - - - - 0.05685 seconds
kleine - - - - 0.082814 seconds
afternoon - - - - 0.063832 seconds
buen - - - - 0.062866 seconds
980 center_word out of 1000 in 6.820560183333346 minutes
feeling - - - - 0.13666 seconds
estancia - - - - 0.118682 seconds
toll - - - - 0.075798 seconds
travellers - - - - 0.056848 seconds
plans - - - - 0.075798 seconds
vondel - - - - 0.00994 seconds
helpfull - - - - 0.098717 seconds
все - - - - 0.070811 sec

In [111]:
coocs

{'location': {'location': 589,
  'next': 9,
  'tram': 19,
  'stop': 8,
  'took': 2,
  '10-15': 3,
  'get': 31,
  'center': 25,
  'great': 168,
  'host': 42,
  'like': 8,
  'kindness': 1,
  'enough': 13,
  'super': 15,
  'clean': 32,
  'provided': 2,
  'needed': 11,
  'staying': 9,
  'maps': 1,
  'tips': 4,
  'easy': 38,
  'relaxing': 4,
  'serviced': 1,
  'neighborhood': 10,
  'pleasant': 2,
  'place': 78,
  'perfect': 95,
  'minute': 11,
  'bus': 11,
  'metro': 3,
  'ride': 6,
  'downtown': 4,
  'quiet': 23,
  'evening': 2,
  'apartment': 61,
  'train': 11,
  'stops': 3,
  'nice': 41,
  'wonderful': 17,
  'looking': 5,
  'local': 2,
  'experience': 7,
  'use': 3,
  'public': 7,
  'want': 13,
  'restaurants': 14,
  'pubs': 1,
  'walking': 33,
  'distance': 29,
  "'s": 75,
  'fantastic': 27,
  'see': 9,
  'many': 12,
  'quick': 5,
  'bustle': 1,
  'noise': 11,
  'good': 33,
  'central': 65,
  'centre': 16,
  'short': 14,
  'convenient': 20,
  'close': 56,
  'stay': 52,
  'comfortable': 

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [112]:
def cooc_dict2df(coocs):
  coocdf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  for index, row in coocdf.iterrows():
    for word in cont_vocab:
      try:
        coocdf[word][index] = coocs[index][word]
      except: 
        coocdf[word][index] = 0

  return coocdf

In [None]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [None]:
coocdf

In [None]:
coocdf['location']['location']

In [None]:
def cooc2pmi(df):
  pmidf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  N = 0
  for index, row in coocdf.iterrows():
    N += sum(row)

  for index, row in coocdf.iterrows():
    for word in cont_vocab:
      try:
        pmi = df[index][word] / (sum(df[word])/N / sum(row)/N)
        if pmi == 0:
          pmidf[word][index] = 0
        else:
          pmidf[word][index] = np.log([pmi])[0] 
#         print(pmidf[word][index])
      except: 
        pmidf[word][index] = 0
      
  return pmidf

In [None]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

In [None]:
pmidf

In [None]:
# for name in cont_vocab:
#     if len(pmidf[name][pmidf[name] > 0]) > 0:
#         print(pmidf[name][pmidf[name] > 0 ])

### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [None]:
def topk(df, center_word, N=10):
  top_words = sorted([df[word][center_word] for word in cont_vocab], reverse=True)[:N]
  return top_words

In [None]:
topk(pmidf, 'place')

In [None]:
topk(pmidf, 'location')

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

...