# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [190]:
import pandas as pd
from nltk.tag import pos_tag
from nltk import RegexpParser
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

In [191]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [192]:
# load stopwords
sw = set(stopwords.words('english'))

In [None]:
sw

In [194]:
'''We want to find the alternative forms of stopwords that have the "'" symbol in them 
in order to be able to add also to stopwords the word without this symbol'''

pattern = r'\w+\'\w+'

new_stopwords = []
for word in sw:
    # If it finds a word that contains "'" it appends the word in new_stopwords list
    if len(re.findall(pattern,word)) == 1:
        new_stopwords.append(re.findall(pattern,word)[0].replace('\'',''))
new_stopwords

['youre',
 'havent',
 'shouldnt',
 'wouldnt',
 'shant',
 'wasnt',
 'dont',
 'werent',
 'arent',
 'neednt',
 'didnt',
 'youve',
 'youd',
 'wont',
 'youll',
 'hadnt',
 'mightnt',
 'hasnt',
 'shes',
 'couldnt',
 'isnt',
 'shouldve',
 'doesnt',
 'thatll',
 'its',
 'mustnt']

In [195]:
# After checking those "new" words we add them to the stopwords variables named sw
for word in new_stopwords:
    sw.add(word)
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'arent',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'couldnt',
 'd',
 'did',
 'didn',
 "didn't",
 'didnt',
 'do',
 'does',
 'doesn',
 "doesn't",
 'doesnt',
 'doing',
 'don',
 "don't",
 'dont',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'hadnt',
 'has',
 'hasn',
 "hasn't",
 'hasnt',
 'have',
 'haven',
 "haven't",
 'havent',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'isnt',
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'mightnt',
 'more',
 'most',
 'mustn',
 "mustn't",
 'mustnt',
 'my',
 'myself',
 'needn',
 "needn't",
 'neednt',
 'no',
 'nor',
 'not',
 'no

In [196]:
basedir = os.getcwd()
df = pd.read_csv(os.path.join(basedir,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [197]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [198]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [199]:
def process_reviews(df):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    # Initialize 3 lists one for each column we will create
    tokenized_col = []
    tagged_col = []
    lower_tagged_col = []


    mylen = len(df)
    count = 0
    
    # Iterate through the given dataframe
    for index, row in df.iterrows():
        # tokenize the words for the comments of a row
        token = word_tokenize(row.comments)
        # Append the tokenized words to the proper list
        tokenized_col.append(token)
        # Tag the tokenized words of the row and then append them to the proper list
        tagged_col.append(pos_tag(token))
        # lower_tagged.append(list(set(pos_tag([item.lower() for item in token]))))
        # Make the tagged words lowercased and then if they are not stopwords append them to the lower_tagged_col list
        lower_tagged_col.append(pos_tag([item.lower() for item in token if item.lower() not in sw]))
        count += 1

        if count % 1000 == 0:
            print(f'{count} out of {mylen}')

    # Set as values of the 3 new columns the proper list we created for each one
    df['tokenized'] = tokenized_col
    df['tagged'] = tagged_col
    df['lower_tagged'] = lower_tagged_col

    # Return the modified dataframe
    return df

In [200]:
# df = process_reviews(df)
df = process_reviews(df[:50000])

50000 out of 50000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tokenized'] = tokenized_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tagged'] = tagged_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lower_tagged'] = lower_tagged_col


### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [209]:
def get_vocab(df):
  cent_list, cont_list = [], []

  for review in df.lower_tagged:
    cent_list.extend([word for word in [list_of_words[0] for list_of_words in review if list_of_words[1][0] == 'N']])
    cont_list.extend([word for word in [list_of_words[0] for list_of_words in review if (list_of_words[1][0] == 'J') or (list_of_words[1][0] == 'V')]])
    
  cent_dict = Counter(cent_list)
  cont_dict = Counter(cont_list)

  cent_vocab = [key for key, value in sorted(cent_dict.items(), key=lambda item: item[1], reverse=True)][:1000]
  cont_vocab = [key for key, value in sorted(cont_dict.items(), key=lambda item: item[1], reverse=True)][:1000]

  return cent_vocab, cont_vocab

In [210]:
cent_vocab, cont_vocab = get_vocab(df)

In [211]:
cent_vocab[:5]

['place', 'apartment', 'location', 'amsterdam', 'stay']

In [212]:
cont_vocab[:5]

['great', 'nice', 'recommend', 'clean', 'good']

In [213]:
samewords = [name for name in cent_vocab if name in cont_vocab]
len(samewords)

394

In [214]:
samewords

['place',
 'apartment',
 'location',
 'amsterdam',
 'stay',
 'host',
 'home',
 'experience',
 'très',
 'walk',
 'center',
 'breakfast',
 'hosts',
 'boat',
 'tram',
 'centre',
 'à',
 'perfect',
 'neighborhood',
 'bathroom',
 'la',
 'clean',
 '’',
 'bien',
 'bed',
 'e',
 'street',
 'thank',
 'visit',
 'sehr',
 'et',
 'coffee',
 'lot',
 'houseboat',
 'ist',
 'der',
 'need',
 'tips',
 'muy',
 'arrival',
 'die',
 'helpful',
 'airbnb',
 'view',
 'que',
 'bus',
 'shops',
 'cozy',
 'minute',
 'kitchen',
 'dans',
 'get',
 'bars',
 'lots',
 'stairs',
 'con',
 'es',
 'min',
 'wir',
 'check',
 'le',
 'transport',
 'pictures',
 'zu',
 'bit',
 'manuel',
 'convenient',
 'train',
 'door',
 'friends',
 'è',
 'beautiful',
 'neighbourhood',
 'des',
 'appartement',
 'je',
 'help',
 'couple',
 'du',
 'supermarket',
 'il',
 'corner',
 'casa',
 'feel',
 'canal',
 'photos',
 'use',
 'alles',
 'airport',
 'shower',
 'wine',
 'super',
 'di',
 'séjour',
 'bike',
 'park',
 'situé',
 'friend',
 'cosy',
 'und',
 'r

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [215]:
# def get_coocs(df, cent_vocab, cont_vocab):
#   sentences = []
#   comments = df.comments

#   for comment in comments:
#     sentences.extend([sentence for sentence in comment.split('.')])
  
#   print('yolo')
#   # print(sentences)
  
#   coocs = {}
  
#   count = 0
#   for center_word in cent_vocab:
#     count += 1
#     words = []
#     for sentence in sentences:
#       if center_word in sentence:
#         words_in_sentence = word_tokenize(sentence)
#         words.extend([word for word in words_in_sentence if word in cont_vocab])
    
#     center_word_dict = dict(Counter(words))
#     coocs[center_word] = center_word_dict
#     print(f'{count} out of 1000')
    
#   # cent_dict = Counter(cent_list)
#   # cont_dict = Counter(cont_list)

  
#   return coocs  

In [216]:
def get_coocs(df, cent_vocab, cont_vocab):
    sentences = []
    comments = df.comments

    for comment in comments:
        sentences.extend([sentence for sentence in comment.split('.')])
  
    print('yolo')
  
    sentences_per_center_word = {center_word : Filter(sentences, center_word) for center_word in cent_vocab}

    print('swag')

    
    coocs = {}

    count = 0
    count2 = 0
    diff = 0
    for center_word, sentences in sentences_per_center_word.items():
        words = []
        #     print(sentences)
        count += 1
        start = pd.to_datetime('today')
        count2 = 0
        
        for sentence in sentences:
            count2 += 1
            words_of_sentence = [word for word in word_tokenize(sentence) if word in cont_vocab]
            if len(words_of_sentence) > 0: words.extend(words_of_sentence)
                
        end = pd.to_datetime('today')
        diff += (end-start).total_seconds()
        print(f'{center_word} - - - - {(end-start).total_seconds()} seconds')
        coocs[center_word] = dict(Counter(words))
        if count % 20 == 0 : print(f'{count} center_word out of {len(sentences_per_center_word)} in {diff/60} minutes')

            
    print(diff/60)
    return coocs 

def Filter(sentences, center_word):
    sentences_for_center_word = []
    
    for sentence in sentences:
        if center_word in sentence.split():
            sentences_for_center_word.append(sentence)
    print(f'{center_word} --- {len(sentences_for_center_word)}')
    return sentences_for_center_word

In [217]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

yolo
place --- 15783
apartment --- 16114
location --- 11551
amsterdam --- 581
stay --- 17951
host --- 7961
room --- 9458
everything --- 7320
city --- 7336
time --- 6486
house --- 4684
home --- 4103
area --- 4067
experience --- 3490
restaurants --- 2968
très --- 3525
walk --- 5432
station --- 2351
center --- 3198
thanks --- 748
minutes --- 2973
breakfast --- 2938
hosts --- 2303
boat --- 2675
tram --- 4147
day --- 2410
space --- 2471
centre --- 3074
à --- 3308
perfect --- 6533
neighborhood --- 2616
bathroom --- 2259
la --- 3917
distance --- 2534
kind --- 2166
clean --- 7427
’ --- 0
bien --- 2305
bed --- 2834
night --- 1940
trip --- 2061
e --- 2003
street --- 1932
thank --- 643
days --- 2019
people --- 1728
places --- 2008
visit --- 2317
sehr --- 2280
et --- 4931
coffee --- 1521
lot --- 2013
houseboat --- 1606
ist --- 2278
way --- 1598
communication --- 1236
war --- 1536
der --- 1385
need --- 3217
morning --- 1411
tips --- 1694
family --- 1447
muy --- 1525
arrival --- 1850
die --- 1651
in

boyfriend --- 322
einen --- 356
angela --- 5
barrio --- 178
internet --- 169
tipps --- 42
doors --- 178
adults --- 165
zona --- 235
check-in --- 515
grande --- 188
vivement --- 202
dass --- 187
otto --- 8
transports --- 227
laura --- 2
einem --- 312
downtown --- 350
er --- 297
world --- 144
opportunity --- 184
table --- 351
su --- 255
vera --- 13
spotless --- 194
stars --- 123
amstel --- 10
erik --- 0
placé --- 137
conforme --- 193
barbara --- 6
breakfasts --- 175
check-out --- 188
michael --- 1
river --- 149
heel --- 169
beaucoup --- 248
addition --- 122
cette --- 251
talk --- 218
question --- 165
tiene --- 165
gogh --- 9
georg --- 0
surprise --- 179
years --- 136
zentrum --- 1
stations --- 161
mind --- 285
posizione --- 164
tutto --- 164
decor --- 180
list --- 224
jacob --- 3
responsive --- 705
cafe --- 341
calm --- 375
nicht --- 496
cafés --- 146
staircase --- 182
cats --- 150
treats --- 166
bicycles --- 154
make --- 1705
anche --- 167
character --- 161
connection --- 162
entrance -

war - - - - 0.496672 seconds
der - - - - 0.521601 seconds
need - - - - 1.291504 seconds
morning - - - - 0.631738 seconds
60 center_word out of 1000 in 1.4965292000000003 minutes
tips - - - - 0.694143 seconds
family - - - - 0.554509 seconds
muy - - - - 0.527598 seconds
arrival - - - - 0.623327 seconds
die - - - - 0.607415 seconds
information - - - - 0.600353 seconds
helpful - - - - 1.841108 seconds
airbnb - - - - 0.208472 seconds
view - - - - 0.542832 seconds
things - - - - 0.649262 seconds
que - - - - 0.776926 seconds
bus - - - - 0.691152 seconds
shops - - - - 0.536566 seconds
anyone - - - - 0.578451 seconds
cozy - - - - 0.592452 seconds
questions - - - - 0.569439 seconds
minute - - - - 0.684199 seconds
kitchen - - - - 0.752993 seconds
nights - - - - 0.469746 seconds
town - - - - 0.450781 seconds
80 center_word out of 1000 in 1.7123912666666672 minutes
dans - - - - 0.569477 seconds
get - - - - 2.025595 seconds
bars - - - - 0.404901 seconds
lots - - - - 0.706089 seconds
food - - - - 0.4

steps - - - - 0.197475 seconds
и - - - - 0.145572 seconds
building - - - - 0.189709 seconds
300 center_word out of 1000 in 3.192587416666664 minutes
merci - - - - 0.048868 seconds
noise - - - - 0.270295 seconds
se - - - - 0.390133 seconds
dem - - - - 0.199426 seconds
deck - - - - 0.168554 seconds
karin - - - - 0.007014 seconds
comfortable - - - - 2.041501 seconds
streets - - - - 0.144614 seconds
était - - - - 0.197517 seconds
ideal - - - - 0.265291 seconds
work - - - - 0.197479 seconds
tv - - - - 0.033904 seconds
mais - - - - 0.279328 seconds
auf - - - - 0.21639 seconds
life - - - - 0.15363 seconds
ha - - - - 0.177524 seconds
von - - - - 0.18252 seconds
bags - - - - 0.222539 seconds
lo - - - - 0.245349 seconds
shop - - - - 0.160568 seconds
320 center_word out of 1000 in 3.2859614833333306 minutes
waren - - - - 0.185543 seconds
care - - - - 0.169544 seconds
dutch - - - - 0.059838 seconds
cat - - - - 0.151594 seconds
problems - - - - 0.141665 seconds
hours - - - - 0.180475 seconds
parfai

partner - - - - 0.093738 seconds
map - - - - 0.218451 seconds
540 center_word out of 1000 in 3.876106983333328 minutes
accommodations - - - - 0.06583 seconds
locals - - - - 0.087724 seconds
einer - - - - 0.090759 seconds
carlos - - - - 0.001995 seconds
chaleureux - - - - 0.081782 seconds
riks - - - - 0.003003 seconds
eingerichtet - - - - 0.067805 seconds
sin - - - - 0.113696 seconds
zeit - - - - 0.003989 seconds
meals - - - - 0.105716 seconds
milk - - - - 0.064829 seconds
quality - - - - 0.098741 seconds
sonja - - - - 0.001998 seconds
gijs - - - - 0.001986 seconds
answer - - - - 0.210437 seconds
parts - - - - 0.106712 seconds
trouble - - - - 0.152591 seconds
nos - - - - 0.432884 seconds
patio - - - - 0.109706 seconds
com - - - - 0.093749 seconds
560 center_word out of 1000 in 3.907705849999994 minutes
start - - - - 0.172538 seconds
love - - - - 0.482736 seconds
items - - - - 0.083789 seconds
recommandons - - - - 0.054889 seconds
boyfriend - - - - 0.154548 seconds
einen - - - - 0.147609

camera - - - - 0.05685 seconds
bem - - - - 0.05984 seconds
über - - - - 0.055849 seconds
eltjo - - - - 0.002992 seconds
info - - - - 0.11968 seconds
spent - - - - 0.351061 seconds
recommendation - - - - 0.047874 seconds
meeting - - - - 0.088784 seconds
mornings - - - - 0.058815 seconds
picture - - - - 0.051861 seconds
option - - - - 0.073834 seconds
dirk - - - - 0.001999 seconds
guides - - - - 0.067814 seconds
jours - - - - 0.080751 seconds
hilfsbereit - - - - 0.036901 seconds
jederzeit - - - - 0.046875 seconds
site - - - - 0.046874 seconds
oder - - - - 0.084774 seconds
cottage - - - - 0.052859 seconds
800 center_word out of 1000 in 4.304386499999996 minutes
girlfriend - - - - 0.138628 seconds
sweet - - - - 0.211433 seconds
plaisir - - - - 0.055851 seconds
houses - - - - 0.057845 seconds
zeker - - - - 0.04488 seconds
apt - - - - 0.12766 seconds
upstairs - - - - 0.142643 seconds
verblijf - - - - 0.053831 seconds
relaxing - - - - 0.256314 seconds
offers - - - - 0.138629 seconds
ok - - - 

In [218]:
coocs

{'place': {'place': 16381,
  'nice': 1519,
  'clean': 1422,
  'finding': 31,
  'amazing': 661,
  'host': 793,
  'provides': 26,
  'want': 345,
  'comfy': 109,
  'bed': 220,
  'maps': 29,
  'located': 706,
  'quiet': 677,
  'neighbourhood': 107,
  'close': 715,
  'public': 147,
  'getting': 63,
  'easy': 522,
  'fast': 19,
  'calm': 74,
  'provided': 106,
  'much': 290,
  'enjoy': 181,
  'return': 130,
  'first': 166,
  'friendly': 299,
  'waiting': 21,
  'outside': 106,
  'arrived': 121,
  'late': 48,
  'comfortable': 757,
  'recommend': 1916,
  "'s": 3391,
  'next': 274,
  'visit': 375,
  'wonderful': 584,
  'accommodating': 72,
  'come': 323,
  'choose': 47,
  'great': 2807,
  'got': 94,
  'relaxing': 78,
  'traveling': 51,
  'apartment': 661,
  'tidy': 85,
  'organized': 27,
  'go': 343,
  'bit': 139,
  'central': 429,
  'accessible': 58,
  'tram': 351,
  'stay': 4127,
  'pleasant': 85,
  'location': 1422,
  'perfect': 1296,
  '10-15': 16,
  'minute': 143,
  'bus': 148,
  'metro': 6

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [219]:
def cooc_dict2df(coocs):
  coocdf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  for index, row in coocdf.iterrows():
    for word in cont_vocab:
      try:
        coocdf[word][index] = coocs[index][word]
      except: 
        coocdf[word][index] = 0

  return coocdf

In [220]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(1000, 1000)

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [221]:
coocdf

Unnamed: 0,great,nice,recommend,clean,good,stay,comfortable,perfect,easy,quiet,...,responding,inmejorable,neighbour,white,keeping,contacted,fixed,supermarket,advice,francina
place,2807,1519,1916,1422,749,4127,757,1296,522,677,...,4,0,2,5,9,5,2,52,31,0
apartment,2541,1501,1022,1995,701,1967,1160,1187,556,807,...,0,0,10,6,3,5,4,100,30,0
location,3348,886,338,712,986,828,475,1976,663,832,...,4,0,4,0,1,4,0,94,8,0
amsterdam,83,56,32,20,27,117,13,40,29,18,...,0,0,1,0,0,0,0,4,1,0
stay,2659,1241,1880,653,631,18519,893,1115,367,278,...,6,0,8,6,8,8,6,26,66,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
frühstück,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
selection,35,16,2,11,21,5,9,5,1,9,...,0,0,0,0,1,0,0,1,0,0
ducks,6,7,0,4,3,5,2,4,0,12,...,0,0,0,1,0,0,0,2,0,0
ou,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [222]:
def cooc2pmi(df):
  pmidf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  N = 0
  for index, row in df.iterrows():
    N += sum(row)

  for index, row in df.iterrows():
    for word in cont_vocab:
      try:
        pmi = df[word][index] / (sum(df[word])/N / sum(row)/N)
        if pmi == 0:
          pmidf[word][index] = 0
        else:
          pmidf[word][index] = np.log([pmi])[0] 
#         print(pmidf[word][index])
      except: 
        pmidf[word][index] = 0
      
  return pmidf

In [None]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

In [None]:
pmidf

In [None]:
# for name in cont_vocab:
#     if len(pmidf[name][pmidf[name] > 0]) > 0:
#         print(pmidf[name][pmidf[name] > 0 ])

### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [None]:
def topk(df, center_word, N=10):
  top_words = sorted([df[word][center_word] for word in cont_vocab], reverse=True)[:N]
  return top_words

In [None]:
topk(pmidf, 'place')

In [None]:
topk(pmidf, 'location')

In [None]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

...