# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [240]:
import pandas as pd
from nltk.tag import pos_tag
from nltk import RegexpParser
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

In [241]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andreas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andreas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Andreas\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [193]:
# load stopwords
sw = set(stopwords.words('english'))

In [219]:
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [194]:
basedir = os.getcwd()
df = pd.read_csv(os.path.join(basedir,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [195]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [196]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [197]:
def process_reviews(df):
  tokenized = []
  tagged = []
  lower_tagged = []

  mylen = len(df)
  count = 0
  for index, row in df.iterrows():
    token = word_tokenize(row.comments)
    tokenized.append(token)
    tagged.append(pos_tag(token))
    # lower_tagged.append(list(set(pos_tag([item.lower() for item in token]))))
    lower_tagged.append(pos_tag([item.lower() for item in token if item not in sw]))
    count += 1
    
    if count % 1000 == 0:      
      print(f'{count} out of {mylen}')

  df['tokenized'] = tokenized
  df['tagged'] = tagged
  df['lower_tagged'] = lower_tagged


  # df['tokenized'] = [ word_tokenize(row.comments) for index, row in df.iterrows()]
  # print('Tokenizing done!\n')
  # df['tagged'] = [pos_tag(row.tokenized) for index, row in df.iterrows()]
  # print('Tagging done!\n')
  # df['lower_tagged'] = list(set([pos_tag([item.lower() for item in row.tokenized]) for index, row in df.iterrows()]))
  # print('Lower tagging done!\n')
  return df

In [198]:
df = process_reviews(df[:50000])

1000 out of 50000
2000 out of 50000
3000 out of 50000
4000 out of 50000
5000 out of 50000
6000 out of 50000
7000 out of 50000
8000 out of 50000
9000 out of 50000
10000 out of 50000
11000 out of 50000
12000 out of 50000
13000 out of 50000
14000 out of 50000
15000 out of 50000
16000 out of 50000
17000 out of 50000
18000 out of 50000
19000 out of 50000
20000 out of 50000
21000 out of 50000
22000 out of 50000
23000 out of 50000
24000 out of 50000
25000 out of 50000
26000 out of 50000
27000 out of 50000
28000 out of 50000
29000 out of 50000
30000 out of 50000
31000 out of 50000
32000 out of 50000
33000 out of 50000
34000 out of 50000
35000 out of 50000
36000 out of 50000
37000 out of 50000
38000 out of 50000
39000 out of 50000
40000 out of 50000
41000 out of 50000
42000 out of 50000
43000 out of 50000
44000 out of 50000
45000 out of 50000
46000 out of 50000
47000 out of 50000
48000 out of 50000
49000 out of 50000
50000 out of 50000
A value is trying to be set on a copy of a slice from a Dat

In [199]:
df

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel, is, really, cool, ., The, place, was,...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(daniel, NN), (really, RB), (cool, JJ), (., ...."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel, is, the, most, amazing, host, !, His,...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(daniel, NN), (amazing, VBG), (host, NN), (!,..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We, had, such, a, great, time, in, Amsterdam,...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(we, PRP), (great, JJ), (time, NN), (amsterda..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very, professional, operation, ., Room, is, v...","[(Very, RB), (professional, JJ), (operation, N...","[(very, RB), (professional, JJ), (operation, N..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel, is, highly, recommended, ., He, provi...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(daniel, NN), (highly, RB), (recommended, VBD..."
...,...,...,...,...,...,...,...,...,...
49995,995087,123196897,2016-12-28,29998110,Rodrigo,Nice place,"[Nice, place]","[(Nice, JJ), (place, NN)]","[(nice, JJ), (place, NN)]"
49996,995087,126479741,2017-01-12,33857593,Justine,This was a great place in a great location. Su...,"[This, was, a, great, place, in, a, great, loc...","[(This, DT), (was, VBD), (a, DT), (great, JJ),...","[(this, DT), (great, JJ), (place, NN), (great,..."
49997,995087,126927936,2017-01-15,55006052,Arron,Lovely little apartment in a quieter part of t...,"[Lovely, little, apartment, in, a, quieter, pa...","[(Lovely, RB), (little, JJ), (apartment, NN), ...","[(lovely, RB), (little, JJ), (apartment, NN), ..."
49998,995087,127556090,2017-01-19,64669388,Jake,"Great place, right next to Jordan. The apartme...","[Great, place, ,, right, next, to, Jordan, ., ...","[(Great, JJ), (place, NN), (,, ,), (right, RB)...","[(great, JJ), (place, NN), (,, ,), (right, JJ)..."


In [245]:
df.comments[0]

'Daniel is really cool. The place was nice and clean. Very quiet neighborhood. He had maps and a lonely planet guide book in the room for you to use. I didnt have any trouble finding the place from Central Station. I would defintely come back! Thanks!'

In [242]:
grammar = "CHUNK: {<JJ>*<NN.>+}" 
cp = RegexpParser(grammar)
parsed = cp.parse(df.tagged[0])
print(parsed)

(S
  (CHUNK Daniel/NNP)
  is/VBZ
  really/RB
  cool/JJ
  ./.
  The/DT
  place/NN
  was/VBD
  nice/JJ
  and/CC
  clean/JJ
  ./.
  Very/RB
  quiet/JJ
  neighborhood/NN
  ./.
  He/PRP
  had/VBD
  (CHUNK maps/NNS)
  and/CC
  a/DT
  lonely/JJ
  planet/NN
  guide/NN
  book/NN
  in/IN
  the/DT
  room/NN
  for/IN
  you/PRP
  to/TO
  use/VB
  ./.
  I/PRP
  didnt/VBP
  have/VBP
  any/DT
  trouble/NN
  finding/VBG
  the/DT
  place/NN
  from/IN
  (CHUNK Central/JJ Station/NNP)
  ./.
  I/PRP
  would/MD
  defintely/RB
  come/VB
  back/RB
  !/.
  (CHUNK Thanks/NNS)
  !/.)


In [244]:
for tree in parsed.subtrees():
    print(tree)

(S
  (CHUNK Daniel/NNP)
  is/VBZ
  really/RB
  cool/JJ
  ./.
  The/DT
  place/NN
  was/VBD
  nice/JJ
  and/CC
  clean/JJ
  ./.
  Very/RB
  quiet/JJ
  neighborhood/NN
  ./.
  He/PRP
  had/VBD
  (CHUNK maps/NNS)
  and/CC
  a/DT
  lonely/JJ
  planet/NN
  guide/NN
  book/NN
  in/IN
  the/DT
  room/NN
  for/IN
  you/PRP
  to/TO
  use/VB
  ./.
  I/PRP
  didnt/VBP
  have/VBP
  any/DT
  trouble/NN
  finding/VBG
  the/DT
  place/NN
  from/IN
  (CHUNK Central/JJ Station/NNP)
  ./.
  I/PRP
  would/MD
  defintely/RB
  come/VB
  back/RB
  !/.
  (CHUNK Thanks/NNS)
  !/.)
(CHUNK Daniel/NNP)
(CHUNK maps/NNS)
(CHUNK Central/JJ Station/NNP)
(CHUNK Thanks/NNS)


In [233]:
df.lower_tagged[0]

[('daniel', 'NN'),
 ('really', 'RB'),
 ('cool', 'JJ'),
 ('.', '.'),
 ('the', 'DT'),
 ('place', 'NN'),
 ('nice', 'JJ'),
 ('clean', 'NN'),
 ('.', '.'),
 ('very', 'RB'),
 ('quiet', 'JJ'),
 ('neighborhood', 'NN'),
 ('.', '.'),
 ('he', 'PRP'),
 ('maps', 'VBZ'),
 ('lonely', 'RB'),
 ('planet', 'JJ'),
 ('guide', 'NN'),
 ('book', 'NN'),
 ('room', 'NN'),
 ('use', 'NN'),
 ('.', '.'),
 ('i', 'NN'),
 ('didnt', 'VBP'),
 ('trouble', 'NN'),
 ('finding', 'VBG'),
 ('place', 'JJ'),
 ('central', 'JJ'),
 ('station', 'NN'),
 ('.', '.'),
 ('i', 'NN'),
 ('would', 'MD'),
 ('defintely', 'RB'),
 ('come', 'VB'),
 ('back', 'RB'),
 ('!', '.'),
 ('thanks', 'NNS'),
 ('!', '.')]

In [200]:
# num = 0
# print(len(df.tagged[num]), len(set(df.lower_tagged[num])))
# list(set(df.lower_tagged[num]))

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [201]:
def get_vocab(df):
  cent_list, cont_list = [], []

  for review in df.lower_tagged:
    cent_list.extend([word for word in [list_of_words[0] for list_of_words in review if list_of_words[1][0] == 'N']])
    cont_list.extend([word for word in [list_of_words[0] for list_of_words in review if (list_of_words[1][0] == 'J') or (list_of_words[1][0] == 'V')]])
    
  cent_dict = Counter(cent_list)
  cont_dict = Counter(cont_list)

  cent_vocab = [key for key, value in sorted(cent_dict.items(), key=lambda item: item[1], reverse=True)][:1000]
  cont_vocab = [key for key, value in sorted(cont_dict.items(), key=lambda item: item[1], reverse=True)][:1000]

  return cent_vocab, cont_vocab

In [202]:
cent_vocab, cont_vocab = get_vocab(df)

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [203]:
comments = df.comments
type(comments)

pandas.core.series.Series

In [204]:
comments[:5]

0    Daniel is really cool. The place was nice and ...
1    Daniel is the most amazing host! His place is ...
2    We had such a great time in Amsterdam. Daniel ...
3    Very professional operation. Room is very clea...
4    Daniel is highly recommended.  He provided all...
Name: comments, dtype: object

In [205]:
def get_coocs(df, cent_vocab, cont_vocab):
  sentences = []
  comments = df.comments

  for comment in comments:
    sentences.extend([sentence for sentence in comment.split('.')])
  
  print('yolo')
  # print(sentences)
  
  coocs = {}

  count = 0
  for center_word in cent_vocab:
    count += 1
    words = []
    for sentence in sentences:
      if center_word in sentence:
        words_in_sentence = word_tokenize(sentence)
        words.extend([word for word in words_in_sentence if word in cont_vocab])
    
    center_word_dict = dict(Counter(words))
    coocs[center_word] = center_word_dict
    print(f'{count} out of 1000')
    
  # cent_dict = Counter(cent_list)
  # cont_dict = Counter(cont_list)

  
  return coocs  

In [217]:
def get_coocs(df, cent_vocab, cont_vocab):
  sentences = []
  comments = df.comments

  for comment in comments:
    sentences.extend([sentence for sentence in comment.split('.')])
  
  print('yolo')
  # print(sentences)

  sentences_per_center_word = {center_word : Filter(sentences, [center_word]) for center_word in cent_vocab}
  
  print('swag')

  words = []
  coocs = {}

  count = 0
  count2 = 0
  diff = 0
  for center_word, sentences in sentences_per_center_word.items():
    count += 1
    start = pd.to_datetime('today')
    # print(value)
    count2 = 0
    for sentence in sentences:
      # print(sentence)
      # break
      count2 += 1
      # if count2 % 10 == 0 : print(f'{count2} sentence out of {len(sentences)} sentences')
      for word in word_tokenize(sentence):
        # print(word)
        # break
        if word in cont_vocab:
          words.append(word)   
           
        # break
    #   if count == 10:
    #     break
    # print(center_word)
    # print(words)
    # print(Counter(words))
    coocs[center_word] = dict(Counter(words))
    end = pd.to_datetime('today')
    diff += (end-start).total_seconds()
    print(f'{(end-start).total_seconds()} seconds')
    if count % 20 == 0 : print(f'{count} center_word out of {len(sentences_per_center_word)} in {diff/60} minutes')
    # break

  # coocs = {key: dict(Counter([word for word in word_tokenize(value) if word in cont_vocab])) for key, value in sentences_per_center_word.items()}


  #   coocs = {}

  #   count = 0
  #   for center_word in cent_vocab:
  #     count += 1
  #     words = []
  #     for sentence in Filter(sentences, [center_word]):
  #         words_in_sentence = word_tokenize(sentence)
  #         words.extend([word for word in words_in_sentence if word in cont_vocab])
      
  #     center_word_dict = dict(Counter(words))
  #     coocs[center_word] = center_word_dict
  #     print(f'{count} out of 1000')
    
  # cent_dict = Counter(cent_list)
  # cont_dict = Counter(cont_list)

  print(diff/60)
  return coocs 

def Filter(string, substr):
    return [str for str in string if any(sub in str for sub in substr)]

In [218]:
coocs = get_coocs(df, cent_vocab, cont_vocab)


1.519151 seconds
1.654741 seconds
500 center_word out of 1000 in 29.129968200000004 minutes
1.632296 seconds
1.622884 seconds
1.704639 seconds
1.684264 seconds
1.519414 seconds
1.713175 seconds
1.684989 seconds
1.699186 seconds
1.7146 seconds
1.663184 seconds
1.662913 seconds
1.833519 seconds
1.657222 seconds
1.708804 seconds
1.882001 seconds
1.699051 seconds
1.632215 seconds
2.129128 seconds
1.97898 seconds
1.717782 seconds
520 center_word out of 1000 in 29.705638966666683 minutes
2.072663 seconds
1.704005 seconds
3.262991 seconds
2.118787 seconds
1.627657 seconds
1.894566 seconds
1.539467 seconds
1.642601 seconds
1.515944 seconds
3.606016 seconds
3.710903 seconds
1.706972 seconds
6.752227 seconds
13.837868 seconds
2.048311 seconds
1.61453 seconds
1.820453 seconds
16.395811 seconds
1.773465 seconds
1.640654 seconds
540 center_word out of 1000 in 30.910403816666687 minutes
1.691812 seconds
1.711942 seconds
1.719634 seconds
1.757735 seconds
1.711169 seconds
1.718103 seconds
1.590241 se

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [220]:
def cooc_dict2df(coocs):
  coocdf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  for index, row in coocdf.iterrows():
    for word in cont_vocab:
      try:
        coocdf[word][index] = coocs[index][word]
      except: 
        coocdf[word][index] = 0

  return coocdf

In [221]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(1000, 1000)

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [222]:
coocdf

Unnamed: 0,great,nice,recommend,i,stay,clean,good,comfortable,perfect,easy,...,carolina,meant,says,present,italian,leo,riding,stanza,chez,historical
place,3656,2033,2356,168,4538,1658,1111,883,1517,704,...,0,5,10,3,1,0,6,0,9,7
apartment,6540,3768,3558,268,6806,3842,1918,2164,2849,1320,...,0,15,20,10,1,0,15,0,10,16
location,11058,5044,4054,327,7955,4926,3253,2874,5246,2209,...,0,26,29,12,2,0,17,0,12,24
amsterdam,11170,5126,4098,402,8124,4958,3293,2896,5303,2242,...,0,27,29,12,2,0,17,4,13,24
stay,15003,6863,7277,658,28511,5930,4173,4115,6780,2754,...,0,40,36,26,3,0,28,4,14,31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
pressure,400771,250074,190370,27936,376654,207489,152617,175706,162264,121731,...,49,2670,1405,1800,407,42,1891,3178,10605,1737
heater,400781,250085,190372,27936,376659,207501,152623,175718,162267,121731,...,49,2670,1405,1800,407,42,1891,3178,10605,1737
sido,400781,250085,190372,27936,376659,207501,152623,175718,162267,121731,...,49,2670,1405,1800,407,42,1891,3178,10605,1737
accueillant,400781,250085,190372,27936,376659,207501,152623,175718,162267,121731,...,49,2670,1405,1800,407,42,1891,3178,10621,1737


In [223]:
def cooc2pmi(df):
  pmidf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  N = 0
  for index, row in coocdf.iterrows():
    N += sum(row)

  for index, row in coocdf.iterrows():
    for word in cont_vocab:
      try:
        pmi = df[index][word] / (sum(df[word])/N / sum(row)/N)
        if pmi == 0:
          pmidf[word][index] = 0
        else:
          pmidf[word][index] = np.log([pmi])[0] 
        print(pmidf[word][index])
      except: 
        pmidf[word][index] = 0
      
  return pmidf

In [224]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

8101439
55.22074979892282
55.82637473258414
56.550586273069676
57.54856320443356
60.83174072142332
56.60603567683757
61.55854788937479
56.771784602267225
57.72186380510572
57.681212267876816
56.95649123448234
61.569832022645826
59.523287206035
59.51784850142511
54.926495167273636
61.69272961225518
57.0320179671697
55.40612081101506
56.901164418188635
54.86801946947991
56.46048992565462
56.01334098341999
55.113981366318754
60.58151095146721
60.99012441201705
55.86525509034819
54.89668292687646
60.90659809602264
55.801657879897114
61.8696765841769
57.53053577906022
58.11807609920102
56.03386676208725
57.02366322885557
57.87395683465342
60.75812945267364
56.093032479200716
57.42611412697003
59.48545873297607
54.74141195502379
55.51798534889082
60.294529052923025
56.20194509486683
55.86498346156545
57.165760532454975
57.305851018009946
57.43610700471433
57.221296885118484
57.36287699219627
47.92323192258999
56.463353196198504
56.30704216643832
58.76713960588715
51.67326213870561
56.0466553

(1000, 1000)

In [225]:
pmidf

Unnamed: 0,great,nice,recommend,i,stay,clean,good,comfortable,perfect,easy,...,carolina,meant,says,present,italian,leo,riding,stanza,chez,historical
place,0,51.599916,52.116041,52.611155,49.478409,50.841277,52.989976,52.291475,51.111314,53.154812,...,61.071177,0,0,0,0,61.204768,0,0,55.479375,0
apartment,0,52.306652,52.817852,53.235248,50.068215,51.490355,53.76481,52.936539,51.759362,53.931479,...,61.846994,0,0,0,0,61.974464,0,0,56.191463,0
location,0,52.354964,52.866153,53.375275,50.199758,51.701556,53.812099,53.003022,51.989169,53.974296,...,61.891878,0,0,0,0,62.017871,0,0,56.315407,0
amsterdam,0,49.44907,50.047321,50.468474,47.373178,48.74596,50.956758,50.169619,49.012455,51.115597,...,59.037591,0,0,0,0,59.172759,0,0,53.425961,0
stay,0,52.87687,53.415004,54.017774,50.887438,52.211606,54.309381,53.534968,52.476751,54.470305,...,62.391166,0,0,0,0,62.528456,0,0,56.765624,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
pressure,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
heater,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sido,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
accueillant,0,52.267905,52.788059,52.248392,46.027002,50.887724,53.669405,52.877651,51.130479,53.799439,...,61.747348,0,0,0,0,61.873993,0,0,56.095315,0


In [226]:
for name in cont_vocab:
    if len(pmidf[name][pmidf[name] > 0]) > 0:
        print(pmidf[name][pmidf[name] > 0 ])

erdam      51.726665
stay           55.158133
                 ...    
amy            51.084542
quartier       56.161291
frühstück       50.51307
ou             55.164231
accueillant    54.561656
Name: supermarket, Length: 389, dtype: object
place          56.429806
apartment      57.169486
location       57.266622
amsterdam      54.383708
stay           57.720483
                 ...    
amy            53.779389
quartier       58.763294
frühstück       53.36005
ou             57.822649
accueillant    57.098025
Name: bnb, Length: 389, dtype: object
place          54.792367
apartment      55.502659
location       55.521297
amsterdam      52.727761
stay           56.058235
                 ...    
amy            52.066332
quartier        57.09383
frühstück      51.598203
ou               56.1216
accueillant    55.389907
Name: par, Length: 389, dtype: object
place          56.042225
apartment       56.81895
location       56.862095
amsterdam      54.003222
stay           57.358579
       

### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [227]:
def topk(df, center_word, N=10):
  top_words = sorted([df[word][center_word] for word in cont_vocab], reverse=True)[:N]
  return top_words

In [228]:
topk(pmidf, 'coffee')

[63.4810461965441,
 63.249741460458985,
 62.656342134676464,
 62.502734531458294,
 62.364350607187795,
 62.179061170939555,
 62.1467580070075,
 62.01003845182203,
 61.96889371534266,
 61.87466107181496]

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

...