# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [191]:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

In [192]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andreas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andreas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Andreas\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [193]:
# load stopwords
sw = set(stopwords.words('english'))

In [194]:
basedir = os.getcwd()
df = pd.read_csv(os.path.join(basedir,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [195]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [196]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [197]:
def process_reviews(df):
  tokenized = []
  tagged = []
  lower_tagged = []

  mylen = len(df)
  count = 0
  for index, row in df.iterrows():
    token = word_tokenize(row.comments)
    tokenized.append(token)
    tagged.append(pos_tag(token))
    # lower_tagged.append(list(set(pos_tag([item.lower() for item in token]))))
    lower_tagged.append(pos_tag([item.lower() for item in token if item not in sw]))
    count += 1
    
    if count % 1000 == 0:      
      print(f'{count} out of {mylen}')

  df['tokenized'] = tokenized
  df['tagged'] = tagged
  df['lower_tagged'] = lower_tagged


  # df['tokenized'] = [ word_tokenize(row.comments) for index, row in df.iterrows()]
  # print('Tokenizing done!\n')
  # df['tagged'] = [pos_tag(row.tokenized) for index, row in df.iterrows()]
  # print('Tagging done!\n')
  # df['lower_tagged'] = list(set([pos_tag([item.lower() for item in row.tokenized]) for index, row in df.iterrows()]))
  # print('Lower tagging done!\n')
  return df

In [148]:
df = process_reviews(df[:50000])

1000 out of 5000
2000 out of 5000
3000 out of 5000
4000 out of 5000
5000 out of 5000
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tokenized'] = tokenized
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tagged'] = tagged
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lower_tagged'] = lower_tagged


In [149]:
df

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel, is, really, cool, ., The, place, was,...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(daniel, NN), (really, RB), (cool, JJ), (., ...."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel, is, the, most, amazing, host, !, His,...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(daniel, NN), (amazing, VBG), (host, NN), (!,..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We, had, such, a, great, time, in, Amsterdam,...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(we, PRP), (great, JJ), (time, NN), (amsterda..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very, professional, operation, ., Room, is, v...","[(Very, RB), (professional, JJ), (operation, N...","[(very, RB), (professional, JJ), (operation, N..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel, is, highly, recommended, ., He, provi...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(daniel, NN), (highly, RB), (recommended, VBD..."
...,...,...,...,...,...,...,...,...,...
4995,68290,407172587,2019-02-01,88938296,Thomas,Lovely room in a perfect location. Manuel was ...,"[Lovely, room, in, a, perfect, location, ., Ma...","[(Lovely, RB), (room, NN), (in, IN), (a, DT), ...","[(lovely, RB), (room, NN), (perfect, JJ), (loc..."
4996,68290,409015147,2019-02-06,17645610,Ryan,"Nice apartment, walking distance from anything...","[Nice, apartment, ,, walking, distance, from, ...","[(Nice, NNP), (apartment, NN), (,, ,), (walkin...","[(nice, JJ), (apartment, NN), (,, ,), (walking..."
4997,68290,409587602,2019-02-08,8979137,Krishna,Very nice,"[Very, nice]","[(Very, RB), (nice, JJ)]","[(very, RB), (nice, JJ)]"
4998,68290,410538173,2019-02-10,15278842,Theo,"A stylish room, with a clean shared bathroom, ...","[A, stylish, room, ,, with, a, clean, shared, ...","[(A, DT), (stylish, JJ), (room, NN), (,, ,), (...","[(a, DT), (stylish, JJ), (room, NN), (,, ,), (..."


In [150]:
# num = 0
# print(len(df.tagged[num]), len(set(df.lower_tagged[num])))
# list(set(df.lower_tagged[num]))

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [152]:
def get_vocab(df):
  cent_list, cont_list = [], []

  for review in df.lower_tagged:
    cent_list.extend([word for word in [list_of_words[0] for list_of_words in review if list_of_words[1][0] == 'N']])
    cont_list.extend([word for word in [list_of_words[0] for list_of_words in review if (list_of_words[1][0] == 'J') or (list_of_words[1][0] == 'V')]])
    
  cent_dict = Counter(cent_list)
  cont_dict = Counter(cont_list)

  cent_vocab = [key for key, value in sorted(cent_dict.items(), key=lambda item: item[1], reverse=True)][:1000]
  cont_vocab = [key for key, value in sorted(cont_dict.items(), key=lambda item: item[1], reverse=True)][:1000]

  return cent_vocab, cont_vocab

In [153]:
cent_vocab, cont_vocab = get_vocab(df)

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [154]:
comments = df.comments
type(comments)

pandas.core.series.Series

In [155]:
comments[:5]

0    Daniel is really cool. The place was nice and ...
1    Daniel is the most amazing host! His place is ...
2    We had such a great time in Amsterdam. Daniel ...
3    Very professional operation. Room is very clea...
4    Daniel is highly recommended.  He provided all...
Name: comments, dtype: object

In [156]:
def get_coocs(df, cent_vocab, cont_vocab):
  sentences = []
  comments = df.comments

  for comment in comments:
    sentences.extend([sentence for sentence in comment.split('.')])
  
  print('yolo')
  # print(sentences)
  
  coocs = {}

  count = 0
  for center_word in cent_vocab:
    count += 1
    words = []
    for sentence in sentences:
      if center_word in sentence:
        words_in_sentence = word_tokenize(sentence)
        words.extend([word for word in words_in_sentence if word in cont_vocab])
    
    center_word_dict = dict(Counter(words))
    coocs[center_word] = center_word_dict
    print(f'{count} out of 1000')
    
  # cent_dict = Counter(cent_list)
  # cont_dict = Counter(cont_list)

  
  return coocs  

In [189]:
def get_coocs(df, cent_vocab, cont_vocab):
  sentences = []
  comments = df.comments

  for comment in comments:
    sentences.extend([sentence for sentence in comment.split('.')])
  
  print('yolo')
  # print(sentences)

  sentences_per_center_word = {center_word : Filter(sentences, [center_word]) for center_word in cent_vocab}
  
  print('swag')

  words = []
  coocs = {}

  count = 0
  count2 = 0
  diff = 0
  for center_word, sentences in sentences_per_center_word.items():
    count += 1
    start = pd.to_datetime('today')
    if count % 100 == 0 : print(f'{count} center_word out of {len(sentences_per_center_word)}')
    # print(value)
    count2 = 0
    for sentence in sentences:
      # print(sentence)
      # break
      count2 += 1
      # if count2 % 10 == 0 : print(f'{count2} sentence out of {len(sentences)} sentences')
      for word in word_tokenize(sentence):
        # print(word)
        # break
        if word in cont_vocab:
          words.append(word)   
           
        # break
    #   if count == 10:
    #     break
    # print(center_word)
    # print(words)
    # print(Counter(words))
    coocs[center_word] = dict(Counter(words))
    end = pd.to_datetime('today')
    diff += (end-start).total_seconds()
    print(f'{(end-start).total_seconds()} seconds')
    # break

  # coocs = {key: dict(Counter([word for word in word_tokenize(value) if word in cont_vocab])) for key, value in sentences_per_center_word.items()}


  #   coocs = {}

  #   count = 0
  #   for center_word in cent_vocab:
  #     count += 1
  #     words = []
  #     for sentence in Filter(sentences, [center_word]):
  #         words_in_sentence = word_tokenize(sentence)
  #         words.extend([word for word in words_in_sentence if word in cont_vocab])
      
  #     center_word_dict = dict(Counter(words))
  #     coocs[center_word] = center_word_dict
  #     print(f'{count} out of 1000')
    
  # cent_dict = Counter(cent_list)
  # cont_dict = Counter(cont_list)

  print(diff/60)
  return coocs 

def Filter(string, substr):
    return [str for str in string if any(sub in str for sub in substr)]

In [190]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

yolo
swag
0.631205 seconds
0.737719 seconds
0.709 seconds
5.833961 seconds
0.038998 seconds
0.629261 seconds
0.539781 seconds
0.902788 seconds
0.352984 seconds
0.402381 seconds
0.381394 seconds
0.357131 seconds
0.022939 seconds
0.227973 seconds
0.23102 seconds
0.201671 seconds
0.187417 seconds
0.17484 seconds
0.437872 seconds
0.118757 seconds
0.159628 seconds
0.063006 seconds
0.023152 seconds
0.260326 seconds
0.305711 seconds
0.423071 seconds
0.024024 seconds
0.407012 seconds
0.107768 seconds
0.202017 seconds
0.157036 seconds
0.322566 seconds
0.171037 seconds
0.158875 seconds
0.025336 seconds
0.175639 seconds
0.152005 seconds
2.080956 seconds
0.025932 seconds
0.234073 seconds
0.132123 seconds
0.119124 seconds
0.124922 seconds
0.139154 seconds
0.219892 seconds
0.090861 seconds
0.105056 seconds
0.131188 seconds
0.154019 seconds
0.193969 seconds
0.115412 seconds
0.183933 seconds
0.082597 seconds
0.14123 seconds
0.370242 seconds
0.159695 seconds
0.170637 seconds
0.187526 seconds
0.189917 s

KeyboardInterrupt: 

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [160]:
def cooc_dict2df(coocs):
  coocdf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  for index, row in coocdf.iterrows():
    for word in cont_vocab:
      try:
        coocdf[word][index] = coocs[index][word]
      except: 
        coocdf[word][index] = 0

  return coocdf

In [161]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(1000, 1000)

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [162]:
coocdf

Unnamed: 0,great,i,nice,recommend,clean,stay,good,comfortable,perfect,easy,...,deux,parking,edwins,pointed,von,sie,nightlife,floating,estación,todas
location,532,11,172,56,153,147,144,105,298,110,...,0,0,0,1,0,0,2,2,0,0
place,876,27,371,286,349,576,264,197,468,179,...,1,2,0,3,0,0,5,5,0,0
room,1093,48,606,327,787,690,375,469,592,204,...,1,2,0,4,0,0,6,5,0,0
i,3050,196,1759,1209,1815,2593,1067,1219,1389,722,...,13,9,0,11,41,9,28,11,9,10
amsterdam,3068,205,1766,1212,1819,2608,1070,1220,1396,724,...,13,9,0,12,41,9,28,11,9,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
items,40645,3367,22900,17579,22681,36054,15308,19572,17269,10871,...,339,174,0,165,978,192,537,179,238,277
erreichen,40645,3367,22900,17579,22681,36054,15308,19572,17269,10871,...,339,174,0,165,979,192,537,179,238,277
peace,40648,3367,22909,17582,22684,36061,15308,19573,17272,10873,...,339,175,0,165,979,192,537,179,238,277
reply,40651,3367,22911,17583,22685,36066,15309,19574,17272,10877,...,339,175,0,165,979,192,537,179,238,277


In [163]:
def cooc2pmi(df):
  pmidf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  N = 0
  for index, row in coocdf.iterrows():
    N += sum(row)

  for index, row in coocdf.iterrows():
    for word in cont_vocab:
      try:
        pmi = df[index][word] / (sum(df[word])/N / sum(row)/N)
        pmidf[word][index] = np.log([pmi])[0] 
        print(pmidf[word][index])
      except: 
        pmidf[word][index] = 0
      
  return pmidf

In [164]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

13839276088
49.43352831508158
48.51716309188095
48.02243555769455
48.81936523663623
47.49957972172697
49.627353397802814
52.43252622485918
49.654778278522606
48.77395707937862
49.91298008045673
49.64057145451084
50.327883304079016
44.1035595270316
49.06527791581271
49.2101860684666
48.24800187309929
49.05714034776801
50.98351281584351
46.87238494318265
47.68112067300175
48.890198702803836
49.41163188318831
51.799408195311266
48.76048532541118
48.92642348418904
50.23370520149433
50.237521750148446
49.48085154896183
48.827106850616005
47.27079364704947
49.01512104647231
49.803892837944645
49.624390892336756
49.88225188373172
48.014602325434886
47.76585000914108
48.19508336205154
49.9300465327885
48.46366411969948
47.92885895158347
49.492112023569156
40.723820431109715
50.03343934024457
48.58617002572549
49.195372443465956
46.994341405565855
48.71750344387815
50.4476681683933
49.36889518148224
46.23946771612914
50.13245082591983
49.41809038456698
49.04953426101078
49.70867945808333
50.105

(1000, 1000)

In [165]:
pmidf

Unnamed: 0,great,i,nice,recommend,clean,stay,good,comfortable,perfect,easy,...,deux,parking,edwins,pointed,von,sie,nightlife,floating,estación,todas
location,0,45.180797,44.375139,45.026203,43.61067,42.905355,45.7187,44.902137,43.922751,46.130504,...,0,0,0,0,48.440549,0,49.128603,0,0,0
place,0,45.946325,45.23051,45.792725,44.372281,43.698008,46.491337,45.658222,44.666308,46.905564,...,0,0,0,0,49.203981,0,49.905718,0,0,0
room,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
i,0,44.565392,44.231157,44.872558,43.176817,42.409949,45.6714,44.732913,43.462504,46.089907,...,0,0,0,0,48.350932,0,49.089662,0,0,0
amsterdam,0,43.973859,43.706975,44.280488,42.910777,42.229721,45.077007,44.149577,43.20346,45.477955,...,0,0,0,0,47.79724,0,48.476691,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
items,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
erreichen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
peace,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
reply,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [166]:
for name in cont_vocab:
    if len(pmidf[name][pmidf[name] > 0]) > 0:
        print(pmidf[name][pmidf[name] > 0 ])

: 358, dtype: object
location     48.233414
place        49.004442
i            48.129812
amsterdam    47.593908
host         50.214368
               ...    
upstairs     49.084825
quartier     50.498374
peaceful     49.680463
wish         49.380323
key          49.624527
Name: par, Length: 358, dtype: object
location     48.399269
place        49.168481
i            48.274225
amsterdam    47.702455
host          50.38418
               ...    
upstairs     49.252575
quartier     50.651183
peaceful     49.842573
wish          49.49938
key          49.771781
Name: talk, Length: 358, dtype: object
location     48.757189
place        49.520231
i            48.674065
amsterdam    48.112023
host         50.735851
               ...    
upstairs     49.608913
quartier     51.057295
peaceful     50.255531
wish         49.902544
key          50.162246
Name: sauber, Length: 358, dtype: object
location     49.330822
place        50.113736
i            49.326288
amsterdam    48.692419
host      

### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [167]:
def topk(df, center_word, N=10):
  top_words = sorted([df[word][center_word] for word in cont_vocab], reverse=True)[:N]
  return top_words

In [168]:
topk(pmidf, 'coffee')

[53.70215805578728,
 53.6969072153563,
 53.418036823596545,
 53.36655713877299,
 53.250804845381595,
 52.894356807022874,
 52.64918014661928,
 52.6143277397996,
 52.12876666111378,
 52.03348711914996]

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

...