# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [128]:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

In [129]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andreas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andreas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Andreas\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [130]:
# load stopwords
sw = set(stopwords.words('english'))

In [131]:
basedir = os.getcwd()
df = pd.read_csv(os.path.join(basedir,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [132]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [133]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [134]:
def process_reviews(df):
  tokenized = []
  tagged = []
  lower_tagged = []

  mylen = len(df)
  count = 0
  for index, row in df.iterrows():
    token = word_tokenize(row.comments)
    tokenized.append(token)
    tagged.append(pos_tag(token))
    lower_tagged.append(list(set(pos_tag([item.lower() for item in token]))))
    count += 1
    print(f'{count} out of {mylen}')
    if count % 20000 == 0:      
      break

  df['tokenized'] = tokenized
  df['tagged'] = tagged
  df['lower_tagged'] = lower_tagged


  # df['tokenized'] = [ word_tokenize(row.comments) for index, row in df.iterrows()]
  # print('Tokenizing done!\n')
  # df['tagged'] = [pos_tag(row.tokenized) for index, row in df.iterrows()]
  # print('Tagging done!\n')
  # df['lower_tagged'] = list(set([pos_tag([item.lower() for item in row.tokenized]) for index, row in df.iterrows()]))
  # print('Lower tagging done!\n')
  return df

In [135]:
df = process_reviews(df[:20000])

t of 20000
19005 out of 20000
19006 out of 20000
19007 out of 20000
19008 out of 20000
19009 out of 20000
19010 out of 20000
19011 out of 20000
19012 out of 20000
19013 out of 20000
19014 out of 20000
19015 out of 20000
19016 out of 20000
19017 out of 20000
19018 out of 20000
19019 out of 20000
19020 out of 20000
19021 out of 20000
19022 out of 20000
19023 out of 20000
19024 out of 20000
19025 out of 20000
19026 out of 20000
19027 out of 20000
19028 out of 20000
19029 out of 20000
19030 out of 20000
19031 out of 20000
19032 out of 20000
19033 out of 20000
19034 out of 20000
19035 out of 20000
19036 out of 20000
19037 out of 20000
19038 out of 20000
19039 out of 20000
19040 out of 20000
19041 out of 20000
19042 out of 20000
19043 out of 20000
19044 out of 20000
19045 out of 20000
19046 out of 20000
19047 out of 20000
19048 out of 20000
19049 out of 20000
19050 out of 20000
19051 out of 20000
19052 out of 20000
19053 out of 20000
19054 out of 20000
19055 out of 20000
19056 out of 20000
1

In [136]:
df

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel, is, really, cool, ., The, place, was,...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(in, IN), (didnt, VBP), (any, DT), (clean, JJ..."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel, is, the, most, amazing, host, !, His,...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(please, VBP), (way, NN), (bed, NN), (if, IN)..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We, had, such, a, great, time, in, Amsterdam,...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(bathroom, NN), (10-15, JJ), (in, IN), (we, P..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very, professional, operation, ., Room, is, v...","[(Very, RB), (professional, JJ), (operation, N...","[(which, WDT), (clean, JJ), (comfortable, JJ),..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel, is, highly, recommended, ., He, provi...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(way, NN), (in, IN), (which, WDT), (recommend..."
...,...,...,...,...,...,...,...,...,...
19995,481664,51662420,2015-10-22,46551788,Martin,I had an amazing stay on Eltjo and Liselores b...,"[I, had, an, amazing, stay, on, Eltjo, and, Li...","[(I, PRP), (had, VBD), (an, DT), (amazing, JJ)...","[(eltjo, NN), ('re, VBP), (if, IN), (exactly, ..."
19996,481664,51885927,2015-10-25,37047887,Susi,Thank you so much for that wonderful stay. The...,"[Thank, you, so, much, for, that, wonderful, s...","[(Thank, NNP), (you, PRP), (so, RB), (much, JJ...","[(thank, NN), (in, IN), (wonderful, JJ), (cosy..."
19997,481664,52491199,2015-10-30,8500879,Kristine,Eltjo and Liselore were very nice people and t...,"[Eltjo, and, Liselore, were, very, nice, peopl...","[(Eltjo, NNP), (and, CC), (Liselore, NNP), (we...","[(with, IN), (eltjo, NN), (water, NN), (thank,..."
19998,481664,53267155,2015-11-07,46466936,Alexander,The hosts are friendly) all as on foto) далее ...,"[The, hosts, are, friendly, ), all, as, on, fo...","[(The, DT), (hosts, NNS), (are, VBP), (friendl...","[(и, JJ), (бы, NNP), (людишек, NNP), (переехал..."


In [137]:
# num = 0
# print(len(df.tagged[num]), len(set(df.lower_tagged[num])))
# list(set(df.lower_tagged[num]))

In [138]:
df.lower_tagged[0][1][1][0]

'V'

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [139]:
def get_vocab(df):
  cent_list, cont_list = [], []

  for review in df.lower_tagged:
    cent_list.extend([word for word in [list_of_words[0] for list_of_words in review if list_of_words[1][0] == 'N']])
    cont_list.extend([word for word in [list_of_words[0] for list_of_words in review if (list_of_words[1][0] == 'J') or (list_of_words[1][0] == 'V')]])
    
  cent_dict = Counter(cent_list)
  cont_dict = Counter(cont_list)

  cent_vocab = [key for key, value in sorted(cent_dict.items(), key=lambda item: item[1])][:1000]
  cont_vocab = [key for key, value in sorted(cont_dict.items(), key=lambda item: item[1])][:1000]

  return cent_vocab, cont_vocab

In [140]:
cent_vocab, cont_vocab = get_vocab(df)

In [141]:
cent_vocab[0]

'compass'

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [142]:
comments = df.comments
type(comments)

pandas.core.series.Series

In [143]:
comments[:5]

0    Daniel is really cool. The place was nice and ...
1    Daniel is the most amazing host! His place is ...
2    We had such a great time in Amsterdam. Daniel ...
3    Very professional operation. Room is very clea...
4    Daniel is highly recommended.  He provided all...
Name: comments, dtype: object

In [144]:
def get_coocs(df, cent_vocab, cont_vocab):
  sentences = []
  comments = df.comments

  for comment in comments:
    sentences.extend([sentence for sentence in comment.split('.')])
  
  # print(sentences)
  
  coocs = {}

  for center_word in cent_vocab:
    words = []
    for sentence in sentences:
      if center_word in sentence:
        words_in_sentence = word_tokenize(sentence)
        words.extend([word for word in words_in_sentence if word in cont_vocab])
    
    center_word_dict = dict(Counter(words))
    coocs[center_word] = center_word_dict
    
  # cent_dict = Counter(cent_list)
  # cont_dict = Counter(cont_list)
  
  return coocs  

In [145]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [153]:
def cooc_dict2df(coocs):
  coocdf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  for index, row in coocdf.iterrows():
    for word in cont_vocab:
      try:
        coocdf[word][index] = coocs[index][word]
      except: 
        coocdf[word][index] = 0

  return coocdf

In [154]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(1000, 1000)

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [190]:
coocdf

Unnamed: 0,umbrella,convienient,couldn.t,bicycles,exeptional,unmatched,corny,stucked,asthetic,well-thought-of,...,looooved,familiarize,awe-inspiring,leidseplein/,worderful,super-nice,nearby-,entertain,+tram,shore
compass,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
travelguide,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
accomation,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
nuances,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
command,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
diversos,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
viajei,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sozinha,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
leidsesplein,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [194]:
def cooc2pmi(df):
  pmidf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

  N = 0
  for index, row in coocdf.iterrows():
    N += sum(row)

  for index, row in coocdf.iterrows():
    for word in cont_vocab:
      try:
        pmi = df[index][word] / (sum(df[word])/N / sum(row)/N)
        pmidf[word][index] = np.log2([pmi])[0] 
        print(pmidf[word][index])
      except: 
        pmidf[word][index] = 0
      
  return pmidf

In [195]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

23.847819078274203
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
  pmidf[word][index] = np.log2([pmi])[0]
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
25.43278157899536
-inf
24.43278157899536
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
23.847819078274203
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
20.625426656937755
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
23.695815984829153
-inf
-inf
-inf
22.847819078274203
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
22.847819078274203
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
23.43278157899536
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
22.847819078274203
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
-inf
22.847819078274203
-inf
-inf
-inf


(1000, 1000)

In [196]:
pmidf

Unnamed: 0,umbrella,convienient,couldn.t,bicycles,exeptional,unmatched,corny,stucked,asthetic,well-thought-of,...,looooved,familiarize,awe-inspiring,leidseplein/,worderful,super-nice,nearby-,entertain,+tram,shore
compass,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
travelguide,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
accomation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
nuances,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
command,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
diversos,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
viajei,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sozinha,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
leidsesplein,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [198]:
for name in cont_vocab:
    if len(pmidf[name][pmidf[name] > 0]) > 0:
        print(pmidf[name][pmidf[name] > 0 ])

impatient    23.847819
Name: impatient, dtype: object
bountiful    24.432782
Name: bountiful, dtype: object
schlechte    22.847819
Name: schlechte, dtype: object
pensez    20.625427
Name: pensez, dtype: object
лучшее    23.847819
Name: лучшее, dtype: object
caracol    22.847819
Name: caracol, dtype: object
llegué    22.847819
Name: llegué, dtype: object
fiquei    22.847819
Name: fiquei, dtype: object
socialising    23.432782
Name: socialising, dtype: object
небольшие    23.695816
Name: небольшие, dtype: object
delen    25.432782
Name: delen, dtype: object


### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [201]:
def topk(df, center_word, N=10):
  top_words = sorted([df[word][center_word] for word in cont_vocab])[:N]
  return top_words

In [207]:
topk(pmidf, 'coffee')

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

...