# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [1]:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

  from pandas import Panel


In [2]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rory\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rory\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Rory\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
# load stopwords
sw = set(stopwords.words('english'))

In [4]:
df = pd.read_csv('reviews.csv')
# deal with empty reviews
df.comments = df.comments.fillna('')

In [5]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [7]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [8]:
import string

def process_reviews(df):
    """Creates 3 new columns on input dataframe with a column containing strings of text. Columns created: tokenized text,
       tagged text and tagged text lower-case.
       
    Input
    -----
    pd.dataframe with column of strings called 'comments'
    
    Output
    ------
    pd.dataframe
    """
    df['tokenized'] = df['comments'].apply(nltk.word_tokenize)
    df['tagged'] = df['tokenized'].apply(nltk.pos_tag)
    df['lower_tagged'] = df['tagged'].astype('str').str.lower()
    
    return df

In [9]:
df = process_reviews(df)

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [10]:
def get_vocab(df):
    """Takes dataframe with a column containg lists of tagged test, it will then return up to 1000 most common 
       center words and up to 1000 most common context words.
    
    Input
    -----
    pd.dataframe containg column called 'tagged'
    
    Output
    ------
    cent_vocab: list containing up to 1000 most common center words
    cont_vocab: list containing up to 1000 most common context words
    """
    all_words = []
    # Creating list of lists
    for listius in df['tagged']:
        all_words += listius
    # Separating the list to contain only center words
    nouns = [word for word in all_words if word[1][0] == 'N']
    # Remove duplicates
    nouns = list(set(nouns))
    # Separating the list to contain only context words
    verbs = [word for word in all_words if ((word[1][0] == 'V') or (word[1][0] =='J'))]
    # Remove duplicates
    verbs = list(set(verbs))
    # Calculating frequency of each center word and each context word
    nouns = nltk.FreqDist(nouns)
    verbs = nltk.FreqDist(verbs)
    # Keeping up to the 1000 most common center and context words
    cent_vocab = nouns.most_common(1000)
    cent_vocab = [word[0][0] for word in cent_vocab]
    cont_vocab = verbs.most_common(1000)
    cont_vocab = [word[0][0] for word in cont_vocab]
    
    return cent_vocab, cont_vocab

In [11]:
cent_vocab, cont_vocab = get_vocab(df)

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [12]:
def add_value_to_dictionary_of_dictionary(master_dict,
                                          slave_dict,
                                          master_key,
                                          slave_key):
    """Used to add a dictionary to another dictionary. Checks whether slave dictionary added to master dictionary is 
       already in the master dictionary, if so the value in slave dictionary is appended, otherwise the dictionary is added.
    
    Input
    -----
    master_dict: dictionary holding dictionaries
    slave_dict: dictionary contained in master_dict
    master_key: key of master dictionary
    slave_key: key of slave dictionary
    """
    if slave_key in master_dict[master_key]:
        master_dict[master_key][slave_key] = master_dict[master_key][slave_key] + slave_dict[slave_key]
    else:
        master_dict[master_key][slave_key] = slave_dict[slave_key]

In [13]:
def get_coocs(df, cent_vocab, cont_vocab):
    """Creating co-occurence matrix between center words and context words, stored as a dictionary of dictionaries
    
    Inputs
    ------
    dataframe containing tokenized reviews
    list of center_vocab
    list of context_vocab
    
    Outputs
    -------
    Dictionary of dictionaries representing a co-occurence matrix
    """
    # Setting up output dictionary
    coocs = {}
    # Putting all center vocab as keys of output dictionary
    for w in cent_vocab:
        coocs[w] = {}
        
    # Loop through each review (dataframe rows)
    for review in range(len(df['tokenized'])):
        # Center words contained in the review
        found_centers = [word for word in df['tokenized'][review] if word in cent_vocab]
        # Context words contained in the review
        found_contexts = [word for word in df['tokenized'][review] if word in cont_vocab]
        # Creating temporary dictionary to record the number of each context word in the review
        # this allows duplicates to be captured
        temp_dict = {}
        for word in range(len(found_contexts)): 
            temp_dict[found_contexts[word]] = found_contexts.count(found_contexts[word])
            # Adding the temporary dictionary to the keys of the final dictionary (center words) found in this review
            for center_word in found_centers:       
            # Adding each context word to the center word dict
                for context_word in temp_dict:
                    # Using predefined function to append or create values - depending on pre-existence
                    add_value_to_dictionary_of_dictionary(coocs, temp_dict, center_word, context_word)
                    
    return coocs  

In [14]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [15]:
def cooc_dict2df(coocs):
    """Transforms dictionary of dictionary into a pandas dataframe
    
    Input
    -----
    Dictionary containing dictionary
    
    Output
    ------
    pd.dataframe
    """
    # Putting dictionary into a dataframe
    coocdf = pd.DataFrame(coocs)
    # Transposing the dataframe so that center words are the index
    # Filling NA with 0 
    coocdf = coocdf.transpose().fillna(0)
    
    return coocdf

In [16]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(1000, 728)

**Not all context words got processed, perhaps a memory error. For purposes of being able to run the next functions, I will cut the df down to 728 rows.**

In [24]:
coocdf = coocdf.iloc[0:728]

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [17]:
def cooc2pmi(df):
    """Replaces the occurences in the co-occurence matrix with the PPMI scores
    
    Input
    -----
    pd.dataframe containing co-occurence matrix, must be square in dimensions
    
    Output
    ------
    pd.dataframe containing PPMI scores
    """
    # Sum of all values in dataframe
    N = df.values.sum()
    # Joint probability 
    Pij = df/N
    # Context word independant probability
    Pi = df.sum(axis=0)/N
    # Center word independant probability
    Pj = df.sum(axis=1)/N
    # Plugging all values into the PMI forumula
    pmidf =  np.log(Pij/(Pi.values*Pj.values))
    # Replacing negative values with 0, to give the positive point-wise mutual information score
    pmidf[pmidf < 0] = 0
    
    
    return pmidf 

In [25]:
pmidf = cooc2pmi(coocdf)
pmidf.shape



(728, 728)

### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [26]:
def topk(df,
         center_word,
         N=10
):
    """Given a center word, returns the top N words most likely to occur with given center word
    
    Inputs
    -----
    pd.dataframe
    center_word: string
    N: int - number of context words to return
    
    Outputs
    -------
    list: Top N context words in order of PPMI highest to lowest
    """
    top_words = df.loc[center_word].sort_values(ascending=False).index[0:N].tolist()
    return top_words

In [27]:
topk(pmidf, 'coffee')

['charme',
 'tease',
 'handle',
 'good-tasting',
 'Недалеко',
 'cutting',
 'Algerian',
 'in-between',
 'Smart',
 'taille']

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---
The action I will be reflecting on is defining public benefit and user need, I have chosen this action because I believe that it is weakest ethically. In the context given, the only groups to benefit will be the host and air bnb, this is because during the annual event they will know that they can charge more per-night, otherwise there are only downsides: the air bnb host will be recommended a price which is above the allowed regulations, this would make charging such a price illegal and that nobody would hire the air bnb at any time of the year outside the annual event. Potential air bnb clients would not benefit from this since they might have to overpay to stay at the accommodation. 

I believe that misuse of the data/ poor design of the algorithm could cause the social issue that rent prices are pushed up for residents all year round, this would happen if the fact that the high price recommendation being caused by the event is not captured by users. In being successful the project would benefit the public as air bnb hosts would have more business throughout the year, generating more tax and encouraging tourists to visit and spend money. 
The user need of the price recommender is simple and transparent, they would need to know the longitude and latitude of the property as well as the room type, the expectation is even more simple, a recommended price per-night to charge customers. Overall, I would give the project a score of 3/5 on this action.

At this moment, the project would not score high on transparency, this is because the methods are not published anywhere at this moment in time. However, this could be improved by publishing online the price recommending tool along with the methodology used behind it. At this moment in time the project would score low on accountability as the project is not publicly available and hence cannot be reviewed, if the recommender were made publicly available online then the accountability would be vastly improved as it could be openly tested by anyone. In the project's current state, it would not score well on fairness, this is because it is currently recommending prices which are higher than what they should be all-year round. This is not fair on potential tourists as they may end up overpaying and it is also not fair on residents as it could cause general rent prices to rise. I believe the aim of the project is to provide fairness, as it recommends a price closest to prices of pre-existing air bnb listings.

In order to improve the 3 principles, I would publish the recommender online as a web-app, and include information on the methodology, in addition to this I would make a change to the algorithm. In order to stop large price changes that occur in short spaces of a year, weighting could be put on prices. Whereby prices would carry a greater weighting if the yearly price standard deviation is low, this would imply the price does not change much throughout the year. If this issue was mitigated for, then I believe the project would benefit all parties better, as air bnb hosts would be recommended prices which people are more likely to pay, and it saves them the time it would take to research in order to find a suitable price. People hiring the air bnb would be recommended a fair price throughout the year.
...