# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [1]:
import pandas as pd
from nltk.tag import pos_tag
from nltk import RegexpParser
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

In [2]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
# load stopwords
sw = set(stopwords.words('english'))

In [4]:
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [5]:
'''We want to find the alternative forms of stopwords that have the "'" symbol in them 
in order to be able to add also to stopwords the word without this symbol'''

pattern = r'\w+\'\w+'

new_stopwords = []
for word in sw:
    # If it finds a word that contains "'" it appends the word in new_stopwords list
    if len(re.findall(pattern,word)) == 1:
        new_stopwords.append(re.findall(pattern,word)[0].replace('\'',''))
new_stopwords

['shouldnt',
 'neednt',
 'isnt',
 'shouldve',
 'hadnt',
 'youre',
 'couldnt',
 'arent',
 'youll',
 'hasnt',
 'wasnt',
 'werent',
 'doesnt',
 'mightnt',
 'havent',
 'shes',
 'dont',
 'its',
 'wouldnt',
 'youve',
 'shant',
 'wont',
 'thatll',
 'mustnt',
 'youd',
 'didnt']

In [6]:
# After checking those "new" words we add them to the stopwords variables named sw
for word in new_stopwords:
    sw.add(word)
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'arent',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'couldnt',
 'd',
 'did',
 'didn',
 "didn't",
 'didnt',
 'do',
 'does',
 'doesn',
 "doesn't",
 'doesnt',
 'doing',
 'don',
 "don't",
 'dont',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'hadnt',
 'has',
 'hasn',
 "hasn't",
 'hasnt',
 'have',
 'haven',
 "haven't",
 'havent',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'isnt',
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'mightnt',
 'more',
 'most',
 'mustn',
 "mustn't",
 'mustnt',
 'my',
 'myself',
 'needn',
 "needn't",
 'neednt',
 'no',
 'nor',
 'not',
 'no

In [7]:
basedir = os.getcwd()
df = pd.read_csv(os.path.join(basedir,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [8]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [9]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [134]:
def process_reviews(df):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    # Initialize 3 lists one for each column we will create
    tokenized_col = []
    tagged_col = []
    lower_tagged_col = []


    mylen = len(df)
    count = 0
    
    # Iterate through the given dataframe
    for index, row in df.iterrows():
        # tokenize the words for the comments of a row
        token = word_tokenize(row.comments)
        # Append the tokenized words to the proper list
        tokenized_col.append(token)
        # Tag the tokenized words of the row and then append them to the proper list
        tagged_col.append(pos_tag(token))
        # lower_tagged.append(list(set(pos_tag([item.lower() for item in token]))))
        # Make the tagged words lowercased and then if they are not stopwords append them to the lower_tagged_col list
        lower_tagged_col.append(pos_tag([item.lower() for item in token if item.lower() not in sw]))
        count += 1

        if count % 50000 == 0: print(f'{count} out of {mylen}')

    # Set as values of the 3 new columns the proper list we created for each one
    df['tokenized'] = tokenized_col
    df['tagged'] = tagged_col
    df['lower_tagged'] = lower_tagged_col

    # Return the modified dataframe
    return df

In [135]:
# df = process_reviews(df)
df = process_reviews(df[:500])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tokenized'] = tokenized_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tagged'] = tagged_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lower_tagged'] = lower_tagged_col


In [60]:
# sentences = []

# center_word = 'trouble'

# words_in_sentence = []
# it_is_in = 0
# for (word, tag) in df.lower_tagged[0]:
#     if center_word == word and tag[0] == 'N': it_is_in += 1
#     if tag[0] == '.' and it_is_in > 0:
#         sentences.extend(words_in_sentence)
#         words_in_sentence = []
#         it_is_in = 0
#     elif tag[0] == '.' and it_is_in == 0:
#         words_in_sentence = []
#     else:
#         words_in_sentence.append((word, tag))
        
# sentences

[('trouble', 'NN'),
 ('finding', 'VBG'),
 ('place', 'JJ'),
 ('central', 'JJ'),
 ('station', 'NN')]

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [136]:
def get_vocab(df):
    '''
    Based on the lower_tagged column of the dataframe (df) that this function receives it creates a vocabulary of 
    ‘center’ (the x in the PMI equation) and ‘context’ (the y in the PMI equation) words. 
    The vocabulary of center words will be the 1,000 most frequent NOUNS (words with a PoS tag starting with ‘N’), 
    and the context words will be the 1,000 most frequent words tagged as either VERB or ADJECTIVE 
    (words with any PoS tag starting with either ‘J’ or ‘V’).

    Args:
        df: The dataframe we want to modify.

    Returns: The ‘center’ and ‘context’ vocabularies as lists.
    '''
    
    # Initialize 2 empty lists for the 2 vocabularies to be filled
    cent_list, cont_list = [], []

    # Iterate through the 'lower_tagged' column of the df provided
    for review in df.lower_tagged:
        
        '''For every word in a review (a record in the 'lower_tagged' column) that the condition is true for either 
        center or context list is appended to the appropriate list '''
        cent_list.extend([word for word in [list_of_words[0] for list_of_words in review if list_of_words[1][0] == 'N']])
        cont_list.extend([word for word in [list_of_words[0] for list_of_words in review if (list_of_words[1][0] == 'J') 
                                            or (list_of_words[1][0] == 'V')]])
    
    # We create 2 dictionaries that holds information about the frequency of the words in the 2 lists we have
    cent_dict = Counter(cent_list)
    cont_dict = Counter(cont_list)

    # We sort the dictionaries based on their value for frequency of the words and then keep the 1000 most frequent in each list
    cent_vocab = [key for key, value in sorted(cent_dict.items(), key=lambda item: item[1], reverse=True)][:1000]
    cont_vocab = [key for key, value in sorted(cont_dict.items(), key=lambda item: item[1], reverse=True)][:1000]

    # Return the lists
    return cent_vocab, cont_vocab

In [137]:
cent_vocab, cont_vocab = get_vocab(df)

In [138]:
cent_vocab[:5]

['daniel', 'room', 'place', 'host', 'location']

In [139]:
cont_vocab[:5]

['great', 'clean', 'alex', 'nice', 'comfortable']

In [140]:
samewords = [name for name in cent_vocab if name in cont_vocab]
len(samewords)

243

In [141]:
samewords

['daniel',
 'place',
 'location',
 'amsterdam',
 'stay',
 'apartment',
 'bathroom',
 'alex',
 'clean',
 'bus',
 'distance',
 'neighborhood',
 'maps',
 'bed',
 'experience',
 'get',
 'tram',
 'que',
 'walk',
 'airbnb',
 'transport',
 'thank',
 'bit',
 'visit',
 'très',
 'bien',
 'tips',
 'arrival',
 'muy',
 'helpful',
 'lot',
 'è',
 'places',
 'la',
 'cozy',
 'beds',
 'help',
 'perfect',
 'towels',
 'e',
 'tea',
 'daniels',
 'need',
 'die',
 'use',
 'convenient',
 'friend',
 'shower',
 'minute',
 'sehr',
 'check',
 'map',
 'tidy',
 'bike',
 'window',
 'pictures',
 'ist',
 'zu',
 'para',
 'stops',
 'nice',
 'es',
 'lo',
 'staying',
 'guest',
 'spotless',
 'garden',
 'wifi',
 'fun',
 'et',
 'séjour',
 'photos',
 '’',
 'por',
 'amazing',
 'stop',
 'work',
 'travel',
 'lots',
 'tourist',
 'le',
 'stairs',
 'cafes',
 'pleasant',
 'appartment',
 'beautiful',
 'luggage',
 'il',
 'ce',
 'drink',
 'welcoming',
 'feel',
 'restaurant',
 'enjoy',
 'dutch',
 'living',
 'recommend',
 'ride',
 'offer'

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [148]:
def get_coocs(df, cent_vocab, cont_vocab):
    '''
    This function with the 1,000-word vocabularies of center and context words, creates a co-occurrence matrix 
    where, for each center word, we keep track of how many of the context words co-occur with it. 

    Args:
        df: The dataframe that holds the reviews and we will base the creation of co-occurence matrix.
        cent_vocab: The vocabulary of center words.
        cont_vocab: The vocabulary of context words.

    Returns: A co-occurrence matrix of center and context words.
    '''
    
    sentences = [] 
    words_in_sentence = []
    start = pd.to_datetime('today')
    diff = 0
    count = 0
    for index, row in df.iterrows():
        count += 1
        for (word, tag) in row.lower_tagged:            
            if tag[0] == '.':
                sentences.append(words_in_sentence)
                words_in_sentence = []
            else:
                words_in_sentence.append((word, tag))
#         if count == 2 :
#             print(sentences)
#             break
        end = pd.to_datetime('today')
        diff = (end-start).total_seconds()
        if count % 25000 == 0 : print(f'{count} -- {diff/60}')
                
#     end = pd.to_datetime('today')
#     diff = (end-start).total_seconds()
    
    print(f'yolo in {diff/60} minutes')
#     # Initialize an empty list to append all the sentences of all comments
#     sentences = []    
    
#     start = pd.to_datetime('today')
    
#     # Iterate through all comments, split them to sentences and then append those sentences to the list we created above
#     for comment in df.comments:
#         sentences.extend([sentence for sentence in comment.split('.')])
        
#     end = pd.to_datetime('today')
#     diff = (end-start).total_seconds()
    
#     print(f'yolo in {diff/60} minutes')
    
    start = pd.to_datetime('today')
  
    '''Create a dict where we have as keys the 1,000 center_words and as value the sentences they occur.
       In order to Filter the sentences we call the Filter() function'''
    sentences_per_center_word = {center_word : Filter(sentences, center_word) for center_word in cent_vocab}

    # return sentences_per_center_word
    end = pd.to_datetime('today')
    diff = (end-start).total_seconds()
    
    print(f'swag in {diff/60} minutes')
    
    # Initialize an empty dictionary for the co-occurence matrix
    coocs = {}

    count = 0
    count2 = 0
    diff = 0
    
    # Iterate through the dictionary that keeps the sentences for each center_word
    for center_word, sentences in sentences_per_center_word.items():
        # Initialize an empty list for the context words that co-occur with the center_word
        words = []
        count += 1
        start = pd.to_datetime('today')
        count2 = 0
        
        # Iterate through the sentences the center_word occurs
        for sentence in sentences:
            count2 += 1
            
            # Create a list with the context words that co-occur with the center_word
            # words_of_sentence = [word.lower() for word in pos_tag([item.lower() for item in word_tokenize(sentence) if item.lower() not in sw]) if word[0] in cont_vocab and ((word[1][0] == 'J') or (word[1][0] == 'V'))]
            # words_of_sentence = [word.lower() for word in word_tokenize(sentence) if word in cont_vocab and word != center_word]
            
            words_of_sentence = [word for (word, tag) in sentence if word in cont_vocab and (( tag[0] == 'J' ) or ( tag[0] == 'V' ))]
            '''If the list we created above is not empty (it was succesful in searching for contenxt words) 
            extend the words list with it.'''
            if len(words_of_sentence) > 0: words.extend(words_of_sentence)
                
        end = pd.to_datetime('today')
        diff += (end-start).total_seconds()
        
        print(words)
        return dict(Counter(words))
        
        coocs[center_word] = dict(Counter(words))
        
        if count % 50 == 0 : print(f'{count} center_word out of {len(sentences_per_center_word)} in {diff/60} minutes')

            
    print(diff/60)
    
    return coocs 

def Filter(sentences, center_word):
    '''
    This function receives all the sentences in the comments column and 
    creates a list of the sentences the center_word occurs.

    Args:
        sentences: A list of all the sentences in the comments.
        center_word: A center word to find its sentences.

    Returns: A list of the sentences that the center_word occurs.
    '''
    
    # Initialize a list to collect the sentences a center_word is in.
#     sentences_for_center_word = []
    start = pd.to_datetime('today')
    
    sentences_for_center_word = [sentence for sentence in sentences 
                                 if len([word for (word, tag) in sentence if center_word == word and tag[0] == 'N']) > 0]
    end = pd.to_datetime('today')
    diff = (end-start).total_seconds()
    
    print(f'{center_word} in {diff} seconds')
    return sentences_for_center_word
    
#     # Iterate through the sentences
#     for sentence in sentences:
#         # If the center_word is one of the words in the sentence append it to the sentences_for_center_word list. 
#         [word for (word, tag) in sentence if center_word == word]
        
#         if [center_word, 'N'] in [[word[0], word[1][0]] for word in pos_tag([item.lower() for item in word_tokenize(sentence) if item.lower() not in sw])]:
#             sentences_for_center_word.append(sentence)
            
#     return sentences_for_center_word

#     start = pd.to_datetime('today')

#     sentences_for_center_word = []
#     words_in_sentence = []
#     it_is_in = 0
    
#     print(center_word)
    
    
#     for index, row in df.iterrows():
#         if center_word in [word.lower() for word in row.tokenized]:
# #             print(row.lower_tagged)
# #             break
#             for (word, tag) in row.lower_tagged:
#                 if center_word == word and tag[0] == 'N': it_is_in += 1
# #                 if word == '.' and tag[0] == '.' and it_is_in > 0:
#                 if tag[0] == '.' and it_is_in > 0:
#                     sentences_for_center_word.append(words_in_sentence)
#                     words_in_sentence = []
#                     it_is_in = 0
# #                 elif word == '.' and tag[0] == '.' and it_is_in == 0:
#                 elif tag[0] == '.' and it_is_in == 0:
#                     words_in_sentence = []
#                 else:
#                     words_in_sentence.append((word, tag))
    
#     print(len(sentences_for_center_word))

#     end = pd.to_datetime('today')
#     diff = (end-start).total_seconds()
    
#     print(f'{center_word} in {diff/60} minutes')
    
#     return sentences_for_center_word

In [149]:
# df.iloc[:,6:9]

In [150]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

yolo in 0.0021782833333333336 minutes
daniel in 0.002977 seconds
room in 0.002001 seconds
place in 0.00301 seconds
host in 0.002971 seconds
location in 0.00303 seconds
amsterdam in 0.001962 seconds
stay in 0.001996 seconds
city in 0.001999 seconds
everything in 0.001996 seconds
apartment in 0.001956 seconds
time in 0.002993 seconds
bathroom in 0.001996 seconds
information in 0.001994 seconds
alex in 0.001993 seconds
home in 0.001994 seconds
center in 0.001997 seconds
clean in 0.001993 seconds
bus in 0.001996 seconds
house in 0.002003 seconds
distance in 0.001987 seconds
thanks in 0.001993 seconds
night in 0.001995 seconds
neighborhood in 0.001994 seconds
maps in 0.001 seconds
bed in 0.001995 seconds
experience in 0.001994 seconds
get in 0.001995 seconds
area in 0.001994 seconds
tram in 0.001995 seconds
day in 0.001995 seconds
trip in 0.001997 seconds
restaurants in 0.001994 seconds
station in 0.001957 seconds
minutes in 0.00203 seconds
things in 0.001997 seconds
que in 0.001993 seconds

travels in 0.001032 seconds
spending in 0.005978 seconds
traveller in 0.003989 seconds
plane in 0.002992 seconds
wealth in 0.002993 seconds
ways in 0.00203 seconds
kitchen in 0.001995 seconds
desk in 0.001995 seconds
flowers in 0.001993 seconds
tickets in 0.001998 seconds
provide in 0.001993 seconds
fan in 0.001994 seconds
thai in 0.001995 seconds
stores in 0.001992 seconds
family in 0.000997 seconds
birds in 0.000999 seconds
towel in 0.001004 seconds
standard in 0.001007 seconds
info in 0.000997 seconds
definition in 0.002029 seconds
music in 0.001964 seconds
playing in 0.003994 seconds
mins in 0.003987 seconds
magazines in 0.002993 seconds
describe in 0.001994 seconds
suitcases in 0.001958 seconds
convenience in 0.002025 seconds
parts in 0.001993 seconds
easy in 0.001995 seconds
fait in 0.001992 seconds
desservi in 0.001007 seconds
mon in 0.000995 seconds
advise in 0.00196 seconds
votre in 0.003025 seconds
disposition in 0.001999 seconds
besoin in 0.001991 seconds
calme in 0.001018 s

kettel in 0.001993 seconds
range in 0.002054 seconds
checks in 0.002963 seconds
armoire in 0.002992 seconds
suitcase in 0.002023 seconds
doubt in 0.001996 seconds
graciousness in 0.001995 seconds
accommadations in 0.001966 seconds
st in 0.002024 seconds
flawless in 0.002963 seconds
convinient in 0.002021 seconds
regards in 0.001994 seconds
rahul in 0.001988 seconds
buddy in 0.002969 seconds
review in 0.002023 seconds
'core in 0.001994 seconds
apartments in 0.002002 seconds
north in 0.002 seconds
weeknight in 0.001996 seconds
sheets in 0.001992 seconds
fragrant in 0.001995 seconds
manager in 0.001994 seconds
angela in 0.001998 seconds
brazil in 0.001993 seconds
feeling in 0.001993 seconds
mattresses in 0.001999 seconds
airy in 0.002951 seconds
arrangements in 0.001994 seconds
accomotion in 0.001994 seconds
eat in 0.001994 seconds
bright in 0.002993 seconds
resources in 0.001994 seconds
meal in 0.001995 seconds
k in 0.002992 seconds
photo in 0.001996 seconds
cover in 0.00299 seconds
colo

In [151]:
coocs

{'cool': 1,
 'amazing': 9,
 'recommended': 2,
 'gave': 11,
 'good': 11,
 'umbrella': 1,
 'getting': 1,
 'great': 33,
 'welcoming': 11,
 'made': 7,
 'sure': 9,
 'see': 4,
 'fantastic': 9,
 'waiting': 2,
 'outside': 1,
 'arrived': 5,
 'late': 2,
 'prepared': 2,
 'providing': 1,
 'bus': 1,
 'short': 4,
 'helpful': 11,
 'local': 1,
 'mapped': 1,
 'ready': 1,
 'use': 3,
 'accommodating': 6,
 'last': 2,
 'amsterdam': 5,
 'recommend': 23,
 'next': 3,
 'corny': 1,
 'excessive': 1,
 'luxurious': 1,
 'provided': 12,
 'asked': 1,
 'travel': 1,
 'else': 1,
 'say': 3,
 "'s": 1,
 'expect': 3,
 'moreover': 1,
 'hash': 1,
 'stucked': 1,
 'help': 2,
 'come': 4,
 'choose': 1,
 'misses': 1,
 'stated': 1,
 'maps': 2,
 'stay': 18,
 'looking': 4,
 'smart': 2,
 'affordable': 2,
 'alternative': 2,
 'check': 2,
 'youtube': 1,
 'includes': 1,
 'peek': 1,
 'website': 1,
 'hidden': 2,
 'accomodating': 1,
 'friendly': 5,
 'first': 4,
 'found': 4,
 'informed': 1,
 'professional': 3,
 'considering': 2,
 'positive': 

In [93]:
cent_vocab

['place',
 'apartment',
 'location',
 'stay',
 'amsterdam',
 'host',
 'everything',
 'city',
 'room',
 'time',
 'house',
 'area',
 'home',
 'très',
 'center',
 'restaurants',
 '’',
 'station',
 'minutes',
 'walk',
 'centre',
 'tram',
 'experience',
 'space',
 'thanks',
 'à',
 'hosts',
 'neighborhood',
 'clean',
 'bien',
 'perfect',
 'communication',
 'day',
 'la',
 'kind',
 'distance',
 'days',
 'bed',
 'trip',
 'bathroom',
 'et',
 'night',
 'e',
 'sehr',
 'people',
 'thank',
 'places',
 'breakfast',
 'lot',
 'ist',
 'street',
 'boat',
 'tips',
 'der',
 'visit',
 'coffee',
 'arrival',
 'bus',
 'need',
 'muy',
 'war',
 'min',
 'appartement',
 'die',
 'que',
 'transport',
 'airbnb',
 'view',
 'cozy',
 'check',
 'shops',
 'kitchen',
 'questions',
 'way',
 'bars',
 'minute',
 'helpful',
 'family',
 'studio',
 'anyone',
 'things',
 'get',
 'es',
 'access',
 'dans',
 'mit',
 'man',
 'bit',
 'weekend',
 'town',
 'le',
 'lots',
 'je',
 'stairs',
 'con',
 'floor',
 'food',
 'information',
 'mor

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [None]:
def cooc_dict2df(coocs):
    '''
    This function takes as input the dictionary of co-occurence matrix for center and context words. 
    It converts the dictionary to a 1000x1000 pandas DataFrame.

    Args:
        coocs: The dictionary of co-occurence matrix for center and context words.

    Returns: A 1000x1000 pandas DataFrame.
    '''
    
    # Initialize a pandas DataFrame with columns the context words and indexes the center words
    coocdf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

    # Iterate through the dataframe we created previously to fill it with values
    for index, row in coocdf.iterrows():
        for word in cont_vocab:
            ''' If the pair of index(center word) and word (context word) co-occurs 
                it will add the value to proper place in the dataframe.
                Otherwise the Error of coocs not having a value for this pair will be caught 
                and the value of 0 will added to the corresponding place.'''
            
            try:
                coocdf[word][index] = coocs[index][word]
            except: 
                coocdf[word][index] = 0

    # Return the pandas DataFrame
    return coocdf

In [None]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [None]:
def cooc2pmi(df):
    '''
    This function converts the raw co-occurence counts pandas DataFrame 
    to a DataFrame that keeps the information for the PMI scores.

    Args:
        df: The dataframe we want to convert from raw co-occurence counts to PMI scores.

    Returns: A pandas DataFrame with PMI scores for the pairs of center and context words.
    '''
    
    pmidf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

    N = 0
    for index, row in df.iterrows():
        N += sum(row)
    
    count = 0
    for index, row in df.iterrows():
#         count += 1
#         print(row)
#         print(sum(row))
#         if count == 100: break
        for word in cont_vocab:
#             print(f'sum(df[word]) - {word} -  {sum(df[word])}')
#             print(f'sum(row) - {word} - {sum(row)}')
#             pmi = (df[word][index] / N) / ((sum(df[word])/N) * (sum(row)/N))
#             if pmi == 0:
#                 pmidf[word][index] = 0
#             else:
#                 pmidf[word][index] = np.log([pmi])[0] 
            try:
                pmi = df[word][index] / (sum(df[word])/N / sum(row)/N)
                if pmi == 0:
                    pmidf[word][index] = 0
                else:
                    pmidf[word][index] = np.log([pmi])[0] 
            except: 
                print(word)
                pmidf[word][index] = 0
      
    return pmidf

In [None]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

In [None]:
pmidf

### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [None]:
# pmidf.iloc[2,:]
pmidf['place']['room']

In [None]:
pmidf['great']['place']

In [None]:
pmidf['place']['nice']

In [None]:
pmidf['based']['place']

In [None]:
def topk(df, center_word, N=10):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.
        center_word: The dataframe we want to modify.
        N: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    dicts_ = {word: df[word][center_word] for word in cont_vocab}
    top_words = [key for key, value in sorted(dicts_.items(), key=lambda item: item[1], reverse=True)][:N]

    return top_words

In [None]:
topk(pmidf, 'place')

In [None]:
topk(pmidf, 'location')

In [None]:
topk(pmidf, 'coffee')

In [None]:
topk(pmidf, 'stay')

In [None]:
topk(pmidf, 'petits')

In [None]:
topk(pmidf, 'sauber')

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

...