# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [34]:
import pandas as pd
from nltk.tag import pos_tag
from nltk import RegexpParser
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()

In [35]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\c2086876\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [36]:
# load stopwords
sw = set(stopwords.words('english'))

In [37]:
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [38]:
'''We want to find the alternative forms of stopwords that have the "'" symbol in them 
in order to be able to add also to stopwords the word without this symbol'''

pattern = r'\w+\'\w+'

new_stopwords = []
for word in sw:
    # If it finds a word that contains "'" it appends the word in new_stopwords list
    if len(re.findall(pattern,word)) == 1:
        new_stopwords.append(re.findall(pattern,word)[0].replace('\'',''))
new_stopwords

['neednt',
 'couldnt',
 'arent',
 'hasnt',
 'mustnt',
 'youd',
 'shouldnt',
 'mightnt',
 'didnt',
 'dont',
 'youve',
 'wasnt',
 'shes',
 'wont',
 'youll',
 'isnt',
 'shouldve',
 'wouldnt',
 'youre',
 'havent',
 'hadnt',
 'doesnt',
 'its',
 'thatll',
 'shant',
 'werent']

In [39]:
# After checking those "new" words we add them to the stopwords variables named sw
for word in new_stopwords:
    sw.add(word)
sw

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'arent',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'couldnt',
 'd',
 'did',
 'didn',
 "didn't",
 'didnt',
 'do',
 'does',
 'doesn',
 "doesn't",
 'doesnt',
 'doing',
 'don',
 "don't",
 'dont',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'hadnt',
 'has',
 'hasn',
 "hasn't",
 'hasnt',
 'have',
 'haven',
 "haven't",
 'havent',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'isnt',
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'mightnt',
 'more',
 'most',
 'mustn',
 "mustn't",
 'mustnt',
 'my',
 'myself',
 'needn',
 "needn't",
 'neednt',
 'no',
 'nor',
 'not',
 'no

In [40]:
basedir = os.getcwd()
df = pd.read_csv(os.path.join(basedir,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [41]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [42]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [43]:
def process_reviews(df):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    # Initialize 3 lists one for each column we will create
    tokenized_col = []
    tagged_col = []
    lower_tagged_col = []


    mylen = len(df)
    count = 0
    
    # Iterate through the given dataframe
    for index, row in df.iterrows():
        # tokenize the words for the comments of a row
        token = word_tokenize(row.comments)
        # Append the tokenized words to the proper list
        tokenized_col.append(token)
        # Tag the tokenized words of the row and then append them to the proper list
        tagged_col.append(pos_tag(token))
        # lower_tagged.append(list(set(pos_tag([item.lower() for item in token]))))
        # Make the tagged words lowercased and then if they are not stopwords append them to the lower_tagged_col list
        lower_tagged_col.append(pos_tag([item.lower() for item in token if item.lower() not in sw]))
        count += 1

        if count % 50000 == 0: print(f'{count} out of {mylen}')

    # Set as values of the 3 new columns the proper list we created for each one
    df['tokenized'] = tokenized_col
    df['tagged'] = tagged_col
    df['lower_tagged'] = lower_tagged_col

    # Return the modified dataframe
    return df

In [44]:
df = process_reviews(df)
# df = process_reviews(df[:500])

50000 out of 452143
100000 out of 452143
150000 out of 452143
200000 out of 452143
250000 out of 452143
300000 out of 452143
350000 out of 452143
400000 out of 452143
450000 out of 452143


In [45]:
# sentences = []

# center_word = 'trouble'

# words_in_sentence = []
# it_is_in = 0
# for (word, tag) in df.lower_tagged[0]:
#     if center_word == word and tag[0] == 'N': it_is_in += 1
#     if tag[0] == '.' and it_is_in > 0:
#         sentences.extend(words_in_sentence)
#         words_in_sentence = []
#         it_is_in = 0
#     elif tag[0] == '.' and it_is_in == 0:
#         words_in_sentence = []
#     else:
#         words_in_sentence.append((word, tag))
        
# sentences

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [46]:
def get_vocab(df):
    '''
    Based on the lower_tagged column of the dataframe (df) that this function receives it creates a vocabulary of 
    ‘center’ (the x in the PMI equation) and ‘context’ (the y in the PMI equation) words. 
    The vocabulary of center words will be the 1,000 most frequent NOUNS (words with a PoS tag starting with ‘N’), 
    and the context words will be the 1,000 most frequent words tagged as either VERB or ADJECTIVE 
    (words with any PoS tag starting with either ‘J’ or ‘V’).

    Args:
        df: The dataframe we want to modify.

    Returns: The ‘center’ and ‘context’ vocabularies as lists.
    '''
    
    # Initialize 2 empty lists for the 2 vocabularies to be filled
    cent_list, cont_list = [], []

    # Iterate through the 'lower_tagged' column of the df provided
    for review in df.lower_tagged:
        
        '''For every word in a review (a record in the 'lower_tagged' column) that the condition is true for either 
        center or context list is appended to the appropriate list '''
        cent_list.extend([word for word in [list_of_words[0] for list_of_words in review if list_of_words[1][0] == 'N']])
        cont_list.extend([word for word in [list_of_words[0] for list_of_words in review if (list_of_words[1][0] == 'J') 
                                            or (list_of_words[1][0] == 'V')]])
    
    # We create 2 dictionaries that holds information about the frequency of the words in the 2 lists we have
    cent_dict = Counter(cent_list)
    cont_dict = Counter(cont_list)

    # We sort the dictionaries based on their value for frequency of the words and then keep the 1000 most frequent in each list
    cent_vocab = [key for key, value in sorted(cent_dict.items(), key=lambda item: item[1], reverse=True)][:1000]
    cont_vocab = [key for key, value in sorted(cont_dict.items(), key=lambda item: item[1], reverse=True)][:1000]

    # Return the lists
    return cent_vocab, cont_vocab

In [47]:
cent_vocab, cont_vocab = get_vocab(df)

In [48]:
cent_vocab[:5]

['place', 'apartment', 'location', 'stay', 'amsterdam']

In [49]:
cont_vocab[:5]

['great', 'nice', 'recommend', 'clean', 'good']

In [50]:
samewords = [name for name in cent_vocab if name in cont_vocab]
len(samewords)

385

In [51]:
samewords

['place',
 'apartment',
 'location',
 'stay',
 'amsterdam',
 'host',
 'home',
 'très',
 'center',
 '’',
 'walk',
 'centre',
 'tram',
 'experience',
 'à',
 'hosts',
 'neighborhood',
 'clean',
 'bien',
 'perfect',
 'la',
 'bed',
 'bathroom',
 'et',
 'e',
 'sehr',
 'thank',
 'breakfast',
 'lot',
 'ist',
 'street',
 'boat',
 'tips',
 'der',
 'visit',
 'coffee',
 'arrival',
 'bus',
 'need',
 'muy',
 'min',
 'appartement',
 'die',
 'que',
 'transport',
 'airbnb',
 'view',
 'cozy',
 'check',
 'shops',
 'kitchen',
 'bars',
 'minute',
 'helpful',
 'get',
 'es',
 'dans',
 'bit',
 'le',
 'lots',
 'je',
 'stairs',
 'con',
 'neighbourhood',
 'convenient',
 'zu',
 'du',
 'wir',
 'houseboat',
 'beautiful',
 'pictures',
 'des',
 'super',
 'train',
 'park',
 'il',
 'door',
 'couple',
 'supermarket',
 'séjour',
 'bike',
 'recommend',
 'metro',
 'è',
 'airport',
 'situé',
 'help',
 'casa',
 'friends',
 'alles',
 'balcony',
 'photos',
 'garden',
 'comfy',
 'appartment',
 'feel',
 'corner',
 'nice',
 'stop

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [52]:
def get_coocs(df, cent_vocab, cont_vocab):
    '''
    This function with the 1,000-word vocabularies of center and context words, creates a co-occurrence matrix 
    where, for each center word, we keep track of how many of the context words co-occur with it. 

    Args:
        df: The dataframe that holds the reviews and we will base the creation of co-occurence matrix.
        cent_vocab: The vocabulary of center words.
        cont_vocab: The vocabulary of context words.

    Returns: A co-occurrence matrix of center and context words.
    '''
    
    sentences = [] 
    words_in_sentence = []
    start = pd.to_datetime('today')
    diff = 0
    count = 0
    for index, row in df.iterrows():
        count += 1
        for (word, tag) in row.lower_tagged:            
            if tag[0] == '.':
                sentences.append(words_in_sentence)
                words_in_sentence = []
            else:
                words_in_sentence.append((word, tag))
#         if count == 2 :
#             print(sentences)
#             break
        end = pd.to_datetime('today')
        diff = (end-start).total_seconds()
        if count % 25000 == 0 : print(f'{count} -- {diff/60}')
                
#     end = pd.to_datetime('today')
#     diff = (end-start).total_seconds()
    
    print(f'yolo in {diff/60} minutes')
#     # Initialize an empty list to append all the sentences of all comments
#     sentences = []    
    
#     start = pd.to_datetime('today')
    
#     # Iterate through all comments, split them to sentences and then append those sentences to the list we created above
#     for comment in df.comments:
#         sentences.extend([sentence for sentence in comment.split('.')])
        
#     end = pd.to_datetime('today')
#     diff = (end-start).total_seconds()
    
#     print(f'yolo in {diff/60} minutes')
    
    start = pd.to_datetime('today')
  
    '''Create a dict where we have as keys the 1,000 center_words and as value the sentences they occur.
       In order to Filter the sentences we call the Filter() function'''
    sentences_per_center_word = {center_word : Filter(sentences, center_word) for center_word in cent_vocab}

    # return sentences_per_center_word
    end = pd.to_datetime('today')
    diff = (end-start).total_seconds()
    
    print(f'swag in {diff/60} minutes')
    
    # Initialize an empty dictionary for the co-occurence matrix
    coocs = {}

    count = 0
    count2 = 0
    diff = 0
    
    # Iterate through the dictionary that keeps the sentences for each center_word
    for center_word, sentences in sentences_per_center_word.items():
        # Initialize an empty list for the context words that co-occur with the center_word
        words = []
        count += 1
        start = pd.to_datetime('today')
        count2 = 0
        
        # Iterate through the sentences the center_word occurs
        for sentence in sentences:
            count2 += 1
            
            # Create a list with the context words that co-occur with the center_word
            # words_of_sentence = [word.lower() for word in pos_tag([item.lower() for item in word_tokenize(sentence) if item.lower() not in sw]) if word[0] in cont_vocab and ((word[1][0] == 'J') or (word[1][0] == 'V'))]
            # words_of_sentence = [word.lower() for word in word_tokenize(sentence) if word in cont_vocab and word != center_word]
            
            words_of_sentence = [word for (word, tag) in sentence if word in cont_vocab and (( tag[0] == 'J' ) or ( tag[0] == 'V' ))]
            '''If the list we created above is not empty (it was succesful in searching for contenxt words) 
            extend the words list with it.'''
            if len(words_of_sentence) > 0: words.extend(words_of_sentence)
                
        end = pd.to_datetime('today')
        diff += (end-start).total_seconds()
        
#         print(words)
#         return dict(Counter(words))
        
        coocs[center_word] = dict(Counter(words))
        
        if count % 50 == 0 : print(f'{count} center_word out of {len(sentences_per_center_word)} in {diff/60} minutes')

            
    print(diff/60)
    
    return coocs 

def Filter(sentences, center_word):
    '''
    This function receives all the sentences in the comments column and 
    creates a list of the sentences the center_word occurs.

    Args:
        sentences: A list of all the sentences in the comments.
        center_word: A center word to find its sentences.

    Returns: A list of the sentences that the center_word occurs.
    '''
    
    # Initialize a list to collect the sentences a center_word is in.
#     sentences_for_center_word = []
    start = pd.to_datetime('today')
    
    sentences_for_center_word = [sentence for sentence in sentences 
                                 if len([word for (word, tag) in sentence if center_word == word and tag[0] == 'N']) > 0]
    end = pd.to_datetime('today')
    diff = (end-start).total_seconds()
    
    print(f'{center_word} in {diff} seconds')
    return sentences_for_center_word
    
#     # Iterate through the sentences
#     for sentence in sentences:
#         # If the center_word is one of the words in the sentence append it to the sentences_for_center_word list. 
#         [word for (word, tag) in sentence if center_word == word]
        
#         if [center_word, 'N'] in [[word[0], word[1][0]] for word in pos_tag([item.lower() for item in word_tokenize(sentence) if item.lower() not in sw])]:
#             sentences_for_center_word.append(sentence)
            
#     return sentences_for_center_word

#     start = pd.to_datetime('today')

#     sentences_for_center_word = []
#     words_in_sentence = []
#     it_is_in = 0
    
#     print(center_word)
    
    
#     for index, row in df.iterrows():
#         if center_word in [word.lower() for word in row.tokenized]:
# #             print(row.lower_tagged)
# #             break
#             for (word, tag) in row.lower_tagged:
#                 if center_word == word and tag[0] == 'N': it_is_in += 1
# #                 if word == '.' and tag[0] == '.' and it_is_in > 0:
#                 if tag[0] == '.' and it_is_in > 0:
#                     sentences_for_center_word.append(words_in_sentence)
#                     words_in_sentence = []
#                     it_is_in = 0
# #                 elif word == '.' and tag[0] == '.' and it_is_in == 0:
#                 elif tag[0] == '.' and it_is_in == 0:
#                     words_in_sentence = []
#                 else:
#                     words_in_sentence.append((word, tag))
    
#     print(len(sentences_for_center_word))

#     end = pd.to_datetime('today')
#     diff = (end-start).total_seconds()
    
#     print(f'{center_word} in {diff/60} minutes')
    
#     return sentences_for_center_word

In [53]:
# df.iloc[:,6:9]

In [54]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

25000 -- 1.9145672
50000 -- 2.0694676000000003
75000 -- 2.209042783333333
100000 -- 3.9738843
125000 -- 4.11963795
150000 -- 4.25805245
175000 -- 4.4521100166666665
200000 -- 6.533285783333333
225000 -- 6.742544933333334
250000 -- 6.873489316666666
275000 -- 7.118114383333333
300000 -- 7.221259316666666
325000 -- 7.3446008166666665
350000 -- 7.5253455
375000 -- 10.637935133333334
400000 -- 10.730963683333332
425000 -- 10.822394699999998
450000 -- 10.91067795
yolo in 10.918363366666666 minutes
place in 41.054149 seconds
apartment in 8.439152 seconds
location in 2.149253 seconds
stay in 2.337079 seconds
amsterdam in 1.732792 seconds
host in 1.95876 seconds
everything in 1.842105 seconds
city in 1.686455 seconds
room in 1.656284 seconds
time in 1.637778 seconds
house in 1.684457 seconds
area in 1.711057 seconds
home in 2.024584 seconds
très in 1.792244 seconds
center in 2.232288 seconds
restaurants in 1.771264 seconds
’ in 2.066472 seconds
station in 2.161251 seconds
minutes in 1.644474 s

kids in 1.546851 seconds
ville in 1.546873 seconds
hope in 1.561825 seconds
size in 1.524919 seconds
warm in 1.57176 seconds
chez in 1.551879 seconds
dem in 1.537856 seconds
details in 1.52595 seconds
comfortable in 1.51765 seconds
issues in 1.507253 seconds
foot in 1.548888 seconds
wifi in 1.532344 seconds
peter in 1.532042 seconds
stores in 1.495999 seconds
messages in 1.487017 seconds
group in 1.505969 seconds
pleasure in 1.514945 seconds
property in 1.493207 seconds
waren in 1.528094 seconds
gastgeber in 1.519936 seconds
steps in 1.544865 seconds
auf in 1.520914 seconds
lo in 1.527888 seconds
bei in 1.524958 seconds
care in 1.525915 seconds
shop in 1.555837 seconds
life in 1.525908 seconds
al in 1.516401 seconds
che in 1.523879 seconds
issue in 1.539925 seconds
noise in 1.546854 seconds
ideal in 1.535892 seconds
cool in 1.539882 seconds
money in 1.527913 seconds
contact in 1.493969 seconds
centrum in 1.506012 seconds
ha in 1.491009 seconds
problems in 1.501657 seconds
sind in 1.543

meals in 1.537884 seconds
taxi in 1.518443 seconds
list in 1.527055 seconds
situation in 1.592694 seconds
posizione in 1.535891 seconds
adults in 1.527914 seconds
tutto in 1.542907 seconds
limpio in 1.537887 seconds
non in 1.527927 seconds
boyfriend in 1.508929 seconds
addition in 1.491155 seconds
cette in 1.528904 seconds
rijksmuseum in 1.503945 seconds
stuff in 1.558876 seconds
aanrader in 1.513909 seconds
gem in 1.510957 seconds
solo in 1.558857 seconds
pijp in 1.550856 seconds
anche in 1.558831 seconds
owners in 1.527374 seconds
rob in 1.536853 seconds
cafés in 1.522927 seconds
clothes in 1.542867 seconds
departure in 1.496013 seconds
talk in 1.565884 seconds
magnifique in 1.513952 seconds
families in 1.501948 seconds
tons in 1.539916 seconds
opportunity in 1.495007 seconds
centrale in 1.513909 seconds
gute in 1.556838 seconds
marcel in 1.523885 seconds
minuti in 1.522953 seconds
explore in 1.489435 seconds
emplacement in 1.545866 seconds
pratique in 1.539912 seconds
bel in 1.50995

interest in 1.518387 seconds
maker in 1.541881 seconds
pieds in 1.500936 seconds
uns in 1.517969 seconds
+ in 1.519132 seconds
facilement in 1.489046 seconds
près in 1.543966 seconds
zijn in 1.541874 seconds
danke in 1.523889 seconds
johan in 1.543173 seconds
aus in 1.503118 seconds
gentile in 1.547859 seconds
dat in 1.533901 seconds
upstairs in 1.529916 seconds
sa in 1.530868 seconds
essentials in 1.487461 seconds
viel in 1.522925 seconds
n in 1.516974 seconds
cheese in 1.53091 seconds
d'un in 1.526951 seconds
mattress in 1.486025 seconds
aussi in 1.558797 seconds
kamer in 1.674634 seconds
feet in 1.620657 seconds
louise in 1.575737 seconds
well in 1.561822 seconds
surroundings in 1.52794 seconds
alla in 1.713417 seconds
aan in 1.701461 seconds
être in 1.594733 seconds
passé in 1.557839 seconds
expérience in 1.540835 seconds
aller in 1.62768 seconds
cómoda in 1.532978 seconds
gare in 1.525914 seconds
drop in 1.533958 seconds
haus in 1.535924 seconds
doubt in 1.541876 seconds
jo in 1.5

In [73]:
coocs['place']

{'nice': 19494,
 'clean': 12018,
 'provides': 137,
 'want': 2799,
 'comfy': 316,
 'located': 5616,
 'quiet': 6530,
 'close': 4151,
 'public': 1987,
 'getting': 499,
 'easy': 4662,
 'recommend': 19551,
 'calm': 497,
 'provided': 742,
 'much': 1442,
 'amazing': 6801,
 'first': 872,
 'waiting': 135,
 'outside': 563,
 'arrived': 728,
 'late': 296,
 'comfortable': 6380,
 'next': 2156,
 'friendly': 1370,
 'accommodating': 447,
 'asked': 450,
 'travel': 125,
 'say': 781,
 'perfect': 10903,
 'safe': 669,
 'come': 2718,
 'choose': 343,
 'sure': 1008,
 'ready': 226,
 'great': 40513,
 'traveling': 506,
 'tidy': 147,
 'organized': 238,
 'go': 2351,
 'bit': 179,
 'central': 4241,
 'accessible': 563,
 'stay': 9934,
 '10-15': 138,
 'ride': 287,
 'lots': 339,
 'different': 368,
 'get': 2713,
 'make': 1118,
 'excellent': 2239,
 'considering': 146,
 'bed': 620,
 'consider': 197,
 'affordable': 207,
 'good': 9771,
 'spent': 643,
 'pleased': 115,
 'fantastic': 3450,
 'expect': 243,
 'decorated': 1037,
 'l

In [56]:
cent_vocab

['place',
 'apartment',
 'location',
 'stay',
 'amsterdam',
 'host',
 'everything',
 'city',
 'room',
 'time',
 'house',
 'area',
 'home',
 'très',
 'center',
 'restaurants',
 '’',
 'station',
 'minutes',
 'walk',
 'centre',
 'tram',
 'experience',
 'space',
 'thanks',
 'à',
 'hosts',
 'neighborhood',
 'clean',
 'bien',
 'perfect',
 'communication',
 'day',
 'la',
 'kind',
 'distance',
 'days',
 'bed',
 'trip',
 'bathroom',
 'et',
 'night',
 'e',
 'sehr',
 'people',
 'thank',
 'places',
 'breakfast',
 'lot',
 'ist',
 'street',
 'boat',
 'tips',
 'der',
 'visit',
 'coffee',
 'arrival',
 'bus',
 'need',
 'muy',
 'war',
 'min',
 'appartement',
 'die',
 'que',
 'transport',
 'airbnb',
 'view',
 'cozy',
 'check',
 'shops',
 'kitchen',
 'questions',
 'way',
 'bars',
 'minute',
 'helpful',
 'family',
 'studio',
 'anyone',
 'things',
 'get',
 'es',
 'access',
 'dans',
 'mit',
 'man',
 'bit',
 'weekend',
 'town',
 'le',
 'lots',
 'je',
 'stairs',
 'con',
 'floor',
 'food',
 'information',
 'mor

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [57]:
def cooc_dict2df(coocs):
    '''
    This function takes as input the dictionary of co-occurence matrix for center and context words. 
    It converts the dictionary to a 1000x1000 pandas DataFrame.

    Args:
        coocs: The dictionary of co-occurence matrix for center and context words.

    Returns: A 1000x1000 pandas DataFrame.
    '''
    
    # Initialize a pandas DataFrame with columns the context words and indexes the center words
    coocdf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

    # Iterate through the dataframe we created previously to fill it with values
    for index, row in coocdf.iterrows():
        for word in cont_vocab:
            ''' If the pair of index(center word) and word (context word) co-occurs 
                it will add the value to proper place in the dataframe.
                Otherwise the Error of coocs not having a value for this pair will be caught 
                and the value of 0 will added to the corresponding place.'''
            
            try:
                coocdf[word][index] = coocs[index][word]
            except: 
                coocdf[word][index] = 0

    # Return the pandas DataFrame
    return coocdf

In [58]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(1000, 1000)

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [59]:
def cooc2pmi(df):
    '''
    This function converts the raw co-occurence counts pandas DataFrame 
    to a DataFrame that keeps the information for the PMI scores.

    Args:
        df: The dataframe we want to convert from raw co-occurence counts to PMI scores.

    Returns: A pandas DataFrame with PMI scores for the pairs of center and context words.
    '''
    
    pmidf = pd.DataFrame(columns=cont_vocab, index = cent_vocab)

    N = 0
    for index, row in df.iterrows():
        N += sum(row)
    
    count = 0
    for index, row in df.iterrows():
#         count += 1
#         print(row)
#         print(sum(row))
#         if count == 100: break
        for word in cont_vocab:
#             print(f'sum(df[word]) - {word} -  {sum(df[word])}')
#             print(f'sum(row) - {word} - {sum(row)}')
#             pmi = (df[word][index] / N) / ((sum(df[word])/N) * (sum(row)/N))
#             if pmi == 0:
#                 pmidf[word][index] = 0
#             else:
#                 pmidf[word][index] = np.log([pmi])[0] 
            try:
                pmi = df[word][index] / (sum(df[word])/N / sum(row)/N)
                if pmi == 0:
                    pmidf[word][index] = 0
                else:
                    pmidf[word][index] = np.log([pmi])[0] 
            except: 
#                 print(word)
                pmidf[word][index] = 0
      
    return pmidf

In [60]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

(1000, 1000)

In [61]:
pmidf

Unnamed: 0,great,nice,recommend,clean,good,stay,comfortable,easy,perfect,quiet,...,climb,chilled,downstairs,well-located,accommodate,based,andrea,avant,taxi,wi-fi
place,43.119588,43.016158,43.852694,43.129038,42.755983,43.443306,42.743090,42.211904,43.170165,42.704746,...,41.583864,42.751780,42.117945,43.210670,42.179014,42.549861,42.178219,40.924455,41.768985,41.704752
apartment,42.726791,42.819623,43.007347,43.183159,42.457831,42.807691,43.021223,42.187156,42.745846,42.643613,...,42.921042,42.561585,42.375559,43.386565,42.322270,42.531516,41.435956,38.523964,41.988133,42.090954
location,43.268305,42.396526,41.833638,42.525356,43.059813,41.827540,42.162287,42.507726,43.257888,42.729157,...,41.079865,42.095700,41.051477,40.154106,40.298772,42.584931,40.721791,39.783880,41.998956,41.146266
stay,42.108583,41.821913,42.636182,41.359525,41.670270,41.326692,42.028527,40.982286,42.216008,40.927935,...,40.244595,41.640428,40.333990,41.155046,41.299712,41.369196,40.909625,39.153403,40.988332,41.091153
amsterdam,41.757060,41.585891,42.755572,41.185652,41.585781,42.823777,41.220988,41.627679,42.363307,41.767589,...,41.501917,41.845658,40.733862,42.167246,41.318661,41.923219,41.493218,40.611397,41.041682,40.451356
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
petits,30.887791,30.417265,30.403583,30.166548,30.000479,30.958940,30.008372,29.097762,30.305042,29.253642,...,0,0,0,0,0,0,33.927759,34.311603,0,0
show,31.745670,32.163191,31.545974,32.407552,31.653697,31.670549,31.792618,31.028611,31.874878,30.983821,...,0,0,0,0,0,0,0,0,0,0
découvrir,30.392349,30.955897,30.690901,29.537575,30.624269,29.349138,30.478011,30.483692,31.062363,30.416429,...,0,0,0,0,0,0,0,34.088095,0,0
wait,31.272333,31.089489,32.040888,31.599059,30.826854,32.568075,30.701216,30.147281,31.211460,30.303161,...,0,0,0,0,0,0,33.367839,0,33.977174,0


### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [62]:
# pmidf.iloc[2,:]
# pmidf['place']['room']

In [63]:
# pmidf['great']['place']

In [64]:
# pmidf['place']['nice']

In [65]:
# pmidf['based']['place']

In [66]:
def topk(df, center_word, N=10):
    '''
    This function takes as input the given dataframe and creates three new columns the tokenized, tagged and lower_tagged. 
    The tokenized column has as input the words of the comments for its row. The tagged has the result Part-of-speech (PoS) 
    tagging for the tokenized words and finally the lower_tagged column holds the tagged words in lowercase.

    Args:
        df: The dataframe we want to modify.
        center_word: The dataframe we want to modify.
        N: The dataframe we want to modify.

    Returns: A new version of the given dataframe with three additional columns: tokenized, tagged and lower_tagged.
    '''
    
    dicts_ = {word: df[word][center_word] for word in cont_vocab}
    top_words = [key for key, value in sorted(dicts_.items(), key=lambda item: item[1], reverse=True)][:N]

    return top_words

In [67]:
topk(pmidf, 'place')

['recommand',
 'recomend',
 'recommend',
 'reccomend',
 '’',
 '‘',
 'looking',
 'spotless',
 'recommendable',
 'magical']

In [68]:
topk(pmidf, 'location')

['prime',
 'brilliant',
 'fantastic',
 'great',
 'perfect',
 'superb',
 'terrific',
 'excellent',
 'fabulous',
 'better']

In [69]:
topk(pmidf, 'coffee')

['tea',
 'nespresso',
 'microwave',
 'complimentary',
 'supplied',
 'fridge',
 'stocked',
 'cheese',
 'fresh',
 'drink']

In [70]:
topk(pmidf, 'stay')

['enjoyed',
 'enjoyable',
 'pleasant',
 'overall',
 'hesitate',
 'memorable',
 'love',
 'pleased',
 'ensure',
 'satisfied']

In [71]:
topk(pmidf, 'petits')

['aux', 'des', 'sont', 'deux', 'avec', 'apprécié', 'et', 'ses', 'les', 'ont']

In [72]:
topk(pmidf, 'sauber')

['wohnung',
 'sehr',
 'zimmer',
 'das',
 'alles',
 'und',
 'bad',
 'ist',
 'wie',
 'gute']

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

...