## Udacity Write A Data Science Blog Post Project

## Introduction

This project is part of The [Udacity](https://eu.udacity.com/) Data Scientist Nanodegree Program which is composed by:
- Term 1
    - Supervised Learning
    - Deep Learning
    - Unsupervised Learning
- Term 2
    - Write A Data Science Blog Post
    - Disaster Response Pipelines
    - Recommendation Engines
    
The goal of this project is to put in practice the technical skills teached during the program but manly to focus on the ability to effectively communicate the results of the analysis.
    
The **CRISP-DM** Process (Cross Industry Process for Data Mining):
1. Business Understanding
2. Data Understanding
3. Prepare Data
4. Data Modeling
5. Evaluate the Results
6. Deploy   

### Software and Libraries
This project uses Python 3.7.2 the following libraries:
- NumPy
- Pandas
- nltk
- scikit-learn
- Matplotlib
- seaborn
- TextBlob
- WordCloud

## Business Understanding

Looking at the suggested datasets I was pretty stuck because of the too many options. Then because with some friends we were pondering the idea to transfer in Milan to be closer to our working places I have decided to use Airbnb data to do a sentiment analysis of its neighborhoods. 

Questions:
- Witch are the 5 best scoring neighborhood?
- Witch are the 5 worst scoring neighborhood?
- How much is different the overview of the neighborhood given from the hosts from the one given by the guests?

## Data Understanding

As already said the dataset is provided by [Airbnb](http://insideairbnb.com/get-the-data.html) and is basically composed by:
- **listings.csv**:	Detailed Listings data for Milan
- **calendar.csv**:	Detailed Calendar Data for listings in Milan
- **reviews.csv**:	Detailed Review Data for listings in Milan
- **summary_listings.csv**:	Summary information and metrics for listings in Milan (good for visualisations).
- **summary_reviews.csv**: Summary Review data and Listing ID (to facilitate time based analytics and visualisations linked to a listing).
- **neighbourhoods.csv**: Neighbourhood list for geo filter. Sourced from city or open source GIS files.
- **neighbourhoods.geojson**: GeoJSON file of neighbourhoods of the city.

In [None]:
# Import libraries necessary for this project

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

from wordcloud import WordCloud

from src.config import DATA_FOLDER

nltk.download("averaged_perceptron_tagger")
nltk.download("stopwords")

# Pretty display for notebooks
%matplotlib inline

In [None]:
# Load the datasets

df_listings_data = pd.read_csv(DATA_FOLDER + "listings.csv")
df_calendar_data = pd.read_csv(DATA_FOLDER + "calendar.csv")
df_reviews_data = pd.read_csv(DATA_FOLDER + "reviews.csv")
# df_summary_listings_data = pd.read_csv(DATA_FOLDER + 'summary_listings.csv')
# df_summary_reviews_data = pd.read_csv(DATA_FOLDER + 'summary_reviews.csv')

In [None]:
df_listings_data.head()

In [None]:
df_listings_data.columns

In [None]:
df_listings_data.info()

In [None]:
df_listings_data.describe()

In [None]:
print("Numerical variables:")

for name, values in df_listings_data.items():
    if values.dtype == np.float64 or values.dtype == np.int64:
        print(name)

In [None]:
print("Categorical variables values:")

for name, values in df_listings_data.items():
    if values.dtype != np.float64 and values.dtype != np.int64:
        print(name)

In [None]:
for name, values in df_listings_data.items():
    if values.dtype != np.float64 and values.dtype != np.int64:
        print("{name}: {value}\n".format(name=name, value=values.unique()))

In [None]:
df_calendar_data.head()

In [None]:
df_calendar_data.columns

In [None]:
df_calendar_data.info()

In [None]:
df_calendar_data.describe()

In [None]:
print("Numerical variables:")

for name, values in df_calendar_data.items():
    if values.dtype == np.float64 or values.dtype == np.int64:
        print(name)

In [None]:
print("Categorical variables values:")

for name, values in df_calendar_data.items():
    if values.dtype != np.float64 and values.dtype != np.int64:
        print(name)

In [None]:
for name, values in df_calendar_data.items():
    if values.dtype != np.float64 and values.dtype != np.int64:
        print("{name}: {value}\n".format(name=name, value=values.unique()))

In [None]:
df_reviews_data.head()

In [None]:
df_reviews_data.columns

In [None]:
df_reviews_data.info()

In [None]:
df_reviews_data.describe()

In [None]:
print("Numerical variables:")

for name, values in df_reviews_data.items():
    if values.dtype == np.float64 or values.dtype == np.int64:
        print(name)

In [None]:
print("Categorical variables values:")

for name, values in df_reviews_data.items():
    if values.dtype != np.float64 and values.dtype != np.int64:
        print(name)

In [None]:
for name, values in df_reviews_data.items():
    if values.dtype != np.float64 and values.dtype != np.int64:
        print("{name}: {value}\n".format(name=name, value=values.unique()))

## Data Preparation and Data Modeling

Now looking at the columns of the datasets we can figure out which of them can be usefull to answer our questions, of course for our goal the main focus is on the neighbourhoods:

In [None]:
df_listings_data_cleaned = df_listings_data[
    [
        "id",
        # , 'name'
        # , 'summary'
        # , 'space'
        # , 'description'
        "neighborhood_overview",
        # , 'transit'
        # , 'access'
        # , 'interaction'
        # , 'house_rules'
        # , 'host_about'
        # , 'host_neighbourhood'
        # , 'neighbourhood'
        "neighbourhood_cleansed",
    ]
]

df_listings_data_cleaned.head()

In [None]:
df_listings_data_cleaned.shape

In [None]:
# len(df_listings_data_cleaned['id'].unique())

In [None]:
# Set 'id' as key in the dataframe

# df_listings_data_cleaned.set_index('id', inplace = True)

In [None]:
# df_listings_data_cleaned.head()

In [None]:
# Heatmap of the missing values

plt.figure(figsize=(20, 20))
sns.heatmap(df_listings_data_cleaned.isnull(), cmap="Blues", cbar=False)

In [None]:
len(df_listings_data["neighbourhood_cleansed"].unique())

There are only 85 **neighbourhood_cleansed** unique entry.

In [None]:
list_neighbourhood = list(df_listings_data["neighbourhood_cleansed"].unique())
list_neighbourhood = [neighbourhood.lower() for neighbourhood in list_neighbourhood]

print(list_neighbourhood)

Searching online for [Milan's neighbourhoods](http://www.museomilano.it/mediateca/media-pg-5/) and after some data cleaning we have this list of 130 neighbourhoods:
- ticinese
- magenta
- porta vercellina
- cordusio
- carrobbio
- cinquevie
- sant’ambrogio
- verziere
- san babila
- brolo-pantano
- duomo
- castello
- sempione
- brera
- borgo degli ortolani - chinatown
- porta nuova
- centrale
- centro direzionale
- porta garibaldi
- porta venezia
- risorgimento
- porta vittoria
- porta romana
- citta’ studi
- acquabella
- porta monforte
- calvairate
- lazio
- tertulliano
- porta vigentina
- porta genova
- porta lodovica
- bullona
- taliedo mecenate
- morsenchio
- gamboloita
- castagnedo
- vigentino
- corvetto
- nosedo
- santa giulia
- rogoredo
- triulzo superiore
- ponte lambro
- forlanini
- monluè
- guastalla
- ortica
- cavriano
- lambrate
- loreto
- abadesse
- ponte seveso
- isola
- tortona
- washington
- solari
- navigli
- san pietro
- la maddalena
- pagano
- fopponino
- lotto
- molinazzo
- vaiano valle
- selvanesco
- moncucco
- san cristoforo
- lorenteggio giambellino
- primaticcio 
- arzaga
- forze armate
- bisceglie
- quarto cagnino
- quinto romano
- baggio
- muggiano
- trenno
- figino
- lampugnano
- gallaratese
- cascina merlata
- certosa
- qt8
- san siro
- portello
- cagnola
- musocco
- roserio
- vialba
- ronchetto sul naviglio
- barona
- boffalora
- chiesa rossa
- conca fallata
- cantalupa
- gratosoglio
- macconago
- quintosole
- morivione
- chiaravalle
- casoretto
- greco
- bicocca
- prato centenario
- gorla
- precotto
- villa san giovanni
- adriano
- crescenzago
- rottole
- turro
- maggiolina
- montalbino
- niguarda
- tre torri
- dergano
- affori 
- bovisasca
- comasina
- bruzzano
- bovisa 
- villa pizzone
- quarto oggiaro
- farini 
- la fontana
- ronchetto delle rane
- conchetta
- porta volta
- ghisolfa

![title](img/quartieri_milano.jpg)

As we can see not all the neighbourhoods are rappresented in the dataset and moreover there is not an exact mapping between the dataset and the real neighbourhoods.

In [None]:
list_real_neighbourhood = []

with open(DATA_FOLDER + "quartieri.txt", "r") as file:
    for line in file:
        item = line.replace("\n", "")  # remove linebreak
        list_real_neighbourhood.append(item)

print(len(list_real_neighbourhood))

# print(list_real_neighbourhood)

In [None]:
def is_present(item, lista):
    for elemento in lista:
        if elemento in item or item in elemento:
            return elemento
    return False


lista = ["pippo", "pluto", "paperino"]
print(is_present("paperino", lista))

In [None]:
list_mapping_neighbourhood_real_neighbourhood = []
list_no_matched_data_neighbourhood_by_real_neighbourhood = []
list_no_matched_real_neighbourhood_by_data_neighbourhood = []

for neighbourhood in list_neighbourhood:
    real_neighbourhood = is_present(neighbourhood, list_real_neighbourhood)
    if real_neighbourhood is False:
        list_no_matched_data_neighbourhood_by_real_neighbourhood.append(neighbourhood)
    else:
        list_mapping_neighbourhood_real_neighbourhood.append(
            (neighbourhood, real_neighbourhood)
        )

In [None]:
list_mapping_neighbourhood_real_neighbourhood

By checking the association made by our funciton we can see some errors that we must correct:

In [None]:
# Update wrong association ('ronchetto sul naviglio', 'navigli') and ('bovisa', 'bovisasca'),

list_mapping_neighbourhood_real_neighbourhood_correct = []

for i in range(len(list_mapping_neighbourhood_real_neighbourhood)):
    if list_mapping_neighbourhood_real_neighbourhood[i][0] == "ronchetto sul naviglio":
        list_mapping_neighbourhood_real_neighbourhood_correct.append(
            ("ronchetto sul naviglio", "ronchetto sul naviglio")
        )
    elif list_mapping_neighbourhood_real_neighbourhood[i][0] == "bovisa":
        list_mapping_neighbourhood_real_neighbourhood_correct.append(
            ("bovisa", "bovisa")
        )
    else:
        list_mapping_neighbourhood_real_neighbourhood_correct.append(
            list_mapping_neighbourhood_real_neighbourhood[i]
        )

list_mapping_neighbourhood_real_neighbourhood = (
    list_mapping_neighbourhood_real_neighbourhood_correct
)
list_mapping_neighbourhood_real_neighbourhood

In [None]:
for neighbourhood in list_real_neighbourhood:
    if neighbourhood not in [
        element[1] for element in list_mapping_neighbourhood_real_neighbourhood
    ]:
        list_no_matched_real_neighbourhood_by_data_neighbourhood.append(neighbourhood)

print(len(list_mapping_neighbourhood_real_neighbourhood))
print(len(list_no_matched_data_neighbourhood_by_real_neighbourhood))
print(len(list_no_matched_real_neighbourhood_by_data_neighbourhood))

In [None]:
print(list_no_matched_data_neighbourhood_by_real_neighbourhood)

In [None]:
print(list_no_matched_real_neighbourhood_by_data_neighbourhood)

Let's do by hand the mapping of this no matched neighbourhood with the help of Google Maps:

|     Data               |     Real                         |
|------------------------|----------------------------------| 
| bande nere             | primaticcio                      |
| buenos aires - venezia | porta venezia                    |
| corsica                | acquabella                       |
| de angeli - monte rosa | tre torri                        |
| garibaldi repubblica   | porta garibaldi                  |
| ortomercato            | calvairate                       |
| padova                 | isola                            |
| parco bosco in città   | quinto romano                    |
| parco delle abbazie    | vaiano valle                     |
| parco lambro - cimiano | lambrate                         |
| parco nord             | bicocca                          |
| qt 8                   | qt8                              |
| ripamonti              | vigentino                        |
| s. cristoforo          | san cristoforo                   |
| s. siro                | san siro                         |
| sacco                  | vialba                           |
| sarpi                  | borgo degli ortolani - chinatown |
| scalo romana           | vigentino                        |
| selinunte              | san siro                         |
| stadera                | chiesa rossa                     |
| tibaldi                | conchetta                        |
| umbria - molise        | calvairate                       |
| viale monza            | gorla                            |
| villapizzone           | villa pizzone                    |
| xxii marzo             | porta vittoria                   |

In [None]:
list_manual_mapping_neighbourhood_real_neighbourhood = [
    ("bande nere", "primaticcio"),
    ("buenos aires - venezia", "porta venezia"),
    ("corsica", "acquabella"),
    ("de angeli - monte rosa", "tre torri"),
    ("garibaldi repubblica", "porta garibaldi"),
    ("ortomercato", "calvairate"),
    ("padova", "isola"),
    ("parco bosco in citt\x85", "quinto romano"),
    ("parco delle abbazie", "vaiano valle"),
    ("parco lambro - cimiano", "lambrate"),
    ("parco nord", "bicocca"),
    ("qt 8", "qt8"),
    ("ripamonti", "vigentino"),
    ("s. cristoforo", "san cristoforo"),
    ("s. siro", "san siro"),
    ("sacco", "vialba"),
    ("sarpi", "borgo degli ortolani - chinatown"),
    ("scalo romana", "vigentin"),
    ("selinunte", "san siro"),
    ("stadera", "chiesa rossa"),
    ("tibaldi", "conchetta"),
    ("umbria - molise", "calvairate"),
    ("viale monza", "gorla"),
    ("villapizzone", "villa pizzone"),
    ("xxii marzo", "porta vittoria"),
]

for tupla in list_manual_mapping_neighbourhood_real_neighbourhood:
    list_mapping_neighbourhood_real_neighbourhood.append(tupla)
    list_no_matched_data_neighbourhood_by_real_neighbourhood.remove(tupla[0])
    if tupla[1] in list_no_matched_real_neighbourhood_by_data_neighbourhood:
        list_no_matched_real_neighbourhood_by_data_neighbourhood.remove(tupla[1])

print(len(list_mapping_neighbourhood_real_neighbourhood))
print(len(list_no_matched_data_neighbourhood_by_real_neighbourhood))
print(len(list_no_matched_real_neighbourhood_by_data_neighbourhood))

In [None]:
list_mapping_neighbourhood_real_neighbourhood

In [None]:
# Not rappresented neighbourhoods

print(list_no_matched_real_neighbourhood_by_data_neighbourhood)

Now let's map in the dataframe **neighbourhood_cleansed** to the real neighbourhoods:

In [None]:
def get_real_neighbourhood(data_neighbourhood, list_mapping):
    for tupla in list_mapping:
        if tupla[0] == data_neighbourhood:
            return tupla[1]
    return False


print(
    get_real_neighbourhood(
        "villapizzone", list_mapping_neighbourhood_real_neighbourhood
    )
)

In [None]:
# Map neighbourhood to real neighbourhood

df_listings_data_cleaned["real_neighbourhood"] = [
    get_real_neighbourhood(
        neighbourhood.lower(), list_mapping_neighbourhood_real_neighbourhood
    )
    for neighbourhood in df_listings_data_cleaned["neighbourhood_cleansed"]
]

# Drop 'neighbourhood_cleansed' column

df_listings_data_cleaned = df_listings_data_cleaned.drop(
    columns=["neighbourhood_cleansed"]
)

In [None]:
df_listings_data_cleaned.head(20)

In [None]:
list_data_real_neighbourhood = list(df_listings_data_cleaned["real_neighbourhood"])

In [None]:
# list_data_real_neighbourhood

In [None]:
print(len(list_data_real_neighbourhood))

In [None]:
# Plot an histogram of the number of listings related to a neighborhood

list_data_real_neighbourhood_count = []

for data_real_neighbourhood in set(list_data_real_neighbourhood):
    list_data_real_neighbourhood_count.append(
        (
            str(data_real_neighbourhood),
            list_data_real_neighbourhood.count(str(data_real_neighbourhood)),
        )
    )

# print(list_data_real_neighbourhood_count)

data_real_neighbourhoods = [
    data_real_neighbourhood_count[0]
    for data_real_neighbourhood_count in list_data_real_neighbourhood_count
]
counts = [
    data_real_neighbourhood_count[1]
    for data_real_neighbourhood_count in list_data_real_neighbourhood_count
]

# print(counts)

plt.figure(figsize=(20, 10))
plt.bar(data_real_neighbourhoods, counts)
plt.xticks(rotation=90)
plt.show()

In [None]:
total_number_listing = df_listings_data_cleaned.shape[0]

# print(total_number_listing)

for neighbourhood, count in list_data_real_neighbourhood_count:
    print("{0:<35} {1:>8}".format(neighbourhood, count / total_number_listing))

Now let's detect the language of **neighborhood_overview**:

In [None]:
# TextBlob after somem request gives: HTTP Error 429: Too Many Requests

# text_blob = TextBlob('la casa è brutta')

# print(text_blob.detect_language())

In [None]:
# def detect_language(text):
#    try:
#        return TextBlob(text).detect_language()
#    except:
#        return 'not detected'
#
# print(detect_language(22))

Searching online I have found this very intresting [blog post](http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/) about language detection through counting the stop words

In [None]:
# Adapted from http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/


def calculate_languages_ratios(text):
    """
    Calculate probability of given text to be written in several languages and
    return a dictionary that looks like {'french': 2, 'spanish': 4, 'english': 0}

    @param word_tokens: Tokenized text whose language want to be detected
    @type text: str

    @return: Dictionary with languages and unique stopwords seen in analyzed text
    @rtype: dict
    """

    languages_ratios = {}

    tokenizer = RegexpTokenizer(r"\w+")
    word_tokens = tokenizer.tokenize(text)
    words = [word.lower() for word in word_tokens]

    # Compute per language included in nltk number of unique stopwords appearing in analyzed text
    for language in stopwords.fileids():
        stopwords_set = set(stopwords.words(language))
        words_set = set(words)
        common_elements = words_set.intersection(stopwords_set)
        languages_ratios[language] = len(common_elements)  # language "score"

    return languages_ratios


def detect_language(text):
    """
    Calculate probability of given text to be written in several languages and
    return the highest scored.

    It uses a stopwords based approach, counting how many unique stopwords
    are seen in analyzed text.

    @param text: Text whose language want to be detected
    @type text: str

    @return: Most scored language guessed
    @rtype: str
    """

    try:
        ratios = calculate_languages_ratios(text)
        most_rated_language = max(ratios, key=ratios.get)
    except Exception:
        most_rated_language = "not detected"

    return most_rated_language


# input_text = "This is a sample sentence, showing off the language detection"
input_text = "Questa è una frase in italiano"

print(detect_language(input_text))

In [None]:
# Mark each neighborhood_overview with the detected language

start_time = time.time()

df_listings_data_cleaned["detected_language"] = [
    detect_language(neighborhood_overview)
    for neighborhood_overview in df_listings_data_cleaned["neighborhood_overview"]
]

end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: {} seconds".format(elapsed_time))

In [None]:
df_listings_data_cleaned.head(20)

In [None]:
list_detected_language = list(df_listings_data_cleaned["detected_language"])

In [None]:
print(len(list_detected_language))

In [None]:
# Plot an histogram of the neighborhood_overview detected languages

list_language_count = []

for language in set(list_detected_language):
    list_language_count.append((language, list_detected_language.count(language)))

# print(list_language_count)

languages = [language_count[0] for language_count in list_language_count]
counts = [language_count[1] for language_count in list_language_count]

plt.figure(figsize=(20, 10))
plt.bar(languages, counts)
plt.show()

In [None]:
total_number_review = df_listings_data_cleaned.shape[0]

# print(total_number_review)

for language, count in list_language_count:
    print("{0:<15} {1:>8}".format(language, count / total_number_review))

The same analysis must be done on the reviews:

In [None]:
df_reviews_data_cleaned = df_reviews_data[
    [
        "listing_id",
        # , 'date'
        "comments",
    ]
]

df_reviews_data_cleaned.head()

In [None]:
# Heatmap of the missing values

# Not very usefull because there is no missing value

plt.figure(figsize=(20, 20))
sns.heatmap(df_reviews_data_cleaned.isnull(), cmap="Blues", cbar=False)

In [None]:
# Join df_reviews_data_cleaned with listings dataframe to link listing_id to the neighborhood

df_listings_reviews = df_reviews_data_cleaned.join(
    df_listings_data_cleaned[["id", "real_neighbourhood"]].set_index("id"),
    on="listing_id",
)

df_listings_reviews.head(20)

In [None]:
list_review_data_real_neighbourhood = list(df_listings_reviews["real_neighbourhood"])

In [None]:
# print(len(list_review_data_real_neighbourhood))

In [None]:
# Plot an histogram of the number of review related to a neighborhood

list_review_data_real_neighbourhood_count = []

for data_review_real_neighbourhood in set(list_review_data_real_neighbourhood):
    list_review_data_real_neighbourhood_count.append(
        (
            str(data_review_real_neighbourhood),
            list_review_data_real_neighbourhood.count(
                str(data_review_real_neighbourhood)
            ),
        )
    )

# print(list_review_data_real_neighbourhood_count)

review_data_real_neighbourhoods = [
    review_data_real_neighbourhood_count[0]
    for review_data_real_neighbourhood_count in list_review_data_real_neighbourhood_count
]
counts = [
    review_data_real_neighbourhood_count[1]
    for review_data_real_neighbourhood_count in list_review_data_real_neighbourhood_count
]

# print(sum(counts))

plt.figure(figsize=(20, 10))
plt.bar(review_data_real_neighbourhoods, counts)
plt.xticks(rotation=90)
plt.show()

In [None]:
total_number_listing = df_listings_data_cleaned.shape[0]

# print(total_number_listing)

for neighbourhood, count in list_review_data_real_neighbourhood_count:
    print("{0:<35} {1:>8}".format(neighbourhood, count / total_number_listing))

In [None]:
# Mark each review with the detected language

start_time = time.time()

df_reviews_data_cleaned["detected_language"] = [
    detect_language(comment) for comment in df_reviews_data_cleaned["comments"]
]

end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: {} seconds".format(elapsed_time))

In [None]:
df_reviews_data_cleaned.head(20)

In [None]:
list_detected_language = list(df_reviews_data_cleaned["detected_language"])

In [None]:
print(len(list_detected_language))

In [None]:
# Plot an histogram of the comments detected languages

list_language_count = []

for language in set(list_detected_language):
    list_language_count.append((language, list_detected_language.count(language)))

# print(list_language_count)

languages = [language_count[0] for language_count in list_language_count]
counts = [language_count[1] for language_count in list_language_count]

plt.figure(figsize=(20, 10))
plt.bar(languages, counts)
plt.show()

In [None]:
total_number_review = df_reviews_data_cleaned.shape[0]

# print(total_number_review)

for language, count in list_language_count:
    print("{0:<15} {1:>8}".format(language, count / total_number_review))

In [None]:
# Save dataframe

df_listings_data_cleaned.to_csv(DATA_FOLDER + "output_" + "listings.csv")
df_reviews_data_cleaned.to_csv(DATA_FOLDER + "output_" + "reviews.csv")

# df_listings_data_cleaned = pd.read_csv(DATA_FOLDER + 'output_' + 'listings.csv')
# df_reviews_data_cleaned = pd.read_csv(DATA_FOLDER + 'output_' + 'reviews.csv')

As first step for sake of simplicity let's focus on neighborhood overview and reviews in English.
Anyway some possible strategies to tackle the different languages could be:
- Translate everything to English
- Try to redo the same steps for other languages (first of Italian because is the second one more used)

The words that gives us context are the one related to neighborhood:

Synonyms neighborhood (Quartiere in Italian):
- [`English`](https://www.thesaurus.com/browse/neighborhood):
 - area
 - block
 - district
 - ghetto
 - parish
 - part
 - precinct
 - region
 - section
 - slum
 - street
 - suburb
 - territory
 - zone
- [`Italian`](https://dizionari.corriere.it/dizionario_sinonimi_contrari/Q/quartiere.shtml):
 - zona
 - vicinato
 - rione
 - sobborgo
 - borgata
 
 As already said let's focus on English and extract only the records marked like so:

In [None]:
df_listings_data_cleaned_eng = df_listings_data_cleaned[
    df_listings_data_cleaned["detected_language"] == "english"
]

# df_listings_data_cleaned_eng.set_index('id', inplace = True)

df_listings_data_cleaned_eng.head()

In [None]:
print(df_listings_data_cleaned_eng.shape[0])

In [None]:
df_reviews_data_cleaned_eng = df_reviews_data_cleaned[
    df_reviews_data_cleaned["detected_language"] == "english"
]

df_reviews_data_cleaned_eng.head()

In [None]:
print(df_reviews_data_cleaned_eng.shape[0])

In listing using **neighborhood_overview** we can directly get the sentiment of the neighborhood but for the reviews for the **comments** we have to extract the sentences only related to neighborhood.

In [None]:
text_blob = TextBlob("in my opinion textblob is very usefull")

# print(text_blob.detect_language()) # after some requests gives HTTP Error 429: Too Many Requests
print(text_blob.tags)
print(text_blob.words)
print(text_blob.sentiment.polarity)

In [None]:
def get_polarity_sentiment(text):
    return TextBlob(text).sentiment.polarity


print(get_polarity_sentiment("pippo is awesome"))

In [None]:
# Mark each review with the neighborhood sentiment

start_time = time.time()

df_listings_data_cleaned_eng["neighborhood_sentiment"] = [
    get_polarity_sentiment(row["neighborhood_overview"])
    for index, row in df_listings_data_cleaned_eng.iterrows()
]

end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: {} seconds".format(elapsed_time))

In [None]:
df_listings_data_cleaned_eng.head(20)

In [None]:
df_listings_neighbourhood_sentiment = df_listings_data_cleaned_eng[
    ["real_neighbourhood", "neighborhood_sentiment"]
]

df_listings_neighbourhood_sentiment = df_listings_neighbourhood_sentiment.groupby(
    ["real_neighbourhood"], as_index=False
)["neighborhood_sentiment"].mean()

df_listings_neighbourhood_sentiment = df_listings_neighbourhood_sentiment.sort_values(
    by="neighborhood_sentiment", ascending=False
)

df_listings_neighbourhood_sentiment

In [None]:
# Let's peek at the neighborhood_overview of the best and the worst scoring neighborhood

list_neighborhood_overview = list(
    df_listings_data_cleaned_eng[
        df_listings_data_cleaned_eng["real_neighbourhood"] == "sempione"
    ]["neighborhood_overview"]
)
string_list_neighborhood_overview = " ".join(list_neighborhood_overview)
# print(string_list_neighborhood_overview)

tokenizer = RegexpTokenizer(r"\w+")
word_tokens = tokenizer.tokenize(string_list_neighborhood_overview)
words = [word.lower() for word in word_tokens]
stop_words = set(stopwords.words("english"))

# print(stop_words)

filtered_list_neighborhood_overview = [w for w in words if w not in stop_words]
filtered_string_list_neighborhood_overview = " ".join(
    filtered_list_neighborhood_overview
)
# print(filtered_string_list_neighborhood_overview)

text_blob = TextBlob(filtered_string_list_neighborhood_overview)
df_neighborhood_overview_word_count = pd.DataFrame(
    (text_blob.word_counts).items(), columns=["word", "count"]
)
df_neighborhood_overview_word_count = df_neighborhood_overview_word_count.sort_values(
    by="count", ascending=False
)
df_neighborhood_overview_word_count.head(20)

In [None]:
list_neighborhood_overview = list(
    df_listings_data_cleaned_eng[
        df_listings_data_cleaned_eng["real_neighbourhood"] == "figino"
    ]["neighborhood_overview"]
)
string_list_neighborhood_overview = " ".join(list_neighborhood_overview)
# print(string_list_neighborhood_overview)

tokenizer = RegexpTokenizer(r"\w+")
word_tokens = tokenizer.tokenize(string_list_neighborhood_overview)
words = [word.lower() for word in word_tokens]
stop_words = set(stopwords.words("english"))

# print(stop_words)

filtered_list_neighborhood_overview = [w for w in words if w not in stop_words]
filtered_string_list_neighborhood_overview = " ".join(
    filtered_list_neighborhood_overview
)
# print(filtered_string_list_neighborhood_overview)

text_blob = TextBlob(filtered_string_list_neighborhood_overview)
df_neighborhood_overview_word_count = pd.DataFrame(
    (text_blob.word_counts).items(), columns=["word", "count"]
)
df_neighborhood_overview_word_count = df_neighborhood_overview_word_count.sort_values(
    by="count", ascending=False
)
df_neighborhood_overview_word_count.head(20)

Intresting to note that the sentiment is of course highly affected by how many words are present in the concatenation of all **neighborhood_overview**. For example for the worst 'figino' there is only one **neighborhood_overview** so the sentiment is really low compared to the others even if the actual **neighborhood_overview** is pretty good.

In [None]:
df_listings_neighbourhood_sentiment.describe()

In [None]:
# Plot the distribution of the sentiment of the neighbourhoods

df_listings_neighbourhood_sentiment.hist(column="neighborhood_sentiment")

The distribution of the neighborhoods sentiment is skewed right because of course the hosts will in general give a more positive overview of the neighborhood.

In [None]:
# Plot an histogram of the detected sentiment of the neighborhoods

neighborhoods = []
sentiments = []

for index, row in df_listings_neighbourhood_sentiment.iterrows():
    neighborhoods.append(row["real_neighbourhood"])
    sentiments.append(row["neighborhood_sentiment"])

plt.figure(figsize=(20, 10))
plt.bar(neighborhoods, sentiments)
plt.xticks(rotation=90)
plt.show()

Now let's search in the review for the keywords related to neighborhood:

In [None]:
searched_words_english = [
    "neighborhood",
    "area",
    "block",
    "district",
    "ghetto",
    "parish",
    # , 'part' # removed beacuse was beeing used to mark not neighborhood related part of comments
    "precinct",
    "region",
    "section",
    "slum",
    "street",
    "suburb",
    "territory",
    "zone",
    "location",
]

# searched_words_italian = ['quartiere'
#                          , 'zona'
#                          , 'vicinato'
#                          , 'rione'
#                          , 'sobborgo'
#                          , 'borgata'
#                          ]

# sarched_words  = searched_words_english + searched_words_italian

In [None]:
def detect_words(text, searched_words):
    try:
        for word in searched_words:
            if word in text:
                return True
    except Exception:
        return False
    return False


print(detect_words("questo testo continene pippo", ["pippo", "pluto"]))

In [None]:
def clean_tokenize_text(text, language):
    try:
        tokenizer = RegexpTokenizer(r"\w+")
        word_tokens = tokenizer.tokenize(text)
        stop_words = set(stopwords.words(language))
        filtered_sentence = [word for word in word_tokens if word not in stop_words]
        return filtered_sentence
    except Exception:
        return False


input_text = "This is a sample sentence, showing off the stop words filtration!!!"
print(clean_tokenize_text(input_text, "english"))

Let's check the first comment:

In [None]:
# Tokenize and clean comments
comment, language = df_reviews_data_cleaned_eng[["comments", "detected_language"]].iloc[
    0
]

print(comment, language)

In [None]:
print(clean_tokenize_text(comment, language))

In [None]:
print(detect_words(clean_tokenize_text(comment, language), searched_words_english))

In [None]:
# Mark each review if searched words are present

start_time = time.time()

df_reviews_data_cleaned_eng["contains_searched_words"] = [
    detect_words(
        clean_tokenize_text(row["comments"], row["detected_language"]),
        searched_words_english,
    )
    for index, row in df_reviews_data_cleaned_eng.iterrows()
]

end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: {} seconds".format(elapsed_time))

In [None]:
df_reviews_data_cleaned_eng.head()

In [None]:
list_contains_serached_words = list(
    df_reviews_data_cleaned_eng["contains_searched_words"]
)

In [None]:
print(len(list_contains_serached_words))

In [None]:
print(sum(list_contains_serached_words))

In [None]:
print(sum(list_contains_serached_words) / df_reviews_data_cleaned_eng.shape[0])

Only 41% of the english review contains some words related to the neighborhood.

We will consider only records containing the searched words:

In [None]:
df_reviews_data_cleaned_eng_contains = df_reviews_data_cleaned_eng[
    df_reviews_data_cleaned_eng["contains_searched_words"] is True
]

df_reviews_data_cleaned_eng_contains.head()

Now the goal is to isolate the words related to neighborhood or similar:

In [None]:
comment = df_reviews_data_cleaned_eng_contains["comments"].iloc[0]

print(comment)

By looking at some comments I had realize I cuold use punctuation to isolate the phares related to neighborhood instad of remove it like I ws rhinking at the beginning.

In [None]:
def get_contextual_phrase(text, language, searched_words):
    contextual_phrase = ""
    sentences = text.split(".")
    for sentence in sentences:
        if (
            detect_words(clean_tokenize_text(sentence, language), searched_words)
            is True
        ):
            contextual_phrase = contextual_phrase + " " + sentence
    if contextual_phrase == "":
        return text
    else:
        return contextual_phrase


text = "Staying at Francesca's and Alberto's place was a pleasure. Just as described, well located for my purposes, an enjoyable walk to the Tortona area. The room is very nice, cleaned daily and has private bathroom.Francesca is super friendly and very helpful; whilst still respecting privacy. Overall a great experience!"

print(get_contextual_phrase(text, "english", searched_words_english))

In [None]:
# Mark each review with contextual phrases

start_time = time.time()

df_reviews_data_cleaned_eng_contains["contextual_phrases"] = [
    get_contextual_phrase(
        row["comments"], row["detected_language"], searched_words_english
    )
    for index, row in df_reviews_data_cleaned_eng_contains.iterrows()
]

end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: {} seconds".format(elapsed_time))

In [None]:
df_reviews_data_cleaned_eng_contains.head()

Now we can evaluate the sentiment of the neighborhood retated sentences:

In [None]:
comment = df_reviews_data_cleaned_eng_contains["contextual_phrases"].iloc[0]

print(comment)

text_blob = TextBlob(comment)

print(text_blob.tags)
print(text_blob.words)
print(text_blob.sentiment.polarity)

In [None]:
# Mark each review with the neighborhood sentiment

start_time = time.time()

df_reviews_data_cleaned_eng_contains["neighborhood_sentiment"] = [
    get_polarity_sentiment(row["contextual_phrases"])
    for index, row in df_reviews_data_cleaned_eng_contains.iterrows()
]

end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: {} seconds".format(elapsed_time))

In [None]:
df_reviews_data_cleaned_eng_contains.head(20)

In [None]:
df_reviews_listing_neighbourhood_sentiment = df_reviews_data_cleaned_eng_contains[
    ["listing_id", "neighborhood_sentiment"]
]

df_reviews_listing_neighbourhood_sentiment = (
    df_reviews_listing_neighbourhood_sentiment.groupby(["listing_id"], as_index=False)[
        "neighborhood_sentiment"
    ].mean()
)

df_reviews_listing_neighbourhood_sentiment = (
    df_reviews_listing_neighbourhood_sentiment.sort_values(
        by="neighborhood_sentiment", ascending=False
    )
)

df_reviews_listing_neighbourhood_sentiment

In [None]:
df_reviews_listing_neighbourhood_sentiment.describe()

In [None]:
# Join df_reviews_neighbourhood_sentiment with original dataframe to link sentiment to the neighborhood

df_listings_reviews_sentiment = df_listings_data_cleaned.join(
    df_reviews_listing_neighbourhood_sentiment.set_index("listing_id"), on="id"
)[["id", "real_neighbourhood", "neighborhood_sentiment"]]

df_listings_reviews_sentiment.head()

In [None]:
# Let's peek at some records

print(df_reviews_data_cleaned["comments"].iloc[6400])

In [None]:
df_reviews_neighbourhood_sentiment = df_listings_reviews_sentiment[
    ["real_neighbourhood", "neighborhood_sentiment"]
]

df_reviews_neighbourhood_sentiment = df_reviews_neighbourhood_sentiment.groupby(
    ["real_neighbourhood"], as_index=False
)["neighborhood_sentiment"].mean()

df_reviews_neighbourhood_sentiment = df_reviews_neighbourhood_sentiment.sort_values(
    by="neighborhood_sentiment", ascending=False
)

df_reviews_neighbourhood_sentiment

In [None]:
len(df_reviews_neighbourhood_sentiment["real_neighbourhood"].unique())

In [None]:
df_reviews_neighbourhood_sentiment.describe()

In [None]:
# Plot the distribution of the sentiment of the neighbourhoods

df_reviews_neighbourhood_sentiment.hist(column="neighborhood_sentiment")

The distribution of the neighborhoods sentiment is more gaussian distributed with respet to the one given by the **neighbourhood_overview**

In [None]:
# Plot an histogram of the detected sentiment of the neighborhoods

neighborhoods = []
sentiments = []

for index, row in df_reviews_neighbourhood_sentiment.iterrows():
    neighborhoods.append(row["real_neighbourhood"])
    sentiments.append(row["neighborhood_sentiment"])

plt.figure(figsize=(20, 10))
plt.bar(neighborhoods, sentiments)
plt.xticks(rotation=90)
plt.show()

Let's compare the results of a sentiment analysis on the **neighborhood_overview** column in the listing dataframe with the reviews one:

In [None]:
df_listings_neighbourhood_sentiment.head()

In [None]:
df_reviews_neighbourhood_sentiment.head()

In [None]:
df_neighbourhood_sentiment = pd.DataFrame(
    list_real_neighbourhood, columns=["real_neighbourhood"]
)

df_neighbourhood_sentiment.head()

In [None]:
df_neighbourhood_sentiment = df_neighbourhood_sentiment.set_index(
    "real_neighbourhood"
).join(
    df_listings_neighbourhood_sentiment.set_index("real_neighbourhood"),
    rsuffix="_listing",
)
df_neighbourhood_sentiment.rename(
    columns={"neighborhood_sentiment": "neighborhood_sentiment_listing"}, inplace=True
)

In [None]:
df_neighbourhood_sentiment = df_neighbourhood_sentiment.join(
    df_reviews_neighbourhood_sentiment.set_index("real_neighbourhood"),
    rsuffix="_review",
)
df_neighbourhood_sentiment.rename(
    columns={"neighborhood_sentiment": "neighborhood_sentiment_review"}, inplace=True
)

In [None]:
df_neighbourhood_sentiment.head()

In [None]:
df_neighbourhood_sentiment = df_neighbourhood_sentiment.sort_values(
    by="neighborhood_sentiment_review", ascending=False
)

In [None]:
df_neighbourhood_sentiment.head()

In [None]:
df_neighbourhood_sentiment_dropna = df_neighbourhood_sentiment.dropna(
    subset=["neighborhood_sentiment_review"], axis=0
)
df_neighbourhood_sentiment_dropna.tail()

In [None]:
df_neighbourhood_sentiment["difference"] = [
    row["neighborhood_sentiment_listing"] - row["neighborhood_sentiment_review"]
    for index, row in df_neighbourhood_sentiment.iterrows()
]

In [None]:
df_neighbourhood_sentiment.describe()

In [None]:
df_neighbourhood_sentiment[df_neighbourhood_sentiment["difference"] < 0]

In [None]:
df_neighbourhood_sentiment[df_neighbourhood_sentiment["difference"] > 0]

Pretty intresting to note that the majority of **neighborhood_sentiment_review** are bigger than **neighborhood_sentiment_listing**. One possible explanation could be that in the field **neighborhood_overview** the host usually tend to use a lot of words to describe the neighborhood while extracting only the sentences related to neighborhood clean up more the string used for the sentiment analysis giving an overall higher score.
Let's do a word cloud to compare the words used to get **neighborhood_sentiment_listing** and **neighborhood_sentiment_review**:

In [None]:
show_neighborhood = [
    "sempione",
    "brera",
    "duomo",
    "ticinese",
    "quarto oggiaro",
    "bovisasca",
    "comasina",
    "quinto romano",
]

for neighborhood in show_neighborhood:
    list_neighborhood_overview = list(
        df_listings_data_cleaned_eng[
            df_listings_data_cleaned_eng["real_neighbourhood"] == neighborhood
        ]["neighborhood_overview"]
    )
    string_list_neighborhood_overview = " ".join(list_neighborhood_overview)
    # print(string_list_neighborhood_overview)

    list_contextual_phrases = list(
        (
            df_listings_data_cleaned_eng[
                df_listings_data_cleaned_eng["real_neighbourhood"] == neighborhood
            ]
        ).join(
            df_reviews_data_cleaned_eng_contains[
                ["listing_id", "contextual_phrases"]
            ].set_index("listing_id"),
            on="id",
        )["contextual_phrases"]
    )
    # print(list_contextual_phrases)
    list_contextual_phrases = [x for x in list_contextual_phrases if str(x) != "nan"]
    string_list_contextual_phrases = " ".join(list_contextual_phrases)
    # print(string_list_contextual_phrases)

    stop_words = set(stopwords.words("english"))
    stop_words.update(searched_words_english)
    stop_words.update([neighborhood, "milan", "city", "apartment"])
    stop_words.remove("not")
    stop_words.remove("no")
    stop_words.remove("nor")
    # print(stop_words)

    f = plt.figure(figsize=(20, 10))
    f.suptitle(neighborhood)
    ax = f.add_subplot(121)
    ax2 = f.add_subplot(122)

    ax.set_title("listing_neighborhood_overview")
    ax.axis("off")

    ax2.set_title("review_contextual_phrases")
    ax2.axis("off")

    try:
        wordcloud_neighborhood_overview = WordCloud(stopwords=stop_words).generate(
            string_list_neighborhood_overview
        )
        ax.imshow(wordcloud_neighborhood_overview, interpolation="bilinear")

        wordcloud_contextual_phrases = WordCloud(stopwords=stop_words).generate(
            string_list_contextual_phrases
        )
        ax2.imshow(wordcloud_contextual_phrases, interpolation="bilinear")
    except Exception:
        pass