# Airbnb London - Predict property rating based on sentiment analysis

This study is divided in two parts.

#### Step 1. Sentiment Analysis
Sentiment analysis is a process of identifying an attitude of the author on a topic that is being written about. With sentiment analysis, text can be categorized into a variety of sentiments. 
Ideally, to perform sentiment analysis in a supervised learning machine learning process, we would have to manually associate each text record with a “sentiment” for training. Based on training, a prediction model associates “positive” or “negative” sentiments to test records.

For simplicity, this work will be based on only two categories, positive and negative. Also, instead of manually labelling each comment, we will use an automated algorithm for this purpose.


#### Step 2. Property rating prediction based on sentiment analysis

Based on the sentiment scores of a set of review comments, for each property, we will try to infer its rating.


**Note:** Portions of this notebook were based on the following articles:
* https://towardsdatascience.com/social-media-sentiment-analysis-49b395771197
* https://www.kaggle.com/madatpython/python-nltk-sentiment-analysis
* https://www.kaggle.com/sasikala11/sentiment-analysis-using-python
* https://algotrading101.com/learn/sentiment-analysis-python-guide/
* https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
* https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk
* https://www.kaggle.com/leandrodoze/sentiment-analysis-in-portuguese
* https://realpython.com/python-nltk-sentiment-analysis/

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.stem.porter import *
from nltk.corpus import stopwords
from nltk.classify import SklearnClassifier

from wordcloud import WordCloud,STOPWORDS
from gensim.models import word2vec

import sklearn
from sklearn.model_selection import train_test_split # function for splitting data to train and test sets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
#from sklearn.metrics import jaccard_similarity_score
cv = CountVectorizer()
from sklearn.metrics.pairwise import cosine_similarity

%matplotlib inline
plt.style.use('bmh')
import warnings
warnings.filterwarnings('ignore')

## 1. Read dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
dfdict = dict()
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if filename != 'calendar.csv' and filename.rfind('.csv') > 0:
            print(os.path.join(dirname, filename))
            name = Path(os.path.join(dirname, filename)).stem
            dfdict[name] = pd.read_csv(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### 1.1. Convert data types and remove irrelevant columns

In [None]:
# Remove columns with almost no values: listings: bathrooms, neighbourhood_group_cleansed, calendar_updated, license
dfdict['listings'].drop(columns=['bathrooms', 'neighbourhood_group_cleansed', 'calendar_updated', 'license'], inplace=True)
# Remove irrelevant columns
dfdict['listings'].drop(columns=['listing_url', 'picture_url', 'host_url', 'host_name', 'host_thumbnail_url', 'host_picture_url'], inplace=True)
dfdict['listings'].drop(columns=['minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 
                                 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm'], inplace=True)

In [None]:
dfdict['reviews']['date'] = pd.to_datetime(dfdict['reviews']['date'])
dfdict['listings']['last_scraped'] = pd.to_datetime(dfdict['listings']['last_scraped'])
dfdict['listings']['host_since'] = pd.to_datetime(dfdict['listings']['host_since'])
dfdict['listings']['calendar_last_scraped'] = pd.to_datetime(dfdict['listings']['calendar_last_scraped'])
dfdict['listings']['first_review'] = pd.to_datetime(dfdict['listings']['first_review'])
dfdict['listings']['last_review'] = pd.to_datetime(dfdict['listings']['last_review'])

In [None]:
def convert_price(df_column):
    return df_column.str.replace('$', '', regex = 'true').str.replace(',', '', regex = 'true').astype(float)

In [None]:
dfdict['listings']['price'] = convert_price(dfdict['listings']['price'])

In [None]:
def convert_boolean(df_column):
    return df_column.replace({'f': 0, 't': 1}).astype('boolean')

In [None]:
# Convert t/f fields to boolean
# calendar dataframe : available
# listings dataframe : (host_is_superhost, host_has_profile_pic, host_identity_verified, calendar_updated, has_availability, instant_bookable)
dfdict['listings']['host_is_superhost'] = convert_boolean(dfdict['listings']['host_is_superhost'])
dfdict['listings']['host_has_profile_pic'] = convert_boolean(dfdict['listings']['host_has_profile_pic'])
dfdict['listings']['host_identity_verified'] = convert_boolean(dfdict['listings']['host_identity_verified'])
dfdict['listings']['has_availability'] = convert_boolean(dfdict['listings']['has_availability'])
dfdict['listings']['instant_bookable'] = convert_boolean(dfdict['listings']['instant_bookable'])
# Convert 'host_acceptance_rate', 'host_response_rate', removing the %
dfdict['listings']['host_acceptance_rate'] = dfdict['listings']['host_acceptance_rate'].str.replace('%', '', regex = 'true').str.replace(',', '', regex = 'true').astype(float)
dfdict['listings']['host_response_rate'] = dfdict['listings']['host_response_rate'].str.replace('%', '', regex = 'true').str.replace(',', '', regex = 'true').astype(float)

### 1.2. Setup indices

In [None]:
dfdict['listings'].set_index('id', inplace=True)
dfdict['reviews'].set_index('id', inplace=True)

## 2. Data overview

### 2.1. Overview of the `reviews` dataset

In [None]:
dfdict['reviews']

As we can see, there are 5 attributes present in the dataset, with a total of 1,163,886 reviews. Each review has a date and is associated with a property (`listing_id`), and with a reviewer (`reviewed_id`, `reviewer_name`). Each review has a text field with `comments`.

### We need to label each comment as positive (1), neutral (0) or negative (-1)



## 3. Data preprocessing

When performing the sentiment analysis of property comments, our approach will be based on performing individual sentiment analysis on the comment of each reviewer, and then aggregating sentiment scores for each property.

### 3.1. Merge `reviews` and `listings` dataframes to obtain the `property review score` column

To perform sentiment analysis on each comment, while preserving property information, we will join the `listings` and `reviews` dataframes. 

We will also sort the records by review date (column `review_date`).

In [None]:
# Rename the column from date to review_date (will help recognizing this column after merging with the other dataframe)
dfdict['reviews'].rename(columns={"date": "review_date"}, inplace=True)
df_merge = dfdict['listings'].merge(dfdict['reviews'], left_on=['id'], right_on=['listing_id']).sort_values(by=['review_date'])
df_merge

### 3.2. Reduce dataset size

Our `comments` dataframe has 1,163,886 reviews (too much records!). Let's reduce the size of the dataset, by choosing a subset of reviews. 

Our filter criteria will be purely random. Let's select 10% of the records, randomly.

But, before filtering, let's look at the distribution of the number of reviews.

In [None]:
# Get the number of comments per listing
review_count = df_merge[['listing_id', 'reviewer_id']].groupby(by=['listing_id']).count()
review_count.hist(bins=40)

In [None]:
# df_merge_original = df_merge.copy()
# Select a random 50% sample of the DataFrame 
df_merge = df_merge.sample(frac=0.1, random_state=42)
review_count = df_merge[['listing_id', 'reviewer_id']].groupby(by=['listing_id']).count()
review_count.hist(bins=40)

### 3.1. Convert `comments` column from object to string

In [None]:
df_merge['comments'] = df_merge['comments'].astype(str)
df_merge.info()

### 3.2. Removing Punctuation, Numbers, and Special Characters

Punctuation, numbers and special characters do not help much. In the following function, we will replace some expressions with spaces.

#### NOTE: We will store the modified comments in a new column called `comments_cleaned`. 

In [None]:
def cleaning(s):
    s = str(s)
    s = s.lower()
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W,\s',' ',s)
    s = re.sub(r'[^\w]', ' ', s)
    s = re.sub("\d+", "", s)
    s = re.sub('\s+',' ',s)
    s = re.sub('[!@#$_]', '', s)
    s = re.sub(r'[^\x00-\x7f]',r' ',s)  # remove punctuation
    s = s.replace("co","")
    s = s.replace("https","")
    s = s.replace(",","")
    s = s.replace("[\w*"," ")
    return s

In [None]:
df_merge['comments_cleaned'] = df_merge['comments'].apply(lambda s: cleaning(s))
df_merge.head(8)

### 3.3. Analyzing in which language the comments are written

In [None]:
# https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language
#from langdetect import detect => does not work
#from textblob import TextBlob  # Requires NLTK package, uses Google => error of too many web requests...
import fasttext
model = fasttext.load_model('../input/fasttext-model/fasttext.ftz')

language_list = []
for comment in df_merge['comments']:
    language_list.append(model.predict(comment.replace("\n",""))[0][0])

In [None]:
df_merge['language'] = language_list
df_merge[['comments', 'comments_cleaned', 'language']].head(6)

#### Ops! There are comments in other languages! We will remove all comments that are not in English.

### 3.4. Filter out non-english reviews

In [None]:
df_english = df_merge[df_merge['language'] == '__label__en']
df_english[['comments', 'comments_cleaned', 'language']].head(6)

### 3.5. Removing stop words

We will define a function to remove stop words in English.

In [None]:
stopword_list = set(stopwords.words("english"))
def removeStopwords(x):
    filtered_words = [word for word in x.split() if word not in stopword_list]
    return " ".join(filtered_words)

In [None]:
df_english['comments_cleaned_2'] = df_english['comments_cleaned'].apply(lambda s: removeStopwords(s))
df_english[['comments', 'comments_cleaned', 'comments_cleaned_2', 'language']].head(6)

### 3.6. Tokenization

Now we will tokenize all the cleaned comments in our dataset. Tokens are individual terms or words, and tokenization is the process of splitting a string of text into tokens.

We tokenize our sentences because we will apply Stemming from the “NLTK” package in the next step.

In [None]:
tokenized_comments = df_english['comments_cleaned_2'].apply(lambda x: x.split())
tokenized_comments.head()

### 3.7. Stemming

Stemming is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
For example — “play”, “player”, “played”, “plays” and “playing” are the different variations of the word — “play”

In [None]:
from nltk import PorterStemmer

ps = PorterStemmer()
tokenized_comments = tokenized_comments.apply(lambda x: [ps.stem(i) for i in x])
tokenized_comments.head()

### 3.8. Merge tokens back together

In [None]:
tokenized_comments = tokenized_comments.apply(lambda x: ' '.join(x))
df_english['comments_cleaned_3'] = tokenized_comments
df_english[['comments', 'comments_cleaned', 'comments_cleaned_2', 'comments_cleaned_3', 'language']].head(6)

## 5. Performing automated sentiment analysis on `comments`

Source: https://algotrading101.com/learn/sentiment-analysis-python-guide/, Section 3.

We will use the `VADER Sentiment Analyzer`

VADER is a sentiment analyser that is trained using social media and news data using a lexicon-based approach. This means that it looks at words, punctuation, phases, emojis etc and rates them as positive or negative.

VADER stands for “Valence Aware Dictionary and sEntiment Reasoner”.

### 5.1. Download VADER lexicon

In [None]:
%%time
nltk.download('vader_lexicon')

### 5.2. Run the automated sentiment analysis

And concatenate the output to the original dataframe. Note that we are only interested in the values of the ‘compound’ variable, which is a function of the positive, neutral and negative sentiment scores.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

def guess_sentiment(comment):
    pol_score = SIA().polarity_scores(comment) # run analysis
    return pol_score["compound"]

#### Since Vader algorithm takes a lot of time to run (more than 15min), we will store its results in a pickle file to save time. If this result file already exists, we will skip Vader processing.

In [None]:
outputpath = '/kaggle/working/'
vader_sentiment_list_file = os.path.join(outputpath, 'airbnb-vader-sentiment.pkl.gz')

In [None]:
if os.path.isfile(vader_sentiment_list_file):
    print('Reusing existing Vader sentiment results.')
    vader_sentiment_list = pd.read_pickle(vader_sentiment_list_file)
else:
    print('Existing Vader sentiment results not found. Reprocessing...')
    #%%time 
    vader_sentiment_list = df_english['comments'].apply(lambda x: guess_sentiment(x))
    vader_sentiment_list.to_pickle(vader_sentiment_list_file)
    print('Saved existing Vader sentiment results to file.')

In [None]:
vader_sentiment_list.head(6)

### 5.3. Analyze obtained sentiment scores

First, let's merge the obtained sentiment score into the original dataset.

In [None]:
df_english['vader_sentiment_score'] = vader_sentiment_list

In [None]:
display(df_english[['comments', 'vader_sentiment_score']].describe())
display(df_english[['comments', 'language', 'vader_sentiment_score']])

As expected, sentiment scores range from -1.0 to 1.0.

## 6. Correlating sentiment score of the reviews against property review scores

Source: https://algotrading101.com/learn/sentiment-analysis-python-guide/, Section 4.

In this section, we want to compare the relationship between property ratings and our sentiment score. 

If there is a significant relationship, then our sentiment scores might have some predictive value.

Here are the next steps:

1. Aggregate sentiment scores by property, creating the `property review score` column;
1. Check relationship between `property sentiment score` against `property review score`.

### 6.1. Scale sentiment scores, so that values lie in the interval `[0, 1]`

In [None]:
# https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df_english[['vader_sentiment_score']])
df_english['sentiment_score'] = scaler.transform(df_english[['vader_sentiment_score']])

In [None]:
display(df_english[['comments', 'sentiment_score']].describe())
display(df_english[['comments', 'language', 'sentiment_score']])

#### Let's also multiply sentiment scores by 100, truncate the values and convert them to int

In [None]:
df_english['sentiment_score'] = np.ceil(df_english['sentiment_score'] * 100).astype(int)

### 6.2. Aggregate sentiment scores by property

Let's take a quick look at a random property (`id = 15400`) and the sentiment scores of its reviews.

In [None]:
df_english[df_english['listing_id'] == 15400][['sentiment_score', 'review_scores_rating']]

Now let's combine the `sentiment_score` for all comments/reviews concerning each property, to get a unique property `sentiment_score`. 

#### Let's investigate the best way to calculate an aggregated property sentiment score by observing some statistics of the scores of the individual comments.

In [None]:
# https://stackoverflow.com/questions/17578115/pass-percentiles-to-pandas-agg-function
def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'p%s' % n
    return percentile_

In [None]:
df_stats = df_english[['listing_id', 'sentiment_score', 'review_scores_rating']].groupby(['listing_id'])\
.agg({"sentiment_score": [np.median, np.var, np.min, np.max, percentile(10), percentile(25), percentile(75),
                         percentile(90), percentile(95), np.mean], 
      "review_scores_rating" : [np.mean]})
df_stats.columns = ['_'.join(col).strip() for col in df_stats.columns.values]
df_stats
#reset_index().pivot(index='name', values='score', columns='level_1')

### 6.3. Correlation heatmap between candidate `property sentiment scores` and `property review score`

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# https://medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-python-834c0686b88e
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(df_stats.corr()[['review_scores_rating_mean']].sort_values(by='review_scores_rating_mean', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with review_scores_rating', fontdict={'fontsize':18}, pad=16);

In [None]:
sns.scatterplot(data=df_stats, x="sentiment_score_mean", y="review_scores_rating_mean")
print(df_stats["sentiment_score_mean"].corr(df_stats["review_scores_rating_mean"]))

On the x-axis, we have our sentiment score. On the y-axis we have our property review score.

The correlation between the mean sentiment score and the property review score is not so strong (0.36). We may need to improve it.

### 6.4. Improving the relationship between `property sentiment score` against `property review score`


#### 6.4.1. Feature engineering on `last_scraped` and `review_date` dates

We do not know the formula used by Airbnb when calculating the property review score. Maybe older reviews receive a smaller weight in the formula when compared to newer / more recent reviews?

Let's create a new field with the date difference between the review date (`review_date`) and the date of the property review score (`last_scraped`). 

In [None]:
df_english['days_since_review'] = (df_english['last_scraped'] - df_english['review_date']).astype('timedelta64[D]')

#### 6.4.2. Ignoring some neutral-sentiment comments

Let's try to improve the correlation by discarding comments that present neutral sentiment (insignificant comment). 
The bare minimum is to exclude the data where the score is 0 or insignificant.

We shall assume that a score of between -0.1 and 0.1 is insignificant for the sake of simplicity. This is an arbitrary figure.
Let's discard any comment with $\mid sentiment.score \mid \leq 0.1 $.

In [None]:
threshold = 0.1
df_english2 = df_english[((df_english['vader_sentiment_score'] <= -threshold) | (df_english['vader_sentiment_score'] >= threshold))]
df_english2

In [None]:
# https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df_english2[['vader_sentiment_score']])
df_english2['sentiment_score'] = scaler.transform(df_english2[['vader_sentiment_score']])

df_stats2 = df_english2[['listing_id', 'sentiment_score', 'review_scores_rating', 'days_since_review']].groupby(['listing_id'])\
.agg({"sentiment_score": [np.median, np.var, np.min, np.max, percentile(10), percentile(25), percentile(75),
                         percentile(90), percentile(95), np.mean], 
      "review_scores_rating" : [np.mean],
      "days_since_review" : [np.mean]})
df_stats2.columns = ['_'.join(col).strip() for col in df_stats2.columns.values]
df_stats2


In [None]:
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(df_stats2.corr()[['review_scores_rating_mean']].sort_values(by='review_scores_rating_mean', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with review_scores_rating 2', fontdict={'fontsize':18}, pad=16);

#### We managed to improve the correlation index from 0.36 to 0.40.

In [None]:
sns.scatterplot(data=df_stats2, x="sentiment_score_mean", y="review_scores_rating_mean")
print(df_stats2["sentiment_score_mean"].corr(df_stats2["review_scores_rating_mean"]))

## 7. Predict `property review score` using `property sentiment score`

### 7.1. Splitting the Dataset for Training and Testing the Model

We need to divide our data into training and testing sets. The training set will be used to train the algorithm while the test set will be used to evaluate the performance of the machine learning model.

In [None]:
# Checking for NaN values in the dataframe
df_stats2.isna().sum()

In [None]:
df_stats3 = df_stats2.fillna(0)

In [None]:
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
target_col = 'review_scores_rating_mean'
X = df_stats3[[col for col in df_stats3.columns if col != target_col]]
y = df_stats3[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In the code above we use the `train_test_split` class from the `sklearn.model_selection` module to divide our data into `training` and `testing` set. The method takes the feature set as the first parameter, the label set as the second parameter, and a value for the `test_size` parameter. We specified a value of `0.2` for `test_size` which means that our data set will be split into two sets of `80%` and `20%` data. We will use the `80%` dataset for training and `20%` dataset for testing.

### 7.2. Define the ML Pipeline and Randomized Hyperparameter Search

The `XGBClassifier` class implements the Scikit-Learn interface for using `XGBoost` for classification. That means that it has the familiar fit method as well as predict, score and so on.

The preprocessing methods to use in the pipeline and the parameters to optimize are just for the sake of the example.

In k-fold cross-validation, note that I have set the number of splits/folds to 3 in order to save time. You should probably put 5 there to get a more reliable result.

In [None]:
# https://www.kaggle.com/carlosdg/xgboost-with-scikit-learn-pipeline-gridsearchcv
model = xgb.XGBRegressor(learning_rate=0.02, n_estimators=600, silent=True, nthread=1)

pipeline = Pipeline([
    ('standard_scaler', StandardScaler()), 
    ('model', model)
])

folds = 3
param_comb = 50

param_grid = {
        'model__silent': [False],
        'model__max_depth': [6, 10, 15, 20],
        'model__learning_rate': [0.001, 0.01, 0.1, 0.2, 0,3],
        'model__subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'model__colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'model__colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'model__min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
        'model__gamma': [0, 0.25, 0.5, 1.0],
        'model__reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0],
        'model__n_estimators': [100]}

grid = RandomizedSearchCV(pipeline, param_grid, random_state=0, n_iter=param_comb, cv=folds, n_jobs=4, verbose=3, 
                          scoring='neg_mean_squared_error')

In [None]:
%%time
grid.fit(X_train, y_train)

### 7.3. CV results

Here are the results of the model that gave the best mean score in the k-fold cross-validation

In [None]:
mean_score = grid.cv_results_["mean_test_score"][grid.best_index_]
std_score = grid.cv_results_["std_test_score"][grid.best_index_]

grid.best_params_, mean_score, std_score

print(f"Best parameters: {grid.best_params_}")
print(f"Mean CV score: {mean_score: .6f}")
print(f"Standard deviation of CV score: {std_score: .6f}")

### 7.4. Making Predictions and Evaluating the Model

Once the model has been trained, the last step is to make predictions on the model. To do so, we need to call the predict method on the `grid` object that we used for training. 

In [None]:
y_pred = grid.predict(X_test)

We can now evaluate the regressor performance.

In [None]:
# https://medium.com/acing-ai/how-to-evaluate-regression-models-d183b4f5853d
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error, max_error

print('RMSE: ', np.sqrt(mean_squared_error(y_test, y_pred)))
print('MAE: ', mean_absolute_error(y_test, y_pred))
print('MedAE: ', median_absolute_error(y_test, y_pred))
print('MaxError: ', max_error(y_test, y_pred))

#### The XGBoostRegressor presented a mean absolute error (MAE) of `4.18` when trying to predict the `property review score`. Note, however, that the maximum absolute error observed in prediction is equal to 95.12, a really large error!

We can also plot the test values (`y_test` vs. `y_pred`). 

In [None]:
# https://stackoverflow.com/questions/65539013/how-to-plot-a-graph-of-actual-vs-predict-values-in-python
import matplotlib.pyplot as plt
import numpy as np


def plotGraph(y_test,y_pred,regressorName):
    if max(y_test) >= max(y_pred):
        my_range = int(max(y_test))
    else:
        my_range = int(max(y_pred))
    plt.scatter(range(len(y_test)), y_test, color='blue')
    plt.scatter(range(len(y_pred)), y_pred, color='red')
    plt.title(regressorName)
    plt.show()
    return

In [None]:
plotGraph(y_test, y_pred, "XGBRegressor")

Observe that the regressor has problemas to predict smaller preperty review scores (e.g., from 0 until 80). No property review score below ~60 is returned by the regressor.

## 8. Trying to improve the prediction results by formulating a classification problem

What if, instead of trying to predict the `property review score` as a number (from `0` to `100`), we tried to predict the `property review score` as a class (e.g., `0` == low, `1` == medium and `3` == high)? Would it be possible to obtain improved results?

**Let's begin by analyzing the distribution of the target column.** I suspect that these values are not i.i.d. distributed. If this is true, the regression model may be overfitting on high property scores (the majority of the scores is high), leaving behind the low scores.


In [None]:
sns.displot(df_stats3, x="review_scores_rating_mean", bins=100)

### 8.1. Converting the target colum from numerical to categorical

Even though some websites explain how Statified K-fold CV could be performed on non-categorial targets (https://scottclowe.com/2016-03-19-stratified-regression-partitions/), there is no source code available on that subject.

Our approach here will be the conversion of the target column to catagorical (property score ranges) in order to apply `StratifiedKFold` cross-validation. 

To discretize the target column values, we will apply the `KBinsDiscretizer`. Here is a page that illustrates the different stratagies of discretization: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_strategies.html
We will discretize the column into 4 groups using `quantile` criteria (i.e., all bins in each feature have the same number of points) to define the width of the bins.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
enc = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')
enc.fit(y_train.to_numpy().reshape(-1, 1))
y_train_cat = enc.transform(y_train.to_numpy().reshape(-1, 1)).astype(int)
y_train_cat

In [None]:
new_target = 'review_scores_rating_bin'
sns.displot(pd.DataFrame(y_train_cat, columns=[new_target]), x=new_target, bins=100)

The new values of `new_target` column a much better distributed than the original numerical column. 

Let's apply the same transformation to the `y_test` data.

In [None]:
y_test_cat = enc.transform(y_test.to_numpy().reshape(-1, 1)).astype(int)
sns.displot(pd.DataFrame(y_test_cat, columns=[new_target]), x=new_target, bins=100)

We can now run a new Machine Learning algorithm, now based on a classification task on target column `new_target`.

### 8.2. ML pipeline for classification

We'll define a new Machine Learning pipeline for classification, this time using `XGBClassifier` and `RandomizedSearchCV` with the `f1_weighted` evaluation metric.

To properly define the `f1_weighted` metric, first we need to define the following metrics:

**Accuracy**: Accuracy is an evaluation metric that allows you to measure the total number of predictions a model gets right. The formula for accuracy is below:

![Accuracy_formula](https://miro.medium.com/max/700/1*sVuthxNoz09nzzJTDN1rww.png)

**Precision**: It tells you what fraction of predictions as a positive class were actually positive. To calculate precision, use the following formula: 

$TP/(TP+FP)$.

**Recall**: It tells you what fraction of all positive samples were correctly predicted as positive by the classifier. It is also known as True Positive Rate (TPR), Sensitivity, Probability of Detection. To calculate Recall, use the following formula: 

$TP/(TP+FN)$.

**F1-score**: It combines precision and recall into a single measure. Mathematically it’s the harmonic mean of precision and recall. It can be calculated as follows:

![F1-score](https://miro.medium.com/max/700/1*wUdjcIb9J9Bq6f2GvX1jSA.png)

Finally, `f1_weighted` is the weighted average of the f1-scores of each class.

In [None]:
# Machine Learning pipeline definition
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
model = xgb.XGBClassifier(learning_rate=0.02, n_estimators=600, silent=True, nthread=1)

pipeline = Pipeline([
    #('standard_scaler', StandardScaler()), 
    ('model', model)
])

folds = 5
param_comb = 30

param_grid = {
        'model__silent': [False],
        'model__max_depth': [6, 10, 15, 20],
        'model__learning_rate': [0.001, 0.01, 0.1, 0.2, 0,3],
        'model__subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'model__colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'model__colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'model__min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
        'model__gamma': [0, 0.25, 0.5, 1.0],
        'model__reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0],
        'model__n_estimators': [100]}

#cross_val = StratifiedKFold(n_splits=folds)
cross_val = KFold(n_splits=folds, random_state=42, shuffle=True)

class_grid = RandomizedSearchCV(pipeline, param_grid, random_state=0, n_iter=param_comb, cv=cross_val, n_jobs=4, verbose=3, 
                          scoring='f1_weighted')

In [None]:
%%time
class_grid.fit(X_train, y_train_cat)

### 8.3. Making Predictions and Evaluating the new model

Once the model has been trained, the last step is to make predictions on the model. To do so, we need to call the predict method on the `class_grid` object that we used for training. 

In [None]:
y_pred_cat = class_grid.predict(X_test)

#### Let's now evaluate the classifier performance.

In [None]:
# https://stackoverflow.com/questions/65618137/confusion-matrix-for-multiple-classes-in-python
import matplotlib.pyplot as plt
from itertools import product

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    thresh = cm.max() / 2.
    for i, j in product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826
# importing confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test_cat, y_pred_cat)
print('Confusion Matrix\n')
print(cm)
plot_confusion_matrix(cm, classes=[0, 1, 2])

In [None]:
# https://towardsdatascience.com/hackcvilleds-4636c6c1ba53
# Generating a report to extract the measure of interest using built-in sklearn function
# First experiment obtained a weighted_f1 = 0.351
from sklearn.metrics import classification_report
report = classification_report(y_test_cat, y_pred_cat, digits=3)
print(report)

There is no definitive answer to the question of "what's a good accuracy value". It depends on the problem and its context. 
In other words, the classification accuracy mainly depends on the application domain.

In this problem of property review classification based on sentiment scores, we obtained a weighted average accuracy of `71%` in the score class prediction.

On average, considering all 3 classes, `29%` of the predictions of `property review score class` were incorrect.

However, it we take into account only the intermediate class `1` (medium score), the classification model is able to correctly predict only `17%` of the records (`383 / (6 + 383 + 1856)`). 

This percentage rises to `96%` when trying to predict the upper class `2` (high score). 