# Homework week 3 Machine Learning Zoomcamp

## Author: Sebastián Ayala Ruano 

## Dataset

In this homework, we will continue the New York City Airbnb Open Data. You can take it from
[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

We'll keep working with the `'price'` variable, and we'll transform it to a classification task.

In [252]:
# Import libraries 
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import mutual_info_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [138]:
# Load data 
df = pd.read_csv("../Data/New_York_City_Airbnb_Open_Data.csv")
len(df)

48895

In [139]:
# Show the first rows of the data frame 
df.head().T

Unnamed: 0,0,1,2,3,4
id,2539,2595,3647,3831,5022
name,Clean & quiet apt home by the park,Skylit Midtown Castle,THE VILLAGE OF HARLEM....NEW YORK !,Cozy Entire Floor of Brownstone,Entire Apt: Spacious Studio/Loft by central park
host_id,2787,2845,4632,4869,7192
host_name,John,Jennifer,Elisabeth,LisaRoxanne,Laura
neighbourhood_group,Brooklyn,Manhattan,Manhattan,Brooklyn,Manhattan
neighbourhood,Kensington,Midtown,Harlem,Clinton Hill,East Harlem
latitude,40.64749,40.75362,40.80902,40.68514,40.79851
longitude,-73.97237,-73.98377,-73.9419,-73.95976,-73.94399
room_type,Private room,Entire home/apt,Private room,Entire home/apt,Entire home/apt
price,149,225,150,89,80


## Features

For the rest of the homework, you'll need to use the features from the previous homework with additional two `'neighbourhood_group'` and `'room_type'`. So the whole feature set will be set as follows:

* `'neighbourhood_group'`,
* `'room_type'`,
* `'latitude'`,
* `'longitude'`,
* `'price'`,
* `'minimum_nights'`,
* `'number_of_reviews'`,
* `'reviews_per_month'`,
* `'calculated_host_listings_count'`,
* `'availability_365'`

Select only them and fill in the missing values with 0.

In [140]:
# Select features 
features = ['neighbourhood_group','room_type','latitude','longitude','price','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']
df = df[features]

In [141]:
# Searching features with null values
df.isnull().sum()

neighbourhood_group                   0
room_type                             0
latitude                              0
longitude                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [142]:
# Replace missing values with 0
df['reviews_per_month'] = df.reviews_per_month.fillna(0)


## Question 1

What is the most frequent observation (mode) for the column `'neighbourhood_group'`?

In [143]:
df.neighbourhood_group.mode()

0    Manhattan
dtype: object

**Answer:** The most frequent observation for the column `neighbourhood_group` is **Manhattan**. 

## Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value ('price') is not in your dataframe.

In [144]:
# Create test and full training partitions 
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [145]:
# # Create train and validation partitions 
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=42)

In [146]:
# Verify the lenght of partitions 
len(df_train), len(df_val), len(df_test)

(29337, 9779, 9779)

In [147]:
# Reset indices of all partitions 
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [148]:
# Extract target variable of all partitions 
y_train = df_train.price.values
y_val = df_val.price.values
y_test = df_test.price.values

In [149]:
# Delete target variable for all partitions 
del df_train['price']
del df_val['price']
del df_test['price']

In [150]:
# Create lists of categorical and numerical features 
categorical = ["neighbourhood_group", "room_type"]
numerical = ['latitude','longitude', 'minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']

## Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
   * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

In [151]:
# Obtain the correlation matrix 
corr = df_train.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

  corr.style.background_gradient(cmap='coolwarm').set_precision(2)


Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.08,0.03,-0.01,-0.01,0.02,-0.01
longitude,0.08,1.0,-0.06,0.06,0.13,-0.12,0.08
minimum_nights,0.03,-0.06,1.0,-0.08,-0.12,0.12,0.14
number_of_reviews,-0.01,0.06,-0.08,1.0,0.59,-0.07,0.17
reviews_per_month,-0.01,0.13,-0.12,0.59,1.0,-0.05,0.17
calculated_host_listings_count,0.02,-0.12,0.12,-0.07,-0.05,1.0,0.23
availability_365,-0.01,0.08,0.14,0.17,0.17,0.23,1.0


**Answer:** The features that have the biggest correlation in this dataset are `reviews_per_month` and `number_of_reviews`.

## Make price binary

* We need to turn the price variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the price is above (or equal to) `152`.

In [218]:
# Obtain binarized prices for all datasets 
bin_price_val = (y_val >= 152).astype(int)
bin_price_train = (y_train >= 152).astype(int)
bin_price_test = (y_test >= 152).astype(int)

## Question 3

* Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using `round(score, 2)`

In [208]:
# Mutual information between categorical features and binary target variable
def calculate_mi(series):
    return round(mutual_info_score(series, bin_price_train),2)

df_mi = df_train[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')
df_mi

Unnamed: 0,MI
room_type,0.14
neighbourhood_group,0.05


**Answer:** `room_type` variable has the bigger mutual information score. 

## Question 4

* Now let's train a logistic regression
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
   * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
   * `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)`
* Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.

In [225]:
# Create dictionaries of the training data to apply one-hot enconding on categorical fatures 
train_dict = df_train.to_dict(orient='records')

In [226]:
# Create feature matrix with numerical and one-hot encoded categorical variables
dv = DictVectorizer()
X_train = dv.fit_transform(train_dict)

In [227]:
# Fit and train logistic regression model 
model = LogisticRegression(solver='liblinear', C=1.0, random_state=42, max_iter=500)
model.fit(X_train, bin_price_train)

LogisticRegression(max_iter=500, random_state=42, solver='liblinear')

In [228]:
# Create feature matrix of validation partition 
val_dict = df_val.to_dict(orient='records')
X_val = dv.fit_transform(val_dict)

In [229]:
# Make predictions on the validation dataset 
y_pred = model.predict_proba(X_val)[:, 1]

In [230]:
# Define a threshold to calculate the accuracy 
sell_cutoff = (y_pred >= 0.5)

In [231]:
# Obtain accuracy 
orig_accuracy = round((bin_price_val == sell_cutoff).mean(),2)
print(orig_accuracy)

0.79


**Answer:** The accuracy on the validation dataset was **0.79**. 

## Question 5

* We have 9 features: 7 numerical features and 2 categorical.
* Let's find the least useful one using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `neighbourhood_group`
   * `room_type` 
   * `number_of_reviews`
   * `reviews_per_month`

> **note**: the difference doesn't have to be positive

In [250]:
small_features = ["neighbourhood_group", "room_type", "number_of_reviews", "reviews_per_month"]
for i in small_features:
    df_train_temp = df_train.drop(i, 1)
    train_dict_temp = df_train_temp.to_dict(orient='records')
    dv_temp = DictVectorizer()
    X_train_temp = dv_temp.fit_transform(train_dict_temp)
    
    model.fit(X_train_temp, bin_price_train)
    
    df_val_temp = df_val.drop(i, 1)
    val_dict_temp = df_val_temp.to_dict(orient='records')
    X_val_temp = dv_temp.fit_transform(val_dict_temp)

    y_pred_temp = model.predict_proba(X_val_temp)[:, 1]

    sell_cutoff_temp = (y_pred_temp >= 0.5)

    acc_temp = (bin_price_val == sell_cutoff_temp).mean()

    diff_acc = abs(orig_accuracy - acc_temp)

    print(f'Accuracy difference between LogReg model with all features and without {i} feature: {diff_acc}')


Accuracy difference between LogReg model with all features and without neighbourhood_group feature: 0.040127824930974554
Accuracy difference between LogReg model with all features and without room_type feature: 0.06129563350035794
Accuracy difference between LogReg model with all features and without number_of_reviews feature: 0.0013897126495551193
Accuracy difference between LogReg model with all features and without reviews_per_month feature: 0.0007761529808774092


**Answer:** The feature `reviews_per_month` has the smallest difference with the original accuracy. 

## Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.

In [256]:
# Log normalization of target variable from all partitions 
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)
y_test_log = np.log1p(y_test)

In [268]:
alphas = [0, 0.01, 0.1, 1, 10]

for i in alphas:
    ridge_model_temp = Ridge(alpha=i)
    ridge_model_temp.fit(X_train, y_train_log)
    ridge_predictions = ridge_model_temp.predict(X_val)

    rootMeanSquaredError = np.sqrt(mean_squared_error(y_val_log, ridge_predictions))

    print(f'RMSE of ridge regression with alpha {i}: {round(rootMeanSquaredError, 3)}')

RMSE of ridge regression with alpha 0: 0.506
RMSE of ridge regression with alpha 0.01: 0.506
RMSE of ridge regression with alpha 0.1: 0.506
RMSE of ridge regression with alpha 1: 0.506
RMSE of ridge regression with alpha 10: 0.506


**Answer:** The best RMSE on the validation set with the smallest alpha is **0.01**