## 03 Classification Homework

In this homework, we will continue the New York City Airbnb Open Data. You can take it from Kaggle or download from here if you don't want to sign up to Kaggle.

We'll keep working with the price variable, and we'll transform it to a classification task.

In [3]:
import pandas as pd
import numpy as np

import math

from sklearn.metrics import mean_squared_error, mutual_info_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression, Ridge

In [9]:
df = pd.read_csv("AB_NYC_2019.csv")

In [10]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


### Features

For the rest of the homework, you'll need to use the features from the previous homework with additional two `'neighbourhood_group'` and `'room_type'`. So the whole feature set will be set as follows:

* `'neighbourhood_group'`,
* `'room_type'`,
* `'latitude'`,
* `'longitude'`,
* `'price'`,
* `'minimum_nights'`,
* `'number_of_reviews'`,
* `'reviews_per_month'`,
* `'calculated_host_listings_count'`,
* `'availability_365'`

Select only them and fill in the missing values with 0.


### Question 1

What is the most frequent observation (mode) for the column `'neighbourhood_group'`?

In [12]:
features = ['neighbourhood_group',
            'room_type',
            'latitude',
            'longitude',
            'minimum_nights',
            'number_of_reviews',
            'reviews_per_month',
            'calculated_host_listings_count',
            'availability_365',
            'price']

# select features
df = df.loc[:, features].fillna(0)

# string data normalization
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' |/', '_')

df.head()

  df[c] = df[c].str.lower().str.replace(' |/', '_')


Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,price
0,brooklyn,private_room,40.64749,-73.97237,1,9,0.21,6,365,149
1,manhattan,entire_home_apt,40.75362,-73.98377,1,45,0.38,2,355,225
2,manhattan,private_room,40.80902,-73.9419,3,0,0.0,1,365,150
3,brooklyn,entire_home_apt,40.68514,-73.95976,1,270,4.64,1,194,89
4,manhattan,entire_home_apt,40.79851,-73.94399,10,9,0.1,1,0,80


In [13]:
print('Mode is: ' + df.neighbourhood_group.mode().astype(str))

0    Mode is: manhattan
dtype: object


### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value ('price') is not in your dataframe.


### Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
   * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?


In [14]:
df_train, df_test = train_test_split(df, test_size = 0.2, random_state = 42)
df_train, df_val = train_test_split(df_train, test_size = 0.25, random_state = 42)

len(df_train), len(df_test), len(df_val)

(29337, 9779, 9779)

In [42]:
x_train = df_train[features].reset_index(drop = True)
x_test  = df_test[features].reset_index(drop = True)
x_val   = df_val[features].reset_index(drop = True)

y_train = df_train.price.values
y_test = df_test.price.values
y_val = df_val.price.values

del x_train['price']
del x_val['price']
del x_test['price']

In [43]:
x_train.corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.080301,0.027441,-0.006246,-0.007159,0.019375,-0.005891
longitude,0.080301,1.0,-0.06066,0.055084,0.134642,-0.117041,0.083666
minimum_nights,0.027441,-0.06066,1.0,-0.07602,-0.120703,0.118647,0.138901
number_of_reviews,-0.006246,0.055084,-0.07602,1.0,0.590374,-0.073167,0.174477
reviews_per_month,-0.007159,0.134642,-0.120703,0.590374,1.0,-0.048767,0.165376
calculated_host_listings_count,0.019375,-0.117041,0.118647,-0.073167,-0.048767,1.0,0.225913
availability_365,-0.005891,0.083666,0.138901,0.174477,0.165376,0.225913,1.0


### Make price binary

* We need to turn the price variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the price is above (or equal to) `152`.


### Question 3

* Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using `round(score, 2)`

In [44]:
above_average = (y_train >= 152).astype(int)
above_average_val   = (y_val >= 152).astype(int)
above_average_test  = (y_test >= 152).astype(int)

In [68]:
numeric = list(x_train.select_dtypes(include=['int', 'float']).columns)

In [45]:
categoric = list(x_train.select_dtypes(include='object').columns)
for col in categoric:
    print(f'{col:<20} - {mutual_info_score(x_train[col], above_average):.2f}')

neighbourhood_group  - 0.05
room_type            - 0.14


### Question 4

* Now let's train a logistic regression
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
   * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
   * `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)`
* Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.

In [46]:
train_dict = x_train.to_dict(orient = 'records')
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)
X_train = dv.fit_transform(train_dict)

In [47]:
#fit on validation data
val_dict = x_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

In [54]:
model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train, above_average)
val_preds = model.predict(X_val)
accuracy = accuracy_score(above_average_val,val_preds)
print(round(accuracy_score(above_average_val,val_preds),2))

acc_list = []
acc_list.append(['global accuracy', accuracy])

0.79


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Question 5

* We have 9 features: 7 numerical features and 2 categorical.
* Let's find the least useful one using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `neighbourhood_group`
   * `room_type` 
   * `number_of_reviews`
   * `reviews_per_month`

> **note**: the difference doesn't have to be positive

In [59]:
columns = dv.get_feature_names()

In [60]:
subset = dict(zip(columns, model.coef_[0].round(3)))

In [61]:
sorted(subset.items(), key=lambda x: x[1], reverse=True)

[('room_type=private_room', 1.629),
 ('neighbourhood_group=queens', 1.259),
 ('neighbourhood_group=manhattan', 0.083),
 ('latitude', 0.004),
 ('calculated_host_listings_count', 0.003),
 ('reviews_per_month', -0.003),
 ('neighbourhood_group=bronx', -0.013),
 ('room_type=entire_home_apt', -0.044),
 ('minimum_nights', -0.098),
 ('number_of_reviews', -0.134),
 ('longitude', -0.233),
 ('neighbourhood_group=brooklyn', -0.405),
 ('neighbourhood_group=staten_island', -0.805),
 ('room_type=shared_room', -1.15)]

In [86]:
(y_val == threshold).mean()

0.00010225994477962981

In [94]:
features = categoric + numeric
dv = DictVectorizer(sparse=False)
acc = (y_val == threshold).mean()
diff_dict = {}


for f in features:
    f_li = categoric + numeric
    # remove 1 feature
    f_li.remove(f)

    # Encoding categorical variables in training set
    train_dict = x_train[f_li].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    # Encoding categorical variables in validate set
    validate_dict = x_val[f_li].to_dict(orient='records')
    X_validate = dv.transform(validate_dict)

    # Train model
    model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)

    model.fit(X_train, y_train)

    # get y prediction from validate set
    y_pred_val = model.predict_proba(X_validate)[:, 1]

    # calculate accuracy
    threshold = (y_pred_val >= 0.5)
    acc_without_f = (above_average_val == threshold).mean()
    # calculate the difference
    diff = acc - acc_without_f

    diff_dict[f] = diff

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [95]:
for di in sorted(diff_dict, key=diff_dict.get):
    print(f'Model without feature: {di} got difference ==> {diff_dict[di]}')

Model without feature: neighbourhood_group got difference ==> -0.6933224256058902
Model without feature: room_type got difference ==> -0.6933224256058902
Model without feature: latitude got difference ==> -0.6933224256058902
Model without feature: longitude got difference ==> -0.6933224256058902
Model without feature: reviews_per_month got difference ==> -0.6933224256058902


### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.


In [98]:
y_train_log = np.log1p(y_train)
y_validate_log = np.log1p(y_val)
y_test_log = np.log1p(y_test)

In [99]:
# Encoding categorical variables in training set
train_dict = x_train[features].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

# Encoding categorical variables in validate set
validate_dict = x_val[features].to_dict(orient='records')
X_validate = dv.transform(validate_dict)

alpha_li = [0, 0.01, 0.1, 1, 10]
rmse_li = []

# Train model & Calculate RMSE
for a in alpha_li:
    model = Ridge(alpha=a)
    model.fit(X_train, y_train_log)

    y_pred = model.predict(X_validate)

    rmse = mean_squared_error(y_validate_log, y_pred)

    rmse_li.append(rmse)

In [103]:
for al, mse in list(zip(alpha_li, rmse_li)):
    print(f'Alpha = {al} ===> RMSE = {round(mse, 3)}')

Alpha = 0 ===> RMSE = 0.257
Alpha = 0.01 ===> RMSE = 0.257
Alpha = 0.1 ===> RMSE = 0.257
Alpha = 1 ===> RMSE = 0.257
Alpha = 10 ===> RMSE = 0.258
