## 3.15 Homework

### Dataset

In this homework, we will continue the New York City Airbnb Open Data. You can take it from
[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

We'll keep working with the `'price'` variable, and we'll transform it to a classification task.


### Features

For the rest of the homework, you'll need to use the features from the previous homework with additional two `'neighbourhood_group'` and `'room_type'`. So the whole feature set will be set as follows:

* `'neighbourhood_group'`,
* `'room_type'`,
* `'latitude'`,
* `'longitude'`,
* `'price'`,
* `'minimum_nights'`,
* `'number_of_reviews'`,
* `'reviews_per_month'`,
* `'calculated_host_listings_count'`,
* `'availability_365'`

Select only them.


In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.linear_model import Ridge
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer



import warnings

warnings.filterwarnings("ignore")
%matplotlib inline

In [2]:
df= pd.read_csv('AB_NYC_2019.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [3]:
#selected features
features= ['neighbourhood_group','room_type','latitude', 'longitude', 'price','minimum_nights', 'number_of_reviews',
           'reviews_per_month', 'calculated_host_listings_count','availability_365']

df_hw = df[features].copy()
df_hw.reviews_per_month=df_hw.reviews_per_month.fillna(0)

df_hw.isnull().sum()

neighbourhood_group               0
room_type                         0
latitude                          0
longitude                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

### Question 1

What is the most frequent observation (mode) for the column `'neighbourhood_group'`?
### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value ('price') is not in your dataframe.


In [4]:
df_hw['neighbourhood_group'].mode()

0    Manhattan
dtype: object

In [5]:
df_train_full, df_test=train_test_split(df_hw, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state =42)


df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train['price'].to_numpy()
y_val = df_val['price'].to_numpy()
y_test = df_test['price'].to_numpy()

del df_train['price']
del df_val['price']
del df_test['price']


### Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
   * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

In [6]:
numerical = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews',
             'reviews_per_month', 'calculated_host_listings_count',
            'availability_365']
df_train[numerical].corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.080301,0.027441,-0.006246,-0.007159,0.019375,-0.005891
longitude,0.080301,1.0,-0.06066,0.055084,0.134642,-0.117041,0.083666
minimum_nights,0.027441,-0.06066,1.0,-0.07602,-0.120703,0.118647,0.138901
number_of_reviews,-0.006246,0.055084,-0.07602,1.0,0.590374,-0.073167,0.174477
reviews_per_month,-0.007159,0.134642,-0.120703,0.590374,1.0,-0.048767,0.165376
calculated_host_listings_count,0.019375,-0.117041,0.118647,-0.073167,-0.048767,1.0,0.225913
availability_365,-0.005891,0.083666,0.138901,0.174477,0.165376,0.225913,1.0


In [7]:
df_train['above_average']=np.where(y_train>=152, 1,0)
df_val['above_average']=np.where(y_val>=152, 1,0)
df_test['above_average']=np.where(y_test>=152, 1,0)


y_train_bin= df_train['above_average'].to_numpy()
y_val_bin= df_val['above_average'].to_numpy()
y_test_bin = df_test['above_average'].to_numpy()


del df_train['above_average']
del df_val['above_average']
del df_test['above_average']

### Question 3

* Calculate the mutual information score for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using `round(score, 2)`

In [8]:
def calculate_mi(series):
    return round(mutual_info_score(series,y_train_bin),2)


categorical =['neighbourhood_group', 'room_type']
df_mi = df_train[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')


display(df_mi.head())

Unnamed: 0,MI
room_type,0.14
neighbourhood_group,0.05


### Question 4

* Now let's train a logistic regression
* For that, we need to turn our price prediction problem into a binary classification task.
* Let's create a variable `above_average` which is `1` if the price is above (or equal to) `152`.
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
   * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
   * `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)`
* Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.


In [9]:
df_train.head()

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Entire home/apt,40.7276,-73.94495,3,29,0.7,13,50
1,Manhattan,Private room,40.70847,-74.00498,1,0,0.0,1,7
2,Bronx,Entire home/apt,40.83149,-73.92766,40,0,0.0,1,0
3,Brooklyn,Entire home/apt,40.66448,-73.99407,2,3,0.08,1,0
4,Manhattan,Private room,40.74118,-74.00012,1,48,1.8,2,67


In [10]:
train_dict = df_train.to_dict(orient='records')
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)

X_train = dv.transform(train_dict)

model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train, y_train_bin)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)
y_pred = model.predict_proba(X_val)[:, 1]
all_accuracy=(y_val_bin == (y_pred > 0.5)).mean()
print(round(all_accuracy,2))

0.79


### Question 5

* We have 10 features: 8 numerical features and 2 categorical.
* Let's find the least useful one using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `neighbourhood_group`
   * `room_type` 
   * `number_of_reviews`
   * `reviews_per_month`

> **note**: the difference doesn't have to be positive

In [11]:
columns=df_train.columns

features_scores = {}
for column in columns:
    print("Feature:",column)
    df_train1= df_train.copy()
    del df_train1[column]
    train_dict = df_train1.to_dict(orient='records')
    dv = DictVectorizer(sparse=False)
    dv.fit(train_dict)

    X_train = dv.transform(train_dict)
    
    model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
    model.fit(X_train, y_train_bin)
    
    df_val1= df_val.copy()
    del df_val1[column]
    val_dict = df_val1.to_dict(orient='records')
    X_val = dv.transform(val_dict)
    y_pred = model.predict_proba(X_val)[:, 1]
    accuracy=(y_val_bin == (y_pred > 0.5)).mean()
    print("Accuracy without ", column, ": ", accuracy)
    print("Diff:", abs(all_accuracy-accuracy))

Feature: neighbourhood_group
Accuracy without  neighbourhood_group :  0.7509970344616014
Diff: 0.03568872072809082
Feature: room_type
Accuracy without  room_type :  0.7165354330708661
Diff: 0.0701503221188261
Feature: latitude
Accuracy without  latitude :  0.7863789753553533
Diff: 0.00030677983433891054
Feature: longitude
Accuracy without  longitude :  0.7867880151344718
Diff: 0.00010225994477963685
Feature: minimum_nights
Accuracy without  minimum_nights :  0.785356375907557
Diff: 0.001329379282135168
Feature: number_of_reviews
Accuracy without  number_of_reviews :  0.7869925350240311
Diff: 0.00030677983433891054
Feature: reviews_per_month
Accuracy without  reviews_per_month :  0.7850495960732181
Diff: 0.0016361591164740785
Feature: calculated_host_listings_count
Accuracy without  calculated_host_listings_count :  0.7866857551896922
Diff: 0.0
Feature: availability_365
Accuracy without  availability_365 :  0.7815727579507107
Diff: 0.005112997238981509


### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.

In [12]:
def rmse(y, y_pred):
    error = y_pred - y
    mse = (error ** 2).mean()
    return np.sqrt(mse)




train_dict = df_train.to_dict(orient='records')
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)

X_train = dv.transform(train_dict)


val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

y_train = np.log1p(y_train)
y_val= np.log1p(y_val)

for alpha in [0,0.01,0.1,1,10]:
    print("Alpha:",alpha)
    ridge_reg = Ridge(alpha=alpha)
    ridge_reg.fit(X_train,y_train)
    
    y_pred= ridge_reg.predict(X_val)
    print("RMSE:", round(rmse(y_val,y_pred)))

Alpha: 0
RMSE: 0
Alpha: 0.01
RMSE: 0
Alpha: 0.1
RMSE: 0
Alpha: 1
RMSE: 0
Alpha: 10
RMSE: 0
