# Homework
## Dataset
In this homework, we will continue the New York City Airbnb Open Data.

We'll keep working with the `price` variable, and we'll transform it to a classification task.

## Features
For the rest of the homework, you'll need to use the features from the previous homework with additional two `neighbourhood_group` and `room_type`. So the whole feature set will be set as follows:

* `neighbourhood_group`,
* `room_type`,
* `latitude`,
* `longitude`,
* `price`,
* `minimum_nights`,
* `number_of_reviews`,
* `reviews_per_month`,
* `calculated_host_listings_count`,
* `availability_365`

Select only them and fill in the missing values with 0.

In [1]:
import numpy as np
import pandas as pd

columns = [
    'neighbourhood_group',
    'room_type',
    'latitude',
    'longitude',
    'price',
    'minimum_nights',
    'number_of_reviews',
    'reviews_per_month',
    'calculated_host_listings_count',
    'availability_365'    
]
df = pd.read_csv('AB_NYC_2019.csv')[columns].fillna(0)
df.head()

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Private room,40.64749,-73.97237,149,1,9,0.21,6,365
1,Manhattan,Entire home/apt,40.75362,-73.98377,225,1,45,0.38,2,355
2,Manhattan,Private room,40.80902,-73.9419,150,3,0,0.0,1,365
3,Brooklyn,Entire home/apt,40.68514,-73.95976,89,1,270,4.64,1,194
4,Manhattan,Entire home/apt,40.79851,-73.94399,80,10,9,0.1,1,0


## Question 1
What is the most frequent observation (mode) for the column `neighbourhood_group`?

In [2]:
df.groupby('neighbourhood_group').size()

neighbourhood_group
Bronx             1091
Brooklyn         20104
Manhattan        21661
Queens            5666
Staten Island      373
dtype: int64

### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value (`'price'`) is not in your dataframe.

In [3]:
from sklearn.model_selection import train_test_split

def split_data(df, seed=42):
    train, test = train_test_split(df, test_size=0.2, random_state=seed)
    train, val = train_test_split(train, test_size=0.25, random_state=seed)
    
    dfs = [[d.drop('price', axis=1).reset_index(drop=True), d['price'].reset_index(drop=True)]
          for d in [train, val, test]]
    
    return [item for pair in dfs for item in pair]

In [4]:
trainX, trainy, valX, valy, testX, testy = split_data(df)
trainX.head()

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Entire home/apt,40.7276,-73.94495,3,29,0.7,13,50
1,Manhattan,Private room,40.70847,-74.00498,1,0,0.0,1,7
2,Bronx,Entire home/apt,40.83149,-73.92766,40,0,0.0,1,0
3,Brooklyn,Entire home/apt,40.66448,-73.99407,2,3,0.08,1,0
4,Manhattan,Private room,40.74118,-74.00012,1,48,1.8,2,67


## Question 2

* Create the correlation matrix for the numerical features of your train dataset.
  * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?


In [5]:
trainX.corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.080301,0.027441,-0.006246,-0.007159,0.019375,-0.005891
longitude,0.080301,1.0,-0.06066,0.055084,0.134642,-0.117041,0.083666
minimum_nights,0.027441,-0.06066,1.0,-0.07602,-0.120703,0.118647,0.138901
number_of_reviews,-0.006246,0.055084,-0.07602,1.0,0.590374,-0.073167,0.174477
reviews_per_month,-0.007159,0.134642,-0.120703,0.590374,1.0,-0.048767,0.165376
calculated_host_listings_count,0.019375,-0.117041,0.118647,-0.073167,-0.048767,1.0,0.225913
availability_365,-0.005891,0.083666,0.138901,0.174477,0.165376,0.225913,1.0


In [6]:
trainX.corr().replace({1: 0}).max().sort_values(ascending=False)

number_of_reviews                 0.590374
reviews_per_month                 0.590374
calculated_host_listings_count    0.225913
availability_365                  0.225913
minimum_nights                    0.138901
longitude                         0.134642
latitude                          0.080301
dtype: float64

In [7]:
trainX.corr().replace({1: 0}).min().sort_values(ascending=True)

minimum_nights                   -0.120703
reviews_per_month                -0.120703
longitude                        -0.117041
calculated_host_listings_count   -0.117041
number_of_reviews                -0.076020
latitude                         -0.007159
availability_365                 -0.005891
dtype: float64

Two features with largest correlation are `number_of_reviews` and `reviews_per_month`.

### Make price binary

* We need to turn the price variable from numeric into binary.
* Let's create a variable `above_average` which is one if the price is above (or equal to) `152`.

In [8]:
train_above_average, val_above_average, test_above_average = [(y >= 152).astype('int') for y in [trainy, valy, testy]]

## Question 3

* Calculate the mutual information score for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using `round(score, 2)`

In [9]:
from sklearn.metrics import mutual_info_score

round(mutual_info_score(trainX.neighbourhood_group, train_above_average), 2)

0.05

In [10]:
round(mutual_info_score(trainX.room_type, train_above_average), 2)

0.14

`room_type` has a bigger score.

## Question 4

* Now let's train a logistic regression
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
  * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
  * model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
* Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer

CAT_VARS = ['neighbourhood_group', 'room_type']
NUM_VARS = list(filter(lambda s: s not in CAT_VARS, trainX.columns))

identity = FunctionTransformer(lambda x: x)

def create_transformation_pipeline(columns):
    num_attribs, cat_attribs = [list(filter(lambda s: s in columns, attribs)) for attribs in [NUM_VARS, CAT_VARS]]
    
    return ColumnTransformer([
        ('num', identity, num_attribs),
        ('cat', OneHotEncoder(), cat_attribs)
    ])

def train_model(dfX, dfy, pipeline, col_drop=''):
    dfX_transformed = pipeline.fit_transform(trainX)
    model = LogisticRegression(solver='liblinear', C=1.0, random_state=42)
    model.fit(dfX_transformed, dfy)
    
    return model

def score_on_val(model, valX, valy, pipeline, col_drop=''):
    valX_transformed = pipeline.transform(valX)
    predictions = model.predict(valX_transformed)
    return round((predictions == valy).mean(), 2)


pipeline = create_transformation_pipeline(trainX.columns)
trainX_transformed = pipeline.fit_transform(trainX)

model = train_model(trainX, train_above_average, pipeline)
original_score = score_on_val(model, valX, val_above_average, pipeline)
original_score

0.79

## Question 5

* We have 9 features: 7 numerical features and 2 categorical.
* Let's find the least useful one using the feature elimination technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
* Which of following feature has the smallest difference?
  * `neighbourhood_group`
  * `room_type`
  * `number_of_reviews`
  * `reviews_per_month`

> note: the difference doesn't have to be positive

In [12]:
scores = {}


for col in trainX.columns:
    pipeline = create_transformation_pipeline(list(filter(lambda s: s != col, df.columns)))
    model = train_model(trainX, train_above_average, pipeline)
    scores[col] = score_on_val(model, valX, val_above_average, pipeline)

{col: round(abs(original_score - score), 2) for (col, score) in scores.items()}

{'neighbourhood_group': 0.04,
 'room_type': 0.06,
 'latitude': 0.0,
 'longitude': 0.0,
 'minimum_nights': 0.0,
 'number_of_reviews': 0.0,
 'reviews_per_month': 0.0,
 'calculated_host_listings_count': 0.0,
 'availability_365': 0.01}

## Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.

In [13]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

trainy = np.log1p(trainy)
valy = np.log1p(valy)
testy = np.log1p(testy)
pipeline = create_transformation_pipeline(trainX.columns).fit(trainX)
scores = []

for alpha in [0, 0.01, 0.1, 1, 10]:
    model = Ridge(alpha=alpha, random_state=42)
    model.fit(pipeline.transform(trainX), trainy)
    
    predictions = model.predict(pipeline.transform(valX))
    rmse = np.sqrt(mean_squared_error(predictions, valy))
    
    scores.append({'alpha': alpha, 'rmse': round(rmse, 3)})
    
scores

[{'alpha': 0, 'rmse': 0.497},
 {'alpha': 0.01, 'rmse': 0.497},
 {'alpha': 0.1, 'rmse': 0.497},
 {'alpha': 1, 'rmse': 0.497},
 {'alpha': 10, 'rmse': 0.498}]