# Homework
## Dataset
In this homework, we will continue the New York City Airbnb Open Data.

We'll keep working with the `price` variable, and we'll transform it to a classification task.

## Features
For the rest of the homework, you'll need to use the features from the previous homework with additional two `neighbourhood_group` and `room_type`. So the whole feature set will be set as follows:

* `neighbourhood_group`,
* `room_type`,
* `latitude`,
* `longitude`,
* `price`,
* `minimum_nights`,
* `number_of_reviews`,
* `reviews_per_month`,
* `calculated_host_listings_count`,
* `availability_365`

Select only them

In [1]:
import numpy as np
import pandas as pd

columns = [
    'neighbourhood_group',
    'room_type',
    'latitude',
    'longitude',
    'price',
    'minimum_nights',
    'number_of_reviews',
    'reviews_per_month',
    'calculated_host_listings_count',
    'availability_365'    
]
df = pd.read_csv('AB_NYC_2019.csv')[columns]
df.head()

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Private room,40.64749,-73.97237,149,1,9,0.21,6,365
1,Manhattan,Entire home/apt,40.75362,-73.98377,225,1,45,0.38,2,355
2,Manhattan,Private room,40.80902,-73.9419,150,3,0,,1,365
3,Brooklyn,Entire home/apt,40.68514,-73.95976,89,1,270,4.64,1,194
4,Manhattan,Entire home/apt,40.79851,-73.94399,80,10,9,0.1,1,0


## Question 1
What is the most frequent observation (mode) for the column `neighbourhood_group`?

In [2]:
df.groupby('neighbourhood_group').size()

neighbourhood_group
Bronx             1091
Brooklyn         20104
Manhattan        21661
Queens            5666
Staten Island      373
dtype: int64

### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value (`'price'`) is not in your dataframe.

In [3]:
from sklearn.model_selection import train_test_split

def split_data(df, seed=42):
    train, test = train_test_split(df, test_size=0.2, random_state=seed)
    train, val = train_test_split(test, test_size=0.25, random_state=seed)
    
    dfs = [[d.drop('price', axis=1).reset_index(drop=True), d['price'].reset_index(drop=True)]
          for d in [train, val, test]]
    
    return [item for pair in dfs for item in pair]

In [4]:
trainX, trainy, valX, valy, testX, testy = split_data(df)
trainX.head()

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Private room,40.62834,-73.90506,2,37,2.19,1,67
1,Manhattan,Entire home/apt,40.73476,-74.001,1,7,0.12,1,0
2,Manhattan,Private room,40.85612,-73.93024,1,50,1.32,1,90
3,Manhattan,Private room,40.73083,-73.9943,4,0,,1,0
4,Queens,Shared room,40.7154,-73.91148,1,5,2.68,1,89


## Question 2

* Create the correlation matrix for the numerical features of your train dataset.
  * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?


In [5]:
trainX.corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.109087,0.022696,-0.026649,-0.031298,0.020536,-0.022594
longitude,0.109087,1.0,-0.056032,0.053435,0.155258,-0.103358,0.084891
minimum_nights,0.022696,-0.056032,1.0,-0.097775,-0.163913,0.177761,0.174556
number_of_reviews,-0.026649,0.053435,-0.097775,1.0,0.569521,-0.07035,0.168153
reviews_per_month,-0.031298,0.155258,-0.163913,0.569521,1.0,-0.020999,0.183087
calculated_host_listings_count,0.020536,-0.103358,0.177761,-0.07035,-0.020999,1.0,0.228664
availability_365,-0.022594,0.084891,0.174556,0.168153,0.183087,0.228664,1.0


In [6]:
trainX.corr().replace({1: 0}).max().sort_values(ascending=False)

number_of_reviews                 0.569521
reviews_per_month                 0.569521
calculated_host_listings_count    0.228664
availability_365                  0.228664
minimum_nights                    0.177761
longitude                         0.155258
latitude                          0.109087
dtype: float64

In [7]:
trainX.corr().replace({1: 0}).min().sort_values(ascending=True)

minimum_nights                   -0.163913
reviews_per_month                -0.163913
longitude                        -0.103358
calculated_host_listings_count   -0.103358
number_of_reviews                -0.097775
latitude                         -0.031298
availability_365                 -0.022594
dtype: float64

Two features with largest correlation are `number_of_reviews` and `reviews_per_month`.