# Homework 2
## Dataset

In this homework, we will use the New York City Airbnb Open Data. You can take it from Kaggle or download from here if you don't want to sign up to Kaggle.

The goal of this homework is to create a regression model for prediction apartment prices (column 'price').

## EDA

* Load the data.
* Look at the price variable. Does it have a long tail?

## Features

For the rest of the homework, you'll need to use only these columns:

* 'latitude',
* 'longitude',
* 'price',
* 'minimum_nights',
* 'number_of_reviews',
* 'reviews_per_month',
* 'calculated_host_listings_count',
* 'availability_365'

Select only them.

In [1]:
import numpy as np
import pandas as pd

columns = [
    'latitude',
    'longitude',
    'price',
    'minimum_nights',
    'number_of_reviews',
    'reviews_per_month',
    'calculated_host_listings_count',
    'availability_365'
]
df = pd.read_csv('AB_NYC_2019.csv')[columns]
df

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,40.64749,-73.97237,149,1,9,0.21,6,365
1,40.75362,-73.98377,225,1,45,0.38,2,355
2,40.80902,-73.94190,150,3,0,,1,365
3,40.68514,-73.95976,89,1,270,4.64,1,194
4,40.79851,-73.94399,80,10,9,0.10,1,0
...,...,...,...,...,...,...,...,...
48890,40.67853,-73.94995,70,2,0,,2,9
48891,40.70184,-73.93317,40,4,0,,2,36
48892,40.81475,-73.94867,115,10,0,,1,27
48893,40.75751,-73.99112,55,1,0,,6,2


## Question 1
Find a feature with missing values. How many missing values does it have?

In [2]:
df.isnull().sum()

latitude                              0
longitude                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

## Question 2

What's the median (50% percentile) for variable 'minimum_nights'?

In [3]:
df.minimum_nights.median()

3.0

## Split the data

* Shuffle the initial dataset, use seed 42.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Make sure that the target value ('price') is not in your dataframe.
* Apply the log transformation to the price variable using the np.log1p() function.

In [4]:
def split_data(seed):
    np.random.seed(seed)
    n = len(df)
    idx = np.arange(n)
    np.random.shuffle(idx)

    n_val = int(n * 0.2)
    n_train = n - (2 * n_val)

    assert n_train > n_val * 2

    train_idx = idx[:n_train]
    val_idx = idx[n_train:n_train + n_val]
    test_idx = idx[n_train + n_val:]

    val = df.iloc[val_idx] #.reset_index(drop=True)
    test = df.iloc[test_idx] #.reset_index(drop=True)
    train = df.iloc[train_idx] #.reset_index(drop=True)

    valX, valy = val.drop(['price'], axis=1), val['price']
    testX, testy = test.drop(['price'], axis=1), test['price']
    trainX, trainy = train.drop(['price'], axis=1), train['price']

    assert len(valX) == n_val
    assert len(testX) == n_val
    assert len(trainX) == n_train
    assert len(valX) + len(testX) + len(trainX) == n

    valy = np.log1p(valy)
    testy = np.log1p(testy)
    trainy = np.log1p(trainy)
    
    return valX, valy, testX, testy, trainX, trainy

In [5]:
valX, valy, _, _, trainX, trainy = split_data(42)

## Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using round(score, 2).
* Which option gives better RMSE?

In [6]:
def linear_regression(X, y, **kwargs):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])
    
    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

In [7]:
def rmse(y, y_pred):
    error = y - y_pred
    se = error ** 2
    mse = se.mean()
    return np.sqrt(mse)

In [8]:
def evaluate(Xtrain, ytrain, Xval, yval, na_fills, lin_fn, **kwargs):
    trainX_zero = Xtrain.copy().fillna(na_fills)
    w0, w = lin_fn(trainX_zero, ytrain, **kwargs)

    X_val = Xval.copy().fillna(na_fills)
    y_pred = w0 + X_val.dot(w)
    
    return round(rmse(yval, y_pred), 2)

In [9]:
rmse_zero = evaluate(trainX, trainy, valX, valy, 0, linear_regression)
rmse_mean = evaluate(trainX, trainy, valX, valy, trainX.mean(), linear_regression)

print(f'RMSE (zero): {rmse_zero}')
print(f'RMSE (mean): {rmse_mean}')

RMSE (zero): 0.64
RMSE (mean): 0.64


## Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0.
* Try different values of `r` from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.

In [10]:
def regularized_linear_regression(X, y, **kwargs):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])
    r = kwargs['r']
    
    XTX = X.T.dot(X)
    XTX += r * np.eye(XTX.shape[0])
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

In [11]:
for r in [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]:
    print('r=%f, rmse=%.2f' % (r, evaluate(trainX, trainy, valX, valy, 0, regularized_linear_regression, r=r)))

r=0.000000, rmse=0.64
r=0.000001, rmse=0.64
r=0.000100, rmse=0.64
r=0.001000, rmse=0.64
r=0.010000, rmse=0.66
r=0.100000, rmse=0.68
r=1.000000, rmse=0.68
r=5.000000, rmse=0.68
r=10.000000, rmse=0.68


## Question 5

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits `(round(std, 3))`

> Note: Standard deviation shows how different the values are. If it's low, then all values are approximately the same. If it's high, the values are different. If standard deviation of scores is low, then our model is stable.

In [12]:
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
scores = []

for seed in seeds:
    valX, valy, _, _, trainX, trainy = split_data(seed)
    scores.append(evaluate(trainX, trainy, valX, valy, 0, linear_regression))
    
scores

[0.65, 0.65, 0.65, 0.64, 0.64, 0.63, 0.63, 0.65, 0.65, 0.64]

In [13]:
round(np.std(scores), 3)

0.008

## Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with r=0.001.
* What's the RMSE on the test dataset?

In [14]:
valX, valy, testX, testy, trainX, trainy = split_data(9)
combinedX = pd.concat([valX, trainX])
combinedy = pd.concat([valy, trainy])

combinedX.fillna(0, inplace=True)
evaluate(combinedX, combinedy, testX, testy, 0, regularized_linear_regression, r=0.001)

0.65