# 1. Info

Notebook with all the code needed to solve the homework for the second week of the machine learning zoomcamp.

## Install the required libraries

In [18]:
import pandas as pd
import numpy as np

## Getting the data

For this homework, we'll use the California Housing Prices dataset. Download it from here.

In [11]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

--2023-09-19 21:03:33--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: 'housing.csv'


2023-09-19 21:03:33 (3.44 MB/s) - 'housing.csv' saved [1423529/1423529]



In [12]:
data = pd.read_csv('housing.csv')

The goal of this homework is to create a regression model for predicting housing prices (column 'median_house_value').

### EDA

* Load the data.
* Look at the median_house_value variable. Does it have a long tail?

### Preparing the dataset

For this homework, we only want to use a subset of data.

First, keep only the records where ocean_proximity is either '<1H OCEAN' or 'INLAND'

Next, use only the following columns:

* 'latitude',
* 'longitude',
* 'housing_median_age',
* 'total_rooms',
* 'total_bedrooms',
* 'population',
* 'households',
* 'median_income',
* 'median_house_value'

In [29]:
df = data[['latitude','longitude','housing_median_age','total_rooms',
      'total_bedrooms','population','households','median_income',
      'median_house_value']].copy()

# Question 1

There's one feature with missing values. What is it?

* total_rooms
* total_bedrooms
* population
* households

In [30]:
df.isnull().sum()

latitude                0
longitude               0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
dtype: int64

# Question 2

What's the median (50% percentile) for variable 'population'?

* 995
* 1095
* 1195
* 1295

In [31]:
df[['population']].describe()

Unnamed: 0,population
count,20640.0
mean,1425.476744
std,1132.462122
min,3.0
25%,787.0
50%,1166.0
75%,1725.0
max,35682.0


### Prepare and split the dataset

* Shuffle the initial dataset, use seed 42.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Apply the log transformation to the median_house_value variable using the np.log1p() function.

In [32]:
np.random.seed(42)

n = len(df)

n_val = int(0.2 * n)
n_test = int(0.2 * n)
n_train = n - (n_val + n_test)

idx = np.arange(n)
np.random.shuffle(idx)

df_shuffled = df.iloc[idx]

df_train = df_shuffled.iloc[:n_train].copy()
df_val = df_shuffled.iloc[n_train:n_train+n_val].copy()
df_test = df_shuffled.iloc[n_train+n_val:].copy()

In [33]:
# reset index
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [34]:
# getting the y dataframe
y_train = np.log1p(df_train.median_house_value.values)
y_val = np.log1p(df_val.median_house_value.values)
y_test = np.log1p(df_test.median_house_value.values)

In [35]:
# to avoid using y in the x data we will erase y from the df
del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']

# Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using round(score, 2)
* Which option gives better RMSE?

Options:

* With 0
* With mean
* Both are equally good

In [38]:
X_train_fill_zero = df_train.fillna(0).values
X_train_fill_mean = df_train.fillna(df_train.mean()).values

X_val_fill_zero = df_val.fillna(0).values
X_val_fill_mean = df_val.fillna(df_val.mean()).values

In [39]:
def train_linear_regression(X, y):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)
    
    return w[0], w[1:]

def rmse(y, y_pred):
    error = y_pred - y
    mse = (error ** 2).mean()
    return np.sqrt(mse)

In [46]:
w_0, w = train_linear_regression(X_train_fill_zero,y_train)
y_pred = w_0 + X_val_fill_zero.dot(w)
round(rmse(y_val, y_pred),2)

0.33

In [47]:
w_0, w = train_linear_regression(X_train_fill_mean,y_train)
y_pred = w_0 + X_val_fill_mean.dot(w)
round(rmse(y_val, y_pred),2)

0.33

# Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0.
* Try different values of r from this list: [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10].
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which r gives the best RMSE?

If there are multiple options, select the smallest r.

Options:

* 0
* 0.000001
* 0.001
* 0.0001

In [48]:
X_train_fill_zero = df_train.fillna(0).values
X_val_fill_zero = df_val.fillna(0).values

In [49]:
def train_linear_regression_reg(X, y, r=0.0):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    reg = r * np.eye(XTX.shape[0])
    XTX = XTX + reg

    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)
    
    return w[0], w[1:]

In [51]:
for r in [0, 0.001, 0.0001,0.000001]:
    w0, w = train_linear_regression_reg(X_train_fill_zero, y_train, r=r)

    y_pred = w0 + X_val_fill_zero.dot(w)
    score = round(rmse(y_val, y_pred), 2)
    
    print(f"r={r}, w0={w0}, score={score}")

r=0, w0=-11.686975241713926, score=0.33
r=0.001, w0=-11.67093131795569, score=0.33
r=0.0001, w0=-11.685368865728286, score=0.33
r=1e-06, w0=-11.68695917553698, score=0.33


# Question 5

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
* What's the standard deviation of all the scores? To compute the standard deviation, use np.std.
* Round the result to 3 decimal digits (round(std, 3))

What's the value of std?

* 0.5
* 0.05
* 0.005
* 0.0005

Note: Standard deviation shows how different the values are. If it's low, then all values are approximately the same. If it's high, the values are different. If standard deviation of scores is low, then our model is stable.

In [55]:

# seeds
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# array to save the rmse scores
rmse_scores = np.zeros(len(seeds))

for seed in seeds:
  # set the seed
  idx = np.arange(n)
  np.random.seed(seed)
  np.random.shuffle(idx)

  # get the train, val, test datasets
  df_shuffled = df.iloc[idx]

  df_train = df_shuffled.iloc[:n_train].reset_index(drop=True).copy()
  df_val = df_shuffled.iloc[n_train:n_train+n_val].reset_index(drop=True).copy()
  df_test = df_shuffled.iloc[n_train+n_val:].reset_index(drop=True).copy()

  # getting the y dataframe
  y_train = np.log1p(df_train.median_house_value.values)
  y_val = np.log1p(df_val.median_house_value.values)
  y_test = np.log1p(df_test.median_house_value.values)
  
  # to avoid using y in the x data we will erase y from the df
  del df_train['median_house_value']
  del df_val['median_house_value']
  del df_test['median_house_value']


  X_train = df_train.fillna(0).values
  X_val = df_val.fillna(0).values

  w0, w = train_linear_regression_reg(X_train, y_train, 0)

  y_pred = w0 + X_val.dot(w)

  rmse_scores[seed] = rmse(y_val, y_pred)

round(np.std(rmse_scores), 3)

0.004

# Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with r=0.001.
* What's the RMSE on the test dataset?

Options:

* 0.13
* 0.23
* 0.33
* 0.43

In [56]:

# set the seed
idx = np.arange(n)
np.random.seed(9)
np.random.shuffle(idx)

# get the train, val, test datasets
df_shuffled = df.iloc[idx]

df_train = df_shuffled.iloc[:n_train].reset_index(drop=True).copy()
df_val = df_shuffled.iloc[n_train:n_train+n_val].reset_index(drop=True).copy()
df_test = df_shuffled.iloc[n_train+n_val:].reset_index(drop=True).copy()

# getting the y dataframe
y_train = np.log1p(df_train.median_house_value.values)
y_val = np.log1p(df_val.median_house_value.values)
y_test = np.log1p(df_test.median_house_value.values)

# to avoid using y in the x data we will erase y from the df
del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']


X_train = df_train.fillna(0).values
X_val = df_val.fillna(0).values

w0, w = train_linear_regression_reg(X_train, y_train, 0.001)

y_pred = w0 + X_val.dot(w)

rmse(y_val, y_pred)

0.3365921097084417