<a href="https://colab.research.google.com/github/rizalpernata1/Machine-Learning-Zoomcamp-2023/blob/main/2.Regression/homework_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework-2**

## **Setup**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer

## **Dataset**
In this homework, we will use the California Housing Prices from Kaggle.
The goal of this homework is to create a regression model for predicting housing prices (column 'median_house_value').

## **EDA**

In [3]:
# Load the data
data = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv")
data

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [4]:
data['median_house_value'].tail()

20635    78100.0
20636    77100.0
20637    92300.0
20638    84700.0
20639    89400.0
Name: median_house_value, dtype: float64

## **Preparing the dataset**
For this homework, we only want to use a subset of data.

First, keep only the records where ocean_proximity is either '<1H OCEAN' or 'INLAND'

Next, use only the following columns:
*   'latitude',
*   'longitude',
*   'housing_median_age',
*   'total_rooms',
*   'total_bedrooms',
*   'population',
*   'households',
*   'median_income',
*   'median_house_value'


In [5]:
data = data[data['ocean_proximity'].isin(['<1H OCEAN', 'INLAND'])]
selected_columns = [
    'latitude', 'longitude', 'housing_median_age', 'total_rooms',
    'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value'
]
data = data[selected_columns]

## **Question 1**
There's one feature with missing values. What is it?

In [6]:
missing_feature = data.columns[data.isnull().any()].tolist()[0]
missing_feature

'total_bedrooms'

## **Question 2**
What's the median (50% percentile) for variable 'population'?

In [7]:
median_population = data['population'].median()
median_population

1195.0

## **Prepare and split the dataset**


*   Shuffle the dataset (the filtered one you created above), use seed 42.
*   Split your data in train/val/test sets, with 60%/20%/20% distribution.
*   Apply the log transformation to the median_house_value variable using the np.log1p() function.






In [8]:
# Prepare and split the dataset
np.random.seed(42)  # Set seed for reproducibility
data_shuffled = data.sample(frac=1)

train_size = int(0.6 * len(data))
val_size = int(0.2 * len(data))
test_size = int(0.2 * len(data))

train_data = data_shuffled[:train_size].copy()
val_data = data_shuffled[train_size:train_size+val_size].copy()
test_data = data_shuffled[train_size+val_size:].copy()

In [9]:
# Apply log transformation to the target variable
train_data['median_house_value'] = np.log1p(train_data['median_house_value'])
val_data['median_house_value'] = np.log1p(val_data['median_house_value'])
test_data['median_house_value'] = np.log1p(test_data['median_house_value'])

## **Question 3**
*   We need to deal with missing values for the column from Q1.
*   We have two options: fill it with 0 or with the mean of this variable.
*   Try both options. For each, train a linear regression model without regularization using the code from the lessons.
*   For computing the mean, use the training only!
*   Use the validation dataset to evaluate the models and compare the RMSE of each option.
*   Round the RMSE scores to 2 decimal digits using round(score, 2)
*   Which option gives better RMSE?

In [10]:
# Fill with 0
train_data_fill_zero = train_data.fillna(0)
val_data_fill_zero = val_data.fillna(0)

In [11]:
# Fill with mean (computed from training data)
mean_value = train_data[missing_feature].mean()
train_data_fill_mean = train_data.fillna(mean_value)
val_data_fill_mean = val_data.fillna(mean_value)

In [12]:
# Define a function to train and evaluate a linear regression model
def train_ridge_regression(train, val, alpha):
    X_train, y_train = train.drop(columns=['median_house_value']), train['median_house_value']
    X_val, y_val = val.drop(columns=['median_house_value']), val['median_house_value']

    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))

    return rmse

In [13]:
# Calculate RMSE for both options with Ridge regression
ridge_alpha = 0.001  # Adjust alpha as needed
rmse_zero = train_ridge_regression(train_data_fill_zero, val_data_fill_zero, alpha=ridge_alpha)
rmse_mean = train_ridge_regression(train_data_fill_mean, val_data_fill_mean, alpha=ridge_alpha)

In [14]:
print("RMSE with 0:", round(rmse_zero, 2))
print("RMSE with mean:", round(rmse_mean, 2))

RMSE with 0: 0.34
RMSE with mean: 0.34


## **Question 4**
*   Now let's train a regularized linear regression.
*   For this question, fill the NAs with 0.
*   Try different values of r from this list: [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10].
*   Use RMSE to evaluate the model on the validation dataset.
*   Round the RMSE scores to 2 decimal digits.
*   Which r gives the best RMSE?

If there are multiple options, select the smallest r.

In [15]:
# Fill NAs with 0
train_data_fill_zero_reg = train_data.fillna(0)
val_data_fill_zero_reg = val_data.fillna(0)

In [16]:
best_rmse = float('inf')
best_alpha = None
alphas = [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]

In [17]:
for alpha in alphas:
    rmse = train_ridge_regression(train_data_fill_zero_reg, val_data_fill_zero_reg, alpha)

    if rmse < best_rmse:
        best_rmse = rmse
        best_alpha = alpha

In [18]:
best_alpha

0

## **Question 5**
*   We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
*   Try different seed values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
*   For each seed, do the train/validation/test split with 60%/20%/20% distribution.
*   Fill the missing values with 0 and train a model without regularization.
*   For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
*   What's the standard deviation of all the scores? To compute the standard deviation, use np.std.
*   Round the result to 3 decimal digits (round(std, 3))

What's the value of std?
*   Note: Standard deviation shows how different the values are. If it's low, then all values are approximately the same. If it's high, the values are different. If standard deviation of scores is low, then our model is stable.

In [19]:
seed_values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
rmse_scores = []

In [20]:
for seed in seed_values:
    np.random.seed(seed)
    data_shuffled = data.sample(frac=1)

    train_size = int(0.6 * len(data))
    val_size = int(0.2 * len(data))

    train_data = data_shuffled[:train_size].copy()
    val_data = data_shuffled[train_size:train_size+val_size].copy()

    train_data_fill_zero = train_data.fillna(0)
    val_data_fill_zero = val_data.fillna(0)

    rmse = train_ridge_regression(train_data_fill_zero, val_data_fill_zero, alpha=ridge_alpha)
    rmse_scores.append(rmse)

In [21]:
# Corrected code to calculate the standard deviation
std_deviation = np.std(rmse_scores, ddof=1)  # ddof=1 for sample standard deviation

In [22]:
print("Standard deviation of RMSE scores:", round(std_deviation, 3))

Standard deviation of RMSE scores: 1408.475


## **Question 6**
*   Split the dataset like previously, use seed 9.
*   Combine train and validation datasets.
*   Fill the missing values with 0 and train a model with r=0.001.
*   What's the RMSE on the test dataset?

In [23]:
np.random.seed(9)
data_shuffled = data.sample(frac=1)

In [24]:
train_size = int(0.6 * len(data))
val_size = int(0.2 * len(data))
test_size = int(0.2 * len(data))

In [25]:
train_data = data_shuffled[:train_size+val_size].copy()
test_data = data_shuffled[train_size+val_size:].copy()

In [26]:
train_data_fill_zero = train_data.fillna(0)
test_data_fill_zero = test_data.fillna(0)

In [27]:
model = Ridge(alpha=0.001)  # Use alpha=0.001
model.fit(train_data_fill_zero.drop(columns=['median_house_value']), train_data_fill_zero['median_house_value'])

In [28]:
y_pred = model.predict(test_data_fill_zero.drop(columns=['median_house_value']))
rmse_test = np.sqrt(mean_squared_error(test_data_fill_zero['median_house_value'], y_pred))

In [29]:
print("RMSE on the test dataset:", round(rmse_test, 2))

RMSE on the test dataset: 66703.69
