##### Name: Stuti Upadhyay
##### Campus ID: XT81177
##### Instructor: Chalachew Jemberie

## Homework-Linear Regression Model 


### Dataset

In this homework, we will use the California Housing Prices from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).


The goal of this homework is to create a regression model for predicting housing prices (column `'median_house_value'`).

### EDA

* Load the data.
* Look at the `median_house_value` variable. Does it have a long tail? 

### Features

For the rest of the homework, you'll need to use only these columns:

* `'latitude'`,
* `'longitude'`,
* `'housing_median_age'`,
* `'total_rooms'`,
* `'total_bedrooms'`,
* `'population'`,
* `'households'`,
* `'median_income'`,
* `'median_house_value'`

Select only them.

In [1]:
import pandas as pd

# Load the data
data = pd.read_csv("housing.csv")

# Look at the median_house_value variable
median_house_value = data['median_house_value']
print("Long tail of median_house_value:", median_house_value.skew())

Long tail of median_house_value: 0.9777632739098341


In [2]:
# Select required columns
selected_columns = ['latitude', 'longitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 
                    'population', 'households', 'median_income', 'median_house_value']
data = data[selected_columns]

### Question 1

Find a feature with missing values. How many missing values does it have?
- 207
- 307
- 408
- 508

In [3]:
# Find feature with missing values
missing_values_feature = data.columns[data.isnull().any()]
print("Feature with missing values:", missing_values_feature[0])
print("Number of missing values:", data[missing_values_feature[0]].isnull().sum())

Feature with missing values: total_bedrooms
Number of missing values: 207


### Question 2

What's the median (50% percentile) for variable 'population'?
- 1133
- 1122
- 1166
- 1188

In [4]:
# Calculate median for population
population_median = data['population'].median()
print("Median for variable 'population':", population_median)

Median for variable 'population': 1166.0


### Split the data

* Shuffle the initial dataset, use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Make sure that the target value ('median_house_value') is not in your dataframe.
* Apply the log transformation to the median_house_value variable using the `np.log1p()` function.

In [5]:
from sklearn.model_selection import train_test_split
import numpy as np

# Apply log transformation to the target variable
data['median_house_value'] = np.log1p(data['median_house_value'])

# Shuffle the initial dataset
data_shuffled = data.sample(frac=1, random_state=42)

# Split the data
train_val, test = train_test_split(data_shuffled, test_size=0.2, random_state=42)
train, val = train_test_split(train_val, test_size=0.25, random_state=42)

### Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?

Options:
- With 0
- With mean
- Both are equally good

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Option 1: Fill missing values with 0
train_0 = train.fillna(0)
val_0 = val.fillna(0)

X_train_0 = train_0.drop('median_house_value', axis=1)
y_train_0 = train_0['median_house_value']
X_val_0 = val_0.drop('median_house_value', axis=1)
y_val_0 = val_0['median_house_value']

model_0 = LinearRegression()
model_0.fit(X_train_0, y_train_0)
predictions_0 = model_0.predict(X_val_0)
rmse_0 = mean_squared_error(y_val_0, predictions_0, squared=False)

# Option 2: Fill missing values with mean
mean_value = train[missing_values_feature[0]].mean()
train_mean = train.fillna(mean_value)
val_mean = val.fillna(mean_value)

X_train_mean = train_mean.drop('median_house_value', axis=1)
y_train_mean = train_mean['median_house_value']
X_val_mean = val_mean.drop('median_house_value', axis=1)
y_val_mean = val_mean['median_house_value']

model_mean = LinearRegression()
model_mean.fit(X_train_mean, y_train_mean)
predictions_mean = model_mean.predict(X_val_mean)
rmse_mean = mean_squared_error(y_val_mean, predictions_mean, squared=False)

print("RMSE with 0:", round(rmse_0, 2))
print("RMSE with mean:", round(rmse_mean, 2))

RMSE with 0: 0.34
RMSE with mean: 0.34


### Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0. 
* Try different values of `r` from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.

Options:
- 0
- 0.000001
- 0.001
- 0.0001

In [7]:
# Fill NAs with 0
train_reg = train.fillna(0)
val_reg = val.fillna(0)

X_train_reg = train_reg.drop('median_house_value', axis=1)
y_train_reg = train_reg['median_house_value']
X_val_reg = val_reg.drop('median_house_value', axis=1)
y_val_reg = val_reg['median_house_value']

# Try different values of r
rs = [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]
best_r = None
best_rmse = float('inf')

for r in rs:
    model_reg = LinearRegression()
    model_reg.fit(X_train_reg, y_train_reg)
    predictions_reg = model_reg.predict(X_val_reg)
    rmse_reg = mean_squared_error(y_val_reg, predictions_reg, squared=False)
    
    if rmse_reg < best_rmse:
        best_rmse = rmse_reg
        best_r = r

print("Best r:", best_r)

Best r: 0


### Question 5 

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)

> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different. 
> If standard deviation of scores is low, then our model is *stable*.

Options:
- 0.16
- 0.00005
- 0.005
- 0.15555

In [8]:
seed_scores = []

seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

for seed in seeds:
    # Split the data
    train_val_seed, test_seed = train_test_split(data_shuffled, test_size=0.2, random_state=seed)
    train_seed, val_seed = train_test_split(train_val_seed, test_size=0.25, random_state=seed)
    
    # Fill missing values with 0
    train_seed = train_seed.fillna(0)
    val_seed = val_seed.fillna(0)
    
    X_train_seed = train_seed.drop('median_house_value', axis=1)
    y_train_seed = train_seed['median_house_value']
    X_val_seed = val_seed.drop('median_house_value', axis=1)
    y_val_seed = val_seed['median_house_value']
    
    model_seed = LinearRegression()
    model_seed.fit(X_train_seed, y_train_seed)
    predictions_seed = model_seed.predict(X_val_seed)
    rmse_seed = mean_squared_error(y_val_seed, predictions_seed, squared=False)
    
    seed_scores.append(rmse_seed)

std_dev = np.std(seed_scores)
print("Standard deviation of scores:", round(std_dev, 3))

Standard deviation of scores: 0.005


### Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`. 
* What's the RMSE on the test dataset?

Options:
- 0.35
- 0.135
- 0.450
- 0.245

In [9]:
# Split dataset
train_test_seed, _ = train_test_split(data_shuffled, test_size=0.2, random_state=9)
train_test = train_test_seed.fillna(0)
test_set = test.fillna(0)

X_train_test = train_test.drop('median_house_value', axis=1)
y_train_test = train_test['median_house_value']
X_test_set = test_set.drop('median_house_value', axis=1)
y_test_set = test_set['median_house_value']

# Train a model with r=0.001
model_test = LinearRegression()
model_test.fit(X_train_test, y_train_test)
predictions_test = model_test.predict(X_test_set)
rmse_test = mean_squared_error(y_test_set, predictions_test, squared=False)

print("RMSE on test dataset:", round(rmse_test, 3))

RMSE on test dataset: 0.34
