# Data Analysis

Today, you would like to build a model to predict the median housing price in California.

Thankfully, your coworker has prepared a few options to help you build the model.

**Once you see "CHOOSE ONE", copy, paste, and execute one of the choices in the cell below.**

You are free to run other cells.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from joblib import dump, load
from kishu.user_study import install_submit_cell_execution
install_submit_cell_execution()

housing = pd.read_csv("./housing_california.csv")

## Data visualization

In [None]:
print(housing.head())

The median_house_value is the thing you need to predict using other features.

In [None]:
print(housing.dtypes)

In [None]:
#check wheather there are any missing values or null
housing.isnull().sum()

In [None]:
# Understand the distribution of each feature
housing.hist(figsize=(25,25),bins=50);

In [None]:
hcorr = housing.corr()
hcorr.style.background_gradient()

# Part 1 Data preparation

## Impute the missing values (CHOOSE ONE)

Two choices to impute 207 missing values in the `total_bedrooms` column.

1. **Choice 1**: Median Imputation. Simplicity is good for this dataset size.
```python
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(housing.iloc[:,4:5])
housing.iloc[:,4:5] = imputer.transform(housing.iloc[:,4:5])
```
2. **Choice 2**: kNN Unsupervised Learning Imputation. Median imputation can introduce bias; kNN imputer can provide a better representation.
```python
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

# Normalize the data
scaler = MinMaxScaler()
housing_scaled = pd.DataFrame(scaler.fit_transform(housing), columns=housing.columns)

# Apply KNN imputation
knn_imputer = KNNImputer(n_neighbors=2)  # You can adjust the number of neighbors
housing_imputed = pd.DataFrame(knn_imputer.fit_transform(housing_scaled), columns=housing.columns)

# Inverse transform to get back to the original scale
housing_imputed = pd.DataFrame(scaler.inverse_transform(housing_imputed), columns=housing.columns)

# Replace the 4th column in the original df with the imputed values
housing['total_bedrooms'] = housing_imputed['total_bedrooms']
```

**Paste and run** the code of your choice **in the cell below**. Only edit variable names of the code snippet you choose if you need to.

In [None]:
# Making sure there is no missing value
housing.isnull().sum()

## Train test set split (CHOOSE ONE)

How would you split the dataset into train and test sets?

1. **Choice 1**: Random split to be fair.
```python
tr_data, te_data = train_test_split(housing, test_size=0.2, random_state=43)
```
2. **Choice 2**: Stratified split, given the correlation between columns.
```python
housing['income_cat'] = np.ceil(housing['median_income'] / 1.5)
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace=True)
tr_data, te_data = train_test_split(housing, test_size=0.2, stratify=housing['income_cat'])
housing = housing.drop('income_cat',axis=1)
tr_data = tr_data.drop('income_cat',axis=1)
te_data = te_data.drop('income_cat',axis=1)
```

In [None]:
# Making sure splitting is successful
print(tr_data.shape, te_data.shape)

## Divide the data into feature (X) and target (Y)

In [None]:
# For the training data
X_train = tr_data.drop('median_house_value', axis=1)
Y_train = tr_data['median_house_value']

# For the testing data
X_test = te_data.drop('median_house_value', axis=1)
Y_test = te_data['median_house_value']

In [None]:
# Checking that the dataset is complete
print(X_train.head())
print(Y_train.head())

# Part 2 Feature Engineering

Several features (`total_rooms`, `total_bedrooms`, `longitude`, `latitude`) are highly correlated, let's simply drop some of them.

In [None]:
del X_train['total_bedrooms']
del X_test['total_bedrooms']
del X_train['longitude']
del X_test['longitude']

In [None]:
# Check the dataset again.
print(X_train.head())
print(Y_train.head())
print(X_test.head())
print(Y_test.head())

# Part 3 Model selection

## Select the model to use (CHOOSE ONE)

Let's try either of the two models.

1. **Choice 1**: Linear regression
```python
model = LinearRegression(n_jobs=-1)
model.fit(X_train, Y_train)
```
2. **Choice 2**: Random Forest
```python
model = RandomForestRegressor(30)
model.fit(X_train, Y_train)
```

# Part 4 Model Evaluation

## Evaluate your model using RMSE

In [None]:
Y_pred = model.predict(X_test)
rmse = np.sqrt(metrics.mean_squared_error(Y_test, Y_pred))
rmse

# Part 5: Alternative Feature Engineering

Your coworker realized that "Part 2 Feature Engineering" should instead combine the correlated features into new features.

**Try the suggestion your coworker gave. Then, report the old and new RMSE below.**

**Hint**: The following code requires `X_train['longitude']`, but the column has already been dropped. You may need to restore the column.
```python
X_train['diag_coord'] = X_train['longitude'] + X_train['latitude']
X_train['bedperroom'] = X_train['total_bedrooms'] / X_train['total_rooms']
X_test['diag_coord'] = X_test['longitude'] + X_test['latitude']
X_test['bedperroom'] = X_test['total_bedrooms'] / X_test['total_rooms']

X_train = X_train.drop(['longitude', 'latitude', 'total_bedrooms', 'total_rooms'], axis=1)
X_test = X_test.drop(['longitude', 'latitude', 'total_bedrooms', 'total_rooms'], axis=1)
```

## Please report old and new RMSE below

What is the RMSE of the two feature engineering ways respectively?

RMSE of the **old way** (dropping features directly): xxxx.xxxx

RMSE of the **new way** (replacing with new features): xxxx.xxxx

# Part 6


Please export the models you trained by the first and second feature engineering methods respectively using the following code:
1. For the first model
   ```python
   dump(model, 'drop_feature.joblib') 
   ```
2. For the second model
   ```python
   dump(model,'replace_feature.joblib')
   ```