# Identifying Bugs

On a usual Thursday afternoon, your co-workers suddenly asked you to model the median housing price in California. They urgently expect the model in 30 minutes.

You look around... Turns out everyone at your company is doing the same thing and is grabbing a limited number of Jupyter Notebook sessions for their own. As a result of the high demand, __this Jupyter Notebook session will be slowed down and restarted at some point__.

_I'm in luck_. Years ago, you wrote the exact notebook you used to accurately model housing prices!

... _But_ ... just now you remember that it may be packed with bugs!!!

You don't have time to start from zero. You __must__ make this notebook work. In the middle of the Jupyter Notebook shortage.

__Can you successfully run this notebook and submit the model in time??__

____________________

# The Notebook

**Hint: Run every cell one by one up to "Part 1" section.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from kishu.user_study import install_submit_cell_execution
install_submit_cell_execution()
pd.options.mode.chained_assignment = None

np.random.seed(42)

Read in the dataset.

In [None]:
housing = pd.read_csv("./housing_california.csv")

In [None]:
print(housing.head())

In [None]:
print(housing.dtypes)

In [None]:
housing.isnull().sum()

In [None]:
housing.hist(figsize=(25,25),bins=50);

In [None]:
hcorr = housing.corr()
hcorr.style.background_gradient()

In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Normalize dataset columns by scaling.
scaler = MinMaxScaler()
housing_imputed = pd.DataFrame(scaler.fit_transform(housing), columns=housing.columns)

In [None]:
# Fit a KNN for imputing missing values.
knn_imputer = KNNImputer(n_neighbors=2)
housing_imputed = pd.DataFrame(knn_imputer.fit_transform(housing_imputed), columns=housing.columns)

In [None]:
# Predict the imputed dataset based on KNN clusters
housing_imputed = pd.DataFrame(scaler.inverse_transform(housing_imputed), columns=housing.columns)

In [None]:
# Replace the missing values in the original dataframe.
housing = housing_imputed[~housing['total_bedrooms'].isna()]

In [None]:
# Stratified train-test split: first categorize the median income by quantizing.
housing['income_cat'] = np.ceil(housing['median_income']/1.5)

In [None]:
# Clamp the maximum category.
housing['income_cat'].where(housing['income_cat']<5, 5.0, inplace=True)

In [None]:
# Split train and test by stratified sampling on income category.
tr_data,te_data = train_test_split(housing, test_size=0.2,stratify=housing['income_cat'])

In [None]:
# Drop the intermediate column (income category).
housing.drop('income_cat',axis=1)
tr_data.drop('income_cat',axis=1)
te_data.drop('income_cat',axis=1)

Split between features and values to predict.

In [None]:
X_train = tr_data.drop('median_house_value', axis=1)
Y_train = tr_data['median_house_value']

In [None]:
X_test = te_data.drop('median_house_value', axis=1)
Y_test = te_data['median_house_value']

Drop unnecessary columns.

In [None]:
del X_train['latitude']
del X_train['total_rooms']
del X_train['households']

In [None]:
del X_test['latitude']
del X_test['total_rooms']
del X_test['households']

## Part 1

Please do **not** edit cells in this section.

In [None]:
from kishu.user_study import check_train_test_dataset
check_pass = check_train_test_dataset(X_train, Y_train, X_test, Y_test)

**Task**: Identify which **cell** removes rows from the dataset.

**Please contact the staff after successfully identifying the cell.**

In [None]:
assert check_pass, "Must fix the previous bug before continuing"

## Part 2

Please do **not** edit cells in this section.

Your Jupyter Notebook session is being restarted due to the high demand!

**Please manually restart the kernel ("Kernel" > "Restart Kernel...") and continue.**

**Hint**: After the restart, you may need to restore the state.

## Part 3

Please do **not** edit cells in this section.

In [None]:
# Fit a linear model.
model = LinearRegression(n_jobs=-1)
model.fit(X_train,Y_train)
Y_pred = model.predict(X_test)
rmse = np.sqrt(metrics.mean_squared_error(Y_test,Y_pred))
print(f"RMSE= {rmse}")

In [None]:
from kishu.user_study import submit
success = submit(model)  # Expecting "CHECK PASSED"

**Task**: Identify which **cell** creates the temporary column.

**Please contact the staff after successfully identifying the cell.**