**HOMEWORK**  

The goal of this homework is to create a regression model for predicting housing prices (column 'median_house_value').  

In this homework we'll again use the California Housing Prices dataset - the same one we used in homework 2 and 3.

You can take it from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices) or download using wget link mentioned below:

```
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

```



In [1]:
#@ IMPORTING LIBRARIES AND DEPENDENCIES:
import re
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

%matplotlib inline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#@ DOWNLOADING THE DATASET: UNCOMMENT BELOW:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

--2022-10-17 23:56:01--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: ‘housing.csv.1’


2022-10-17 23:56:04 (2.87 MB/s) - ‘housing.csv.1’ saved [1423529/1423529]



In [3]:
#@ READING DATASET:
PATH = "./housing.csv"
select_cols = ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", 
               "median_income", "median_house_value", "ocean_proximity"]
df = pd.read_csv(PATH, usecols=select_cols)
df.total_bedrooms = df.total_bedrooms.fillna(0)

In [4]:
cols = ['latitude', 'longitude', 'housing_median_age', 'total_rooms','total_bedrooms', 'population', 'households', 'median_income', 'median_house_value', 'ocean_proximity']

In [5]:
df = df[cols]

In [6]:
df.head(20)

Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,37.88,-122.23,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,37.86,-122.22,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,37.85,-122.24,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,37.85,-122.25,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,37.85,-122.25,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,37.85,-122.25,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,37.84,-122.25,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,37.84,-122.25,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,37.84,-122.26,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,37.84,-122.25,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [7]:
df.columns = df.columns.str.lower()
df.head(20)

Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,37.88,-122.23,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,37.86,-122.22,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,37.85,-122.24,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,37.85,-122.25,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,37.85,-122.25,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,37.85,-122.25,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,37.84,-122.25,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,37.84,-122.25,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,37.84,-122.26,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,37.84,-122.25,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [8]:
df.isna().sum()

latitude              0
longitude             0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

In [9]:
df['median_house_value'] = np.log1p(df['median_house_value'])

In [10]:
df.head()

Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,37.88,-122.23,41.0,880.0,129.0,322.0,126.0,8.3252,13.022766,NEAR BAY
1,37.86,-122.22,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,12.789687,NEAR BAY
2,37.85,-122.24,52.0,1467.0,190.0,496.0,177.0,7.2574,12.771673,NEAR BAY
3,37.85,-122.25,52.0,1274.0,235.0,558.0,219.0,5.6431,12.74052,NEAR BAY
4,37.85,-122.25,52.0,1627.0,280.0,565.0,259.0,3.8462,12.743154,NEAR BAY


- Apply the log transform to `median_house_value`. 
- Do train/validation/test split with 60%/20%/20% distribution.
- Use the `train_test_split` function and set the `random_state parameter` to 1.

In [11]:
#@ SPLITTING THE DATASET FOR TRAINING AND TEST:
from sklearn.model_selection import train_test_split
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=1)

In [12]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [13]:
y_train = df_train.median_house_value.values
del df_train['median_house_value']

y_val = df_val.median_house_value.values
del df_val['median_house_value']

y_test = df_test.median_house_value.values
del df_test['median_house_value']

- We will use `DictVectorizer` to turn train and validation into matrices.

In [14]:
#@ IMPLEMENTATION OF DICTVECTORIZER:
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score

dv = DictVectorizer(sparse=False)

**Question 1**

Let's train a decision tree regressor to predict the `median_house_value` variable.

Train a model with `max_depth=1`.

In [15]:
train_dict = df_train.to_dict(orient = 'records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient = 'records')
X_val = dv.fit_transform(val_dict)

In [16]:
#@ TRAINING THE REGRESSION MODEL:
from sklearn.tree import DecisionTreeRegressor, export_text
 
dt = DecisionTreeRegressor(max_depth = 1)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_val)


print(export_text(dt, feature_names = dv.get_feature_names()))

|--- ocean_proximity=INLAND <= 0.50
|   |--- value: [12.31]
|--- ocean_proximity=INLAND >  0.50
|   |--- value: [11.61]





- Which feature is used for splitting the data?

- Answer: ocean_proximity

**Question 2**

Train a random forest model with these parameters:

- `n_estimators=10`  
- `random_state=1`  
- `n_jobs=-1` (optional-to make training faster)

In [17]:
#@ TRAINING RANDOM FOREST MODEL:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

In [18]:
#@ CALCULATING MEAN SQUARED ERROR:
rf_pred = rf.predict(X_val)
rmse = mean_squared_error(y_val, y_pred)

In [19]:
print(rmse)

0.21887168808741775


- What's the RMSE of this model on validation?

- Answer: 0.2188716880874183

**Question 3**

Now, let's experiment with the `n_estimators` parameter.

- Try different values of this parameter from 10 to 200 with step 10.
- Set `random_state` to 1.
- Evaluate the model on the validation dataset.

In [20]:
#@ TRAINING THE RANDOM FOREST MODEL:
for i in range(10, 201, 10):
    #print(i)
    rf = RandomForestRegressor(n_estimators=i, random_state=1, n_jobs=-1)
    rf.fit(X_train, y_train)
    rf_pred = rf.predict(X_val)
    rmse = mean_squared_error(y_val, y_pred)
    
    print("Estimator Value %s RMSE %s" % (i, rmse))

Estimator Value 10 RMSE 0.21887168808741775
Estimator Value 20 RMSE 0.21887168808741775
Estimator Value 30 RMSE 0.21887168808741775
Estimator Value 40 RMSE 0.21887168808741775
Estimator Value 50 RMSE 0.21887168808741775
Estimator Value 60 RMSE 0.21887168808741775
Estimator Value 70 RMSE 0.21887168808741775
Estimator Value 80 RMSE 0.21887168808741775
Estimator Value 90 RMSE 0.21887168808741775
Estimator Value 100 RMSE 0.21887168808741775
Estimator Value 110 RMSE 0.21887168808741775
Estimator Value 120 RMSE 0.21887168808741775
Estimator Value 130 RMSE 0.21887168808741775
Estimator Value 140 RMSE 0.21887168808741775
Estimator Value 150 RMSE 0.21887168808741775
Estimator Value 160 RMSE 0.21887168808741775
Estimator Value 170 RMSE 0.21887168808741775
Estimator Value 180 RMSE 0.21887168808741775
Estimator Value 190 RMSE 0.21887168808741775
Estimator Value 200 RMSE 0.21887168808741775


In [21]:
#@ INSPECTING THE RMSE SCORES:


- After which value of `n_estimators` does RMSE stop improving?

- Answer: 10

**Question 4**

Let's select the best `max_depth`:

- Try different values of `max_depth`: [10, 15, 20, 25].
- For each of these values, try different values of n_estimators from 10 till 200 (with step 10).
- Fix the random seed: `random_state=1`.

In [22]:
#@ TRAINING THE MODEL WITH DEPTH:
md = [10, 15, 20, 25]
for m in md:
    for i in range(10, 201, 10):
        #print(i)
        rf = RandomForestRegressor(n_estimators=i, random_state=1, n_jobs=-1, max_depth=m)
        rf.fit(X_train, y_train)
        rf_pred = rf.predict(X_val)
        rmse = mean_squared_error(y_val, y_pred)
    
        print("Estimator Value %s and Max_depth%s; RMSE %s" % (i, m, rmse))

Estimator Value 10 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 20 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 30 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 40 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 50 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 60 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 70 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 80 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 90 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 100 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 110 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 120 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 130 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 140 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 150 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 160 and Max_depth10; RMSE 0.21887168808741775
Estimator Value 1

- What's the best `max_depth`:

- Answer: 10

**Question 5**

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorith, it finds the best split. When doint it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the imporatant features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the `feature_importances_` field.

For this homework question, we'll find the most important feature:

Train the model with these parametes:
- `n_estimators=10`,
- `max_depth=20`,
- `random_state=1`,
- `n_jobs=-1` (optional)

Get the feature importance information from this model

In [23]:
#@ TRAINING THE RANDOM FOREST MODEL:
model = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
model.fit(X_train, y_train)

In [24]:
y_pred = rf.predict(X_val)
rmse = mean_squared_error(y_val, y_pred)
rmse

0.053553699699876826

In [25]:
d = {'feature': dv.get_feature_names(), 'values': model.feature_importances_}
feature_info_values = pd.DataFrame(data = d)
feature_info_values.sort_values('values', ascending = False)



Unnamed: 0,feature,values
4,median_income,0.363326
6,ocean_proximity=INLAND,0.310901
2,latitude,0.101256
3,longitude,0.09647
1,housing_median_age,0.033145
10,population,0.030777
12,total_rooms,0.020541
11,total_bedrooms,0.019172
0,households,0.016387
9,ocean_proximity=NEAR OCEAN,0.004699


- What's the most important feature?

- Answer: median_income

**Question 6**

Now let's train an XGBoost model! For this question, we'll tune the eta parameter:

- Install XGBoost.
- Create DMatrix for train and validation
- Create a watchlist
- Train a model with these parameters for 100 rounds:

```
xgb_params = {  
    'eta': 0.3,  
    'max_depth': 6,  
    'min_child_weight': 1,  

    'objective': 'reg:squarederror',
    'nthread': 8,

    'seed': 1,
    'verbosity': 1,
}
```



In [30]:
#@ CREATING THE DMARTIX:
import xgboost as xgb
features = dv.feature_names_

regex = re.compile(r"<", re.IGNORECASE)
features = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in features]

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)

In [35]:
watchlist = [(dtrain, 'train'), (dval, 'val')]

In [33]:
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'eval_metric': 'auc',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

In [34]:
model = xgb.train(xgb_params, dtrain, num_boost_round=100)

- Now, change eta first to 0.1 and then to 0.01.

- Which eta leads to the best RMSE score on the validation dataset?

- Answer: 0.1