<a href="https://colab.research.google.com/github/xslittlemaggie/Machine-Learning-Projects/blob/master/California_House_value_Prediction_LinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# California House Value

In [0]:
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_california_housing

# **Step 1**: Load dataset

**Load Dataset from sklearn: fetch_california_housing(a dictionary includes 3 elements)**:  

  - data (features)

  - feature_columns

  - target variable


**Sample size : 20,640**

**Features (X):**

 1. 'MedInc'
 2. 'HouseAge'
 3. 'AveRooms'
 4. 'AveBedrms'
 5. 'Population'
 6. 'AveOccup'
 7. 'Latitude'
 8. 'Longitude'

 **Target variable (y):**

 - House value



In [0]:
houes_value = fetch_california_housing()

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /root/scikit_learn_data


In [0]:
X = pd.DataFrame(houes_value.data, columns = houes_value.feature_names)
y = houes_value.target

In [0]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


# **Step 2**: Split train, test dataset

In [0]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

In [0]:
for i in [x_train, x_test]:
  i.index = range(i.shape[0])

In [0]:
x_train.shape

(14448, 8)

In [0]:
print("The train sample size: {}".format(x_train.shape))
print("The test sample size: {}".format(x_test.shape))

The train sample size: (14448, 8)
The test sample size: (6192, 8)


# **Step 3**: Model building

## 1. Fit the model

In [0]:
reg = LinearRegression()
reg.fit(x_train, y_train)

array([2.12598355, 0.94104495, 2.71042934, ..., 1.85617769, 1.54295782,
       1.51367783])

## 2. Get familar with the fitted model parameters: **coef & intercept**

In [0]:
coef = [*zip(x_train.columns, reg.coef_)] # or use tolist, [*] to list the zip
coef = pd.DataFrame(coef, columns = ["feature_name", 'coef'])
coef

Unnamed: 0,feature_name,coef
0,MedInc,0.441038
1,HouseAge,0.009688
2,AveRooms,-0.104781
3,AveBedrms,0.622053
4,Population,-6e-06
5,AveOccup,-0.003288
6,Latitude,-0.423182
7,Longitude,-0.437899


In [0]:
reg.intercept_

-37.28532899875165

From the intercepts and coef, the absolute value of the intercept is much higher than other coefficients. **It is possible that the model doesn't fit well.**

## 3. Evaluate the model: **MSE**

It is difficult to evaluate the performance based on the SSE (Sum Square Error), which is highly impacted by the sample size. We divide the SSE by the sample size and get the MSE (Mean Square Error)

\begin{equation*}
MSE =
\frac{1}{m}\sum_{i = 1}^{m}(y_{hat} - y_i)^2
\end{equation*}

In [0]:
from sklearn.metrics import mean_squared_error as MSE
MSE(y_pred, y_test)

0.5296293151408237

In [0]:
print("The max value of y: {}".format(y_test.max()))
print("The min value of y: {}".format(y_test.min()))

The max value of y: 5.00001
The min value of y: 0.14999


Compare the MSE value to the max and min value, it seems that the model doesn't perform very good.

- Max value 5:00 vs 0.5 -->  0.5/5 = 10% error

- Min value 0.15 vs 0.5 -->  0.5/0.15 300% error

In [0]:
cross_val_score(reg, X, y, cv = 10, scoring = 'neg_mean_squared_error') # the negative value of MSE

array([-0.48922052, -0.43335865, -0.8864377 , -0.39091641, -0.7479731 ,
       -0.52980278, -0.28798456, -0.77326441, -0.64305557, -0.3275106 ])

In [0]:
import sklearn
sorted(sklearn.metrics.SCORERS.keys())

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'brier_score_loss',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'max_error',
 'mutual_info_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'v_measure_score']

## 2. Make prediction base on the fitted model

In [0]:
y_pred = reg.predict(x_test)
y_pred

array([2.12598355, 0.94104495, 2.71042934, ..., 1.85617769, 1.54295782,
       1.51367783])