# House prices reduction 

Let's check out the power of PCA on tabular data. Download the dataset here 👉👉 <a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/PCA/house_prices.csv" target="_blank">House Prices</a>

1. Import `numpy` and `pandas`

In [11]:
import numpy as np
import pandas as pd

# ML librairies
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import  StandardScaler

# Dataviz librairies
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go

2. Import dataset and remove column `Id`

In [2]:
dataset = pd.read_csv("/Users/qxzjy/vscworkspace/dsfs-ft-34/06_UNSUPERVISED_MACHINE_LEARNING/03_DIMENSIONALITY_REDUCTION_PCA/02_EXERCICES/data/house_prices.csv")
dataset.drop("Id", axis=1, inplace=True)
dataset

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1456,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125


3. Remove NaN

In [4]:
dataset.dropna(axis=1, inplace=True)

4. Split data into `X` and `y` where X contains all columns except `SalePrice`

5. Using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html" target="_blank">`select_dtypes`</a> from Pandas, filter out all categorical variables. 

In [6]:
target = "SalePrice"

X = dataset.select_dtypes(exclude=object).drop(target, axis=1)
y = dataset[target]

6. Split `X`, `y` in a train set and test set.
> NB: specify `random_state=0` and `test_size=0.2`

In [7]:
X_train_unproc, X_test_unproc, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train_unproc)
X_test = scaler.transform(X_test_unproc)

7. Import and fit `RandomForestRegressor` on your train set. Check out your score on X_test
> NB: specify `random_state=0` and `min_samples_split=15`

8. Verify overfitting

In [25]:
rfr = RandomForestRegressor(random_state=0, min_samples_split=15)
rfr.fit(X_train, y_train)

print(f"R2 score train set : {rfr.score(X_train, y_train)}")
print(f"R2 score test set : {rfr.score(X_test, y_test)}")

R2 score train set : 0.9453162715669043
R2 score test set : 0.8361109552477888


9. Use cross validation to check your model overall performance : display the average score and the standard deviation
> NB: you can use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross%20val%20score#sklearn.model_selection.cross_val_score" target="_blank" >`cross_val_score`</a> from sklearn

In [10]:
cross_validation = cross_val_score(rfr, X_train, y_train, cv=10)

print(f"cross_validation mean : {cross_validation.mean()}")
print(f"cross_validation std : {cross_validation.std()}")

cross_validation mean : 0.8544660980700808
cross_validation std : 0.05727165449720752


10. Normalize your train set and apply it on your test set 

In [None]:
# Already did

11. Apply PCA on your normalized train set.
> NB: specify `n_components=3`

In [13]:
pca = PCA(n_components=3)

X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

12. Visualize your principal components

In [21]:
X_train_pca

array([[ 4.328848  , -0.58109309,  0.61174832],
       [-3.23032795, -0.38723925,  1.06219162],
       [-2.37447855, -1.77209463, -0.32566518],
       ...,
       [-0.33843671,  2.87164554,  0.00682469],
       [ 1.33778317, -0.2164039 ,  1.97466599],
       [ 1.62886359,  1.72608668,  1.42988244]])

13. Check the explained variance ratio from your principal components

In [22]:
pca.explained_variance_ratio_

array([0.19200008, 0.0982849 , 0.06323249])

14. Fit your `RandomForestRegressor` on your train set where you applied PCA, check out your score on your test set. 

In [23]:
rfr.fit(X_train_pca, y_train)

print(f"R2 score train set : {rfr.score(X_train_pca, y_train)}")
print(f"R2 score train set : {rfr.score(X_test_pca, y_test)}")

R2 score train set : 0.9208319533250391
R2 score train set : 0.7892984722390265


15. Use `cross_val_score` again to see your model's overall performance. What can you say about it?

In [24]:
cross_validation = cross_val_score(rfr, X_train_pca, y_train, cv=10)

print(f"cross_validation mean : {cross_validation.mean()}")
print(f"cross_validation std : {cross_validation.std()}")

cross_validation mean : 0.8458318504926782
cross_validation std : 0.04781905255103295


16. Finally, use `scatter_3d` from `plotly.express` and `go.Scatter3d` from `plotly.graph_objects` to visualize a 3D graphs with all Principal Components from train and from test.

In [28]:
fig = px.scatter_3d(X_train_pca, x=0, y=1, z=2)

fig.add_trace(
  go.Scatter3d(
      x=X_test_pca[:, 0],
      y=X_test_pca[:, 1],
      z=X_test_pca[:, 2],
      mode="markers",
      name="test"
  )
)

fig.show()

In [27]:
df_plot = pd.DataFrame(X_train_pca)
df_plot.columns = ['PC1', 'PC2', 'PC3']
df_plot['SalePrice'] = list(y_train)

fig = px.scatter_3d(df_plot, x='PC1', y='PC2', z='PC3', color='SalePrice')
fig.show()