<a href="https://colab.research.google.com/github/stevengregori92/Project---Predicting-House-Price-Class/blob/main/Project_Predicting_House_Price_Class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project - Parameters with Highest Impact on House Price Class

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- The real estate dealer from last assignment calls back and clarifies his objective
- Not so interested in finding what matters most to find house price, but more in which range a house is in.
- There are 3 classes: 33% cheapest, 33% mid-range, 33% expensive houses.
- He needs to find which 10 parameters matters most to determine that

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)
- NOTE: You might need to install mlxtend, if so, run the following in a cell
```
!pip install mlxtend
```

In [None]:
!pip install --upgrade mlxtend

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier

### Step 1.b: Read the data
- Use ```pd.read_parquet()``` to read the file `files/house_sales.parquet`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)
- Apply ```.head()``` on the data to see all is as expected

In [None]:
data = pd.read_parquet('house_sales.parquet')
data.head()

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LandSlope,OverallQual,OverallCond,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,MiscVal,MoSold,YrSold,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,1,3,3,3,2,7,5,...,61,0,0,0,0,,0,2,2008,208500
2,20,80.0,9600,1,3,3,3,2,6,8,...,0,0,0,0,0,,0,5,2007,181500
3,60,68.0,11250,1,2,3,3,2,7,5,...,42,0,0,0,0,,0,9,2008,223500
4,70,60.0,9550,1,2,3,3,2,7,5,...,35,272,0,0,0,,0,2,2006,140000
5,60,84.0,14260,1,2,3,3,2,8,5,...,84,0,0,0,0,,0,12,2008,250000


In [None]:
len(data)

1460

### Step 1.c: Inspect the data
- Check the number of rows and columns
    - HINT: `.shape`

In [None]:
data.shape

(1460, 56)

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.dtypes```

In [None]:
data.dtypes

MSSubClass         int64
LotFrontage      float64
LotArea            int64
Street             int64
LotShape           int64
LandContour        int64
Utilities          int64
LandSlope          int64
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
MasVnrArea       float64
ExterQual          int64
ExterCond          int64
BsmtQual         float64
BsmtCond         float64
BsmtExposure     float64
BsmtFinType1     float64
BsmtFinSF1         int64
BsmtFinType2     float64
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
HeatingQC          int64
CentralAir         int64
1stFlrSF           int64
2ndFlrSF           int64
LowQualFinSF       int64
GrLivArea          int64
BsmtFullBath       int64
BsmtHalfBath       int64
FullBath           int64
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual        int64
TotRmsAbvGrd       int64
Fireplaces         int64
FireplaceQu      float64


### Step 2.b: Check for null (missing) values
- Let's check if any features are not valuable
- Use ```.info()```
- Should we remove any?
    - You can remove features (columns):
    ```Python
data.drop([<column_name>, ..., <column_name>], axis=1)
```

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 56 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   LotFrontage    1201 non-null   float64
 2   LotArea        1460 non-null   int64  
 3   Street         1460 non-null   int64  
 4   LotShape       1460 non-null   int64  
 5   LandContour    1460 non-null   int64  
 6   Utilities      1460 non-null   int64  
 7   LandSlope      1460 non-null   int64  
 8   OverallQual    1460 non-null   int64  
 9   OverallCond    1460 non-null   int64  
 10  YearBuilt      1460 non-null   int64  
 11  YearRemodAdd   1460 non-null   int64  
 12  MasVnrArea     1452 non-null   float64
 13  ExterQual      1460 non-null   int64  
 14  ExterCond      1460 non-null   int64  
 15  BsmtQual       1423 non-null   float64
 16  BsmtCond       1423 non-null   float64
 17  BsmtExposure   1422 non-null   float64
 18  BsmtFinT

In [None]:
data = data.drop('PoolQC', axis = 1)

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Step 3.a: Quasi constant features
- Let see if there are any quasi features
- Create a `VarianceThreshold(threshold=0.01)` and fit it
- The features that are not quasi constant are given by `sel.get_feature_names_out()`
- Get all the qausi features as with list comprehension

In [None]:
sel = VarianceThreshold(threshold=0.01)
sel.fit(data)

VarianceThreshold(threshold=0.01)

In [None]:
len(sel.get_feature_names_out())

53

In [None]:
quasi_feature = [col for col in data.columns if col not in sel.get_feature_names_out()]
quasi_feature

['Street', 'Utilities']

### Step 3.b: Correlated features
- Calculate the correlation matrix `corr_matrix` and inspect it
    - HINT: use `.corr()`
- Get all the correlated features
    - HINT: A feature is correlated to a feature before it if
```Python
(corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()
```
    - HINT: Use list comprehension to get a list of the correlated features

In [None]:
corr_matrix = data.corr()

In [None]:
corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]
corr_features

['BsmtFinSF2', '1stFlrSF', 'TotRmsAbvGrd', 'GarageYrBlt', 'GarageArea']

### Step 3.c: Prepare training and test set
- Create 3 categorical price ranges using `qcut`
    - HINT: `pd.qcut(data['SalePrice'], q=3, labels=[1, 2, 3])`
- Assign all features in `X`
    - HINT: Use `.drop(['SalePrice', 'Target'] + quasi_features + corr_features, axis=1)`
        - (assuming the same naming)
- Assign the target to `y`
    - HINT: The target is column `Target`
- Split into train and test using `train_test_split`

In [None]:
data['Target'] = pd.qcut(data['SalePrice'], q=3, labels=[1, 2, 3])

In [None]:
X = data.drop(['SalePrice', 'Target'] + quasi_feature + corr_features, axis = 1).fillna(-1)
y = data['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)

### Step 3.d: 10 best features for KNeighborsClassifier model
- Use the `SFS` to find 10 best features for a `KNeighborsClassifier` model
    - HINT: `SFS(KNeighborsClassifier(), k_features=10, verbose=2)`
    - HINT: when fitting fill missing values or remove them
        - Notice: ideally we would investigate them further to find appropriate values
- You get the best feature index from `.k_feature_idx_`

In [None]:
sfs = SFS(KNeighborsClassifier(), k_features = 10, verbose = 2)
sfs.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  47 out of  47 | elapsed:    2.0s finished

[2023-02-19 11:29:15] Features: 1/10 -- score: 0.5872968709878581[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  46 out of  46 | elapsed:    1.6s finished

[2023-02-19 11:29:17] Features: 2/10 -- score: 0.6832104471589451[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:    1.7s finished

[2023-02-19 11:29:19] Features: 3/10 -- score: 0.7140383698323612[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  

SequentialFeatureSelector(estimator=KNeighborsClassifier(), k_features=(10, 10),
                          scoring='accuracy', verbose=2)

In [None]:
sfs.k_feature_idx_

(0, 3, 6, 14, 26, 28, 29, 32, 33, 35)

### Step 3.e: Explore the features
- Let's try to explore the features
    - HINT: The features can be accessed by `sfs.k_feature_idx_`
    - HINT: Get the feature names by: `X_train.columns[list(sfs.k_feature_idx_)]`
- Try to list them according to correlation score
    - HINT: This is a bit more advanced Python

```Python
for item in X_train.columns[list(sfs.k_feature_idx_)]:
    loc = corr_matrix['SalePrice'].sort_values(ascending=False).index.get_loc(item)
    print(item, loc)
```
- Does the result surprise you?
- Does it change your recommendations?

In [None]:
X_train.columns[list(sfs.k_feature_idx_)]

Index(['MSSubClass', 'LotShape', 'OverallQual', 'BsmtCond', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'KitchenQual', 'Fireplaces', 'GarageCars'],
      dtype='object')

In [None]:
for item in X_train.columns[list(sfs.k_feature_idx_)]:
    loc = corr_matrix['SalePrice'].sort_values(ascending=False).index.get_loc(item)
    print(item, loc)

MSSubClass 51
LotShape 54
OverallQual 1
BsmtCond 32
BsmtFullBath 29
FullBath 10
HalfBath 25
KitchenQual 4
Fireplaces 16
GarageCars 6


In [None]:
quasi_feature

['Street', 'Utilities']

In [None]:
corr_features

['BsmtFinSF2', '1stFlrSF', 'TotRmsAbvGrd', 'GarageYrBlt', 'GarageArea']

In [None]:
data = data.drop(columns = ['Street', 'Utilities','BsmtFinSF2', '1stFlrSF', 'TotRmsAbvGrd', 'GarageYrBlt', 'GarageArea'])

In [None]:
corr_matrix['SalePrice'].sort_values(ascending=False)

SalePrice        1.000000
OverallQual      0.790982
GrLivArea        0.708624
ExterQual        0.682639
KitchenQual      0.659600
BsmtQual         0.644019
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
HeatingQC        0.427649
BsmtFinSF1       0.386420
BsmtExposure     0.352958
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
FireplaceQu      0.295794
HalfBath         0.284108
BsmtFinType1     0.277436
LotArea          0.263843
CentralAir       0.251328
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
BsmtCond         0.160658
GarageQual       0.156693
GarageCond       0.125013
ScreenPorch      0.111447
PoolArea         0.092404
MoSold           0.046432
3SsnPorch   

## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Step 4.a: Present findings
- Use the analysis from Step 3 to figures out how to present your findings
- Try to think how the real estate dealer can use these findings

## Step 5: Actions
- Use insights
- Measure impact
- Main goal

### Step 5.a: Measure impact
- Can we help the dealer to use these insights?