### Last week:
- Unsupervised machine learning
- K-Means and its working
- An example of k-means

### This Week:
- Feature selection
- Project 

## Feature Selection

- Removing non-informative or redundant predictors from the modelling
- Converting raw values into desired features using statistical or machine learning approaches

### Recursive Feature Elimination (RFE)
- Example of *backward feature elimination*
- first fit our model using all the features in a given set, then progressively one by one we remove the least significant features, each time re-fitting, until we are left with the desired number of features, which is set by the parameter __n_features_to_select__.

### Dataset:
[Kaggle House Price](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data)

### About dataset
Here's a brief version of what you'll find in the data description file.

- SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale

## Predict house price

In [7]:
import pandas as pd
import numpy as np

from operator import itemgetter
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

In [1]:
train_data = pd.read_csv('datasets/housing_price/train.csv', index_col=0)
test_data = pd.read_csv('datasets/housing_price/test.csv', index_col=0)

target = 'SalePrice'
print(train_data.shape, test_data.shape)

(1460, 80) (1459, 79)


### Inference
Split data for train and test and only numerical data

In [2]:
x_train = train_data.select_dtypes(include=['number']).copy()
x_train = x_train.drop([target], axis=1)
y_train = train_data[target]
x_test  = test_data.select_dtypes(include=['number']).copy()

### Inference
Simple preprocessing: fillna with mean

In [3]:
x_train = x_train.fillna(x_train.mean())
x_test = x_test.fillna(x_test.mean())

### Inference
Regressor/Modelling 

In [6]:
reg = RandomForestRegressor(n_estimators=100, max_depth=10)

n_features = 1
rfe = RFE(reg, n_features_to_select=n_features)
rfe.fit(x_train, y_train)

### Inference
- Ranking based on model

In [8]:
features = x_train.columns.to_list()
for x, y in (sorted(zip(rfe.ranking_ , features), key=itemgetter(0))):
    print(x, y)

1 OverallQual
2 GrLivArea
3 TotalBsmtSF
4 BsmtFinSF1
5 2ndFlrSF
6 YearBuilt
7 1stFlrSF
8 GarageCars
9 LotArea
10 GarageArea
11 YearRemodAdd
12 LotFrontage
13 TotRmsAbvGrd
14 BsmtUnfSF
15 OpenPorchSF
16 WoodDeckSF
17 OverallCond
18 GarageYrBlt
19 FullBath
20 MasVnrArea
21 MoSold
22 Fireplaces
23 MSSubClass
24 BedroomAbvGr
25 YrSold
26 BsmtFullBath
27 KitchenAbvGr
28 ScreenPorch
29 EnclosedPorch
30 HalfBath
31 BsmtFinSF2
32 3SsnPorch
33 BsmtHalfBath
34 PoolArea
35 LowQualFinSF
36 MiscVal


### Inference
- Selecting top 15 features and use them in modelling

In [9]:
n_features = 15
rfe = RFE(reg, n_features_to_select=n_features)
rfe.fit(x_train, y_train)

In [10]:
pred = rfe.predict(x_test)

In [11]:
output = pd.DataFrame({"Id":test_data.index, target:pred})
#output.to_csv('submission_rfe.csv', index=False)

## Project: Predicting Ad Click-through Rate
Dataset: [Kaggle](https://www.kaggle.com/datasets/gauravduttakiit/clickthrough-rate-prediction)

### What is Click-through Rate (CTR)
- shows the percentage of visitors who click on an ad

### Why?
- determine whether advertisements resonate with their target demographic and generate more engagement

### Tasks till Saturday
- Read more about CTR
- Analyse each column 
