## **HW4: Feature Engineering and Linear Regression**

### **Steven Yoo**

**Attention:** This is an individual assignment. 

#### **Description of the Dataset**

* The Ames Housing dataset describes the sale of individual residential properties in Ames, Iowa, from 2006 to 2010. It contains a wide range of features, making it an excellent dataset to practice feature engineering techniques. In this assignment, you will explore a series of feature engineering tasks aimed at improving linear regression predictions.

* Please make sure you have read the  provided`data_description.txt` file that provides additional information about the dataset and its features before you start implementing your homework. 

### **Question 1: Import libraries, Load Train and Test datasets into separate DataFrames**

In [1]:
# Write your code here
import pandas as pd

# Load train and test datasets
train = pd.read_csv('hw4-dataset/test.csv')
test = pd.read_csv('hw4-dataset/train.csv')

train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


### **Mount Google Drive**

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
# Optional diplay all columns to be able to see all the columns (features)

pd.set_option('display.max_columns', None)

### **Question 2: Identify columns with missing data and then determine an appropriate strategy for each.**

In [4]:
missing_data = train.isnull().sum()
missing_data = missing_data[missing_data > 0].index
missing_data.sort_values()
missing_data

Index(['MSZoning', 'LotFrontage', 'Alley', 'Utilities', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath',
       'BsmtHalfBath', 'KitchenQual', 'Functional', 'FireplaceQu',
       'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea',
       'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType'],
      dtype='object')

#### For the numerical columns, we could replace the missing data with some summary statistics such as mean or median. For categorical columns, we could one-hot encode relevant variables, and remove the unnecessary variables (or where it doesn't make sense to one-hot encode).

### **Question 3: One-hot Encoding for Categorical Variables**

* Apply one-hot encoding to transform columns with categorical values.

In [5]:
# first identify categorical columns
categorical_columns = []
for column in train.columns.values.tolist():
    if train[column].dtype == 'object':
        categorical_columns.append(column)
    else:
        continue

# now one-hot encode
train_encoded = pd.get_dummies(train, columns = categorical_columns)

### **Question 4: Scaling and Normalization**

* Using `StandarScaler` from `scikit-learn`, scale the features in some of the numeric columns. 
* You will need to import required libraries and modules here as well.

In [6]:
import numpy as np
from sklearn.preprocessing import StandardScaler

# Identify numeric columns and put them in a python list
numeric_columns = train.select_dtypes(include=[np.number]).columns.tolist()

# From the `numeric_columns`, exclude the target column 'SalePrice' and 'Id' column
numeric_columns.remove('Id')

# Scaling the numeric columns, initialize StabdardScaler from `scikit-learn`
scaler = StandardScaler()

# fit and transform the numeric columns using the scaler you defined above
train_encoded[numeric_columns] = scaler.fit_transform(train_encoded[numeric_columns])

### **Question 5: Feature Extraction from Year Variables**

* Extract the information about the age of the house at the time of the sale by calculating the difference between `YrSold` and `YearBuilt` features and create a new Column (feature) named `HouseAge` from the result. 
* Extract the number of years since remodelling when the house was sold by calculating the difference between `YrSold` and `YearRemodAdd` features and create a new Column (feature) named `YearsSinceRemod` from the result. 

In [7]:
# Age of the house at the time of sale
train_encoded['HouseAge'] = train_encoded['YrSold'] - train_encoded['YearBuilt']

# Number of years since remodelling when the house was sold
train_encoded['YearsSinceRemod'] = train_encoded['YrSold'] - train_encoded['YearRemodAdd']

# Display the new features you have created above
train_encoded[['HouseAge', 'YearsSinceRemod']].head()

Unnamed: 0,HouseAge,YearsSinceRemod
0,2.05485,2.78679
1,2.1536,2.928814
2,0.869846,1.035163
3,0.83693,1.035163
4,1.03443,1.319211


### **Question 6: Dimensionality Reduction**

* In this question you will apply the PCA algorithm to reduce the dimensions of the dataset. PCA will not work if there are missing values in your data.For this, you will first fill all the missing values in the numerical columns with the median of the column. 
* Then, you will fill the missing categorical columns with mode imputation. 
* Tip: To fill all the missing values in all columns, first identify all numeric columns with missing data and store them in a variable called `numeric_cols_with_missing`. Do the same for categoical columns with missing data and store them into a variable called `categorical_cols_with_missing`. Then, you can iterate over (using a for loop) these two sets of columns and fill them with the appropriate values described above. Do not forget to use `inplace=True` attribute in the `fillna()` method that you will use to fill missing values so that you manipulate the original dataset without creating a copy of it. 

In [8]:
# first find all columns with missing data
missing_columns = []
for column in train_encoded.columns.values.tolist():
    if train_encoded[column].isnull().sum() > 0:
        missing_columns.append(column)
    else:
        continue

# find numerical columns with missing data and categorical columns with missing data
numeric_cols_with_missing = [] # numerical columns with missing data
categorical_cols_with_missing = [] # categorical columns with missing data
for column in missing_columns:
    if train_encoded[column].dtype == 'int64' or train_encoded[column].dtype == 'float64':
        numeric_cols_with_missing.append(column)
    else:
        categorical_cols_with_missing.append(column)
        

# Numeric columns: median imputation
for col in numeric_cols_with_missing:
    # fill the missing columns
    train_encoded[col].fillna(train_encoded[col].median(), inplace = True)

# Categorical columns: mode imputation
for col in categorical_cols_with_missing:
    # fill the missing columns
    train_encoded[col].fillna(train_encoded[col].mode(), inplace = True)

# Verifying that there are no more missing values
assert train_encoded.isnull().sum().sum() == 0

* **Implement the PCA algorithm to the dataset with no missing values**

In [9]:
from sklearn.decomposition import PCA
X = train_encoded.drop(columns=['Id'])

pca = PCA(n_components = 0.95)

# fit the PCA
X_pca = pca.fit_transform(X)

# Number of components retained after PCA
num_components = pca.n_components_
num_components

75

### **Question 7: Engineering Ordinal Features**

* For ordinal columns like `ExterQual` and `ExterCond`,  map the values to numbers (using the `map()`(https://docs.python.org/3/library/functions.html#map) method.

In [10]:
# Mapping ordinal values to numbers
ordinal_mappings = {
    'Ex': 5,
    'Gd': 4,
    'TA': 3,
    'Fa': 2,
    'Po': 1
}

train['ExterQual'] = train['ExterQual'].map(ordinal_mappings)
train['ExterCond'] = train['ExterCond'].map(ordinal_mappings)

train[['ExterQual', 'ExterCond']].head()

Unnamed: 0,ExterQual,ExterCond
0,3,3
1,3,3
2,3,3
3,3,3
4,4,3


### **Question 8: Feature Interaction**

* An **interaction feature** refers to a new feature that is created by combining or relating two or more existing features. It is based on the idea that two or more variables together may have a synergistic effect on the target variable that is not captured when they are used independently.

* Create an interaction feature, such as the total area of the house by using the available features `ToralBsmtSF`, `1stFlrSF`, and `2ndFlrSF`
* Tip: You will need to add these `Series` to get the `TotalArea`

In [11]:
# Total area of the house
train['TotalArea'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']

# Displaying the new feature
train[['TotalArea']].head()

Unnamed: 0,TotalArea
0,1778.0
1,2658.0
2,2557.0
3,2530.0
4,2560.0


### **Question 9: Binning**

* Group the `LotArea` feature into 5 bins

In [12]:
train['LotAreaBin'] = pd.cut(train['LotArea'], bins = 5, labels = False)

# Displaying the binned column
train[['LotArea', 'LotAreaBin']].head()

Unnamed: 0,LotArea,LotAreaBin
0,11622,0
1,14267,1
2,13830,1
3,9978,0
4,5005,0


In [13]:
# Check the unique values in the 'LotAreaBin' column and the corresponding bin boundaries
bins = pd.cut(train['LotArea'], bins = 5, retbins = True)
unique_values = train['LotAreaBin'].value_counts().sort_index()

bins, unique_values

((0       (1414.87, 12496.0]
  1       (12496.0, 23522.0]
  2       (12496.0, 23522.0]
  3       (1414.87, 12496.0]
  4       (1414.87, 12496.0]
                 ...        
  1454    (1414.87, 12496.0]
  1455    (1414.87, 12496.0]
  1456    (12496.0, 23522.0]
  1457    (1414.87, 12496.0]
  1458    (1414.87, 12496.0]
  Name: LotArea, Length: 1459, dtype: category
  Categories (5, interval[float64, right]): [(1414.87, 12496.0] < (12496.0, 23522.0] < (23522.0, 34548.0] < (34548.0, 45574.0] < (45574.0, 56600.0]],
  array([ 1414.87, 12496.  , 23522.  , 34548.  , 45574.  , 56600.  ])),
 0    1198
 1     240
 2      12
 3       4
 4       5
 Name: LotAreaBin, dtype: int64)

* From the results, it's evident that the vast majority of houses (1447 out of 1460) have lot areas that fall into the first bin (Bin 0), which corresponds to the interval (1086.055,44089.0 (1086.055,44089.0]. This is why you see many values of 0 in the `LotAreaBin` column.

* The binning behavior here is due to a few properties with very large lot areas that are influencing the range and hence the bin boundaries. The first bin captures most of the data, while the other bins capture only a few outliers. In such cases, it might be more appropriate to use quantile-based binning (using pd.qcut()) to ensure a more even distribution of data points across bins.

In [14]:
# Observe the results of the quantile-based binning and compare it to the result from only `pd.cut()` above

train['LotAreaQuantileBin'] = pd.qcut(train['LotArea'], q=5, labels=False)

# Check the unique values in the 'LotAreaQuantileBin' column and the corresponding bin boundaries
quantile_bins = pd.qcut(train['LotArea'], q=5, retbins=True)
unique_values_quantile = train['LotAreaQuantileBin'].value_counts().sort_index()

quantile_bins, unique_values_quantile

((0       (10125.8, 12194.4]
  1       (12194.4, 56600.0]
  2       (12194.4, 56600.0]
  3        (8640.0, 10125.8]
  4       (1469.999, 6958.4]
                 ...        
  1454    (1469.999, 6958.4]
  1455    (1469.999, 6958.4]
  1456    (12194.4, 56600.0]
  1457    (10125.8, 12194.4]
  1458     (8640.0, 10125.8]
  Name: LotArea, Length: 1459, dtype: category
  Categories (5, interval[float64, right]): [(1469.999, 6958.4] < (6958.4, 8640.0] < (8640.0, 10125.8] < (10125.8, 12194.4] < (12194.4, 56600.0]],
  array([ 1470. ,  6958.4,  8640. , 10125.8, 12194.4, 56600. ])),
 0    292
 1    294
 2    289
 3    292
 4    292
 Name: LotAreaQuantileBin, dtype: int64)

#### The results show that quantile-based binning results in much more evenly distributed bins. The pd.cut() standard binning resulted in bins ranging from have 5 values to 1198 values (range of 1193). In comparison, the pd.qcut() quantile-based binning resulted in all 5 bins having around 290 values each with a range of 5 (289-294).