# Pre-processing and Feature Engineering

Anthony Amadasun

December 15th 20223

---

### Introduction


This section referred to as preprocessing and feature engineering prepares the data for modeling and also enhance its predictive capabilities. These critical step ensures that the data aligns with the requirements of understanding the factors influencing housing property prices. The following key aspects will be addressed: 

- Categorical Variable Transformation = implement one-hot encoding for categorical variables
- Data Scaling = apply appropriate scaling techniques to the data. 
- Data Splitting and Sampling = split the dataset into training and validation sets, 
- Feature selection = to identify and eliminate noisy or multicollinear features

---

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt


from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

---

### Load in Data

In [2]:
# # read in both csvs

df_cleaned = pd.read_csv('../data/test_clean.csv')
file_path_train = '../data/train.csv'
df_train = pd.read_csv(file_path_train)

---

### Pre-Proccessing

**One hot encoding**

The decision on which categorical feature to one-hot encode will depend mostly on domanin knowledge, the number of unique categories, and the potential impact on the model. Only the  Neighborhood column will be One hot encoded for the reason to not introduce too many features, which could lead to the curse of dimensionality. Based on domain knowledge, the Neighborhood which a property is located can significantly influence its price. Different neighborhoods may have different amenities, safety levels, school districts, and overall desirability, all of which can impact property values. As such, that influenced my decision to select this column to encode. 

Based on the data dictionary the neighbordhood column has 28 neighborhoods, which are: 'Sawyer', 'SawyerW', 'NAmes', 'Timber', 'Edwards', 'OldTown','BrDale', 'CollgCr', 'Somerst', 'Mitchel', 'StoneBr', 'NridgHt','Gilbert', 'Crawfor', 'IDOTRR', 'NWAmes', 'Veenker', 'MeadowV','SWISU', 'NoRidge', 'ClearCr', 'Blmngtn', 'BrkSide', 'NPkVill','Blueste', 'GrnHill', 'Greens', 'Landmrk'

- Blmngtn = Bloomington Heights
- Blueste = Bluestem
- BrDale = Briardale
- BrkSide = Brookside
- ClearCr = Clear Creek
- CollgCr = College Creek
- Crawfor = Crawford
- Edwards = Edwards
- Gilbert = Gilbert
- Greens = Greens
- GrnHill = Green Hills
- IDOTRR = Iowa DOT and Rail Road
- Landmrk = Landmark
- MeadowV = Meadow Village
- Mitchel = Mitchell
- NAmes = North Ames
- NoRidge = Northridge
- NPkVill = Northpark Villa
- NridgHt = Northridge Heights
- NWAmes = Northwest Ames
- OldTown = Old Town
- SWISU = South & West of Iowa State University
- Sawyer = Sawyer
- SawyerW = Sawyer West
- Somerst = Somerset
- StoneBr = Stone Brook
- Timber = Timberland
- Veenker = Veenker

In [3]:
unique_neighborhoods = df_train['Neighborhood'].unique()
unique_neighborhoods

array(['Sawyer', 'SawyerW', 'NAmes', 'Timber', 'Edwards', 'OldTown',
       'BrDale', 'CollgCr', 'Somerst', 'Mitchel', 'StoneBr', 'NridgHt',
       'Gilbert', 'Crawfor', 'IDOTRR', 'NWAmes', 'Veenker', 'MeadowV',
       'SWISU', 'NoRidge', 'ClearCr', 'Blmngtn', 'BrkSide', 'NPkVill',
       'Blueste', 'GrnHill', 'Greens', 'Landmrk'], dtype=object)

In [4]:
len(unique_neighborhoods)

28

**Feature selection**

The following features BsmtFin_SF_1', 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF',1st_Flr_SF', '2nd_Flr_SF', and 'Low_Qual_Fin_SF' shows signs of multicollinearity, so they wont be used to predict target variable. An interaction term between 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', and 'Half Bath' was created and the original columns were dropped as they do not serve much purpose to the models goal

In [5]:
#create interaction term to reduce dimensionality
df_cleaned['interaction_total_bathrooms'] = (df_cleaned['Bsmt_Full_Bath'] + df_cleaned['Full_Bath'] +
    0.5 * (df_cleaned['Bsmt_Half_Bath'] + df_cleaned['Half_Bath'])
)

# Drop original columns
df_cleaned.drop(['Bsmt_Full_Bath', 'Bsmt_Half_Bath', 'Full_Bath', 
                 'Half_Bath', 'BsmtFin_SF_1', 'BsmtFin_SF_2', 
                 'Bsmt_Unf_SF', 'Total_Bsmt_SF', '1st_Flr_SF', 
                 '2nd_Flr_SF', 'Low_Qual_Fin_SF'], axis=1, inplace=True)



**Data Splitting and Sampling**

In [6]:
X = df_cleaned.select_dtypes(include=['float64', 'int64']).drop('SalePrice', axis=1)
y = df_cleaned['SalePrice']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

**Data Scaling**

In [8]:
# Scale our data.
scaler = StandardScaler()

# Fit/transform from training to learn mean, stdev. 
#and then transform both using things learned from training
Z_train = scaler.fit_transform(X_train)
Z_test = scaler.transform(X_test)

In [9]:
print(f'Z_train shape is: {Z_train.shape}')
print(f'y_train shape is: {y_train.shape}')
print(f'Z_test shape is: {Z_test.shape}')
print(f'y_test shape is: {y_test.shape}')

Z_train shape is: (1640, 56)
y_train shape is: (1640,)
Z_test shape is: (411, 56)
y_test shape is: (411,)


- After scaling our training set (Z_train) has 1640 samples (rows) and 66 features (columns). 
-  The target vector for the training set (y_train) has 1640 values
- After scaling, the testing set (Z_test) has 411 samples and 66 features after scaling.
-  Thettarget vector for the testing set (y_test) has 411 values.

**OLS**

In [10]:
ols = LinearRegression()

ols.fit(Z_train, y_train)

In [11]:
# How does the model score on the training and test data?
print(f'Training Score: {ols.score(Z_train, y_train):.4f}')
print(f'Testing Score: {ols.score(Z_test, y_test):.4f}')

Training Score: 0.8626
Testing Score: 0.8648


- This linear regression model explains approximately 86.78% of the variance in the training set. It proposes that the model fits the training data quite well.

- The linear regression model also performs well on the testing set, explaining approximately 87.80% of the variance.

- The testing score is close to the training score, which means that the model generalize well to unseen data, which was one of the objective of this project



**Ridge**

In [12]:
# Instantiate, with default value of alpha (1)
ridge = Ridge()
# Fit.
ridge.fit(Z_train, y_train)
# Evaluate model using R2.
print(f'Training Score: {ridge.score(Z_train, y_train)}')
print(f'Testing Score: {ridge.score(Z_test, y_test)}')

Training Score: 0.862644153789303
Testing Score: 0.8648262562083273


In [13]:
# Instantiate, with default value of alpha (100)
ridge = Ridge(alpha=100)
# Fit.
ridge.fit(Z_train, y_train)
# Evaluate model using R2.
print(f'Training Score: {ridge.score(Z_train, y_train)}')
print(f'Testing Score: {ridge.score(Z_test, y_test)}')

Training Score: 0.8616662821930666
Testing Score: 0.86419540362334


In [14]:
# Instantiate, with default value of alpha (1000)
ridge = Ridge(alpha=1000)
# Fit.
ridge.fit(Z_train, y_train)
# Evaluate model using R2.
print(f'Training Score: {ridge.score(Z_train, y_train)}')
print(f'Testing Score: {ridge.score(Z_test, y_test)}')

Training Score: 0.8371002447600637
Testing Score: 0.8477999290316912


In [15]:
# Instantiate, with default value of alpha (10_000)
ridge = Ridge(alpha=10_000)
# Fit.
ridge.fit(Z_train, y_train)
# Evaluate model using R2.
print(f'Training Score: {ridge.score(Z_train, y_train)}')
print(f'Testing Score: {ridge.score(Z_test, y_test)}')

Training Score: 0.6034463527676546
Testing Score: 0.6127791060322482


In [16]:
# Instantiate, with default value of alpha (0)
ridge = Ridge(alpha=0)
# Fit.
ridge.fit(Z_train, y_train)
# Evaluate model using R2.
print(f'Training Score: {ridge.score(Z_train, y_train)}')
print(f'Testing Score: {ridge.score(Z_test, y_test)}')

Training Score: 0.862558578927491
Testing Score: 0.865103393519378


- Alpha 1: with moderate regularization the model perform well on both training and testing sets. The score similiar indicating good generalization

- Alpha 100: Slightly increasing the regularization strength doesn't significantly impact the performance. The model still generalizes well and the testing score improve slightly

- Alpha 1000: with a higher regularization strength, the model performance decreases. This regulariztion may be to strong, leading to underfitting. It doesnt capture the underlying patterns as well

- Alpha 10000: With even higher regularization strength, the model performance significantly drops. This regularization is to strong and the model is likely underfitting. This model is penalized for complexity, resulting in poor generalization 



In [17]:
# Brute Force logic lesson 3.07
# np.logspace generates 100 values equally between 0 and 5,
# then converts them to alphas between 10^0 and 10^5.
alphas = np.logspace(0, 5, 100)
# Cross-validate over the list of ridge alphas.
ridge_cv = RidgeCV(alphas = alphas, cv =5 )
# Fit model using best ridge alpha!
ridge_cv.fit(Z_train, y_train)

In [18]:
#the optimal value of alpha
ridge_cv.alpha_

187.3817422860383

In [19]:
#the optimal cross-validated R-squared score achieved 
ridge_cv.best_score_

0.8339776668151438

In [20]:
print(ridge_cv.score(Z_train, y_train))
print(ridge_cv.score(Z_test, y_test))

0.8599930605697735
0.8633853644603979


In summary, this Ridge regression model demonstrates good performance on both the training and testing sets with the approriate alpha score. It strikes a good balance between fitting the data and preventing overfitting, leading to high R-squared scores on both sets.

**Lasso**

In [21]:
#should put in the modeling section: Reminders
print(" OLS ".center(18, "="))
print(ols.score(Z_train, y_train))
print(ols.score(Z_test, y_test))
print()
print(" Ridge ".center(18, "="))
print(ridge_cv.score(Z_train, y_train))
print(ridge_cv.score(Z_test, y_test))

0.8626425982725601
0.8648270899011734

0.8599930605697735
0.8633853644603979


In [22]:
# Set up a list of 100  Lasso alphas from 0.001 to 1.0 to check.
l_alphas = np.logspace(-3, 0, 100)
# Cross-validate over the list of Lasso alphas.
lasso_cv = LassoCV(alphas=l_alphas)
# Fit model using best ridge alpha!
lasso_cv.fit(Z_train, y_train)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent(


In [23]:
#The optimal value of alpha
lasso_cv.alpha_

1.0

In [24]:
print(lasso_cv.score(Z_train, y_train))
print(lasso_cv.score(Z_test, y_test))

0.8626442854604521
0.8648435953890418


- The score of the Lasso regression model, represent the goodness of fit and generalization performance of the Lasso regression model. The model with the optimal alpha demonstrate good performance on both training and testing sets, providing a balance between fitting the data and preventing overfitting.

In [25]:
# # Save preprocessed data to a CSV file
# df_cleaned.to_csv('../data/test_clean2.csv', index=False) 
