# Project Template Applied Use Case

This notebook aims to demonstrate the discovery stage of a data scientist from obtaining the data to generating the final model.

### 1. Introduction

In a note, the notebooks in this analysis folder need to demonstrate any kind of experimentation, data analysis, and insights that allow others to understand the processes created in this project. In this case, we'll only use feature engineering and model generation. Data Scientists can feel free to create notebooks in any number of ways they feel like.

For this exercise, we shall use data from Kaggle's "House Prices - Advanced Regression Techniques" challenge.

Ref.: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

### 2. Motivation

Among the purpose of this notebook we can list the following points:

- To carry out the construction of a model through fast feature engineering,

- Be used as a reference when producing the code in the template.

Note how each piece of code presented here will be translated and modularized in each script within this template.

### 3. Use Case

From this point, let's exercise a quickly model development using feature engineering.
The focus of this challenge is to predict the sale price of a house using a number of explainable variables.

As have been said before, it's not the objective of this notebook to make data analysis, such as statistical tests and data visualisation for insight generation.

#### 3.1 Virtual Environment as a Jupyter Notebook Kernel

If you notice, this notebook's kernel isn't "Python 3". It's "project_template". It's best practice to use virtual envs as kernels to ensure one more layer of reproducibility in your project.

- For Linux users (Ref.: https://queirozf.com/entries/jupyter-kernels-how-to-add-change-remove)
- For Windows users (Ref.: https://towardsdatascience.com/python-virtual-environments-jupyter-notebook-bb5820d11da8)

In [1]:
import os
import boto3
import pandas as pd
import numpy as np
from io import StringIO

from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin


from feature_engine.imputation import MeanMedianImputer, CategoricalImputer

from dotenv import load_dotenv

from houses_regression import data_manager

pd.set_option('display.float_format', lambda x: '%.3f' % x)
load_dotenv()

True

#### 3.2 dotenv

Credentials **MUST NOT BE** explicit in any type of code. One of the good practices adopted in the community is to create a .env text file and fill it with sensitive info. Using dotenv requires this file to be created in the root directory (Ref.: https://pypi.org/project/python-dotenv/)

Inside the .env you should create variables in bash style. These variables are called environment variables.

- AWS_ACCESS_KEY_ID=AWSACCESSID123
- AWS_SECRET_ACCESS_KEY=AWSSECRETACCESSKEY456
- REGION_NAME=us-east-1

**DO NOT EVER PUSH A .ENV FILE TO A REMOTE REPOSITORY**

In [2]:
aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
aws_session_token = os.getenv("AWS_SESSION_TOKEN")
region_name = os.getenv("REGION_NAME")

s3 = boto3.client(
    's3',
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key,
    aws_session_token=aws_session_token,
    region_name=region_name
)

#### 3.3 Feature Engineering

**Importante Note:** It's ok if you do not agree with the engineering below. The idea is the transformation itself, translated to production code.

The __create_dataframe_from_s3__ is a custom function created to generate a pd.DataFrame from a csv stored in a AWS S3 bucket. You can see how it was created inside the folder "houses_regression". Look for a .py file called "aws_resources".

- **Soft. Engineer tip**: Refrain from using "df" when calling or creating a dataframe. Try to be extremely explicit about the objects and variables you are creating from now on. Even though df is a common nomenclature, good practices require the code to be explainable from itself without abbreviations. Instead of using df, my train data is called "train_features". Ref.: https://www.castsoftware.com/glossary/coding-in-software-engineering-best-practices-good-standards

In [3]:
bucket = "testebella"
train_file_name = "houses_train.csv"
test_file_name = "houses_test.csv"

train_features = data_manager.create_dataframe_from_s3(bucket=bucket, key=train_file_name)
test_features = data_manager.create_dataframe_from_s3(bucket=bucket, key=test_file_name)

# train_features = pd.read_csv(train_file_name)
# test_features = pd.read_csv(test_file_name)

# This mock_data dataset is just a fresh copy from the raw train_features. Is to be used on later sections
# to simulate reproducibility.
mock_data = train_features.copy()

Created some lists to be used later on. Note that this lists are stored and used in **config.yml** in the root directory (houses_regression).

- **So what is the Configuration file?**

Definition from Wikipedia “In computing, configuration files (or config files) are files used to configure the parameters and initial settings for some computer programs. They are used for user applications, server processes, and operating system settings.”

This means you can use a configuration file in your machine learning project. By doing so it will help you to run your project with flexibility and manage your system source code easily e.g when running different machine learning experiments.

**Extreme Important Reading:** https://medium.com/analytics-vidhya/how-to-write-configuration-files-in-your-machine-learning-project-47bc840acc19

In [4]:
selected_features = [
    "HouseStyle",
    "BsmtFinType1",
    "MiscVal",
    "LotArea",
    "LotFrontage",
    "SalePrice"
]

selected_features_pipeline = [
    "HouseStyle",
    "BsmtFinType1",
    "MiscVal",
    "LotArea",
    "LotFrontage",    
]

numeric_features = [
    "MiscVal",
    "LotArea",
    "LotFrontage"
]

scaled_features = [
    "ScaledMiscVal",
    "ScaledLotArea",
    "ScaledLotFrontage"
]

to_drop_unused_features = [
    "HouseStyle",
    "BsmtFinType1",
    "NewHouseStyle"
    
]

As this is a simple example, we filtered the initial dataset arbitrarily. We want to do some understandable data wrangling to be reproducible in code later on. For this very reason, a few variables were selected.

In [5]:
train_features = train_features[selected_features]; train_features

Unnamed: 0,HouseStyle,BsmtFinType1,MiscVal,LotArea,LotFrontage,SalePrice
0,2Story,GLQ,0,8450,65.000,208500
1,1Story,ALQ,0,9600,80.000,181500
2,2Story,GLQ,0,11250,68.000,223500
3,2Story,ALQ,0,9550,60.000,140000
4,2Story,GLQ,0,14260,84.000,250000
...,...,...,...,...,...,...
1455,2Story,Unf,0,7917,62.000,175000
1456,1Story,ALQ,0,13175,85.000,210000
1457,2Story,GLQ,2500,9042,66.000,266500
1458,1Story,GLQ,0,9717,68.000,142125


##### HouseStyle Variable
The first feat. engineering is here. Since houses with 1 and 2 stories are prominent here in proportion, we shall recode the feature to classify as "Other" the classes which aren't the ones cited before.

In [6]:
train_features["HouseStyle"].value_counts()

1Story    726
2Story    445
1.5Fin    154
SLvl       65
SFoyer     37
1.5Unf     14
2.5Unf     11
2.5Fin      8
Name: HouseStyle, dtype: int64

In [7]:
train_features["NewHouseStyle"] = np.where(
    (train_features["HouseStyle"]!="1Story") & (train_features["HouseStyle"]!="2Story"), 
    "Other", 
    train_features["HouseStyle"]
    )

###### Inputing Missing Values

A simple missing value inputting. The categorical shall receive it's own mode and the numerical, it's median.

In [8]:
# Missing Values
train_features["BsmtFinType1"] = train_features["BsmtFinType1"].fillna(train_features["BsmtFinType1"].mode().values[0]) # Mode
train_features["LotFrontage"] = train_features["LotFrontage"].fillna(train_features["LotFrontage"].median()) # Median

##### Standard Scaler

We apply the Standard Scaler method to reduce variability and the scale of numerical variables.
After that, we create a dataframe with scaled variables to be merged a few steps ahead.

In [9]:
transformed_dataframe = StandardScaler().fit_transform(train_features[numeric_features])

scaled_dataframe = pd.DataFrame(transformed_dataframe, columns=scaled_features)
train_features = train_features.merge(scaled_dataframe, how="left", left_index=True, right_index=True)

##### One Hot Encoding

Applying OHE and merging the resulting dataframes with the original one.

In [10]:
dummies_BsmtFinType1 = pd.get_dummies(train_features["BsmtFinType1"])
dummies_NewHouseStyle = pd.get_dummies(train_features["NewHouseStyle"])

train_features = pd.concat([dummies_BsmtFinType1, dummies_NewHouseStyle, train_features],axis=1)

**Visualizing the final dataframe**

Dropping unused variables to generate the final dataframe.

In [11]:
train_features = train_features.drop(to_drop_unused_features + numeric_features, axis=1); train_features

Unnamed: 0,ALQ,BLQ,GLQ,LwQ,Rec,Unf,1Story,2Story,Other,SalePrice,ScaledMiscVal,ScaledLotArea,ScaledLotFrontage
0,0,0,1,0,0,0,0,1,0,208500,-0.088,-0.207,-0.221
1,1,0,0,0,0,0,1,0,0,181500,-0.088,-0.092,0.460
2,0,0,1,0,0,0,0,1,0,223500,-0.088,0.073,-0.085
3,1,0,0,0,0,0,0,1,0,140000,-0.088,-0.097,-0.448
4,0,0,1,0,0,0,0,1,0,250000,-0.088,0.375,0.642
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0,0,0,0,0,1,0,1,0,175000,-0.088,-0.261,-0.357
1456,1,0,0,0,0,0,1,0,0,210000,-0.088,0.266,0.687
1457,0,0,1,0,0,0,0,1,0,266500,4.953,-0.148,-0.175
1458,0,0,1,0,0,0,1,0,0,142125,-0.088,-0.080,-0.085


#### 3.4 Model Stage

The usual model stage to build a model and calculate it's metrics.

In [12]:
X = train_features.drop("SalePrice", axis=1)
y = train_features["SalePrice"]

## **Important Reminder**

As cited in **3.3**, all constant variables MUST BE PASSED in the **config.yml**. SET a random state, test size and model configuration are **ESSENTIAL** for reproducibility.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=16)

In [14]:
model = RandomForestRegressor(n_estimators=400, random_state=16)

In [15]:
model.fit(X_train, y_train)

RandomForestRegressor(n_estimators=400, random_state=16)

In [16]:
y_pred = model.predict(X_test)

In [17]:
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = metrics.r2_score(y_test, y_pred)

print("Mean Absolute Error    :", mae)
print("Mean Squared Error     :", mse)
print("Root Mean Squared Error:", rmse)
print("R2:", r2)

Mean Absolute Error    : 38853.808498214305
Mean Squared Error     : 3616687285.8764863
Root Mean Squared Error: 60138.89993902853
R2: 0.45080781369442036


#### 3.5 Using Scikit-Learn Pipeline

Scikit-Learn's Pipeline allows us to preprocess data in one go. It stores essential info about the transformation steps and grants another step towards reproducibility.

To build a pipeline object properly, one must build classes to mirror the features created before. Importing the BaseEstimator and TransformerMixin objects allows us to inherit methods to do so.

In [18]:
class NewHouseStyleTransformer(BaseEstimator, TransformerMixin):
    """
    Create the NewHouseStyle variable.
    """
    def __init__(self, feature_name: str):
        self.feature_name = feature_name
        
    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        return self
    
    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe = dataframe.copy()
        dataframe["NewHouseStyle"] = np.where(
            (dataframe[self.feature_name]!="1Story") & (dataframe[self.feature_name]!="2Story"), 
            "Other", 
            dataframe[self.feature_name]
        )
        return dataframe

You can use the **fit_transform()** method to simulate the dataset's form after the transformation. The dataframe's last column now shows the new constructed variable.

In [19]:
NewHouseStyleTransformer("HouseStyle").fit_transform(test_features)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,NewHouseStyle
0,1461,20,RH,80.000,11622,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,6,2010,WD,Normal,1Story
1,1462,20,RL,81.000,14267,Pave,,IR1,Lvl,AllPub,...,0,,,Gar2,12500,6,2010,WD,Normal,1Story
2,1463,60,RL,74.000,13830,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,,0,3,2010,WD,Normal,2Story
3,1464,60,RL,78.000,9978,Pave,,IR1,Lvl,AllPub,...,0,,,,0,6,2010,WD,Normal,2Story
4,1465,120,RL,43.000,5005,Pave,,IR1,HLS,AllPub,...,0,,,,0,1,2010,WD,Normal,1Story
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.000,1936,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2006,WD,Normal,2Story
1455,2916,160,RM,21.000,1894,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2006,WD,Abnorml,2Story
1456,2917,20,RL,160.000,20000,Pave,,Reg,Lvl,AllPub,...,0,,,,0,9,2006,WD,Abnorml,1Story
1457,2918,85,RL,62.000,10441,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,Other


In [20]:
class StandardScalerTransformer(BaseEstimator, TransformerMixin):
    """
    Create the Standard Scaler Transformer to apply data standarization.
    """
    def __init__(self, feature_list: list, column_names: list):
        self.feature_list = feature_list
        self.column_names = column_names
        
    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        return self
    
    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe = dataframe.copy()
        transformed_dataframe = StandardScaler().fit_transform(dataframe[self.feature_list])
        
        scaled_dataframe = pd.DataFrame(
            transformed_dataframe, 
            columns=self.column_names
        )
        
        dataframe = dataframe.merge(
            scaled_dataframe, 
            how="left", 
            left_index=True, 
            right_index=True
        )
        
        return dataframe

In [21]:
class OneHotEncoderTransformer(BaseEstimator, TransformerMixin):
    """
    Applies the one-hot-encoding on categorical variables.
    """
    def __init__(self, feature_name: str):
        self.feature_name = feature_name
        
    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        return self
    
    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe = dataframe.copy()
        dummy_dataframe = pd.get_dummies(dataframe[self.feature_name])
        
        dataframe = pd.concat(
            [
                dummy_dataframe,
                dataframe
            ],
            axis=1
        )
        
        return dataframe

In [22]:
class DeleteFeaturesTransformer(BaseEstimator, TransformerMixin):
    """
    Drops unwanted variables.
    """
    def __init__(self, feature_list: list):
        self.feature_list = feature_list
        
    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        return self
    
    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe = dataframe.copy()
        
        dataframe = dataframe.drop(
            self.feature_list, 
            axis=1
        )
        
        return dataframe

In [23]:
class FilterFeaturesTransformer(BaseEstimator, TransformerMixin):
    """
    Filters dataframe variables.
    """
    def __init__(self, feature_list: list):
        self.feature_list = feature_list
        
    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        return self
    
    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe = dataframe.copy()
        
        dataframe = dataframe[self.feature_list]
        
        return dataframe

In [24]:
class MedianInputerTransformer(BaseEstimator, TransformerMixin):
    """
    Fills missing values using the Median.
    """
    def __init__(self, feature_name: str):
        self.feature_name = feature_name
        
    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        return self
    
    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe = dataframe.copy()
        dataframe[self.feature_name] = dataframe[self.feature_name].fillna(dataframe[self.feature_name].median())
        
        return dataframe

In [25]:
class ModeInputerTransformer(BaseEstimator, TransformerMixin):
    """
    Fills missing values using the mode.
    """
    def __init__(self, feature_name: str):
        self.feature_name = feature_name
        
    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        return self
    
    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe = dataframe.copy()
        dataframe[self.feature_name] = dataframe[self.feature_name].fillna(dataframe[self.feature_name].mode().values[0])
        
        return dataframe

After that we assembly all the classes within a list to be passed as an argument on a sklearn's Pipeline object. This pipeline runs through all the transformations, resulting in the very same dataframe from the last section. 

This object allows us to apply data transformation in one single step, even for new/prod data.

In [26]:
pipeline_list = [
    ("SelectFeatures", FilterFeaturesTransformer(selected_features_pipeline)),
    ("NewHouseStyle", NewHouseStyleTransformer("HouseStyle")),
    ("median_missing_imputer",MedianInputerTransformer("LotFrontage")),
    ("mode_missing_imputer",ModeInputerTransformer("BsmtFinType1")),
    ("StandardScaler",StandardScalerTransformer(numeric_features,scaled_features)),
    ("OneHotEncoder_NewHouseStyle",OneHotEncoderTransformer("NewHouseStyle")),
    ("OneHotEncoder_BsmtFinType1",OneHotEncoderTransformer("BsmtFinType1")),
    ("DropVariables", DeleteFeaturesTransformer(to_drop_unused_features + numeric_features)),
]


preprocess_pipeline = Pipeline(pipeline_list)

model = RandomForestRegressor(
              n_estimators=400, 
              random_state=16
              )

We've reached the same dataframe using the pipeline object.

In [27]:
pipeline_train_data = preprocess_pipeline.fit_transform(mock_data); pipeline_train_data

Unnamed: 0,ALQ,BLQ,GLQ,LwQ,Rec,Unf,1Story,2Story,Other,ScaledMiscVal,ScaledLotArea,ScaledLotFrontage
0,0,0,1,0,0,0,0,1,0,-0.088,-0.207,-0.221
1,1,0,0,0,0,0,1,0,0,-0.088,-0.092,0.460
2,0,0,1,0,0,0,0,1,0,-0.088,0.073,-0.085
3,1,0,0,0,0,0,0,1,0,-0.088,-0.097,-0.448
4,0,0,1,0,0,0,0,1,0,-0.088,0.375,0.642
...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0,0,0,0,0,1,0,1,0,-0.088,-0.261,-0.357
1456,1,0,0,0,0,0,1,0,0,-0.088,0.266,0.687
1457,0,0,1,0,0,0,0,1,0,4.953,-0.148,-0.175
1458,0,0,1,0,0,0,1,0,0,-0.088,-0.080,-0.085


In [28]:
train_features

Unnamed: 0,ALQ,BLQ,GLQ,LwQ,Rec,Unf,1Story,2Story,Other,SalePrice,ScaledMiscVal,ScaledLotArea,ScaledLotFrontage
0,0,0,1,0,0,0,0,1,0,208500,-0.088,-0.207,-0.221
1,1,0,0,0,0,0,1,0,0,181500,-0.088,-0.092,0.460
2,0,0,1,0,0,0,0,1,0,223500,-0.088,0.073,-0.085
3,1,0,0,0,0,0,0,1,0,140000,-0.088,-0.097,-0.448
4,0,0,1,0,0,0,0,1,0,250000,-0.088,0.375,0.642
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0,0,0,0,0,1,0,1,0,175000,-0.088,-0.261,-0.357
1456,1,0,0,0,0,0,1,0,0,210000,-0.088,0.266,0.687
1457,0,0,1,0,0,0,0,1,0,266500,4.953,-0.148,-0.175
1458,0,0,1,0,0,0,1,0,0,142125,-0.088,-0.080,-0.085


In [29]:
X = pipeline_train_data
y = mock_data["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=16)

In [30]:
model.fit(X_train, y_train)
pipeline_y_pred = model.predict(X_test)

In [31]:
pipeline_mae = metrics.mean_absolute_error(y_test, pipeline_y_pred)
pipeline_mse = metrics.mean_squared_error(y_test, pipeline_y_pred)
pipeline_rmse = np.sqrt(mse)
pipeline_r2 = metrics.r2_score(y_test, pipeline_y_pred)

print("Mean Absolute Error    :", pipeline_mae)
print("Mean Squared Error     :", pipeline_mse)
print("Root Mean Squared Error:", pipeline_rmse)
print("R2:", pipeline_r2)

Mean Absolute Error    : 38853.808498214305
Mean Squared Error     : 3616687285.8764863
Root Mean Squared Error: 60138.89993902853
R2: 0.45080781369442036


And as we can see here, even the metrics are the same!

In [32]:
print(pipeline_mae == mae)
print(pipeline_mse == mse)
print(pipeline_rmse == rmse)
print(pipeline_r2 == r2)

True
True
True
True


Now we can fetch the test dataset to make a prediction. Note that we can run the transformations in a single step!

In [33]:
pipeline_test_features = preprocess_pipeline.fit_transform(test_features)

In [34]:
model.predict(pipeline_test_features)

array([201561.77      , 179703.24833333, 242042.8525    , ...,
       225360.185     , 159118.79916667, 187030.        ])

In [35]:
from houses_regression import pipeline, features

In [42]:
processed_input = pipeline.preprocess_pipeline.transform(test_features); processed_input

Unnamed: 0,ALQ,BLQ,GLQ,LwQ,Rec,Unf,1Story,2Story,other,ScaledMiscVal,ScaledGarageArea,ScaledLotFrontage
0,0,0,0,0,1,0,1,0,0,-0.092,0.364,0.567
1,1,0,0,0,0,0,1,0,0,19.730,0.898,0.616
2,0,0,1,0,0,0,0,1,0,-0.092,0.810,0.276
3,0,0,1,0,0,0,0,1,0,-0.092,0.032,0.470
4,1,0,0,0,0,0,1,0,0,-0.092,-0.972,-1.232
...,...,...,...,...,...,...,...,...,...,...,...,...
1454,0,0,0,0,0,1,0,1,0,-0.092,-1.591,-2.302
1455,0,0,0,0,1,0,0,1,0,-0.092,-1.600,-2.302
1456,1,0,0,0,0,0,1,0,0,-0.092,2.055,4.458
1457,0,0,1,0,0,0,0,0,1,1.018,0.126,-0.308


In [43]:
model.predict(processed_input)

Feature names unseen at fit time:
- ScaledGarageArea
- other
Feature names seen at fit time, yet now missing:
- Other
- ScaledLotArea



array([201561.77      , 179703.24833333, 242042.8525    , ...,
       225360.185     , 159118.79916667, 187030.        ])

In [50]:
print(model.feature_names_in_)

['ALQ' 'BLQ' 'GLQ' 'LwQ' 'Rec' 'Unf' '1Story' '2Story' 'Other'
 'ScaledMiscVal' 'ScaledLotArea' 'ScaledLotFrontage']
