# Project Template Applied Use Case

This notebook aims to demonstrate the discovery stage of a data scientist from obtaining the data to generating the final model.

### 1. Introduction

In a note, the notebooks in this analysis folder need to demonstrate any kind of experimentation, data analysis, and insights that allow others to understand the processes created in this project. In this case, we'll only use feature engineering and model generation. Data Scientists can feel free to create notebooks in any number of ways they feel like.

For this exercise, we shall use data from Kaggle's "House Prices - Advanced Regression Techniques" challenge.

Ref.: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

### 2. Motivation

Among the purpose of this notebook we can list the following points:

- To carry out the construction of a model through fast feature engineering,

- Be used as a reference when producing the code in the template.

Note how each piece of code presented here will be translated and modularized in each script within this template.

### 3. Use Case

From this point, let's exercise a quickly model development using feature engineering.
The focus of this challenge is to predict the sale price of a house using a number of explainable variables.

As have been said before, it's not the objective of this notebook to make data analysis, such as statistical tests and data visualisation for insight generation.

#### 3.1 Virtual Environment as a Jupyter Notebook Kernel

If you notice, this notebook's kernel isn't "Python 3". It's "project_template". It's best practice to use virtual envs as kernels to ensure one more layer of reproducibility in your project.

- For Linux users (Ref.: https://queirozf.com/entries/jupyter-kernels-how-to-add-change-remove)
- For Windows users (Ref.: https://towardsdatascience.com/python-virtual-environments-jupyter-notebook-bb5820d11da8)

In [1]:
import os
import boto3
import pandas as pd
import numpy as np
from io import StringIO

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from dotenv import load_dotenv

from houses_regression import aws_resources

load_dotenv()

True

#### 3.2 dotenv

Credentials **MUST NOT BE** explicit in any type of code. One of the good practices adopted in the community is to create a .env text file and fill it with sensitive info. Using dotenv requires this file to be created in the root directory (Ref.: https://pypi.org/project/python-dotenv/)

Inside the .env you should create variables in bash style. These variables are called environment variables.

- AWS_ACCESS_KEY_ID=AWSACCESSID123
- AWS_SECRET_ACCESS_KEY=AWSSECRETACCESSKEY456
- REGION_NAME=us-east-1

In [2]:
aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
aws_session_token = os.getenv("AWS_SESSION_TOKEN")
region_name = os.getenv("REGION_NAME")

s3 = boto3.client(
    's3',
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key,
    aws_session_token=aws_session_token,
    region_name=region_name
)

#### 3.3 Feature Engineering

**Importante Note:** It's ok if you do not agree with the engineering below. The idea is the transformation itself, translated to production code.

The __create_dataframe_from_s3__ is a custom function created to generate a pd.DataFrame from a csv stored in a AWS S3 bucket. You can see how it was created inside the folder "houses_regression". Look for a .py file called "aws_resources".

- **Soft. Engineer tip**: Refrain from using "df" when calling or creating a dataframe. Try to be extremely explicit about the objects and variables you are creating from now on. Even though df is a common nomenclature, good practices require the code to be explainable from itself without abbreviations. Instead of using df, my train data is called "data_train". Ref.: https://www.castsoftware.com/glossary/coding-in-software-engineering-best-practices-good-standards

In [3]:
bucket = "testebella"
train_file_name = "houses_train.csv"
test_file_name = "houses_test.csv"

# data_train = aws_resources.create_dataframe_from_s3(bucket=bucket, key=train_file_name)
# data_test = aws_resources.create_dataframe_from_s3(bucket=bucket, key=test_file_name)

data_train = pd.read_csv(train_file_name)
data_test = pd.read_csv(test_file_name)

Created some lists to be used later on. Note that this lists are stored and used in **config.yml** in the root directory (houses_regression).

- **So what is the Configuration file?**

Definition from Wikipedia “In computing, configuration files (or config files) are files used to configure the parameters and initial settings for some computer programs. They are used for user applications, server processes, and operating system settings.”

This means you can use a configuration file in your machine learning project. By doing so it will help you to run your project with flexibility and manage your system source code easily e.g when running different machine learning experiments.

**Extreme Important Reading:** https://medium.com/analytics-vidhya/how-to-write-configuration-files-in-your-machine-learning-project-47bc840acc19

In [4]:
selected_features = [
    "HouseStyle",
    "BsmtFinType1",
    "TotalBsmtSF",
    "GarageArea",
    "LotFrontage",
    "SalePrice"
]

numeric_features = [
    "TotalBsmtSF",
    "GarageArea",
    "LotFrontage"
]

scaled_features = [
    "ScaledTotalBsmtSF",
    "ScaledGarageArea",
    "ScaledLotFrontage"
]

to_drop_unused_features = [
    "HouseStyle",
    "BsmtFinType1",
    "NewHouseStyle"
    
]

As this is a simple example, we filtered the initial dataset arbitrarily. We want to do some understandable data wrangling to be reproducible in code later on. For this very reason, a few variables were selected.

In [5]:
data_train = data_train[selected_features]; data_train

Unnamed: 0,HouseStyle,BsmtFinType1,TotalBsmtSF,GarageArea,LotFrontage,SalePrice
0,2Story,GLQ,856,548,65.0,208500
1,1Story,ALQ,1262,460,80.0,181500
2,2Story,GLQ,920,608,68.0,223500
3,2Story,ALQ,756,642,60.0,140000
4,2Story,GLQ,1145,836,84.0,250000
...,...,...,...,...,...,...
1455,2Story,Unf,953,460,62.0,175000
1456,1Story,ALQ,1542,500,85.0,210000
1457,2Story,GLQ,1152,252,66.0,266500
1458,1Story,GLQ,1078,240,68.0,142125


##### HouseStyle Variable
The first feat. engineering is here. Since houses with 1 and 2 stories are prominent here in proportion, we shall recode the feature to classify as "Other" the classes which aren't the ones cited before.

In [6]:
data_train["HouseStyle"].value_counts()

1Story    726
2Story    445
1.5Fin    154
SLvl       65
SFoyer     37
1.5Unf     14
2.5Unf     11
2.5Fin      8
Name: HouseStyle, dtype: int64

In [7]:
data_train["NewHouseStyle"] = np.where(
    (data_train["HouseStyle"]!="1Story") & (data_train["HouseStyle"]!="2Story"), 
    "Other", 
    data_train["HouseStyle"]
    )

###### Inputing Missing Values

A simple missing value inputting. The categorical shall receive it's own mode and the numerical, it's median.

In [8]:
# Missing Values
data_train["BsmtFinType1"] = data_train["BsmtFinType1"].fillna(data_train["BsmtFinType1"].mode().values[0]) # Mode
data_train["LotFrontage"] = data_train["LotFrontage"].fillna(data_train["LotFrontage"].median()) # Median

##### Standard Scaler

We apply the Standard Scaler method to reduce variability and the scale of numerical variables.
After that, we create a dataframe with scaled variables to be merged a few steps ahead.

In [9]:
transformed_dataframe = StandardScaler().fit_transform(data_train[numeric_features])

scaled_dataframe = pd.DataFrame(transformed_dataframe, columns=scaled_features)
data_train = data_train.merge(scaled_dataframe, how="left", left_index=True, right_index=True)

##### One Hot Encoding

Applying OHE and merging the resulting dataframes with the original one.

In [10]:
dummies_BsmtFinType1 = pd.get_dummies(data_train["BsmtFinType1"])
dummies_NewHouseStyle = pd.get_dummies(data_train["NewHouseStyle"])

data_train = pd.concat([dummies_BsmtFinType1, dummies_NewHouseStyle, data_train],axis=1)

**Visualizing the final dataframe**

Dropping unused variables to generate the final dataframe.

In [11]:
data_train = data_train.drop(to_drop_unused_features + numeric_features, axis=1); data_train

Unnamed: 0,ALQ,BLQ,GLQ,LwQ,Rec,Unf,1Story,2Story,Other,SalePrice,ScaledTotalBsmtSF,ScaledGarageArea,ScaledLotFrontage
0,0,0,1,0,0,0,0,1,0,208500,-0.459303,0.351000,-0.220875
1,1,0,0,0,0,0,1,0,0,181500,0.466465,-0.060731,0.460320
2,0,0,1,0,0,0,0,1,0,223500,-0.313369,0.631726,-0.084636
3,1,0,0,0,0,0,0,1,0,140000,-0.687324,0.790804,-0.447940
4,0,0,1,0,0,0,0,1,0,250000,0.199680,1.698485,0.641972
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0,0,0,0,0,1,0,1,0,175000,-0.238122,-0.060731,-0.357114
1456,1,0,0,0,0,0,1,0,0,210000,1.104925,0.126420,0.687385
1457,0,0,1,0,0,0,0,1,0,266500,0.215641,-1.033914,-0.175462
1458,0,0,1,0,0,0,1,0,0,142125,0.046905,-1.090059,-0.084636


#### 3.4 Model Stage

The usual model stage to build a model and calculate it's metrics.

In [12]:
X = data_train.drop("SalePrice", axis=1)
y = data_train["SalePrice"]

## **Important Reminder**

As cited in **3.3**, all constant variables MUST BE PASSED in the **config.yml**. SET a random state, test size and model configuration are **ESSENTIAL** for reproducibility.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=16)

In [14]:
model = RandomForestRegressor(n_estimators=400, random_state=16)

In [15]:
model.fit(X_train, y_train)

RandomForestRegressor(n_estimators=400, random_state=16)

In [16]:
y_pred = model.predict(X_test)

In [17]:
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = metrics.r2_score(y_test, y_pred)

print("Mean Absolute Error    :", mae)
print("Mean Squared Error     :", mse)
print("Root Mean Squared Error:", rmse)
print("R2:", r2)

Mean Absolute Error    : 28226.457319940255
Mean Squared Error     : 2508871365.6197877
Root Mean Squared Error: 50088.63509439828
R2: 0.6190291165551294
