# User Guide Follow-Along Code

This tutorial contains all the code used in the [Pyreal User Guides](https://dtail.gitbook.io/pyreal/user-guides/data-preparation-and-modelling). We recommend following along with the text there.

This tutorial uses a smaller version of the Ames Housing Dataset [1], with 8 key features selected. In this guide, we will train an ML model that predicts the sale price of houses based on these features. 

[1] De Cock, D. (2011). Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education, 19(3). https://doi.org/10.1080/10691898.2011.11889627

# Data Preparation and Modelling
## Training and Input Data

Pyreal expects data in the format of Pandas DataFrames. Each row refers to one data instance (a person, place, thing, or entity), and each column refers to a feature, or piece of information about that instance. Column headers are the names of feature. Each instance may optionally have an instance ID, which can either be stored as the DataFrame's indices (row IDs) or as a separate column.

There are two categories of data relevant to ML decision-making: the training data and the input data.
We will load in the training data from Pyreal's `sample_applicatons` module, and then inspect it.

In [1]:
from pyreal.sample_applications import ames_housing_small
from sklearn.model_selection import train_test_split

X, y = ames_housing_small.load_data(include_targets=True)
X_train_orig, X_test_orig, y_train, y_test = train_test_split(X, y)

X_train_orig.head()

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


Unnamed: 0,LotArea,Neighborhood,OverallQual,YearBuilt,Exterior1st,TotalBsmtSF,CentralAir,GrLivArea
165,22950.0,Old Town,,1892.0,Wood Siding,1107.0,True,3608.0
73,11911.0,Gilbert,6.0,2005.0,Vinyl Siding,684.0,True,1560.0
431,9291.0,,6.0,1993.0,Hard Board,832.0,,1710.0
448,12243.0,Northwest Ames,5.0,,,1484.0,True,1484.0
857,11029.0,North Ames,6.0,1958.0,Metal Siding,1184.0,True,1414.0


The training data is used to train the ML model and explainers. The input data is the data that you actively wish to get predictions on and understand better. The main difference between these two types is data is that you usually will have the ground truth values (the "correct" answer for the value your model tries to predict) for your training data but not your input data.

In the cell below, we inspect our ground-truth information for our training data, `y_train`, stored in a pandas Series.

In [2]:
y_train.head()

165    475000
73     174000
431    187000
448    175000
857    176500
Name: SalePrice, dtype: int64

In the cell below, we load in and inspect out input data. For this data we have no ground truth values.

In [3]:
x_input = ames_housing_small.load_input_data()
x_input

Unnamed: 0,House ID,LotArea,Neighborhood,OverallQual,YearBuilt,Exterior1st,TotalBsmtSF,CentralAir,GrLivArea
0,House 101,9937,Edwards,5,1965,Hard Board,1256,True,1256
1,House 102,8450,College Creek,7,2003,Vinyl Siding,856,True,1710
2,House 103,9600,Veenker,6,1976,Metal Siding,1262,True,1262
3,House 104,11250,College Creek,7,2001,Vinyl Siding,920,True,1786
4,House 105,9550,Crawford,7,1915,Wood Siding,756,True,1717
5,House 106,14260,Northridge,8,2000,Vinyl Siding,1145,True,2198
6,House 107,14115,Mitchell,5,1993,Vinyl Siding,796,True,1362
7,House 108,10084,Somerset,8,2004,Vinyl Siding,1686,True,1694
8,House 109,10382,Northwest Ames,7,1973,Hard Board,1107,True,2090


## Transformers

Many ML models either require data to be in a specific format, or preform significantly better when data is a specific format. 
For example, many models require all data to be numeric, cannot handle missing data, or expect all features to be on similar numeric scales. But this is rarely the case in real-world applications, so we need to perform feature engineering using data transformers.

In the cell below, we initialize all the transformers we will need to make predictions with our model. See the [Pyreal User Guide](https://dtail.gitbook.io/pyreal/user-guides/data-preparation-and-modelling/transformers) for details.

We then fit the transformers to our training data, and inspect the resulting transformed data.

In [4]:
from pyreal.transformers import OneHotEncoder, MultiTypeImputer, StandardScaler, fit_transformers

oh_encoder = OneHotEncoder(columns=["Neighborhood", "Exterior1st"], handle_unknown="ignore")
imputer = MultiTypeImputer()
scaler = StandardScaler()

transformers = [oh_encoder, imputer, scaler]
fit_transformers(transformers, X_train_orig).head()

Unnamed: 0,LotArea,OverallQual,YearBuilt,TotalBsmtSF,CentralAir,GrLivArea,Neighborhood_Bloomington Heights,Neighborhood_Bluestem,Neighborhood_Briardale,Neighborhood_Brookside,...,Exterior1st_Cement Board,Exterior1st_Hard Board,Exterior1st_Imitation Stucco,Exterior1st_Metal Siding,Exterior1st_Plywood,Exterior1st_Stucco,Exterior1st_Vinyl Siding,Exterior1st_Wood Shingles,Exterior1st_Wood Siding,Exterior1st_nan
165,1.2232,0.0,-2.713342,0.149527,0.251161,4.169153,-0.117502,-0.044151,-0.099112,-0.206389,...,-0.176333,-0.416552,-0.031204,-0.392122,-0.280533,-0.129673,-0.720044,-0.144409,2.487732,-0.21146
73,0.136138,-0.066189,1.164953,-0.916721,0.251161,0.100505,-0.117502,-0.044151,-0.099112,-0.206389,...,-0.176333,-0.416552,-0.031204,-0.392122,-0.280533,-0.129673,1.388803,-0.144409,-0.401973,-0.21146
431,-0.121866,-0.066189,0.753099,-0.54366,0.251161,0.398502,-0.117502,-0.044151,-0.099112,-0.206389,...,-0.176333,2.400658,-0.031204,-0.392122,-0.280533,-0.129673,-0.720044,-0.144409,-0.401973,-0.21146
448,0.168832,-0.820747,-7.803736e-15,1.099824,0.251161,-0.05048,-0.117502,-0.044151,-0.099112,-0.206389,...,-0.176333,-0.416552,-0.031204,-0.392122,-0.280533,-0.129673,-0.720044,-0.144409,-0.401973,4.729021
857,0.049283,-0.066189,-0.4481427,0.34362,0.251161,-0.189545,-0.117502,-0.044151,-0.099112,-0.206389,...,-0.176333,-0.416552,-0.031204,2.550225,-0.280533,-0.129673,-0.720044,-0.144409,-0.401973,-0.21146


## Modelling

We can now transform our training and testing data, and initialize, train, and evaluate our ML model.

In [18]:
from pyreal.transformers import run_transformers
from lightgbm import LGBMRegressor

X_train_model = run_transformers(transformers, X_train_orig)
X_test_model = run_transformers(transformers, X_test_orig)

model = LGBMRegressor().fit(X_train_model, y_train)

print(model.score(X_test_model, y_test))
model.score(X_train_model, y_train)

0.8413540333073528


0.955077097301512