# Jupyter Notebook Eval for Generating a model 

This notebook includes the basics of building and validating a model using DecisionTreeRegressor of sklearn module

## This module is targetted to use
<ul>
<li>Pandas</li>
<li>SKlearn DecisionTreeRegressor</li>
<li>Python functions</li>
<li>train_test_split</li>
</ul>



In [1]:
## Preparing the environment 
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Read the csv file "train.csv" to read the required file
mytrainingfile = './train.csv'

# Validate the data & description
df = pd.read_csv(mytrainingfile)
df.head()
#df.describe()

# Establish the column names 
# Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [2]:
# Now establish the prediction attribute and feature attributes
df = pd.read_csv(mytrainingfile)
y = df.SalePrice
#feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'SalePrice']
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = df[feature_columns]

# Specify the model to be used
home_model = DecisionTreeRegressor()

# Model fitting 
home_model.fit(X, y)

# Validate the actual prediction and the model output estimate
print("First in-sample predictions:", home_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

# To validate inputs with just one value, output of SalesPrice should be 208500 for input values (8450 2003, 856, 854, 2, 3,8)

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]



### In-sample predictions take the input of the feature columns as specified under "feature_columns" dataframe and the target value are the prices listed against each of feature

In [3]:
### Uncomment this to validate the first row of feature columns, you can include SalesPrice in features_columns list to validate if the SalesPrice is the same
#print(X.head(1))  

In [4]:
# Now lets predict with just one row and see if we get the value 208500
print("First in-sample predictions:", home_model.predict(X.head(1)))
print("Actual target values for those homes:", y.head(1).tolist())
print("First in-sample predictions:", home_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predictions: [208500.]
Actual target values for those homes: [208500]
First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


### Splitting data using train_test_split
Using the function train_test_split we will split the training and validation data 

In [8]:
from sklearn.model_selection import train_test_split
trainX, valx, trainY, valy = train_test_split(X,y,random_state = 0)

### Now train the data using the train split

In [22]:
home_model = DecisionTreeRegressor(random_state=0)
home_model.fit(trainX,trainY)
#home_model.fit(valx,valy)

## Validate predictions
validate_predictions = home_model.predict(valx)

### Calculate the Mean Absolute Error in Validation Data

In [23]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(valy, validate_predictions)

print(val_mae)

32410.824657534245
