I based my work of the 'House Prices - Advanced Regression Techniques' competition in kaggle 
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

 I followed the following answers/tutorials:
 https://docs.microsoft.com/en-us/azure/machine-learning/media/algorithm-cheat-sheet/machine-learning-algorithm-cheat-sheet.png#lightbox
 https://www.kaggle.com/code/dansbecker/handling-missing-values
 https://www.kaggle.com/code/dansbecker/using-categorical-data-with-one-hot-encoding/notebook
 https://www.kaggle.com/code/dansbecker/model-validation/tutorial
 https://www.makeuseof.com/fill-missing-data-with-pandas/

In [1]:
import pandas as pd
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [55]:
#Check for missing values
missing_val_count_by_column = (df_train.isnull().sum())
missing_val_count_by_column[missing_val_count_by_column > 0]

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

In [56]:
#clean the categorical data by adding dummy data
one_hot_encoded_training_predictors = pd.get_dummies(df_train)
one_hot_encoded_test_predictors = pd.get_dummies(df_test)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors, join='left', axis=1)
missing_val_count_by_column = (final_train.isnull().sum())
missing_val_count_by_column[missing_val_count_by_column > 0]

LotFrontage    259
MasVnrArea       8
GarageYrBlt     81
dtype: int64

In [60]:
#Clean numerical data by inserting the mean and median into the missing rows
final_train.fillna(final_train.mean().round(1), inplace=True)
missing_val_count_by_column = (final_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])
final_test.fillna(final_test.mean().round(1), inplace=True)
missing_val_count_by_column = (final_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])  #double check

Series([], dtype: int64)
Series([], dtype: int64)


In [62]:
#Specifiy a target
#We choose to build a model based on the SalePrice column 
#We'll try to predict the sale prices and compare them to the actual prices
df_train = final_train
sp = df_train['SalePrice']


In [65]:
#Let's create a dataframe from variables that are involved with SalePrice
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
fn = df_train[feature_names]
fn.describe()
fn.head()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9


In [67]:
#Since we're looking for a prediction, we're going to use decision tree regression
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=1)
model.fit(fn,sp)
predict = model.predict(fn)

DecisionTreeRegressor(random_state=1)

In [72]:
#check the validity of our model results by comparing the original values of SalePrice with our model predictions
print(sp.head(10))
print(predict[:10])

0    208500
1    181500
2    223500
3    140000
4    250000
5    143000
6    307000
7    200000
8    129900
9    118000
Name: SalePrice, dtype: int64
[208500. 181500. 223500. 140000. 250000. 143000. 307000. 200000. 129900.
 118000.]


We use mean absolute error to measure the model quality 
error=actual−predicted

In [88]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(sp, predict) #In-sample data is 62$

62.35433789954339

In [98]:
#Another model that splits the train df 
from sklearn.model_selection import train_test_split
train_fn, val_fn, train_sp, val_sp = train_test_split(fn, sp, random_state = 0)
model = DecisionTreeRegressor()
model.fit(train_fn, train_sp)
val_predictions = model.predict(val_fn)
mean_absolute_error(val_sp, val_predictions) #Out-sample data mae varies from 32000$ to 35000$

32456.86301369863