### what we are going to do
* load the training and testing data
* make sure there is no data leakage (target leakage and train-test contamination)
* train the model (xgboost) and cross validate the scores with rmse as parameter
* predict the results and store the results in a csv file

In [2]:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [3]:
# loading data and removing the price column
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
train_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [4]:
# preventing data leakage by removing ambiguous columns
# xgboost needs columns with the maximum variability
numerical_features = ['LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
                      'GrLivArea', 'BedroomAbvGr', 'TotRmsAbvGrd', 'GarageArea', 'GarageYrBlt', 'PoolArea', 'KitchenAbvGr']
categorical_features = ['MSZoning', 'LandContour', 'Utilities', 'LandSlope', 'Neighborhood', 'SaleType','KitchenQual']

tot_cols = numerical_features+categorical_features

X = train_data[tot_cols]
y = train_data.SalePrice


In [5]:
# making preprocessing piplines
numerical_preprocessor = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean'))
])

categorical_preprocessor = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('oh_encode', OneHotEncoder(handle_unknown='ignore'))
])

#combining pipelines
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_preprocessor, numerical_features),
    ('cat', categorical_preprocessor, categorical_features)
])

In [6]:
#making machine learning model pipeline
my_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('xgb', XGBRegressor(n_estimators =300, learning_rate = 0.06))
])

### approach2 
dropping the columns with missing values

In [7]:
#defining a class which can integrate with sciliit-learn pipeline

from sklearn.base import TransformerMixin

class DropMissingColumns(TransformerMixin):
    def fit(self, X, y=None):
        self.columns_to_drop = X.columns[X.isnull().any().tolist()]
        return self
    
    def transform(self, X):
        return X.drop(columns=self.columns_to_drop)

In [8]:
#defining the new pipeline
numerical_preprocessor_2 = Pipeline(steps=[
    ('drop_missing', DropMissingColumns())
])

categorical_preprocessor_2 = Pipeline(steps=[
   ('drop_missing', DropMissingColumns()),
   ('oh_encoding', OneHotEncoder(handle_unknown='ignore')) 
])
preprocessor_2 = ColumnTransformer(transformers=[
    ('num', numerical_preprocessor_2, numerical_features),
    ('cat', categorical_preprocessor_2, categorical_features)
])

my_pipeline_2 = Pipeline(steps=(
    ('preprocess', preprocessor_2),
    ('xgb', XGBRegressor(n_estimators=300, learning_rate=0.06))
))

In [9]:
# mae for the model - columns not dropped
scores = -1 * cross_val_score(my_pipeline, X, y,
                         scoring='neg_mean_absolute_error',
                         cv= 5)
scores.var()

np.float64(973310.607951469)

In [10]:
#mae of dropped columns
scores2 = -1 * cross_val_score(my_pipeline_2, X, y,
                         scoring='neg_mean_absolute_error',
                         cv= 5)
scores2.var()

np.float64(1078831.2181048964)

In [11]:
print("score 1 : ", scores.mean())
print("score 2 : ", scores2.mean())

score 1 :  17588.443420911815
score 2 :  17538.637692636985


### future agenda -
* understand why variance is occurring and try to eliminate it
* search for ways to improve the accuracy of the model