# Introduction: PipeLines, Cross Validation and XGBoost model.

This kernel is created to apply some of the techniques learned in [Intermediate Machine Learning course](https://www.kaggle.com/learn/intermediate-machine-learning) as **PipeLines**, **Cross Validation** and **XGBoost model**. If you find some errors or lack of consistency in this notebook, please let me know, it will be very helpfull 

The notebook is based on the [Kaggle Housing Prices Competition](https://www.kaggle.com/competitions/home-data-for-ml-course) that provides a dataset with 79 explanatory variables describing aspects of residential homes in Ames, Iowa and challenges to predict the final price of each home.

# Import libraries

In [1]:
import os
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score

# Load datasets: train.csv and test.csv



In [2]:
# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

print (X.shape, y.shape, X_test.shape,)

(1460, 79) (1460,) (1459, 79)


# Classify Features 
Clissify into Categorical and Numerical Features

In [3]:
# Select categorical columns with relatively low cardinality
categorical_cols = [cname for cname in X.columns if 
                    X[cname].nunique() < 10 and
                    X[cname].dtype == "object"] 
                    

# Select numerical columns
numerical_cols = [cname for cname in X.select_dtypes(exclude='object')]


# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_test = X_test[my_cols].copy()
Xf=X[my_cols].copy()

# pd.set_option('display.max_columns', None)
print(Xf.shape)
# Xf.describe(include='all')


(1460, 76)


# Encode Categorical Features 
Using OrdinalEncoder to process X and X_test  

In [4]:
Xfinal=Xf
X_test_final=X_test
OrdEnc=OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
Xfinal[(categorical_cols)]=OrdEnc.fit_transform(Xf[(categorical_cols)])
X_test_final[categorical_cols]=OrdEnc.transform(X_test[categorical_cols])

# Pipeline: Preprocess and Model
Create `my_pipeline` to preprocess the data and build the model (XGBoost)

In [5]:
# Set tranformers for numerical data
numerical_transformer = SimpleImputer(strategy='mean')


# Set tranformers for categorical data
categorical_tranformer = SimpleImputer(strategy='most_frequent')
    

# Bundle tranformers for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_tranformer, categorical_cols)    
        ])

# Define model
model = XGBRegressor(n_estimators=800, learning_rate=0.05, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

Xfinal.shape

(1460, 76)

# Cross-Validation
Using Cross-Validation to evaluate the model

In [6]:
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, Xfinal, y, cv=5, scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

print("Average MAE score (across experiments):")
print(scores.mean())

MAE scores:
 [15948.22515786 16671.70470355 16703.22016802 15125.81600492
 17345.80935627]
Average MAE score (across experiments):
16358.955078125


# Fit the pipeline and get predictions

In [7]:
my_pipeline.fit(Xfinal, y)

preds_test = my_pipeline.predict(X_test_final)

# Save predictions and sbmmit

In [8]:
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

print("Your submission was successfully saved!")

Your submission was successfully saved!
