# Introduction 
`V1.0.0`
### Who am I
Just a fellow Kaggle learner. I was creating this Notebook as practice and thought it could be useful to some others 
### Who is this for
This Notebook is for people that learn from examples. Forget the boring lectures and follow along for some fun/instructive time :)
### What can I learn here
You learn all the basics needed to create a rudimentary XGBoost model with hyperparameter tuning. I go over a multitude of steps with explanations. Hopefully with these building blocks,you can go ahead and build much more complex models.

### Things to remember
+ Please Upvote/Like the Notebook so other people can learn from it
+ Feel free to give any recommendations/changes. 
+ I will be continuously updating the notebook. Look forward to many more upcoming changes in the future.

### You can also refer to these notebooks that have helped me as well:
+ https://www.kaggle.com/cv13j0/tps-jan22-quick-eda-xgboost/notebook

# Imports

<div class="alert alert-block alert-info">
<b>Tip:</b> We will have to run a shell command with the "!" mark. We are installing the "holidays" library so that we can use it later.
</div>

In [None]:
!pip install holidays

In [None]:
# Computational imports
import numpy as np   # Library for n-dimensional arrays
import pandas as pd  # Library for dataframes (structured data)

# ML imports
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

import holidays

# Set seeds to make the experiment more reproducible.
from numpy.random import seed
seed(1)

# Allows us to see more information regarding the DataFrame
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)

# Importing Data
1. Since data is in form of csv file we have to use pandas read_csv to load the data
2. After loading it is important to check the complete information of data. It is important to get a general feel of the data that we are going to be using.

Let's get the Path that contains all csv files.

In [None]:
PATH_TO_INPUT = '/kaggle/input/tabular-playground-series-jan-2022/'

Store the data in dataframes using read_csv method.

In [None]:
train_df = pd.read_csv(PATH_TO_INPUT + 'train.csv')
test_df = pd.read_csv(PATH_TO_INPUT + 'test.csv')
submission_df = pd.read_csv(PATH_TO_INPUT + 'sample_submission.csv')

<div class="alert alert-block alert-info">
<b>Tip:</b> We can use the .head() method to obtain the first 5 rows of the DataFrame.
</div>

In [None]:
train_df.head(5)

<div class="alert alert-block alert-info">
<b>Tip:</b> We can use the .sample() method to obtain 5 random rows in the DataFrame.
</div>

In [None]:
test_df.sample(5)

# EDA/Visualizations
The goal is to try and gain insights from the data prior to modeling

## Explorating the Dataframe

It is useful to use .info() method to quickly have a glance on the general information about the DataFrame. It displays info such as the type of the columnd and also the # of non-null count. In this case there is 26298 entries and for each coloumn we have 26298 non-null count. This means no column has any missing values.

In [None]:
train_df.info()

We can also explore unique values for our feature columns using the unique() method.

In [None]:
country_list = train_df['country'].unique()
store_list = train_df['store'].unique()
product_list = train_df['product'].unique()

print(f'Country List:{country_list}')
print(f'Store List:{store_list}')
print(f'Product List:{product_list}')

The value_counts() method allows us to get unique value counts that exist in a specific column. In this case, we will get the unique values and count of the three feature columns.

In [None]:
train_df["country"].value_counts(), train_df["store"].value_counts(), train_df["product"].value_counts()

The describe() method gives a quick summary of the statistical information of the numerical columns. We get descriptions for the mean, standard deviation and max value for example.

In [None]:
train_df.describe()

Here we are defining a function that returns are categorical, numerical and feature columns. We will be using it consistenly across the notebook.

In [None]:
def get_all_cols(df, target, exclude=[]):
    
    # Select categorical columns
    object_cols = [cname for cname in train_df.columns 
                   if train_df[cname].dtype == "object"]

    # Select numerical columns
    num_cols = [cname for cname in train_df.columns 
                if train_df[cname].dtype in ['int64', 'float64', 'uint8']]
    
    all_cols = object_cols + num_cols
    
    exclude_cols = exclude + [target]
    
    feature_cols = [col for col in all_cols if col not in exclude_cols]
    
    return object_cols, num_cols, feature_cols

In [None]:
object_cols, num_cols, feature_cols = get_all_cols(train_df, 'num_sold', exclude=['row_id', 'date', 'num_sold'])

<div class="alert alert-block alert-warning">  
<b>Note:</b> We exlude row_id, date because we will not be using them spefically as features. We remove
    num_sold because it is the target.
</div>

In [None]:
object_cols, num_cols, feature_cols

Let's define a handy function that quickly gives us the min and max timestep of our dataframe.

In [None]:
# Create a simple function to evaluate the time-ranges of the information provided.
# It will help with the train / validation separations

def evaluate_time(df):
    min_date = df['date'].min()
    max_date = df['date'].max()
    print(f'Min Date: {min_date} /  Max Date: {max_date}')
    return None

evaluate_time(train_df)
evaluate_time(test_df)

## Plotting
We will explore various plots that could give us valuable insights.

### Time-Series Plot
We will start by plotting the time-series plot of the number of sell for a specific store. Goal would be to see if there is seasonality in the data (more or less sells during a specific time period) or even a trend (sales inreasing or decreasing over time).

In [None]:
km_df = train_df[train_df['store'] == 'KaggleMart']
kr_df= train_df[train_df['store'] == 'KaggleRama']

In [None]:
km_grouped_df = km_df.groupby(['date'])['num_sold'].sum()
km_grouped_df.plot(figsize = (10,5))

In [None]:
kr_grouped_df = kr_df.groupby(['date'])['num_sold'].sum()
kr_grouped_df.plot(figsize = (10,5))

We noticed that there is a spike in sales every year around christmas holidays. That is expected since people tend to spend more money around that time of the year.

## Bar Plot
Here we will explore the amount of sell per product for a specific store. Goal would be to determine which items has the most sells.

In [None]:
km_df = train_df[train_df['store'] == 'KaggleMart']
kr_df= train_df[train_df['store'] == 'KaggleRama']

km_grouped_series = km_df.groupby(by = ['product'], as_index=False)['num_sold'].sum()
kr_grouped_series = kr_df.groupby(by = ['product'], as_index=False)['num_sold'].sum()

km_grouped_df= km_grouped_series.reset_index()
kr_grouped_df= kr_grouped_series.reset_index()

In [None]:
sns.set_style('white') # darkgrid, white grid, dark, white and ticks
colors = sns.color_palette('pastel') # Color palette to use
plt.rc('axes', titlesize=18)     # fontsize of the axes title
plt.rc('axes', labelsize=14)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=13)    # fontsize of the tick labels
plt.rc('ytick', labelsize=13)    # fontsize of the tick labels
plt.rc('legend', fontsize=13)    # legend fontsize
plt.rc('font', size=13)          # controls default text sizes
sns.barplot(data=km_grouped_df, x='product', y= 'num_sold');

In [None]:
sns.barplot(data=kr_grouped_df, x='product', y= 'num_sold')

## Categorical Plot
Here we will explore the amount of sell, mean and distribution per product. We can look at distribution and see if it close to being a Gaussian distribution and also look at if there are any outliers. Goal would be to analyze the statistical distribution of num_sold depending on each product type.

In [None]:
# Source vs Price

sns.catplot(y = "num_sold", x = "product", data = train_df.sort_values("num_sold", ascending = False), kind="violin", height = 4, aspect = 3)
plt.show()

<div class="alert alert-block alert-success">  
<b>What we found:</b> From the EDA, we notice that we will have to clean the data before training the model. We notice that there is wrong data types for some columns, missing values and also outliers.
</div>

# Feature Engineering
In this section, we take the data and preprocess and engineer it so that it is ready to be fed to our model. There are many steps to this.

## Prepare the Data
In this subsection, we look into preparing the feature columns. That can be done by transforming the type of the column to a proper one, creating datetime features from our date column or even adding more valuable feature column (such as holidays) to our dataframe. This is the first step before going to other feature engineering steps.

In [None]:
TARGET = 'num_sold'

Here we are using the holidays API to get information on the holidays of the three countries.

In [None]:
holiday_FI = holidays.CountryHoliday('FI', years=[2015, 2016, 2017, 2018, 2019])
holiday_NO = holidays.CountryHoliday('NO', years=[2015, 2016, 2017, 2018, 2019])
holiday_SE = holidays.CountryHoliday('SE', years=[2015, 2016, 2017, 2018, 2019])

We are creating one dict for all the holidays. Some holidays overlap and we don't want to have repititve entries in our dictionary.

In [None]:
holiday_dict = holiday_FI.copy()
holiday_dict.update(holiday_NO)
holiday_dict.update(holiday_SE)

Important step here is to transform our date column to a datetime column. Before this, it was considered as an object column. 

We then use the map() function to map the dict keys to each of our date row values. When the dict key and date row value match, we return the holiday name and it gets stored in our new column named "holiday_name"

In [None]:
train_df['date'] = pd.to_datetime(train_df['date'])
train_df['holiday_name'] = train_df['date'].map(holiday_dict)
train_df['is_holiday'] = np.where(train_df['holiday_name'].notnull(), 1, 0)
train_df['holiday_name'] = train_df['holiday_name'].fillna('Not Holiday')

train_df.head(10)

<div class="alert alert-block alert-danger">  
Don't forget to do the same for our test data!
</div>

In [None]:
test_df['date'] = pd.to_datetime(test_df['date']) # Convert the date to datetime.
test_df['holiday_name'] = test_df['date'].map(holiday_dict)
test_df['is_holiday'] = np.where(test_df['holiday_name'].notnull(), 1, 0)
test_df['holiday_name'] = test_df['holiday_name'].fillna('Not Holiday')

test_df.sample(10)

Here we are defining a function that quickly creates for us time-series features from our date time column. Since we aren't using a RNN Deep Learning model, our model has no knowledge of the sequentiallity of our data points. That is why we have to add many features each representing an information of our date (day, week, month, etc).

In [None]:
def create_time_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create features base on the date variable, the idea is to extract as much 
    information from the date componets.
    Args
        df: Input data to create the features.
    Returns
        df: A DataFrame with the new time base features.
    """
    
    df['date'] = pd.to_datetime(df['date']) # Convert the date to datetime.
    
    # Start the creating future process.
    df['year'] = df['date'].dt.year
    df['quarter'] = df['date'].dt.quarter
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['dayofweek'] = df['date'].dt.dayofweek
    df['dayofmonth'] = df['date'].dt.days_in_month
    df['dayofyear'] = df['date'].dt.dayofyear
    df['weekofyear'] = df['date'].dt.weekofyear
    df['weekday'] = df['date'].dt.weekday
    df['is_weekend'] = np.where((df['weekday'] == 5) | (df['weekday'] == 6), 1, 0)
    
    return df

Create the time-series features for both train and test data

In [None]:
train_df = create_time_features(train_df)
test_df = create_time_features(test_df)

train_df.sample(5)

## Handling Missing Values 
In this subsection, we look if we have any missing data, if so, we take care of it.

In [None]:
train_df.isna().sum()

In [None]:
test_df.isna().sum()

Lucky for us, there seems to be no missing data for either the training or the test data.

## Handling imbalanced/Skewed dataset
In this subsection, we are going to handle skewed data using log transform. Skewed data can significantly affect the performance of the model. It is especially important to take care of skewed data for models using gradient descent and also models that rely on distances between data points and for example clusters (KNN). A typical regression model using gradient descent performs better with data that resembles more like a gaussian distribution. If the data is not transformed, the model will learn a skewed probability distribution and will not perform well when faced with data closer to one of the extremes of the probability distribution.

We will use this function to plot the distribution plots.

In [None]:
def dist_plots(df):
    plt.figure(figsize=(10,5))
    plt.title("Distribution Plot")
    sns.distplot(df)
    sns.despine()
    plt.show()

In [None]:
print(train_df['num_sold'].skew())
dist_plots(train_df['num_sold'])

We notice that the distributiuon is right skewed. Let's try to fix this using log transformation

Here we are defining a function that takes care of the log transform of a column.

In [None]:
def transform_target(df, target):
    """
    Apply a log transformation to the target for better optimization 
    during training.
    """
    df[target] = np.log(df[target])
    return df

train_df = transform_target(train_df, TARGET)

In [None]:
print(train_df['num_sold'].skew())
dist_plots(train_df['num_sold']);

The skewness has been reduce by a factor of 10 which is perfect. We must jsut remember to re-transform the output when we predict with the model.

<div class="alert alert-block alert-warning">  
<b>Note:</b> We are not transforming the test data sets since it doesn't contain theoretically the target column.
</div>

## Handling Outliers
In this subsection, we are going to handle outliers by removing them from our dataset. We have enough samples that removing them shouldn't be an issue. Outliers can significantly affect the performance of the model. It is especially important to take care of outliers in models using gradient descent and also models that rely on distances between data points and clusters (KNN) for example. A typical regression model using gradient descent performs better with data that does not contain many outliers. If the data contains outliers, the model will not perform as well due to factors such as learning a probability distribution that is totally representative of real data (due to the outlier) and also for reasons such as inneficient learning due to adverse effects to the cost functions by the existance of these outliers.


Here we are defining a function that will plot box plots.

In [None]:
def box_plots(df):
    plt.figure(figsize=(10,5))
    plt.title("Box Plot")
    sns.boxplot(df)
    plt.show()

In [None]:
box_plots(train_df['num_sold'])

We notice some outliers outside of our whisker range. Let's remove them by calculating our upper_limit and lower_limit. These are calculated using the interquartile range. 

In [None]:
percentile25 = train_df['num_sold'].quantile(0.25)
percentile75 = train_df['num_sold'].quantile(0.75)

iqr = percentile75 - percentile25

upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

In [None]:
train_df.shape

In [None]:
train_df = train_df[(train_df['num_sold'] < upper_limit) & (train_df['num_sold'] > lower_limit)]
train_df.shape

We have removed 4 data points.

In [None]:
box_plots(train_df['num_sold'])

After replotting, we have validated that there is no more any outliers :)

## Handling Categorical Data
So that the model can understand categorical data, we must transform them in a numerical form. There is various ways to do that. 

Some of them categorical data are,
<div class="alert alert-block alert-info">
<b>Nominal Data</b> --> data are not in any order --> OneHotEncoder is used in this case
</div>
<div class="alert alert-block alert-info">
<b>Ordinal data </b> --> data are in order --> LabelEncoder is used in this case
</div>

Since all our columns are of the nominal type, we are going to use one hot encoding. To do so, we are defining a function that uses get_dummies() method to one hot encode our categorical data.

In [None]:
# Convert the Categorical variables to one-hot encoded features...
# It will help in the training process
def create_one_hot(df, categ_colums):
    """
    Creates one_hot encoded fields for the specified categorical columns...
    Args
        df
        categ_colums
    Returns
        df
    """
    df = pd.get_dummies(df, columns=categ_colums)
    return df

Let's get all numerical and categorical columns

In [None]:
def get_all_cols(df, target, exclude=[]):
    
    # Select categorical columns
    object_cols = [cname for cname in train_df.columns 
                   if train_df[cname].dtype == "object"]

    # Select numerical columns
    num_cols = [cname for cname in train_df.columns 
                if train_df[cname].dtype in ['int64', 'float64', 'uint8']]
    
    all_cols = object_cols + num_cols
    
    exclude_cols = exclude + [target]
    
    feature_cols = [col for col in all_cols if col not in exclude_cols]
    
    return object_cols, num_cols, feature_cols

In [None]:
object_cols, num_cols, feature_cols = get_all_cols(train_df, target=TARGET, exclude=['row_id', 'date', 'num_sold'])

Let's one-hot encode by using the function we have defined.

In [None]:
train_df = create_one_hot(train_df, object_cols)
test_df = create_one_hot(test_df, object_cols)

## Feature Selection

Finding out the best feature which will contribute and have good relation with target variable.
Following are some of the feature selection methods,


<div class="alert alert-block alert-info">
<b>1. heatmap</b> 
</div>
<div class="alert alert-block alert-info">
<b>2. feature_importance_</b> 
</div>
<div class="alert alert-block alert-info">
<b>3. SelectKBest</b> 
</div>

### Correlation 
To see the correlation between the various features and also with the target value, we will use a heatmap.

In [None]:
object_cols, num_cols, feature_cols = get_all_cols(train_df, target=TARGET, exclude=['row_id', 'date'])

In [None]:
feature_cols_plus_target = [TARGET] + feature_cols

<div class="alert alert-block alert-warning">  
<b>Note:</b> We are filtering uptil the 11th column since after the 11th column, it is the one hot encoded features. I decided to not plot those since the heatmap will
    be too dense.
</div>

In [None]:
heatmap_df = train_df[feature_cols_plus_target].iloc[:,:12]

In [None]:
# Finds correlation between Independent and dependent attributes
train_data = pd.read_csv(PATH_TO_INPUT + 'train.csv')

plt.figure(figsize = (18,18))
sns.heatmap(heatmap_df.corr(), annot = True, cmap = "RdYlGn")

plt.show()

**We notice some features are heavily correlated. We will remove two to reduce the dimensionality of our model:**
1.  We can remove quarter and keep month since they are heavily correlated.
2.  We can remove weekofyear and keep month since they are heavily correlated.

**We also notice what has the highest effect on our target variable. We notice:**
1. The weekend has the highest positive correlation with the target variable which makes sense. People tend to buy more during the weekend (more free time)
2. The year has lowest correlation with the target variable which also makes sense. The year doesn't really sway someone to buy more or less.

In [None]:
object_cols, num_cols, feature_cols = get_all_cols(train_df, target=TARGET, exclude=['row_id', 'date', 'quarter', 'weekofyear'])

### Feature importance (Extra Tree Classifier)
You can also use the ExtraTressRegressor from sklearn which will allows you to easily see what are the important features for the Target Price.


In [None]:
X = train_df[feature_cols]
y = train_df[TARGET]

In [None]:
# Important feature using ExtraTreesRegressor
from sklearn.ensemble import ExtraTreesRegressor
selection = ExtraTreesRegressor()
selection.fit(X, y)

You can print it, but it isn't the prettiest.

In [None]:
print(selection.feature_importances_)

In [None]:
plt.figure(figsize = (12,8))
feat_importances = pd.Series(selection.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()

From the feature importance plot, it is clear that the most important feature here is the product: Kaggle Sticker (Note that we didn't plot the categorical features witht heatmap).

We also notice that the second most important feature is the store: KaggleRama... Might be a wise decision to buy that store :)

## Splitting the data
In this section, we will split the data in train and test set. Do not confuse test set with our test data. Test set is just a subsample of train_df.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Models
In this section, we will explore two models:

1. RandomForestRegressor
2. XGBRegressor

## Training
We've prepared the food (data), time to... FEED THE MACHINE.

In [None]:
from xgboost import XGBRegressor 
from sklearn.ensemble import RandomForestRegressor

reg_rf = RandomForestRegressor()
xgboost_model = XGBRegressor()

hist_reg_rf = reg_rf.fit(X_train, y_train)
hist_xgboost_model= xgboost_model.fit(X_train, y_train)

This following function is taken from sklearn documentation. It is a concise way of plotting the learning curves.

In [None]:
from sklearn.model_selection import learning_curve

def plot_learning_curve(
    estimator,
    title,
    X,
    y,
    axes=None,
    ylim=None,
    cv=None,
    n_jobs=None,
    train_sizes=np.linspace(0.1, 1.0, 5),
):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.

    Parameters
    ----------
    estimator : estimator instance
        An estimator instance implementing `fit` and `predict` methods which
        will be cloned for each validation.

    title : str
        Title for the chart.

    X : array-like of shape (n_samples, n_features)
        Training vector, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    y : array-like of shape (n_samples) or (n_samples, n_features)
        Target relative to ``X`` for classification or regression;
        None for unsupervised learning.

    axes : array-like of shape (3,), default=None
        Axes to use for plotting the curves.

    ylim : tuple of shape (2,), default=None
        Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).

    cv : int, cross-validation generator or an iterable, default=None
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, default=None
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like of shape (n_ticks,)
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the ``dtype`` is float, it is regarded
        as a fraction of the maximum size of the training set (that is
        determined by the selected validation method), i.e. it has to be within
        (0, 1]. Otherwise it is interpreted as absolute sizes of the training
        sets. Note that for classification the number of samples usually have
        to be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(
        train_sizes,
        train_scores_mean - train_scores_std,
        train_scores_mean + train_scores_std,
        alpha=0.1,
        color="r",
    )
    axes[0].fill_between(
        train_sizes,
        test_scores_mean - test_scores_std,
        test_scores_mean + test_scores_std,
        alpha=0.1,
        color="g",
    )
    axes[0].plot(
        train_sizes, train_scores_mean, "o-", color="r", label="Training score"
    )
    axes[0].plot(
        train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
    )
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, "o-")
    axes[1].fill_between(
        train_sizes,
        fit_times_mean - fit_times_std,
        fit_times_mean + fit_times_std,
        alpha=0.1,
    )
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    fit_time_argsort = fit_times_mean.argsort()
    fit_time_sorted = fit_times_mean[fit_time_argsort]
    test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
    test_scores_std_sorted = test_scores_std[fit_time_argsort]
    axes[2].grid()
    axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
    axes[2].fill_between(
        fit_time_sorted,
        test_scores_mean_sorted - test_scores_std_sorted,
        test_scores_mean_sorted + test_scores_std_sorted,
        alpha=0.1,
    )
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt

In [None]:
plot_learning_curve(
    xgboost_model,
    "XGBoostRegressor Learning Curve",
    X_train,
    y_train,
    axes=None,
    ylim=None,
    cv=5,
    n_jobs=None,
    train_sizes=[5000,10000,15000])

<div class="alert alert-block alert-info">
<b>Plot 1</b> We notice that both our training and validation accuracy are good. Therefore, we are not overfitting to our training data.
</div>
<div class="alert alert-block alert-info">
<b>Plot 2</b> As we increase # training examples, the fit times increases linearly (at 10000 samples, the slope decreases)
</div>
<div class="alert alert-block alert-info">
<b>Plot 3</b> As we fit time increases with the increases of training examples notably, the score increases linearly. 
</div>

Conclusion... MORE DATA PLEASE

## Predicting
In this subsection, we will use the basic trained model to predict on our test set (not test data).

In [None]:
y_pred_reg_rf = reg_rf.predict(X_test)
y_pred_xgboost = xgboost_model.predict(X_test)

In [None]:
print(reg_rf.score(X_train, y_train))
print(xgboost_model.score(X_train, y_train))

In [None]:
print(reg_rf.score(X_test, y_pred_reg_rf))
print(xgboost_model.score(X_test, y_pred_xgboost))

## Evaluating
In this subsection, we evaluate using plots and metrics to see if our predictions are good or not.

### Distribution plots

In [None]:
sns.distplot(y_test-y_pred_reg_rf)
plt.show()

In [None]:
sns.distplot(y_test-y_pred_xgboost)
plt.show()

We notice that the disth plot for the RandomForest seems tigher and a bit less spread. However, notice that the XGBoost graph spreads to only -0.1 to 0.1 while as the RandomForest spreads to -0.2 to 0.2. As well the density at 0 is much higher for the XGBoost than the RandomForest. This means there is more predictions for the XGBoost model that is closer to the expected output.

### Scatter plot of Target and Predicted

In [None]:
plt.scatter(y_test, y_pred_reg_rf, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()

In [None]:
plt.scatter(y_test, y_pred_xgboost, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()

We notice that the scatter plot for the XGBoost model is much tighter. The tighter and the more linear this graph is, the more the predicted and expected values are similar. In a perfect model, we would expect this slope to be 1.

### Metrics to decide on which model to use
I have chosen a handful of metrics, however the main one used of regression is usually R2 score.

In [None]:
from sklearn import metrics

In [None]:
print('RandomForest')
print('MAE:', metrics.mean_absolute_error(y_test, y_pred_reg_rf))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_reg_rf))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_reg_rf)))

print('XGBoost')
print('MAE:', metrics.mean_absolute_error(y_test, y_pred_xgboost))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_xgboost))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_xgboost)))

In [None]:
metrics.r2_score(y_test, y_pred_reg_rf)

In [None]:
metrics.r2_score(y_test, y_pred_xgboost)

XGBoost outperformed the RandomForest on all three metrics. I decided to proceed with the XGBoost model for the hyperparameter tuning.

# Hyperparameter Tuning


* Choose following method for hyperparameter tuning
    1. **RandomizedSearchCV** --> Fast
    2. **GridSearchCV**
* Assign hyperparameters in form of dictionary
* Fit the model
* Check best paramters and best score

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Create the random grid for the XGBoost model

params = {
 "learning_rate" : [0.05,0.10,0.15,0.20,0.25,0.30],
 "max_depth" : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma": [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
}

## Search for best hyperparameters

In [None]:
# Random search of parameters, using 5 fold cross validation, 
# search across 100 different combinations
xgb_model_tuned = RandomizedSearchCV(estimator = xgboost_model, param_distributions = params, scoring='neg_mean_squared_error', n_iter = 50, cv = 5, verbose=2, random_state=42, n_jobs = 1)

In [None]:
xgb_model_tuned.fit(X_train,y_train)

We can check the best parameters by accessing the following attribute:

In [None]:
xgb_model_tuned.best_params_

# {'min_child_weight': 5,
#  'max_depth': 8,
#  'learning_rate': 0.25,
#  'gamma': 0.0,
#  'colsample_bytree': 0.5}

## Predicting with tuned model
Let us used our tuned model to predict the Target price and see if it does better than our untuned model.

In [None]:
prediction = xgb_model_tuned.predict(X_test)

## Evaluating tuned model

In [None]:
plt.figure(figsize = (8,8))
sns.distplot(y_test-prediction)
plt.show()

Already the dist plot looks MUCH better

In [None]:
plt.figure(figsize = (8,8))
plt.scatter(y_test, prediction, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()

Even the scatter plot looks much tighter.

In [None]:
# MAE: 0.04518231450992608
# MSE: 0.003278265867648754
# RMSE: 0.05725614261936228

print('MAE:', metrics.mean_absolute_error(y_test, prediction))
print('MSE:', metrics.mean_squared_error(y_test, prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))

In [None]:
metrics.r2_score(y_test, prediction)

Finally, the MAE and the MSE is lower, but the RMSE is slightly higher.

By analzing these graphs and metrics, we can actually say that the tuning optimized the model for the better.

# Save the model to reuse it again

There's various ways to save the model. We decided to go forward with pickling. It is very easy and straighforward. 

In [None]:
import pickle
# open a file, where you ant to store the data
file = open('xgboost_tuned.pkl', 'wb') # wb is write and binary mode

# dump information to that file
pickle.dump(xgb_model_tuned, file)

In [None]:
model = open('xgboost_tuned.pkl','rb')
xgboost = pickle.load(model)

In [None]:
y_prediction = xgboost.predict(X_test)

In [None]:
metrics.r2_score(y_test, y_prediction)

As you can see, it is extremely easy to save and load a trained model and to use it for future predictions. No need to traine everytime!

# Final Remarks
Thank you for going through this notebook. Please feel free to show support and comment on the notebooks with advice or improvements. If you found it useful, please let me know as well :)