<font color="red" size=5><center>Ensemble Learning -Part 1</center></font>

In [None]:
from IPython.display import Image
import os
Image("../input/ensemble-learning-pic/EL.png")

This notebook on Ensemble Learning will be divided into 2 parts.


This is **Part-1**
### For Part-2 click [here](https://www.kaggle.com/nitindatta/ensemble-learning-part-1) 


## To begin with, below is a short story on why ensemble learning is so widely used in competitions.
 
You need a plumber, and you find one that has 4.5 stars rating (out of 5) and charges 100 dollars to do the job. Now I offer you two plumbers, each with 4 stars rating, that charge 75 dollars apiece. My selling point to you is that they would visit one at a time, and the second plumber will fix whatever the first didnâ€™t do right. You laugh at me and take a rock-star plumber. Why would you spend 2x the time and 1.5x the money when the first plumber would probably do the job just fine?

Letâ€™s start with the same premise, but now I offer you 4 plumbers. Each has a 3 star rating, so they charge 23 dollars apiece. They would come to your house together, work as a team, and fix your problem faster and for less money. You think about it for a second, because it would be nice to save 8 dollars. In the end, you decide to go with your rock-star plumber because: a) he must be good if he is charging 100 dollars; b) other 4 plumbers canâ€™t be as good or else they would be charging more. Even though you are probably right on both counts, that still doesnâ€™t guarantee you made the best choice.

In most societies there is an unwritten rule that a single expert is always better than 3 so-so experts combined. But letâ€™s see if that holds for predictions we have to make.

Below is a simple example of predicting 10 digits that are evenly split between 1 and 0.

```
1111100000    Ground truth 
1110100000    Strong learner (90%) Best at predicting 0s
```

It seems like we have a very good model â€“ a good expert, if you will. This model is perfect in predicting 0s, and pretty good at predicting 1s.

Now we take three weak models, none of which are better than 70% in predicting digits.

```
1111100000     Ground truth
1011110100     Weak learner (70%) Good at predicting 1s
1101000010     Weak learner (70%) Good at predicting 0s
0110101001     Weak learner (60%) Not good at predicting anything
1111100000     Vote average of weak learners (100%)
```

We take the average vote of their predictions since none of them are very good. Amazingly, we get a prediction at 100% accuracy. Is this a setup devised by yours truly in number selection, or does it actually hold in real life?

It is fairly intuitive that blending two good models will again yield a good model, and it also makes sense that the result could be better than either individual model. It is not so obvious that blending a good and a bad model could yield a better result. It is even less obvious that blending 3 bad models could yield a really good model, but that is the case.

This phenomenon is often referred to as the strength of weak learners. **This doesnâ€™t mean that combining any 3 weak learners will result in a great model**. A complementary expertise is needed. If you get 3 individuals with mediocre expertise that overlaps 95% between them, that would mean that each brings in only 5% unique knowledge compared to their union. On the other hand, 3 WEAK AND DIVERSE experts that overlap 70% in their knowledge and bring 30% of unique expertise each, are likely to blend into a good model. That is exactly the case with 3 weak learners I used in the example above: one of them is equally good/bad at predicting everything, while the other two are good at predicting 1s and 0s, respectively.

### The above story/information is picked from [here](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/51058#290767).

## Table of Contents

1. [Simple Ensemble Learning](#1)

   a. [Max Voting](#11)
   
   b. [Averaging](#12)
   
   c. [Weighted Averaging](#13) 
   
   
2. [Advanced Ensemble Learning Types](https://www.kaggle.com/nitindatta/ensemble-learning-part-2#3)

    a. [Stacking](https://www.kaggle.com/nitindatta/ensemble-learning-part-2#31)
    
    b. [Blending](https://www.kaggle.com/nitindatta/ensemble-learning-part-2#32)
    
    c. [Bagging](https://www.kaggle.com/nitindatta/ensemble-learning-part-2#33)
        
    d. [Boosting](https://www.kaggle.com/nitindatta/ensemble-learning-part-2#34)
      
      * [XGBoost](https://www.kaggle.com/nitindatta/ensemble-learning-part-2#341)
      
      * [AdaBoost](https://www.kaggle.com/nitindatta/ensemble-learning-part-2#342)
      
      * [Light GBM](https://www.kaggle.com/nitindatta/ensemble-learning-part-2#343)
      
      * [Catboost](https://www.kaggle.com/nitindatta/ensemble-learning-part-2#344)

In [None]:
import pandas as pd
import numpy as np
import time
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import datetime
import warnings
import eli5
from eli5.sklearn import PermutationImportance

%matplotlib inline
sns.set(style="darkgrid")
warnings.filterwarnings("ignore")


In [None]:
test = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/test.csv')
item_categories = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
items = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
shops = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
sales = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv',parse_dates=['date'],dtype={'date': 'str'})

In [None]:
# Concatenating item_categories, items, shops and sales dataframes as train
df = sales.join(items, on='item_id',rsuffix='_')
df = df.join(shops, on='shop_id', rsuffix='_')
df = df.join(item_categories, on='item_category_id', rsuffix='_')

As the data consumes high memory we will downcast it.


Source of the method: [LINK](https://www.kaggle.com/kyakovlev/1st-place-solution-part-1-hands-on-data)

In [None]:
def downcast_dtypes(df):
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols = [c for c in df if df[c].dtype in ["int64", "int32"]]
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols] = df[int_cols].astype(np.int16)
    return df

df = downcast_dtypes(df)
print(df.info())

In [None]:
df.head().T

There are some redundant values which we will remove later

In [None]:
df.dtypes

In [None]:
print('Dataframe shape :',df.shape)

Data Leakages

The below code snippet is picked from [here](https://www.kaggle.com/dimitreoliveira/model-stacking-feature-engineering-and-eda).

In [None]:
test_shop_ids = test['shop_id'].unique()
test_item_ids = test['item_id'].unique()
# Only shops that exist in test set.
leak_df = df[df['shop_id'].isin(test_shop_ids)]
# Only items that exist in test set.
leak_df = leak_df[leak_df['item_id'].isin(test_item_ids)]
print('Data set size before leaking:', df.shape[0])
print('Data set size after leaking:', leak_df.shape[0])
df = leak_df

In [None]:
print(df.isnull().sum())
print('\nNo null records')

In [None]:
# We will drop all the strings (object type) and item_category_id as we will not use them.
df.drop(['item_name','shop_name','item_category_name','item_category_id'],axis=1,inplace=True)

In [None]:
print('Is column \'shop_id\' equal to \'shop_id_\' :',df['shop_id'].equals(df['shop_id_']),'\n')
print('Is column \'item_id\' equal to \'item_id_\' :',df['item_id'].equals(df['item_id_']),'\n')
print('\nAll are same so we will drop the duplicates')
df.drop(['shop_id_','item_id_'],axis=1,inplace=True)

In [None]:
df = df[df['item_price']>0]
# Dropped row where item_price is less than 0 

In [None]:
df = df.sort_values('date').groupby(['date_block_num', 'shop_id','item_id'], as_index=False)
df = df.agg({'item_price':['sum', 'mean'], 'item_cnt_day':['sum', 'mean','count']})
# Rename features.
df.columns = ['date_block_num', 'shop_id', 'item_id', 'item_price', 'mitem_price', 'item_cnt', 'mitem_cnt', 'transactions']

In [None]:
df.count()

In [None]:
df['year'] = df['date_block_num'].apply(lambda x: ((x//12) + 2013))
df['month'] = df['date_block_num'].apply(lambda x: (x % 12))

In [None]:
plt.figure(figsize=(22,8))
plt.subplot(2, 1, 1)
sns.boxplot(x=df['item_cnt'])
plt.subplot(2, 1, 2)
sns.boxplot(x=df['item_price'])

Highly skewed `item_cnt` and `item_price`.
Let us remove any `item_cnt` above 1500 and `item_prce` above 400000

In [None]:
df = df.query('item_cnt >= 0 and item_cnt <= 1500 and item_price < 400000')

In [None]:
df['cnt_m'] = df.sort_values('date_block_num').groupby(['shop_id','item_id'])['item_cnt'].shift(-1)

In [None]:
df.head()

In [None]:
df.describe().T

In [None]:
corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(9, 7))
    ax = sns.heatmap(corr,mask=mask,square=True,annot=True,fmt='0.2f',linewidths=.8,cmap="YlGnBu")

`item_cnt` is correlated with `transactions` and `year` is highly correlated with `date_block_num`

In [None]:
fig = sns.jointplot(x='item_price',y='item_id',data=df,
                   joint_kws={'alpha':0.2,'color':'orange'},
                   marginal_kws={'color':'red'})

Around `item_id`: 6000 there seems to be an outlier due to high `item_price`

In [None]:
plt.figure(figsize=(20,6)) 
sns.countplot(df['shop_id'])

`Shop_id 31` has highest number of sales

In [None]:
plt.figure(figsize=(20,6)) 
sns.barplot(x=df['shop_id'],y=df['item_cnt'],palette='viridis')

`Shop_id 9` has highest number of unique items

In [None]:
item_cat_price = df.groupby(['item_id']).sum()['item_price']
plt.figure(figsize=(18,6))
item_cat_price.plot(color ='red')

Somewhere around `item_id`: 6000 we might have an outlier.

In [None]:
ts = time.time()
shop_ids = df['shop_id'].unique()
item_ids = df['item_id'].unique()
empty_df = []
for i in range(34):
    for shop in shop_ids:
        for item in item_ids:
            empty_df.append([i, shop, item])
    
empty_df = pd.DataFrame(empty_df, columns=['date_block_num','shop_id','item_id'])
print(time.time()-ts)

In [None]:
# Merge the train set with the complete set (missing records will be filled with 0).
df = pd.merge(empty_df, df, on=['date_block_num','shop_id','item_id'], how='left')
df.fillna(0, inplace=True)

Splitting the data into `train`, `validation` and `test` set.
* Train set will be from `date_block_num` : 0-28 
* Validation set will be from `date_block_num` : 29-32
* Test set will be from `date_block_num` : 33

In [None]:
train_set = df.query('date_block_num >= 0 and date_block_num < 26').copy()
validation_set = df.query('date_block_num >= 26 and date_block_num < 33').copy()
test_set = df.query('date_block_num == 33').copy()

print('Train set records:', train_set.shape[0])
print('Validation set records:', validation_set.shape[0])
print('Test set records:', test_set.shape[0])

print('Percent of train_set:',(train_set.shape[0]/df.shape[0])*100,'%')
print('Percent of validation_set:',(validation_set.shape[0]/df.shape[0])*100,'%')
print('Percent of test_set:',(test_set.shape[0]/df.shape[0])*100,'%')

In [None]:
train_set.dropna(subset=['cnt_m'], inplace=True)
validation_set.dropna(subset=['cnt_m'], inplace=True)

In [None]:
# Creating training and validation sets
x_train = train_set.drop(['cnt_m','date_block_num'],axis=1)
y_train = train_set['cnt_m'].astype(int)

x_val = validation_set.drop(['cnt_m','date_block_num'],axis=1)
y_val = validation_set['cnt_m'].astype(int)

In [None]:
latest_records = pd.concat([train_set, validation_set]).drop_duplicates(subset=['shop_id', 'item_id'], keep='last')
x_test = pd.merge(test, latest_records, on=['shop_id', 'item_id'], how='left', suffixes=['', '_'])
x_test['year'] = 2015
x_test['month'] = 9
x_test.drop('cnt_m', axis=1, inplace=True)
x_test = x_test[x_train.columns]

In [None]:
ts=time.time()
sets = [x_train, x_val, x_test]
for dataset in sets:
    for shop_id in dataset['shop_id'].unique():
        for column in dataset.columns:
            shop_median = dataset[(dataset['shop_id'] == shop_id)][column].median()
            dataset.loc[(dataset[column].isnull()) & (dataset['shop_id'] == shop_id), column] = shop_median
            
# Fill remaining missing values on test set with mean.
x_test.fillna(x_test.mean(), inplace=True)
print(time.time()-ts)

In [None]:
x_test.head()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from scipy import stats

In [None]:
# These will be our base models
m1 = LinearRegression()
m2 = DecisionTreeRegressor()
m3 = RandomForestRegressor(n_estimators=10)

<a id="1"></a> <br>
# 1-Simple Ensemble Learning 
<a id="11"></a> <br>
### a.Max Voting
 

In [None]:
ts = time.time()
from sklearn.ensemble import VotingRegressor
model = VotingRegressor([('lr', m1), ('dt', m2),('rf', m3)])
model.fit(x_train, y_train)
train_pred = model.predict(x_train)
val_pred = model.predict(x_val)
print('Total time taken :',time.time()-ts) 

In [None]:
print('Train rmse:', np.sqrt(mean_squared_error(y_train, train_pred)))
print('Validation rmse:', np.sqrt(mean_squared_error(y_val, val_pred)))

In [None]:
perm = PermutationImportance(model, random_state=1).fit(x_val, y_val)
eli5.show_weights(perm, feature_names = x_val.columns.tolist())

`item_cnt` is an important feature and plays a vital role in predicting the output.

<a id="12"></a> <br>
### b. Average Voting 

In [None]:
ts = time.time()
m1.fit(x_train, y_train)
m2.fit(x_train, y_train)
m3.fit(x_train,y_train)

avg_train_pred1 = m1.predict(x_train)
avg_train_pred2 = m2.predict(x_train)
avg_train_pred3 = m3.predict(x_train)

avg_pred1 = m1.predict(x_val)
avg_pred2 = m2.predict(x_val)
avg_pred3 = m3.predict(x_val)

train_pred_avg = (avg_train_pred1+avg_train_pred2+avg_train_pred3)/3
val_pred_avg = (avg_pred1+avg_pred2+avg_pred3)/3

print('Total time taken: ',time.time()-ts)

In [None]:
print('Train rmse:', np.sqrt(mean_squared_error(y_train, train_pred_avg)))
print('Validation rmse:', np.sqrt(mean_squared_error(y_val, val_pred_avg)))

<a id="13"></a> <br>
### c. Weighted Averaging


In this we will first calculate RMSE for each `base model` and then we will give higher weightage to model which has least RMSE.

In [None]:
ts = time.time()
m1.fit(x_train, y_train)
m2.fit(x_train, y_train)
m3.fit(x_train,y_train)

wavg_train_pred1 = m1.predict(x_train)
wavg_train_pred2 = m2.predict(x_train)
wavg_train_pred3 = m3.predict(x_train)

print('M1_train:',np.sqrt(mean_squared_error(y_train, wavg_train_pred1)))
print('M2_train:',np.sqrt(mean_squared_error(y_train, wavg_train_pred2)))
print('M3_train:',np.sqrt(mean_squared_error(y_train, wavg_train_pred3)))

wavg_pred1 = m1.predict(x_val)
wavg_pred2 = m2.predict(x_val)
wavg_pred3 = m3.predict(x_val)

print('\nM1_validation:',np.sqrt(mean_squared_error(y_val, wavg_pred1)))
print('M2_validation:',np.sqrt(mean_squared_error(y_val, wavg_pred2)))
print('M3_validation:',np.sqrt(mean_squared_error(y_val, wavg_pred3)))

print('\nTotal time taken: ',time.time()-ts)

From the above values it is clear that `Decision Tree` overfits the data.

For our weighted average the weights given will be as follows `Random Forest`:0.5, `Decision Tree`:0.2, `Linear Regression`:0.3

In [None]:
final_val_pred = 0.3 * wavg_pred1 + 0.2 * wavg_pred2 + 0.5 * wavg_pred3
print('Weighted Average:',np.sqrt(mean_squared_error(y_val, final_val_pred)))

#### From the results we can notice that `Weighted Average` performs slightly better when compared to Max Voting and Averaging
----------------------------------------------------------------------------------------------------------------------------------------------

 We will use the same dataframe without further processing for 'Advanced Ensemble Learning' so I will save it to csv and use it in ** 'Part-2'**

In [None]:
train_set.to_csv('/kaggle/working/train_set.csv',index=False)
validation_set.to_csv('/kaggle/working/validation_set.csv',index=False)
test_set.to_csv('/kaggle/working/test_set.csv',index=False)

<font color="chocolate" size=+2.5><b>My Other Kernels</b></font>

Click on the button to view kernel...


<a href="https://www.kaggle.com/nitindatta/fifa-in-depth-analysis-with-linear-regression" class="btn btn-success" style="color:white;">FIFA In-Depth Analysis</a><br><br>

<a href="https://www.kaggle.com/nitindatta/storytelling-with-gwd-pre-print-data" class="btn btn-success" style="color:white;">Storytelling with GWD pre_print data</a><br><br>

<a href="https://www.kaggle.com/nitindatta/ensemble-learning-part-1" class="btn btn-success" style="color:white;">Ensemble Learning Part 1</a><br><br>

<a href="https://www.kaggle.com/nitindatta/ensemble-learning-part-2" class="btn btn-success" style="color:white;">Ensemble Learning Part 2</a><br><br>

<a href="https://www.kaggle.com/nitindatta/students-performance-in-exams-eda-in-depth" class="btn btn-success" style="color:white;">Students performance in Exams- EDA in depth ðŸ“ŠðŸ“ˆ</a><br><br>

<a href="https://www.kaggle.com/nitindatta/pulmonary-embolism-dicom-preprocessing-eda" class="btn btn-success" style="color:white;">ðŸ©ºPulmonary Embolism Dicom preprocessing & EDAðŸ©º</a><br><br>

<a href="https://www.kaggle.com/nitindatta/first-kaggle-submission" class="btn btn-success" style="color:white;">Titanic: Machine Learning from Disaster</a><br><br>

<a href="https://www.kaggle.com/nitindatta/graduate-admission-chances" class="btn btn-success" style="color:white;">ðŸ“– Graduate Admission Chances ðŸ“• ðŸ“”</a><br><br>

<a href="https://www.kaggle.com/nitindatta/flower-classification-augmentations-eda" class="btn btn-success" style="color:white;">Flower_Classification+Augmentations+EDA</a><br><br>

<a href="https://www.kaggle.com/nitindatta/storytelling-with-gwd-pre-print-data" class="btn btn-success" style="color:white;">Storytelling with GWD pre_print data</a><br><br>


### If these kernels impress you,give them an <font size="+2" color="red"><b>Upvote</b></font>.<br>

<a href="#toc" class="btn btn-primary" role="button" aria-pressed="true" style="color:white" data-toggle="popover" title="go to Colors">Go to TOP</a>