# **23. Ensemble Methods in Python**

## Chapter 1 - Combining Multiple Models

Do you struggle to determine which of the models you built is the best for your problem? You should give up on that, and use them all instead! In this chapter, you'll learn how to combine multiple models into one using "Voting" and "Averaging". You'll use these to predict the ratings of apps on the Google Play Store, whether or not a Pokémon is legendary, and which characters are going to die in Game of Thrones!



### Exploring Google apps data

The first dataset you'll work with contains information about the ratings of around 10,800 apps on the Google Play store. It has been preloaded into a DataFrame called ratings. There are 13 features that describe a given app, such as 'Category' and 'Price'. The goal is to use this information to predict the rating of an app, which can range from 1 to 5.

Before doing that, it's a good idea to familiarize yourself with the dataset using pandas methods such as .head() and .describe(). Explore the data in the IPython Shell and select the incorrect statement from the options below.

In [1]:
import pandas as pd   
import numpy as np

df = pd.read_csv('googleplaystore.csv')
df.head(3)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


In [2]:
df.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


In [4]:
df.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

### Predicting the rating of an app

Having explored the Google apps dataset in the previous exercise, let's now build a model that predicts the rating of an app given a subset of its features.

To do this, you'll use scikit-learn's DecisionTreeRegressor. As decision trees are the building blocks of many ensemble models, refreshing your memory of how they work will serve you well throughout this course.

We'll use the MAE (mean absolute error) as the evaluation metric. This metric is highly interpretable, as it represents the average absolute difference between actual and predicted ratings.

All required modules have been pre-imported for you. The features and target are available in the variables X and y, respectively.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

df = pd.read_csv('googleplaystore.csv')

# Define features (X) and target variable (y)
X = df[['Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating']]
y = df['Rating']

# Split into train (80%) and test(20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the regressor
reg_dt = DecisionTreeRegressor(min_samples_leaf=3, min_samples_split=9, random_state=500)

# Fit to the training set
reg_dt.fit(X_train, y_train)

# Evaluate the performance of the model on the test set
y_pred = reg_dt.predict(X_test)
print('MAE: {:.3f}'.format(mean_absolute_error(y_test, y_pred)))


# <script.py> output:
#     MAE: 0.609

### Choosing the best model

In this exercise, you'll compare different classifiers and choose the one that performs the best.

The dataset here - already loaded and split into train and test sets - consists of Pokémon - their stats, types, and whether or not they're legendary. The objective of our classifiers is to predict this 'Legendary' variable.

Three individual classifiers have been fitted to the training set:

clf_lr is a logistic regression.
clf_dt is a decision tree.
clf_knn is a 5-nearest neighbors classifier.
As the classes here are imbalanced - only 65 of the 800 Pokémon in the dataset are legendary - we'll use F1-Score to evaluate the performance. Scikit-learn's f1_score() has been imported for you.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score

# Define and instantiate classifiers
clf_lr = LogisticRegression()
clf_dt = DecisionTreeClassifier()
clf_knn = KNeighborsClassifier(n_neighbors=5)

# Make the individual predictions
pred_lr = clf_lr.predict(X_test)
pred_dt = clf_dt.predict(X_test)
pred_knn = clf_knn.predict(X_test)

# Evaluate the performance of each model
score_lr = f1_score(y_test, pred_lr)
score_dt = f1_score(y_test, pred_dt)
score_knn = f1_score(y_test, pred_knn)

# Print the scores
print(score_lr)
print(score_dt)
print(score_knn)



script.py
123456789101112131415
# Make the invidual predictions
pred_lr = clf_lr.predict(X_test)
pred_dt = clf_dt.predict(X_test)
pred_knn = clf_knn.predict(X_test)

# Evaluate the performance of each model
score_lr = f1_score(y_test, pred_lr)
score_dt = f1_score(y_test, pred_dt)
score_knn = f1_score(y_test, pred_knn)

IPython Shell
Slides
# Make the invidual predictions
pred_lr = clf_lr.predict(X_test)
pred_dt = clf_dt.predict(X_test)
pred_knn = clf_knn.predict(X_test)

# Evaluate the performance of each model
score_lr = f1_score(y_test, pred_lr)
score_dt = f1_score(y_test, pred_dt)
score_knn = f1_score(y_test, pred_knn)


# Print the scores
print(score_lr)
print(score_dt)
print(score_knn)


# Logistic regression (clf_lr).
# Decision tree (clf_dt).
# 5-nearest neighbors (clf_knn).

# <script.py> output:
    # 0.5882352941176471
    # 0.5833333333333334
    # 0.47619047619047616


### Assembling your first ensemble

It's time to build your first ensemble model! The Pokémon dataset from the previous exercise has been loaded and split into train and test sets.

Your job is to leverage the voting ensemble technique using the sklearn API. It's up to you to instantiate the individual models and pass them as parameters to build your first voting classifier.

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Instantiate the individual models
clf_knn = KNeighborsClassifier(n_neighbors=5)
clf_lr = LogisticRegression(class_weight='balanced')
clf_dt = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)

# Create and fit the voting classifier
clf_vote = VotingClassifier(
    estimators=[('knn', clf_knn), ('lr', clf_lr), ('dt', clf_dt)]
)
clf_vote.fit(X_train, y_train)




### Evaluating your ensemble

In the previous exercise, you built your first voting classifier. Let's now evaluate it and compare it to that of the individual models.

The individual models (clf_knn, clf_dt, and clf_lr) and the voting classifier (clf_vote) have already been loaded and trained.

Remember to use f1_score() to evaluate the performance. In addition, you'll create a classification report on the test set (X_test, y_test) using the classification_report() function.

Will your voting classifier beat the 58% F1-score of the decision tree?

In [None]:
from sklearn.metrics import f1_score, classification_report

# Calculate the predictions using the voting classifier
pred_vote = clf_vote.predict(X_test)

# Calculate the F1-Score of the voting classifier
score_vote = f1_score(y_test, pred_vote)
print('F1-Score: {:.3f}'.format(score_vote))

# Calculate the classification report
report = classification_report(y_test, pred_vote)
print(report)

# <script.py> output:
#     F1-Score: 0.583
#                   precision    recall  f1-score   support
    
#            False       0.98      0.95      0.97       150
#             True       0.50      0.70      0.58        10
    
#         accuracy                           0.94       160
#        macro avg       0.74      0.83      0.77       160
#     weighted avg       0.95      0.94      0.94       160
    


### Journey to Westeros

If you're a Game of Thrones fan, you might already know all about the fictional world of Westeros and the characters that inhabit it. Regardless, it's important to explore a new dataset before doing any modeling. That's what you'll do now!

The dataset is loaded into the environment and available to you as got, the commonly used acronym for Game of Thrones.

The target variable here is 'actual'. It represents whether a character is alive (1) or not (0). First explore the target using the .describe() method. What can you conclude about it?

In [19]:
df = pd.read_csv('character-predictions.csv')
df.head(3)

Unnamed: 0,S.No,actual,pred,alive,plod,name,title,male,culture,dateOfBirth,...,isAliveHeir,isAliveSpouse,isMarried,isNoble,age,numDeadRelations,boolDeadRelations,isPopular,popularity,isAlive
0,1,0,0,0.054,0.946,Viserys II Targaryen,,1,,,...,0.0,,0,0,,11,1,1,0.605351,0
1,2,1,0,0.387,0.613,Walder Frey,Lord of the Crossing,1,Rivermen,208.0,...,,1.0,1,1,97.0,1,1,1,0.896321,1
2,3,1,0,0.493,0.507,Addison Hill,Ser,1,,,...,,,0,1,,0,0,0,0.267559,1


In [20]:
df.columns

Index(['S.No', 'actual', 'pred', 'alive', 'plod', 'name', 'title', 'male',
       'culture', 'dateOfBirth', 'DateoFdeath', 'mother', 'father', 'heir',
       'house', 'spouse', 'book1', 'book2', 'book3', 'book4', 'book5',
       'isAliveMother', 'isAliveFather', 'isAliveHeir', 'isAliveSpouse',
       'isMarried', 'isNoble', 'age', 'numDeadRelations', 'boolDeadRelations',
       'isPopular', 'popularity', 'isAlive'],
      dtype='object')

In [25]:
df.describe()

Unnamed: 0,S.No,actual,pred,alive,plod,male,dateOfBirth,DateoFdeath,book1,book2,...,isAliveHeir,isAliveSpouse,isMarried,isNoble,age,numDeadRelations,boolDeadRelations,isPopular,popularity,isAlive
count,1946.0,1946.0,1946.0,1946.0,1946.0,1946.0,433.0,444.0,1946.0,1946.0,...,23.0,276.0,1946.0,1946.0,433.0,1946.0,1946.0,1946.0,1946.0,1946.0
mean,973.5,0.745632,0.68705,0.63447,0.36553,0.619219,1577.364896,2950.193694,0.198356,0.374615,...,0.652174,0.778986,0.141829,0.460946,-1293.56351,0.305755,0.074512,0.059096,0.089584,0.745632
std,561.906131,0.435617,0.463813,0.312637,0.312637,0.485704,19565.41446,28192.245529,0.398864,0.484148,...,0.486985,0.415684,0.348965,0.498601,19564.340993,1.38391,0.262669,0.235864,0.160568,0.435617
min,1.0,0.0,0.0,0.0,0.0,0.0,-28.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-298001.0,0.0,0.0,0.0,0.0,0.0
25%,487.25,0.0,0.0,0.39125,0.101,0.0,240.0,282.0,0.0,0.0,...,0.0,1.0,0.0,0.0,18.0,0.0,0.0,0.0,0.013378,0.0
50%,973.5,1.0,1.0,0.7355,0.2645,1.0,268.0,299.0,0.0,0.0,...,1.0,1.0,0.0,0.0,27.0,0.0,0.0,0.0,0.033445,1.0
75%,1459.75,1.0,1.0,0.899,0.60875,1.0,285.0,299.0,0.0,1.0,...,1.0,1.0,0.0,1.0,50.0,0.0,0.0,0.0,0.086957,1.0
max,1946.0,1.0,1.0,1.0,1.0,1.0,298299.0,298299.0,1.0,1.0,...,1.0,1.0,1.0,1.0,100.0,15.0,1.0,1.0,1.0,1.0


In [24]:
df["actual"].describe()


count    1946.000000
mean        0.745632
std         0.435617
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: actual, dtype: float64

### Predicting GoT deaths

While the target variable does not have any missing values, other features do. As the focus of the course is not on data cleaning and preprocessing, we have already done the following preprocessing for you:

Replaced NA values with 0.
Replace negative values of age with 0.
Replace NA values of age with the mean.
Let's now build an ensemble model using the averaging technique. The following individual models have been built:

Logistic Regression (clf_lr).
Decision Tree (clf_dt).
Support Vector Machine (clf_svm).
As the target is binary, all these models might have good individual performance. Your objective is to combine them using averaging. Recall from the video that this is the same as a soft voting approach, so you should still use the VotingClassifier().

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Build the individual models
clf_lr = LogisticRegression(class_weight='balanced')
clf_dt = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)
clf_svm = SVC(probability=True, class_weight='balanced', random_state=500)

# List of (string, estimator) tuples
estimators = [('lr', clf_lr), ('dt', clf_dt), ('svm', clf_svm)]

# Build and fit an averaging classifier
clf_avg = VotingClassifier(estimators=estimators, voting='soft')
clf_avg.fit(X_train, y_train)

# Evaluate model performance
acc_avg = accuracy_score(y_test,  clf_avg.predict(X_test))
print('Accuracy: {:.2f}'.format(acc_avg))


# <script.py> output:
#     Accuracy: 0.82

### Soft vs. hard voting

You've now practiced building two types of ensemble methods: Voting and Averaging (soft voting). Which one is better? It's best to try both of them and then compare their performance. Let's try this now using the Game of Thrones dataset.

Three individual classifiers have been instantiated for you:

A DecisionTreeClassifier (clf_dt).
A LogisticRegression (clf_lr).
A KNeighborsClassifier (clf_knn).
Your task is to try both voting and averaging to determine which is better

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

# List of (string, estimator) tuples
estimators = [('DecisionTree', clf_dt), ('LogisticRegression', clf_lr), ('KNeighbors', clf_knn)]

# Build and fit a voting classifier
clf_vote = VotingClassifier(estimators=estimators, voting='hard')
clf_vote.fit(X_train, y_train)

# Build and fit an averaging classifier
clf_avg = VotingClassifier(estimators=estimators, voting='soft')
clf_avg.fit(X_train, y_train)

# Evaluate the performance of both models
acc_vote = accuracy_score(y_test, clf_vote.predict(X_test))
acc_avg = accuracy_score(y_test, clf_avg.predict(X_test))
print('Voting: {:.2f}, Averaging: {:.2f}'.format(acc_vote, acc_avg))


# <script.py> output:
#     Voting: 0.80, Averaging: 0.81

## Chapter 2 - Bagging


### Restricted and unrestricted decision trees

For this exercise, we will revisit the Pokémon dataset from the last chapter. Recall that the goal is to predict whether or not a given Pokémon is legendary.

Here, you will build two separate decision tree classifiers. In the first, you will specify the parameters min_samples_leaf and min_samples_split, but not a maximum depth, so that the tree can fully develop without any restrictions.

In the second, you will specify some constraints by limiting the depth of the decision tree. By then comparing the two models, you'll better understand the notion of a "weak" learner.