# Scikit-learn - Unit 04 - Tree-based models for Regression and Classification

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Implement and Evaluate Tree-Based Models for Regression and Classification



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

import warnings
warnings.filterwarnings('ignore')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 04 - Tree-based models for Regression and Classification

In this unit, we will cover the practical steps and code to fit a pipeline considering Tree-based models, like Decision Trees, Random Forest.
* In case you want to refresh the content, revert to Module 2.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A typical workflow used for supervised learning is
* Split the dataset into train and test set
* Fit the pipeline
* Evaluate your model. If the performance is not good, revisit the process, starting from defining the business case, collecting the data, conducting EDA (Exploratory Data Analysis) etc


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> For teaching purposes, **we will use a fixed dataset for Regression and a fixed dataset for Classification across the different algorithms used in this notebook.**

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the Boston dataset from sklearn for the **Regression task**. 
* It has house price records and house characteristics, like the average number of rooms per dwelling and the per capita crime rate in Boston.
* We'll use the same code from the previous unit


from sklearn.datasets import load_boston
data = load_boston()
df_reg = pd.DataFrame(data.data,columns=data.feature_names)
df_reg['price'] = pd.Series(data.target)

df_reg = df_reg.sample(frac=0.5, random_state=101)

print(df_reg.shape)
df_reg.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the Iris dataset from seaborn for the **Classification task**. 
* It contains records of 3 classes of iris plants, with their petal and sepal measurements

df_clf = sns.load_dataset('iris').sample(frac=0.7, random_state=101)
print(df_clf.shape)
df_clf.head(3)

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 We will cover the following tree algorithms, which includes ensemble tree algorithms
* Decision Tree
* Random Forest
* Gradient Boosting
* Ada Boost
* XG Boost (eXtreme Gradient Boost)
* Extra Tree







<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For teaching purposes, we will use:
* **Classification** task for: Decision Tree, Gradient Boosting, XG Boost
* **Regression** task for: Random Forest, Ada Boost, Extra Tree

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> That speeds up our learning process. And, if you do Regressor using a Decision Tree, the code and workflow are the same as you would do for Classification using a Decision Tree.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Decision Tree

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may revert to module 2 - ML Essentials - on Algorithms lesson to refresh the algorithms we cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.

* In a nutshell, a decision tree is like a flow chart where each question has a yes/no answer. This brings you from a general question to a very specific question as you get deeper. The questions asked must be ones where the yes or no answer gives useful insights into the data


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Depending on your task (Regression or Classification) for using Decision Tree algorithm in Sckit learn, you will import a different estimator.
* There is the suffix "`Regressor`" in the estimator when the algorithm will be used for a regression task, and, as you may expect, there is the suffix "`Classifier`" in the estimator when the algorithm is used for the classification task.
* That pattern repeats for the other tree-based algorithm.
* The difference is subtle, however, it is worth pointing out.






Find here the documentation for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).
* We will import both but will use the `DecisionTreeClassifier` for the exercise.

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

Let's reinspect our data again.
* The target variable is 'species' and we don't have missing data.

df_clf.head()

We are getting more comfortable with ML, but it is worth remembering this exercise is an example of supervised learning, where the ML task is classification. The same principle applies when the ML task is Regression.
* For that workflow, it is wise to split the data into train and test set
* In the previous units, we explained the `train_test_split() `function. From now on, we will just state "We split the data into train and test sets"

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

Considering the existing data, we will not need the data cleaning or the feature engineering steps.
* We then set feature scaling, feature selection and modelling using the DecisionTreeClassifier. We set random_state, so the results will be reproducible anywhere. We chain these steps in a sklearn Pipeline

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.tree import DecisionTreeClassifier


def pipeline_decision_tree_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),

      ( "feat_selection",SelectFromModel(DecisionTreeClassifier(random_state=101)) ),
      
      ( "model", DecisionTreeClassifier(random_state=101)),

    ])

  return pipeline

pipeline_decision_tree_clf()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> It is time to fit the pipeline, so the model can learn the relationships between the features and the target. We create a variable pipeline (it could be any name) and call the function where we set our pipeline

pipeline = pipeline_decision_tree_clf()
pipeline.fit(X_train, y_train)

Like in the previous notebook, we are now interested in starting to evaluate the pipeline. Since it is a tree-based model, we can assess the importance of the features in the model using `.features_importance_`
* We created a custom function to assess feature importance on tree-based models, it takes the model and the variables that "hit" the model, check the pseudo-code to understand the logic.
* Don't worry if, at first, you don't understand.  Expect it to take some time to absorb.

def feature_importance_tree_based_models(model, columns):
  """
  # Gets the model, and the columns used to train the model
  - we use the model.feature_importances_ and columns to make a
  DataFrame that shows the importance of each feature
  - next, we print the features name and its relative importance order,
  followed by a barplot indicating the importance

  """

  # create DataFrame to display feature importance
  df_feature_importance = (pd.DataFrame(data={
      'Features': columns,
      'Importance': model.feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

  best_features = df_feature_importance['Features'].to_list()

  # Most important features statement and plot
  print(f"* These are the {len(best_features)} most important features in descending order. "
        f"The model was trained on them: \n{df_feature_importance['Features'].to_list()}")

  df_feature_importance.plot(kind='bar',x='Features',y='Importance')
  plt.show()


Let's check that.
* The `model` argument is the 'model' step from the pipeline (we don't parse the pipeline, since we need only the model step)
* In the `columns` argument, we subset the feature selection step where we grab a boolean array informing which features hit the model - pipeline['feat_selection'].get_support(). This array is used to subset the features from train set columns.
* Note that only 2 features - `['petal_width', 'petal_length']` - out of 4, were used to train the model and they have roughly similar relevance

feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()]
                                     )

It is time to evaluate the classifier. We are using the same custom function for evaluating the classifier considered in the last notebook. 

# loads confusion_matrix and classification_report from sklearn
from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):
  """
  # Gets features, target, pipeline and how labelled (named) the levels from your target

  - it predicts based on features
  - compare predictions and actual in a confusion matrix
    - the first argument stays as rows and the second stays as columns in the matrix
    - we will use the pattern where the predictions are in the row and actual values are in the columns
    - to refresh that, revert to the Performance Metric video in Module 2
  - show classification report

  """

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")



def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  """
  # gets the features and target from train and test set, pipeline how
  you labelled (named) the levels from your target
  - for each set (train and test), it calls the function above to show the confusion matrix
  and classification report for both train and test set
  """

  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

You will notice that in this dataset, the target variable wasn't a set of numbers that refer to classes, but instead, are strings
* We are parsing, from df_clf, the unique values from the target as the `label_map` parameter

df_clf['species'].unique()

Let's evaluate the classifier then
* Note the model aced all predictions in the train set, which is an indication that it learned all the relationships from the training data. That is good, but let's check on the test set
* As we may expect, on the test set the performance was a bit lower (we noticed that in the confusion matrix, where  Virginica and Versicolor have the wrong predictions). At the same time, it is still very good, and it is not much of a difference from the train set. It is a good indication that the model didn't overfit 

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= df_clf['species'].unique()
                )

One additional aspect when using DecisionTree, is to visualize the created tree.
* Sckit learn has `plot_tree()` function that is okay and can help us, the documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html). We parse:
* decision_tree as the model step in our pipeline
* feature_names as the variable used to train the model. That is done by extracting the information from the feature selection step
* class_names are taken from unique values from species
* The remaining arguments help us to get a cleaner visualization

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Just a side note, this decision tree is simple, however, when it comes to big trees, the visualization might become too big or more difficult to interpret.
* In this example the decision is made first on petal_width, if it is smaller than -0.47, it is Setosa, if not it goes to another decision-making point. The other decision is for petal_lenght, if it is smaller than -0.57, it is Virginica, otherwise is Versicolor.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **Note the beauty: the algorithm computed by itself the pattern and now can predict. That is the major difference between ML and traditional programming. Here, we have data and an objective (predict species), then the computer finds the best rule for that. In traditional programming, the developer has to set the rules**

* However the decision points are still weird, what does a 0.47 mean for petal_width? Negative value. Let's explore the next cell

from sklearn import tree

fig = plt.figure(figsize=(15,15))
tree.plot_tree(decision_tree = pipeline['model'], 
               feature_names = X_train.columns[pipeline['feat_selection'].get_support()],
               class_names = df_clf['species'].unique(),
               filled=True,
               rounded=True,
               fontsize=9,
               impurity=False)
plt.show()

The negative values from the previous case happen due to the feature scaling step, where it scaled the data using a standard scaler. We can grab this pipeline step and use .inverse_transformation() to convert the scaled value to the original.
* We create a dataframe that relates to the orginal data. For petal_width and petal_length we set the decision points from the previous map. We parse the dataframe to .inverse_transform
* The decision points are actually 5.4 for petal_width, 3.3 for petal_length

scaled_data = pd.DataFrame(data={'petal_width':-0.472,
                                 'petal_length':0.578,
                                 'sepal_length':1.0, # this value doesn't matter, but needs to be here
                                 'sepal_width':1.0}, # this value doesn't matter, but needs to be here
                           index=[0])


pipeline['feat_scaling'].inverse_transform(scaled_data)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Random Forest

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may revert to module 2 - ML Essentials - on Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.


* The random forest is made of many decisions trees and it is an ensemble method. It uses bagging and feature randomness when building each individual tree, aiming to create an uncorrelated collection of trees, where the prediction from the set of trees is more accurate than that of any individual tree.



Once again, the same algorithm has a different estimator depending on the tasks: Regression or Classification. Find the documentation here for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).
* We will import both but will use `RandomForestRegressor` for the exercise.

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

 We will use the Boston dataset to fit an ML pipeline to predict the sales price using the Random Forest Algorithm

df_reg.head()

We split the train and  test sets. The target variable is 'price' 

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_reg.drop(['price'],axis=1),
                                    df_reg['price'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

We create the pipeline using a similar structure as the previous example. There are 3 steps: scaling, feature selection and modelling. 
* We know in advance the data won't need data cleaning

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.ensemble import RandomForestRegressor


def pipeline_random_forest_reg():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestRegressor(random_state=101)) ),
      ( "model", RandomForestRegressor(random_state=101)),

  ])

  return pipeline

pipeline_random_forest_reg()

We will fit the pipeline to the train set (features and target) using `.fit()`


pipeline = pipeline_random_forest_reg()
pipeline.fit(X_train, y_train)

Since it is a tree-based model, we can assess in the model the importance of the features with .features_importance_, using the custom function from the previous section
* Note that from 13 features, the model was trained on 2: LSTAT and RM, where LSTAT is more important to the model

feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()])

We will evaluate the regressor pipeline using the same custom function from the last unit notebook 

# import regression metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
# we will use numpy to calcuate RMSE based on MSE (mean_squared_error)
import numpy as np


def regression_performance(X_train, y_train, X_test, y_test,pipeline):
  """
  # Gets train/test sets and pipeline and evaluate the performance
  - for each set (train and test) call regression_evaluation()
  which will evaluate the pipeline performance
  """

  print("Model Evaluation \n")
  print("* Train Set")
  regression_evaluation(X_train,y_train,pipeline)
  print("* Test Set")
  regression_evaluation(X_test,y_test,pipeline)



def regression_evaluation(X,y,pipeline):
  """
  # Gets features and target (either from train or test set) and pipeline
  - it predicts using the pipeline and the features
  - calculates performance metrics comparing the prediction to the target
  """
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

def regression_evaluation_plots(X_train, y_train, X_test, y_test,pipeline, alpha_scatter=0.5):
  """
  # Gets Train and Test set (features and target), pipeline, and adjust dots transparency 
  at scatter plot
  - It predicts on train and test set
  - It creates an Actual vs Prediction scatterplots, for train and test set
  - It draws a red diagonal line. In theory, a good regressor should predict
  close to the actual, meaning the dot should be close to the diagonal red line
  The closer the dots are to the line, the better

  """
  pred_train = pipeline.predict(X_train)
  pred_test = pipeline.predict(X_test)


  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Test Set")

  plt.show()

* We notice that the performance on the train set is pretty good (0.95 of R2, MAE of 1.4, the actual vs prediction plot is dense around the diagonal red line), however, R2 on the test set is still ok (0.72) but much lower than on the train set, there is a notable difference. That may be a sign of overfitting.
* We note for the actual vs predictions plots that in the train set, the dots are closer around the diagonal line than they were in the test set. That reinforces the previous point.
* Following the diagonal line means the predictions tend to follow the actual value.
* This pipeline was trained on the default algorithm hyperparameters (like the number of trees, max depth etc). It is a matter of making sense of the hyperparameter and its common impact on algorithm performance. We will cover how to train with multiple hyperparameters in an upcoming lesson 

regression_performance(X_train, y_train, X_test, y_test,pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, 
                            pipeline, alpha_scatter=0.5)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Gradient Boosting

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may revert to module 2 - ML Essentials - on Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.

* Gradient boosting is a type of machine learning boosting. The idea of a boosting technique is based on building a sequence of initially weak models into increasingly more powerful models. You add the Models sequentially until no further improvements can be made. Gradient boosting aims to minimize the loss function by adding weak learners using a gradient of a loss function that captures the performance of a model.




We import the algorithms. Find the documentation here for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html).
* We will import both but will use  `GradientBoostingClassifier`for the exercise.

from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import GradientBoostingRegressor

Let's consider the iris dataset again for the classification task

df_clf.head()

As usual, we split the data into train and test sets, considering 'species' as the target variable

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

The pipeline is similar to that used in the  previous section where we considered the iris dataset.
* There are 3 steps: feature scaling, feature selection and modelling, and here we consider the Gradient Boosting Classifier

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.ensemble import GradientBoostingClassifier 


def pipeline_gradient_boost_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(GradientBoostingClassifier(random_state=101)) ),
      ( "model", GradientBoostingClassifier(random_state=101)),

    ])

  return pipeline


We fit the pipeline to the train set

pipeline = pipeline_gradient_boost_clf()
pipeline.fit(X_train, y_train)

And check feature importance using the same function we used previously since, for this algorithm, feature importance is assessed using the same attribute
* Note it considers only petal_length. Note also the difference; the same data in the decision tree had 2 features as the most important features. That naturally happens since different algorithms have different mechanisms

feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()]
                                     )

Let's evaluate the data using the same custom function that shows the confusion matrix and classification report for the train and test sets
* The results are the same compared to a Decision Tree (considering, the same dataset).
* The only difference is that we needed only 1 feature to reach that result for the Gradient Boost; for the decision tree, we needed 2. So the Gradient Boost is better for this data since it is simpler and easier to have a system with fewer features.

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= df_clf['species'].unique()
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Ada Boost

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may revert to module 2 - ML Essentials - on Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.


* AdaBoost (or Adaptive Boosting) is an ensemble learning used to build a strong model from several weak models. It uses multiple iterations to generate a single strong learner by iteratively adding weak learners. The result is a model that has higher accuracy than the weak learner itself.


We import the algorithms. Find the documentation here for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html).


* We will import both but will use `AdaBoostRegressor` for the exercise.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor

 We will use the Boston dataset to fit an ML pipeline to predict the sales price using the Ada Boost Algorithm

df_reg.head()

We split the train and test sets. The target variable is 'price'

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_reg.drop(['price'],axis=1),
                                    df_reg['price'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

We create the pipeline using the same steps as previously but now considering the Ada Boost Regressor

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.ensemble import AdaBoostRegressor

def pipeline_adaboost_reg():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(AdaBoostRegressor(random_state=101)) ),
      ( "model", AdaBoostRegressor(random_state=101)),

    ])

  return pipeline


We fit the data to the train set (in the same manner we did previously)

pipeline = pipeline_adaboost_reg()
pipeline.fit(X_train, y_train)

And assess feature importance using our custom function
* Note this pipeline selects 3 variables to train the model: `['LSTAT', 'RM', 'DIS']`


feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()])

We now evaluate the data using the custom function. 
* The R2 score on the train set is 0.9 and on the test set is 0.78. Ideally, it could be less, but this difference is lower than the difference we see for Random Forest
* We note for the actual vs predictions plots, that in the train set, the dots are around the diagonal line (not so close as in the Random Forest). 
* Following the diagonal line means the predictions tend to follow the actual value.

regression_performance(X_train, y_train, X_test, y_test,pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, 
                            pipeline, alpha_scatter=0.5)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  XG Boost

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may revert to module 2 - ML Essentials - on Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.


* XGBoost stands for eXtreme Gradient Boosting and is an extension to gradient boosted decision trees, specially designed to improve speed and performance. It has regularization features that help to avoid over-fitting. It is a dedicated software library that you should install, it doesn't belong to the Sckit-learn library.


We import the algorithms. Find [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) the documentation for both
* We will import both but will use `XGBClassifier` for the exercise.

from xgboost import XGBRegressor
from xgboost import XGBClassifier

Let's consider the iris dataset again for the classification task

df_clf.head()

Let's split the data into train and test sets, where the target variable is 'species' 

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

We create the pipeline using the same steps as previously but now considering XGBoost

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from xgboost import XGBClassifier


def pipeline_xgboost_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(XGBClassifier(random_state=101)) ),
      ( "model", XGBClassifier(random_state=101)),

    ])

  return pipeline


We fit the pipeline to the train data

pipeline = pipeline_xgboost_clf()
pipeline.fit(X_train, y_train)

And assess the feature importance
* Note only petal_length is relevant to fit the model. 



feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()]
                                     )

Let's assess the pipeline performance
* The performance is the same as Gradient Boost on the train and test set. So for Classification, of the three algorithms we tested, decision tree, gradient boost, and XG boost - the last 2 are good candidates and best suit the data. However, we will study another method to test more algorithms simultaneously and avoid this segregated analysis we are doing now.

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= df_clf['species'].unique() 
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  ExtraTree

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may revert to module 2 - ML Essentials - on Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.


* Extra Trees (or Extremely Randomized Trees) is an ensemble algorithm. It works by creating a large number of unpruned trees. Predictions are made by averaging the prediction of the decision trees when it is regression or using majority voting when it is classification.


We import the algorithms. Find the documentation here for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html).
* We will import both but will use `ExtraTreesRegressor` for the exercise.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import ExtraTreesRegressor

We will use the Boston dataset to fit an ML pipeline to predict the sales price

df_reg.head()

Let's split the data into train and test sets using 'price' as a target variable 

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_reg.drop(['price'],axis=1),
                                    df_reg['price'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

We create the pipeline using the same steps as previously but now considering the Extra Tree Regressor

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.ensemble import ExtraTreesRegressor

def pipeline_extra_tree_reg():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(ExtraTreesRegressor(random_state=101)) ),
      ( "model", ExtraTreesRegressor(random_state=101)),

    ])

  return pipeline


We fit the pipeline to the train set

pipeline = pipeline_extra_tree_reg()
pipeline.fit(X_train, y_train)

And evaluate feature importance using our custom function
* It used `['LSTAT', 'RM']` and LSTAT is more important.
* Just to reinforce, different algorithms consider different features to find patterns in the data, Random Forest selected the same features, and Ada Boost added to the selected list the variable DIS 

feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()])

Let's now evaluate the pipeline
* Note the pipeline was perfect on the train set (R2 score of 1), and on the test set was poor (R2 score of 0.68 is poor compared to a score of 1 in the train set)
* This is a sign the model overfits since it performs better in the train set and doesn't generalize well to other sets, like the test set
* After all, for this dataset and among Random Forest, Ada Boost and Extra Tree, Ada Boost performed better since it can generalize better (the difference between performance on train and test set is smaller).
* Again, we analyse each algorithm separately for learning purposes; in the next unit, we will learn how to evaluate all the algorithms simultaneously.

regression_performance(X_train, y_train, X_test, y_test,pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, 
                            pipeline, alpha_scatter=0.5)

---