# Scikit-learn - Unit 01 - ML Pipeline and ML tasks

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **Scikit-learn Lesson consists of nine units.**
* By the end of this lesson, you should be able to:
  * Learn and use the workflow for training and evaluating the ML pipeline.
  * Create a pipeline according to our dataset and ML task.
  * Fit Regression, Classification, Cluster, PCA (Principal Component Analysis), and NLP (Natural Language Processing) considering different algorithms.
  * Learn and use the code to fit in one turn, multiple algorithms with hyperparameters optimisation.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Reinforce ML pipeline concepts and the ML tasks that are covered in upcoming notebooks.
* Learn and use the workflow for training and evaluating the ML pipeline.



---

Scikit-learn allows you to train machine learning models for classification, regression or clustering. In addition, it provides a wide set of functions for data processing, dimensionality reduction, feature engineering, feature scaling, feature selection, tuning model hyperparameters, creating an ML pipeline, evaluating a model's performance and more.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study Scikit-learn?**
  * Because it is a centralised and complete library for conventional ML, containing a suite of practical modules that helps the data practitioners from the development to the deployment of ML pipelines.



## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add **code cells and try out** other possibilities, i.e.: play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your comments** in the cells. It can help you to consolidate your learning. 

* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For Scikit learn the link is [here](https://scikit-learn.org/stable/g/). We also will use the XGBoost library to train pipelines with eXtreme Gradient Boosting, which is a tree-based algorithm. The documentation is [here](https://xgboost.readthedocs.io/en/latest/index.html)**.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 01 - ML Pipeline and ML tasks

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Introduction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In a nutshell, Machine learning is a data-driven approach that uses algorithms to learn patterns and relationships from the data, without being explicitly programmed. 
* The developer gives the algorithm data and an objective. The algorithm is trained and figures out how to match the objective based on the provided data.
* This creates a model, and the trained model is used for predicting behaviours and outputs, allowing decision-making on unseen data.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> ML is heavily applied in practical terms for multiple use cases in many industries, examples include:
* E-mail spam detection
* Customer Churn
* Text Sentiment Analysis
* Fraud Detection
* Real-time Ads
* Recommendation Engine (i.e.: While watching streaming movies and after finishing one movie, you receive suggestions on what to watch next.)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will explore
* Pipeline concepts
* Data Cleaning and Feature Engineering
* Feature Scaling and Feature Selection
* ML tasks covered in this lesson
* General Workflow

**Note**
<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">
* The overall perception of difficulty may escalate in the following notebooks since we will start using a series of concepts we covered in the videos and the previous notebooks but in more practical terms.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pipeline concepts

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In the previous lesson, we introduced scikit-learn that creates a Pipeline, a sequence of tasks.
* In ML, we are interested in arranging a sequence of tasks that are in line with the ML process of **data cleaning, feature engineering, feature scaling, feature selection and model**
* In an ML pipeline, the last step is typically the model, and the preceding steps prepare the data for the model

We import Pipeline from sklearn

from sklearn.pipeline import Pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In addition, the pipeline should identify two outcomes; the training outcome and the prediction outcome. 
* For that, we use estimators as part of the pipeline steps. There are two types of estimators mainly used: predictors and transformers.
  * A predictor estimator, uses methods like **.fit()** and **.predict()**. An ML model uses these methods to learn patterns from the data and is used for subsequent predictions.
  * On the other hand, the transformer estimator uses the methods **.fit()** and **.transform()** because it learns from the data and later transforms the data with better distribution. 
  


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We will demonstrate the differences between fitting models with and without a pipeline.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Data Cleaning and Feature Engineering

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We studied in the feature-engine lesson common techniques to handle data cleaning and feature engineering tasks, using feature-engines built-in transformers or creating your own transformer.
* In addition, we arranged this transformer in a pipeline

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Feature Scaling and Feature Selection

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Once the data is cleaned and engineered, you should consider feature scaling and feature selection. We have studied the definitions in the Module: Machine Learning Essentials / Section: ML Pipeline. Please refer to it if you need refreshing.

* In this section, we will cover the practical step of feature scaling and feature selection.



#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Feature Scaling

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The scale of a feature is an important aspect when fitting a model. For example, there are algorithms like K-means clustering, Linear and Logistic Regression, and Neural Networks that are highly affected by the scale of their features.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to Scikit-learn [documentation](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html), feature scaling can be an important preprocessing step for many machine-learning algorithms. Standardisation involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
* The idea behind scaling the features is to make all features have a similar scale.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will present `StandardScaler()`, which standardises the data: it centres the variable at zero. It sets the variance to 1, by subtracting the mean from each observation and dividing by the standard deviation. It is also known as the Z-score. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* We will cover the StandardScaler transformer in the course as a first go-to option for feature scaling. However, there are other alternatives, and you may check the [documentation](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) to learn more. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The tradeoff of feature scaling is that the variable distribution will be slightly different. Still, we will create better conditions for the algorithm to learn the patterns and relationships in the data and generalize on unseen data.

Let's use the iris dataset

df =  sns.load_dataset('iris')
print(df.shape)
df.head()

We will import `StandardScaler()`

from sklearn.preprocessing import StandardScaler

We create a pipeline with a step called 'feature_scaling' and attach `StandardScaler()`. When you don't parse any variables to it, it scales all variables

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
      ("feature_scaling", StandardScaler()) 
  ])

We will apply this pipeline to the features in the train set. We will learn how to split data soon, but for now, we will manually create a train set and a test set, where each has a set for features and the target variable.
* In this dataset, features are `['sepal_length', 'sepal_width', 'petal_length', 'petal_width']` and target is `['species']`. We shuffle the data and will get the first 100 rows and set them as the train set. The remaining goes to the test.
* <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> There is a proper way to split a train and test set. We will cover that soon.
* The central point is to have 2 sets (Train and Test) and have features and the target separated.


Let's shuffle the data. we use `.sample(frac=1)`, the documentation link is [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html). It returns a random sample from the data.

df = df.sample(frac=1)
df.head()

The train set features are X_train and have the first 100 rows. The train set target is y_train and has the last 50 rows from species. The same rationale goes for the test set, x_test has the first 100 rows and y_test the last 50 rows.

X_train = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']][:100]
y_train =  df[['species']][:100]
X_test =  df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']][100:]
y_test =  df[['species']][100:]
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We check the DataFrames dimensions

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

When applying pipelines to ML, we fit the pipeline to the train set (so it will learn the parameters) and based on this learning, transform the data on the train and test set

pipeline.fit(X_train)
X_train_scaled = pipeline.transform(X_train)
X_test_scaled = pipeline.transform(X_test)

One caveat of using sklearn transformers is that they output NumPy arrays, instead of Pandas DataFrames. You may remember that the feature-engine outputs DataFrames. 

type(X_train_scaled)

So we need an additional step to convert the scaled data back to a DataFrame.

X_train_scaled = pd.DataFrame(data= X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(data= X_test_scaled, columns=X_train.columns)

Now we are fine to move on, the dataset is a DataFrame

type(X_train_scaled)

We are now interested to see the difference in each feature before and after applying StandardScaler().
* We create a logic to loop on each feature and plot two histograms in the same plot. One shows the data distribution before applying ``StandardScaler()`` and the other after applying it.
* The blue plot is before applying, and the red is after. Note that the red histograms are centred at zero on the x-axis. You will notice the distribution may change a bit, but that is part of the tradeoff we mentioned earlier.

sns.set_style('whitegrid')
for col in X_train.columns:
  fig, axes = plt.subplots(figsize=(8,5))
  sns.histplot(data=X_train, x=col, kde=True, color='b',  ax=axes)
  sns.histplot(data=X_train_scaled, x=col, kde=True,color='r', ax=axes)
  axes.set_title(f"{col}")
  axes.legend(labels=['Before Scaling', 'After Scaling'])
  plt.show()
  print("\n\n")

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Feature Selection

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> The primary goal of feature selection is to have a process to select the relevant features for fitting an ML model. 

That is important since: 
* Models with fewer and more relevant features are simpler to interpret.
* You reduce the chance of overfitting by removing features that may add little information or noise.
* You reduce the time needed to train the models.
* You reduce the feature space. You require less effort from the software development team to design and implement the interface (either API or dashboard) in the production environment.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> This step can be seen as a combination of search techniques to look for a subset of features and an evaluation measure that scores the different feature subsets. There are a few methods for feature selection:
* Filter Method
* Wrapper Method
* Embedded Method



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In the course, and as a starting point in your career, you will use the Embedded method.
* It is named the embedded method since it performs feature selection during the model training. It finds the feature subset for the algorithm that is being trained.
* The method automatically trains an ML model, and then derives feature importance from it, removing non-relevant features using the derived feature importance.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> For example:
* Suppose your pipeline is considering a Decision Tree algorithm in the model step. In that case, you can add before the model step a feature selection step using an embedded method considering a Decision Tree.


Let's reuse the same data from the previous exercise: the iris dataset.

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

X_train.head()


We are using `SelectFromModel()` as the method. Its documentation is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html). 
* The argument is the algorithm you are considering in the pipeline

from sklearn.feature_selection import SelectFromModel

We create a pipeline using a Decision Tree algorithm that contains three steps:
* `feature_scaling`: like we saw in the previous example.
* `feature_selection`: use SelectFromModel considering the same algorithm from the model step.
* `model`: uses a Decision Tree algorithm (we will get into more details in upcoming units, for now, take this step as the model step and let's use a decision tree for the example.

from sklearn.tree import DecisionTreeClassifier

pipeline = Pipeline([
      ( "feature_scaling", StandardScaler() ),
      ( "feature_selection", SelectFromModel(DecisionTreeClassifier(random_state=101)) ),
      ( "model", DecisionTreeClassifier(random_state=101) ),
  ])

pipeline

We fit the pipeline with the Train set.

pipeline.fit(X_train,y_train)

And access the feature_selection step using bracket notation as we saw in the feature-engine lesson.

pipeline['feature_selection']

That was not informative. We need to use `.get_support()` to access which features were selected by this step. 
* The output is a boolean list, where its length and order are related to the original feature space.
* For example, the train set has four features. We see that the feature_selection step selected the last two steps since they are True. The first two features were not considered since they are False in the boolean list.

pipeline['feature_selection'].get_support()

However, we want to know the features list that was selected, not a boolean list.
* We then use this boolean list to subset the features.
* A quick recap on the features list.

X_train.columns

We use the boolean list to subset the previous list.
* And here we have the features that were considered important for that given dataset using that given algorithm.

X_train.columns[pipeline['feature_selection'].get_support()] 

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> ML tasks

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In this lesson, we will explore business cases that involve the following ML tasks:
* Regression
* Classification (Binary and Multi-class)
* Clustering
* NLP (Natural Language processing)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We will use structured and tabular datasets from ML libraries like Seaborn, Plotly, Scikit-learn and Yellow-brick.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> General Workflow

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In a practical project, you can use CRISP-DM workflow to manage project steps. In case you want a refresher on the workflow, revert to the Module Delivering Data Science projects.
* For this lesson, we will focus on the following CRISP-DM steps: data understanding, data preparation, modelling and evaluation.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Therefore, when you reach the modelling phase in a project, it is assumed you have collected the data, conducted an EDA, and defined the pipeline steps.

* When modelling, for supervised learning, you will typically use an overall workflow like:
  * Split the dataset into train and test set
  * Fit the model (either a pipeline or not) 
  * Evaluate your model. If performance is not good, revisit the process, starting from collecting the data, conducting EDA etc


There are some potential small variations to this workflow, but this is the starting point we consider in your journey of modelling


 **HUGE WARNING** <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> 
* **Reflect** for a second on how many steps and considerations you need before fitting a model. You will be surprised that the modelling phase will take a small percentage of your time and attention in a project where a person is responsible from end to end.

* Even though, this phase is critical to your project, **so let's stop the reading/talking and let's fit some models**.




---

In [None]:
# Scikit-learn - Unit 02 - Split your data, fit a model, predict and save the model

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and implement the basic workflow for splitting the data, fitting a model, predicting on data and saving the model.



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 02 - Split your data, fit a model and predict

In this unit, we will cover how to:
  * Split your data
  * Fit a model
  * Run predictions with the fitted model
  * Save the model, so you can use it later

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In supervised learning, you are interested in splitting your data. In conventional ML, like in Scikit-learn, we will split the data into Train and Test sets.
* The validation set is a part of the Train set. When using a specific Scikit-learn function for hyperparameter optimization, the validation set is grabbed automatically. Therefore we will split into Train and Test sets only.
* If you want a refresher on Train, Validation, and Test sets, refer to Module 2 - ML Essentials.

Let's consider the iris dataset. It contains records of three classes of iris plants, with petal and sepal measurements.

df = sns.load_dataset('iris')
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> How do you know which variables are features and which variable is a target variable?
* It will depend on the context of your ML project. You will need to know or need to investigate the objective of your ML project to determine features and the target.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For the purposes of this notebook, species is our chosen target variable. There are three species. We need to classify the species according to the flower's petal and sepal. Our ML task then will be a classification.

df['species'].value_counts()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Split your data

We use `train_test_split()` to split the data. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). The parameters we will use are:
* The first two are the features and target, respectively. In this case, for the features, you drop species, and for the target, you subset species.
* ``test_size:`` it represents the data proportion to include in the test set. We set it at 0.2
* ``random_state:`` according to the documentation, it controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. It can be any positive integer. We suggest keeping the same random_state value across your project. We will select here 101

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> `random_state` is a critical parameter in ML, which we will use in other use cases. It essentially gives **REPRODUCIBILITY** to your project. That means the same result you get here right now, another person will get elsewhere at another time.


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['species'],axis=1),
                                                    df['species'],
                                                    test_size=0.2,
                                                    random_state=101)

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

Let's have look at `X_train`
* Those will be the features used to train the model
* Note the features are numbers. Scikit-learn uses numbers to fit models. That is why we have to encode categorical data
* In this dataset, we don't need any data cleaning or categorical encoding

X_train.head()

Let's inspect `y_train`. These are categories.
* When the ML task is classification, Scikit-learn handles either numbers or categories for the target variable.

y_train

In addition, `y_train` is a Pandas Series

type(y_train)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit your model

You get a preview of tree-based algorithms in the 'Machine Learning Essentials' Algorithms unit. Even though we have a dedicated unit for tree-based algorithms, here we will use a decision tree algorithm to fit a model to demonstrate the basic workflow for fitting a model.
* We will use `DecisionTreeClassifier()`, the documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* We create a python object/variable called model, and instantiate `DecisionTreeClassifier()`. A common convention is to set the object name as a model.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">  Note: we created a model and fit. We can do that since the data doesn't require a pre-processing step, like data cleaning or categorical encoding, for the fitting.

* Fitting the model on its own is fine as a learning experience. However, in our later exercises, we will not fit the model but instead use a pipeline that contains a series of steps, where typically, the last step will be the model.

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

Next, we fit the model with the train set - features (`X_train`) and target (`y_train`)
* We use `.fit()` method and parse `X_train` and `y_train`. Simple as that.

model.fit(X_train,y_train)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Run predictions

Let's predict the test set using our model.
* We use `.predict()` and parse the test set features (`X_test`)
* The answer is an array

model.predict(X_test)

You can predict the probability (between 0.0 and 1.0) for each class for a given observation using `.predict_proba()`

model.predict_proba(X_test)

Ideally, we should predict on the Train and Test set, set a performance metric and evaluate model performance.
* We will not evaluate the model yet. We will leave it until another unit
* The idea here is to feel how it works 'under the hood' when doing a basic training and predicting process.

Let's assume now you want to predict on real-time data.
* In an application, you will likely create an interface to collect the data or will get the data from somewhere else, from an API, for example.
* In this case, we will manually create a DataFrame that contains the features. We call that X_live. It will have one row only (you could have a set of rows, that would mean running predictions in a batch, in our case, it is only one prediction)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In theory, you can set any value to the variable, But in practice, the values will follow the actual data distribution.

X_live = pd.DataFrame(data={'sepal_length':6.0,
                            'sepal_width':3.9,
                            'petal_length':2.5,
                            'petal_width':0.9},
                      index=[0] # the DataFrame needs an index (either number or category), we just parsed the number 0
                      )
X_live

Let's predict using this live data.

model.predict(X_live)

The model is 100% confident it is a determined class

model.predict_proba(X_live)

We saw already this class is Versicolor, but you cross-check the labels orders with .unique()

df['species'].unique()

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Save your model

We can save either an ML model or an ML pipeline as a .pkl file with a library called joblib.
* You need the function `joblib.dump()`, the documentation is [here](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html). We will parse the `value` of the arguments as the file we want to save and the `filename` as the directory + filename + .pkl (we are saving at the root level) 

import joblib
joblib.dump(value=model , filename="my_first_ml_model.pkl")

Once you are in an application or in another notebook, you can load with `joblib.load()`. The documentation is [here](https://joblib.readthedocs.io/en/latest/generated/joblib.load.html). You will parse the argument `filename` as the directory + filename + .pkl

loaded_model = joblib.load(filename="my_first_ml_model.pkl")
loaded_model

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> Awesome!! 

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Does that mean I am ready to create an ML model for the world and solve big challenges?
* Almost. We've started the ML journey now! 
* We still need to cover more topics. Now let's have some fun!

---

In [None]:
# Scikit-learn - Unit 03 - Linear Models for Regression and Classification

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Implement and Evaluate Linear Models for Regression and Classification



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 03 - Linear Models for Regression and Classification


In this unit, we will cover the practical steps and code to fit a pipeline considering Linear Regression and Logistic Regression.
* In case you want a reminder of the theory, refer back to the Introduction to Predictive Analytics And Machine Learning and particularly to the Machine Learning Essentials > Machine Learning Terminology > Train/Fit a Model.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A typical workflow used for supervised learning is:
* Split the dataset into train and test set
* Fit the model (either using a pipeline or not) 
* Evaluate your model. If performance is not good, revisit the process, starting from collecting the data, conducting EDA (Exploratory Data Analysis) etc.



---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Regression: Linear Regression

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the Boston dataset from sklearn. It has house price records and characteristics, like the average number of rooms per dwelling and the per capita crime rate in Boston.

* The approach to load the data from sklearn is a bit different from seaborn.
* In this case, data comes as a "dictionary" where you need to grab different pieces (like data.data, data.features_names, data.target) to arrange the DataFrame


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Just as an aside this data won't need cleaning or feature engineering to train a model.

from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)

print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As our workflow suggests, we split the data into train and test set
* we parse the features (`the full data dropping the target`) and the target (`df['target']`)
* test_size is 20%, random_state is 101 - from now on, we want always to use  these values
* It is a good practice to inspect the train and test set shape, just a sanity check.

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['target'],axis=1),
                                    df['target'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Our target variable is the house price, which is a continuous variable. So we will create a pipeline to handle that.

* We import Pipeline, StandardScaler and select from the model
* To speed up the process, we know the dataset doesn't require any data cleaning or feature engineering. When we work with a dataset that needs it, we will inform you and suggest a transformer for that. In the workplace, that will be the data practitioner's task. But for learning purposes, we focus on the modelling and evaluation aspects.
* We also import the linear regression algorithm. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* We define a function to create a pipeline with three steps: feature scaling, feature selection and model. It is convenient to arrange everything in a custom function for a given pipeline.
* Just to emphasise, in the feature selection, we parse to ``SelectFromModel()`` the model we will use, in this case, Linear Regression.

**WARNING** <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> 
* The code is already written here, but you will likely write it for yourself for your milestone project and in the workplace.
* We are familiar with the idea of arranging the pipeline in a series of steps, however mistyping code is super common, as you may already know, so when you write the pipeline, please remember you will  almost certainly mistype or miss out on commas or parenthesis "(". If you mistype something, don't worry, as the code will alert you with an error. 

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.linear_model import LinearRegression

def pipeline_linear_regression():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(LinearRegression()) ),
      ( "model", LinearRegression()),

    ])

  return pipeline

pipeline_linear_regression()

We define the object `pipeline` based on `pipeline_linear_regression()`, then fit the train set (X_train and y_train)

pipeline = pipeline_linear_regression()
pipeline.fit(X_train,y_train)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Great. Once the pipeline is fitted, we want to start evaluating it. First, we want to know the linear model coefficients, by extracting from the model the attribute `.coef_`
* We create a custom function to grab that in place in a DataFrame together with the columns and sort by the absolute values from the coefficients

def linear_model_coefficients(model, columns):
  print(f"* Interception: {model.intercept_}")
  coeff_df = (pd.DataFrame(model.coef_,columns,columns=['Coefficient'])
            .sort_values(['Coefficient'],key=abs, ascending=False)
            )
  print("* Coefficients")
  print(coeff_df)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As we have seen, we need to parse the model and the columns that the pipeline is trained on
* To parse the model only, we subset the model from the pipeline with `pipeline['model']` (in this case, we named this step as 'model', but you could have named it as 'ml_model', so you would use the step name notation)
* To parse the columns, we subset the feature selection step where we grab a boolean array informing which features hit the model - `pipeline['feat_selection'].get_support()`. Then this array is used to subset the features from train set columns.

Let's make one exercise to visualise everything we read.
* here we subset the model step from the pipeline

pipeline['model']

Here we subset the boolean array that tells which features hit the model
* Note the first element is False, meaning the first feature from the train set was removed in this step.

pipeline['feat_selection'].get_support()

Here we parse this array to train set columns

X_train.columns[pipeline['feat_selection'].get_support()]

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Now that we are comfortable with what happens on the back end for extracting information from the pipeline, we want to learn the model coefficients.
* Do you remember the intercept and beta coefficients in the algorithms lesson? Here they are. In this case, it is a multiple linear regression since we have multiple features hitting the model.
* We notice that LSTAT has the highest absolute value. That indicates it is the most important feature for this model. But then we ask: is this model good?

linear_model_coefficients(model=pipeline['model'],
                          columns=X_train.columns[pipeline['feat_selection'].get_support()])

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we want to evaluate how good the pipeline fits the train and test set
* In case you want to revise the performance metrics for regression, refer to the Performance Metrics video in Machine Learning Essentials > Machine Learning Terminology.
* Read the pseudo code to understand the logic better. The main aspect now is to understand the logic and why it is important for us now.
* We will use these functions in the rest of the course when we evaluate regression models.


# import regression metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
# we will use numpy to calcuate RMSE based on MSE (mean_squared_error)
import numpy as np


def regression_performance(X_train, y_train, X_test, y_test,pipeline):
  """
  # Gets train/test sets and pipeline and evaluates the performance
  - for each set (train and test) call regression_evaluation()
  which will evaluate the pipeline performance
  """

  print("Model Evaluation \n")
  print("* Train Set")
  regression_evaluation(X_train,y_train,pipeline)
  print("* Test Set")
  regression_evaluation(X_test,y_test,pipeline)



def regression_evaluation(X,y,pipeline):
  """
  # Gets features and target (either from train or test set) and pipeline
  - it predicts using the pipeline and the features
  - calculates performance metrics comparing the prediction to the target
  """
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

def regression_evaluation_plots(X_train, y_train, X_test, y_test,pipeline, alpha_scatter=0.5):
  """
  # Gets Train and Test set (features and target), pipeline, and adjust dots transparency 
  at scatter plot
  - It predicts on train and test set
  - It creates Actual vs Prediction scatterplots, for train and test set
  - It draws a red diagonal line. In theory, a good regressor should predict
  close to the actual, meaning the dot should be close to the diagonal red line
  The closer the dots are to the line, the better

  """
  pred_train = pipeline.predict(X_train)
  pred_test = pipeline.predict(X_test)


  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Test Set")

  plt.show()



Let's use the custom regression evaluation function.
* Note the performance on the train and test set are not too different. That is an indication that the model didn't overfit.
* At the same time, the test set performance (which is the best data to simulate real data since the model has never seen it) has an R2 performance of 0.67. This is not too good and not too bad. You may want to look for something better, but it is a good R2 value for a first model.
* We also note in the plots that Prediction x Actual plot, the predictions tend to follow the actual value (since it kind of follows the red diagonal line)

regression_performance(X_train, y_train, X_test, y_test,pipeline)
regression_evaluation_plots(X_train, y_train,
                            X_test, y_test,
                            pipeline,alpha_scatter=0.5)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Classification: Logistic Regression

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the breast cancer dataset from sklearn. It shows records for a breast mass sample and a diagnosis informing whether it is a 0 (Malignant), 1 (Benign)
* The approach to load the data from sklearn is a bit different from seaborn.
* In this case, `data` comes as a "dictionary" where you need to grab different pieces (like data.data, data.features_names, data.target) to arrange the DataFrame 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> As an aside, this data won't need data cleaning or feature engineering to train a model.

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data,columns=data.feature_names)
df['target'] = pd.Series(data.target)
print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  As usual, we start our workflow by splitting the data into train and test sets.
* We use the same pattern as the previous section

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['target'],axis=1),
                                    df['target'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Our target variable is 0 (Malignant) and 1 (Benign), which is a categorical variable. We will create a pipeline to handle that, it will be a binary classifier.
* We also import the logistic regression algorithm. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* We define a function to create a pipeline with three steps: feature scaling, feature selection and model. It is convenient to arrange everything in a custom function for a given pipeline.
* Just a reminder, we parse the model to `SelectFromModel()` in the feature selection. We will use this pattern all the time. In this case, we will use Logistic Regression

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.linear_model import LogisticRegression

def pipeline_logistic_regression():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(LogisticRegression(random_state=101)) ),
      ( "model", LogisticRegression(random_state=101)),

    ])

  return pipeline


We define the object pipeline based on pipeline_logistic_regression(), then fit to the train set (X_train and y_train)

pipeline = pipeline_logistic_regression()
pipeline.fit(X_train,y_train)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Great. Once the pipeline is fitted, we want to start evaluating it. First, we want to know the model coefficients, by extracting from the model the attribute `.coef_`
* We create a custom function to grab that in place in a DataFrame together with the columns, then we transpose it and sort by the absolute values from the coefficients

def logistic_regression_coef(model, columns):
  coeff_df = (pd.DataFrame(model.coef_,index=['Coefficient'],columns=columns)
            .T
            .sort_values(['Coefficient'],key=abs, ascending=False)
            )
  print(coeff_df)

We parse the data in a similar way to the previous section:
* the model as the model step from the pipeline
* the columns as the train set features, subset by an array that tells which features were selected by `feat_selection` pipeline step

logistic_regression_coef(model=pipeline['model'],
                         columns=X_train.columns[pipeline['feat_selection'].get_support()])

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we want to evaluate how good the pipeline fits the train and test set
* In case you want to revise the performance metrics for regression, refer to the Performance Metrics video in Machine Learning Essentials > Machine Learning Terminology.
* Read the pseudo code to understand the logic better. The main aspect now is to understand the logic and why it is important for us now.
* We will use these functions in the rest of the course when we evaluate classification models.

# loads confusion_matrix and classification_report from sklearn
from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):
  """
  # Gets features, target, pipeline and how the levels from your target are labelled (named)
  in this case, 0 (Malignant) and 1 (Benign), so you parse a list ['Malignant' , 'Benign']

  - it predicts based on features
  - compare predictions and actuals in a confusion matrix
    - the first argument stays as rows and the second stay as columns in the matrix
    - we will use the pattern where the predictions are in the row and actual values are in the columns
    - for a  refresher on that, revert to the Performance Metric video in Module 2
  - show classification report

  """

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")



def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  """
  # gets the features and target from train and test set, pipeline, and how
  you labelled (named) the levels from your target
  in this case, 0 (Malignant) and 1 (Benign), so you parse a list ['Malignant', 'Benign']
  - for each set (train and test), it calls the function above to show the confusion matrix
  and classification report for both train and test set
  """

  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's use the custom classification evaluation function.
* Note the performance on the train and test set are not too different. That is an indication that the model didn't overfit.
* Just a side note, look at the confusion matrix, the actual values are in the columns, and the prediction is in the row. That is the explanation we gave in the pseudo-code. In the workplace, you may see that switch. That is fine, and you just have to pay attention to where you see the actual and prediction :)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The confusion matrix shows the counts of when the classifier predicted properly or not for a given class. 
* For example, how many times did the model predict an actual malignant as malignant for the train set? That is 164.
* How many times has the model predicted a malignant as benign for the train set? That is 6.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The classification report shows the main metrics for the classification
* We see for each class the precision, recall and f1-score.
* Support means how many observations.
* We also see the accuracy.
* macro avg: it computes the average without considering the proportion. For example, on the train set in the precision column, it takes all precisions and calculates the average: `(0.99 + 0.98)/ 2`
* weighted avg: it computes the average considering the proportion. For example, on the train set in the precision column, it takes `[ 170/(170+285) * 0.99 ] + [ 285/(179-+285) * 0.98 ]`


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> However, we will tend to use more precision, recall, f1-score and accuracy as metrics for classification

Let's comment now on the results.
* Since the classes are not balanced (we have more benign and malignant) we probably will not choose accuracy. But let's assume we chose accuracy. We see the accuracy is very good on the train and test set.
* We could also interpret and consider that for this case we are interested (due to some particular business reason) to use as a metric the recall on malignant since we don't want to tell that a patient is benign when it is malignant. In this case, your performance on the train set is 0.96 and on the test set is 0.95. That means when you have live data, you should expect that 95% of the time you will not misclassify a patient that has a malignant tumour. It will be up to your business problem and context to tell if this level is acceptable. Your heuristics also will play a role to answer this question.

clf_performance(X_train=X_train, y_train=y_train,
                        X_test=X_test, y_test=y_test,
                        pipeline=pipeline,
                        label_map= ["Malignant","Benign"] )

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> What if I don't know how to map my target variable for this custom function?
* In this example, we know that the number 0 for the target means 'Malignant' and 1 is 'Benign'. But what if I didn't?
* That is okay, and you just have to parse in a list of the ordered sequence of the classes as strings, like: ``["0", "1" ]``
*Let's try below. It will display the same result, and the difference is the ``label_map``

clf_performance(X_train=X_train, y_train=y_train,
                        X_test=X_test, y_test=y_test,
                        pipeline=pipeline,
                        label_map= ["0", "1" ] ) # it will display the classes as 0 and 1
                                                  # but "0" and "1" should be a string

---



# Scikit-learn - Unit 04 - Tree-based models for Regression and Classification

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Implement and Evaluate Tree-Based Models for Regression and Classification



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

import warnings
warnings.filterwarnings('ignore')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 04 - Tree-based models for Regression and Classification

In this unit, we will cover the practical steps and code to fit a pipeline considering Tree-based models, like Decision Trees, Random Forests.
* If you want to revise the algorithm content, refer to the Machine Learning Essentials > Algorithm units. 

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A typical workflow used for supervised learning is
* Split the dataset into train and test set
* Fit the pipeline
* Evaluate your model. If the performance is not good, revisit the process, starting from defining the business case, collecting the data, conducting EDA (Exploratory Data Analysis) etc.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> For teaching purposes, **we will use a fixed dataset for Regression and a fixed dataset for Classification across the different algorithms used in this notebook.**

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the Boston dataset from sklearn for the **Regression task**. 
* It has house price records and characteristics, like the average number of rooms per dwelling and the per capita crime rate in Boston.
* We'll use the same code from the previous unit.


from sklearn.datasets import load_boston
data = load_boston()
df_reg = pd.DataFrame(data.data,columns=data.feature_names)
df_reg['price'] = pd.Series(data.target)

df_reg = df_reg.sample(frac=0.5, random_state=101)

print(df_reg.shape)
df_reg.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the Iris dataset from seaborn for the **Classification task**. 
* It contains records of three varieties of iris plants, with their petal and sepal measurements.

df_clf = sns.load_dataset('iris').sample(frac=0.7, random_state=101)
print(df_clf.shape)
df_clf.head(3)

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 We will cover the following tree algorithms, which include ensemble tree algorithms.
* Decision Tree
* Random Forest
* Gradient Boosting
* Ada Boost
* XG Boost (eXtreme Gradient Boost)
* Extra Tree







<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For teaching purposes, we will use:
* **Classification** task for Decision Tree, Gradient Boosting and XG Boost.
* **Regression** task for Random Forest, Ada Boost and Extra Tree.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> That speeds up our learning process. And, if you do Regressor using a Decision Tree, the code and workflow are the same as you would do for Classification using a Decision Tree.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Decision Tree

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may refer to module 2 - ML Essentials - in the Algorithms lesson to refresh the algorithms we cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.

* In a nutshell, a decision tree is like a flow chart where each question has a yes/no answer. This brings you from a general question to a very specific question as you get deeper. The questions asked must be ones where the yes or no answer gives useful insights into the data.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Depending on your task (Regression or Classification) for using the Decision Tree algorithm in Sckit learn, you will import a different estimator.
* There is the suffix "`Regressor`" in the estimator when the algorithm will be used for a regression task, and, as you may expect, there is the suffix "`Classifier`" in the estimator when the algorithm is used for the classification task.
* That pattern repeats for the other tree-based algorithm.
* The difference is subtle, however, it is worth pointing out.






Find here the documentation for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).
* We will import both but will use the `DecisionTreeClassifier` for the exercise.

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

Let's reinspect our data again.
* The target variable is 'species' and we don't have missing data.

df_clf.head()

We are getting more comfortable with ML, but it is worth remembering that this exercise is an example of supervised learning, where the ML task is classification. The same principle applies when the ML task is Regression.
* For that workflow, it is wise to split the data into train and test sets.
* In the previous units, we explained the `train_test_split() `function. From now on, we will just state "We split the data into train and test sets"

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

As we are using a clean complete data set in this example, we will not need the data cleaning or the feature engineering steps.
* We then set feature scaling, feature selection and modelling using the DecisionTreeClassifier. We set random_state, so the results will be reproducible anywhere. We chain these steps in a sklearn Pipeline.

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.tree import DecisionTreeClassifier


def pipeline_decision_tree_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),

      ( "feat_selection",SelectFromModel(DecisionTreeClassifier(random_state=101)) ),
      
      ( "model", DecisionTreeClassifier(random_state=101)),

    ])

  return pipeline

pipeline_decision_tree_clf()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> It is time to fit the pipeline, so the model can learn the relationships between the features and the target. We create a variable pipeline (it could have any name) and call the function where we set our pipeline.

pipeline = pipeline_decision_tree_clf()
pipeline.fit(X_train, y_train)

Like in the previous notebook, we are now interested in starting to evaluate the pipeline. Since it is a tree-based model, we can assess the importance of the features in the model using `.features_importance_`
* We created a custom function to assess feature importance on tree-based models. It takes the model and the variables that "hit" the model. Check the pseudo-code, comments and docstrings to understand the logic.
* Don't worry if, at first, you don't understand.  Expect it to take some time to absorb.

def feature_importance_tree_based_models(model, columns):
  """
  Gets the model, and the columns used to train the model
  - we use the model.feature_importances_ and columns to make a
  DataFrame that shows the importance of each feature
  - next, we print the features name and its relative importance order,
  followed by a barplot indicating the importance

  """

  # create DataFrame to display feature importance
  df_feature_importance = (pd.DataFrame(data={
      'Features': columns,
      'Importance': model.feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

  best_features = df_feature_importance['Features'].to_list()

  # Most important features statement and plot
  print(f"* These are the {len(best_features)} most important features in descending order. "
        f"The model was trained on them: \n{df_feature_importance['Features'].to_list()}")

  df_feature_importance.plot(kind='bar',x='Features',y='Importance')
  plt.show()


Let's check that.
* The `model` argument is the 'model' step from the pipeline (we don't parse the pipeline, since we need only the model step)
* In the `columns` argument, we subset the feature selection step where we grab a boolean array informing which features hit the model - pipeline['feat_selection'].get_support(). This array is used to subset the features from train set columns.
* Note that only 2 features - `['petal_width', 'petal_length']` - out of 4, were used to train the model and they have roughly similar relevance

feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()]
                                     )

It is time to evaluate the classifier. We are using the same custom function for evaluating the classifier as used in the last notebook. 

# loads confusion_matrix and classification_report from sklearn
from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):
  """
  Gets features, target, pipeline and how labelled (named) the levels from your target

  - it predicts based on features
  - compare predictions and actuals in a confusion matrix
    - the first argument stays as rows and the second stay as columns in the matrix
    - we will use the pattern where the predictions are in the row and actual values are in the columns
    - to refresh that, revert to the Performance Metric video in Module 2
  - show classification report

  """

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")



def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  """
  gets the features and target from train and test set, pipeline how
  you labelled (named) the levels from your target
  - for each set (train and test), it calls the function above to show the confusion matrix
  and classification report for both train and test set
  """

  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

You will notice that in this dataset, the target variable wasn't a set of numbers referring to classes, but rather, are strings.
* We are parsing, from df_clf, the unique values from the target as the `label_map` parameter.

df_clf['species'].unique()

Let's evaluate the classifier then
* Note the model aced all predictions in the train set, which is an indication that it learned all the relationships from the training data. That is good, but let's check on the test set
* As we may expect, on the test set the performance was a bit lower (we noticed that in the confusion matrix, where  Virginica and Versicolor have the wrong predictions). At the same time, it is still very good, and it is not much of a difference from the train set. It is a good indication that the model didn't overfit 

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= df_clf['species'].unique()
                )

One additional aspect when using DecisionTree, is to visualise the created tree.
* Sckit learn has `plot_tree()` function that is okay and can help us, the documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html). We parse:
* decision_tree as the model step in our pipeline
* feature_names as the variable used to train the model. That is done by extracting the information from the feature selection step
* class_names are taken from unique values from species
* The remaining arguments help us to get a cleaner visualisation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Just a side note, this decision tree is simple, however, when it comes to big trees, the visualization might become too big or more difficult to interpret.
* In this example the decision is made first on petal_width, if it is smaller than -0.47, it is Setosa, if not it goes to another decision-making point. The other decision is for petal_length, if it is smaller than -0.57, it is Virginica, otherwise is Versicolor.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **Note the beauty: the algorithm computed by itself the pattern and now can predict. That is the major difference between ML and traditional programming. Here, we have data and an objective (predict species), and then the computer finds the best rule for that. In traditional programming, the developer has to set the rules**

* However the decision points are still weird, what does a -0.47 mean for petal_width? Negative value. Let's explore the next cell

from sklearn import tree

fig = plt.figure(figsize=(15,15))
tree.plot_tree(decision_tree = pipeline['model'], 
               feature_names = X_train.columns[pipeline['feat_selection'].get_support()],
               class_names = df_clf['species'].unique(),
               filled=True,
               rounded=True,
               fontsize=9,
               impurity=False)
plt.show()

The negative values from the previous case happen due to the feature scaling step, where it scaled the data using a standard scaler. We can grab this pipeline step and use .inverse_transformation() to convert the scaled value to the original.
* We create a DataFrame that relates to the original data. For petal_width and petal_length we set the decision points from the previous map. We parse the DataFrame to .inverse_transform
* The decision points are actually 5.4 for petal_width, 3.3 for petal_length

scaled_data = pd.DataFrame(data={'petal_width':-0.472,
                                 'petal_length':0.578,
                                 'sepal_length':1.0, # this value doesn't matter, but needs to be here
                                 'sepal_width':1.0}, # this value doesn't matter, but needs to be here
                           index=[0])


pipeline['feat_scaling'].inverse_transform(scaled_data)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Random Forest

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may refer back to module 2 (ML Essentials) in the Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.


* The random forest is made of many decision trees and it is an ensemble method. It uses bagging and feature randomness when building each individual tree, aiming to create an uncorrelated collection of trees, where the prediction from the set of trees is more accurate than that of any individual tree.



Once again, the same algorithm has a different estimator depending on the tasks: Regression or Classification. Find the documentation here for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).
* We will import both but will use `RandomForestRegressor` for the exercise.

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

 We will use the Boston dataset to fit an ML pipeline to predict the sales price using the Random Forest Algorithm

df_reg.head()

We split the train and  test sets. The target variable is 'price' 

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_reg.drop(['price'],axis=1),
                                    df_reg['price'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

We create the pipeline using a similar structure as the previous example. There are 3 steps: scaling, feature selection and modelling. 
* We know in advance the data doesn't require data cleaning.

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.ensemble import RandomForestRegressor


def pipeline_random_forest_reg():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestRegressor(random_state=101)) ),
      ( "model", RandomForestRegressor(random_state=101)),

  ])

  return pipeline

pipeline_random_forest_reg()

We will fit the pipeline to the train set (features and target) using `.fit()`


pipeline = pipeline_random_forest_reg()
pipeline.fit(X_train, y_train)

Since it is a tree-based model, we can assess in the model the importance of the features with .features_importance_, using the custom function from the previous section
* Note that from 13 features, the model was trained on 2: LSTAT and RM, where LSTAT is more important to the model

feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()])

We will evaluate the regressor pipeline using the same custom function from the last unit notebook 

# import regression metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
# we will use numpy to calcuate RMSE based on MSE (mean_squared_error)
import numpy as np


def regression_performance(X_train, y_train, X_test, y_test,pipeline):
  """
  # Gets train/test sets and pipeline and evaluates the performance
  - for each set (train and test) call regression_evaluation()
  which will evaluate the pipeline performance
  """

  print("Model Evaluation \n")
  print("* Train Set")
  regression_evaluation(X_train,y_train,pipeline)
  print("* Test Set")
  regression_evaluation(X_test,y_test,pipeline)



def regression_evaluation(X,y,pipeline):
  """
  # Gets features and target (either from train or test set) and pipeline
  - it predicts using the pipeline and the features
  - calculates performance metrics comparing the prediction to the target
  """
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

def regression_evaluation_plots(X_train, y_train, X_test, y_test,pipeline, alpha_scatter=0.5):
  """
  # Gets Train and Test set (features and target), pipeline, and adjust dots transparency 
  at scatter plot
  - It predicts on train and test set
  - It creates Actual vs Prediction scatterplots, for train and test set
  - It draws a red diagonal line. In theory, a good regressor should predict
  close to the actual, meaning the dot should be close to the diagonal red line
  The closer the dots are to the line, the better

  """
  pred_train = pipeline.predict(X_train)
  pred_test = pipeline.predict(X_test)


  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Test Set")

  plt.show()

* We notice that the performance on the train set is pretty good (0.95 of R2, MAE of 1.4, the actual vs prediction plot is dense around the diagonal red line), however, R2 on the test set is still ok (0.71) but much lower than on the train set, there is a notable difference. That may be a sign of overfitting.
* We note for the actual vs predictions plots that in the train set, the dots are closer around the diagonal line than they were in the test set. That reinforces the previous point.
* Following the diagonal line means the predictions tend to follow the actual value.
* This pipeline was trained on the default algorithm hyperparameters (like the number of trees, max depth etc). It is a matter of making sense of the hyperparameter and its common impact on algorithm performance. We will cover how to train with multiple hyperparameters in an upcoming lesson 

regression_performance(X_train, y_train, X_test, y_test, pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, 
                            pipeline, alpha_scatter=0.5)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Gradient Boosting

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may refer back to module 2 (ML Essentials) in the Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.

* Gradient boosting is a type of machine learning boosting. The idea of a boosting technique is based on building a sequence of initially weak models into increasingly more powerful models. You add the Models sequentially until no further improvements can be made. Gradient boosting aims to minimize the loss function by adding weak learners using a gradient of a loss function that captures the performance of a model.




We import the algorithms. Find the documentation here for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html).
* We will import both but will use  `GradientBoostingClassifier`for the exercise.

from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import GradientBoostingRegressor

Let's consider the iris dataset again for the classification task

df_clf.head()

As usual, we split the data into train and test sets, considering 'species' as the target variable

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

The pipeline is similar to that used in the  previous section where we considered the iris dataset.
* There are 3 steps: feature scaling, feature selection and modelling, and here we consider the Gradient Boosting Classifier

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.ensemble import GradientBoostingClassifier 


def pipeline_gradient_boost_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(GradientBoostingClassifier(random_state=101)) ),
      ( "model", GradientBoostingClassifier(random_state=101)),

    ])

  return pipeline


We fit the pipeline with the train set.

pipeline = pipeline_gradient_boost_clf()
pipeline.fit(X_train, y_train)

And check feature importance using the same function we used previously since, for this algorithm, feature importance is assessed using the same attribute
* Note it considers only petal_length. Note also the difference; the same data in the decision tree had 2 features as the most important features. That happens since different algorithms have different mechanisms.

feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()]
                                     )

Let's evaluate the data using the same custom function that shows the confusion matrix and classification report for the train and test sets
* The results are the same compared to a Decision Tree (considering, the same dataset).
* The only difference is that we needed only 1 feature to reach that result for the Gradient Boost; for the decision tree, we needed 2. So the Gradient Boost is better for this data since it is simpler and easier to have a system with fewer features.

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= df_clf['species'].unique()
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Ada Boost

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may refer back to module 2 (ML Essentials) in the Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.


* AdaBoost (or Adaptive Boosting) is an ensemble learning used to build a strong model from several weak models. It uses multiple iterations to generate a single strong learner by iteratively adding weak learners. The result is a model that has higher accuracy than the weak learner itself.


We import the algorithms. Find the documentation here for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html).


* We will import both but will use `AdaBoostRegressor` for the exercise.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor

 We will use the Boston dataset to fit an ML pipeline to predict the sales price using the Ada Boost Algorithm

df_reg.head()

We split the train and test sets. The target variable is 'price'

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_reg.drop(['price'],axis=1),
                                    df_reg['price'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

We create the pipeline using the same steps as previously but now considering the Ada Boost Regressor

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.ensemble import AdaBoostRegressor

def pipeline_adaboost_reg():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(AdaBoostRegressor(random_state=101)) ),
      ( "model", AdaBoostRegressor(random_state=101)),

    ])

  return pipeline


We fit the data to the train set (in the same manner we did previously)

pipeline = pipeline_adaboost_reg()
pipeline.fit(X_train, y_train)

And assess feature importance using our custom function
* Note this pipeline selects 3 variables to train the model: `['LSTAT', 'RM', 'DIS']`


feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()])

We now evaluate the data using the custom function. 
* The R2 score on the train set is 0.9 and on the test set is 0.78. Ideally, it could be less, but this difference is lower than the difference we see for Random Forest
* We note for the actual vs predictions plots, that in the train set, the dots are around the diagonal line (not so close as in the Random Forest). 
* Following the diagonal line means the predictions tend to follow the actual value.

regression_performance(X_train, y_train, X_test, y_test,pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, 
                            pipeline, alpha_scatter=0.5)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  XG Boost

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may refer back to module 2 (ML Essentials) in the Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.


* XGBoost stands for eXtreme Gradient Boosting and is an extension to gradient-boosted decision trees, specially designed to improve speed and performance. It has regularisation features that help to avoid over-fitting. It is a dedicated software library that you should install, it doesn't belong to the Scikit-learn library.


We import the algorithms. Find [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) the documentation for both
* We will import both but will use `XGBClassifier` for the exercise.

from xgboost import XGBRegressor
from xgboost import XGBClassifier

Let's consider the iris dataset again for the classification task

df_clf.head()

Let's split the data into train and test sets, where the target variable is 'species' 

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

We create the pipeline using the same steps as previously but now considering XGBoost

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from xgboost import XGBClassifier


def pipeline_xgboost_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(XGBClassifier(random_state=101)) ),
      ( "model", XGBClassifier(random_state=101)),

    ])

  return pipeline


We fit the pipeline to the train data

pipeline = pipeline_xgboost_clf()
pipeline.fit(X_train, y_train)

And assess the feature importance
* Note only petal_length is relevant to fit the model. 



feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()]
                                     )

Let's assess the pipeline performance
* The performance is the same as Gradient Boost on the train and test set. So for Classification, of the three algorithms we tested, decision tree, gradient boost, and XG boost - the last 2 are good candidates and best suit the data. However, we will study another method to test more algorithms simultaneously and avoid this segregated analysis we are doing now.

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= df_clf['species'].unique() 
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  ExtraTree

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may refer back to module 2 (ML Essentials) in the Algorithms lesson to refresh the algorithms we will cover. We are not going deep into the mathematical functions; the idea is to present the concept and the algorithm application.


* Extra Trees (or Extremely Randomized Trees) is an ensemble algorithm. It works by creating a large number of unpruned trees. Predictions are made by averaging the prediction of the decision trees when it is regression or using majority voting when it is classification.


We import the algorithms. Find the documentation here for both, [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html).
* We will import both but will use `ExtraTreesRegressor` for the exercise.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import ExtraTreesRegressor

We will use the Boston dataset to fit an ML pipeline to predict the sales price

df_reg.head()

Let's split the data into train and test sets using 'price' as a target variable 

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_reg.drop(['price'],axis=1),
                                    df_reg['price'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

We create the pipeline using the same steps as previously but now considering the Extra Tree Regressor

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.ensemble import ExtraTreesRegressor

def pipeline_extra_tree_reg():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(ExtraTreesRegressor(random_state=101)) ),
      ( "model", ExtraTreesRegressor(random_state=101)),

    ])

  return pipeline


We fit the pipeline with the train set.

pipeline = pipeline_extra_tree_reg()
pipeline.fit(X_train, y_train)

And evaluate feature importance using our custom function
* It used `['LSTAT', 'RM']` and LSTAT is more important.
* Just to reinforce, different algorithms consider different features to find patterns in the data, Random Forest selected the same features, and Ada Boost added to the selected list the variable DIS 

feature_importance_tree_based_models(model = pipeline['model'],
                                     columns =  X_train.columns[pipeline['feat_selection'].get_support()])

Let's now evaluate the pipeline
* Note the pipeline was perfect on the train set (R2 score of 1), and on the test set, was poor (R2 score of 0.68 is poor compared to a score of 1 in the train set).
* This is a sign the model overfits since it performs better in the train set and doesn't generalise well to other sets, like the test set.
* After all, for this dataset and among Random Forest, Ada Boost and Extra Tree, Ada Boost performed better since it can generalize better (the difference between performance on train and test set is smaller).
* Again, we analyse each algorithm separately for learning purposes; in the next unit, we will learn how to evaluate all the algorithms simultaneously.

regression_performance(X_train, y_train, X_test, y_test,pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, 
                            pipeline, alpha_scatter=0.5)

---

# Scikit-learn - Unit 05A - Cross Validation Search (GridSearchCV) and Hyperparameter Optimization Regression - Part 01

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and use GridSearchCV for Hyperparameter Optimization




---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 05A - Cross Validation Search (GridSearchCV ) and Hyperparameter Optimisation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Good job! You fitted multiple pipelines considering different algorithms separately for regression and classification tasks. However, how do you know which was better for a given ML task? 
* Imagine for the classification task on the iris dataset, you fitted three individual pipelines, evaluated each separately, and concluded a given algorithm was better. That is fair enough, but you want a more effective way to assess multiple algorithms.
* However, you also fitted the models with the default hyperparameters and may wonder: what if for a given algorithm, I could fit multiple models using different hyperparameters and find a model with even better performance?




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%208-%20Challenge.png"> Let's learn how to use GridSearchCV and do Hyperparameter Optimization using multiple algorithms. 
* This is the heart of conventional ML. We will split this topic into two parts: 
  * In the next three notebooks, we will show how to conduct hyperparameter optimization using one algorithm (for Regression, Binary Classification and Multiclass Classification).
  * Then we will cover how to do hyperparameter optimization using multiple algorithms at once.


---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Hyperparameter Optimization with one algorithm

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In this section, we will select a given algorithm and fine-tune it, by defining a set of hyperparameters. 
* For each possible hyperparameter combination, a set of models will be fitted - based on the cross-validation parameter. For example, if the developer sets cross-validation as 5, it will fit five models for a given hyperparameter combination. 
* These five models are scored against a performance metric (i.e., if it is regression, it could be the R2 score), and average performance is computed. This average is the cross-validated performance for a given configuration of hyperparameters. 
* This process is repeated then for every combination of hyperparameters. 



<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">  Let's quickly recap how we use the data for fitting supervised models
* The training subset of the data is used to fit or train the model.
* A subset of the training set is known as the validation set. This is used during fitting to compare one model against another, and when choosing or tuning hyperparameters. 
* The final subset of data used to test the model is known as the test set. This assesses the final model's performance and it must be data that is new to the model to give an unbiased result. The test set closely replicates what the deployed model will see in real-time usage. 


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
">   When we do Hyperparameter Optimisation, a part of the train set is automatically subset as a validation set and the model is fitted using cross-validation. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> How can I do that in Scikit-learn? 
* We use a function called **GridSearchCV**, which fits multiple models looping through a hyperparameter list over each model. 
* Note: CV here means cross-validation. It uses cross-validation to  compare different algorithm and hyperparameter combinations. So, at the end, we can select the best parameters from the listed hyperparameters that achieve a better performance.
* Ultimately, it helps to automate the process to find the best combination of hyperparameters for a given algorithm in a given dataset. The documentation is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We will split this section showing uses cased for GridSearchCV on Regression, Binary Classification, Multiclass classification task
* When using GridSearchCV, the difference between these ML tasks relies on the scoring parameters, which tell which performance metric should be used to select the best model.

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Regression

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 We are going to consider a similar workflow we studied earlier:
* Split the data
* Define the pipeline 
* Fit the pipeline
* Evaluate the pipeline

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">The differences now are:
* we decide on a list of hyperparameters to optimise our model while fitting it
* we need a performance metric to decide which model is the best in cross-validating the models.


We will use the Boston dataset from sklearn. It has house price records and characteristics, like the average number of rooms per dwelling and Boston's per capita crime rate.

from sklearn.datasets import load_boston
data = load_boston()
df_reg = pd.DataFrame(data.data,columns=data.feature_names)
df_reg['price'] = pd.Series(data.target)

df_reg = df_reg.sample(frac=0.5, random_state=101)

print(df_reg.shape)
df_reg.head()

We split the data into train and test sets.

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_reg.drop(['price'],axis=1),
                                    df_reg['price'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Similar to previous examples, we set the pipeline with three steps; feature scaling, feature selection and modelling. 
* For the purpose of learning hyperparameter optimisation, we just set the algorithm used to RandomForestRegressor. However, we encourage you to try additional algorithms. There are example algorithms commented out that you can try.

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
#from sklearn.ensemble import AdaBoostRegressor
#from sklearn.linear_model import LinearRegression
#from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
#from sklearn.tree import DecisionTreeRegressor

def pipeline_adaboost_reg():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestRegressor(random_state=101)) ),
      ( "model", RandomForestRegressor(random_state=101)),

    ])

  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We should parse the **algorithm's hyperparameters**, in a dictionary, with support from its documentation, which is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
* Since we are fitting a pipeline, that contains a set of steps, you should state in your hyperparameter list to which step your hyperparameter belongs. In our case, we named the modelling step `"model"`, so we add the suffix `"model__" `before the hyperparameter name.
* For this example, we picked only one hyperparameter: n_estimators. For n_estimator, we parse in a list with 10 and 20 (the default value is 50, but for faster computation, in this teaching example we set it to 10 and 20)
* It will take time and practical experience to make sense of which hyperparameters are more useful for each algorithm and what are the typical ranges to consider when listing hyperparameter's values

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html  # documentation is here
param_grid = {"model__n_estimators":[10,20],
              }

param_grid

We import GridSearchCV. Its documentation is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). We parse
*  `estimator` as the pipeline, and `param_grid` as the dictionary we stated above. 
* `cv` sets the number of cross-validation you want to try for each selected set of hyperparameters. It uses k-fold cross-validation, where we subdivide the whole dataset into multiple randomly chosen data sets known as k-fold cross-validation where k refers to the number of data sets.  
* `n_jobs`, according to the documentation, is the number of jobs to run in parallel. -1 means using all processors, whereas -2 uses all but one.
* `scoring` is the evaluation metric that you want to use.  That will depend on the ML task you are considering. In this case, it is regression, so we set the R2 score as the metric. Other options would include: `'neg_mean_absolute_error'`, `'neg_mean_squared_error'`
* `verbose`, according to the documentation, controls the verbosity: the higher, the more messages. As this is a teaching example we set it as 3, so you get more information returned about the process.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> We create an object (called the grid) that contains a GridSearchCV using the above parameters. Next, we fit this object to the train set (features and target).

* That will fit multiple RandomForestRegressor models. Considering cv=2 and the hyperparameters we listed above, it will train two models. It trains two models since we have two possible combinations of hyperparameters.
* For each model, we use 2-fold cross-validation, since cv=2. Therefore each model will be fitted twice.
* In total, this operation will fit four models, two models where each model is fitted two times, due to 2-fold cross-validation.
* Note, the two scores for `model__n_estimators=10` (the first two test_scores) are 0.618 and 0.695. The mean is 0.656. We will highlight this mean to you when computing the cross-validation results for this hyperparameter combination in upcoming cells.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Remember, the validation set is automatically defined using GridSearchCV. You parse the training set and it will subset the validation set as a part of the training set.

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=pipeline_adaboost_reg(),
                    param_grid=param_grid,
                    cv=2,
                    n_jobs=-2,
                    verbose=3,  # for learning, we set 3 to print the score from every cross-validation
                    scoring='r2')


grid.fit(X_train,y_train)

The results of all models and their respective cross-validations are stored in the attribute `.cv_results`
* When you access this attribute, you will see it is a dictionary and when displayed as is, is not very informative.

grid.cv_results_

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> One way to make it more informative is to parse it to a DataFrame, sort the values by 'mean_test_score', filter 'parameters' and 'mean_test_score' columns and convert it to an array, using .values
* In the end, you print a simplified ordered list showing the results for optimising a model with multiple hyperparameter combinations.
  * For example, we see the best hyperparameter configuration is `n_estimators 10`, which gave an R2 score of 0.65
  * Note the first hyperparameter combination: `model__n_estimators=10`. Previously we commented on its average performance: 0.656. That is a result of the mean of the 2 cross-validated models for this hyperparameter combination. In this particular example considering this set of hyperparameter combinations, its performance was the best compared to the others.

(pd.DataFrame(grid.cv_results_)
.sort_values(by='mean_test_score',ascending=False)
.filter(['params','mean_test_score'])
.values
 )

Additionally, we can get the best parameters with `.best_params_`

grid.best_params_

This is interesting, but we want to have the pipeline with the
highest score, for real-world usage. To get it, use the attribute `.best_estimator_`
* This is the most important aspect of the grid search, where we grab the pipeline, which we will evaluate and potentially use.
* Note, when fitting the pipelines, you saw the score for each cross-validated model for each hyperparameter combination. For the best hyperparameter combination, it takes the best cross-validated model, in this case, it takes the last.
  * `[CV 1/2] END ............model__n_estimators=10;, score=0.618 total time=   0.2s`
  * `[CV 2/2] END ............model__n_estimators=10;, score=0.695 total time=   0.2s`

pipeline = grid.best_estimator_
pipeline

You can now evaluate the pipeline that you fit using hyperparameter optimisation using the techniques we covered already. We will import the custom function for regression evaluation.

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
import numpy as np

def regression_performance(X_train, y_train, X_test, y_test,pipeline):
	print("Model Evaluation \n")
	print("* Train Set")
	regression_evaluation(X_train,y_train,pipeline)
	print("* Test Set")
	regression_evaluation(X_test,y_test,pipeline)



def regression_evaluation(X,y,pipeline):
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

def regression_evaluation_plots(X_train, y_train, X_test, y_test,pipeline, alpha_scatter=0.5):
  pred_train = pipeline.predict(X_train)
  pred_test = pipeline.predict(X_test)


  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Test Set")
  plt.show()


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Next, we parse the train set, test set and our pipeline to the function
* Note the performance on the train test is good (0.938). The test set is not so good (0.685). However, these values are mismatched, which may indicate overfitting.
* We may have to consider additional values for the hyperparameters or even consider other hyperparameters. Or maybe we need more data so the algorithm can find the patterns and generalise on unseen data.
* Or maybe this algorithm is not the best for this dataset. But don't worry, soon we will discover an approach to train multiple algorithms at once.

regression_performance(X_train, y_train, X_test, y_test,pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, 
                            pipeline, alpha_scatter=0.5)

---

# Scikit-learn - Unit 05 - Cross Validation Search (GridSearchCV) and Hyperparameter Optimisation Binary Clf- Part 01

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and use GridSearchCV for Hyperparameter Optimisation




---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 05 - Cross Validation Search (GridSearchCV ) and Hyperparameter Optimisation

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Hyperparameter Optimisation with one algorithm

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Binary Classification

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the last section, we saw how to conduct hyperparameter tuning using one algorithm to solve a Regression problem.
* There is a tiny difference in using GridSearch CV when your ML task is classification, we will cover that now.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are going to consider a similar workflow to the one we studied earlier:
* Split the data
* Define the pipeline and hyperparameter
* Fit the pipeline
* Evaluate the pipeline

Let's load the breast cancer data from sklearn. It shows records for a breast mass sample and a diagnosis informing whether it is a malignant or benign tumour, where 0 is malignant and 1 is benign.

from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
df_clf = pd.DataFrame(data.data,columns=data.feature_names)
df_clf['diagnostic'] = pd.Series(data.target)
df_clf = df_clf.sample(frac=0.5, random_state=101)


print(df_clf.shape)
df_clf.head()

We split the data into train and test set.

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['diagnostic'],axis=1),
                                    df_clf['diagnostic'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

And create a pipeline with three steps, feature scaling, feature selection and modelling using RandomForestClassifier.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

def pipeline_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestClassifier(random_state=101)) ),
      ( "model", RandomForestClassifier(random_state=101)),

    ])

  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We define our hyperparameter list based on the algorithm documentation. One method could be to consider the default parameter value and a set of values around the default value.
* In this case, there are two possible combinations of hyperparameter.

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.model_selection import GridSearchCV

param_grid = {"model__n_estimators":[50,20],
              }

param_grid

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> When we move to Classification, there will be a different GridSearchCV scoring argument. 



* We consider that in our classification projects, the potential performance metrics are: accuracy, recall, precision, and f1 score.
  * When the metric is either recall, precision or f1 score, we need to inform which class we want to tune for and use `make_scorer()` as an "auxiliary" function to help define the metric and the class to tune. The documentation for make_scorer is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)
  * When your performance metric is recall, you need to import [recall_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), if it is precision, [precision_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html) and if it is f1 score, you need to import [f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html); so you can parse the metric to the `make_scorer()` function.
  * When your performance metric is accuracy, you simply write "accuracy" for scoring: `scoring='accuracy'`





<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In this exercise, we have 0 and 1 as diagnostics for breast cancer. 
* We assume that when defining the ML business case, it was agreed that the performance metric is recall on malignant (0) since the client needs to detect a malignant case. 
* The client doesn't want to miss a malignant case, even if that comes with a cost where you misidentify a benign tumour, and state it is malignant. For this client, this is not as bad as misidentifying a malignant tumour as benign. Therefore, the model is tuned on recall for malignant (0).


from sklearn.metrics import make_scorer, recall_score
from sklearn.metrics import f1_score # in case your metric is f1 score, you would need this import
from sklearn.metrics import precision_score # in case your metric is precision, you would need this import

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png">  The arguments `estimator, param_grid, cv, n_jobs,` and `verbose` are similar to the previous example.

* The focus is now on `scoring` when creating the object to conduct a grid search. You will need `make_scorer()` to parse your tune on recall for class 0 for this binary classifier. 
  * Pass two arguments to `make_scorer()` for recall_score as your metric and pos_label to identify which class you want to tune recall. In this case, it is 0.
* Next, you fit the grid search with the train set (features and target) as usual.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Since `cv=2`, we will fit two models for each hyperparameter combination using k-fold cross-validation. Therefore, four models (two times two) are trained in the end. 
* The same dynamic repeats: compute the performance for each cross-validated model and get the average performance for a given hyperparameter combination, then iterate for each hyperparameter combination.


grid = GridSearchCV(estimator=pipeline_clf(),
                    param_grid=param_grid,
                    cv=2,
                    n_jobs=-2,
                    verbose=3,
                    # in the workplace we typically set verbose to 1, 
                    # to reduce the amount of messages when fitting the models
                    # for teaching purpose, we set to 3 to see the score for each cross validated model
                    scoring=make_scorer(recall_score, pos_label=0)
                    )


grid.fit(X_train,y_train)

Next, we check the results for all four different models  with `.cv_results_` and use the same code from the previous section
* Note that `'model__n_estimators': 50` gave an average recall score on class 0 of 0.86 and is superior to the other combination.

(pd.DataFrame(grid.cv_results_)
.sort_values(by='mean_test_score',ascending=False)
.filter(['params','mean_test_score'])
.values
 )

Let's check the best parameters with `.best_params_`

grid.best_params_

And finally grab the pipeline that has the best estimator, the one which gave the highest score. 

pipeline = grid.best_estimator_
pipeline

As usual in our workflow, we will evaluate the pipeline using our custom function for classification problems

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

We parse the parameters, as usual, considering that class 0 is malignant and class 1 is benign. Therefore, label_map receives an ordered list that matches the class value and its meaning:  `['malignant', 'benign']`
* Note the recall on malignant on the train set is 100% and on the test set is 90%. In a project, you set the threshold you would accept. 
* In case the threshold you agreed with the client is 90%, this pipeline is the solution. In a case where your agreed threshold is 98%, you would still have to look for other algorithms or hyperparameters combinations to improve your pipeline performance as recall weighted average is at 95%. 

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= ['malignant', 'benign'] 
                )

---

# Scikit-learn - Unit 05C - Cross Validation Search (GridSearchCV) and Hyperparameter Optimisation Multiple Clf- Part 01

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and use GridSearchCV for Hyperparameter Optimisation




---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 05C - Cross Validation Search (GridSearchCV ) and Hyperparameter Optimisation

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Hyperparameter Optimisation with one algorithm

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Multiclass Classification

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the last section, we saw how to conduct hyperparameter tuning using one algorithm to solve a Binary Classification problem.
* There is a tiny difference in using GridSearchCV when your ML task is multi-class classification, we will cover that now.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are going to consider a similar workflow we studied earlier:
* Split the data
* Define the pipeline and hyperparameters
* Fit the pipeline
* Evaluate the pipeline

We load the iris dataset for this exercise. It contains records of three species or classes of iris plants, with their petal and sepal measurements.

df_clf = sns.load_dataset('iris')

print(df_clf.shape)
df_clf.head()

As usual, we split the data into train and test set.

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

And create a pipeline using three steps: feature scaling, feature selection and modelling with RandomForestClassifier.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

def pipeline_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestClassifier(random_state=101)) ),
      ( "model", RandomForestClassifier(random_state=101)),

    ])

  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We define our hyperparameter list based on the algorithm documentation.
* In this case, there will be two hyperparameter combinations.
* As the intention of the unit is to learn hyperparameter optimisation, we will reduce the number of hyperparameter combinations, so the code runs faster. However, we encourage you to try additional larger combinations to consolidate your learning.



# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.model_selection import GridSearchCV

param_grid = {"model__n_estimators":[10,20],
              }
param_grid

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Let's assume for this project, the client is interested particularly in the Virginica species, and needs the predictions for this class to be precise. Although this is an arbitrary choice, in this case, it is an example of the type of business requirements given to you by the product owner or business expert.  
* In this case, your scoring parameter is `precision_score` to the class Virginica.
  * In a Multiclass classification, when your performance metric is accuracy, you just pass scoring='accuracy' as an argument, as done with a binary classifier.
  * In our case, we need to pass arguments to the make_scorer() method to fine-tune the model using precision on the Virginica species. We pass to make_scorer as an argument the metric we want - `precision_score`. The next argument is `labels`, where you set the class you want to tune as a list. Note, in this dataset, the species is not encoded as numbers but as categories. If it were numbers, you would pass the number related to the class you want to tune. The last argument is `average`, and it should equal `None` since you compute the precision from one class only (in this case Virginica) and you don't need to average.
* Finally, you fit the grid search to the training data.


from sklearn.metrics import make_scorer, precision_score
grid = GridSearchCV(estimator=pipeline_clf(),
                    param_grid=param_grid,
                    cv=2,
                    n_jobs=-2,
                    verbose=3, # In the workplace we typically set verbose to 1, 
                    # to reduce the number of messages when fitting the models.
                    # For teaching purposes, we set it to 3 to see the score for each cross-validated model.
                    scoring=make_scorer(precision_score,
                                        labels=['virginica'],
                                        average=None)
                    )


grid.fit(X_train,y_train)

Next, we check the results for all four different models with `.cv_results_` and use the same code from the previous section
* Note this combination `''model__n_estimators': 10` gave an average precision score on virginica of 0.91. In this case, both options look to give the same performance, and the grid search picked the model with n_estimator as 10.

(pd.DataFrame(grid.cv_results_)
.sort_values(by='mean_test_score',ascending=False)
.filter(['params','mean_test_score'])
.values
 )

We grab programmatically the best hyperparameter combination for a quick check.

grid.best_params_

And finally grab the best pipeline, considering the best cross-validated model for the best hyperparameter combination.

pipeline = grid.best_estimator_
pipeline

Finally, we evaluate the pipeline.
* Note the precision on Virginica, on the train set is 98% and on the test set is 100%. It is a very good sign that the precision is maximised for the test set since it shows the pipeline can generalise on unseen data.
* Again, the client will accept the pipeline based on the performance criteria you both set in the ML business case.

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)
    

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= df_clf['species'].unique()
                )

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> Congratulations! You now know how to get a given algorithm and do a hyperparameter optimization for Regression and Classification!
  * The **next level** is to define a set of algorithms and a set of hyperparameters for each algorithm and do a hyperparameter optimization for Regression and Classification tasks!

---

# Scikit-learn - Unit 06 - Cross Validation Search (GridSearchCV) and Hyperparameter Optimisation - Part 02

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Do Hyperparameter Optimisation using multiple algorithms. 




---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning


In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")





## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 06 - Cross Validation Search (GridSearchCV ) and Hyperparameter Optimisation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Good job! You fitted multiple pipelines using a single algorithm while looking for the best hyperparameter combination, for regression and classification tasks. However, how do you know which was better for a given ML task? 
* Let's learn how to use GridSearchCV and do Hyperparameter Optimization using **multiple algorithms**

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 We will cover in this notebook:
* A technique to do a hyperparameter optimisation with multiple algorithms.
* A strategy for using this technique that typically reduces the time needed to train all algorithms.
* A strategy to refit the pipeline with only the most relevant features, so you can deploy a pipeline that contains only the best features.
* **BONUS**: Here we list the values of the most common hyperparameters for the algorithms we have covered in the course. You can use them as a starting point and as a reference for the Portfolio Project or in your future workplace.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Hyperparameter Optimisation with multiple algorithms

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We are going to consider a similar workflow to the one we studied earlier:

* Split the data.
* Define the pipeline and hyperparameter.
* Fit the pipeline (using a strategy that typically trains all the algorithms faster).
* Evaluate the pipeline.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> The exercise we are conducting in this notebook is specifically for a multiclassification task but extends to regression and binary tasks. The concepts we cover here are also applicable to these other tasks.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We will use the 'penguins' dataset for this exercise. It has records for three different species of penguins, collected from three islands in the Palmer Archipelago, Antarctica. 
* Here, we are interested in predicting the species of a given penguin.

df_clf = sns.load_dataset('penguins')
print(df_clf.shape)
df_clf.head()

We split the data into train and test set

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 And define the pipeline steps considering:
* data cleaning (median imputation, categorical imputation).
* feature engineering (categorical encoding).
* feature scaling.
* feature selection (note we don't specify the algorithm, we pass in a variable called `model`).
*  and modelling (note we don't specify the algorithm, we pass in a variable called `model`).


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> Previously we covered the definition of data cleaning and feature-engineering/scaling/selection steps, and in this exercise, we provide the appropriate actions (imputations and encoding).

from sklearn.pipeline import Pipeline

### Data Cleaning and Feature Engineering
from feature_engine.imputation import MeanMedianImputer
from feature_engine.imputation import CategoricalImputer
from feature_engine.encoding import OrdinalEncoder

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier


def PipelineOptimization(model):
  pipeline_base = Pipeline([
      ( 'median',  MeanMedianImputer(imputation_method='median',
                                     variables=['bill_length_mm' , 'bill_depth_mm',
                                                'flipper_length_mm', 'body_mass_g']) ),

      ( 'categorical_imputer', CategoricalImputer(imputation_method='frequent',
                                                        variables=['sex']) ),

      ( "ordinal",OrdinalEncoder(encoding_method='arbitrary', 
                                 variables = ['island',	'sex']) ), 

      ("feat_scaling", StandardScaler() ),

      ("feat_selection",  SelectFromModel(model) ),

      ("model", model ),


    ])

  return pipeline_base


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we create a Python class (HyperparameterOptimizationSearch) which aims to fit a set of algorithms with multiple hyperparameters. The logic is: 
* The developer defines a set of algorithms and their respective hyperparameter values.
* The code iterates on each algorithm and fits pipelines using GridSearchCV considering their respective hyperparameter values. The result is stored. 
* That is repeated for all algorithms that the user listed.
* Once all pipelines are trained, the developer can retrieve a list with a performance result summary and an object that contains all the trained pipelines. The developer can then subset the best pipeline.




<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> Let's explain the major parts of the Python class.

* In the \__init__ method, you pass in models, and params as a dictionary of algorithms and their respective hyperparameters.
* In the `fit` method, we loop on each algorithm, and pass the algorithm to PipelineOptimization(). As a result, it will do a grid search on a set of hyperparameters for that given model. The result is stored and the loop continues on.
* The `score_summary` method returns all pipelines, and a DataFrame with a performance summary for all of the algorithms.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> Again, at first, it will take some time to understand the code of this class, but what's most important for now is to understand what it does. 



from sklearn.model_selection import GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
">
 We now define a list of models and their respective hyperparameters. 
* The first dictionary is related to the algorithms.
  * We create a dictionary where the key is the model name (you can use any name here, but we suggest using the estimator name), and the value is the estimator object. For example, for the decision tree we use DecisionTreeClassifier(random_state=0).
* It is a multiclass classification, so we consider all algorithms bar logistic regression (since that is more suitable for binary classification).

models_search = {
    "DecisionTreeClassifier":DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier":RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier":GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier":ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier":AdaBoostClassifier(random_state=0),
}

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> The other dictionary relates to the hyperparameter values.
  * Its keys should map the keys from the models' dictionary.
  * For each key, the value will be a dictionary, whose keys will be the hyperparameter names and their values as a list of hyperparameter values.
  * Look at the example, and see that RandomForestClassifier has two hyperparameters: n_estimators and max_depth. For each hyperparameter, we set a list with the determined values.
  * When you want to consider only the default hyperparameters, you just pass in an empty dictionary for a given algorithm. You will see that the other algorithms have an empty dict `{ }` as their hyperparameters, which means it will only consider the default hyperparameters. **But you may ask: why would we do that?**

params_search = {
    "DecisionTreeClassifier":{},
    "RandomForestClassifier":{"model__n_estimators":[50,20],
                               "model__max_depth":[None,3,10]},
    "GradientBoostingClassifier":{},
    "ExtraTreesClassifier":{},
    "AdaBoostClassifier":{},
}

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> That is useful when we define a strategy to speed up the hyperparameter optimisation process.
* You noticed the idea is to fit multiple models with multiple hyperparameter options. But the time needed to compute all of that based on your hardware capability has a cost. 
* It would make sense to do a quick search using the default hyperparameters across all listed algorithms. The result will show the algorithms that look to fit your data the best, and this training process tends not to take long since it uses the default hyperparameters.
* Then you use the best two or three algorithms and finally do an extensive search so that you can fine-tune your pipeline performance.





<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's do a hyperparameter optimisation search using the **default hyperparameters values first**.

params_search = {
    "DecisionTreeClassifier":{},
    "RandomForestClassifier":{},
    "GradientBoostingClassifier":{},
    "ExtraTreesClassifier":{},
    "AdaBoostClassifier":{},
}

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We now use our custom class `HyperparameterOptimizationSearch` to assign an object called search (you can use whatever name you wish).
* We pass in two arguments: models and params, which are the two dictionaries we set in the previous cells: models_search and params_search.
* The goal here is to use the default hyperparameters to find the type of algorithms that look to best fit your data.
* Next, we fit this object, meaning we will fit all the algorithms using GridSearchCV. Therefore we pass in the `training data` (X_train, y_train), `scoring` (in this case, as it's a teaching example we arbitrarily chose accuracy) and `cv` (we defined 2 to speed up the process).

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note you will see the code looping on each algorithm. There is one candidate since you are fitting with the default hyperparameter. It totals two fits per model since cv=2



```
Running GridSearchCV for DecisionTreeClassifier 
Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for RandomForestClassifier 
Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for GradientBoostingClassifier 
Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for ExtraTreesClassifier 
Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for AdaBoostClassifier 
Fitting 2 folds for each of 1 candidates, totalling 2 fits
```



search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-1, # use all processors, but one
           cv=2)

Our method `.score_summary` returns a DataFrame with all the training results summary and a dictionary containing all pipelines.
* We grab both and first check the results summary.
* Note that ExtraTreesClassifier had an average accuracy performance (using two cross-validated models with default hyperparameters values) of 0.98
* The second best was RandomForestClassifier with 0.95. Then GradientBoostingClassifier with 0.92.
* AdaBoostClassifier had the lowest performance here, with 0.82 of average accuracy. 

grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
**On which algorithms would you spend time doing an extensive hyperparameter search?**
* It depends on how distant the performance distribution is amongst the top performers.
* In our case, we would certainly select ExtraTreesClassifier and would give a second chance to RandomForestClassifier, since its performance was not so far from ExtraTress.
* We wouldn't give a second chance to GradientBoosting since 0.92 (for this context) is quite far from 0.98.


* However, there could be a case where for example, the top four had similar performance on the default hyperparameter, then you would do an extensive hyperparameter optimisation on these four.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 Let's define the new hyperparameters for the extensive search.
* You don't need to pass in the same quantity of hyperparameters for each algorithm, and the assigned values in the list will depend on the hyperparameter.
* There is no fixed number of values to be parsed in this list; just remember the more values and hyperparameters you parse, the more time it will take to fit all possible combinations.

# you don't have to necessarily list in any specific order here
models_search = {
    "ExtraTreesClassifier":ExtraTreesClassifier(random_state=0),
    "RandomForestClassifier":RandomForestClassifier(random_state=0),
}

params_search = {
    # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
    "ExtraTreesClassifier":{"model__n_estimators": [20,50],
                            },
    # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
    "RandomForestClassifier":{"model__n_estimators": [40,20],
                            },
}

Let's fit again using our HyperparameterOptimizationSearch class and our updated information on models_search and params_search.
* The other arguments remain the same.
* The goal here is to do an extensive search on the algorithms that performed better in a default hyperparameter optimisation.

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-1,
           cv=2)

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
">
 Let's check the results summary with `.score_summary`
* We could do a further round of extensive search with more hyperparameters and consider values around those that demonstrated good performance in this round. But for this teaching example, we are happy with the current search.

grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Programatically, we grab the best model name, by using `.iloc[ ]` on the first row and column from the previous DataFrame.

best_model = grid_search_summary.iloc[0,0]
best_model

Let's get the best model parameters.

grid_search_pipelines[best_model].best_params_

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Finally, we want to grab the best pipeline.
* The object grid_search_pipelines contains all trained pipelines. We first subset the pipelines from the algorithm having the best performance (with `best_model`), then used `.best_estimator_` to retrieve the pipeline that has the algorithm and hyperparameter configuration that best suits our data.

best_pipeline = grid_search_pipelines[best_model].best_estimator_
best_pipeline

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The best pipeline is a tree-based algorithm, so we can check the most important features with `.feature_importances_`

* The information on the `“best features”` is on the pipeline’s “feature selection” step as a boolean list. We get this list to subset the train set columns.
* Go through the pseudo code and code comments to understand the logic.
* Make sure you understand the variable data_cleaning_feat_eng_steps. If you use this code in your milestone project, you will likely need to update this value to the approprate one for your pipeline.

# after data cleaning and feature engineering, the feature space may change
# for example, you may drop variables, or you may add variables; such as a "date" variable
# if you extract the day, month and year, for example.
# then you ask yourself: how many data cleaning and feature engineering steps does your pipeline have?
# in our case three: median, categorical_imputer and ordinal

data_cleaning_feat_eng_steps = 3
# we get these steps with .steps[] starting from 0 until the value we assigned above
# then we .transform() to the train set and extract the columns
columns_after_data_cleaning_feat_eng = (Pipeline(best_pipeline.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

# we get the boolean list indicating the best features with best_pipeline['feat_selection'].get_support()
# and use this list to sbuset columns_after_data_cleaning_feat_eng
best_features = columns_after_data_cleaning_feat_eng[best_pipeline['feat_selection'].get_support()].to_list()


# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Feature': best_features,
          'Importance': best_pipeline['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")


df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.show()

Finally, we evaluate the pipeline as usual with our custom function for classification tasks.

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

We pass in the arguments we are familiar with.
* Note the performance on the test set is the same as in the train set.
* for label_map, we get the classes name with .unique()

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=best_pipeline,
                label_map= df_clf['species'].unique() 
                # in this case the target variable is encoded as categories and we
                # get the values with .unique() 
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Refit only with the most important features

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> Now you know which algorithm and which hyperparameters best fit your data. 
* That is awesome! Look at your improvement and what you achieved so far.
* However, your pipeline needs six columns and your model needs only three to predict. That means if you deploy this pipeline, your system will manage six inputs, when in fact you only need three.
* That happens beacuse we consider a feature selection step, which is useful to determine the most appropriate features for the algorithm.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
  In practical terms, you don't need the features that got dropped by the feature selection step. Once you know which features you can ignore, **you can fit a new pipeline with only the most important features**.
* This new pipeline will be deployed and contains an algorithm and hyperparameters that best suit your data and has the correct number of features.

These are the most important features according to the previous analysis.

best_features

We will use the same workflow, but now using the `best_features` only,  for the train and test sets

We split the data into train and test set

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> And subset the `best_features` !!!

X_train = X_train.filter(best_features)
X_test = X_test.filter(best_features)

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)
X_train.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You will need to update your pipeline since you have fewer variables to consider and you don't need feature selection.
* Before you had three steps for data cleaning and feature engineering.
* Now, you have two steps: one for median imputation and another for categorical encoding.

from sklearn.pipeline import Pipeline

### Data Cleaning and Feature Engineering
from feature_engine.imputation import MeanMedianImputer
from feature_engine.encoding import OrdinalEncoder

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### ML algorithms 
from sklearn.ensemble import ExtraTreesClassifier


def PipelineOptimization(model):
  pipeline_base = Pipeline([
      ( 'median',  MeanMedianImputer(imputation_method='median',
                                     variables=['bill_length_mm' , 'flipper_length_mm']) ),

      ( "ordinal",OrdinalEncoder(encoding_method='arbitrary', variables = ['island']) ), 

      ("feat_scaling", StandardScaler() ),

      # no feature selection!!!

      ("model", model ),


    ])

  return pipeline_base



We now list the model that performed best, in this case, ExtraTreesClassifier.

models_search = {
    "ExtraTreesClassifier":ExtraTreesClassifier(random_state=0),
}
models_search

We will need to hardcode the best parameters, so let's remind ourselves of the best params.

grid_search_pipelines[best_model].best_params_

We need to parse the value between brackets `[ ]`

params_search = {
    "ExtraTreesClassifier":{'model__n_estimators': [20]
                            },

}
params_search

We fit the model using `HyperparameterOptimizationSearch` considering the model "ExtraTreeClassifier" and the parameters we set previously.
* The goal here is not to do a hyperparameter optimisation search, but instead to fit a pipeline using the algorithm and best hyperparameter configuration we discovered.

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-1,
           cv=2)

As usual, we check the search summary with the method .score_summary()
* Note the performance is the same as the previous pipeline.

grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

We get the best model programmatically.

best_model = grid_search_summary.iloc[0,0]
best_model

So we can grab the pipeline.

best_pipeline = grid_search_pipelines[best_model].best_estimator_
best_pipeline

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The best pipeline is a tree-based algorithm, so we can check the most important features with `.feature_importances_`
* The code is similar to the previous section, the difference is that now we don't have three steps in the pipeline related to data cleaning and feature engineering. Instead, we have two steps now.

data_cleaning_feat_eng_steps = 2

columns_after_data_cleaning_feat_eng = (Pipeline(best_pipeline.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)
best_features = columns_after_data_cleaning_feat_eng


# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Feature': best_features,
          'Importance': best_pipeline['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")


df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.show()

We parse the arguments we are familiar with to evaluate the classifier's performance.
* Note the performance from this pipeline is the same as from the previous pipeline - as we should expect!

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=best_pipeline,
                label_map= df_clf['species'].unique() 
                # in this case the target variable is encoded as categories and we
                # get the values with .unique() 
                )

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> Well done!
* In this notebook, you learned how to conduct a hyperparameter optimisation search fitting multiple algorithms with the best features that predict a penguin's species.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Bonus: Most common Hyperparameters for the algorithms we cover in the course

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 It will take **time and experience** to learn which hyperparameters to consider when optimising your pipeline and which values would make sense to tune.
* The key is to understand how the algorithm works, and that will take time and experience. We offer the most common hyperparameters for the algorithms we cover in the course. You can use them as a starting point and as a reference if you require them for the Portfolio Project or in the workplace.
* Once again: the **library documentation** is your best friend to instruct you on the available hyperparameters the library offers for that given algorithm.


* The hyperparameters we list here are a suggestion so that you can use them as a reference when you start fine-tuning your ML pipelines.

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We will write the hyperparameters for all algorithms using the same dictionary structure we saw over the notebook, assuming you are arranging everything into a pipeline and the last step is called `'model'`

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Linear Regression

# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
from sklearn.linear_model import LinearRegression

#Linear Regression doesn't have hyperparameters. You should parse an empty dictionary
params_search = {
    "LinearRegression":{},
}


#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Logistic Regression

# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
from sklearn.linear_model import LogisticRegression 

params_search = {
    "LogisticRegression":{'model__penalty': ["l2","l1", "elasticnet"],
                          'model__C': [1, 0.5, 2],
                          'model__tol': [1e-4,1e-3,1e-5],
                            }
  }

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Decision Tree

# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
from sklearn.tree import DecisionTreeClassifier

params_search = {
    "DecisionTreeClassifier":{'model__max_depth': [None,4, 15],
                              'model__min_samples_split': [2,50],
                              'model__min_samples_leaf': [1,50],
                              'model__max_leaf_nodes': [None,50],
                            }
  }

# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
from sklearn.tree import DecisionTreeRegressor


params_search = {
    "DecisionTreeRegressor":{'model__max_depth': [None,4, 15],
                             'model__min_samples_split': [2,50],
                             'model__min_samples_leaf': [1,50],
                             'model__max_leaf_nodes': [None,50],
                            }
  }

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Random Forest

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
from sklearn.ensemble import RandomForestRegressor

params_search = {
    "RandomForestRegressor":{'model__n_estimators': [100,50, 140],
                             'model__max_depth': [None,4, 15],
                             'model__min_samples_split': [2,50],
                             'model__min_samples_leaf': [1,50],
                             'model__max_leaf_nodes': [None,50],
                            }
  }


# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.ensemble import RandomForestClassifier

params_search = {
    "RandomForestClassifier":{'model__n_estimators': [100,50,140],
                             'model__max_depth': [None,4, 15],
                             'model__min_samples_split': [2,50],
                             'model__min_samples_leaf': [1,50],
                             'model__max_leaf_nodes': [None,50],
                            }
  }


#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Gradient Boosting

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
from sklearn.ensemble import GradientBoostingClassifier 

params_search = {
    "GradientBoostingClassifier":{'model__n_estimators': [100,50,140],
                                  'model__learning_rate':[0.1, 0.01, 0.001],
                                  'model__max_depth': [3,15, None],
                                  'model__min_samples_split': [2,50],
                                  'model__min_samples_leaf': [1,50],
                                  'model__max_leaf_nodes': [None,50],
                            }
  }


# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
from sklearn.ensemble import GradientBoostingRegressor

params_search = {
    "GradientBoostingRegressor":{'model__n_estimators': [100,50,140],
                                  'model__learning_rate':[0.1, 0.01, 0.001],
                                  'model__max_depth': [3,15, None],
                                  'model__min_samples_split': [2,50],
                                  'model__min_samples_leaf': [1,50],
                                  'model__max_leaf_nodes': [None,50],
                            }
  }



####  <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Ada Boost

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
from sklearn.ensemble import AdaBoostClassifier

params_search = {
    "AdaBoostClassifier":{'model__n_estimators': [50,25,80,150],
                          'model__learning_rate':[1,0.1, 2],
                            }
  }



# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html
from sklearn.ensemble import AdaBoostRegressor

params_search = {
    "AdaBoostRegressor":{'model__n_estimators': [50,25,80,150],
                          'model__learning_rate':[1,0.1, 2],
                          'model__loss':['linear', 'square', 'exponential'],
                            }
  }


####  <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> XG Boost

# https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
from xgboost import XGBRegressor

params_search = {
    "XGBRegressor":{'model__n_estimators': [30,80,200],
                    'model__max_depth': [None, 3, 15],
                    'model__learning_rate': [0.01,0.1,0.001],
                    'model__gamma': [0, 0.1],
                            }
  }


from xgboost import XGBClassifier

params_search = {
    "XGBClassifier":{'model__n_estimators': [30,80,200],
                      'model__max_depth': [None, 3, 15],
                      'model__learning_rate': [0.01,0.1,0.001],
                      'model__gamma': [0, 0.1],
                            }
  }


#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> ExtraTree

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
from sklearn.ensemble import ExtraTreesClassifier

params_search = {
    "ExtraTreesClassifier":{'model__n_estimators': [100,50,150],
                          'model__max_depth': [None, 3, 15],
                          'model__min_samples_split': [2, 50],
                          'model__min_samples_leaf': [1,50],
                            }
  }


# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html
from sklearn.ensemble import ExtraTreesRegressor


params_search = {
    "ExtraTreesRegressor":{'model__n_estimators': [100,50,150],
                          'model__max_depth': [None, 3, 15],
                          'model__min_samples_split': [2, 50],
                          'model__min_samples_leaf': [1,50],
                            }
  }



---

# Scikit-learn - Unit 07 - PCA (Principal Component Analysis)

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand what PCA (Principal Component Analysis) is and how it can be used in your project



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 06 - PCA (Principal Component Analysis)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Principal Component Analysis, or PCA, is a transformation to your data and attempts to find out what features explain the most variance in your data.

* It reduces the number of variables, while it preserves as much information as possible. Therefore it is also referred to as "dimensionality reduction".
* After the transformation, it creates a set of components, where each component contains the relevant information from the original variables.
  * Each component explains a certain part of the variance of the whole dataset and is independent (uncorrelated) from each other.
  * The drawback of PCA is that it is not easy to understand what each of these components represents since they don't relate one to one a specific variable, instead, each component corresponds to a combination of the original variable.




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **We will not focus** on the mathematical study of PCA but instead will discuss the idea behind it and how to use PCA in practical terms in your data science project
* It will take time and experience to understand how the PCA algorithm works. For now, the central aspect is to understand what PCA is and why it will help you in predictive modelling.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> **Why and when should I consider using PCA?**


* Imagine if your data has a lot of variables (or dimensions). 

  <img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 You want to be able to **visualise** your data to discover patterns, however, it is unfeasible to visualise all of your data in a single plot. You can use PCA to reduce your dataset to 2 or 3 components and visualise it. We will explore that in this notebook.

  <img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
  In **predictive modelling**, we are concerned about which variables are more relevant for modelling. PCA is a tool capable of transforming your data, retaining only the most appropriate information or the most variance while keeping all the original variables that help the model learn the patterns in the data.
  * In supervised learning, you can use PCA as a step when extracting features for your ML model. Instead of using, for example, `SelectFromModel()`. You may also use PCA to transform your features into relevant components that can help to predict your target variable. We will explore this technique in the Walkthrough Project 02.
  * In addition, in unsupervised learning, you can use PCA as a step to reduce dimensionality. So your cluster algorithm will be able to understand better how to group similar data. We will explore this technique in the next lesson and Walkthrough Project 02.


You can import PCA using the command below

from sklearn.decomposition import PCA

In the next cells we are going to:
* Load a dataset and define the pipeline steps to prepare the data for PCA
* Transform the data using PCA and understand how many components to consider
* Visualise the data after the PCA transformation

---

### Load Data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's load the breast cancer data from sklearn and apply PCA
* It shows records for a breast mass sample and a diagnosis informing whether it is as malignant or benign cancer, where 0 is malignant, 1 is benign. 
* The target variable is 'diagnostic' and features are the remaining variables.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">  We know in advance this dataset has only numerical features and no missing data. 
* We are adding on purpose missing data (`np.NaN`) in the first 10 rows of 'mean smoothness' using `.iloc[:10,4]`, just to better simulate the datasets you will likely face in the workplace.

from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
df_clf = pd.DataFrame(data.data,columns=data.feature_names)
df_clf['diagnostic'] = pd.Series(data.target)
df_clf = df_clf.sample(frac=0.6, random_state=101)
df_clf.iloc[:10,4] = np.NaN

print(df_clf.shape)
df_clf.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are interested in applying PCA to the features only (not the diagnostic)
* We create 2 distinct DataFrames, `X` which is the features, and `df_target` which contains the diagnostic (benign or malignant). 
  * Note, there are 30 features in `X`.
  * We will use `X` to apply PCA, and `df_target` at a later stage when we visualise the data.


df_target = df_clf[['diagnostic']]
X = df_clf.drop(['diagnostic'], axis=1)
print(X.shape)
X.head(3)

---

### Create pipeline steps

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To apply PCA, we should scale the data. Therefore we create our pipeline that is responsible for data cleaning, feature engineering and feature scaling.
* In our case, it will perform data cleaning (median imputation) and feature scaling.

from sklearn.pipeline import Pipeline
### Data Cleaning
from feature_engine.imputation import MeanMedianImputer
### Feat Scaling
from sklearn.preprocessing import StandardScaler


def PipelineDataCleaningFeatEngFeatScaling():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median') ),

      ( 'feature_scaling', StandardScaler() ),
  ])

  return pipeline_base

PipelineDataCleaningFeatEngFeatScaling()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We fit and transform the data to the pipeline.
* The result is a NumPy array. Note there are still the same quantity of rows and columns (341, 30). The point to note is that the data type is now an array due to the feature scaling transformation.

pipeline_pca = PipelineDataCleaningFeatEngFeatScaling()
df_pca = pipeline_pca.fit_transform(X)
print(df_pca.shape,'\n', type(df_pca))

Just to reinforce our learning, let's check `df_pca`. 

df_pca

* As we expect, it is the familiar NumPy array we covered in previous sections. Note also it is a 2D array.

---

### PCA transformation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Now that the data is scaled, we can apply the PCA component
* We are not assembling PCA to a pipeline in this lesson, we will do that at a later stage. The idea here is to understand how the process works
* **A quick recap**: PCA reduces the number of variables, while it preserves as much information as possible. After the transformation, it creates a set of components, where each component contains the relevant information from the original variables.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
"> The first question is: 
* **How many components should I consider?** That depends; let's test, setting the number of components as the number of columns the scaled data has, in this case, 30. That is useful in understanding the explained variance of each component. 
* Read the pseudo-code and comments in the below code cell to understand its logic. Once you run the cell, you will notice that:
  * The first three components are more significant than the others. And, together, they sum 72.47% of the data variance. That is okay. It is a good sign when in a few components, like 3 or 4, you can get more than 80% of your data variance. So you could select three as the number of components, which is good progress since you had thirty features and now have three components.
  * But in this exercise, for learning purposes, we will aim for more than 90% of data variance and use seven components since we could get more data variance with a relatively low increase of components. Before, we had thirty features with all data variance. Then switched to three components with 72% of data variance, and now seven components with 90% of data variance.
  

import numpy as np
from sklearn.decomposition import PCA # import PCA from sklearn

n_components = 30 # set the number of components as all columns in the data

pca = PCA(n_components=n_components).fit(df_pca)  # set PCA object and fit to the data
x_PCA = pca.transform(df_pca) # array with transformed PCA


# the PCA object has .explained_variance_ratio_ attribute, which tells 
# how much information (variance) each component has 
# We store that to a DataFrame relating each component to its variance explanation
ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,2),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

# prints how much of the dataset these components explain (naturally in this case will be 100%)
PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

In the next cell we just copied the code from the cell above and changed n_components to 7. 
* With 7 components we achieved a bit more than 91% of data variance

n_components = 7

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca) # array with transformed PCA

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,2),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

Note that the data is transformed and stored at `x_PCA`. Let's check its content.
* You will notice it is a NumPy array, and its dimension is 341 x 7, where the rows indicate the number of rows and seven relates to the number of components we defined earlier.
* Imagine now that this data would be fed to a model. For this particular dataset, the ML task would be a classification.
* Also, note that the PCA helped reduce from thirty features to seven components where these seven components contain 90% of the information.

print(x_PCA.shape)
x_PCA

---

### Visualise data after PCA transformation

Imagine you want to visualise your data, before and after applying PCA.
* If you had to visualize the thirty features, you could do a correlation analysis and look for features that are correlated among themselves or, in this particular dataset, features that are correlated to the target.
* So let's suppose you want to visualise the relationship between "mean concavity" and "mean concave points" and the target. Since the features are numerical, you can do a scatter plot with them and colour by the target.
 * You will imagine/visualise the frontier between the blue and orange dots. Although that is good, the malignant and benign may look to be separable. At the same time, few data points look mingled in this frontier.
  * However, what about the remaining variables? When you consider this dataset as a whole, is that informative enough to separate these classes?

var1, var2 = 'mean concavity' , 'mean concave points'
sns.scatterplot(x=X[var1], y=X[var2], hue=df_target['diagnostic'])
plt.xlabel(var1)
plt.ylabel(var2)
plt.show()

We can plot the PCA components to evaluate, from another perspective, how the data behaves.
* We know x_PCA holds the data after transformation and has seven components. We will plot in a scatterplot the most representative components: components 0 and 1.

sns.scatterplot(x=x_PCA[:,0], y=x_PCA[:,1])
plt.xlabel('Component 0')
plt.ylabel('Component 1')
plt.show()

We know that these two components hold by themselves 62% of the information (data variance).
* This is powerful because with two variables (two components) we have a clearer vision of how the dataset looks to have enough information to separate malignant and benign.
* We now colour the plot by diagnostic using df_target as the hue argument.
  * Note we see a clearer border between 0 and 1.
  * In a nutshell, we have the same data, showing the same information. The difference now is that the data was reduced to its major components.
  * The drawback is that we lose the interpretation, since component 0 is made of a combination of the original variables.

sns.scatterplot(x=x_PCA[:,0], y=x_PCA[:,1], hue=df_target['diagnostic'], alpha=0.8)
plt.xlabel('Component 0')
plt.ylabel('Component 1')
plt.show()

Naturally We can plot more components. In this exercise, we can plot three components in a 3D scatter plot using Plotly Express
  * Move around the 3D plot and try to visualise if you could draw a surface that would separate the dots. The surface you imagined, is an ML model.
  * Note again these three components alone hold 72% of all information from the dataset to diagnose malignant or benign.

import plotly.express as px
fig = px.scatter_3d(x=x_PCA[:,0], y=x_PCA[:,1], z= x_PCA[:,2] , color=df_target['diagnostic'],
                    labels=dict(x="Component 0", y="Component 1", z='Component 2'),
                    color_continuous_scale='spectral',
                    width=750, height=500)
fig.update_traces(marker_size=5)
fig.show()

---

# Scikit-learn - Unit 08 - Cluster

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand how to group similar data using KMeans clustering algorithm
* Explain Clusters profiles



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 08 - Cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Welcome to the world of unsupervised learning! It is slightly different from supervised learning, due to one aspect: there is no target variable. **The algorithm is left on its own to look for patterns in the data**
* The ML task we will study is called a **cluster**, a type of unsupervised algorithm where it looks to group the data by similarity
* The workflow used for a cluster will be also be slightly different from regression and classification tasks However, you will still do tasks like creating pipeline steps, fitting the pipeline using your data, and evaluating the pipeline. But now they will be done in a slightly different way.




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> There are multiple clustering algorithms in Scikit-learn; you may go to this [link](https://scikit-learn.org/stable/modules/clustering.html) and look for the potential algorithms to learn and use over your career. 
* We will study **KMeans** in this course since it is a starting point for your career and will not add much complexity to what we have been studying so far. In case you want to revise the concepts of KMeans, you may refer to Introduction to Predictive Analytics And Machine Learning - ML Essentials.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Workflow

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Introduction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In practical terms, we don't know for sure how good your cluster model performance will be
* Unless you gather a separate data and find a way to discover the actual value so you can compare it to the cluster prediction. That is not so trivial in practical terms, and in this course, we will not consider this alternative.
* That being said, you will not know, for sure, for example, if a pipeline with four clusters is, in reality, better than a pipeline with seven clusters. However, there are approaches you can use to frame the project and reach more conclusive results that will help you to understand the patterns in your data.

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Expectation and Pipeline Objective

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> This notebook is dense and we cover many concepts. But always remember the core concept of this notebook is simple:
* **Fit a Cluster Pipeline that groups similar data and explains each Cluster profile**

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Major ideas

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> The major ideas  we consider in this notebook are:
* 1: **Create a Cluster pipeline**. Before fitting the pipeline, we need to define the number of PCA components and the number of clusters.
* 2: **Fit the Cluster Pipeline**
* 3: We need to **understand the Cluster profile**. We will use a classifier where the target is the cluster prediction to identify the most important variables that define a cluster.
* 4: **Cluster analysis**: explain each cluster profile in terms of the most important variables. In addition, in case your dataset has a separate variable you want to study and you didn't include it in the cluster pipeline, you can study how this variable correlates to the clusters. In our case, we will analyze the clusters and the diagnostic (malignant or benign) 

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Practical Workflow

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> The practical workflow may be longer based on the ideas we outlined above. In particular, in this notebook, we will:
* 1 - **Create a Cluster Pipeline** that contains the following steps: data cleaning, feature engineering, feature scaling, PCA and Cluster Model (KMeans). Note: this pipeline has parameters for PCA and Cluster that we will need to update over the notebook.
* 2 - Analyse to determine the  number of components in a PCA. We will update that value in the Cluster Pipeline
* 3 - Apply **Elbow Method and evaluate the Silhouette score**, to define the number of clusters in Cluster Pipeline
* 4 - **Fit** the cluster pipeline
* 5 - Add the cluster predictions to the data
* 6 - Create a separate **Classifier Pipeline**, where the target variable is cluster predictions and features are the remaining variables
* 7 - Fit this classifier, evaluate its performance and assess the most important features. These features are the most **important features needed to define the cluster predictions**
* 8 - **Cluster analysis**: explain each cluster profile in terms of the most important features from the previous step. In addition, in case your dataset has a separate variable you want to study and you didn't include in the cluster pipeline, you can study how this variable correlates to the clusters.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load Data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's load the breast cancer data from sklearn. It shows records for a breast mass sample and a diagnosis confirming whether it is malignant or benign cancer, where 0 is malignant, and 1 is benign.
* **Our objective is to cluster similar data points and then analyse the clusters against the diagnostic (malignant or benign)**
  * As a result, **we will use only the thirty features** (all variables but Diagnostic) to fit the cluster pipeline.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We know in advance this dataset has only numerical features and no missing data.
* We are intentionally adding missing data (`np.NaN`) in the first ten rows for 'mean smoothness' using `.iloc[:10,4]`, that better simulates the datasets you will likely face in the workplace.

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df = pd.DataFrame(data.data,columns=data.feature_names)
df.iloc[:10,4] = np.NaN
print(df.shape)
df.head()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> ML Pipeline for Cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The Cluster Pipeline is made of Data Cleaning (median imputation on `mean smoothness`) feature scaling, PCA and model (KMeans) steps
* Note: `n_components` of PCA and `n_clusters` of KMeans values will be updated afterwards, for now, we leave arbitrary value of 50 (it could be any number).

from sklearn.pipeline import Pipeline

### Data Cleaning
from feature_engine.imputation import MeanMedianImputer

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### PCA
from sklearn.decomposition import PCA

### ML algorithm
from sklearn.cluster import KMeans

def PipelineCluster():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median',
                                               variables=['mean smoothness']) ),

      ("scaler", StandardScaler()  ),    

      ("PCA",  PCA(n_components=50, random_state=0)), 

      ("model", KMeans(n_clusters=50, random_state=0)  ), 
  ])
  return pipeline_base

PipelineCluster()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Principal Component Analysis (PCA)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Principal Component Analysis, or PCA, is a transformation to your data and attempts to find out what features explain the most variance in your data.
* PCA reduces the number of variables, while it preserves as much information as possible. After the transformation, it creates a set of components, where each component contains the relevant information from the original variables.
* **This is useful in a Cluster pipeline since it is a method to reduce the feature space and provide data to the model that is in a better format for the algorithm to group similar data**.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are interested to find the most suitable `n_components`, then we update the value in the ML Pipeline for Cluster
* To reach that, we will create an object based on PipelineCluster(), then remove the last two steps (PCA and model): `.steps[:-2]`
* Finally, the `pipeline_pca` scales the data, so we can apply PCA afterwards

pipeline_cluster = PipelineCluster()
pipeline_pca = Pipeline(pipeline_cluster.steps[:-2])
df_pca = pipeline_pca.fit_transform(df)

print(df_pca.shape,'\n', type(df_pca))

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we apply PCA separately to the scaled data similar to what we did in previous unit notebooks.
* Next, we are interested in defining the number of components from the PCA step. We will set the number of components as the number of columns the scaled data has, in this case, thirty. That is useful in understanding the explained variance of each component.
* The interpretation is similar to the previous PCA notebook.
  * The first three components are more significant than the others. And, together, they sum 72.47% of the data variance. That is okay. It is a good sign when in a few components, like three or four, you can get more than 80% of your data variance. So you could select three as the number of components, which is good progress since you had thirty features and now have three components.
  * But in this exercise, for learning purposes, we will aim for more than 90% of data variance and use seven components since we could get more data variance with a relatively low increase of components.

n_components = 30 # set the number of components as all columns in the data

pca = PCA(n_components=n_components).fit(df_pca)  # set PCA object and fit to the data
x_PCA = pca.transform(df_pca) # array with transformed PCA


# the PCA object has .explained_variance_ratio_ attribute, which tells 
# how much information (variance) each component has 
# We store that to a DataFrame relating each component to its variance explanation
ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,3),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

# prints how much of the dataset these components explain (naturally in this case will be 100%)
PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the next cell, we just copied the code from the cell above and changed `n_components` to 7.
  * With seven components we achieved a bit more than 90% of data variance

n_components = 7

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca)

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,3),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

Next we rewrite the `PipelineCluster()`, updating `n_components` to 7
* Note, in an actual project, you don't have to rewrite in the cell below the pipeline necessarily. You could have scrolled up to the cell where we defined the pipeline previously and updated there. But for learning purposes, we'll rewrite the pipeline in the cell below.

def PipelineCluster():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median',
                                               variables=['mean smoothness']) ),

      ("scaler", StandardScaler()  ),    

      ("PCA",  PCA(n_components=7, random_state=0)),  ##### we update the n_components to 7

      ("model", KMeans(n_clusters=30, random_state=0)  ), 
  ])
  return pipeline_base

PipelineCluster()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Elbow Method and Silhouette Score

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are now interested to find the most suitable value for `n_clusters`, then we update the value on the ML Pipeline for Cluster.
* But how do you know the optimal amount of clusters for your data?
* **We will combine 2 techniques (Elbow Method and Silhouette Score) to find the optimal value for the number of clusters**. Both will suggest values and we will use them in conjunction to decide on the optimal amount of clusters


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 We will first explain and apply Elbow. Then we will explain and apply the Silhouette score.





<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">  There is a technique called Elbow Method. According to [Yellowbrick documentation](https://www.scikit-yb.org/en/latest/api/cluster/elbow.html) - (an ML visualization library), the elbow method runs k-means clustering on the dataset for a range of values for k and then for each value of k computes an average score for all clusters. By default, the distortion score is computed as the sum of square distances from each point to its assigned centre.
   
* That is plotted as a line chart, where on the x-axis you find the values for the quantity of clusters and on the y-axis the distortion score. The line chart will remind you of an arm, then you will pick as a candidate the point of inflection (or the elbow) as the optimal value for the number of clusters.
  * According to [Wikipedia](https://en.wikipedia.org/wiki/Elbow_method_(clustering)), using the "elbow" or "knee of a curve" as a cutoff point is a common heuristic in mathematical optimization to choose a point where diminishing returns are no longer worth the additional cost. In clustering, this means one should choose several clusters so that adding another cluster doesn't give much better modelling of the data.
  * You will also observe the plot and look at the values where there is a sharp steep fall in the distances. These ranges will be used in the Silhouette Score analysis. 

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Prepare data for analysis
  * You need to transform your data up to the point that it will hit the model, for Elbow Method and Silhouette score. 
    * Therefore we remove the last step (`.steps[:-1]`) and fit_transform `pipeline_analysis` to the data.
    * Note the data has seven columns since it has passed through the PCA step.

pipeline_cluster = PipelineCluster()
pipeline_analysis = Pipeline(pipeline_cluster.steps[:-1])
df_analysis = pipeline_analysis.fit_transform(df)

print(df_analysis.shape,'\n', type(df_analysis))

Next, we use [`KElbowVisualizer()`](https://www.scikit-yb.org/en/latest/api/cluster/elbow.html) from YellowbrickElbow Analysis to implement the Elbow Method
* We pass in as arguments the algorithm we want (KMeans) and the range for the number of clusters we want to try, in this case from 1 to 10, so we pass in a tuple of (1,11), where the last value is not inclusive. 
* Here, there is no fixed recipe; you have to try a few ranges for the number of clusters. Initially, you may try a range of 1 to 10 or 1 to 15 and refine it accordingly.
* Then we fit this object to the `df_elbow` (the data that passed through data cleaning, feature scaling and PCA)
  * **Note the plot suggests three clusters!**
  * **Note also that between 2 and 5 the values have a sharp and steep falloff. Outside this range, it does not fall off in a similar manner.**

from yellowbrick.cluster import KElbowVisualizer

visualizer = KElbowVisualizer(KMeans(random_state=0), k=(1,11))
visualizer.fit(df_analysis) 
visualizer.show() 
plt.show()

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> There is also a **Silhouette score** that helps us to define the number of clusters. You can refer to Introduction to Predictive Analytics And Machine Learning > ML Essentials where we presented the concept in the performance metric video.

* The silhouette score **interprets and validates the consistency within clusters**, which is based on the mean intra-cluster distance and mean nearest-cluster distance for each data point.
  * The mean intra-cluster distance is the average distance between the data point and all other data points in the same cluster. Essentially, how far each data point is from the centre of its own cluster. 
  * The mean nearest-cluster distance on the other hand is the average distance between the data point and all other data points of the next nearest cluster. In other words, how far each data point in 1 cluster is to the centre of its nearest neighbouring cluster.
 

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The silhouette score range is from -1 to +1, where:
  *   “+1” means that a clustered data point is dense and properly separated from other clusters. 
  * A score close to 0 means the clustered data point is overlapping with another cluster.  
  * A negative score means that the clustered data point may be wrong; it may even belong to another cluster.




<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 The silhouette score for each data point allows you to build a Silhouette plot, showing each silhouette score for each data point across all clusters.
* You can then calculate an **average silhouette score** for the plot. This average helps to (1) compare different models with a different number of clusters and (2) define a performance metric for a given cluster model. A rule of thumb in the industry is that an average silhouette score greater than 0.5 means the clusters are nicely separated, but there may be a case where for your dataset, the optimal amount of cluster leads to an average lower than 0.5. That is fine also. It just means we computed the optimal way for that dataset to cluster even though it doesn't have a tremendous silhouette score.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To evaluate a cluster's silhouette, we need the data formatted before it hits the model. We have done this already, and the result is stored at `df_analysis`
* We will use [SilhouetteVisualizer](https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html) and [KElbowVisualizer](https://www.scikit-yb.org/en/latest/api/cluster/elbow.html) from Yellowbrick


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The code has the following logic:
* **First, you will calculate the average silhouette score for different numbers of clusters** using  KElbowVisualizer() by setting KMeans() as the algorithm,  the range 2 to 5 for the number of clusters (it doesn't accept 1 cluster) and the  metric='silhouette'. Then you will fit the scaled data (df_analysis) and show the results
  * You will evaluate which number of clusters produce the higher average silhouette score.
* Then you will iterate on the **silhouette plot for models with a different number of clusters**, in this case from 2 to 11. You will use SilhouetteVisualizer() and set the estimator as KMeans(). Then you will fit the scaled data (df_analysis) and show the results
  *  You will evaluate if there are clusters with a maximum score below average score, if the silhouette values vary too much in the cluster, if there are too many silhouette values lower than the average silhouette score and if there are too many negative silhouette values.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note the following:
* Average Silhouette Score: the best result is with two clusters, but three is not that far away. We will give more attention to evaluting the Silhouette Plot from these options. 
* Silhouette Plot:
  * Two clusters: One cluster is dominant (the blue) since it has more observations, and the majority of its values are greater than the average score (the red dotted line). The other cluster (green) has a few data points with a negative score (these may belong to other clusters) and almost no data point is above the average score.

  * Three clusters: One cluster is dominant (the blue) since it has more observations, and the majority of its values are greater than the average score (the red dotted line). The other two clusters look to have a similar frequency. The last blue cluster has a few data points greater than the average and a few with negative silhouette values. However, the green middle cluster has a few data points with a negative score (these may belong to other clusters) and almost no data points are above the average score. This is not as bad as the two clusters since more observations are above the average score in the non-dominant clusters.

from yellowbrick.cluster import SilhouetteVisualizer

print("=== Average Silhouette Score for different number of clusters ===")
visualizer = KElbowVisualizer(KMeans(random_state=0), k=(2,7), metric='silhouette')
visualizer.fit(df_analysis) 
visualizer.show() 
plt.show()
print("\n")

for n_clusters in np.arange(start=2,stop=11):
  
  print(f"=== Silhouette plot for {n_clusters} Clusters ===")
  visualizer = SilhouetteVisualizer(estimator = KMeans(n_clusters=n_clusters, random_state=0),
                                    colors = 'yellowbrick')
  visualizer.fit(df_analysis)
  visualizer.show()
  plt.show()
  print("\n")

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%208-%20Challenge.png"> What is the number of clusters then?
* Elbow Method says three.
* The average Silhouette Score says two, but the Silhouette Plot from three clusters is better than for two clusters.
* As a result, we will pick three, since the Elbow Method and Silhouette Plot both support that decision.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next we rewrite the `PipelineCluster()`, updating `n_cluster` to 3
* Note, you don't have to necessarily rewrite in the cell below the pipeline, in an actual project. You could have scrolled up to the cell where we defined the pipeline previously and updated there. But for learning purposes, we'll rewrite the pipeline in the cell below.

def PipelineCluster():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median',
                                               variables=['mean smoothness']) ),

      ("scaler", StandardScaler()  ),    

      ("PCA",  PCA(n_components=7, random_state=0)), 

      ("model", KMeans(n_clusters=3, random_state=0)  ),  ##### update n_clusters to 3 
  ])
  return pipeline_base

PipelineCluster()

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> Notice the additional effort and steps we take in clustering compared to the workflow we have for Classification and Regression. Only now are we ready to train the pipeline.
  * Note we have only one pipeline, and we are not doing hyperparameter optimisation when training the model.
  * We "kind" of made a hyperparameter optimisation in the previous sections since we tried different options for PCA components and the number of clusters for KMeans().
  * Let's fit the pipeline then!

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit Cluster Pipeline

We don't need to split our data. All available data is used for training. 
* For training purposes, we create a DataFrame `X` that is a copy of your data.

X = df.copy()
print(X.shape)
X.head(3)

Then we fit the Cluster pipeline to the training data (`X`)

pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Add cluster predictions to dataset

We add a column "`Clusters`" (with the Cluster Pipeline predictions) to X
* Scroll to the right and check the last variable. That is the cluster predictions for each data point of your dataset.
* The model predictions are stored in an attribute `.labels_`
* Since the model is in a pipeline, you will grab the `model` using the notation `pipeline_cluster['model'].labels_`

X['Clusters'] = pipeline_cluster['model'].labels_
print(X.shape)
X.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next we are interested to know the cluster frequency
* Note there are three clusters, and the counting starts from 0.
* We note that the algorithm found that the majority of the data (63%) belongs to cluster number 2, where the remaining datapoints are shared equally between the other 2 clusters.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">  **But what is the profile of each cluster?**


print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True).to_frame().round(2)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar')
plt.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit a classifier, where the target is cluster predictions and the features are the remaining variables

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are at a point where we have cluster predictions made from the cluster pipeline, but we can't interpret the clusters yet. 


We are **interested in learning each cluster's profile**, based on the most relevant dataset variables.
* Our new dataset has `Clusters` as a variable. We use a technique where  `Clusters` will be the **target for a classifier**, and the remaining variables will be features for that target.
  * We will assume that the most relevant features for this classifier, will be the most relevant variables that define a cluster.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 To do that, we will use the traditional workflow we covered in the previous notebooks: 
 * 1 - split the data into train and test set
 * 2 - create the classifier pipeline
 * 3 - fit the classifier to training data
 * 4 - evaluate pipeline performance
 * 5 - and (most important for our analysis) **assess feature importance**.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png "> Note: If you need, pause for a second and reflect on which step from the "Major ideas" section we are in. That may help you to better understand our goal, which point we are at, and the next step we move on to.


We start by copying `X` to a DataFrame `df_clf`

df_clf = X.copy()
print(df_clf.shape)
df_clf.head(3)

Next, we split train and test sets, where the target variable is `'Clusters'`

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['Clusters'],axis=1),
                                    df_clf['Clusters'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Create a classifier pipeline 
* We should use the **data cleaning and feature engineering** steps from the Cluster Pipeline.
* Then we add the conventional steps for supervised learning: **feature scaling, feature selection and modelling**
* We are considering a model that typically offers good results, and feature's importance can be assessed with `.features_importance_` using a tree-based algorithm. We are using GradientBoostingClassifier since it typically has good performance while it is fast to train.
  * We could conduct a detailed hyperparameter optimisation to find the best tree-based model, but we are most interested in finding a pipeline that can explain the relationship between the target (Clusters) and the features to assess the feature's importance afterwards.

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithm
from sklearn.ensemble import GradientBoostingClassifier 

def PipelineClf2ExplainClusters():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median',
                                               variables=['mean smoothness']) ),

      ("scaler", StandardScaler()  ),    

      ("feat_selection", SelectFromModel(GradientBoostingClassifier(random_state=0)) ), 

      ("model",  GradientBoostingClassifier(random_state=0) ), 
  ])
  return pipeline_base

  
PipelineClf2ExplainClusters()

We fit the classifier to the training data
* Note again, here we are not doing a detailed hyperparameter optimisation. This classification pipeline is useful only for the the features that look to be more important to predict the Clusters. We are not deploying this model, so fitting with the default hyperparameters is fine for this task.

pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster.fit(X_train, y_train)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Evaluate the classifier performance on the Train and Test Sets

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 In theory, we expect to have a good performance, since the Clusters were generated by the KMeans() and that algorithm has a logic. As a result, the classifier algorithm (GradientBoosting) would be able to map these relationships, in theory. So let's check that.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Then evaluate the performance on the Train set using `classification_report()`
* It looks to have learned the relationships to ace all predictions in the train set.

from sklearn.metrics import classification_report
print(classification_report(y_train, pipeline_clf_cluster.predict(X_train)))

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> And finally, we evaluate in the test set. 
* It looks to have learned the relationship between the target and the features to generalise on the test set, since the performance is not much different from the train set.

print(classification_report(y_test, pipeline_clf_cluster.predict(X_test)))

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Assess the Most Important Features that define a cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Now we assess the feature importance from the pipeline. First, we need to know how many data cleaning and feature engineering steps your pipeline has.
* It has one step only: median imputation.

pipeline_clf_cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use the same code we saw in the previous unit notebook where we grab the feature importance from the feature selection step and store it in a DataFrame.
* The plot shows that these are the 4 most important features in descending order: `['mean concavity', 'worst perimeter', 'worst fractal dimension', 'mean perimeter'] `




<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> **We are considering these as the most important variables that define a Cluster. They will be used to understand the Cluster Profile**

# after data cleaning and feat engineering, the feature space changes

data_cleaning_feat_eng_steps = 1 # how many data cleaning and feature engineering steps does your pipeline have?
columns_after_data_cleaning_feat_eng = (Pipeline(pipeline_clf_cluster.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[pipeline_clf_cluster['feat_selection'].get_support()].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Feature': columns_after_data_cleaning_feat_eng[pipeline_clf_cluster['feat_selection'].get_support()],
          'Importance': pipeline_clf_cluster['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

best_features = df_feature_importance['Feature'].to_list() # reassign best features in importance order

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features} \n")
df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Cluster Analysis

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> Bravo! You know which variables to consider now to explain each cluster!
* Let's create a custom function where we will explain the cluster profile, in terms of  `['mean concavity', 'worst perimeter', 'worst fractal dimension', 'mean perimeter']`. For each cluster, we want to know the most common values for each variable.

* Go through the code and check the pseudo-code and comments to understand its logic. It may take a while to understand it all, but the focus is to understand and apply the function to our business problem.

# df contains the most important features and the clusters
# Note: your DataFrame needs to have a variable called 'Clusters' which will
# contain the cluster prediction from the pipeline

# It outputs a table showing for each cluster what is the most common values for a given variable

def DescriptionAllClusters(df, decimal_points=3):

  DescriptionAllClusters = pd.DataFrame(columns=df.drop(['Clusters'],axis=1).columns)
  # iterate on each cluster , calls Clusters_IndividualDescription()
  for cluster in df.sort_values(by='Clusters')['Clusters'].unique():
    
      EDA_ClusterSubset = df.query(f"Clusters == {cluster}").drop(['Clusters'],axis=1)
      ClusterDescription = Clusters_IndividualDescription(EDA_ClusterSubset,cluster,decimal_points)
      DescriptionAllClusters = DescriptionAllClusters.append(ClusterDescription)

  
  DescriptionAllClusters.set_index(['Cluster'],inplace=True)
  return DescriptionAllClusters


def Clusters_IndividualDescription(EDA_Cluster,cluster, decimal_points):

  ClustersDescription = pd.DataFrame(columns=EDA_Cluster.columns)
  # for a given cluster, iterate in all columns
  # if the variable is numerical, calculate the IQR: display as Q1 -- Q3.
    # That will show the range for the most common values for the numerical variable
  # if the variable is categorical, count the frequencies and display the top 3 most frequent
    # That will show the most common levels for the category

  for col in EDA_Cluster.columns:
    
    try:  # eventually a given cluster will have only missing data for a given variable
      
      if EDA_Cluster[col].dtypes == 'object':
        
        top_frequencies = EDA_Cluster.dropna(subset=[col])[[col]].value_counts(normalize=True).nlargest(n=3)
        Description = ''
        
        for x in range(len(top_frequencies)):
          freq = top_frequencies.iloc[x]
          category = top_frequencies.index[x][0]
          CategoryPercentage = int(round(freq*100,0))
          statement =  f"'{category}': {CategoryPercentage}% , "  
          Description = Description + statement
        
        ClustersDescription.at[0,col] = Description[:-2]


      
      elif EDA_Cluster[col].dtypes in ['float', 'int']:
        DescStats = EDA_Cluster.dropna(subset=[col])[[col]].describe()
        Q1 = round(DescStats.iloc[4,0], decimal_points)
        Q3 = round(DescStats.iloc[6,0], decimal_points)
        Description = f"{Q1} -- {Q3}"
        ClustersDescription.at[0,col] = Description
    
    
    except Exception as e:
      ClustersDescription.at[0,col] = 'Not available'
      print(f"** Error Exception: {e} - cluster {cluster}, variable {col}")
  
  ClustersDescription['Cluster'] = str(cluster)
  
  return ClustersDescription




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The next custom function is called `cluster_distribution_per_variable() ` and is used to analyse the Clusters and given Variable - in our case, it will evaluate **Clusters x Diagnostic**.
* It will show the absolute and relative levels of Diagnostic (Malignant and Benign) per cluster
* Go through the code and check the pseudo-code and comments to understand its logic. It may take a while to understand it, but the focus is to understand and apply the function to our business problem.

import plotly.express as px
def cluster_distribution_per_variable(df,target):

  # the data should have 2 variables, the cluster predictions and
  # the variable you want to analyze with, in this case we call "target"
  
  # we use plotly express to create 2 plots
  # cluster distribution across the target
  # relative presence of the target level in each cluster
  
   
  df_bar_plot = df.value_counts(["Clusters", target]).reset_index() 
  df_bar_plot.columns = ['Clusters',target,'Count']
  df_bar_plot[target] = df_bar_plot[target].astype('object')

  print(f"Clusters distribution across {target} levels")
  fig = px.bar(df_bar_plot, x='Clusters',y='Count',color=target,width=800, height=500)
  fig.update_layout(xaxis=dict(tickmode= 'array',tickvals= df['Clusters'].unique()))
  fig.show()


  df_relative = (df
                 .groupby(["Clusters", target])
                 .size()
                 .groupby(level=0)
                 .apply(lambda x:  100*x / x.sum())
                 .reset_index()
                 .sort_values(by=['Clusters'])
                 )
  df_relative.columns = ['Clusters',target,'Relative Percentage (%)']
 

  print(f"Relative Percentage (%) of {target} in each cluster")
  fig = px.line(df_relative, x='Clusters',y='Relative Percentage (%)',color=target,width=800, height=500)
  fig.update_layout(xaxis=dict(tickmode= 'array',tickvals= df['Clusters'].unique()))
  fig.update_traces(mode='markers+lines')
  fig.show()
 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> To start the analysis we want a DataFrame that contains the best features and Cluster Predictions since we want to analyse the patterns for each cluster.
* We will copy `df_clf` DataFrame (since it has all the features and Cluster predictions) and filter `best_features` plus `['Clusters']`.


df_cluster_profile = df_clf.copy()
df_cluster_profile = df_cluster_profile.filter(items=best_features + ['Clusters'], axis=1)
print(df_cluster_profile.shape)
df_cluster_profile.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We want also to analyze Diagnostic levels
* In this exercise, we get it from `data.target` and create a DataFrame.
* We know in advance Diagnostic represents a categorical variable and came as an integer. Therefore we change its data type to `'object'`.

df_diagnostic = pd.DataFrame(data.target, columns=['diagnostic'])
df_diagnostic['diagnostic'] = df_diagnostic['diagnostic'].astype('object')
df_diagnostic.head(3)

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Cluster profile on most important features

We call `DescriptionAllClusters()` and parse a concatenated DataFrame made with `df_cluster_profile` and `df_diagnostic`. Before parsing let's just show this concatenated data so you can visualise it better.
* It has the best features `['mean concavity', 'worst perimeter', 'worst fractal dimension', 'mean perimeter']`, Cluster Predictions and Diagnostic (where 0 is malignant, 1 is benign)



pd.concat([df_cluster_profile,df_diagnostic], axis=1).head(4)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Finally, we use `DescriptionAllClusters()` parsing the concatenated DataFrame. It outputs a table showing for each cluster what are the most common values for a given variable, including the diagnostic level (where 0 is malignant, 1 is benign). You will also parse the decimal points you want to display when the evaluated variable is numerical; depending on the range of the numerical variable, you may need more decimal points. In our case, 2 decimal points are fine, but you can re-run the function after and check with different values, like 0 and 4.




<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> Recall that we found the most important variables that help to define a cluster are: `['mean concavity', 'worst perimeter', 'worst fractal dimension', 'mean perimeter']`
* Note that the algorithm found that for Cluster 0, the most common values for mean concavity are between 0.13 -- 0.22, for worst perimeter, is between 145.7 -- 174.18, worst fractal is between	0.08 -- 0.09 and mean perimeter is between	120.88 -- 136.88. Also, all diagnoses in cluster 0 are 0 - malignant. **This is the profile from cluster 0!**
  * Repeat this analysis for the remaining clusters. Note we start giving meaning to each cluster.
* Note also our analysed variable (diagnostic). It shows that cluster 0 has  only malignant cases, cluster 1 is a mix between malignant and benign, but malignant is more dominant, and cluster only has two benign cases. 
  * Think for a moment about how cool that is. The algorithm found patterns to split into three groups, one with malignant, another a mix and the last benign. Now think how this analysis could be applied to solve other business problems.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note the major differences/patterns between clusters across variables, like:
  * The ranges of mean concavity look to be smaller when the diagnostic is benign (1) and look to increase when diganostic tends to 0 (malignant).
  * The values of the worst perimeter in clusters where malignant is predominant tend to be higher than in benign clusters.
    * Note we keep adding meaning to how the clusters interact based on the analysis between a given variable (mean concavity, for example) and diagnostic
  * Repeat the same analysis for other variables (worst fractal and mean perimeter)

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 Typically you will notice differences in ranges across the clusters and across the levels of your analyzed variable (diagnostic). This difference is typically the pattern we are interested to discover.	

pd.set_option('display.max_colwidth', None)
clusters_profile = DescriptionAllClusters(df=pd.concat([df_cluster_profile,df_diagnostic], axis=1),
                                          decimal_points=2)
clusters_profile

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Clusters distribution across Diagnostic levels & Relative Percentage of Diagnostic in each cluster

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> This analysis shows the Cluster's distribution across Diagnostic. This information is revealed in the previous table, but now we can make it more visual to stakeholders. It has 2 plots:
* The first is a bar plot, in the x-axis the clusters, the bar length is how many data points are in that cluster and is coloured by the level of diagnosis (where 0 is malignant, 1 is benign).
* The second plot gives a complementary vision to the first. In the first, we saw the absolute values (the counts). Now we see the relative (the percentage).

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> Let's analyse the plots
* The first plot shows that cluster 0 has malignant cases only. Cluster 1 a mix of both cases with malignant predominant. The last cluster is predominantly benign cases (however, there are few malignant cases. If required, you could do a data analysis later on these malignant cases.)
* The second plot reveals the percentage presence of Diagnostic (malignant and benign) and displays the percentage in each cluster.


df_cluster_vs_diagnostic=  df_diagnostic.copy()
df_cluster_vs_diagnostic['Clusters'] = X['Clusters']
cluster_distribution_per_variable(df=df_cluster_vs_diagnostic, target='diagnostic')

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> What should I do now?

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> You could deploy the Cluster Pipeline as is. However, it would **need all 30 variables** to predict a given cluster for a new breast sample even though you used 4 variables to describe the profile from each cluster.
* In a real system, we should consider the number of input variables we want to manage.
* Therefore, we would consider an additional step for trying to **refit the cluster pipeline using the most important variables**. We say "trying" since we will need to conduct a tradeoff analysis to validate if the pipeline with all variables and the pipeline with only the "best feature" produce "equivalent" results.
  * In case they produce "equivalent" results, you can deploy a pipeline with fewer variables that will deliver a similar performance.




<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> However, we will study this approach in our second walkthrough project
* For the moment, what really matters is to understand that we can **cluster the data on similar data points, explain the profile of clusters, and we can analyse the clusters vs another variable** (in our case, clusters vs diagnostic)

---

# Scikit-learn - Unit 09 - NLP (Natural Language Processing)

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand and create an ML pipeline for NLP (Natural Language Processing)


---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - NLP (Natural Language Processing)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Conversational language, unlike text neatly entered into form inputs, is unstructured data that cannot be neatly broken down into elements in a row-column database table; there is a vast quantity of information available within it and waiting to be accessed. 
* Therefore, natural language processing aims to gather, extract and make available all of this information.




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%208-%20Challenge.png"> NLP is not a trivial task since its goal is to understand the language, not only process the text/strings/keywords. 
* As we know, language is ambiguous, subjective and subtle.  New words and terms are constantly added/updated and their meaning may change according to the context. 
* These aspects all together make NLP a very interesting and challenging task for ML.




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will study NLP (Natural Language Processing) as a supervised learning approach where the features are text and the target variable is a meaning associated with that given text. Therefore the ML task is Classification.
* Therefore the workflow will be similar to what we covered for Classification tasks, where we:
    * Load the data
    * Define the pipeline steps
    * Split the data into train and test sets
    * Train multiple pipelines using hyperparameter optimisation
    * Evaluate pipeline performance
* One difference will be defining the pipeline steps, where we will use steps for pre-processing the textual data before the modelling stage. Once you have a processed text, you can use ML algorithms to predict your target variable.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load data

We will use a dataset that contains records telling if a given SMS message is spam or not (spam or ham). We load the data from GitHub.
* In this project we are interested to **predict if a given message is spam or not**, therefore the ML task is Classification.

url = 'https://raw.githubusercontent.com/ShresthaSudip/SMS_Spam_Detection_DNN_LSTM_BiLSTM/master/SMSSpamCollection'
df = (pd.read_csv(url, sep ='\t',names=["label", "message"])
    .sample(frac=0.6, random_state=0)
    .reset_index(drop=True)
    )
df = df.sample(frac=0.5, random_state=101)
print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Note: just as a reminder, in an actual project, once you load your textual data, you could explore it using the techniques covered in the Text Analysis lesson. 
* We will not do that here since our focus is on the ML process used in NLP.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Split data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  As usual, we are splitting the data into train and test sets.
* In this case, there are two columns in the dataset, where the `message` contains the text, and the `label` tells if the SMS message was spam or not.
* In the end, we have a Pandas Series for the features (`message`) and target (`label`) - note the brackets subsetting the data, for example, `df['message']`

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'],
                                                    test_size=0.2, random_state=101)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Create the pipeline

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> We will consider classic steps in an NLP pipeline, where we first clean the text and extract the features for the model.
* The pipeline steps will be slightly different from what we have been studying within Classification (Data Cleaning, Feature Engineering, Feature Scaling, Feature Selection and Model), but the purpose is the same: prepare the data for the model.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 Overall, here we will consider steps for **(1) cleaning the textual data and (2) representing the text as numbers, or feature extraction.**
* (1) In our case, we will make the text lowercase and remove punctuation for text cleaning.
    * The practical tasks for cleaning the textual data will differ from dataset to dataset; for example, you may have a dataset where you need to clean HTML tags, so you need a function to do that for you; or eventually, you need to remove diacritics (marks located above or below a letter to reflect a particular pronunciation, like *resumé*)
  
* (2) There are also multiple techniques for feature extraction; we will consider the ones we covered in Module 2; in this case, we **will tokenize the text and then use TF-IDF (Term Frequency－Inverse Document Frequency)**

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are using texthero to clean the textual data, **by changing the text to lowercase and removing punctuation** from the textual data
* If you want to refresh these concepts, you may refer back to Machine Learning Essentials > Machine Learning Tasks > Natural Language Processing, Recommender Systems unit video.
* We must create a custom Python class to parse it into the pipeline thereafter. We are using the same approach for creating custom transformers we saw in the feature-engine lesson, where we use BaseEstimator, TransformerMixin, and create fit and transform methods. So the custom transformer can be added correctly to the ML pipeline.

from sklearn.pipeline import Pipeline
import texthero as hero

from sklearn.base import BaseEstimator, TransformerMixin
class text_cleaning(BaseEstimator, TransformerMixin):

  def __init__(self ):
    return None

  def fit(self, X, y=None):
    return self

  def transform(self, X):
    X = hero.preprocessing.lowercase(X)
    X = hero.remove_punctuation(X)
    return X


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For feature extraction, we use **CountVectorizer** and **TfidfTransformer**. You can find their documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).
* We need to convert the textual data to a format from which the algorithms can learn the relationships, also known as vectors. 
  * CountVectorizer: According to its documentation, it converts a collection of text documents to a matrix of token counts. It stores the number of times every word is used in our text data. We are also removing English "stop words".
  * (TfidfTransformer) Term Frequency－Inverse Document Frequency Transformer: It transforms a count matrix to a normalised tf or tf-idf representation according to its documentation. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and are empirically less informative than features that occur in a small fraction of the data. In addition, this highlights the words that are most unique to a document, thus better for characterising it. 


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> Finally, our pipeline will have four steps:
* Text cleaning: lowercase the text and remove punctuation
* CountVectorizer: convert text to token
* TF-IDF: transform a count matrix to a normalised tf or tf-idf representation
* Model

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def PipelineOptimization(model):
  pipeline = Pipeline([
                       
        ( 'text_cleaning', text_cleaning() ),
        ( 'vect', CountVectorizer(stop_words='english') ),
        ( 'tfidf', TfidfTransformer() ),
        ( 'model', model )
    ])
  
  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We load the Python class (HyperparameterOptimizationSearch) that we studied in previous units, which aims to fit a set of algorithms with multiple hyperparameters. A quick reminder of what this class does: 
* The developer defines a set of algorithms and their respective hyperparameters values.
* The code iterates on each algorithm and fits pipelines using GridSearchCV considering its respective hyperparameter values. The result is stored.
That is repeated for all algorithms that the user listed.
* Once all pipelines are trained, the developer can retrieve a list with a performance result summary and an object that contains all trained pipelines. The developer can then subset the best pipeline.

from sklearn.model_selection import GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model=  PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> List algorithms

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Now we list the algorithms we want to use for this task. First, we are considering new estimators from Scikit-learn that typically tend to offer reasonable performance for NLP tasks.
  * It doesn't mean we couldn't have considered the algorithms we have seen already in the course, like tree-based algorithms. However, the central aspect is that we should use algorithms that tend to be more effective for NLP tasks.
  * For teaching purposes, we will consider only two algorithms (SGDClassifier and LinearSVC) from this set of algorithms used for NLP tasks to speed up the learning process. However, we suggest you try out the other algorithms at your own pace and time
  * We will not give full details of how these other algorithms work to avoid overloading you with a lot of new information. It will be a matter of time, experience and curiosity for you to keep learning new topics as a data practitioner, including learning about additional families of algorithms. There is a BONUS section at the end of the next notebook where we will briefly explain the algorithms and present the typical hyperparameters used for the NLP classification task.

from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

models_search = {
    #"MultinomialNB":MultinomialNB(),
    "SGDClassifier":SGDClassifier(random_state=101),
   # "SVC": SVC(random_state=101),
    "LinearSVC": LinearSVC(random_state=101),
}


params_search = {
   # "MultinomialNB":{},
    "SGDClassifier": {},
   # "SVC": {},
    "LinearSVC": {},
}


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
">
 We are using the technique we covered in previous units for hyperparameter optimisation, where we:
* 1-  Fit multiple pipelines with multiple algorithms using their default hyperparameters. So we can find the algorithms that look to best fit the data
* 2 - Then we fit multiple pipelines for the best algorithms using multiple hyperparameter combinations.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit multiple pipelines with multiple algorithms using their default hyperparameters

We start by fitting multiple pipelines using the default hyperparameters.
* We pass in the training data, set the scoring metric to accuracy (we assume our stakeholders are interested in how accurate their system is) and set cv=2 (typically you may set it to 5, but for simplification and to have a faster training, we set it to 2).

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-2,
           cv=2)

Let's check  the training results summary.
* Note that SGDClassifier performed best, and the difference to LinearSVC is slight; both are close.

grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

---

# Scikit-learn - Unit 09 - NLP (Natural Language Processing)

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand and create an ML pipeline for NLP (Natural Language Processing)


---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - NLP (Natural Language Processing)

* We will continue from the previous notebook, where we found the algorithm that most suited the data (SGDClassifier) and now we are doing an extensive hyperparameter optimisation to find the pipeline with the best hyperparameter combination.
* Once we find the best pipeline, we will evaluate the pipeline and make predictions using real-time data.
* We will need to reload the data, and create a custom function for hyperparameter optimisation and pipeline.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load data

We will use a dataset that contains records telling if a given SMS message is spam or not (spam or ham). We load the data from GitHub.
* In this project we are interested in **predicting if a given message is spam or not**, therefore the ML task is Classification

url = 'https://raw.githubusercontent.com/ShresthaSudip/SMS_Spam_Detection_DNN_LSTM_BiLSTM/master/SMSSpamCollection'
df = (pd.read_csv(url, sep ='\t',names=["label", "message"])
    .sample(frac=0.6, random_state=0)
    .reset_index(drop=True)
    )
df = df.sample(frac=0.5, random_state=101)
print(df.shape)
df.head()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Split data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  As usual, we are splitting the data into train and test sets.
* In this case, there are two columns in the dataset, where the `message` contains the text, and the `label` tells if the SMS message is spam or not.
* In the end, we have a Pandas Series for the features (`message`) and target (`label`) - note the brackets subsetting the data, for example, `df['message']`

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'],
                                                    test_size=0.2, random_state=101)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Create the pipeline

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> We will consider classic steps in an NLP pipeline, where we first clean the text and then extract the features for the model.
* The pipeline steps will be slightly different from what we have been studying within Classification (Data Cleaning, Feature Engineering, Feature Scaling, Feature Selection and Model), but the purpose is the same: prepare the data for the model.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
">
 Overall, here we will consider steps for **(1) cleaning the textual data and (2) representing the text as numbers or feature extraction.**
* (1) In our case, we will make the text lowercase and remove punctuation for text cleaning.
    * The practical tasks for cleaning the textual data will differ from dataset to dataset; for example, you may have a dataset where you need to clean HTML tags, so you need a function to do that for you; or eventually, you need to remove diacritics (marks located above or below a letter to reflect a particular pronunciation, like *resumé*)
  
* (2) There are also multiple techniques for feature extraction; we will consider the ones we covered in ML essentials; in this case, we **will tokenize the text then use TF-IDF (Term Frequency－Inverse Document Frequency)**

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are using texthero to clean the textual data, **by changing the text to lowercase and removing punctuation** from the textual data
* If you want to refresh these concepts, you may refer back to the NLP video.
* We need to create a custom Python class to pass it into the pipeline afterwards. We are using the same approach for creating custom transformers we saw in the feature-engine lesson, where we use BaseEstimator, TransformerMixin, and create fit and transform methods. So the custom transformer can be added correctly to the ML pipeline.

from sklearn.pipeline import Pipeline
import texthero as hero

from sklearn.base import BaseEstimator, TransformerMixin
class text_cleaning(BaseEstimator, TransformerMixin):

  def __init__(self ):
    return None

  def fit(self, X, y=None):
    return self

  def transform(self, X):
    X = hero.preprocessing.lowercase(X)
    X = hero.remove_punctuation(X)
    return X


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For feature extraction we use **CountVectorizer** and **TfidfTransformer**, you can find their documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).
* We need to convert the textual data to a format that the algorithms can learn the relationships from, also known as vectors. 
  * CountVectorizer: According to its documentation, it converts a collection of text documents to a matrix of token counts. It stores the number of times every word is used in our text data. We are also removing English "stop words".
  * (TfidfTransformer) Term Frequency－Inverse Document Frequency Transformer: It transforms a count matrix to a normalized tf or tf-idf representation according to its documentation. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and are empirically less informative than features that occur in a small fraction of the data. In addition, this highlights the words that are most unique to a document, thus better for characterising it. 


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> Finally, our pipeline will have four steps:
* Text cleaning: lowercase the text and remove punctuation.
* CountVectorizer: convert text to token.
* TF-IDF: transform a count matrix to a normalised tf or tf-idf representation.
* Model.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def PipelineOptimization(model):
  pipeline = Pipeline([
                       
        ( 'text_cleaning', text_cleaning() ),
        ( 'vect', CountVectorizer(stop_words='english') ),
        ( 'tfidf', TfidfTransformer() ),
        ( 'model', model )
    ])
  
  return pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We load the Python class (HyperparameterOptimizationSearch) that we studied in previous units, which aims to fit a set of algorithms with multiple hyperparameters. A quick reminder on what this class does: 
* The developer defines a set of algorithms and their respective hyperparameters values
* The code iterates on each algorithm and fits pipelines using GridSearchCV considering its respective hyperparameter values. The result is stored.
That is repeated for all algorithms that the user listed.
* Once all pipelines are trained, the developer can retrieve a list with a performance result summary and an object that contains all trained pipelines. The developer can then subset the best pipeline.

from sklearn.model_selection import GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model=  PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Fit multiple pipelines for best algorithms using multiple hyperparameter combination

We update our dictionaries using the algorithms and hyperparameters combinations we want to optimise.

from sklearn.linear_model import SGDClassifier

models_search = {
    "SGDClassifier":SGDClassifier(random_state=101),}


params_search = {
    "SGDClassifier": {'model__tol':[1e-2, 1e-1], },
  }

Next, we fit multiple pipelines using the algorithms we selected considering multiple combinations of hyperparameters.

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-2,
           cv=2)

Let's check the training results summary 
* Note that SGDClassifier performed best. Not only has the performance improved from the default hyperparameters but now SGDClassifier is performing better than LinearSVC.

grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

We check for the best model programmatically.

best_model = grid_search_summary.iloc[0,0]
best_model

So we can grab the best model parameters.

grid_search_pipelines[best_model].best_params_

And grab the best pipeline.

best_pipeline = grid_search_pipelines[best_model].best_estimator_
best_pipeline

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pipeline Performance

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Finally, we evaluate the pipeline as usual with our custom function for classification tasks.

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We pass in the arguments with which we are familiar.
* Train and Test set
* Best pipeline
* for `label_map`, we get the classes name with `.unique()`


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> Note: The model learned the relationships in the data in the train set and predicted everything correctly. In the test set, we had a few misclassifications, but still, the performance looks good, and the **model could generalise on the unseen data** (test set)

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=best_pipeline,
                label_map= df['label'].unique()
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Predict using real-time data

Pass in a real-time message to validate whether or not you shall click on the link   :)
* Try new sentences, by changing the content on the `real_time_msg` variable.

########################################################################
real_time_msg = 'Congratulations, you won the auction. Please click on link below to get your prize'
########################################################################

X_live = pd.Series(data=real_time_msg, name='message')
best_pipeline.predict(X_live)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Bonus: Typical hyperparameters for algorithms listed in this notebook

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We again reinforce that It will take time and experience to learn which hyperparameters to consider when optimising your pipeline and which values would make sense to tune.
* the library documentation is your best friend instructing you on the library's available hyperparameters for that given algorithm.
The hyperparameters we list here are a suggestion so that you can use them as a reference when you start fine-tuning your ML pipelines.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We will write the hyperparameters for the algorithms using the same dictionary structure we saw over the notebook, assuming you are arranging everything into a pipeline and the last step is called '`model`'

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Support Vector Machine


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Support Vector Machine (or SVM) is an algorithm that can be used for Classification or Regression
* The idea is to find a hyperplane that separates the data.
  * A hyperplane is a boundary that distinguishes the data points and will be N-1 dimensional, for example, if you have two variables (2 dimensions), you can plot these variables in an XY plot, like a 2D scatter plot. Your hyperplane in this case is a line. If you have 3 variables  (3 dimensions), you can plot these variables in an XYZ plot, like a 3D scatter plot. Your hyperplane in this case is a [plane](https://en.wikipedia.org/wiki/Plane_(geometry)) (note: it is a geometry plane, not an aeroplane)
  * The hyperplane should have the maximum distance (here called the margin) between data points. Support vectors (therefore the algorithm name) are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane.

# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
from sklearn.svm import SVC

params_search = {
    "SVC": {#'model__C':[1,0.5,1.5],
          'model__tol':[1e-3,1e-2,1e-4],
          #  'model__kernel': ['rbf', 'poly', 'sigmoid'],
            }
}


---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Linear Support Vector Machine

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to its documentation, Linear Support Vector Machine is similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

# https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
from sklearn.svm import LinearSVC

params_search = {

    "LinearSVC": {#'model__C':[1,0.5,1.5],
                  'model__tol':[1e-3,1e-2,1e-4],
                  },
}

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Linear classifier with SGD (Stochastic Gradient Descent)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to Scikit-learn documentation, this estimator implements regularized linear models (SVM, logistic regression, etc.) with stochastic gradient descent (SGD) learning.
* SGD is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. SGD is merely an optimisation technique and does not correspond to a specific family of machine learning models. It is only a way to train a model.

# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
from sklearn.linear_model import SGDClassifier

params_search = {
    "SGDClassifier": {'model__tol':[1e-3, 1e-2, 1e-4],
                    #  'model__penalty':['l2', 'l1', 'elasticnet'],
                     # 'model__alpha':[0.0001,0.001],
                      },
}

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Naive Bayes

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to Scikit-learn [documentation](https://scikit-learn.org/stable/modules/naive_bayes.html), Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. 
* Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one-dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
from sklearn.naive_bayes import MultinomialNB

params_search = {
    "MultinomialNB":{'model__alpha': [1.0, 0.6, 0.4, 1.3, 0.0]
                     },
}


---