# Scikit-learn - Unit 01 - ML Pipeline and ML tasks

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **Scikit-learn Lesson is made of 9 units.**
* By the end of this lesson, you should be able to:
  * Learn and use the workflow for training and to evaluate the ML pipeline
  * Create a pipeline according to our dataset and ML task
  * Fit Regression, Classification, Cluster, PCA (Principal Component Analysis), and NLP (Natural Language Processing) considering different algorithms
  * Learn and use the code to fit in one turn, multiple algorithms with hyperparameters optimization

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Reinforce ML pipeline concepts and the ML tasks that will be covered in the next notebooks.
* Learn and use the workflow for training and to evaluate the ML pipeline.



---

Scikit-learn allows you to train machine learning models for classification, regression or clustering. In addition, it provides a wide set of functions for data processing, dimensionality reduction, feature engineering, feature scaling, feature selection, tuning model hyperparameters, creating an ML pipeline, evaluating a models performance and more.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study Scikit-learn?**
  * Because it is a centralized and complete library for conventional ML, containing a suite of practical modules that helps the data practitioners from development to the deployment of ML pipelines.



## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add **code cells and try out** other possibilities, ie.: play around with parameters values in a function/method, or consider additional function parameters etc.
  * Also, **add your comments** in the cells. It can help you to consolidate the learning. 

* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed at Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For Scikit learn the link is [here](https://scikit-learn.org/stable/g/). We also will use XGBoost library to train pipelines with eXtreme Gradient Boosting, which is a tree-based algorithm. The documentation is [here](https://xgboost.readthedocs.io/en/latest/index.html)**.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 01 - ML Pipeline and ML tasks

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Introduction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In a nutshell, Machine learning is a data-driven approach that uses algorithms to learn patterns and relationships from the data, without being explicitly programmed. 
* The developer gives the algorithm data and an objective. The algorithm is trained and figures out how to match the objective based on the provided data.
* This creates a model, and the trained model is used for predicting behaviours and outputs, allowing decision making on unseen data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> ML is heavily applied in practical terms for multiple use cases in many industries, examples include:
* E-mail spam detection
* Customer Churn
* Text Sentiment Analysis
* Fraud Detection
* Real-time Ads
* Recommendation Engine (ie.: you may watch online movies and after finishing one movie, you receive suggestions on what to watch next)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will explore
* Pipeline concepts
* Data Cleaning and Feature Engineering
* Feature Scaling and Feature Selection
* ML tasks covered in this lesson
* General Workflow

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **Note**
<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">
<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">
* The overall difficulty perception may escalate in the following upcoming notebooks since we will start using in practical terms a series of concepts we covered in the videos and the previous notebooks.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pipeline concepts

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In the previous lesson, we introduced scikit learn that creates a Pipeline, a sequence of tasks.
* In ML, we are interested in arranging a sequence of tasks that are in line with the ML process of **data cleaning, feature engineering, feature scaling, feature selection and model**
* In an ML pipeline, the last step is typically the model, and the precedent steps prepare the data for the model

We import Pipeline from sklearn

from sklearn.pipeline import Pipeline


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> On top of that, the pipeline should identify two outcomes: the training outcome and the prediction outcome. 
* For that, we use estimators as part of the pipeline steps. There are two types of estimators mainly used: predictors and transformers.
  * A predictor estimator, uses methods like **.fit()** and **.predict()**. An ML model uses these methods to learn patterns from the data and is used for predictions afterwards.
  * On the other hand, the transformer estimator uses the methods **.fit()** and **.transform()** because it learns from the data and later transforms the data with better distribution. 
  


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We will demonstrate the differences between fitting models with and without a pipeline

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Data Cleaning and Feature Engineering

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We studied in the feature-engine lesson common techniques to handle data cleaning and feature engineering tasks, using feature-engines built-in transformers or creating your own transformer.
* In addition, we arranged this transformer in a pipeline

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Feature Scaling and Feature Selection

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Once the data is cleaned and engineered, you should consider feature scaling and feature selection. We have studied their definition in the Module: Machine Learning Essentials / Section: ML Pipeline. Please refer to it if you need refreshing.

* In this section, we will cover the practical step of feature scaling and feature selection.



#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Feature Scaling

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The scale of a feature is an important aspect when fitting a model. For example, there are algorithms like K-means clustering, Linear and Logistic Regression, Neural Networks that are highly affected by the scale of their features.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to Scikit-learn [documentation](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html), feature scaling can be an important preprocessing step for many machine learning algorithms. Standardization involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
* The idea behind scaling the features is to make all features within a similar scale.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will present `StandardScaler()`, which standardizes the data: it centers the variable at zero. It sets the variance to 1, by subtracting the mean from each observation and dividing by the standard deviation. It is also known as Z score. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* We will cover the StandardScaler transformer in the course as a first go-to option for feature scaling. However, there are other alternatives, and you may check the [documentation](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) to learn more. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The tradeoff of feature scaling is that the variable distribution will be slightly different. Still, we will create better conditions for the algorithm to learn the patterns and relationships in the data and generalize on unseen data.

Let's use the iris dataset

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)
# df =  sns.load_dataset('iris')
print(df.shape)
df.head()

We will import `StandardScaler()`

from sklearn.preprocessing import StandardScaler

We create a pipeline with a step called 'feature_scaling' and attach `StandardScaler()`. When you don't parse any variables to it, it scales all variables

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
      ("feature_scaling", StandardScaler()) 
  ])

We will apply this pipeline to the features in the train set. We will learn how to split data soon, but for now, we will manually create a train set and a test set, where each has a set for features and the target variable.
* In this dataset, features are `['sepal_length', 'sepal_width', 'petal_length', 'petal_width']` and target is `['species']`. We shuffle the data and will get the first 100 rows and set to the train set. The remaining goes to the test.
* <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> There is a proper way to split a train and test set. We will cover that soon.
* The central point is to have 2 sets (Train and test) and have features and the target separated.


Let's shuffle the data. we use `.sample(frac=1)`, the documentation link is [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html). It returns a random sample from the data.

df = df.sample(frac=1)
df.head()

df.shape

The train set features are X_train, and has the first 100 rows. The train set target is y_train and has the last 50 rows from species. The same rationale goes to the test set, x_test has the first 100 rows and y_test the last 50 rows.

X_train = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']][:100]
y_train =  df[['species']][:100]
X_test =  df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']][100:]
y_test =  df[['species']][100:]
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We check the DataFrames dimensions

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

When applying pipelines to ML, we fit the pipeline to the train set (so it will learn the parameters) and based on this learning, transform the data on the train and test set

pipeline.fit(X_train)
X_train_scaled = pipeline.transform(X_train)
X_test_scaled = pipeline.transform(X_test)

One caveat of using sklearn transformers is that they output NumPy arrays, instead of Pandas DataFrames. You may remember that feature-engine outputs DataFrames. 

type(X_train_scaled)

So we need an additional step to convert the scaled data back to a DataFrame.

X_train_scaled = pd.DataFrame(data= X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(data= X_test_scaled, columns=X_train.columns)

Now we are fine to move on, the dataset is a DataFrame

type(X_train_scaled)

We are now interested to see the difference in each feature before and after applying StandardScaler().
* We create a logic to loop on each feature and plot two histograms in the same plot. One shows the data distribution before applying ``StandardScaler()`` and the other after applying it.
* The blue plot is before applying, and the red is after. Note that the red histograms are centred at zero of the x-axis. You will notice the distribution may change a bit, but that is part of the tradeoff we mentioned earlier 

sns.set_style('whitegrid')
for col in X_train.columns:
  fig, axes = plt.subplots(figsize=(8,5))
  sns.histplot(data=X_train, x=col, kde=True, color='b',  ax=axes)
  sns.histplot(data=X_train_scaled, x=col, kde=True,color='r', ax=axes)
  axes.set_title(f"{col}")
  axes.legend(labels=['Before Scaling', 'After Scaling'])
  plt.show()
  print("\n\n")

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Feature Selection

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> The primary goal of feature selection is to have a process to select the relevant features for fitting an ML model. 

That is important since: 
* Models with less and more relevant features are simpler to interpret
* You reduce the chance of overfitting by removing features that may add little information or noise.
* You reduce the time needed to train the models.
* You reduce the feature space. You require less effort from the software development team to design and implement the interface (either API or dashboard) to the production environment.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> This step can be seen as a combination of search techniques to look for a subset of features and an evaluation measure that scores the different feature subsets. There are a few methods for feature selection:
* Filter Method
* Wrapper Method
* Embedded Method



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In the course, as a start point for your career, we will use the Embedded method.
* It is named embedded methods since it performs feature selection during the model training. It finds the feature subset for the algorithm that is being trained.
* The method automatically trains an ML model, then derives feature importance from it, removing non-relevant features using the derived feature importance.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> For example:
* Suppose your pipeline is considering a Decision Tree algorithm in the model step. In that case, you can add before the model step a feature selection step using an embedded method considering a Decision Tree.


Let's reuse the same data from the previous exercise: the iris dataset

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

X_train.head()


We are using `SelectFromModel()` as the method. Its documentation is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html). 
* The argument is the algorithm you are considering in the pipeline

from sklearn.feature_selection import SelectFromModel

We create a pipeline using a Decision Tree algorithm that contains 3 steps:
* feature scaling: like we saw in the previous example.
* feature selection: use SelectFromModel considering the same algorithm from the model step.
* model: uses a Decision Tree algorithm (we will get into more details in upcoming units, for now, take this step as the model step and let's use a decision tree for the example.

from sklearn.tree import DecisionTreeClassifier

pipeline = Pipeline([
      ( "feature_scaling", StandardScaler() ),
      ( "feature_selection", SelectFromModel(DecisionTreeClassifier(random_state=101)) ),
      ( "model", DecisionTreeClassifier(random_state=101) ),
  ])

pipeline

We fit the pipeline with the Train set

pipeline.fit(X_train,y_train)

And access the feature_selection step, using bracket notation as we saw in the feature-engine lesson

pipeline['feature_selection']

That was not informative. We need to use `.get_support()` to access which features were selected by this step. 
* The output is a boolean list, where its length and order are related to the original feature space.
* For example, the train set has four features. We see that the feature_selection step selected the last two steps since they are True. The first two features were not considered since they are False in the boolean list.

pipeline['feature_selection'].get_support()

However, we want to know the features list that was selected, not a boolean list.
* We then use this boolean list to subset the features.
* A quick recap on the features list

X_train.columns

We use the boolean list to subset the previous list
* And here we have the features that were consideried important for that given dataset using that given algorithm

X_train.columns[pipeline['feature_selection'].get_support()] 

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> ML tasks

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In this lesson, we will explore business cases that involve the following ML tasks
* Regression
* Classification (Binary and Multi-class)
* Clustering
* NLP (Natural Language processing)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We will use structured and tabular datasets from ML libraries like Seaborn, Plotly, Sckit-learn and Yellow-brick.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> General Workflow

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In a practical project, you can use CRISP-DM workflow to manage project steps. In case you want a refresher on the workflow, revert to the Module Delivering Data Science projects
* For this lesson, we will focus on the following CRISP-DM steps: data understanding, data preparation, modelling and evaluation.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Therefore, when you reach the modelling phase in a project, it is assumed you have collected the data, conducted an EDA, and defined the pipeline steps.

* When modelling, for supervised learning, you will typically use an overall workflow like:
  * Split the dataset into train and test set
  * Fit the model (either a pipeline or not) 
  * Evaluate your model. If performance is not good, revisit the process, starting from collecting the data, conducting EDA etc


There are some potential small variations to this workflow, but this is the starting point we consider in your journey of modelling


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **HUGE WARNING** <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">
* **Reflect** for a second how many steps and considerations you need before fitting a model. You will be surprised that in a project where a person is responsible from end to end, the modelling phase will take a small percentage of your time and attention.

* Even though, this phase is critical to your project, **so let's stop the reading/talking and let's fit some models**.




---