# Scikit-learn - Unit 02 - Split your data, fit a model, predict and save the model

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and implement the basic workflow for splitting the data, fitting a model, predicting on data and saving the model.



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

We will install scikit-learn, xgboost, feature-engine and yellow brick to run our exercises

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 02 - Split your data, fit a model and predict

In this unit, we will cover how to:
  * Split your data
  * Fit a model
  * Run predictions with the fitted model
  * Save the model, so you can use it later

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In supervised learning, you are interested in splitting your data. In conventional ML, like in Scikit-learn, we will split the data into Train and Test sets.
* The validation set is a part of the Train set. When using a specific Scikit-learn function for hyperparameter optimization, the validation set is grabbed automatically. Therefore we will split into Train and Test set only.
* If you want a refresher on Train, Validation, Test set, revert to Module 2 - ML Essentials.

Let's consider the iris dataset. It contains records of 3 classes of iris plants, with its petal and sepal measurements

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)
# df = sns.load_dataset('iris')
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> How do you know which variables are features and which variable is a target variable?
* It will depend on the context of your ML project. You will need to know or need to investigate the objective of your ML project to determine features and the target.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For this dataset, species is the target variable. There are three species. We need to classify the species according to the flower's petal and sepal. Our ML task then will be a classification.

df['species'].value_counts()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Split your data

We use `train_test_split()` to split the data. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). The parameters we will use are:
* The first 2 are the features and target, respectively. In this case, for the features, you drop species, and for the target, you subset species.
* ``test_size:`` it represents the data proportion to include in the test set. We set at 0.2
* ``random_state:`` according to the documentation, it controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. It can be any positive integer. We suggest keeping the same random_state value across your project. We will select here 101

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> `random_state` is a critical parameter in ML, which we will use in other use cases. It essentially gives **REPRODUCIBILITY** to your project. That means the same result you get here right now, and another person will get somewhere else at another time.


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['species'],axis=1),
                                                    df['species'],
                                                    test_size=0.4,
                                                    random_state=101)

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

Let's have look at `X_train`
* Those will be the features used to train the model
* Note the features are numbers. Scikit-learn uses numbers to fit models. That is why we have to encode categorical data
* In this dataset, we don't need any data cleaning or categorical encoding

X_train.head()

Let's inspect `y_train`. These are categories.
* When the ML task is classification, Sckit learn handles either number or categories for the target variable

y_train

In addition, `y_train` is a Pandas Series

type(y_train)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit your model

You get a preview of tree-based algorithms in Module 2. Even though we have a dedicated unit for tree-based algorithms, we will use a decision tree algorithm to fit a model to give a sense of the basic workflow for fitting a model.
* We will use `DecisionTreeClassifier()`, the documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* We create a python object/variable called model, and instantiate `DecisionTreeClassifier()`. A common convention is to set the object name as a model.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">  Note: we created a model and fit. We can do that since the data doesn't require a pre-processing step, like data cleaning or categorical encoding, for the purpose of fitting.

* Fitting only the model is fine as a learning experience. However, in our exercises, we will not fit the model but instead use a pipeline that contains a series of steps, where typically, the last step will be the model.

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

Next, we fit the model with the train set - features (`X_train`) and target (`y_train`)
* We use `.fit()` method and parse `X_train` and `y_train`. Simple as that.

model.fit(X_train,y_train)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Run predictions

Let's predict the test set using our model.
* We use `.predict()` and parse the test set features (`X_test`)
* The answer is an array

model.predict(X_test)

You can predict the probability (between 0.0 and 1.0) for each class for a given observation using `.predict_proba()`

model.predict_proba(X_test)

Ideally, we should predict on the Train and Test set, set a performance metric and evaluate model performance.
* We will not evaluate the model yet. We will leave it until another unit
* The idea here is to feel how it works behind the hood to do a basic training and predicting process.

Let's assume now you want to predict on real-time data.
* In an application, you will likely create an interface to collect the data or will get the data from somewhere else, from an API, for example.
* In this case, we will manually create a DataFrame that contains the features. We call that X_live. It will have one row only (you could have a set of rows, that would mean to run predictions in a batch, in our case, it is only one prediction)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In theory, you can set any value to the variable, But in practice, the values will follow the actual data distribution.

X_live = pd.DataFrame(data={'sepal_length':6.0,
                            'sepal_width':3.9,
                            'petal_length':2.5,
                            'petal_width':0.9},
                      index=[0] # the DataFrame needs an index (either number or category), we just parsed the number 0
                      )
X_live

Let's predict on the live data

model.predict(X_live)

The model is 100% confident it is a determined class

model.predict_proba(X_live)

We noticed already this class is Versicolor, but you cross-check the labels orders with .unique()

df['species'].unique()

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Save your model

We can save either an ML model or an ML pipeline as a .pkl file with a library called joblib.
* You need the function `joblib.dump()`, the documentation is [here](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html). We will parse the arguments `value`, as the file we want to save, and `filename` as the directory + filname + .pkl (we are saving at root) 

import joblib
joblib.dump(value=model , filename="my_first_ml_model.pkl")

Once you are in an application or in another notebook, you can load with `joblib.load()`. The documentation is [here](https://joblib.readthedocs.io/en/latest/generated/joblib.load.html). You will parse the argument `filename` as the directory + filename + .pkl

loaded_model = joblib.load(filename="my_first_ml_model.pkl")
loaded_model

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%209-%20Well%20done.png"> Awesome!! 

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Does that mean I am ready to create an ML model for the world and solve big challenges?
* Almost. We just started the ML journey now! 
* We still need to cover more topics. Let's have some fun now!

---