# Scikit-learn - Unit 03 - Linear Models for Regression and Classification

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Implement and Evaluate Linear Models for Regression and Classification



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scikit-learn - Unit 03 - Linear Models for Regression and Classification


In this unit, we will cover the practical steps and code to fit a pipeline considering Linear Regression and Logistic Regression.
* In case you want a reminder of the content, revert to Module 2 and Module 3

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A typical workflow used for supervised learning is
* Split the dataset into train and test set
* Fit the model (either a pipeline or not) 
* Evaluate your model. If performance is not good, revisit the process, starting from collecting the data, conducting EDA (Exploratory Data Analysis) etc



---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Regression: Linear Regression

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the Boston dataset from sklearn. It has house price records and house characteristics, like the average number of rooms per dwelling and the per capita crime rate in Boston.

* The approach to load the data from sklearn is a bit different from seaborn.
* In this case, data comes as a "dictionary" where you need to grab different pieces (like data.data, data.features_names, data.target) to arrange the DataFrame


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We inform you that the data won't need cleaning or feature engineering to train a model.

from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)

print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As our workflow suggests, we split the data into train and test set
* we parse the features (`the full data dropping the target`) and the target (`df['target']`)
* test_size is 20%, random_state is 101 - from now on, we want always to use  these values
* It is a good practice to inspect the train and test set shape, just a sanity check.

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['target'],axis=1),
                                    df['target'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Our target variable is the house price, which is a continuous variable. So we will create a pipeline to handle that.

* We import Pipeline, StandardScaler and select from the model
* To speed up the process, we know the dataset doesn't require any data cleaning or feature engineering. When we work with a dataset that needs it, we will inform you and suggest a transformer for that. In the workplace, that will be the data practitioner task. But for learning purposes, we focus on the modelling and evaluation aspects.
* We also import the linear regression algorithm. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* We define a function to create a pipeline with three steps: feature scaling, feature selection and model. It is convenient to arrange everything in a custom function for a given pipeline.
* Just a reinforcement, in the feature selection, we parse to ``SelectFromModel()`` the model we will use, in this case, Linear Regression.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"><img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **WARNING** <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">
* The code is already written here, but you will likely write it for yourself for your milestone project and in the workplace.
* We are familiar with the idea of arranging the pipeline in a series of steps, however mistyping code is super common, as you may already know, so when you write the pipeline, please consider you may eventually mistype or miss commas "," or parenthesis "(". If you type something wrong, don't worry, the code will alert you with an error. 

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.linear_model import LinearRegression

def pipeline_linear_regression():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(LinearRegression()) ),
      ( "model", LinearRegression()),

    ])

  return pipeline

pipeline_linear_regression()

We define the object `pipeline` based on `pipeline_linear_regression()`, then fit to the train set (X_train and y_train)

pipeline = pipeline_linear_regression()
pipeline.fit(X_train,y_train)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Great. Once the pipeline is fitted, we want to start evaluating it. First we want to know the linear model coefficients, by extracting from the model the attribute `.coef_`
* We create a custom function to grab that in place in a DataFrame together with the columns, and sorted by the absolute values from the coefficients

def linear_model_coefficients(model, columns):
  print(f"* Interception: {model.intercept_}")
  coeff_df = (pd.DataFrame(model.coef_,columns,columns=['Coefficient'])
            .sort_values(['Coefficient'],key=abs, ascending=False)
            )
  print("* Coefficients")
  print(coeff_df)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As we have seen, we need to parse the model and the columns that the pipeline is trained on
* To parse the model only, we subset the model from the pipeline with `pipeline['model']` (in this case, we named this step as 'model', but you could have named it as 'ml_model', so you would use the step name notation)
* To parse the columns, we subset the feature selection step where we grab a boolean array informing which features hit the model - `pipeline['feat_selection'].get_support()`. Then this array is used to subset the features from train set columns.

Let's make one exercise to visualize everything we read.
* here we subset the model step from the pipeline

pipeline['model']

Here we subset the boolean array that tells which features hit the model
* Note the first element is False, meaning the first feature from the train set was removed in this step.

pipeline['feat_selection'].get_support()

Here we parse this array to train set columns

X_train.columns[pipeline['feat_selection'].get_support()]

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Now that we are comfortable with what happens on the back end for extracting information from the pipeline, we want to learn the model coefficients.
* Do you remember the intercept and beta coefficients in the algorithms lesson? Here they are. In this case, it is a multiple linear regression since we have multiple features hitting the model.
* We notice that LSTAT has the highest absolute value. That indicates it is the most important feature for this model. But then we ask: is this model good?

linear_model_coefficients(model=pipeline['model'],
                          columns=X_train.columns[pipeline['feat_selection'].get_support()])

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we want to evaluate how good the pipeline fits the train and test set
* In case you want to remember the performance metrics for regression better, revert to the Performance Metrics video in Module 2
* Read the pseudo code to understand the logic better. The main aspect now is to understand the logic and why it is important for us now.
* We will use these functions in the rest of the course when we evaluate regression models.


# import regression metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
# we will use numpy to calcuate RMSE based on MSE (mean_squared_error)
import numpy as np


def regression_performance(X_train, y_train, X_test, y_test,pipeline):
  """
  # Gets train/test sets and pipeline and evaluate the performance
  - for each set (train and test) call regression_evaluation()
  which will evaluate the pipeline performance
  """

  print("Model Evaluation \n")
  print("* Train Set")
  regression_evaluation(X_train,y_train,pipeline)
  print("* Test Set")
  regression_evaluation(X_test,y_test,pipeline)



def regression_evaluation(X,y,pipeline):
  """
  # Gets features and target (either from train or test set) and pipeline
  - it predicts using the pipeline and the features
  - calculates performance metrics comparing the prediction to the target
  """
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

def regression_evaluation_plots(X_train, y_train, X_test, y_test,pipeline, alpha_scatter=0.5):
  """
  # Gets Train and Test set (features and target), pipeline, and adjust dots transparency 
  at scatter plot
  - It predicts on train and test set
  - It creates an Actual vs Prediction scatterplots, for train and test set
  - It draws a red diagonal line. In theory, a good regressor should predict
  close to the actual, meaning the dot should be close to the diagonal red line
  The closer the dots are to the line, the better

  """
  pred_train = pipeline.predict(X_train)
  pred_test = pipeline.predict(X_test)


  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Test Set")

  plt.show()



Let's use the custom regression evaluation function.
* Note the performance on the train and test set are not too different. That is an indication that the model didn't overfit.
* At the same time, the test set performance (which is the best data to simulate real data since the model has never seen it) has an R2 performance of 0.67. This is not too good and not too bad. You may want to look for something better, but it is a good example for a first model.
* We also note in the plots that Prediction x Actual plot, the predictions tend to follow the actual value (since it kind of follows the red diagonal line)

regression_performance(X_train, y_train, X_test, y_test,pipeline)
regression_evaluation_plots(X_train, y_train,
                            X_test, y_test,
                            pipeline,alpha_scatter=0.5)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Classification: Logistic Regression

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use the breast cancer dataset from sklearn. It shows records for a breast mass sample and a diagnosis informing whether it is a 0 (Malignant), 1 (Benign)
* The approach to load the data from sklearn is a bit different from seaborn.
* In this case, `data` comes as a "dictionary" where you need to grab different pieces (like data.data, data.features_names, data.target) to arrange the DataFrame 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We inform you that the data won't need data cleaning or feature engineering to train a model

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data,columns=data.feature_names)
df['target'] = pd.Series(data.target)
print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  As usual, we start our workflow by splitting the data into train and test sets.
* We use the same pattern as the previous section

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['target'],axis=1),
                                    df['target'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Our target variable is 0 (Malignant) and 1 (Benign), which is a categorical variable. We will create a pipeline to handle that, it will be a binary classifier.
* We also import the logistic regression algorithm. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* We define a function to create a pipeline with three steps: feature scaling, feature selection and model. It is convenient to arrange everything in a custom function for a given pipeline.
* Just a reminder, we parse the model to `SelectFromModel()` in the feature selection. We will use this pattern all the time. In this case, we will use Logistic Regression

from sklearn.pipeline import Pipeline

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.linear_model import LogisticRegression

def pipeline_logistic_regression():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(LogisticRegression(random_state=101)) ),
      ( "model", LogisticRegression(random_state=101)),

    ])

  return pipeline


We define the object pipeline based on pipeline_logistic_regression(), then fit to the train set (X_train and y_train)

pipeline = pipeline_logistic_regression()
pipeline.fit(X_train,y_train)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Great. Once the pipeline is fitted, we want to start evaluating it. First we want to know the model coefficients, by extracting from the model the attribute `.coef_`
* We create a custom function to grab that in place in a DataFrame together with the columns, then we transpose it and sorted by the absolute values from the coefficients

def logistic_regression_coef(model, columns):
  coeff_df = (pd.DataFrame(model.coef_,index=['Coefficient'],columns=columns)
            .T
            .sort_values(['Coefficient'],key=abs, ascending=False)
            )
  print(coeff_df)

We parse the data in a similar way to the previous section:
* the model as the model step from the pipeline
* the columns as the train set features, subset by an array that tells which features were selected by `feat_selection` pipeline step

logistic_regression_coef(model=pipeline['model'],
                         columns=X_train.columns[pipeline['feat_selection'].get_support()])

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we want to evaluate how good the pipeline fits the train and test set
* In case you want to remember the performance metrics for classification better, revert to the Performance Metrics video in Module 2
* Read the pseudo code to understand the logic better. The main aspect now is to understand the logic and why it is important for us now.
* We will use these functions in the rest of the course when we evaluate classification models.

# loads confusion_matrix and classification_report from sklearn
from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):
  """
  # Gets features, target, pipeline and how the levels from your target are labelled (named)
  in this case, 0 (Malignant) and 1 (Benign), so you parse a list ['Malignant' , 'Benign']

  - it predicts based on features
  - compare predictions and actual in a confusion matrix
    - the first argument stays as rows and the second stays as columns in the matrix
    - we will use the pattern where the predictions are in the row and actual values are in the columns
    - for a  refresher on that, revert to the Performance Metric video in Module 2
  - show classification report

  """

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")



def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  """
  # gets the features and target from train and test set, pipeline, and how
  you labelled (named) the levels from your target
  in this case, 0 (Malignant) and 1 (Benign), so you parse a list ['Malignant', 'Benign']
  - for each set (train and test), it calls the function above to show the confusion matrix
  and classification report for both train and test set
  """

  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's use the custom classification evaluation function.
* Note the performance on the train and test set are not too different. That is an indication that the model didn't overfit.
* Just a side note, look at the confusion matrix, the actual values are in the columns, and the prediction is in the row. That is the explanation we gave in the pseudo-code. In the workplace, you may see that switched. That is fine, and you just have to pay attention to where you see the actual and prediction :)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The confusion matrix shows the counts of when the classifier predicted properly or not a given class. 
* For example, how many times did the model predict an actual malignant as malignant for the train set? That is 164.
* How many times has the model predicted a malignant as benign for the train set? That is 6.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The classification report shows main metrics for classification
* We see for each class the precision, recall and f1-score.
* Support means how many observations.
* We also see the accuracy.
* macro avg: it computes the average without considering the proportion. For example, on the train set in the precision column, it takes all precisions and calculates the average: `(0.99 + 0.98)/ 2`
* weighted avg: it computes the average considering the proportion. For example, on the train set in the precision column, it takes `[ 170/(170+285) * 0.99 ] + [ 285/(179-+285) * 0.98 ]`


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> However, we will tend to use more precision, recall, f1-score and accuracy as metrics for classification

Let's comment now on the results.
* Since the classes are not balanced (we have more benign and malignant) we probably will not choose accuracy. But let's assume we chose accuracy. We see the accuracy is very good on the train and test set.
* We could also interpret and consider that for this case we are interested (due to some particular business reason) to use as a metric the recall on malignant since we don't want to tell that a patient is benign when it is maligant. In this case, your performance on the train set is 0.96 and on the test set is 0.95. That means when you have live data, you should expect that 95% of the time you will not misclassify a patient that has malignant. It will be up to your business problem and context to tell if this level is acceptable. Your heuristics also will play a role to answer this question.

clf_performance(X_train=X_train, y_train=y_train,
                        X_test=X_test, y_test=y_test,
                        pipeline=pipeline,
                        label_map= ["Malignant","Benign"] )

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> What if I don't know how to map my target variable for this custom function?
* In this example, we know that the number 0 for the target means 'Malignant' and 1 is 'Benign'. But what if I didn't?
* That is okay, and you just have to parse in a list of the ordered sequence of the classes as strings, like: ``["0" , "1" ]``
*Let's try below. It will display the same result, and the difference is the ``label_map``

clf_performance(X_train=X_train, y_train=y_train,
                        X_test=X_test, y_test=y_test,
                        pipeline=pipeline,
                        label_map= ["0", "1" ] ) # it will display the classes as 0 and 1
                                                  # but "0" and "1" should be a string

---