In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # plots
import matplotlib.pyplot as plt # plots

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 📚 Beginner Logistic Regression - Breast Cancer

![beginner](https://img.shields.io/badge/Level-Beginner-darkgreen.svg)

**Details**:
> **Author**: Itokiana RAFIDINARIVO

> **Last update**: Fri. 4 Dec 2020

> **Topic**: Logistic regression

## 👋 Introduction

Hi readers! Are you new in datascience and you want to learn how to create a model which can do a binary classification, well you are at the right place! This notebook will be a kind of a tutorial for anyone who want to create a binary classification model with **python**.

___

## 📜 Load the data
As you might guess if you want to create a model, data is needed so the data we will be using for this notebook is the ***Breast Cancer Wisconsin (Diagnostic) Data Set*** a "*classic and very easy binary classification dataset*" from [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)).

> **Description**
: Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at [Web Link](http://pages.cs.wisc.edu/~street/images/)

We will use the ***scikit-learn*** packages in order to load the data in the notebook!

> [**scikit-learn**](https://scikit-learn.org/stable/index.html) package
: Simple and efficient tools for predictive data analysis;  Accessible to everybody, and reusable in various contexts;  Built on NumPy, SciPy, and matplotlib;  Open source, commercially usable - BSD license.

In [1]:
from sklearn.datasets import load_breast_cancer  # Necessary function that loads the data

bunch = load_breast_cancer(as_frame=True)  # Load the Bunch in a variable

data = bunch["frame"]  # Access the dataframe in the Bunch

data.head()  # Display the 5 first rows in the DataFrame

The code blocks above did the following :
1. Load the Bunch from the package
2. Extract the ***DataFrame*** from the data bunch
3. Display the 5 first rows in the ***DataFrame***

##### Vocabulary
> [**Bunch**](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html#sklearn.utils.Bunch)
: Container object exposing keys as attributes. Bunch objects are sometimes used as an output for functions and methods. They extend dictionaries by enabling values to be accessed by key, `bunch["value_key"]`, or by an attribute, `bunch.value_key`.

> [**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
: Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

In [1]:
names  = ["rows", "columns"]
maxstr = 10
maxval = 3

print("No. of {} : {}".format("dimensions".rjust(maxstr), str(len(data.shape)).rjust(maxval)))

for name, value in zip(names, data.shape):
    print("No. of {} : {}".format(name.rjust(maxstr), str(value).rjust(maxval)))

#### Dimensions

The ***DataFrame*** is **bi-dimensional** and has :
- **568 rows**
- **31 columns**

#### Code explained

> **`data.shape`**
: `data`, our DataFrame, has an attribute named "shape". This attribute `shape` contains a tuple of the length of every dimension. Here `data.shape` returns `(568, 31)`.

In [1]:
n_missing_values = data.isna().sum().sum()

print(f"There are {n_missing_values} missing value(s).")

#### Missing values
A good practice before implementing a prediction model is to look at the quantity of missing values in the data and choose the right method to impute them. Hopefully the data does not have any missing values here so we don't have to do that!

#### Code explained
> **`data.isna().sum().sum()`**
: `data.isna()` returns a DataFrame of boolean values where it contains `False` if the values is not missing otherwise `True`. Then `data.isna().sum()` returns a Series where each values represent the quantity of missing values in the corresponding column, computed by adding 0 if the values is `False` otherwise 1. Finally `data.isna().sum().sum()` is just the sum of the values in `data.isna().sum()`.

#### Vocabulary
> [**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)
: One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN). Operations between Series (+, -, /, , *) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

In [1]:
corr = data.corr().round(2)  # Returns a correlation table between variables

mask = np.triu(np.ones_like(corr, dtype=bool))  # Generate a mask for the upper triangle

plt.figure(figsize=(18,10))  # Define the figure size
sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, annot=True)  # Create the correlation heatmap
plt.show()  # Show the plot

#### Correlation
Another good practice is to see if among the variables there are some that are linear combinations of others. If some variables are highly correlated it might negatively affect our model! So the solution in this case is delete them of the features for the model.

#### Code explained
> `data.corr()`
: Our DataFrame `data` has a method `corr` which returns a DataFrame representing the correlations between variables.

#### Vocabulary

> **variable** | **feature**
: Most of the time you might see the words *variable* and *feature* being used in datascience. Those words are representing the same entity, that is for instance the column of a DataFrame. 

In [1]:
data.describe()

The code block above shows some basics descriptive statistics about our features, the method `describe` of DataFrame is used to do that.

#### Code explained
> `data.describe()`
:
Generate descriptive statistics.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

#### Normalization or Standardization?

Another common practice before creating a model is to standardize or normalize the numerical variables. But what is the difference between those two terms?

> **Normalization**
: Scale a variable to have a values between 0 and 1.

> **Standardization**
: Have a mean of 0 and a standard deviation of 1. 


In [1]:
from sklearn.preprocessing import Normalizer, StandardScaler  # Import the functions from the sklearn package

def process_data(dataframe=None, preprocessing_method=None):
    scaled = preprocessing_method.fit_transform(dataframe)  # Apply the preprocessing function to the DataFrame
    return pd.DataFrame(scaled)

normalized = process_data(dataframe=data, preprocessing_method=Normalizer())
scaled     = process_data(dataframe=data, preprocessing_method=StandardScaler())

In [1]:
max_values = normalized.max()  # Maximum values among columns
min_values = normalized.min()  # Minimum values among columns

print("Normalized data :")
print("  > max =", min(min_values))
print("  > min =", max(max_values))

> The maximum and the minimum are between 0 and 1 so the data has been normalized.

In [1]:
described = scaled.describe()
means     = described.loc["mean"]
stds      = described.loc["std"]

print("Standardization data :")
print("  > mean between [ {}; {} ]".format(min(means), max(means)))
print("  > std  between [ {} ; {} ]".format(min(stds), max(stds)))

> The mean is approximately 0 and standard deviation is approximately 1 so the data has been standardized.

Well now let's get into the real deal! If I missed a few things please let me know down in the comments.

___

## 🧰 Create the model

Before getting into it we have to know do we have to do and how? Well here are some facts first...

#### Facts

- **Type of problem** : Binary classification
- **Target variable** : 'target' column
- **Features types**  : Mostly **float64** and an **int64**

#### Basic process

1. Prepare the data
3. Fit the model
4. Evaluate the model

In [1]:
from sklearn.linear_model import LogisticRegression  # class Logistic regression model
from sklearn.model_selection import train_test_split  # function that separates 'train' and 'test' sets
from sklearn.metrics import roc_auc_score  # function calculating the area under the ROC curve
from sklearn.metrics import plot_roc_curve  # function plotting the ROC curve
from sklearn.preprocessing import scale  # function that scales the data

RANDOM_STATE = 42

def basic_logistic_regression(data, target, test_size=0.3, threshold=0.75, plot_roc=False):
    # Separate 'features' and 'target' such that : y = f(X)
    X = scale(data.drop(columns=[target]))
    y = data[target]
    
    # Instanciate the logistic regression model
    mLR = LogisticRegression(random_state=RANDOM_STATE)
    
    # Separate the data into 'train' and 'test' sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=RANDOM_STATE)
    
    # Fit the model to the 'train' set
    mLR.fit(X_train, y_train)
    
    # Predict the probabilities
    y_train_pred = mLR.predict(X_train)
    y_test_pred  = mLR.predict(X_test)
    
    # Compute the scores
    score_train = roc_auc_score(y_train, y_train_pred)
    score_test  = roc_auc_score(y_test, y_test_pred)
    
    # Printing
    print("Score (train) :", score_train)
    print("Score (test ) :", score_test)
    
    if plot_roc:  # Display the ROC curve
        plt.figure(figsize=(10, 8))
        plot_roc_curve(mLR, X_test, y_test)
        line = np.linspace(0, 1, 20)
        plt.plot(line, line)
        plt.show()

    return mLR

#### Code explained

> `sklearn.metrics.roc_auc_score`
: Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

> `sklearn.metrics.plot_roc_curve`
: Plot Receiver operating characteristic (ROC) curve.

> `sklearn.linear_model.LogisticRegression`
: Logistic Regression (aka logit, MaxEnt) classifier. This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers.

> `sklearn.preprocessing.scale`
: Standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.

> `sklearn.model_selection.train_test_split`
: Split arrays or matrices into random train and test subsets.

In [1]:
model = basic_logistic_regression(data, target="target", plot_roc=True)

#### Observations
It seems like both scores for training and testing set are very high! Despite the lack of relevance of the training score, the testing one is quite good because it approches 1.

#### Conclusion

That's it for a basic process in order to create a logistic regression with ***scikit-learn***.

___

## ⚒️ *Pipeline* and *K-Fold cross-validation*

In addition to basic process of creating a predictive model and evaluating it with the ROC curve, there is another way to evaluate the model, the ***K-Fold cross-validation***. Here is an illustration:

![kfold_crossvalidation](https://ethen8181.github.io/machine-learning/model_selection/img/kfolds.png)
> source: http://ethen8181.github.io/machine-learning/model_selection/model_selection.html

In [1]:
from sklearn.pipeline import make_pipeline  # function that create a pipeline
from sklearn.model_selection import StratifiedKFold  # class for k-fold cross-validation

scores = list()  # scores computed by every fold

# Create a Pipeline that will: scale and create the model
pipeline = make_pipeline(
    StandardScaler(),  # scale the data
    LogisticRegression(),  # model 
)

# Create a 3-Fold cross-validation
k   = 4
skf = StratifiedKFold(
    n_splits=k,
    random_state=RANDOM_STATE,
    shuffle=True
)

# Separate 'target' and 'features'
target = "target"
X      = data.drop(columns=[target])
y      = data[target]

# Loop over the 5 folds
for train_index, test_index in skf.split(X, y):
    # Get the features
    X_train, X_test = X.values[train_index], X.values[test_index]
    
    # Get the target
    y_train, y_test = y.values[train_index], y.values[test_index]
    
    # Scale the data & Fit to the model
    pipeline.fit(X_train, y_train)
    
    # Get the score
    score = roc_auc_score(pipeline.predict(X_test), y_test)
    
    # Add the score
    scores.append(score)

#### Code explained

> `sklearn.model_selection.StratifiedKFold` 
: This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

> `sklearn.pipeline.make_pipeline`
: Construct a `Pipeline` from the given estimators. This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

In [1]:
print("Scores  :",[round(value, 4) for value in scores])
average = np.mean(scores)
vmin    = np.min(scores)
vmax    = np.max(scores)
std     = np.std(scores)
print("Average :", round(average, 4))
print("Minimum :", round(vmin, 4))
print("Maximum :", round(vmax, 4))
print("STD     :", round(std, 4))

___

## 🏁 Conclusion

Well this is all for now, I hope that you learnt something from this notebook. This notebook covered the following principles :
- Do some basic check-up of the data
- Normalization / Standardization
- Create a logistic regression with the ***scikit-learn*** package
- Evaluate with the ROC curves
- Pipelines & K-Fold cross validation