In [None]:
# Correlation - Unit 01 - Introduction

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **The Correlation Lesson is made of two units.**
* By the end of this lesson, you should be able to:
  * Understand concepts around correlation, like direction, range, strength and causation
  * Calculate and interpret Pearson and Spearman Correlation levels in numerical variables

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand concepts around correlation, like direction, range, strength and causation



---

Correlation tests quantify the association between numerical variables in a dataset.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 Why do we need to know about correlation levels? 
* Because it allows you to quickly understand the intrinsic associations in your dataset variables with a reasonably low computing cost, the correlation levels will serve as a reference for additional data investigations.
* In addition to that, the performance of some algorithms, like linear regression, can deteriorate if two or more variables are highly correlated.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Python offers multiple libraries to calculate correlations, like SciPy, NumPy and Pandas. We will stick to the Pandas library.


## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add **code cells and try out** other possibilities, i.e., play around with parameter values in a function/method, consider additional function parameters etc.
  * Also, **add your comments** in the cells. It can help you to consolidate your learning. 

* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For Pandas the link is [here](https://pandas.pydata.org/)**.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Unit

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Correlation - Introduction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> What do Correlation levels do?
* It quantifies the association between variables in a dataset. 



### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Correlation Concepts

Let's discuss:
* Correlation direction
* Correlation range
* Correlation strength
* Correlation is not causation


#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Correlation Direction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A correlation level between two variables could be positive, negative or neutral (zero).
* Positive: both variables move in the same direction; when one increases, so does the other one.
* Negative, when one variable’s value increases, the other variables’ values decrease.
* Neutral: meaning that the variables are unrelated.

Below we find the plots demonstrating, from left to right, a positive, negative and neutral correlation between 2 numerical variables.

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Correlation Range

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The correlation level could range from **-1 to +1**. 
* If the level is positive, there is a positive correlation between the variables 
* If the level is negative, there is a negative correlation  between the variables 
* If the value is zero, the variables are unrelated.


#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Correlation Strength

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The absolute value gives information on the strength. We can use the following table to interpret the strength.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> For example
* If the correlation level between the two variables is **-0.76**, they are **negatively and strongly** correlated. 
* If the correlation level between the two variables is **0.45**, they are **positively and moderately** correlated.


#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Correlation is not causation

Keep in mind that correlation does not indicate causation. Correlation quantifies the strength of the relationship between variables. It does not indicate a cause-effect relationship.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the lesson video, we saw the example where the number of people who drowned by falling into a pool in the USA correlates with films Nicolas Cage appeared in. 
* It is a strong and positive correlation level, so the more Nicholas Cage appears in Films, the more people drown in pools. 
* You may ask yourself: does that make sense? Based purely on the correlation level number, we can't know if one caused another, if a third variable causes both or if this correlation was a coincidence.


In [None]:
# Correlation - Unit 02 - Analysis

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Calculate and interpret Pearson and Spearman Correlation levels in numerical variables



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Correlation Tests

We will study two correlation tests in this unit, they are:
* Pearson
* Spearman

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pearson


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Pearson’s correlation (or r coefficient) measures the linear relationship between two numerical variables. 
* That means this test has the assumption that both variables **change at a constant rate**


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In the image below, we see relationship differences among each plot. There are examples where the linear relationship, either positive or negative, is strong and examples where the linear relationship is null.
* Note the plot on the bottom right, where the relationship is quadratic. Even though we know there is a quadratic relationship, a Pearson correlation test outputs the correlation as zero.


---

Let's consider the mpg dataset from Seaborn. It holds records for mpg (miles per gallon), cylinders, horsepower, and weight from multiple car types.

df = sns.load_dataset('mpg')
df = df.head(50)
print(df.shape)
df.head()

We use [.corr()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) from Pandas library. The argument is `method='pearson'`.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Reminder: it considers numerical variables only. The output is a table where the columns and rows are the numerical variables. The numbers reflect the correlation from a given variable to another.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note:
* The correlation between `mpg` and `mpg` is 1. As we may expect, they have the same information, and naturally, they correlate perfectly. Note also that all diagonal cells where both variables are the same and hold the value 1 for the same reason we stated before.
*The values are symmetric: the correlation between `mpg` and `cylinders` is the same as the correlation between `cylinders` and `mpg`, in this case, -0.9212 (negative and strong correlation). That means, in practical terms, the upper part of the diagonal is repeated.

df_corr = df.corr(method='pearson')
df_corr

---

You may process the table for better clarity; for example, in this dataset, the target variable is mpg, so you filter mpg using `.filter()` and sort values by their absolute value using `.sort_values(key=abs)`


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note that there are four numerical variables (`weight, displacement, horsepower, cylinders`) that are negative and strongly correlated to your target variable! That is an insight since you now have criteria to start visualising your data based on these variables first.



df_corr.filter(['mpg']).sort_values(by='mpg', key=abs, ascending=False)

Alternatively, we can better understand the correlation using a heatmap from Seaborn.
* Note the data for the heatmap should be in the format where the columns and indices have the same information and each cell corresponds to a level between two variables.
* We will mask the upper part of the diagonal, we do that with the first two lines of code.
* We then plot a heatmap with `sns.heatmap()`. A few additional arguments are set here: `annot=True` means we want to display the value in each square, `mask=mask` hides the upper diagonal values, `annot_kws={"size": 8}` sets the font size for the annotation, and `linewidths=0.5` adds a line between the squares, giving a better sense of separation

mask = np.zeros_like(df_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
fig, axes = plt.subplots(figsize=(8,8))
sns.heatmap(data=df_corr, annot=True, mask=mask, cmap='viridis', annot_kws={"size": 8}, linewidths=0.5)
plt.ylim(df_corr.shape[1], 0) # it sets the y axis limits
plt.show()

We will now create a function to plot only correlation levels greater than a certain threshold since it is more interesting to visualise the most correlated variables.
* We will build on top of the previous code to create a function. We set four arguments, `data`, `threshold`, `figsize`with a default value, and `annot_size` with a default value for font annotation size

def heatmap_corr(data, threshold, figsize=(8,8), annot_size=8):
  # we create the mask for the upper diagonal and
  # show only values greater than the threshold
  mask = np.zeros_like(data, dtype=np.bool)
  mask[np.triu_indices_from(mask)] = True
  mask[abs(data) < threshold] = True

  # we plot the heatmap as usual
  fig, axes = plt.subplots(figsize=(5,5))
  sns.heatmap(data=data, annot=True, xticklabels=True, yticklabels=True,
              mask=mask, cmap='viridis', annot_kws={"size": annot_size}, ax=axes,
              linewidth=0.5
                    )
  plt.ylim(len(data.columns),0)
  plt.show()

Now we are interested in showing linear relationships that are at least moderate, which means a threshold of 0.4
* We notice many variables are correlating among themselves with strong levels of correlation. Then we can increase the threshold to scope down to fewer variables.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> But what if the correlation levels, in general, were not so high?
* Then you would decrease the threshold level. In this exercise, we will soon set the level to 0.8 (very strong) since this may be a good threshold for an initial analysis.
* But if your data shows the majority of correlation levels to be around, say, 0.4, this value will be the threshold to start your investigation.

heatmap_corr(data= df_corr, threshold=0.4)

Now we are interested to see only very strong correlations, which means a threshold of 0.8
* We notice that mpg is linearly correlated with weight and displacement, a bit more with weight. 
* We notice multicollinearity here since weight and displacement are correlated with each other: 0.87. That means both have the same information to predict the target.

heatmap_corr(data= df_corr, threshold=0.8)

Let's plot a scatterplot of `mpg` against `weight` and `displacement`
* We note that the relationship in both cases is not purely linear; it is kind of exponential. At the same time, there is a clear pattern between the studied variables and the target.
* This is an example of how the correlation study can help you to figure out patterns in a dataset.
  * The data we exercised is a simple dataset, but you can extend the rationale for other datasets.

for col in ['weight', 'displacement']:
  sns.scatterplot(data=df,  x=col, y='mpg')
  plt.show()
  print("\n")

We also plot 'weight' against 'displacement' to check the feature multicollinearity. They show 0.93 for linear correlation, and indeed, they have a linear pattern

 sns.scatterplot(data=df,  x='weight', y='displacement')
 plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE:** can you programmatically, in one cell, develop a logic where you will:
* Calculate the correlation level, filter the variables where they are correlated with the target at a threshold of 0.7 and do a scatter plot for all of them, separately, against the target.
* Your code should subset `['cylinders', 'displacement', 'horsepower', 'weight']` as the variables that correlate to 0.7 (either positive or negative) to 'mpg'.


# create a variable called var_list, that will calculate the linear correlation.
# filter mpg column and query values either greater than 0.7 or less than -0.7.
# hint: query() expects a string
# then you get the index
var_list = df.corr()

# next you will loop on each variable and do a scatter plot on mpg against the variable
for col in var_list:
  # do a scatter plot where x=col and y is 'mpg'

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> However, there is still another limitation, Pearson correlation assumes your data is normally distributed, which doesn't happen the majority of the time in real datasets
* You can use other methods, like Spearman, to figure out correlations and complement your data analysis

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Spearman

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Spearman correlation measures the monotonic relationship between two variables
* That means the variables do not necessarily have to move at a constant rate.
* It helps reveal a general trend: if one variable increases, so will the other. Or if one increases, the other one decreases.

* 1 - There is a monotonic relationship since, generally, when X increases, so does Y.
* 2 - There is a monotonic relationship since, generally, when X increases, Y decreases.
* 3 - There is **no** monotonic relationship since there are intervals where X increases and so does Y, and the opposite.

---

Let's consider the mpg dataset again from Seaborn.  It holds records for mpg (miles per gallon), cylinders, horsepower, and weight from multiple car types.

df = sns.load_dataset('mpg')
df = df.head(50)
print(df.shape)
df.head()

We use [.corr()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) from Pandas library. The argument is `method='spearman'`.
* Reminder: it considers numerical variables only
* Again, the data in this format is not very informative. We need to plot it to make sense.



df_corr_spearman = df.corr(method='spearman')
df_corr_spearman

We use our previous heatmap function with a threshold for a very strong monotonic correlation.
* At first, it looks similar to the previous study. But looking closer, we notice that cylinders and horsepower are monotonically correlated with mpg.
  * These variables didn't appear in the previous exercise.

heatmap_corr(data= df_corr_spearman, threshold=0.8)

We already plotted mpg against weight and displacement, so we will save time here. We will plot with `sns.scatterplot()` using cylinders and horsepower
* We note there is a pattern between mpg and horsepower, where one increases and the other decreases. The monotonic correlation between them is very strong. 
* We note cylinders is, in fact, a categorical variable, so we will plot again using a boxplot.

for col in ['horsepower', 'cylinders']:
  sns.scatterplot(data=df,  x=col, y='mpg')
  plt.show()
  print("\n")

We plot a swarmplot with Seaborn, where x is cylinders and y is mpg.
* We notice patterns, like the data is predominant on 4, 6 and 8 cylinders, and the majority of the levels across these cylinders look to be different.

* Note: swarmplot works fine in this dataset since the data size is small. In big datasets, a swarmplot would make the dots cluttered. So it is a matter of trial and error. Instead of a swarmplot, you could try a histogram coloured by mpg, or a boxplot.



sns.swarmplot(data=df, x='cylinders', y='mpg', dodge=True)
plt.show()

---

# Predictive Power Score Unit 

## Lesson Learning Outcome

* **Predictive Power Score Lesson is made of 1 unit.**
* By the end of this lesson, you should be able to:
  * Understand and interpret a PPS (predictive power score)
  * Calculate and visualise PPS
  * Combine PPS with Correlation Analysis 

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand and interpret a PPS (predictive power score)
* Calculate and visualise PPS
* Combine PPS with Correlation Analysis 



---

The predictive power score detects linear or non-linear relationships between two variables. It helps to find predictive patterns in the data. 

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
"> **Why do we study Predictive power Scores?**
  * Because PPS can be used as an alternative or in conjunction with correlation levels analysis.
  * It is a handy library that will give additional insights into your data, including finding the potential best univariate predictors for your target variable.




##  <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png">  Additional Learning Context

* We encourage you to:
  * Add **code cells and try out** other possibilities, i.e. play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your own comments** in the cells. It can help you to consolidate your learning. 


* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may find additional parameters in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For ppscore the link is [here](https://github.com/8080labs/ppscore)**.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import ppscore as pps

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Predictive Power Score

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> According to its documentation, the PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to correlation (matrix).


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Introduction

You will probably start wondering during your project: what would be the relationship (linear or not) levels among my variables (features and target) regardless of the data type (you may transform categorical variables for correlation analysis, but it is an extra step)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Predictive Power Score is the answer to that! Each pair of variables in your dataset, regardless of its data type, calculates a score that tells **how a given variable has predictive power over another**. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The calculation of PPS between variables considers one variable trying to predict another using a learning algorithm. 
The library fits an ML model for each pair of variables. 
* If the target is a continuous variable, the library uses a regression model. It compares the performance (mean absolute error - MAE) to a naive regressor model (a model that always predicts the median). The PPS is the result of the following normalisation (and never smaller than 0): `PPS = 1 - (MAE_model / MAE_naive)`
* If the target is a categorical variable, the library fits a classifier and compares the performance (F1 score) to a naive classifier mode (always predicts the most common class). The PPS is the result of the following normalisation (and never smaller than 0): `PPS = (F1_model - F1_naive) / (1 - F1_naive)`


For additional clarification, you may check the [documentation](https://github.com/8080labs/ppscore#cases-and-their-score-metrics).

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Calculate PPS

Let's generate some data to demonstrate how to use PPS using NumPy capabilities
* We will assume Y is our target variable, and X1 to X4 are features

np.random.seed(seed=50)
n_datapoints = 1000
x = np.linspace(0, 10, n_datapoints)
k = x * np.sin(x**2)
u = np.cos(2*x) * np.sin(x)
v = np.random.uniform(low=-10, high=5, size=n_datapoints)
z = v * v + np.random.uniform(low=-1, high=2,size=n_datapoints)


df = pd.DataFrame(data={"Y":k,
                        "X1":x,
                        "X2":np.random.choice(['Type A','Type B','Type C'], n_datapoints),
                        "X3":v,
                        "X4": z
                        })
df = df.head(300)
df.head()

Since the feature space is not extensive, we will do a Pairplot to visualise the relationships among the variables quickly. 
* We notice patterns between Y and X1; and between X3 and X4 
* We don't see X2 since it is a categorical variable (object type in the DataFrame) and is not plotted in the Pairplot. If this variable were encoded as a number, it would appear.


sns.pairplot(data=df)

We calculate the PPS matrix with `pps.matrix()`; the documentation is [here](https://github.com/8080labs/ppscore#api). The argument is the dataset.
* There is a computation cost to calculate the scores since the library will fit an ML model for each possible combination of variables, and the order is important since the library calculates the score for X1 against X2 and X2 against X1, for example, in our exercise, it is expected to be a simple computation.
* The result is a table. The variables we will be interested in are x, y, and ppscore.

pps.matrix(df=df)

We will then process this table to output a DataFrame where we can visualise the results
* We create a `pps_matrix_raw` with `pps.matrix()`, then filter x, y and ppscore and do a pivot table on these filtered variables
* The aspect is similar to a correlation table we saw earlier. As you may expect, we can plot using a heatmap

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Here we see a practical aspect we mentioned earlier: pps is asymmetric, the pps between Y and X1 (that means X1 predicting Y) is 0.95, and the pps between X1 and Y (Y predicting X1) is 0.44


pps_matrix_raw = pps.matrix(df)
pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')
pps_matrix

We create a custom heatmap function using a threshold
* You will notice there is a linecolor='lightgrey' that will create a grid in the plot; This makes sense for a PPS heatmap since typically, there will be a good number of cells to look at, and a grid with sections helps with visualising

def heatmap_pps(df,threshold, figsize=(8,8), font_annot = 10):
    if len(df.columns) > 1:

      mask = np.zeros_like(df, dtype=np.bool)
      mask[abs(df) < threshold] = True

      fig, ax = plt.subplots(figsize=figsize)
      ax = sns.heatmap(df, annot=True, annot_kws={"size": font_annot},
                       mask=mask,cmap='rocket_r', linewidth=0.05,
                       linecolor='lightgrey')
      
      plt.ylim(len(df.columns),0)
      plt.show()


Let's use our function on pps_matrix and set for now `threshold=0`
* Next, we ask ourselves, how do I interpret it? I know the levels go from 0 to 1, but what is a "good level" of pps?
* What should be the criteria to set a threshold for my analysis?

heatmap_pps(df=pps_matrix, threshold=0)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> According to [this](https://github.com/8080labs/ppscore/issues/39) forum discussion, the interpretation depends on the context:

* In general, it is hard to denote some specific levels and give some interpretation for them without knowing the context. For example, if many columns have a PPS of 0.3, then a PPS of 0.2 might actually not be that good. However, when no column has a PPS >0.01, then a PPS of 0.1 might be very good - especially when trying to predict something hard, like stock prices.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Nevertheless, some levels are often helpful during everyday life:

* PPS == 0 means that there is no predictive power
* PPS < 0.2 often means that there is some relevant predictive power, but it is weak
* PPS > 0.2 often means that there is strong predictive power
* PPS > 0.8 often means that there is a deterministic relationship in the data, for example, y = 3*x or there is some underlying if...else... logic

* Based on those levels, it is often important to check the PPS for multiple columns and then determine your interpretation.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> **We shall consider levels from 0.2, however depending on the common levels for PPS in a given dataset; the threshold might be different**




* We will then calculate the range of common PPS values at `pps_matrix_raw`. Let's review this matrix first. 

 

pps_matrix_raw

We will grab the ppscore column and `filter values <1` (since values that equal 1 mean the score between a variable and itself). Then we calculate the summary statistics with `.describe()` and transpose it (for better table visualisation).

We are interested in the IQR - Q1 and Q3 values
* In this case, the Q1 and Q3 ranges are 0 to 0. That means the majority of the values are zero. So for the first round of visualisation, we will set the threshold at 0.15 (which allows us to see potential scores that would be closer to 0.2).

pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T

We set a threshold, and now it is cleaner to analyse
* There are two axes in the plot, y and x. The y-axis is the target variable in the PPS computation, and the x-axis is the feature variable for PPS computation
* We notice that:
  * X1 has a very strong predictive power (deterministic relationship) to Y: 0.95.  That means if I have X1, I can easily predict Y
  * On the other hand, if I have Y, it is more difficult to predict X1. The PPS score is 0.44. Here there is a contextual interpretation; even 0.21 is a representative value in the dataset, the opposite relationship is much stronger (0.95); therefore, the predictive power of Y is weak to X1
  * We see that X3 and X4 have interesting PPS values among them. X3 has a strong predictive power (0.95) to X4. When you see the pair plot for these variables, this will be easy to conclude.
  * However, X4 has lower (but still high) predictive power than X3.

heatmap_pps(df=pps_matrix, threshold=0.15)

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**

Let's take the wine dataset. It shows records for three types of wines grown in the same region in Italy, made from a chemical analysis taken for different constituents found in the three types of wine.

from sklearn.datasets import load_wine
import pandas as pd

data = load_wine()
df_practice = pd.DataFrame(data.data,columns=data.feature_names)
df_practice['wine_type'] = pd.Series(data.target)
df_practice.head()


Let's create the pps_matrix
* What should be a good threshold level for the heatmap?

pps_matrix_raw = pps.matrix(df_practice)
pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T

Plot the heatmap for PPS
* Which predictor variables look to have predictive power to the target (wine type) individually?

heatmap_pps(df=pps_matrix, threshold=...)

You will need to type in the variables for the next exercise. Let's get the columns name, so we just copy and paste at `var_list`

df_practice.columns

Fill var_list with the four variables that showed predictive power to the target.
* The code below will loop on each variable and do a box plot and swarmplot on each variable for each wine type.
* Do these variables look to have predictive power to separate the wine types? (Pay attention to the distribution across each wine type)

var_list = [...]  # list 4  variables
for col in var_list:

  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,6))
  sns.boxplot(data=df_practice, y=col, x='wine_type', ax=axes[0])
  sns.swarmplot(data=df_practice, y=col, x='wine_type', dodge=True, ax=axes[1])
  plt.show()
  print("\n")



### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Combine with Correlation Study

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are not interested in discussing if Correlation Analysis is better or not than PPS analysis. We want to use the best of both since they convey the data from different angles. Therefore, we want to combine both.


We created custom functions that combine a correlation analysis (considering Pearson and Spearman) and PPS, arranging the knowledge we have built so far. There are two functions you need to call: 
* `CalculateCorrAndPPS()`: calculate correlation tables and PPS table for a dataset. This function prints Q1 and Q3 for PPS scores already.

* `DisplayCorrAndPPS()`, the arguments are `df_corr_pearson` which is the Pearson correlation table for the data `df_corr_spearman` which is the Spearman correlation for the data; `pps_matrix`, which is the PPS table for the data, and `CorrThreshold` and `PPS_Threshold`, which are visualisation threshold for correlations and PPS respectively.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps

def heatmap_corr(df,threshold, figsize=(20,12), font_annot = 8):
  if len(df.columns) > 1:
    mask = np.zeros_like(df, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    mask[abs(df) < threshold] = True

    fig, axes = plt.subplots(figsize=figsize)
    sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                linewidth=0.5
                     )
    axes.set_yticklabels(df.columns, rotation = 0)
    plt.ylim(len(df.columns),0)
    plt.show()


def heatmap_pps(df,threshold, figsize=(20,12), font_annot = 8):
    if len(df.columns) > 1:

      mask = np.zeros_like(df, dtype=np.bool)
      mask[abs(df) < threshold] = True

      fig, ax = plt.subplots(figsize=figsize)
      ax = sns.heatmap(df, annot=True, xticklabels=True,yticklabels=True,
                       mask=mask,cmap='rocket_r', annot_kws={"size": font_annot},
                       linewidth=0.05,linecolor='grey')
      
      plt.ylim(len(df.columns),0)
      plt.show()



def CalculateCorrAndPPS(df):
  df_corr_spearman = df.corr(method="spearman")
  df_corr_pearson = df.corr(method="pearson")

  pps_matrix_raw = pps.matrix(df)
  pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

  pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
  print("PPS threshold - check PPS score IQR to decide the threshold for the heatmap \n")
  print(pps_score_stats.round(3))

  return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix,CorrThreshold,PPS_Threshold,
                      figsize=(20,12), font_annot=8 ):

  print("\n")
  print("* Analyze how the target variable for your ML models are correlated with other variables (features and target)")
  print("* Analyze multi colinearity, that is, how the features are correlated among themselves")

  print("\n")
  print("*** Heatmap: Spearman Correlation ***")
  print("It evaluates monotonic relationship \n")
  heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Pearson Correlation ***")
  print("It evaluates the linear relationship between two continuous variables \n")
  heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Predictive power Score (PPS) ***")
  print(f"PPS detects linear or non-linear relationships between two columns.\n"
        f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
  heatmap_pps(df=pps_matrix,threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

Call CalculateCorrAndPPS to calculate the necessary correlation/PPS tables

df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

Display in the heatmaps the correlation and PPS levels
* The threshold value, either correlation or PPS, will be more trial and error to see what fits best. We started with a correlation threshold of 0.6 (strong correlation) and a PPS threshold of 0.15.
* We set a small figsize since we don't have many variables

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Interpretations
* When looking at the first two plots (linear and monotonic relationships), we note that X3 and X4 have a strong and negative linear and monotonic relationship. 
* When looking at the last plot, we note that X1 has a very strong predictive power (deterministic relationship) to Y. On the other hand, if I have Y, it is more difficult to predict X1.
* When looking at the last plot, we note that X3 has a strong predictive power to X4. However, X4 has lower (but still high) predictive power than X3. It shows a practical example of asymmetry.

DisplayCorrAndPPS(df_corr_pearson=df_corr_pearson,
                  df_corr_spearman=df_corr_spearman, 
                  pps_matrix=pps_matrix,
                  CorrThreshold=0.6, PPS_Threshold=0.15,
                  figsize=(5,5), font_annot=8)

---

# Target Imbalance - Unit 01 - Introduction

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **The Target Imbalance Lesson consists of two units.**
* By the end of this lesson, you should be able to:
  * Analyse how balanced your target distribution is
  * Understand and select Over Sampling or Under Sampling techniques to handle target imbalance
  * Combine ML Pipeline and Target Imbalance

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

  * Analyse how balanced your target distribution is
  * Understand and select Over Sampling or Under Sampling techniques to handle target imbalance



---

In the workplace, the chances are high that a dataset will be imbalanced. That means the target variable will contain classes that are not distributed in a similar frequency


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study target imbalance?**
* Because it is a critical part of an exploratory data analysis flow for a classification task; to evaluate if the classes frequencies are balanced.
* In case it is not, an ML model will likely have low performance in the infrequent classes.


---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add **code cells and try out** other possibilities, i.e. play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your own comments** in the cells. It can help you to consolidate your learning. 

* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional information in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For imbalanced-learn, the link is [here](https://imbalanced-learn.org/stable/auto_examples/api/plot_sampling_strategy_usage.html)**.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

We will install imbalanced-learn to handle an imbalanced target; the documentation link is [here](https://imbalanced-learn.org/stable/auto_examples/api/plot_sampling_strategy_usage.html)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Target Imbalance

There will be ML Classification tasks where the classes from the target variable will not have the same proportion (or frequency)
* `imbalanced-learn` is a python package with a set of re-sampling techniques used in datasets showing imbalance in the target

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Imbalanced dataset

We generate an imbalanced dataset with the `make_classification()` function from sklearn. The documentation function is [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html). 
* According to the documentation, the arguments are: 
  * `n_samples` for the number of samples
  * `n_feautures` as the number of features
  * `n_redundant` as features generated as random linear combinations of the informative features
  * `n_classes` for the number of classes (or labels) of the classification problem
  * `n_clusters_per_class` as the number of clusters per class
  * `weights` as the proportion between the classes, 
  * `flip_y` here you define the imbalance as the fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder
  * `random_state` determines random number generation for dataset creation.
* It returns NumPy arrays

from  sklearn.datasets import make_classification
X, y = make_classification(n_samples=150, n_features=2, n_redundant=0, n_classes=2,
                           n_clusters_per_class=1, weights=[0.95], flip_y=0, random_state=1)

print(type(X), type(y))

We transform them to Pandas DataFrames using the techniques we know: create a DataFrame with X, then assign as a target variable y. We finally relabel the classes name with `.replace()`

df_2classes = pd.DataFrame(data=X,columns=['X1','X2'])
df_2classes['Target'] = y
df_2classes['Target'] = df_2classes['Target'].replace({0:'Class 0',1:'Class 1'})
df_2classes.head()

Let's check what size the dataset is

df_2classes.shape

We plot the targets classes distribution. It is clearly imbalanced, where class 0 is dominant.

plt.figure(figsize=(7,5))
sns.countplot(x=df_2classes['Target'])
plt.show()

We want to see the imbalance now, including the features. We do a scatter plot with `sns.scatterplot()` and colour by the `target`. We set alpha, setting the dots' transparency, so we can better see the overlapping
* As we expected, we see many more instances of Class 0 than Class 1.

* This exercise is simple since we consider only two features. In real-life datasets, you will likely have more features, and then you need to assess which variables are most linked to the target, so that you can plot them. We covered these strategies in previous lessons.

plt.figure(figsize=(6,5))
sns.scatterplot(data=df_2classes, x='X1', y='X2', hue='Target', alpha=0.8)
plt.show()

You can get a given dataset and make it imbalanced. To see how to do this, with `make_imbalance()`, read the documentation [here](https://imbalanced-learn.org/stable/references/generated/imblearn.datasets.make_imbalance.html). According to the documentation, the arguments are the features and target, `sampling strategy`, which is the new amount of observations for each class parsed in a dictionary  and `random_state`, which is a seed used by the random number generator to make it imbalanced
* Let's take the iris dataset

df = sns.load_dataset('iris')
df.head()

We count the target frequency with `.value_counts()`. We see it is all the same and they are balanced

df['species'].value_counts()

The sampling strategy is the new amount of observations for each class parsed in a dictionary. Previously we noticed each class had 50 instances; now we are changing in this manner: `'setosa': 10, 'versicolor': 30,'virginica' : 40 `
* The outputs are X_imb and y_imb; we assign them to a DataFrame using the techniques we are already familiar with

from imblearn.datasets import make_imbalance

target_variable = 'species'

X_imb, y_imb = make_imbalance(X=df.drop([target_variable],axis=1),
                              y=df[target_variable],
                              sampling_strategy={'setosa': 10, 'versicolor': 30,'virginica' : 40},
                              random_state=1)
df_3classes = pd.DataFrame(data=X_imb)
df_3classes[target_variable] = y_imb
df_3classes.head()

We check the new target frequencies

df_3classes['species'].value_counts()

And plot the target distribution. It is imbalanced; setosa has a lot less data than the other classes

plt.figure(figsize=(7,5))
sns.countplot(x=df_3classes['species'])
plt.show()

Now we plot 'sepal_length' against 'petal_length' in a scatter plot colouring by 'species'
* We notice fewer datapoints for setosa

plt.figure(figsize=(9,6))
sns.scatterplot(data=df_3classes, x='sepal_length', y='petal_length', hue='species', alpha=0.8)
plt.show()

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will study 2 techniques in these datasets:
* **Over Sampling using SMOTE** (Synthetic Minority Over-sampling Technique). Here you are interested in adding synthetic data that is similar to the rest of the data that belongs to the imbalanced classes 
* **Under Sampling**. Here the idea is to drop observations from classes that are the majority in the dataset, so at the end, the classes will have similar levels from classes with minority frequencies


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The tradeoff of the technique is: we are "changing" the dataset; therefore, we may lose information; however, the benefit is that we create conditions to fit a better model with the existing data provided

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Over Sampling Technique

We will use `SMOTE()`, which stands for Synthetic Minority Over-sampling Technique. The function documentation is [here](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html). Here you are interested in adding synthetic data that is similar to the rest of the data that belongs to the imbalanced classes
* The arguments, according to the documentation, are: `random_state` to control the randomisation of the resampling; `sampling_strategy`, which is the information to resample the data set; you can consider:
  * 'minority': resample only the minority class;
  * 'not minority': resample all classes but the minority class;
  * 'not majority': resample all classes but the majority class;
  * 'all': resample all classes


First, a quick recap of `df_2classes`

print(df_2classes.shape)
df_2classes.head()


You create an object with `SMOTE()` with a sampling strategy of 'not majority', so it resamples all classes but the majority class; then `.fit_resample()` in the train set (now we do in all data set, but soon we will see a different example).
* It returns X and y as arrays. Then you reassigned the re-sampled data in a DataFrame

from imblearn.over_sampling import SMOTE
smote_over = SMOTE(sampling_strategy='not majority', random_state=1)
X, y = smote_over.fit_resample(X= df_2classes.drop(['Target'],axis=1), y= df_2classes['Target'])

df_2classes_smote =pd.DataFrame(data=X,columns=['X1','X2'])
df_2classes_smote['Target'] = y
df_2classes_smote['Target'] = df_2classes_smote['Target'].replace({0:'Class 0',1:'Class 1'})
df_2classes_smote.head()

Let's check the dataset size
* As we may expect, it increased due to the fact we added data in the minority class

df_2classes_smote.shape

You remember this dataset was strongly imbalanced; let's compare  target distribution before and after SMOTE: it is even now!

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
sns.countplot(x=df_2classes['Target'], ax=axes[0])
axes[0].set_title("Before SMOTE")
sns.countplot(x=df_2classes_smote['Target'], ax=axes[1])
axes[1].set_title("After SMOTE")
plt.show()

We do the same scatter plot to check the data after the transformation
* Note: the new data points for Class 1 were synthetically generated

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
sns.scatterplot(data=df_2classes, x='X1', y='X2', hue='Target', alpha=0.5, ax=axes[0])
axes[0].set_title("Before SMOTE")
sns.scatterplot(data=df_2classes_smote, x='X1', y='X2', hue='Target', alpha=0.5,  ax=axes[1])
axes[1].set_title("After SMOTE")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**

Let's recap df_3classes target distribution

df_3classes['species'].value_counts()

Use the oversampling technique with SMOTE to manage target imbalance in df_3classes

smote_over = # write your code here; you may use not majority as a sampling strategy
X, y = smote_over.fit_resample(...) ### write your code here


df_3classes_smote = pd.DataFrame(data=X)
df_3classes_smote['species'] = y
df_3classes_smote.head()

Let's count the target distribution to validate the resampling

df_3classes_smote['species'].value_counts()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Under Sampling Technique


We will use `RandomUnderSampler()`; the function documentation is [here](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html). Here the idea is to drop observations from classes that are the majority in the dataset, so at the end, the classes will have similar levels from classes with minority frequencies


* The arguments, according to the documentation, are: `random_state` to control the randomisation of the resampling; `sampling_strategy`, which is the information to resample the data set; you can consider:
  * 'minority': resample only the minority class;
  * 'not minority': resample all classes but the minority class;
  * 'not majority': resample all classes but the majority class;
  * 'all': resample all classes

Let's use df_2classes - which is the imbalanced dataset

df_2classes['Target'].value_counts()

We apply `RandomUnderSampler()` with under resampling of all classes but the minority class, then we `.fit_sample()`. The result is a pair of arrays (X and y), so we assign them to a DataFrame

from imblearn.under_sampling import RandomUnderSampler
under_sampler = RandomUnderSampler(sampling_strategy='not minority',random_state=1)
X, y = under_sampler.fit_resample(X= df_2classes.drop(['Target'],axis=1), y= df_2classes['Target'])

df_2classes_under = pd.DataFrame(data=X, columns=['X1','X2'])
df_2classes_under['Target'] = y
df_2classes_under['Target'] = df_2classes_under['Target'].replace({0:'Class 0',1:'Class 1'})
df_2classes_under.head()

Let's check the dataset size
* As we may expect, it decreased due to the fact that we removed data from the majority class

df_2classes_under.shape

You remember this dataset was strongly imbalanced; let's compare  target distribution before and after undersampling

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
sns.countplot(x=df_2classes['Target'], ax=axes[0])
axes[0].set_title("Before Under Sampling")
sns.countplot(x=df_2classes_under['Target'], ax=axes[1])
axes[1].set_title("After Under Sampling")
plt.show()

We do the same scatter plot to check how the data is after the transformation

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
sns.scatterplot(data=df_2classes, x='X1', y='X2', hue='Target', alpha=0.5, ax=axes[0])
axes[0].set_title("Before Under Sampling")
sns.scatterplot(data=df_2classes_under, x='X1', y='X2', hue='Target', alpha=0.5,  ax=axes[1])
axes[1].set_title("After Under Sampling")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**

Let's recap df_3classes target distribution

df_3classes['species'].value_counts()

Use the undersampling technique to manage target imbalance in df_3classes

under_sampler = # write your code here; you may use not minority as a sampling strategy
X, y = under_sampler.fit_resample(...) # write your code here

df_3classes_under = pd.DataFrame(data=X)
df_3classes_under['species'] = y
df_3classes_under.head()

Let's count the target distribution to validate the resampling

df_3classes_under['species'].value_counts()

---

# Target Imbalance - Unit 02 - Combine ML pipeline and Target Imbalance

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

  * Analyse how balanced your target distribution is
  * Understand and select Over Sampling or Under Sampling techniques to handle target imbalance



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

We will install imbalanced-learn to handle imbalanced targets. The documentation link is [here](https://imbalanced-learn.org/stable/auto_examples/api/plot_sampling_strategy_usage.html)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Combine Target Imbalance and ML Pipeline

In the workplace, your data will likely require data cleaning. When dealing with target imbalance, your data can't contain missing data since that is a requirement for the sampling algorithm from imbalanced-learn.
* Therefore, we take the opportunity to perform data cleaning and feature engineering before sampling.


We will consider two scenarios where you will handle target imbalance
* When there **is no** need for data cleaning and feature engineering
* When there **is a need** for data cleaning and feature engineering.

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> No need for data cleaning and feature engineering

Assume your data is cleaned and doesn't need additional processing, but unfortunately, it is imbalanced. Then, you can add SMOTE to your usual techniques when fitting models.

You should consider:
* Split your data into train and test set
* Fit SMOTE to the train set, transform on the train and test set
* Fit and evaluate the pipeline

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **In the end, you will have one pipeline**

We will create an imbalanced numerical dataset with make_classification; there are two classes, 10000 rows and five features.
* We assign X and y to a DataFrame, and the columns' names go from X1 to X5. The target variable is called Target

from  sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, n_features=5,
                           n_redundant=3, n_classes=2,
                           n_clusters_per_class=2, weights=[0.90], 
                           flip_y=0, random_state=1)


df = pd.DataFrame(data=X,columns=['X1','X2','X3','X4','X5'])
df['Target'] = y
df.head()

We check the target distribution, and it is imbalanced

df['Target'].value_counts()

We check for missing data
* There is no missing data

df.isna().sum()

We split the train and test set using the usual technique
* We print the train and test shapes

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['Target'],axis=1),
                                    df['Target'],
                                    test_size = 0.2,
                                    random_state = 0
                                    )

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We double-check the train set target distribution

y_train.value_counts()

And test set distribution

y_test.value_counts()

The data we are using is already cleaned, it doesn't need any encoding, and we will not worry about numerical feature engineering focusing only on the scope of the exercise.
* We create a function pipeline_clf()` that returns a pipeline containing three steps: feature scaling, feature selection and model. We picked up a LogisticRegression algorithm (here, we are not interested in finding the best model, just to demonstrate SMOTE)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler # Feat Scaling
from sklearn.feature_selection import SelectFromModel # Feat Selection
from sklearn.linear_model import LogisticRegression # ML algorithm for classification

def pipeline_clf():
  pipeline_base = Pipeline([
       ("scaler",StandardScaler() ),
       ("feat_selection",SelectFromModel(LogisticRegression(random_state=0)) ),
       ("model",LogisticRegression(random_state=0) )
  ])

  return pipeline_base

pipeline_clf()

We fit the pipeline to the train set

pipe = pipeline_clf()
pipe.fit(X_train,y_train)

And check the confusion matrix on the test set
* We want the "actual" to be the columns and "predictions" to be the rows in the confusion matrix; therefore we flipped the arguments.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
confusion_matrix(y_true=pipe.predict(X_test), y_pred=y_test)

Finally, we check the classification report of the test set.
* Typically, the minority class in these cases are the ones in which you are really interested in predicting properly, like, whether it is a disease or not, or it is a fraud or not; and typically, your dataset will contain fewer instances of these classes
* In this case, accuracy is not a good indicator of performance metrics. We will consider the recall from class 1, which is 0.64; let's see if, with the SMOTE technique, we can improve it and if there is a trade-off

print(classification_report(y_pred=pipe.predict(X_test), y_true=y_test))

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's resample the dataset with SMOTE, where the sampling strategy is 'not majority'. We `.fit_resample()` and print the train and test sets.
* As we may expect, it increased the train set

from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='not majority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We check the train set target distribution
* Now both classes are balanced

y_train.value_counts()

We create a new pipeline

pipe = pipeline_clf()
pipe.fit(X_train,y_train)

We check the confusion matrix on the test set

confusion_matrix(y_true=pipe.predict(X_test), y_pred=y_test)

Finally, we run a classification report on the test set

print(classification_report(y_pred=pipe.predict(X_test), y_true=y_test))

We compare the confusion matrix and classification before and after resampling. We notice that:
* The accuracy decreased
* The recall and precision on class 0 (dominant) are still good
* The recall on class 1 (minority) increased, but precision decreased.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Assuming for our context that Recall on class 0 is super important, the performance increased when we applied the SMOTE technique. 


---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> There is no need for data cleaning and feature engineering


In the last example, the data was already cleaned and didn't need any processing. You split the data into train, and test sets, applied SMOTE, and fitted the model. You had one pipeline to manage all of that.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We know that cleaned data doesn't often happen in the workplace. SMOTE only works when the data doesn't have missing data, so we take the following initiative to handle targe imbalance

* Create two pipelines: one for data cleaning and feature engineering and another for modelling.
* Split your data into train and test set
* Fit the first pipeline (data cleaning and feature engineering) on the train set and transform on both sets
* Apply SMOTE to the train set
* Fit and evaluate the second pipeline (modelling)

 
 
 <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **In the end, you will have two pipelines.**

We will consider the same dataset from the previous example, but on purpose, we will add missing data. 

from  sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, n_features=5,
                           n_redundant=3, n_classes=2,
                           n_clusters_per_class=2, weights=[0.90], 
                           flip_y=0, random_state=1)


df = pd.DataFrame(data=X,columns=['X1','X2','X3','X4','X5'])
df['Target'] = y
df.head()

We replaced the first 100 data points from X1 with missing data (np.NaN). Naturally, in the workplace, you will likely not do that; however, in this context, we did this to facilitate the exercise.

df.iloc[0:100,0] = np.NaN

Let's check the DataFrame

df.head()

Let's check the target variable frequency

df['Target'].value_counts()

We check the missing levels
* there are missing levels on X1

df.isna().sum()

If we try to apply `SMOTE()` on data with missing data, it will not work. 
* Let's quickly split the train and test set and try to apply `SMOTE()`

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Run the cell below and notice the error: `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').`

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['Target'],axis=1),
                                    df['Target'],
                                    test_size = 0.2,
                                    random_state = 0
                                    )

from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='not majority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We create two pipelines; the first is `pipeline_dc_fe()` which cleans and feature engineers the data. The second is `pipeline_clf()` is responsible for modelling
* We will define the steps required, but as you may have seen in the previous lessons, it would be your task to figure out the transformers' options

from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer
from sklearn.preprocessing import StandardScaler # Feat Scaling
from sklearn.feature_selection import SelectFromModel # Feat Selection
from sklearn.linear_model import LogisticRegression # ML algorithm for classification

def pipeline_dc_fe():

  pipeline_base = Pipeline([
       ( 'median',  MeanMedianImputer(imputation_method='median',variables=['X1']) )
  ])

  return pipeline_base

def pipeline_clf():
  pipeline_base = Pipeline([
       ("scaler",StandardScaler() ),
       ("feat_selection",SelectFromModel(LogisticRegression(random_state=0)) ),
       ("model",LogisticRegression(random_state=0) )
  ])

  return pipeline_base

pipeline_dc_fe()

We split the train and test set using the usual technique
* We print the train and test shapes

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['Target'],axis=1),
                                    df['Target'],
                                    test_size = 0.2,
                                    random_state = 0
                                    )

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We check the train set target distribution
* It is imbalanced

y_train.value_counts()

And test set distribution

y_test.value_counts()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We fit the first pipeline with the train set and transform on both sets

pipeline_data_cleaning_feat_eng = pipeline_dc_fe()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)

We check for missing data

X_train.isna().sum()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Now we apply SMOTE to the train set

* This section will not compare any pipeline differences to when you apply and don't apply SMOTE() since we did that in the previous section.

from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='not majority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We check the train set target distribution

y_train.value_counts()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We then fit the second pipeline

pipe = pipeline_clf()
pipe.fit(X_train, y_train)

And check the confusion matrix on the Test set
* Once again, we want the "actual" to be the columns and "predicted" to be the rows in the confusion matrix; therefore we flipped the arguments.

confusion_matrix(y_true=pipe.predict(X_test), y_pred=y_test)

Finally, we check the classification report on the test set
* The result and interpretation are similar to the example we gave in the previous section. 
* The difference in this section is that we demonstrated an example where the dataset needs to be processed before applying `SMOTE()`

print(classification_report(y_pred=pipe.predict(X_test), y_true=y_test))

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: Let's consider the breast dataset from sklearn. It shows records of a breast mass sample and a diagnosis informing whether it is a malignant or benign cancer

from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
df_practice = pd.DataFrame(data.data, columns=data.feature_names)
df_practice['Target'] = data.target
df_practice.head()

On purpose, we will add missing data (np.NanN) to the first 100 rows in the first column

df_practice.iloc[0:100,0] = np.NaN

Let's check the DataFrame

df_practice.head()

Let's check the target variable frequency

df_practice['Target'].value_counts()

We check the missing levels
* there is a missing level on the mean radius

df_practice.isna().sum()

We create two pipelines; the first is `pipeline_dc_fe()` which cleans and feature engineers the data. The second is `pipeline_clf()` which is responsible for modelling
* We define the modelling pipeline with feature scaling, feature selection and logistic regression algorithm

from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer
from sklearn.preprocessing import StandardScaler # Feat Scaling
from sklearn.feature_selection import SelectFromModel # Feat Selection
from sklearn.linear_model import LogisticRegression # ML algorithm for classification

def pipeline_dc_fe():

  pipeline_base = Pipeline([
       ( 'median',  MeanMedianImputer(imputation_method='median') )
  ])

  return pipeline_base


def pipeline_clf():
  pipeline_base = Pipeline([
       ("scaler",StandardScaler() ),
       ("feat_selection",SelectFromModel(LogisticRegression(random_state=0)) ),
       ("model",LogisticRegression(random_state=0) )
  ])

  return pipeline_base

pipeline_clf()

We split the train and test set using the usual technique
* We print the train and test shapes

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_practice.drop(['Target'],axis=1),
                                    df_practice['Target'],
                                    test_size = 0.2,
                                    random_state = 0
                                    )

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We check the train set target distribution
* It is imbalanced

y_train.value_counts()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> From here on, you should code.
* We fit the first pipeline with the train set and transform on both sets

pipeline_data_cleaning_feat_eng = pipeline_dc_fe()
# write code to fit and transform pipeline_data_cleaning_feat_eng on X_train
# write code to transform pipeline_data_cleaning_feat_eng on X_test


We check for missing data

X_train.isna().sum()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Now we apply SMOTE to the train set



from imblearn.over_sampling import SMOTE
# oversample = SMOTE(....) # write code to define your oversample object
# write code to fit and resample on X_train and y_train, the sampling strategy may be not majority and random state zero


print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We check the train set target distribution

y_train.value_counts()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We then fit the second pipeline

# write code to create an object called pipe that receives pipeline_clf()
# fit pipe to the train set



Check the confusion matrix on the test set

confusion_matrix(y_true=pipe.predict(X_test), y_pred=y_test)

Good job!

---