# <a id='toc1_'></a>[Key items for this class: <span style="color:red">collinearity</span>, <span style="color:orange">scaling</span>, <span style="color:yellow">transformation</span>, <span style="color:green">encoding</span>, <span style="color:blue">get_dummies</span>, <span style="color:purple">R2_score</span>](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Data Preparation](#toc1_)    
  - [Why data preparation?](#toc1_1_)    
  - [Data Cleaning](#toc1_2_)    
  - [Data Exploration](#toc1_3_)    
    - [Review numerical continuous variables](#toc1_3_1_)    
    - [Review categorical and numerical discrete values](#toc1_3_2_)    
    - [Review correlations between variables](#toc1_3_3_)    
  - [Feature selection](#toc1_4_)    
  - [Feature engineering](#toc1_5_)    
  - [Data preprocessing](#toc1_6_)    
    - [Numerical features transformation:](#toc1_6_1_)    
    - [Numerical features scaling:](#toc1_6_2_)    
    - [Categorical features encoding](#toc1_6_3_)    
  - [Modelling](#toc1_7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Data Preparation](#toc0_)

Or how to make your data as informative as possible before making predictions.

In [None]:
# Just another day in the life of a data analyst...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# What are the typical libraries we import?
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
donation = pd.read_csv("https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/donation_data.csv")

# What do we first look at?
print(donation.shape)
donation.sample(10)

## <a id='toc1_1_'></a>[Why data preparation?](#toc0_)

In [None]:
# Let's try running a linear regression without preparing our data
from sklearn import linear_model

# Step 1 - Data
X = donation.drop('donation', axis=1)
y = donation['donation']

# Step 2 - Model
model = linear_model.LinearRegression()

# Step 3 - Fitting
result = model.fit(X, y)

## <a id='toc1_2_'></a>[Data Cleaning](#toc0_)

- Are there any nulls?
- Are the values formatted correctly?
- Which columns have almost no variance?

In [None]:
# Check nulls
donation.isna().sum()

In [None]:
# Quick review of NaN rows - basically trying to find a pattern
donation[donation.time_since_donation.isna()]

In [None]:
# Check dtypes
donation.info()

In [None]:
# Check stats
round(donation.describe(), 2)

Observations:
- `NaNs`: `time_since_donation` has ~39% null values.
- `Dtypes`: `date_insert_db` and `city_code` seem to have the wrong dtype.
- `Unique values`: `date_insert_db` doesn't have a lot of unique values even though it is numerical.
- `Unusual stats`: `salary` standard deviation is almost double its mean, so it likely has outliers on the high-end.

As the std of `date_insert_db` is very low, it seems like this is a column with almost no variance. This means that the feature will provide us with low information, the exception being when the "different" values are very correlated with big changes in the target variable, in which case they should be handled separately like we will see later with the `salary` feature.

In [None]:
# Remove no variance column
donation.drop(columns='date_insert_db', inplace=True)
donation.head()

## <a id='toc1_3_'></a>[Data Exploration](#toc0_)

- What is the distribution of my data?
- Do any of my features have many outliers?
- How is my target (`donation`) related to my features?

In [None]:
# Separate numerical & categorical data
don_num = donation.select_dtypes('number')
don_cat = donation.select_dtypes('object')

### <a id='toc1_3_1_'></a>[Review numerical continuous variables](#toc0_)

In [None]:
# Plot numerical variables
fig = make_subplots(rows=don_num.shape[1], cols=2)
colors = ['red', 'blue', 'green', 'purple', 'orange', 'darkblue']

# Create a loop for histogram plots
for i, col in enumerate(don_num.columns):
    fig.add_trace(go.Histogram(x=donation[col], name=col, marker=dict(color=colors[i])), row=i+1, col=1)

# Create a loop for box plots
for i, col in enumerate(don_num.columns):
    fig.add_trace(go.Box(x=donation[col], name=col, marker=dict(color=colors[i])), row=i+1, col=2)

# Adjust the height, width, and title of the layout
fig.update_layout(height=200 * don_num.shape[1], width=1000, title_text="Numerical variables distributions")
fig.show()

Observations:
- `age` & `num_calls` have the same, [**uniform distribution**](https://statisticsbyjim.com/probability/uniform-distribution/)
- `donation` seems to have a normal distribution
- `salary` does indeed have a couple of people on the high-end. Whilst these are informally called outliers, they aren't wrong values (i.e. incorrectly typed values) but rather a different population, so we will treat them separately.
- `time_since_donation` is very skewed to the right, suggesting that the dataset mainly has people who donated a long time ago. Seeing this, it seems reasonable to fill in the `NaN` values in the column with the mode.

`donation` - fill NaN values

In [None]:
# Check mode
donation.time_since_donation.mode()

In [None]:
# Fill time_since_donation column
donation.time_since_donation.fillna(365, inplace=True)

In [None]:
# Review column after filling
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Histogram(x=donation.time_since_donation, name='time_since_donation', marker=dict(color='blue')), row=1, col=1)
fig.add_trace(go.Box(x=donation.time_since_donation, name='time_since_donation', marker=dict(color='blue')), row=1, col=2)
fig.update_layout(height=400, width=1000, title_text="Time since donation distribution post-filling")
fig.show()

`salary` - select rows from the same population

In [None]:
# We can check the first 20 salaries
donation.salary.sort_values(ascending=False).iloc[:20]

In [None]:
# Where should we set the threshold for selecting salary entries?
# For this dataset, let's say that 500K is a reasonable threshold
donation = donation[donation.salary <= 500000] 

In [None]:
# Review new top 20
donation.salary.iloc[:20]

In [None]:
# Review column after removing "outliers"
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Histogram(x=donation.salary, name='salary', marker=dict(color='green')), row=1, col=1)
fig.add_trace(go.Box(x=donation.salary, name='salary', marker=dict(color='green')), row=1, col=2)
fig.update_layout(height=400, width=1000, title_text="Salary distribution post-filling")
fig.show()

### <a id='toc1_3_2_'></a>[Review categorical and numerical discrete values](#toc0_)

In [None]:
# Plot categorical variables
fig = make_subplots(rows=don_cat.shape[1], cols=1)
colors = ['red', 'blue', 'green', 'purple', 'orange', 'darkblue']

# Create a loop for histogram plots
for i, col in enumerate(don_cat.columns):
    fig.add_trace(go.Histogram(x=donation[col], name=col, marker=dict(color=colors[i])), row=i+1, col=1)

# Adjust the height, width, and title of the layout
fig.update_layout(height=300 * don_cat.shape[1], width=500, title_text="Categorical variables distributions")
fig.show()

Observations:
- `gender` is evenly distributed
- There is no `city_code` data for rural citizens, so we might be better off changing the values in this column to look at whether a donor is in the city or not:

In [None]:
donation['city_code'] = donation['city_code'].apply(lambda x: 'URBAN' if x != 'RURAL' else x)

In [None]:
# Review distribution
px.histogram(donation.city_code)

We can now clearly see that there are more donors in urban areas.

### <a id='toc1_3_3_'></a>[Review correlations between variables](#toc0_)

In [None]:
# Find correlation between features & target - What correlation do we use?
sns.heatmap(donation.corr(numeric_only=True), annot=True)
plt.show()

`age` and `num_calls` are very highly correlated (1!), so we need to:
1. Choose either one of them for modelling.
2. Figure out why they are so highly correlated.

In [None]:
donation[['age', 'num_calls']]

In [None]:
# Review the difference
donation['diff'] = donation['age'] - donation['num_calls']
donation[['age', 'num_calls', 'diff']]

It seems that donors start receiving yearly calls as soon as they turn 18:

In [None]:
# Check mean
donation['diff'].mean() # And indeed they do, given the mean difference between age & num_calls is 17.5

Both columns are equally correlated with the target so we can remove either of them:

In [None]:
# We'll choose to keep age for now
donation.drop('num_calls', axis=1, inplace=True)
donation.head()

In [None]:
# We should also remove the extra column we created
donation.drop('diff', axis=1, inplace=True)
donation.head()

## <a id='toc1_4_'></a>[Feature selection](#toc0_)

We already removed features that:
- have no variance
- are highly correlated with other features

Now would be the time to further select features that are unlikely to contribute to our model. However, as we already have a small number of features, we can revisit this step after creating an initial model.

## <a id='toc1_5_'></a>[Feature engineering](#toc0_)

In this step we'd typically look to create new features from the current ones.

## <a id='toc1_6_'></a>[Data preprocessing](#toc0_)

This section includes all necessary steps for running the model:
- numerical features transformation (as needed)
- numerical features scaling (as needed)
- categorical features encoding (necessary for linear regression)

### <a id='toc1_6_1_'></a>[Numerical features transformation:](#toc0_)
- This step is usually undertaken to reduce the skewness of a dataset, i.e. to increase the variance where it's very low. There are multiple types of transformations, some of the common types being square root, logarithm, and Box-Cox transformations.

⚠️ Using any of these transformations does not change the underlying information of the feature ⚠️

In our case, the `salary` column is extremely skewed to the left so an appropriate transformation for this type of this distribution is a log-transform:

In [None]:
donation['salary'] = donation['salary'].apply(lambda x: np.log(x))

In [None]:
# Review column after log-transform
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Histogram(x=donation.salary, name='salary', marker=dict(color='green')), row=1, col=1)
fig.add_trace(go.Box(x=donation.salary, name='salary', marker=dict(color='green')), row=1, col=2)
fig.update_layout(height=400, width=1000, title_text="Salary distribution post-transform")
fig.show()

Should we do the same for `time_since_donation`?

In [None]:
# Let's review the time_since_donation values
donation.time_since_donation.value_counts()

Compared to the `salary` column, `time_since_donation` has a lot of 365 days values rather than many values ranging between 360 and 365 days, so applying a log-transformation would not change the skewness of the distribution:

In [None]:
# Check how the distribution would look like
log_distrib = donation.time_since_donation.apply(lambda x: np.log(x))
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Histogram(x=log_distrib, name='time_since_donation', marker=dict(color='green')), row=1, col=1)
fig.add_trace(go.Box(x=log_distrib, name='time_since_donation', marker=dict(color='green')), row=1, col=2)
fig.update_layout(height=400, width=1000, title_text="time_since_donation distribution post-transform")
fig.show()

### <a id='toc1_6_2_'></a>[Numerical features scaling:](#toc0_)
- To understand the importance of our numerical features for the LR model, we need to scale our features, i.e. make it so that all features are in the same range, or have the same standard deviation.

In [None]:
from sklearn.preprocessing import StandardScaler
standardizer = StandardScaler()
donation[['salary', 'age', 'time_since_donation']] = standardizer.fit_transform(donation[['salary', 'age', 'time_since_donation']])
donation.head()

### <a id='toc1_6_3_'></a>[Categorical features encoding](#toc0_)
- Models such as linear regression do not accept non-numerical values, so we need to convert those into numerical values by encoding.

In [None]:
# Select categorical columns
don_cat = donation.select_dtypes(object)

In [None]:
# Encode categoricals in a copy df
don_cat_copy = don_cat.copy()
don_cat_copy = pd.get_dummies(don_cat_copy[['gender', 'city_code']])
don_cat_copy.head()

In [None]:
# Switch categoricals in copy df
donation_copy = donation.copy()
donation_copy = pd.concat([donation_copy, don_cat_copy], axis=1)
donation_copy.drop(['age', 'city_code'], axis=1, inplace=True)
donation_copy.head()

⚠️ ALWAYS SET `drop_first=True` WHEN ONE-HOT ENCODING FOR LR ⚠️  

In [None]:
# Check how the features are correlated
sns.heatmap(donation_copy.corr(numeric_only=True), annot=True)
plt.show()

If we have all the possible values of a categorical feature (gender, city code, etc.) in the columns of a data frame, you can always infer what one of the column will be from the other columns. This is why we remove one of the unique elements when applying the `get_dummies` method:

In [None]:
# Apply get_dummies on original df
donation[['gender', 'city_code']] = pd.get_dummies(donation[['gender', 'city_code']], drop_first=True)
donation.head()

In [None]:
# Check how the features are correlated
sns.heatmap(donation.corr(numeric_only=True), annot=True)
plt.show()

## <a id='toc1_7_'></a>[Modelling](#toc0_)

In [None]:
# X-y split
X = (donation).drop('donation', axis=1)
y = donation['donation']

In [None]:
# Fit model
model = linear_model.LinearRegression()
result = model.fit(X, y)

7. Model results

In [None]:
# Check R2 score
result.score(X, y)

This R2 score is really good! 🤩