In [8]:
import pandas as pd
from pandas_profiling import ProfileReport

In [24]:
df = pd.read_csv("../data/raw/train.csv")

profile = ProfileReport(df, title="Profiling Report", explorative=True)
profile.to_file("../docs/profiling_report.html")

# Profiling Report Interpretation

The full report can be found here: [`../docs/profiling_report.html`](../docs/profiling_report.html)

## Summary Statistics

From the summary statistics, we can see:
- The training dataset has 31 variables (30 features + 1 target) and 7003 observations. The number of observations is much greater than the number of features, therefore no high-dimentional data issue here.
- Missing values are present, which need to be explored further.
- There are no duplicated values.
- Training data size is only 4.8 MB, which shouldn't cause any memory issues.

<img src="../docs/images/summary_statistics.png" alt="summary statistics" width="300" />

## Target Distribution

The target variable has 2 values, `0` and `1`, which points to a binary classification problem. The dataset is also imblanced, with class `1` taking up ~14% of total observations. I assume the business context here is similar to a fraud detection, meaning the minority class is what we care about, so we need to use resampling methods to combat the imbalanced dataset issue. As we don't have a large number of observations, I choose to oversample instead of undersample. Common oversampling techniques with good performance include `SMOTE` and `ADASYN`.

<img src="../docs/images/target.png" alt="target" width="600" />

## Missing Values

From the missing values chart, we can see `Vicuna` is 100% missing, so we can drop it from the training dataset. It is a good idea to set up monitoring for Vicuna's missing percentage in case it changes in the future.

<img src="../docs/images/missing_values.png" alt="missing values" width="600" />

`Tiglon` and `Wallaby` are both missing for ~50% with no particular pattern.

`Tiglon` has one constant value `False`. Since we don't know if the missing value represents `True`, we will fill them with `unknown`.

<img src="../docs/images/tiglon.png" alt="tiglon" width="600" />

`Wallaby` is a regular numerical feature, we will impute the missing values with an imputer, such as `KNNImputer` or `IterativeImputer`.

<img src="../docs/images/wallaby.png" alt="wallaby" width="600" />

## Features

Out of the 30 features, 19 of them are numerical, 10 are categorical, 1 boolean (`Tiglon`, will be converted to categorical after filling missing values) and 1 unsupported consisting of all null values (`Vicuna`, will be dropped).

<img src="../docs/images/variable_types.png" alt="variable types" width="300" />

### Numerical Features

Upon closer inspection, I noticed two numerical features, `Thrush` and `Turtle`, naturally falling into a few bins from the histograms. Assuming this behaviour fits the business context, converting them into categorical variables by applying K-bins discretization will reduce variance from these features.

<img src="../docs/images/thrush.png" alt="thrush" width="600" />
<img src="../docs/images/turtle.png" alt="turtle" width="600" />

The rest of the numerical features have wildly different scales, so I decided to standardize them to avoid any potential impact from scales.

I noticed there might be outliers present, e.g. in features `Viper`, `Turkey`, but due to the lack of business context to determine if these values are valid or invalid, I decided to keep them to be safe.

### Categorical Features

`Vulture` and `Warbler` have high cardinality issues with distinct categories of 40 and 83, respectively. This can be resolved by using the hashing trick.
<img src="../docs/images/vulture.png" alt="vulture" width="600" />
<img src="../docs/images/warbler.png" alt="warbler" width="600" />

`Tiger`, `Toad`, `Wildfowl`, `Wolf`, and `Wolverine` have the rare-category problem, meaning they have categories representing only a tiny percentage of the total population. Leaving them untreated will result in a sparse matrix after one-hot encoding and therefore affect model performance. I decided to re-group any categories with less than 5% representation into a "rare" category.

Showing `Toad` as an example:
<img src="../docs/images/toad.png" alt="toad" width="600" />


## Correlation

There are high correlations between some features. I decided to not perform PCA or drop any correlated features, since 1) we don't have a high dimensionality problem and 2) the goal here is prediction rather than interpretation. As long as we steer away from algorithms that get impacted by multicollinearity, such as Logistic Regression or Naive Bayes, the additional features won't become a problem and could add to the predictive power.

<img src="../docs/images/correlation.png" alt="correlation" width="600" />


# Conclusion

In summary, here are the preprocessing steps to perform before modelling:

## Feature Preprocessing

1. Drop `Vicuna`
3. Convert `Tiglon` from boolean to categorical feature by filling missing values with string `unknown`
2. Convert `Thrush` and `Turtle` to categorical features using K-bins discretization

### Numerical Features

4. Impute `Wallaby`'s missing values with an imputer, e.g. `KNNImputer` or `IterativeImputer`
5. Standardize all numerical features

### Categorical Features

6. Encode high-cardinal features using the hashing trick for `Vulture` and `Warbler`
7. Re-assign minority categories to `rare` for `Tiger`, `Toad`, `Wildfowl`, `Wolf`, and `Wolverine`
8. One-hot encode low-cardinal features

## Target Preprocessing

1. Resample `target`'s minority class (`1`) to fix imbalanced dataset issue using `SMOTE` or `ADASYN`
