In [8]:
import pandas as pd
from pandas_profiling import ProfileReport

In [24]:
df = pd.read_csv("../data/raw/train.csv")

profile = ProfileReport(df, title="Profiling Report", explorative=True)
profile.to_file("../docs/profiling_report.html")

# Profiling Report Interpretation

The full report can be found here: [`../docs/profiling_report.html`](../docs/profiling_report.html)

## Summary Statistics

From the summary statistics, we can see:
- The training dataset has 31 variables (30 features + 1 target) and 7003 observations. The number of observations is much greater than the number of features, therefore no high-dimentional data issue yet.
- Missing values are present, which need to be explored further.
- There are no duplicated values.
- Training data size is only 4.8 MB, which shouldn't cause any memory issues.

<img src="../docs/images/summary_statistics.png" alt="summary statistics" width="300" />

## Target Distribution

The target variable has 2 values, `0` and `1`, indicating a binary classification problem. The dataset is also imblanced, with class `1` taking up ~14% of total observations. Resampling may be useful to combat this issue. As we don't have a large number of observations, oversampling is preferred than undersampling. Common oversampling techniques with decent performance include `SMOTE` and `ADASYN`.

<img src="../docs/images/target.png" alt="target" width="700" />

## Missing Values

From the missing values chart, we can see that `Vicuna` is 100% missing, so we can drop it from the training dataset. It is a good idea to set up monitoring for Vicuna's missing percentage in the future in case it changes.

<img src="../docs/images/missing_values.png" alt="missing values" width="800" />

`Tiglon` and `Wallaby` are both missing for ~50% with no particular pattern.

`Tiglon` only has one constant value `False`. Applying one hot encoding should fix this issue without imputing for missing values.

<img src="../docs/images/tiglon.png" alt="tiglon" width="700" />

`Wallaby` is a numerical feature. We can impute the missing values with `KNNImputer` or `SimpleImputer`.

<img src="../docs/images/wallaby.png" alt="wallaby" width="700" />

## Features

Out of the 30 features, 19 of them are numerical, 10 are categorical, 1 boolean (`Tiglon`) and 1 unsupported consisting of all null values (`Vicuna`).

<img src="../docs/images/variable_types.png" alt="variable types" width="300" />

### Numerical Features

The numerical features have wildly different scales, we could benefit from applying a scaler, such as `StandardScaler` or `MinMaxScaler`. 

There may also be outliers present, e.g. in features `Viper`, `Turkey`, but due to the lack of business context to determine if these values are invalid, I decided to keep them to be safe.

### Categorical Features

`Vulture` and `Warbler` have high cardinality issues with distinct categories of 40 and 83, respectively.
<img src="../docs/images/vulture.png" alt="vulture" width="700" />
<img src="../docs/images/warbler.png" alt="warbler" width="700" />

`Tiger`, `Toad`, `Wildfowl`, `Wolf`, and `Wolverine` also present an uneven distribution of categories. Leaving them untreated may result in a sparse matrix after one-hot encoding and therefore affect model performance. Will consider to add a min_frequency rule during the encoding stage. Showing `Toad` as an example:
<img src="../docs/images/toad.png" alt="toad" width="700" />


## Correlation

There are high correlations between some features. I decided to not perform PCA or drop any correlated features, since 1) we don't have a high dimensionality problem and 2) the goal here is prediction rather than interpretation. As a result, we should steer away from linear algorithms that get impacted by multicollinearity, such as Logistic Regression or Naive Bayes.

<img src="../docs/images/correlation.png" alt="correlation" width="700" />


# Conclusion

In summary, here are the preprocessing steps to perform before modelling:

## Preprocessing Steps

1. Drop `Vicuna`

### Numerical Features

2. Impute `Wallaby`'s missing values with an imputer, e.g. `KNNImputer` or `SimpleImputer`
3. Scale all numerical features

### Categorical Features

4. One-hot encode categorical features, including `Tiglon`, with a min frequency rule

### Target Resampling

5. Add optional `target` resampling step to fix imbalanced dataset issue using `SMOTE` or `ADASYN`
