## Data Cleaning & Feature Engineering
- For now most of the attention is focused on data cleaning, this is due:
    - speed
    - going through the general process lots of times in order to improve the MI modeling template
- However, when the *process* is stable feature engineering is one of the most important aspects to focus on

In [2]:
import datetime
import os

import numpy as np
import pandas as pd

import src.features.data_cleaning as data_clean
import src.features.data_exploration as data_exp

In [None]:
%load_ext autoreload
%autoreload 2

In [4]:
train_path = os.path.join("..", "data", "train_set")

In [6]:
df = pd.read_csv(train_path, sep=";", decimal=",", low_memory=False, compression="zip")

In [7]:
df.shape

(45796, 416)

### Drop variables
- without relevant information or a perfect explainer of the dependent variable (e.g., id-parent-klant, model_peildatum, toon_abo, ..)
- with almost no variance 
    - currently only applied on frequency variales, should be applied on other types of variables as well
- highly correlated features
- with (too much) missing values
- without variance (only 1 unique value per variable)

Instead of using the precompiled drop_* functons I am going to show the function which is used behind the scenes:
```python
help(df.drop)
```

In this case I use sets to make sure only unique values are used and to easily control for a possible overlap between the variables we want to drop.
#### not relevant | near zero variance | highly correlated | zero variance |

In [8]:
no_relevant_information = set(
    [
        "id_parent_klant",
        "code_org_eigenaar_dwh",
        "model_peildatum",
        "toon_abo_klant_ind",
        "klant_max_eind_levering_dat",
    ]
)

In [9]:
df_freq_insight = data_exp.freq_insight(df)
low_variance = set(df_freq_insight.index[df_freq_insight["abs"] < 0.05])

In [10]:
df_multi_corr = data_exp.multicollinearity(
    df, cut_off=0.9, dependent_variable="toon_churn"
)

In [11]:
high_correlated = set(df_multi_corr["drop_suggestion"])
high_correlated.remove("toon_churn")  # keep the dependent variable :)

In [12]:
variables_to_drop = list(no_relevant_information | low_variance | high_correlated)

In [13]:
df_cleaned = df.drop(variables_to_drop, axis=1)

#### Missings & No variance
- before removing variables with missings we are going to engineer a new 'days_since_last_sale_num' feature

In [14]:
today = datetime.date.today()

In [15]:
df_cleaned["days_since_ltst_sale_num"] = (
    today - pd.to_datetime(df_cleaned["ltst_sale_dat"]).dt.date
).dt.days

In [16]:
df_cleaned["days_since_ltst_sale_num"] = df_cleaned["days_since_ltst_sale_num"].fillna(
    df_cleaned["days_since_ltst_sale_num"].median()
)

In [17]:
df_cleaned = data_clean.drop_missings(df_cleaned, threshold=0.05)

drop_missings finished in 0.27 seconds with 45796 rows and 335 columns.


In [18]:
df_cleaned = data_clean.nzv(df_cleaned)

nzv finished in 0.76 seconds with 45796 rows and 312 columns.


### Feature engineering
As mentioned in the introduction we will skip this section for now but this is a **very** important step to take. Focus for latter:
- check for and handle outliers (freq & numeric) 
- insight: we have a lot of indicators, possibly without that much variance (nzv for indicators?)
- we will leave the (near zero variance) indicators for now since combinations of them could be useful -> how do automatically incorporate interactions? standard lightgbm?

Notes:
- klant_jr_sinds_frst_start_rec & klant_jr_sinds_ltste_start_rec should be converted to float when loaded
- klant_min_begindatum_dat is redundant

In [19]:
data_exp.describe_df(df_cleaned, dependent_variable="toon_churn")

This dataframe has 45796 rows and 312 columns.

The dependent variable consists of 50.0% of ones.

The variables have the following data types:
int64      308
object       3
float64      1
dtype: int64

The postfixes are distributed as follows:
Counter({'ind': 291, 'freq': 14, 'num': 3, 'rec': 2, 'churn': 1, 'dat': 1})

The following variables have missing values:
Empty DataFrame
Columns: [column, perc_missings]
Index: []




In [20]:
df_cleaned.select_dtypes(include="object").head()

Unnamed: 0,klant_min_begindatum_dat,klant_jr_sinds_frst_start_rec,klant_jr_sinds_ltste_start_rec
0,2013-10-31 00:00:00,6.17,6.17
1,2001-10-31 00:00:00,18.18,0.84
2,2004-04-01 00:00:00,15.76,15.76
3,1992-12-31 00:00:00,27.02,0.9
4,2016-06-14 00:00:00,3.55,0.94


### Indicator nzv start

In [None]:
ind_list = data_exp.postfix_to_column(
    df_cleaned, postfix="ind", dependent_variable=None
)

In [None]:
df_cleaned[ind_list].head()

In [None]:
ind_dict = {}

In [None]:
# The most prevalent value comes first, regardless whether its 0, or 1
for col in df_cleaned[ind_list]:
    ind_dict[col] = list(df_cleaned[col].value_counts(normalize=True))[0]

In [None]:
ind_dict

In [None]:
near_zero_var_indicators = {
    key: value for key, value in ind_dict.items() if value > 0.95
}

In [None]:
len(near_zero_var_indicators)

#### The following section is very memory intensive so for now we will leave this part but it is something to think about

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
X = df_cleaned[ind_list]

In [None]:
poly_transformation = PolynomialFeatures(2, interaction_only=True, include_bias=False)

In [None]:
X_t = poly_transformation.fit_transform(X)

In [None]:
type(X_t)

### Export set
- We will come back to feature engineering but for now we want to keep moving forward
- Only the objects need to handle in the next section via a drop and convert to float

In [21]:
train_path_cleaned = os.path.join("..", "data", "train_set_cleaned")

In [22]:
df_cleaned.to_csv(
    train_path_cleaned, sep=";", encoding="utf-8", index=False, compression="zip"
)