# Dataset Preprocessing

In [4]:
%%capture

%run C:\Users\PC\Downloads\project\project\EDA.ipynb

In [1]:
data = df.copy()

NameError: name 'df' is not defined

## Handle missing values

### 1. Drop features

From our previous inspection:
- ``user_id`` -> Meaningless
- ``ZONE1`` -> 92% missing
- ``ZONE2`` -> ~94% missing
- ``MRG`` -> Only has a single value (`NO`) for all rows, so it does not contain any information.

When a feature has > 90% missing values, it contians very little information. Moreover, imupting such a large proportion would only introduce noise and bias, making the feature meaningless. Lastly, keeping it would only increase the model complexity.

In [58]:
data = data.drop(columns=['ZONE1', 'ZONE2', 'MRG', 'user_id'])

### 2. Impute numeric features

Numeric features with moderate missingness (30% ~ 49%):
- `MONTANT`, `FREQUENCE_RECH`, `REVENUE`, `ARPU_SEGMENT`, `FREQUENCE`, `DATA_VOLUME`, `ON_NET`, `ORANGE`, `TIGO`, `FREQ_TOP_PACK`

The strategy will be used is the **median** imputation instead of mean:
- Median is robust to outliers.
- Keeps the distribution reasonable.

In [59]:
numeric_features = ['MONTANT', 'FREQUENCE_RECH', 'REVENUE', 'ARPU_SEGMENT', 
                    'FREQUENCE', 'DATA_VOLUME', 'ON_NET', 'ORANGE', 'TIGO', 'FREQ_TOP_PACK']

for col in numeric_features:
    median_val = data[col].median()
    data[col].fillna(median_val, inplace=True)

### 3. Impute categorical features

Categorical features with missing values:
- `REGION` -> ~39% missing 
- `TOP_PACK` -> ~42% missing

The strategy will be used is imputing with **Unknown** category:
- This avoids dropping rows.
- Allows the model to treat the missingness as a distinct category, which can be informative.?

In [60]:
categorical_features = ['REGION', 'TOP_PACK']

for col in categorical_features:
    data[col].fillna('Unknown', inplace=True)

## Encoding categorical variables

### 1. Identify categorical variables

After dropping ``MRG`` and handling the missing data, the categorical data are: `REGION` and `TOP_PACK`.


### 2. Encoding strategy

- `REGION` -> One-hot encoding. It creates a separate binary column for each region allowing the model to treat each region independently.
- `TOP_PACK` -> Frequency encoding. It has many unique values, and one-hot encoding would create too many columns, making the model complex and sparse. Frequency encoding, on the other hand, preserves information about popularity of each pack while keeping the feature numeric.
- `TENURE` -> Ordinal numeric encoding. Tenure represents the range of duration in the network, so it has a natural order (shorter tenure -> higher likelihood of churn). Mapping to numeric values preserves this order and makes it usable in any model later on.


In [61]:
# 1. REGION → One-hot encoding
data = pd.get_dummies(data, columns=['REGION'], prefix='REGION')

print(f"Number of REGION columns after encoding: {len([c for c in data.columns if 'REGION_' in c])}\n")

Number of REGION columns after encoding: 15



In [62]:
# 2. TOP_PACK → Frequency encoding
top_pack_counts = data['TOP_PACK'].value_counts()
data['TOP_PACK_FE'] = data['TOP_PACK'].map(top_pack_counts)
data = data.drop(columns=['TOP_PACK'])
print(f"Example encoded values for TOP_PACK_FE:\n{data['TOP_PACK_FE'].head(10)}\n")

Example encoded values for TOP_PACK_FE:
0    152295
1    902594
2     18454
3     14629
4     67512
5     64412
6    902594
7    317802
8    902594
9     22332
Name: TOP_PACK_FE, dtype: int64



In [None]:
# 3. TENURE → Ordinal numeric encoding
tenure_map = {
    'D 3-6 month': 3,
    'E 6-9 month': 6,
    'F 9-12 month': 9,
    'G 12-15 month': 12,
    'H 15-18 month': 15,
    'I 18-21 month': 18,
    'J 21-24 month': 21,
    'K > 24 month': 24,
}
data['TENURE_NUM'] = data['TENURE'].map(tenure_map)
data = data.drop(columns=['TENURE'])

print(f"Example encoded values for TENURE_NUM:\n{data['TENURE_NUM'].head(10)}\n")

Example encoded values for TENURE_NUM:
0    8
1    6
2    8
3    8
4    8
5    8
6    8
7    8
8    8
9    8
Name: TENURE_NUM, dtype: int64



In [64]:
data.head()

Unnamed: 0,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,ORANGE,TIGO,REGULARITY,...,REGION_LOUGA,REGION_MATAM,REGION_SAINT-LOUIS,REGION_SEDHIOU,REGION_TAMBACOUNDA,REGION_THIES,REGION_Unknown,REGION_ZIGUINCHOR,TOP_PACK_FE,TENURE_NUM
0,4250.0,15.0,4251.0,1417.0,17.0,4.0,388.0,46.0,1.0,54,...,False,False,False,False,False,False,False,False,152295,8
1,3000.0,7.0,3000.0,1000.0,9.0,257.0,27.0,29.0,6.0,4,...,False,False,False,False,False,False,True,False,902594,6
2,3600.0,2.0,1020.0,340.0,2.0,257.0,90.0,46.0,7.0,17,...,False,False,False,False,False,False,True,False,18454,8
3,13500.0,15.0,13502.0,4501.0,18.0,43804.0,41.0,102.0,2.0,62,...,False,False,False,False,False,False,False,False,14629,8
4,1000.0,1.0,985.0,328.0,1.0,257.0,39.0,24.0,6.0,11,...,False,False,False,False,False,False,False,False,67512,8


## Feature Engineering

1. ``REVENUE / MONTANT``: measures efficiency of revenue relative to top-ups. High ratio may indicate high-value customers.
2. ``TENURE / FREQUENCE_RECH``: normalize recharge frequency by how long the customer has been active. Low value may indicate inactivity relative to tenure.
3. ``TENURE / REGULARITY``: shows churn risk relative to engagement. Customers with low regularity relative to tenure may be more likely to churn.
4. ``DATA_VOLUME / REGULARITY``: average data usage per active day. Captures engagement intensity.
5. ``ON_NET / REGULARITY``: measures on-network calling frequency per active period. High value may indicate loyal users.
6. ``REVENUE - MONTANT``: difference between revenue and top-up. Large posiitive or negative deviations could indicate unusual behaviour.
7. Log-transformed numeric features: reduces skewness, helps models detect patterns across scales.

In [65]:
import numpy as np

In [None]:
data['REV_DIV_MONTANT'] = data['REVENUE'] / (data['MONTANT'] + 1)  # +1 to avoid division by zero
data['TENURE_DIV_FREQ_RECH'] = data['TENURE_NUM'] / (data['FREQUENCE_RECH'] + 1)
data['TENURE_DIV_REG'] = data['TENURE_NUM'] / (data['REGULARITY'] + 1)
data['DATA_DIV_REG'] = data['DATA_VOLUME'] / (data['REGULARITY'] + 1)
data['ON_NET_DIV_REG'] = data['ON_NET'] / (data['REGULARITY'] + 1)
data['REV_MINUS_MONTANT'] = data['REVENUE'] - data['MONTANT']

numeric_cols = ['MONTANT','REVENUE','FREQUENCE','DATA_VOLUME','FREQUENCE_RECH','REGULARITY']
for col in numeric_cols:
    data[f'{col}_log'] = np.log1p(data[col])

In [None]:
## draw the heatmap again and see the correlation, then decide finally on which features to drop/keep

In [67]:
# save the preprocessed data before scaling
data_unscaled = data.copy()
data_unscaled.to_csv('data/data_preprocessed_unscaled.csv', index=False)

## SMOTE



## Data Splitting

Split 1: Train (70%) + Temp (30%) (Test + Validation)

Split 2: Validation (15%) + Test (15%) (from Temp)

In [68]:
from sklearn.model_selection import train_test_split

In [None]:
X = data.drop(columns=['CHURN'], axis=1)
y = data['CHURN']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y) # 85% train, 15% test
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15, random_state=42, stratify=y_train) # 70% train, 15% test

print("Train shape:", X_train.shape, y_train.shape)
print("Validation shape:", X_val.shape, y_val.shape)
print("Test shape:", X_test.shape, y_test.shape)

Train shape: (1507833, 40) (1507833,)
Validation shape: (323107, 40) (323107,)
Test shape: (323108, 40) (323108,)


## Scaling / Transformation

In [70]:
from sklearn.preprocessing import StandardScaler

In [None]:
num_cols = X_train.select_dtypes(include=['float64', 'int64']).columns

scaler = StandardScaler()

X_train_scaled = X_train.copy()
X_val_scaled = X_val.copy()
X_test_scaled = X_test.copy()

X_train_scaled[num_cols] = scaler.fit_transform(X_train[num_cols])

X_val_scaled[num_cols] = scaler.transform(X_val[num_cols])
X_test_scaled[num_cols] = scaler.transform(X_test[num_cols])


# Model Experimentation

Models that are suitable for churn prediction (binary classification):
1. **Logistic Regression** (baseline linear model, sensitive to scaling).
2. **Gradient Boosting models** (e.g., XGBoost, LightGBM), powerful tree-based models, handle non-linearities and missing values well.
3. **Random Forest** (robust ensemble, less sensitive to scaling).

## Cross-Validation

Before evaluating, use the following:
1. K-fold cross-validation on the training set to estimate the performance.
2. Validation set for hyperparameter tuning. --> EXTRA

## Fit models

* Fit the models on the training data (or training folds when using CV).
* Evaluate on the validation set for tuning.

## Metrics

* Accuracy
* Precision
* Recall
* F1-score
* ROC_AUC

Then if time helps we do either of the following:

a. Pick the best performed model and apply **feature selection/dimensionality reduction** and train the model again.

b. Apply **feature selection/dimensionality reduction** and try it out on all models again.

+ If the performance is still low then we consider handling the class imbalance: SMOTE, undersampling, or class weights.

How the improvement or consideration should be for each model:
1. Logistic Regression: --> Scaling is important
    * **Hyperparameter tuning**: GridSearchCV. Key hyperparameters are: c (inverse of regularization strength), penalty (l1, l2, elasticnet), and solver (depedning on the penalty).
    * **Feature selection/dimensionality reduction**: It is very compatible with L1 regulaization (lasso) for automatic feature selection. Also, it works well with PCA to reduce correlated features.
    * **Class imbalance**: Use `class_weight='balanced'` to automatically adjust weights. OR, use SMOTE before fitting on the logistic regression.

2. Gradient Boosting (XGBoost, LightGBM): --> Scaling not important
    * **Hyperparameter tuning**: Effective. GridSearchCV. Key hyperparameters are: `n_estimators`, `max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, `min_child_weight`. NOTE: if tuned well, it can get big performance.
    * **Feature selection/dimensionality reduction**: Less critical + can compute feature importance after training for insight.
    * **Class imbalance**: For XGBoost use `scale_pos_weight` and for LightGBM use `is_unbalance=True` or `class_weight='balanced`. SMOTE works but usually gradient boosting handles teh imbalance well with class weight.

3. Random Forest: --> Scaling not important
    * **Hyperparameter tuning**: GridSearchCV. Key hyperparameters are: `n_estimators`, `max_depth`, `max_features`, `min_samples_split`, `min_samples_leaf`. 
    * **Feature selection/dimensionality reduction**: Trees are robust, but it can be used to drop weak features.
    * **Class imbalance**: Use `class_weight='balanced`. NOTE: SMOTE is optional but not important.