# Modelling customer churn
In this notebook we try to develop a model for predicting churn using the insights used in the analyze.ipynb notebook. It comprises of the following sections:
1. Feature Engineering
3. Training model
4. Hyperparameter tuning
5. Error analysis
6. Summing up the models

## Feature Engineering
In this section, we start with removing the features which did not have much effect on the churn prediction, and we try to engineer new features based on the insights we gained in the previous work. It consists of the following steps:
1. Deleting features
2. Handling NaNs
3. Converting categorical features into labels
4. Making custom features

In [1]:
import sys
import os
from typing import List, Dict

from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from scipy.stats import chi2_contingency

current_dir = os.getcwd()
src_dir = os.path.join(current_dir, '../src')
sys.path.append(src_dir)

from utils import *
from config import *

In [3]:
df_path, date = get_latest_path_by_date(BASE_PATH, PREPROCESS_PATH)
train = read_dataframe_csv(df_path, date, "train.csv")
val = read_dataframe_csv(df_path, date, "val.csv")
test = read_dataframe_csv(df_path, date, "test.csv")

In [6]:
transformer = DataTransformer(cat_features, num_features)

In [8]:
transformer.fit(train)

ValueError: Expected 2D array, got 1D array instead:
array=[18.  7. 52. ...  1. 18. 42.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [85]:
def fit_label_encoder(df: pd.DataFrame, columns: List[str]):
    encoders = {}
    for cf in columns:
        le = LabelEncoder()
        imputer = SimpleImputer(missing_values=np.nan, strategy='mean') 
        imputer = imputer.fit(X) 
        df[cf] = le.fit_transform(df[cf])
        encoders[cf] = (le, imputer)
    return encoders, df

def transform_data_encoder(df: pd.DataFrame, columns: str, encoders) -> pd.DataFrame:
    for cf in columns:
        df[cf] = encoders[cf][0].transform(df[cf])
        df[cf] = encoders[cf][1].transform(df[cf])
    return df

In [90]:
encoder, X_train = fit_label_encoder(X_train, cat_features)
X_train.fillna(, inplace=True)
X_val.fillna(0, inplace=True)
X_val = transform_data_encoder(X_val, cat_features, encoder)
X_test = transform_data_encoder(X_test, cat_features, encoder)

In [91]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
clf.score(X_val, y_val)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8914123491838183

In [92]:
X_train.isna().sum()

senior_citizen       0
partner              0
dependents           0
tenure_months        0
internet_service     0
online_security      0
online_backup        0
device_protection    0
tech_support         0
contract             0
paperless_billing    0
payment_method       0
monthly_charges      0
total_charges        0
churn_score          0
cltv                 0
dtype: int64

array([0, 1])