# Churn Prediction

## Problem Statement

There is a telecom company that offers phone and internet services. There is a problem: some of our customers are churning. We would like to build a model that can identify the customers that are likely to churn. We have collected a dataset about our customers: what type of services they use, how much they paid, and how long they stayed with us. We also know who canceled their contracts and stopped using our services (churned). 

In this notebook, we are going to discuss thee evaluation metrics and prepare data for model training.

## Imports

In [1]:
# usual imports 
import numpy as np
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import math

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

from collections import defaultdict
from sklearn.metrics import fbeta_score, make_scorer

# helper functions
from churn_prediction_utils import *

In [2]:
%store -r df_train_full_explore
%store -r df_train_full
%store -r df_train
%store -r df_val
%store -r df_test

%store -r y_train_full
%store -r y_train
%store -r y_val
%store -r y_test

%store -r categorical_features
%store -r numerical_features


## Evaluation metrics
We have imbalanced dataset. However, imbalance is moderate. Approximately 26% of the data points represent customers that have churned. Using accuracy as a evaluation metric won't be wise thing to do for the problem at hand. Positive class in this problem is the user that is going to churn. These are the users who are going to leave the app. Our model will identify these users. 

1. False positive in this problem signifies that a user is predicted as churned but it is actually not. 
2. False negative in this problem signifies that a user is predicted as not churned but it is actually churned.

Precision takes into account #1 above. Recall takes into account #2. Lets make an assumption about the problem from business perspective: 

*We wish to ensure that we don't lag behind in identifying churned users. In the process of doing so, we are fine with identifying some users as churned but actually they are not (False positive).*

We care more about recall in this problem. Keeping above in mind, we will focus on F score as our evaluation metric. We will calculate F score at different `beta` values. Two beta values will be considered : 1 and 1.5. These are going to be our primary metrics. Besides, F-score, we will also calculate roc-auc, precision, recall and accuracy.

In [3]:
evaluation_metrics = ['f1.5', 'f1', 'roc_auc', 'recall', 'precision', 'accuracy']

In [7]:
f_scorer = make_scorer(fbeta_score, beta=1.5)

## Prepare data for model training

In [5]:
# Prepare input data for model training and collect all feature names
res = get_input_data_matrix(df_train_full, categorical_features, numerical_features)
X_train_full_scaled = res['input_data_matrix']
dv_full_scaled = res['dict_vectorizer']
standard_scalar_full_data = res['standard_scalar']
feature_names = res['feature_names']

res = get_input_data_matrix(df_train_full, categorical_features, numerical_features, scaling_required = False)
X_train_full_not_scaled = res['input_data_matrix']
dv_full_not_scaled = res['dict_vectorizer']

res = get_input_data_matrix(df_train, categorical_features, numerical_features)
X_train_scaled = res['input_data_matrix']
dv_scaled = res['dict_vectorizer']
standard_scalar = res['standard_scalar']

res = get_input_data_matrix(df_train, categorical_features, numerical_features, scaling_required = False)
X_train_not_scaled = res['input_data_matrix']
dv_not_scaled = res['dict_vectorizer']

res = get_input_data_matrix(df_val, categorical_features, numerical_features, scaling_required = True, 
                            is_training_data= False, dict_vectorizer = dv_scaled, 
                            standard_scalar = standard_scalar)
X_val_scaled = res['input_data_matrix']

res = get_input_data_matrix(df_val, categorical_features, numerical_features, scaling_required = False, 
                            is_training_data= False, dict_vectorizer = dv_not_scaled)
X_val_not_scaled = res['input_data_matrix']

res = get_input_data_matrix(df_test, categorical_features, numerical_features, scaling_required = True, 
                            is_training_data= False, dict_vectorizer = dv_full_scaled, 
                            standard_scalar = standard_scalar_full_data)
X_test_scaled = res['input_data_matrix']

res = get_input_data_matrix(df_test, categorical_features, numerical_features, scaling_required = False, 
                            is_training_data= False, dict_vectorizer = dv_full_not_scaled)
X_test_not_scaled = res['input_data_matrix']

In [9]:
%store X_train_full_scaled
%store dv_full_scaled
%store standard_scalar_full_data
%store feature_names

%store X_train_full_scaled
%store X_train_full_not_scaled
%store X_train_scaled
%store X_train_not_scaled
%store X_val_scaled
%store X_val_not_scaled
%store X_test_scaled
%store X_test_not_scaled

%store X_test_scaled
%store X_test_not_scaled

%store evaluation_metrics
%store f_scorer

Stored 'X_train_full_scaled' (ndarray)
Stored 'dv_full_scaled' (DictVectorizer)
Stored 'standard_scalar_full_data' (StandardScaler)
Stored 'feature_names' (list)
Stored 'X_train_full_scaled' (ndarray)
Stored 'X_train_full_not_scaled' (ndarray)
Stored 'X_train_scaled' (ndarray)
Stored 'X_train_not_scaled' (ndarray)
Stored 'X_val_scaled' (ndarray)
Stored 'X_val_not_scaled' (ndarray)
Stored 'X_test_scaled' (ndarray)
Stored 'X_test_not_scaled' (ndarray)
Stored 'X_test_scaled' (ndarray)
Stored 'X_test_not_scaled' (ndarray)
Stored 'evaluation_metrics' (list)
Stored 'f_scorer' (_PredictScorer)
