<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

# INTRODUCTION
Challenge-1: Telecom Churn Predictive with Automated Data Science Techniques

The goal of this challenge is to perform the telecom churn prediction as defined by the KDD-2009 challenge using supervised learning techniques. This is the second notebook for this challenge and here is a summary of tasks performed in this notebook:

1. Retrieve list of columns shortlisted from training data.
2. Read test dataset from files, restricting the columns to list from above.
3. Read target labels for the test dataset. This will be used for evaluation of predicted results.
4. Retrieve auto-sklearn models from disk and predict labels for test dataset.
5. Check F1 score and other metrices of the predictions against the evaluation labels.

## 1. Column list retrieval
Columns shortlisted from training data in notebook-1 is retrieved using pickle.

## 2. Parse test data
### a. Filter columns.

The test data set is parsed using the shortlisted column list from first step.

### b. Null values
Null values are filled with 0.0 consistent with the approach selected in notebook-1

### c. Size reduction
Data size reduction can be performed optionally if the test dataframe is hard to run in memory.

## 3. Target labels
Test labels are retrieved from disk. This will be used to test and validate our predictions.

## 4. Model retrieval
Retrieve auto-sklearn ensemble model from disk using pickle.

Predictions are performed in multiple processes. 5 processes are spawned and each predictcs approximately 1/5rd of the test data separately. The results are then combined into a single prediction array before validating against the test labels.

## 5. Check F1 score
Check F1 binary scores by validating the predictions against the test labels.

<b><u>NOTE</u></b>: Every cell in the notebook is timed (%%time), which gives an idea of the cell runtime.

<b>--------------------------------------------------------------------------------------------------</b>
</span>

<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## IMPORTED LIBRARIES
Here is a summary of the imported modules and their purposes:

### Utilities
<b>1. pandas</b>
<br>To read raw training data from csv files into dtaframes and process it.

<b>2. numpy</b>
<br>For datatype constants.

<b>3. pickle</b>
<br>Writing and reading objects to/from disk as binary files. These objects incldue parsed test files, list of columns and trained auto-sklearn model.

<b>4. datetime</b>
<br>To measure start and end time of the notebook execution. Also used to suffix timestamp to the generated log files.

<b>5. os</b>
<br>Writing and reading log files and pickles from disk.
    
### Metrices
<b>1. sklearn.metrics.f1_score</b>
<br>To evaluate the model using F1 scores of binary classes.

<b>2. sklearn.metrics.classification_report</b>
<br>To print precision, recall and support

### Distributed processing

<b>1. multiprocessing</b>
<br>For distributing the auto-sklearn prediction workload to multiple CPU cores.

<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [18]:
#### Utilities
import pandas as pd
import numpy as np
import pickle
import datetime
import os
import warnings

#### Metrices
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

#### Multi-processing
from multiprocessing import Process, Queue

start = datetime.datetime.now()
print(start)

2018-11-16 11:29:19.900616


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## CONSTANTS
Specfic details of each constant are indicated in comments below.
<b>--------------------------------------------------------------------------------------------------</b>

</span>

In [19]:
%%time

#### Constant values

#### Data location on the disk - absolute path of unzipped datafiles
data_dir           = "../unzipped_datafiles/"

#### File prefix for training and label files
data_file_prefix   = "orange_large_train.data.chunk"
target_file        = "orange_large_train_churn.labels"

#### File and columns counts
num_files        = 5
targ_records     = 25000 ## Num of rows to skip in target file = Num of records to retrieve from target file

#### Test dataframes
X_test = pd.DataFrame()
y_test = pd.Series()

#### Log files file prefixes / suffixes
time_suffix = str(start)
feat_extract_txt   = "notebook_2_final_"
for ch in [" ", ":", "-"]:
    time_suffix = time_suffix.replace(ch, "")

f_name_prefix = feat_extract_txt
f_name_suffix = "_" + time_suffix + ".txt"
temp_log_str  = "temp_log_"
f_res_name = f_name_prefix + f_name_suffix

#### Pickle constants for files that are stored to disk.
pkl_cols_to_retain  = "../pickles/cols_to_retain.pkl"
pkl_automl_model    = "../pickles/automl_model.pkl.0"
pkl_x_test          = "../pickles/X_test.pkl"

#### Batch size for multiprocessing. Depends on the number of test data splits that can be handled.
batch_size = 5

#### Set this to TRUE for faster execution by using the pickled test data from disk. 
#### If set to FALSE, extraction of test data can take ~4mins.
use_saved_data = True

CPU times: user 25.1 ms, sys: 10.9 ms, total: 36 ms
Wall time: 35.9 ms


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: CHECK_AND_UPDATE_DTYPE
Helper function that checks the maximum and minimum values of a feature and assigns an appropriate datatype.

By default, pandas assigns the highest datatype for a numeric column such as int64 and float64 and this bloats up the dataframe size in memory. Using smaller datatypes, drastically reduces the memory footprint of the dataframes.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [20]:
%%time
def check_and_update_dtype(dt, mn, mx):
    
    if (dt == np.int64):
        if ((mn >= np.iinfo("int8").min) and (mx < np.iinfo("int8").max)): ##int8
            return np.int8
        elif ((mn >= np.iinfo("int16").min) and (mx < np.iinfo("int16").max)): ##int16
            return np.int16
        elif ((mn >= np.iinfo("int32").min) and (mx < np.iinfo("int32").max)): ##int32
            return np.int32
    elif (dt == np.float64):
        if ((mn >= np.finfo("float32").min) and (mx < np.finfo("float32").max)): ##float32
            return np.float32
    else:
        return dt

CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 12.9 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: REDUCE_DATAFRAME_SIZE
Helper function to reduce dataframe size in memory by assigning the right datatype to a feature.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [21]:
%%time

def reduce_dataframe_size(df):
    
    print("Original DF size = ", df.memory_usage(deep=True).sum()/1024**3, "GB")
    print("Reducing DF size....")
    red_df = pd.DataFrame()
    for col in df.columns:
        dt = df[col].dtype
        mn = df[col].min()
        mx = df[col].max()
        up_dt = check_and_update_dtype(dt, mn, mx)
        
        red_df = red_df.join(pd.DataFrame(df[col], 
                                          columns=[col],
                                          dtype=up_dt), 
                             how="right")
    print("Reduced DF size = ", red_df.memory_usage(deep=True).sum()/1024**3, "GB")
    return (red_df)

CPU times: user 47 µs, sys: 0 ns, total: 47 µs
Wall time: 51.7 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: EXTRACT_TEST_DATA
Extracts the last 25k rows from training data and appends that to a dataframe. We treat this dataframe as the test/validation dataset. During parsing we, use the column list retrieved from training in notebook-1.

<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [22]:
%%time
def extract_test_data (num_files=num_files):
    
    X_test  = pd.DataFrame() 

    #### Cycle through columns in batches, reduce their dtype and then append to final dataframe.
    for i in range(1, num_files+1):

        if (i < 3):
            continue

        chunk_df = pd.DataFrame()
        
        chunk_df = pd.read_csv(data_dir+data_file_prefix+str(i), 
                               sep="\t", 
                               lineterminator="\n", 
                               header=None,
                               names=col_list,
                               usecols=col_ind)
                               #dtype=col_dict)

        chunk_df.fillna(0, inplace=True)

        if (i == 3):
            X_test  = X_test.append(chunk_df.iloc[5001:], ignore_index=True) 
        else:
            X_test  = X_test.append(chunk_df, ignore_index=True)
            
    del chunk_df
    
    X_test = reduce_dataframe_size(X_test)
        
    return (X_test)

CPU times: user 12 µs, sys: 0 ns, total: 12 µs
Wall time: 15.3 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: PRINT_ENSEMBLE_DETAILS
Helper function that prints the details of the selected model, such as the hyperparameters, weights etc.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [23]:
%%time
#### Get raw details of selected ensemble. Each algo+hyperparamter combination is stored as a dictionary
def print_ensemble_details (mod_with_wt):

    #### Build list of the ensemble detail dictionary
    dict_list = []
    for item in mod_with_wt:
        co_dict   = {}
        pip = item[1]
        co_dict = pip.configuration.get_dictionary().copy()
        co_dict["algo_weight"] = item[0]
        dict_list.append(co_dict)

    #### Build a dictionary with the key is the hyperparamter name and value is a list corresponding to the hyperparamter
    #### values of the ensemble in that order.
    print_dict = {}
    for item in dict_list:
        for key, val in item.items():
            if key in print_dict.keys():
                pass
            else:
                print_dict[key] = []

    for item in dict_list:
        for key in print_dict.keys():
            if key in item.keys():
                print_dict[key].append(item[key])
            else:
                print_dict[key].append("NA")

    #### Read the dictionary into a pandas dataframe which is easier to print as a table in the end.
    print_df = pd.DataFrame(print_dict)
    col_dict = {}
    drop_list  = []
    const_dict = {}

    for col in print_df.columns:

        #### Remove parameters that are not relevant. For e.g. 
        #### 1. There are no categorical columns in our dataset
        #### 2. There is no preprocessing / imputation done becasue the null values are filled before training the automl.
        if (("categorical" in col) or ("preprocessor" in col) or ("imputation" in col)) :
            drop_list.append(col)
        else:
            str1 = col.split(":")

            if (len(str1) > 2):
                title = str1[2]
            else:
                title = str1[0]

            #### Seprate parameters that have constant values for the ensemble.
            if (len(print_df[col].unique()) == 1):
                val = print_df[col].unique()[0]
                if (val == "None" or val == 0):
                    pass
                else:
                    const_dict[title] = val
                drop_list.append(col)
            else:
                col_dict[col] = title

    print_df = print_df.drop(drop_list, axis=1)
    print_df = print_df.rename(col_dict, axis=1)

    print("ENSEMBLE Constants")
    print("------------------")
    for k, v in const_dict.items():
        print (k, "\t= ", v)

    print("\n\nENSEMBLE Hyperparameters")
    print("------------------------")

    #print_df = print_df[["algo_weight", 
    #                     "max_features", 
    #                     "rescaling", 
    #                     "n_quantiles", 
    #                     "output_distribution",
    #                     "bootstrap",
    #                     "criterion", 
    #                     "min_samples_leaf", 
    #                     "min_samples_split"]]

    print_df.index = np.arange(1, len(print_df)+1)
    
    return print_df

CPU times: user 21 µs, sys: 0 ns, total: 21 µs
Wall time: 25.5 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: SPAWN_PRED_PROCESS
Function that spawns a worker process for auto-sklearn prediction. The predictions are then put to the queue associated with this process.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [24]:
%%time

def spawn_pred_process(automl_model, test_data, queue, identity):
    #print(automl_model)
    print("Worker process-%d initiated....\n" %identity)
    pred = automl_model.predict(test_data)
    queue.put(pred)
    print("Worker process-%d completed....\n" %identity)
    return

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 8.82 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Column List
Retrieve column list shortlisted from pre-processing and training in notebook-1. This list will be used to retrieve reduced columns from the test set.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [25]:
%%time

#### Shortlisted columns and their types
col_list = pickle.load(file=open(pkl_cols_to_retain, "rb"))
col_ind  = [int(num.replace("Var", ""))-1 for num in col_list]

print("List of columns to filter:")
print(col_list, "\n")
print("Total Columns: ", len(col_list))

List of columns to filter:
Index(['Var7', 'Var11', 'Var13', 'Var17', 'Var19', 'Var20', 'Var21', 'Var22',
       'Var27', 'Var28',
       ...
       'Var14696', 'Var14703', 'Var14710', 'Var14713', 'Var14714', 'Var14721',
       'Var14724', 'Var14729', 'Var14731', 'Var14732'],
      dtype='object', length=2907) 

Total Columns:  2907
CPU times: user 6.86 ms, sys: 0 ns, total: 6.86 ms
Wall time: 5.87 ms


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Extract Test Data
Extract the 25k rows of test data.

The original size of the dataframe is ~550MB. However we can optimize this by reducing the datatypes of the features just like it was done during training in notebook-1. 

For datasize reduction we utilize another approach of reducing the datatype. This is done by cycling through the columns, reducing the datatype and then appending this column to a new dataframe. We observed this is much faster than using .astype() function of pandas because .astype() takes about 6secs per feature. With this approach, it takes less than a half second per feature. However, with ~3k features, this needs ~4mins. The final dataframe size is ~121MB.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [27]:
%%time

#### To speed up execution of this cell, set use_saved_data boolean accordingly in CONSTANTS 
#### cell at the start of the notebook.

if use_saved_data == False:
    X_test = extract_test_data()

    #### Store the text dataframe to disk
    pickle.dump(file=open(pkl_x_test, "wb"), obj=X_test)
else:
    #### Load picked data from disk
    X_test = pickle.load(file=open(pkl_x_test, "rb"))

print ("Memory footprint of X_test: ", X_test.memory_usage(deep=True).sum()/1024**3, "GB")
print ("Shape of X_test: ", X_test.shape)

Memory footprint of X_test:  0.16072306782007217 GB
Shape of X_test:  (25000, 2907)
CPU times: user 346 ms, sys: 134 ms, total: 480 ms
Wall time: 481 ms


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Predict
We now fork as many processes as there are automl models (which again depends on the training batches of notebook-1). 

Every process is associated with a queue where the process writes its predicted values.

Appending the queue objects to a list in order and then extracting them in the same order, ensures that we combining the results of split test batches as in the original test set, even though worker processes may get completed asynchronously.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [28]:
%%time

with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    predictions   = []
    processes     = []
    queues        = []
    ind = 0
    count = 0
    
    with open(pkl_automl_model, "rb") as fp:
        model = pickle.load(file=fp)

    while (ind < len(X_test)):
        
        end = ind + int(len(X_test)/batch_size)
        X_test_split = X_test.iloc[ind:end, ].copy()
        ind = end
        print("Shape of split batch: ", X_test_split.shape)

        q = Queue()
        p = Process(target=spawn_pred_process, args=(model, X_test_split, q, count))
        p.start()
        queues.append(q)
        processes.append(p)
        count += 1

    for proc, que in zip(processes, queues):
        predictions.append(que.get())
        proc.join()

Shape of split batch:  (5000, 2907)
Worker process-0 initiated....

Shape of split batch:  (5000, 2907)
Worker process-1 initiated....

Shape of split batch:  (5000, 2907)
Worker process-2 initiated....

Shape of split batch:  (5000, 2907)
Worker process-3 initiated....

Shape of split batch:  (5000, 2907)
Worker process-4 initiated....

Worker process-0 completed....

Worker process-1 completed....

Worker process-2 completed....

Worker process-3 completed....

Worker process-4 completed....

CPU times: user 14 s, sys: 4.35 s, total: 18.3 s
Wall time: 49.6 s


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Model Ensemble Description
Auto-sklearn builds an ensemble of models that best classify the dataset. We have restricted the classifier selection to RandomForest, since this seems to provide the best results until now. Auto-sklearn then builds an ensemble of RandomForest Classifier models with varying hyperparameter values. In our approach we had divided the training set into batches to reduce the number of majority samples in each batch. Subsequently, each batch was trained separately in auto-sklearn producing different set of model ensembles. These sets are shown below along with their selected hyper parameters that provided the best scoring value, which in our case is ROC_AUC. 

For example, ensemble-1 below, contains 17 RandomForest Classifier models and here is a brief description of the columns:

1. <b>balancing</b>
This describes if the samples are weighted in the model.

2. <b>rescaling</b>
This describes how the feature values were scaled.

3. <b>bootstrap</b>
This describes if the boostrapping based resampling was performed during training. Basically this means that a sample of dataset is used from the training dataset for each training iteration and its possible that a sample appears multiple times in different iterations.

4. <b>criterion</b>
In tree based classification, there are two criteria used for spltting at a node - <b>gini</b> that reduces the probability of mis-classification at a node and <b>entropy</b> that reduces the impurity of classification at a node.

5. <b>max_features</b>
It seems auto-sklearn applies it's own feature dimensionality reduction techniques. Hence every model seems to be using anly a fraction of the total available features.

6. <b>min_samples_leaf</b>
Minimum no: of samples at a leaf node of the tree.

7. <b>min_samples_split</b>
Minimum no: of samples at a node, before it is split into branches.

8. <b>algo_weight</b>
This describes the weight of the predictions that the model contributes to the ensemble. For instance 0.16 indciates that 16% of the perdiction weight comes from this model. In the first ensemble there is a very good distribution of weights across the ensemble, while in the second ensemble 46% of the prediction wieght is provided by the first model.

9. <b>q_max, q_min, n_quantiles and output_distrubtion</b>
This describes the values based on selected scaling methods.

It should also be noted that each ensemble is different. For example in the second ensemble, mdoels are primarily driven by rescaling techniques other than normal standardizer while criterion and boostrapping has been kept constant. This kind of shows how the different samples in the same dataset can produce different ensemble strategies.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [29]:
%%time

#### Print model ensemble details
with open(pkl_automl_model, "rb") as fp:
    model = pickle.load(file=fp)

    print("========================")
    print("*****MODEL-ENSEMBLE*****")
    print("========================")

    display(print_ensemble_details(model.get_models_with_weights()))
    print("\n")

*****MODEL-ENSEMBLE*****
ENSEMBLE Constants
------------------
classifier 	=  random_forest
n_estimators 	=  100


ENSEMBLE Hyperparameters
------------------------


Unnamed: 0,balancing,rescaling,bootstrap,criterion,max_features,min_samples_leaf,min_samples_split,algo_weight,n_quantiles,output_distribution,q_max,q_min
1,none,standardize,False,entropy,0.844016,20,18,0.18,,,,
2,weighting,none,True,entropy,0.881737,7,9,0.14,,,,
3,weighting,quantile_transformer,True,entropy,0.849415,10,3,0.12,116.0,normal,,
4,weighting,none,True,entropy,0.833178,15,13,0.1,,,,
5,weighting,none,True,gini,0.901453,19,3,0.06,,,,
6,weighting,quantile_transformer,True,entropy,0.970101,17,9,0.06,130.0,uniform,,
7,weighting,none,True,gini,0.901453,19,3,0.04,,,,
8,weighting,robust_scaler,True,entropy,0.907332,19,2,0.04,,,0.92252,0.286795
9,weighting,none,True,entropy,0.975166,15,19,0.04,,,,
10,weighting,quantile_transformer,True,gini,0.971916,17,11,0.04,157.0,normal,,




CPU times: user 5.08 s, sys: 2.21 s, total: 7.3 s
Wall time: 7.34 s


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Combine results from models
We combine up the results from models (to make one single array). 

<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [30]:
%%time

print("Predictions from all batches:")
display(predictions)

y_pred = np.concatenate(predictions, axis=0)

print("Combined final prediction:")
print(y_pred)

Predictions from all batches:


[array([-1, -1, -1, ..., -1, -1, -1]),
 array([-1, -1, -1, ...,  1, -1, -1]),
 array([-1, -1, -1, ...,  1, -1, -1]),
 array([-1, -1, -1, ..., -1, -1, -1]),
 array([-1, -1, -1, ..., -1, -1, -1])]

Combined final prediction:
[-1 -1 -1 ... -1 -1 -1]
CPU times: user 5.04 ms, sys: 0 ns, total: 5.04 ms
Wall time: 3.7 ms


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Parse test labels
We now read the test labels from disk. This will be used for validating our predictions.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [31]:
%%time

#### Get labels for test data to check accuracy
y_test = pd.read_csv(data_dir+target_file, header=None, squeeze=True, skiprows=targ_records)

print ("Memory footprint of y_test: ", y_test.memory_usage(deep=True)/1024**3, "GB")
print ("Shape of y_test: ", y_test.shape)

Memory footprint of y_test:  0.00018633902072906494 GB
Shape of y_test:  (25000,)
CPU times: user 11.3 ms, sys: 1.05 ms, total: 12.3 ms
Wall time: 11 ms


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

# Scores
Finally the precision, recall and f1 scores are displayed for the model predictions.

## F1 Score
No Churn  (-1): 	 <b>0.9177</b>
<br>Churn (+1): 	 <b>0.245</b>
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [32]:
%%time

with open (f_res_name, "a") as ar:

    ar.write ("------------------Scores----------------------------\n\n")
    
    #### Default / Binary Scores
    class_rep = classification_report(y_pred=y_pred, y_true=y_test, labels=[-1, +1], 
                                      target_names=["No Churn (-1)", "Churn (+1)"])

    f1_scor = f1_score(y_true=y_test, y_pred=y_pred, average=None)

    print ("\nF1 Binary Score:")
    print ("---------------")
    print ("No Churn (-1): \t", np.round(f1_scor[0], 4))
    print ("Churn    (+1): \t", np.round(f1_scor[1], 4))
    print("\n")
    ar.write ("\n\n" + str(f1_score))  
    
    print ("Classification report:")
    print ("---------------------")
    print (class_rep)
    ar.write ("\n\n" + str(class_rep))


F1 Binary Score:
---------------
No Churn (-1): 	 0.9177
Churn    (+1): 	 0.245


Classification report:
---------------------
               precision    recall  f1-score   support

No Churn (-1)       0.94      0.89      0.92     23173
   Churn (+1)       0.19      0.33      0.24      1827

    micro avg       0.85      0.85      0.85     25000
    macro avg       0.57      0.61      0.58     25000
 weighted avg       0.89      0.85      0.87     25000

CPU times: user 27.7 ms, sys: 1.91 ms, total: 29.6 ms
Wall time: 28.3 ms


In [33]:
end = datetime.datetime.now()
print(end)
print(end-start)

2018-11-16 11:31:53.418358
0:02:33.517742
