<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

# INTRODUCTION
Challenge-1: Telecom Churn Predictive with Automated Data Science Techniques

The goal of this challenge is to perform the telecom churn prediction as defined by the KDD-2009 challenge using supervised learning techniques. Here is a summary of tasks performed in this notebook:

1. Parse the training data from provided files.
2. Pre-process the data and and filter columns for feature selection and engineering.
3. Apply feature selection rules shortisted in above step.
4. Random under sampling to reduce samples of majority class.
5. Train an AUOTML model with the under sampled dataset.

Details of these tasks are in subsequent sections.

For AUTOML, we have selected <b>auto-sklearn</b>, which is an automated machine learning toolkit developed as a wrapper
around the scikit-learn estimators. More details can be found here - https://automl.github.io/auto-sklearn/stable/#

## 1. Parsing
This is done in two steps. In the first step, training data is pre-analyzed to find appropriate datatypes for the features. In the second parse, training data is combined into a pandas dataframe using the reduced datatypes. This drastically improves memory footprint of the training data.

## 2. Feature selection
### a. Categorical features
We chose to remove all categorical features from the training data, i.e., the last 260 columns. 
<br>We tried to incorporate the categorical columns into different estimators by converting them to numerics (using hash buckets, one hot encoding etc.). However they only seemed to degrade the results.</br>

### b. Null values
The dataset is highly sparse and many columns have a very high volume of null values. We chose to use a null value threshold of 0.90. In other words, all columns having more than 90% of their values as null values were removed.

### c. Null filler
We chose to fill all remaining null values with 0. 
<br>We tried with other approaches such as ffill (forward fill), bfill (backward fill) and majority selection. However "0" filled null values seemd to give the best results.</br>

### c. Variance
We also chose to use a variance threshold of 0.9. In other words, columns whose values did not have a variance greater than 90% (compared to the standard deviation of the column), were removed.

### d. Recursive feature elimination
We chose to use recursive feature elimination (<b>RFECV</b>) schemes provided by scikit-learn libraries, to reduce feature dimensionality. We chose to use RandomForestClassifier as the estimator for RFECV with validation based on f1-weighted scores.

Since the dataset shows very high imbalance between the target labels (~13:1), we needed an estimator that could focus more on classifying the minority class more accurately. In other words, the weights of the features should be adjusted in such a way that equivalent samples of minority and majority samples are classified correctly. RandomForestClassifier provides a hyperparamter "class_weight" which can be set with option "balanced". This ensures the balanced classification of majority and minority classes. In the longer run, this gave better set of features that in turn produced the best results for us.

## 3. Filter training data
In this step, training data is filtered based on selected features from previous step.

## 4. Under sampling
In this approach we randomly remove samples of the majority class (no-churn cases) in order to reduce the imbalance between the classes in the dataset. For this purpose we use python's <b>imbalanced-learn</b> library - http://imbalanced-learn.org/en/stable/. We choose the minority:majority class ratio as 1:3 or in other words, for every minority class sample, there are 3 majority class samples. Other ratios were tried as well (10:1, 5:1, 2:1 and 1:1), but 3:1 seems to have the better results of all. Here is the training set configuration:

<b><u>Origial Data</u></b>
<table align="left">
    <tr>
        <td align="left">Total training rows</td>
        <td>25000 (100%)</td>
    </tr>
    <tr>
        <td align="left">Rows with outcome as "no-churn"</td>
        <td>23155 (92.62%)</td>
    </tr>
    <tr>
        <td align="left">Rows with outcome as "churn"</td>
        <td>1845 (7.38%)</td>
    </tr>
</table>

<br>
<br>
<br>

<br><b><u>Under-sampled Data</u></b>
<table align="left">
    <tr>
        <td align="left">Total training rows</td>
        <td>7435 (100%)</td>
    </tr>
    <tr>
        <td align="left">Rows with outcome as "no-churn"</td>
        <td>5590 (75.18%)</td>
    </tr>
    <tr>
        <td align="left">Rows with outcome as "churn"</td>
        <td>1845 (24.82%)</td>
    </tr>
</table>


<br><br><br><br><br><b><u>NOTE:</u></b>
<br>We tried other oversampling / undersampling and combined sampling approaches using SMOTE, Tomek, Edited Nearest Neighbours etc., but with every approach there were too many noisy samples of the minority and majority class that distorted the final results. Hence we proceeded with this approach where we do not introduce any synthetic samples, but try to reduce the shadowing effect of the majority class samples. This also produced better results that any of the other oversampling-undersampling techniques mentioned here.

## 5. Training with AUTO-SKLEARN
Under-sampled training batch is trained in auto-sklearn.

In <b>auto-sklearn</b> the search space is restricted to RandomForest Classifier. There are 2 reasons behind this restriction:
a. Training with other classifiers did not yield good results. Multiple approaches were attempted using all classifiers and batches of classifiers, but results were no better than running RandomForest alone.
b. RFECV has shortlisted best features using the RandomForest Classifier. It then seemed more logical to continue training the models using the same estimator.

Training is performed for 24hrs. Taking into account the other steps mentioned above, it may take slighlty more than 24hrs for the notebook to complete it's task. Ensemble time seems to vary at times. Hence the email functionality was included to get an indication, when the runs were completed.

<b><u>NOTE: </b></u><br>When each step concludes, relevant resultant objects are stored to disk using the pickle library.

Every cell in the notebook is timed (%%time), which gives an idea of the cell runtime.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## IMPORTED LIBRARIES
Here is a summary of the imported modules and their purposes:

### 1. Utilities
<b>a. datetime</b>
<br>To measure start and end time of the notebook execution. Also used to suffix timestamp to the generated log files.

<b>b. shutil, glob</b>
<br>These modules are used to maintain (delete/create) folders and log files. (AUTOML uses pSMAC for distributed processing. This is done by writing and reading data to a shared folder).

<b>c. warnings</b>
<br>To suppress warnings from the auto-sklearn functions. Warnings themselves are harmless but when the training runs for hours, these warnings tend to increase the size of the notebook and can lead to jupyter crash, if disk space is at premium.

<b>d. os</b>
<br>Writing and reading log files and pickles from disk.

<b>e. pickle</b>
<br>Writing objects to disk as binary files. These objects incldue RFECV object, parsed training files, list of columns, list of column datatypes and trained auto-sklearn model.

<b>f. pandas</b>
<br>To read raw training data from csv files into dtaframes and process it.

<b>g. numpy</b>
<br>For datatype constants.

### 2. Distributed processing
<b>a. multiprocessing</b>
<br>For distributing the auto-sklearn training workload to multiple CPU cores.

### 3. Logs via email
<b>a. smtplib</b>
<br>Once the notebook run is completed, this module helps to send the logs and results via email. 
    
### 4. Metrices
<b>a. sklearn.metrics.f1_score</b>
<br>To evaluate the model using F1 scores of binary classes.

### 5. Feature selection
<b>a. sklearn.feature_selection.VarianceThreshold</b>
<br>To filter features that have variance less than a defined variance threshold.

<b>b. sklearn.ensemble.RandomForestClassifier</b>
<br>RandomForestClassifier (RFC) is used as the estimator to find best ranking features via recursive feature elimination. One of the main reasons for selecting RFC is that it provides a hyperparameter to balance the weights of the classes being predicted.

<b>c. sklearn.feature_selection.RFECV</b>
<br>To find the best ranking features against f1-weighted score using RandomForestClassifier as the estimator.

### 6. Auto-Sklearn
<b>a. autosklearn.classification.AutoSklearnClassifier</b>
<br>AUTOML for classification task.

<b>b. autosklearn.constants import *</b>
<br>Constants to be used in the AUTOML functions.

<b>c. autosklearn import metrics</b>
<br>Metrice related constans for AUTOML functions.

### 7. Class Imbalance
<b>a. imblearn.under_sampling.RandomUnderSampler</b>
<br>To reduce the number of majority class samples and bring down the ratio of majority:minority classes from ~13:1 to ~3:1.

<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [1]:
#### Utilities
import datetime
import shutil
import glob
import warnings
import os
import pickle
import pandas as pd
import numpy as np

#### Distributed processing
import multiprocessing

#### Send logs via email
import smtplib
from os.path import basename
from email.mime.application import MIMEApplication
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.utils import COMMASPACE, formatdate

#### Metrices
from sklearn.metrics import f1_score

#### Feature selection modules
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

#### Automl related modules
from autosklearn.classification import AutoSklearnClassifier
from autosklearn.constants import *
from autosklearn import metrics as amet

#### Class imbalance addressing modules
from imblearn.under_sampling import RandomUnderSampler

start = datetime.datetime.now()
print(start)

  self.re = re.compile( self.reString )


2018-11-15 09:27:53.534154


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## CONSTANTS
Specfic details of each constant are indicated in comments below.
<b>--------------------------------------------------------------------------------------------------</b>

</span>

In [2]:
%%time
#### Constant values

#### Data location on the disk - absolute path of unzipped datafiles
data_dir           = "../unzipped_datafiles/"

#### File prefix for training and label files
data_file_prefix   = "orange_large_train.data.chunk"   # training file prefix
target_file        = "orange_large_train_churn.labels" # label file prefix

#### File and columns counts
tot_cols           = 14740 # only numeric cols
tot_files          = 5     # 5 chunks

#### Null and variance threholds
null_val_threshold = .90
variance_threshold = 0.9*(1-0.9) ## Variation <=10% as per sklearn's function requirement

#### Variables for target label splits
targ_start         = 0
num_rows_for_targ  = 25000

#### List of columns indices to process.
col_range          = [x for x in range(0, tot_cols)]

#### Num of CPU cores to use for parallel processing
num_cores_to_use = 7 ## = 7 parallel processes

#### Parameters for auto-sklearn
estimator_for_automl = ["random_forest", ]  # Estimator is limited to RandomForest, since RFECV has
                                            # selected best features using this estimator
    
train_duration       = 86400 # 24hrs - This is for one batch.
ensemble_duration    = 43200 # 12hrs - This is for one batch, although in practice, ensemble requires only few mins.
mem_to_use           = 18432 # 18GB RAM per CPU process.

#### Temporary folders for auto-sklearn output. 
#### Used for building ensemble after fitting training data.
tmp_folder         = "../tmp/autosklearn_parallel_code5_tmp"
output_folder      = "../tmp/autosklearn_parallel_code5_out"
dataset_name       = "kdd_ds"

#### Variables to hold column names and column data type dictionary.
cols_to_drop        = []
col_names           = []
col_type_dict       = {}
int_list            = [np.int8, np.int16, np.int32, np.int64]
float_list          = [np.float16, np.float32, np.float64]

#### Log and result file prefixes and suffixes
time_suffix = str(start)
for ch in [" ", ":", "-"]:
    time_suffix = time_suffix.replace(ch, "")

feat_extract_txt   = "kdd2009_"
f_name_prefix = "notebook_1_" + feat_extract_txt
f_name_suffix = time_suffix + ".txt"
temp_log_str  = "temp_log_"
f_res_name    = f_name_prefix + "training_log_" + f_name_suffix
res_file_list = []

#### Pickle constants for files that are stored to disk.
pkl_cols_to_retain  = "../pickles/cols_to_retain.pkl"
pkl_col_type_dict   = "../pickles/col_type_dict.pkl"
pkl_automl_model    = "../pickles/automl_model.pkl.0"
pkl_rfecv_model     = "../pickles/rfecv_model.pkl"
pkl_xtrain_orig     = "../pickles/X_train.pkl"
pkl_ytrain_orig     = "../pickles/y_train.pkl"

#### Set this boolean to TRUE, to load data from disk
#### If this boolean is set to FALSE, Parsing, RFECV based feature eliminiation etc are run again.
#### NOTE: New run may not be guarantee better prediction results, since the random state is not preserved.
use_saved_data = True

CPU times: user 665 µs, sys: 0 ns, total: 665 µs
Wall time: 671 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: SEND_EMAIL
This function uses the email credentials from a stored file and sends logs to the recepient's email id. Format of data in the "email_cred.txt" should be as below:

from_email_id
<br>nsn-intra password
<br>to_email_id
 
<b>--------------------------------------------------------------------------------------------------</b>

</span>

In [3]:
%%time
def send_email (file_list):
    
    with open ("../email_cred.txt", "r") as fp:
        user  = fp.readline().replace("\n", "")
        pwd   = fp.readline().replace("\n", "")
        recep = fp.readline().replace("\n", "")
        
    #### Add headers
    msg = MIMEMultipart()
    msg['From'] = user
    msg['To'] = recep
    msg['Subject'] = "AUTOML Training Results Are Ready."

    #### Add mail text
    msg.attach(MIMEText("Hello,\n\nTraining results of batches are in attached files.\n\nBest Regards"))

    #### Attach files
    for f in file_list:
        with open(f, "r") as fil:
            part = MIMEApplication(fil.read(),Name=basename(f))
        part['Content-Disposition'] = 'attachment; filename="%s"' % basename(f)
        msg.attach(part)
    
    #### Open SMTP server and send mail
    server = smtplib.SMTP("smtp.office365.com", 587)
    #server.set_debuglevel(True)
    server.starttls()
    server.login(user, pwd)
    log_res = server.sendmail(user, recep, msg.as_string())
    server.quit()

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 5.48 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: CHECK_AND_UPDATE_DTYPE
Helper function that checks the maximum and minimum values of a feature and assigns an appropriate datatype.

By default, pandas assigns the largest datatype for a numeric column such as int64 and float64 and this bloats up the dataframe size in memory. Using smaller datatypes, drastically reduces the memory footprint of the dataframes.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [4]:
%%time
def check_and_update_dtype(dt, mn, mx):
    
    if (dt == np.int64):
        if ((mn >= np.iinfo("int8").min) and (mx < np.iinfo("int8").max)): ##int8
            return np.int8
        elif ((mn >= np.iinfo("int16").min) and (mx < np.iinfo("int16").max)): ##int16
            return np.int16
        elif ((mn >= np.iinfo("int32").min) and (mx < np.iinfo("int32").max)): ##int32
            return np.int32
    elif (dt == np.float64):
        if ((mn >= np.finfo("float32").min) and (mx < np.finfo("float32").max)): ##float32
            return np.float32
    else:
        return dt

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.25 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: PREPROCESS_TRAIN_COLUMNS
Dataframe parsing is done in two steps. In the first step, the whole training dataset (25k rows) is parsed with the following intentions:
1. Find features (columns) that have null values greater than defined threshold (Constant: null_val_threshold). These features will be dropped later during second parse.
2. To find reduced datatypes appropriate for the features.

Two step approach is more time and memory efficient than a single step apprach of reading the data and updating the data type on the fly. For example, analysing the actual data type of a feature and reducing the same in pandas using astype() function takes about 8secs for a single feature. This transaltes to about 19hrs for ~8500 features (reduced after null value checks are applied), with a training dataframe size of ~1.6GB. <br>The two step approach, takes less than 7mins to read the entire training data, amounting to ~360MB in memory.

First step is performed in this function - <b>preprocess_train_columns()</b>.

In the second step we parse the training data again armed with the information above. This enables us to store all the training data in one dataframe since the size is memory friendly. The second step is performed in function <b>extract_data()</b>.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [5]:
%%time

#### Find cols to be dropped. Read all the coulmns from all files and build the drop list and data dict.
def preprocess_train_columns(num_files, data_dir, file_prefix):

    #### break from this loop when 2.5 files are read.
    break_loop = False 
    
    for i in range(1, num_files+1):
        if (i == 1):
            hdr = 0
        else:
            hdr = None

        chunk_df = pd.read_csv(data_dir+data_file_prefix+str(i), 
                               sep="\t", 
                               lineterminator="\n", 
                               header=hdr,
                               usecols=col_range)
        
        #### Store column names from first file as they are not available in other chunks
        if (i == 1):
            col_names = [col for col in chunk_df.columns]
        else:
            chunk_df.columns = col_names
            
        #### Only 25k rows are part of the training set. Limit it here when the 3rd file is read.
        if (i==3):
            chunk_df = chunk_df.iloc[:5001]
            break_loop = True

        #### Drop columns with variance=0(constant value columns), 
        #### null values more than defined threshold and all categorical values.
        cols_to_drop_in_chunk = [col for col in col_names
                                 if (chunk_df[col].isnull().sum()/len(chunk_df) > null_val_threshold) #### Null values
                                ]
        
        #### Since the null value and variance checks are performed for every chunk, the drop list 
        #### has to be updated after every file is read. This ensures we have a holistic view of the features
        #### before they are dropped.
        if (i == 1):
            cols_to_drop[:] = [col for col in cols_to_drop_in_chunk]
        else:
            #### Only columns in both the new list and original list have to be maintained.
            cols_to_drop[:] = list(set(cols_to_drop) & set(cols_to_drop_in_chunk))
        
        #### For the included columns check the min, max and dtypes and store them in a dictionary.
        #### Later we will update this dictionary with reduced dtypes.
        #cols_to_include = list(set(col_names) - set(cols_to_drop))
        #cols_to_include = sorted(cols_to_include)

        if (i == 1):
            col_dt_min_max_list = [[col, chunk_df[col].dtype, chunk_df[col].min(), chunk_df[col].max()]
                                   for col in chunk_df.columns]
        else:
            for col in chunk_df.columns:
                mn = chunk_df[col].min()
                mx = chunk_df[col].max()
                dt = chunk_df[col].dtype
                found = False

                for colm in col_dt_min_max_list:
                    if (col == colm[0]):

                        #### pandas could read different dytpes for the same column depending on values in a chunk
                        #### float dtype is given higher preference than int
                        if ((dt != colm[1]) and (dt in float_list)):
                            colm[1] = dt
                        
                        if  (mn < colm[2]):
                            colm[2] = mn

                        if (mx > colm[3]):
                            colm[3] = mx

                        break

        if (break_loop): break
    
    #### Build the reduced datatype dictionary for numeric columns.
    #### This dictionary will be used later when reading training data into a pandas dataframe - second parsing step.
    for item in col_dt_min_max_list:
        ind = int(item[0].replace("Var", "")) - 1
        col_type_dict[ind] = check_and_update_dtype(item[1], item[2], item[3])
        
    del chunk_df

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.25 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: EXTRACT_DATA
This is the second step in parsing training data. In here, the training data is read from respective chunks based on reduced feature set and reduced datatype of the selected features.

Once the data is in a single dataframe, null values are filled with 0 and further columns are reduced based on variance threshold. Features having variance less than defined threshold (Constant: variance_threshold) are dropped.

<b><u>NOTE:</u></b>
1. Various null filler approaches were tried such as "Forward Filling", "Backward filling" and "Majority filling". However null values filled with zero seems to provide better results for the selected model.

2. Variance threshold can be applied only when null values are filled. This is the reason variance threshold based filtering is delayed to second step of parsing data.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [6]:
%%time
def extract_train_data (num_files):
    
    X_train = pd.DataFrame()
    X_test  = pd.DataFrame()

    #### Cycle through columns in batches, reduce their dtype and then append to final dataframe.
    for i in range(1, num_files+1):
        
        if i > 3:
            continue
            
        chunk_df = pd.DataFrame()

        if (i == 1):
            hdr = 0
        else:
            hdr = None

        chunk_df = pd.read_csv(data_dir+data_file_prefix+str(i), 
                               sep="\t", 
                               lineterminator="\n", 
                               header=hdr,
                               usecols=col_range,
                               dtype=col_type_dict) #### Dictionary of datatype from first parsing step.

        if (i == 1):
            col_names = [col for col in chunk_df.columns]
        else:
            chunk_df.columns = col_names
        
        chunk_df.drop(columns=cols_to_drop, inplace=True)
        
        #chunk_df.fillna(method=null_filler, inplace=True)
        chunk_df.fillna(0, inplace=True)

        if (i < 3):
            X_train = X_train.append(chunk_df, ignore_index=True)
        elif (i == 3):
            X_train = X_train.append(chunk_df.iloc[:5001], ignore_index=True)
    
    #### Filter columns with low variance from train and test data
    X_var = VarianceThreshold(threshold=variance_threshold).fit(X_train)
    indx  = X_var.get_support(indices=True) 
    X_train = X_train[X_train.columns[indx]]
    
    #### Release memory - although python's garbage collector is "lazy" and may not release the memory immediately.
    del chunk_df
    
    return (X_train)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: SPLIT_TARGET_LABELS
Function to read 25k target labels corresponding to the training data.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [7]:
%%time
def split_target_labels():
    
    #### Split true labels into half for trainig set
    y_train = pd.read_csv(data_dir+target_file, 
                          header=None, 
                          squeeze=True, #### Save memory by sqeezing this into a Pandas Series
                          skiprows=targ_start, 
                          nrows=num_rows_for_targ)

    return (y_train)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: DEL_TMP_OUT_FOLDERS
Helper function to maintain sanity of output folders for auto-sklearn. 

These folders store shared data that is used by parallel processes of auto-sklearn. Folders have to be deleted before a fresh run, else the script fails.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [8]:
%%time

#### Check if the tmp and output folders exist and if so, delete them
def del_tmp_out_folders():
    for dir in [tmp_folder, output_folder]:
        try:
            shutil.rmtree(dir)
        except:
            pass

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## ## Pre-process Training Data
First parse of the entire training set.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [9]:
%%time

if (use_saved_data == False):
    ## preprocess columns based on filter options and prepare dtype dict
    preprocess_train_columns(num_files=tot_files, data_dir=data_dir, file_prefix=data_file_prefix)
    pickle.dump(file=(open(pkl_col_type_dict, "wb")), obj=col_type_dict)
else:
    col_type_dict = pickle.load(file=(open(pkl_col_type_dict, "rb")))

CPU times: user 2min 21s, sys: 8.77 s, total: 2min 30s
Wall time: 2min 31s


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Extract Training Data (this has no labels)
Second parse of the entire training set.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [11]:
%%time

if (use_saved_data == False):
    #### Extract training data. This has no labels.
    X_train = extract_train_data(tot_files)
    pickle.dump(file=open(pkl_xtrain_orig, "wb"), obj=X_train)
else:
    X_train = pickle.load(file=open(pkl_xtrain_orig, "rb"))
    
print ("Memory footprint of X_train: ", X_train.memory_usage(deep=True).sum()/1024**3, "GB")
print ("Shape of X_train: ", X_train.shape)

Memory footprint of X_train:  0.3603520616889 GB
Shape of X_train:  (25000, 8757)
CPU times: user 2min 6s, sys: 7.48 s, total: 2min 14s
Wall time: 2min 5s


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Extract Target Labels
Extract target labels corresponding to the training set above.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [12]:
%%time

#### Extract target labels
if (use_saved_data == False):
    y_train = split_target_labels()
    pickle.dump(file=open(pkl_ytrain_orig, "wb"), obj=y_train)
else:
    y_train = pickle.load(file=open(pkl_ytrain_orig, "rb"))
    
print ("Memory footprint of y_train: ", y_train.memory_usage(deep=True)/1024**3, "GB")
print ("Shape of y_train: ", y_train.shape)

Memory footprint of y_train:  0.00018633902072906494 GB
Shape of y_train:  (25000,)
CPU times: user 5.55 ms, sys: 2.97 ms, total: 8.52 ms
Wall time: 7.44 ms


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Reduce Feature Dimensionality with RFECV
Scikit-learn's Recursive Feature Elimination with Cross-validation (RFECV) is used with Random Forest Classifier to reduce the feature dimensionality.

Since the dataset is highly imbalanced with a ratio of <b>1:12.5</b>, we set hyperparamter <b>class_weight</b> of Random Forest classifier as <b>balanced</b>. This mode uses the target labels to automatically adjust weights inversely proportional to class frequencies in the training set as <i>n_samples / (n_classes * numpy.bincount(target))</i>.

With balanced class weights, we can set the <b>scoring</b> parameter of RFECV as <b>f1_weighted</b>. This helps to rank the features based on f1 score.

We have set <b>n_jobs</b> as -1 so that RFECV utilises all available CPU cores. However it seems RFECV limits the usage of CPU cores to the number of cross validation folds. Since <b>cv</b> is set as 5, only 5 CPU cores were used in this case.

Nested parallelism is not supported by scikit-learn (at least in version 0.20.0). Hence <b>n_jobs</b> hyperparamter has not been used with the Random Forest classifier, but only with RFECV.

<b><u>NOTE:</u></b>
<br>We did not preserve the <b>random_state</b> of RFECV runs, hence the code below to run RFECV is commented and instead we load a pickle of an RFECV that was run earlier and provided the best results from auto-sklearn.

</span>

In [13]:
%%time

#### Select features based on RFECV using Random Forest Classifier
if use_saved_data == False:
    classif = RandomForestClassifier(n_estimators=50, class_weight="balanced", n_jobs=None)

    #### Seems like RFECV uses total of 5 cores, one for each of the CV loop, even though we ask to use all cores.
    rfecv = RFECV(estimator=classif, cv=StratifiedKFold(5), n_jobs=-1, scoring="f1_weighted", step=25)
    rfecv.fit (X_train, y_train)
    pickle.dump(file=open(pkl_rfecv_model, "wb"), obj=rfecv)
else:
    #### Load RFECV for successful run. Unable to reproduce this since the random_state was not preserved.
    rfecv = pickle.load(open(pkl_rfecv_model, "rb"))
    
#### Print RFECV results
print("RFECV configuration: \n", rfecv)
print ("\nUsed estimator:\n", rfecv.estimator_)
print ("\nNo: of selected features = ", rfecv.n_features_)

#### Some other paramters of RFECV
#print ("Support     = ", rfecv.support_)
#print ("Ranking     = ", rfecv.ranking_)
#print ("Grid_Scores = ", rfecv.grid_scores_)

RFECV configuration: 
 RFECV(cv=5,
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=-1, scoring='f1_weighted', step=25,
   verbose=0)

Used estimator:
 RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_

<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Filter and Drop Features from Training Dataset
Reduce the training dataset's feature dimensions based on features selected by RFECV.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [14]:
%%time

#### Reduce the dataframe based on selected features
sel_ind = rfecv.get_support(indices=True)
X_train = X_train[X_train.columns[sel_ind]]

print ("Memory footprint of X_train: ", X_train.memory_usage(deep=True).sum()/1024**3, "GB")
print ("Shape of X_train: ", X_train.shape)

pkl_file = pkl_xtrain_orig + "_reduced"
pickle.dump(file=open(pkl_file, "wb"), obj=X_train)

Memory footprint of X_train:  0.1610257476568222 GB
Shape of X_train:  (25000, 2907)
CPU times: user 271 ms, sys: 182 ms, total: 454 ms
Wall time: 454 ms


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Random Under-Sampler
Python's imblearn package provides a mechanism to randomly undersample the majority class samples. We choose the majority:minority class ratio as 3:1 for under-sampling. This means, for every minority class sample, there are 3 majority class samples in the new training dataset. Other ratios were tried as well (10:1, 5:1, 2:1 and 1:1), but 3:1 seems to have the better results of all. 

Under-sampling creates a numpy array data structure which loses the column names that were available in the pandas dataframe. Although it can still be used for further training the ensemble-models, it is not very intuitive for analysis. Hence the re-sampled training data and training lables are converted back to pandas dataframes, using the datatype dictionary that was created before, during phase-1 of parsing raw data. This ensures that the size of the resultant dataframe is still memory friendly.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [15]:
%%time

rus = RandomUnderSampler(ratio=0.33)

#### Get the column names to create the DF later.
col_names = X_train.columns
pickle.dump(obj=col_names, file=open(pkl_cols_to_retain, "wb"))

X_resamp, y_resamp = rus.fit_sample(X_train, y_train)

#### Build the dtypes for selected features
col_dict = {}
for item in col_type_dict:
    col_dict["Var"+str(item+1)] = col_type_dict[item]
    
X_train = pd.DataFrame()
for ind in range (0, X_resamp.shape[1]):
    #print("Started...", ind, X_resamp[:, ind].dtype, end="")
    X_train = X_train.join(pd.DataFrame(X_resamp[:, ind], 
                                        columns=[col_names[ind]], 
                                        dtype=col_dict[col_names[ind]]), 
                           how="right")
    #print(" - Completed.....", col_names[ind], X_train[col_names[ind]].dtype)

y_train = pd.Series(y_resamp)

print("Shape of X_train:\t", X_train.shape)
print("Memory footprint of X_train:\t", X_train.memory_usage(deep=True).sum()/1024**3, "GB")

print("Shape of y_train:\t", y_train.shape)
print("Sample frequency of training set:")

Shape of X_train:	 (7435, 2907)
Memory footprint of X_train:	 0.0478891097009182 GB
Shape of y_train:	 (7435,)
Sample frequency of training set:
CPU times: user 1min 53s, sys: 1.13 s, total: 1min 55s
Wall time: 1min 55s


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: GET_PROCESS_SPAWN
Helper function which returns a worker process, that runs an instance of auto-sklearn.

Multiporcessing support in auto-sklearn is based on pSMAC algorithm, which essentially specifies how parallel processes write shared data to the disk and how different processes read from this shared data.

We chose not to utilize the preprocessing functionalities of auto-sklearn (such as min-max scaling) since they tend disregard the outliers degrading the final results.

We have chosen a single classifier and let auto-sklearn choose an ensemble of this classifier with differently tuned hyperparameters.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [16]:
%%time

## Parallel processing functions based on SMAC implementation of auto-sklearn
def get_process_spawn (X_train, y_train, count):
    def spawned_process(seed, dataset_name):
        
        if (seed==0):
            init_config = 25
            smac_args   = {} 
        else:
            init_config = 0
            smac_args   = {"initial_incumbent": "RANDOM"}

        #### Shared mode is set as true and shared folders are prevented from being deleted after every process has run
        #### Ensembling is done in a later step and hence set ot 0 here.
        #### Computation resource counts have been defined in the Constants section above.
        automl = AutoSklearnClassifier(initial_configurations_via_metalearning=init_config,
                                       ml_memory_limit=mem_to_use,
                                       time_left_for_this_task=train_duration,
                                       include_preprocessors=["no_preprocessing"], ## We have done this when parsing data
                                       include_estimators=estimator_for_automl, 
                                       exclude_estimators=None,
                                       shared_mode=True,
                                       tmp_folder=tmp_folder,
                                       output_folder=output_folder,
                                       delete_tmp_folder_after_terminate=False,
                                       ensemble_size=0,
                                       seed=seed,
                                       smac_scenario_args=smac_args
                                      )
        
        #### Fit the training data using scoring method as ROC_AUC.
        automl.fit(X_train, y_train, dataset_name=dataset_name, metric=amet.roc_auc)
        
        #### Log file to store training results
        f_name = f_name_prefix + temp_log_str + str(seed) + f_name_suffix + "_" + str(count)
        
        with open (f_name, "a") as ar:
            ar.write("\n\n------------------AUTOML_SEED: " + str(seed) + "------------------------")
            ar.write ("\n\n-----------------------cv_results--------------------\n\n")
            for item in automl.cv_results_:
                ar.write (str (item) + "\n")

            ar.write ("\n\n--------------sprint_statistics-----------------------\n\n")
            ar.write (str (automl.sprint_statistics()) + "\n")

    return spawned_process

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.87 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: LAUNCH_PROCESSES
Similar to _main_, this calls the helper function above multiple times to launch auto-sklearn worker processes.

Each worker process is given a different seed number. The number of worker processes ca be less than or equal to the number of available CPU cores. However each process utilizes the same amount of memory, so its not possible to use all cores. Based on various CPU-memory balancing ratios, appropriate number of parallel processes and memory is defined in the CONSTANTS section above.

The function join() ensures that all processes are run before the ensemble is built. Scoring function used for ensemble is ROC_AUC.

No: of CPU cores and memory directly affects the number of algorithms that auto-sklearn selects for training. More the better!

<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [17]:
%%time

def launch_processes(X_train, y_train, count):    
    
    #### Delete tmp and output folders used for pSMAC
    del_tmp_out_folders()

    #### List of parallel processes
    processes = []
    spawn_process = get_process_spawn(X_train, y_train, count)

    #### Launch the worker processes
    for i in range (num_cores_to_use):
        p = multiprocessing.Process(target=spawn_process, args=(i, dataset_name))
        p.start()
        processes.append(p)

    #### Wait for worker processes to complete.
    for proc in processes:
        proc.join()

    #### Build the ensemble of models here using data from shared folders.
    automl = AutoSklearnClassifier(initial_configurations_via_metalearning=0,
                                   ml_memory_limit=mem_to_use,
                                   time_left_for_this_task=ensemble_duration,
                                   shared_mode=True,
                                   tmp_folder=tmp_folder,
                                   output_folder=output_folder,
                                   ensemble_size=50,
                                   ensemble_nbest=50,
                                   seed=1
                                   )
    
    automl.fit_ensemble(y_train, 
                        task=BINARY_CLASSIFICATION,
                        metric=amet.roc_auc,
                        precision="64",
                        dataset_name=dataset_name)
    
    #### Store the model to disk.
    pickle.dump(file=open(pkl_automl_model+"."+str(count), "wb"), obj=automl)
    
    
    #### Log file of final ensemble.
    with open(f_res_name + "_" + str(count), "a") as ar:
            
        for file in os.listdir():
            if ((f_name_prefix+temp_log_str) in file):
                with open(file, "r") as fp:
                    ar.write (fp.read())
                    ar.write("\n")
                os.remove(file)
            
        ar.write ("\n\n------------------show_models----------------------------\n\n")
        ar.write (str (automl.show_models()) + "\n")    

        ar.write ("\n\n------------------params----------------------------\n\n")
        ar.write (str (automl.get_params()) + "\n")
    
    #### Append to result file list to be emailed later.
    res_file_list.append(f_res_name + "_" + str(count))

CPU times: user 15 µs, sys: 0 ns, total: 15 µs
Wall time: 317 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Function: PRINT_ENSEMBLE_DETAILS
Helper function that prints the details of the selected model, such as the hyperparameters, weights etc.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [18]:
%%time
#### Get raw details of selected ensemble. Each algo+hyperparamter combination is stored as a dictionary
def print_ensemble_details (mod_with_wt):

    #### Build list of the ensemble detail dictionary
    dict_list = []
    for item in mod_with_wt:
        co_dict   = {}
        pip = item[1]
        co_dict = pip.configuration.get_dictionary().copy()
        co_dict["algo_weight"] = item[0]
        dict_list.append(co_dict)

    #### Build a dictionary with the key is the hyperparamter name and value is a list corresponding to the hyperparamter
    #### values of the ensemble in that order.
    print_dict = {}
    for item in dict_list:
        for key, val in item.items():
            if key in print_dict.keys():
                pass
            else:
                print_dict[key] = []

    for item in dict_list:
        for key in print_dict.keys():
            if key in item.keys():
                print_dict[key].append(item[key])
            else:
                print_dict[key].append("NA")

    #### Read the dictionary into a pandas dataframe which is easier to print as a table in the end.
    print_df = pd.DataFrame(print_dict)
    col_dict = {}
    drop_list  = []
    const_dict = {}

    for col in print_df.columns:

        #### Remove parameters that are not relevant. For e.g. 
        #### 1. There are no categorical columns in our dataset
        #### 2. There is no preprocessing / imputation done becasue the null values are filled before training the automl.
        if (("categorical" in col) or ("preprocessor" in col) or ("imputation" in col)) :
            drop_list.append(col)
        else:
            str1 = col.split(":")

            if (len(str1) > 2):
                title = str1[2]
            else:
                title = str1[0]

            #### Seprate parameters that have constant values for the ensemble.
            if (len(print_df[col].unique()) == 1):
                val = print_df[col].unique()[0]
                if (val == "None" or val == 0):
                    pass
                else:
                    const_dict[title] = val
                drop_list.append(col)
            else:
                col_dict[col] = title

    print_df = print_df.drop(drop_list, axis=1)
    print_df = print_df.rename(col_dict, axis=1)

    print("ENSEMBLE Constants")
    print("------------------")
    for k, v in const_dict.items():
        print (k, "\t= ", v)

    print("\n\nENSEMBLE Hyperparameters")
    print("------------------------")

    print_df.index = np.arange(1, len(print_df)+1)
    
    return print_df

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 6.91 µs


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Train and ensemble Under-sampled Training Batch

<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [19]:
%%time
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    count=0
    launch_processes(X_train, y_train, count)

CPU times: user 15.2 s, sys: 2.91 s, total: 18.1 s
Wall time: 1d 26s


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

# SUMMARY
After successful training, we can see that auto-sklearn has successfully run <b>~5500</b> of algorithms and selected an ensemble of RandomForest that can best predict the churn.

## Model Ensemble Description
Auto-sklearn builds an ensemble of models that best classify the dataset. We have restricted the classifier selection to RandomForest, since this seems to provide the best results until now. Auto-sklearn then builds an ensemble of RandomForest Classifier models with varying hyperparameter values. The models in ensemble are shown below along with their selected hyper parameters that provided the best scoring value, which in our case is ROC_AUC. 

For example, ensemble below, contains 19 RandomForest Classifier models and here is a brief description of the columns:

1. <b>balancing</b>
This describes if the samples are weighted in the model.

2. <b>rescaling</b>
This describes how the feature values were scaled.

3. <b>bootstrap</b>
This describes if the boostrapping based resampling was performed during training. Basically this means that a sample of dataset is used from the training dataset for each training iteration and its possible that a sample appears multiple times in different iterations.

4. <b>criterion</b>
In tree based classification, there are two criteria used for spltting at a node - <b>gini</b> that reduces the probability of mis-classification at a node and <b>entropy</b> that reduces the impurity of classification at a node.

5. <b>max_features</b>
It seems auto-sklearn applies it's own feature dimensionality reduction techniques. Hence every model seems to be using anly a fraction of the total available features.

6. <b>min_samples_leaf</b>
Minimum no: of samples at a leaf node of the tree.

7. <b>min_samples_split</b>
Minimum no: of samples at a node, before it is split into branches.

8. <b>algo_weight</b>
This describes the weight of the predictions that the model contributes to the ensemble. For instance 0.18 indciates that 18% of the perdiction weight comes from this model. 0.14 indicates 14% of the prediction weight comes from the second model and so on.

9. <b>q_max, q_min, n_quantiles and output_distrubtion</b>
This describes the values based on selected scaling methods in each model of the ensemble.
<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [49]:
%%time

#### Copy files to a common dir
file_list = glob.glob("*.txt*")
for file in file_list:
    shutil.copy(file, "../pickles/.")

target_alog_str = "Number of target algorithm runs: "
succes_algo_str = "Number of successful target algorithm runs: "

target_algos = 0
succes_algos = 0
file_count = 0
log_dir = "../pickles/"
for file in os.listdir(log_dir):
    
    if f_res_name in file:
        file_count += 1
        with open(log_dir+file, "r", encoding="utf-8") as fp:
            for line in fp:
                line = line.rstrip()
                if target_alog_str in line:
                    tokens = line.split()
                    target_algos += int(tokens[-1])
                elif succes_algo_str in line:
                    tokens = line.split()
                    succes_algos += int(tokens[-1])

print ("Log files found: ", file_count)
print (target_alog_str, target_algos)
print (succes_algo_str, succes_algos)

#### Print model ensemble details
with open(pkl_automl_model, "rb") as fp:
    model = pickle.load(file=fp)
    
    print("\n")
    print("========================")
    print("*****MODEL-ENSEMBLE*****")
    print("========================")

    display(print_ensemble_details(model.get_models_with_weights()))
    print("\n")

Log files found:  1
Number of target algorithm runs:  6068
Number of successful target algorithm runs:  5538


*****MODEL-ENSEMBLE*****
ENSEMBLE Constants
------------------
classifier 	=  random_forest
n_estimators 	=  100


ENSEMBLE Hyperparameters
------------------------


Unnamed: 0,balancing,rescaling,bootstrap,criterion,max_features,min_samples_leaf,min_samples_split,algo_weight,n_quantiles,output_distribution,q_max,q_min
1,none,standardize,False,entropy,0.844016,20,18,0.18,,,,
2,weighting,none,True,entropy,0.881737,7,9,0.14,,,,
3,weighting,quantile_transformer,True,entropy,0.849415,10,3,0.12,116.0,normal,,
4,weighting,none,True,entropy,0.833178,15,13,0.1,,,,
5,weighting,none,True,gini,0.901453,19,3,0.06,,,,
6,weighting,quantile_transformer,True,entropy,0.970101,17,9,0.06,130.0,uniform,,
7,weighting,none,True,gini,0.901453,19,3,0.04,,,,
8,weighting,robust_scaler,True,entropy,0.907332,19,2,0.04,,,0.92252,0.286795
9,weighting,none,True,entropy,0.975166,15,19,0.04,,,,
10,weighting,quantile_transformer,True,gini,0.971916,17,11,0.04,157.0,normal,,




CPU times: user 4.97 s, sys: 1.3 s, total: 6.27 s
Wall time: 6.3 s


<span style="color: black; font-family: 'courier new'; font-size: 1.2em">

## Send email with log files

<b>--------------------------------------------------------------------------------------------------</b>
</span>

In [21]:
%%time
send_email (res_file_list)
print("Mail sent with logs .....")

Mail sent with logs .....
CPU times: user 14.7 ms, sys: 3.08 ms, total: 17.8 ms
Wall time: 2.19 s


In [22]:
end = datetime.datetime.now()
print(end)
print(end-start)

2018-11-16 09:43:54.244007
1 day, 0:16:00.709853
