# Problem statement:
Consider you have training data (with the 'Revenue' attribute) for records from June-Sept only. For all records from Oct-Dec, the 'Revenue' attribute is missing. Build a semi-supervised self labelling model to estimate 'Revenue' for the missing records in Oct-Dec and then fit your classifier. Report classification performance on Feb-March data set with and without the self-labelled data.

1. If you dont consider the records from Oct-Dec, generate the classification performance on Test data
2. After using the self labelled data and training data together, does the classification performance on 
Test data improve? Comment on which metrics are of importance here.

# Download data and import libraries

In [1]:
%matplotlib inline 

import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.tree import DecisionTreeClassifier
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

from tpot import TPOTClassifier
from sklearn.semi_supervised import LabelSpreading

import pickle

In [2]:
# load data    
with open('./transfer_files/df_data.pickle', 'rb') as f:
    df_data = pickle.load(f)
df_data.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated_Duration,ExitRates,PageValues,Month,Revenue,SpecialDay_0.2,...,TrafficType_3,TrafficType_4,TrafficType_5,TrafficType_6,TrafficType_7,TrafficType_8,TrafficType_9,VisitorType_Other,VisitorType_Returning_Visitor,Weekend_True
0,0,0.0,0,0.0,0.0,0.2,0.0,2,False,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0.0,0,0.0,64.0,0.1,0.0,2,False,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0.0,0,0.0,0.0,0.2,0.0,2,False,0,...,1,0,0,0,0,0,0,0,1,0
3,0,0.0,0,0.0,2.666667,0.14,0.0,2,False,0,...,0,1,0,0,0,0,0,0,1,0
4,0,0.0,0,0.0,627.5,0.05,0.0,2,False,0,...,0,1,0,0,0,0,0,0,1,1


In [3]:
#Split data on train and test (Train data entries corresponding to the months of June-Dec, and test data entries corresponding to Feb-March.)
df_train = df_data[df_data['Month'] >= 6]
df_test = df_data[(df_data['Month'] >= 2) & (df_data['Month'] <= 3)]
len(df_train), len(df_test)

(6523, 2035)

In [4]:
pv_selected_columns = ['ProductRelated_Duration', 'PageValues', 'Browser_12', 'Browser_2',
                         'Region_8', 'Region_9', 'TrafficType_14', 'TrafficType_18', 'TrafficType_19',
                         'TrafficType_3', 'TrafficType_6', 'TrafficType_7',
                         'VisitorType_Returning_Visitor', 'Weekend_True']

## Reduced train data classification performance

In [5]:
df_reduced_train = df_train[(df_train['Month'] < 10)].copy()
df_reduced_train.shape

(1539, 63)

In [6]:
X_train_reduced = df_reduced_train[pv_selected_columns].values
y_train_reduced = df_reduced_train['Revenue'].values

X_test = df_test[pv_selected_columns].values
y_test = df_test['Revenue'].values

print(X_train_reduced.shape, y_train_reduced.shape)
print(X_test.shape, y_test.shape)

(1539, 14) (1539,)
(2035, 14) (2035,)


In [7]:
#Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_reduced_st = scaler.fit(X_train_reduced).transform(X_train_reduced)
X_test_st = scaler.transform(X_test)

In [8]:
#1. If you dont consider the records from Oct-Dec, generate the classification performance on Test data
tpot = TPOTClassifier(generations=25, 
                      population_size=50, 
                      scoring='f1',
                      verbosity=2,
                      random_state = 2,
                      n_jobs=-1)
tpot.fit(X_train_reduced, y_train_reduced)

Optimization Progress:   0%|          | 0/1300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.5901266980888318

Generation 2 - Current best internal CV score: 0.5989636123476292

Generation 3 - Current best internal CV score: 0.6021539820917374

Generation 4 - Current best internal CV score: 0.6021539820917374

Generation 5 - Current best internal CV score: 0.6021539820917374

Generation 6 - Current best internal CV score: 0.6061921654665279

Generation 7 - Current best internal CV score: 0.6061921654665279

Generation 8 - Current best internal CV score: 0.6061921654665279

Generation 9 - Current best internal CV score: 0.6061921654665279

Generation 10 - Current best internal CV score: 0.6063011631130151

Generation 11 - Current best internal CV score: 0.6082604949867364

Generation 12 - Current best internal CV score: 0.6082604949867364

Generation 13 - Current best internal CV score: 0.6082604949867364

Generation 14 - Current best internal CV score: 0.612523832202199

Generation 15 - Current best internal CV score: 0.6125238

TPOTClassifier(generations=25, n_jobs=-1, population_size=50, random_state=2,
               scoring='f1', verbosity=2)

In [9]:
print(tpot.export())

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.tree import DecisionTreeClassifier
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=2)

# Average CV score on the training set was: 0.612523832202199
exported_pipeline = make_pipeline(
    make_union(
        StackingEstimator(estimator=MultinomialNB(alpha=100.0, fit_prior=True)),
        StackingEstimator(estimator=make_pipeline(
            StackingEstimator(estimator=MultinomialNB(alpha

In [10]:
# Average CV score on the training set was: 0.609134780563352
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=DecisionTreeClassifier(criterion="entropy", max_depth=4, min_samples_leaf=5, min_samples_split=2)),
    MultinomialNB(alpha=10.0, fit_prior=False)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 2)


exported_pipeline.fit(X_train_reduced, y_train_reduced)
results = exported_pipeline.predict(X_test)

cm = confusion_matrix(y_test, results)
print("Model performance:")
print(classification_report(y_test, results))
print("Confusion matrix")
print(cm)

Model performance:
              precision    recall  f1-score   support

       False       0.97      0.97      0.97      1855
        True       0.67      0.69      0.68       180

    accuracy                           0.94      2035
   macro avg       0.82      0.83      0.82      2035
weighted avg       0.94      0.94      0.94      2035

Confusion matrix
[[1793   62]
 [  55  125]]


## Apply label spreading on the data

In [11]:
df_unlabeled_train = df_train[(df_train['Month'] >= 10)].copy()
df_unlabeled_train['Revenue'] = -1
df_reduced_train['Revenue'] = df_reduced_train['Revenue'].apply(lambda x: 1 if x==True else 0)

df_X = pd.concat([df_reduced_train, df_unlabeled_train]) 
X = df_X[pv_selected_columns].values
y = df_X['Revenue'].values
unlabeled_set = df_unlabeled_train[pv_selected_columns].values

print(X.shape, y.shape, unlabeled_set.shape, df_reduced_train.shape)
df_X['Revenue'].value_counts()

(6523, 14) (6523,) (4984, 14) (1539, 63)


-1    4984
 0    1296
 1     243
Name: Revenue, dtype: int64

In [12]:
#Initialize the LabelSpreading model
lp_model = LabelSpreading(gamma=.25, max_iter=20)
lp_model.fit(X, y)

LabelSpreading(gamma=0.25, max_iter=20)

In [13]:
# Extract the label predictions for the unlabeled data
predicted_labels = lp_model.transduction_[1539:]
predicted_labels.shape

(4984,)

In [14]:
df_unlabeled_train['Revenue'] = predicted_labels
df_unlabeled_train['Revenue'] = df_unlabeled_train['Revenue'].apply(lambda x: True if x == 1 else False)
df_unlabeled_train['Revenue'].value_counts()

False    4499
True      485
Name: Revenue, dtype: int64

In [16]:
df_train_spreaded = pd.concat([df_reduced_train, df_unlabeled_train]) 
X_train_spreaded = df_train_spreaded[pv_selected_columns].values
y_train_spreaded = df_train_spreaded['Revenue'].values

In [17]:
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=DecisionTreeClassifier(criterion="entropy", max_depth=4, min_samples_leaf=5, min_samples_split=2)),
    MultinomialNB(alpha=10.0, fit_prior=False)
)

set_param_recursive(exported_pipeline.steps, 'random_state', 2)


exported_pipeline.fit(X_train_spreaded, y_train_spreaded)
results = exported_pipeline.predict(X_test)

cm = confusion_matrix(y_test, results)
print("Model performance:")
print(classification_report(y_test, results))
print("Confusion matrix")
print(cm)

Model performance:
              precision    recall  f1-score   support

       False       0.97      0.97      0.97      1855
        True       0.71      0.69      0.70       180

    accuracy                           0.95      2035
   macro avg       0.84      0.83      0.84      2035
weighted avg       0.95      0.95      0.95      2035

Confusion matrix
[[1804   51]
 [  55  125]]


**F1-score for Revenue=True inproved from 0.68 to 0.70**

**Note:**
I did not use Recall (which is more apropriate here) since AutoML was giving me perfect scores in the first step and there was nothing to improve.