This note book is for trying out new features and then deploy the definition into the scripts.

In this note book, we do not call objects from model.dataEngine, but deploy methods into model.dataEngine.

In [1]:
import os
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tqdm import tqdm

import config
import model.performance
import utility.df
import utility.iolib
import utility.plotlib
from utility.feature import Feature
from utility.feature import FeatureCM

#env = sys.argv[1] if len(sys.argv) > 2 else "dev"

In [4]:
# Setup configuration
cfg = config.ResearchConfig_MonOnly
time_format = cfg.CSV_TIME_FORMAT
date_format = cfg.CSV_DATE_FORMAT
cutoff_date = pd.to_datetime(cfg.CUTOFF_DATE, format=cfg.CSV_DATE_FORMAT)

# Retrieve data
df_subspt, df_lesson, df_incomp, df_crclum, df_pupils = utility.iolib.retrieve_data(cfg)
print("Complete loading data for subscription and lesson history!")

# Filter data
cutoff_date = pd.to_datetime(cfg.CUTOFF_DATE, format=cfg.CSV_DATE_FORMAT)
first_date_impFromData = df_subspt.subscription_start_date.min()

pupils_toBeRemoved = utility.df.filter_subspt_data(
    df_subspt, first_date_impFromData, cutoff_date, remove_annual_subspt=cfg.MONTHLY_ONLY)
df_lesson1 = df_lesson[~df_lesson['pupilId'].isin(pupils_toBeRemoved)]
df_incomp1 = df_incomp[~df_incomp['pupilId'].isin(pupils_toBeRemoved)]
df_subspt1 = df_subspt[~df_subspt['pupilId'].isin(pupils_toBeRemoved)]

df_subspt1 = utility.df.compute_customer_month(df_subspt1, cfg)

# Construct dates frame
df_datesFrame = utility.df.construct_dates_frame(df_subspt1, df_lesson1, df_incomp1, cfg)
df_datesFrame.fillna(0, inplace=True)

# Initialise feature obj
feature = Feature(df_datesFrame)

feature.add_usageTime(df_lesson1, df_incomp1)
feature.add_progressions(df_lesson1)
feature.add_age(df_pupils)
#feature.add_mathAge(df_lesson1, df_incomp1)
feature.add_outcome(df_lesson1)
feature.add_mark(df_lesson1, df_incomp1)
feature.add_hardship(df_lesson1, df_incomp1)

Complete loading data for subscription and lesson history!
By the cutoff date 2018-04-20, there are 1234 active subscriptions.
These subscribers shall be removed from the analysis because we have no evidence to know the lifetime of their subscriptions. 

In the first month of dataset starting from 2014-01-01, there are 154 renewal or new subscriptions.
These subscribers shall be removed from the analysis because we have no evidence to show if they renewed or newly joined. 

We also choose to remove 2525 annual subscribers. 

In summary, there are 3013/5685 subscribers being removed from the dataset in the analysis. 

Calculate customer month in the subscription table.


100%|██████████| 2672/2672 [00:03<00:00, 750.58it/s] 


Construct data-driven dates frame.
The dates frame has already been assigned customer month and saved in a file. The file has been loaded!
+ Add feature: usage time.
+ Add feature: progressions.
+ Add feature: pupils' age.
+ Add feature: outcome.
+ Add feature: mark.
Start binning stackDepth for complete lesson table.
Start binning stackDepth for incomplete lesson table.
+ Add feature: hardship.


In [7]:
import pydot

(graph,) = pydot.graph_from_dot_file('clf.dot')
graph.write_png('clf.png')

In [6]:
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

In [1]:
import pickle

with open('mixture_model_2000.pkl', 'rb') as inputs:
    mixture_model = pickle.load(inputs)

In [3]:
mixture_model.map_group_model_['G1'][0]

[    multi0  multi1  groupId  count     churn  cumcount
 13       8       6       13      1  1.000000         1
 14       9       4       14      1  1.000000         2
 5        7       2        5      3  0.333333         5
 0        1       0        0    997  0.257773      1002
 6        7       3        6      4  0.250000      1006
 15       9       6       15   1598  0.249061      2604
 4        6       3        4   1230  0.243089      3834
 12       8       4       12   1134  0.221340      4968
 10       8       0       10      5  0.200000      4973
 1        4       2        1    796  0.188442      5769
 9        7       8        9    637  0.178964      6406
 7        7       4        7      7  0.142857      6413
 2        4       4        2      1  0.000000      6414
 3        6       0        3      1  0.000000      6415
 8        7       6        8      1  0.000000      6416
 11       8       3       11      1  0.000000      6417,
     multi0  multi1  groupId  count     churn  