This is a call_center exercise. Through this exercise, we want to find out which agent can approach the customers most efficiently (interested in the deal) and also which type of customer is more inclined to be interested in the deal given their characteristic features, i.e. age, working sector and residential region.

Import all related modules from Python

In [3]:
import pandas as pd
import os
import numpy as np
from pandas.api.types import CategoricalDtype
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from scipy import stats

Make sure data files and the program file are saved in the same directory

In [4]:
notebook_path = os.path.abspath('')

Load calls data

In [5]:
calls_data = pd.read_csv(notebook_path + '\\calls.csv')

Have a quick look at the data

In [6]:
calls_data.sort_values('Phone Number').head(20)

Unnamed: 0,Phone Number,Call Outcome,Agent,Call Number
4934,86974920,CALL BACK LATER,red,4934
3398,293076903,NOT INTERESTED,orange,3398
2443,296668016,CALL BACK LATER,red,2443
2543,296668016,INTERESTED,black,2543
2343,296668016,CALL BACK LATER,orange,2343
4053,330347320,INTERESTED,orange,4053
3953,330347320,CALL BACK LATER,red,3953
1473,457146043,INTERESTED,orange,1473
1373,457146043,CALL BACK LATER,red,1373
4545,832669152,NOT INTERESTED,red,4545


which agent makes most calls？

In [7]:
calls_data.groupby('Agent')['Agent'].count().sort_values(ascending = False)

Agent
orange    2234
red       1478
black      750
green      339
blue       199
Name: Agent, dtype: int64

Agent 'orange' apparently has far more calls made than others. But we are not sure whether he made a lot of useless calls or he called the customers effectively.

Among all leads received at least one call from us, we want to find how many calls we need to make in order for them to make a decision (sign up or not).

In [8]:
calls_data.groupby('Phone Number')['Phone Number'].count().mean()

1.839587932303164

Alternatively, we could merge the two data sets between leads (customer characteristic file) and calls

In [9]:
leads_data = pd.read_csv(notebook_path + '\\leads.csv')

In [10]:
leads_data.head(10)

Unnamed: 0,Name,Phone Number,Region,Sector,Age
0,Isabela MEZA,175718505368,north-west,wholesale,19
1,Deangelo LEE,937521423043,north-west,retail,38
2,Rosia MENDEZ,403640999962,midlands,agriculture,40
3,Jeremiah GALLOWAY,946740713605,scotland,food,23
4,Sarah POPE,264176984341,midlands,retail,18
5,Nolan VILLANUEVA,102993220908,north-west,wholesale,35
6,Wade AVERY,936057266681,south-west,construction,20
7,Karyn SHEPARD,416050061466,midlands,retail,60
8,Buster CALDERON,169044176823,south-west,food,21
9,Lu JACOBSON,477236163516,north-west,consultancy,28


In [11]:
leads_calls_left = pd.merge(leads_data, calls_data, how = 'left', on = 'Phone Number')

In [12]:
leads_calls_left.sort_values('Name').head(20)

Unnamed: 0,Name,Phone Number,Region,Sector,Age,Call Outcome,Agent,Call Number
6905,Aaliyah STOKES,829695501942,north-east,retail,44,,,
5745,Aaron MICHAEL,718889280723,north-west,consultancy,24,INTERESTED,blue,4125.0
871,Abagail KENT,518447025746,midlands,food,20,,,
7568,Abagail PACE,647443547151,midlands,entertainment,40,CALL BACK LATER,green,2484.0
7570,Abagail PACE,647443547151,midlands,entertainment,40,NOT INTERESTED,orange,2684.0
7569,Abagail PACE,647443547151,midlands,entertainment,40,CALL BACK LATER,green,2584.0
8130,Abagail SHAFFER,118266780799,south,construction,32,,,
8329,Abbey BLANKENSHIP,96746535702,south,food,28,,,
9906,Abbey BROWNING,143802021855,north-east,entertainment,40,,,
5003,Abbey SALAS,41832516768,scotland,consultancy,43,,,


In [13]:
leads_calls_left[leads_calls_left['Call Outcome'].notnull()].groupby('Name')['Name'].count().mean()

1.839587932303164

Focusing on the signed up leads only, calculate the average calls they received.

In [14]:
temp_data = pd.merge(calls_data, leads_data, on = 'Phone Number')
signups_data = pd.read_csv(notebook_path + '\\signups.csv')
signups_leads_calls = pd.merge(temp_data, signups_data, left_on = 'Name', right_on = 'Lead')

In [15]:
signups_leads_calls.groupby('Name').last().head(20)

Unnamed: 0_level_0,Phone Number,Call Outcome,Agent,Call Number,Region,Sector,Age,Lead,Approval Decision
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Aaron MICHAEL,718889280723,INTERESTED,blue,4125,north-west,consultancy,24,Aaron MICHAEL,REJECTED
Abbey TERRELL,967742850061,INTERESTED,red,380,south-west,wholesale,45,Abbey TERRELL,REJECTED
Abigail BOYLE,882003542264,INTERESTED,red,2017,south-west,consultancy,71,Abigail BOYLE,REJECTED
Ada CISNEROS,583848516870,INTERESTED,orange,2031,north-west,food,63,Ada CISNEROS,REJECTED
Admiral SOLOMON,480425059761,INTERESTED,red,1937,north-west,retail,25,Admiral SOLOMON,APPROVED
Adolf SALAZAR,163178719806,INTERESTED,orange,228,north-west,retail,19,Adolf SALAZAR,APPROVED
Adolph ZUNIGA,76223398562,INTERESTED,orange,917,northern-ireland,entertainment,24,Adolph ZUNIGA,REJECTED
Adrianna GARRETT,388295208858,INTERESTED,green,235,scotland,food,18,Adrianna GARRETT,REJECTED
Adrianne MCMAHON,557948543615,INTERESTED,orange,3069,north-west,food,30,Adrianne MCMAHON,APPROVED
Adrienne OROZCO,655413695089,INTERESTED,black,3248,south,food,37,Adrienne OROZCO,APPROVED


In [17]:
signups_leads_calls.groupby('Lead')['Lead'].count().mean()

2.0989583333333335

In [16]:
# check whether the last deciding call per lead always ends up with Call Outcome == INTERESTED
temp = signups_leads_calls.groupby('Name').last()
temp[temp['Call Outcome'] != 'INTERESTED'].shape

(0, 9)

In [17]:
# check whether each lead has several Call Outcome == INTERESTED
temp1 = calls_data[calls_data['Call Outcome'] == 'INTERESTED'].groupby(['Phone Number']).count()
temp1[temp1['Call Outcome'] > 1].shape

(0, 3)

Find out which agent has most signups

In [18]:
# We filter by 'INTERESTED' in order to only keep the last call that make the lead sign up.
temp.groupby('Agent')['Agent'].count().sort_values(ascending = False)

Agent
red       316
orange    284
green      67
blue       52
black      49
Name: Agent, dtype: int64

It is interesting to see that although Agent Orange made most calls, Agent red got most of the signups.

Note that the above conclusion is based on the assumption that all agents called same type of leads. This is an over-restrictive assumption. For example, Agent red may be just lucky to call leads who are more likely to sign up based on their characteristic variables. 

Note also that not necessarily those leads who are interested will sign up.

In [19]:
signups_leads_calls_all = pd.merge(leads_calls_left, signups_data, how = 'left', left_on = 'Name', right_on = 'Lead')
signups_leads_calls_all[signups_leads_calls_all['Call Outcome'] == 'INTERESTED'].head(10)

Unnamed: 0,Name,Phone Number,Region,Sector,Age,Call Outcome,Agent,Call Number,Lead,Approval Decision
3,Deangelo LEE,937521423043,north-west,retail,38,INTERESTED,orange,2413.0,Deangelo LEE,APPROVED
12,Lu JACOBSON,477236163516,north-west,consultancy,28,INTERESTED,orange,2149.0,Lu JACOBSON,REJECTED
16,Theron WELCH,533788208390,north-east,entertainment,36,INTERESTED,green,1207.0,Theron WELCH,APPROVED
19,Lilia OCHOA,80967872849,north-west,wholesale,33,INTERESTED,black,1333.0,,
31,Cheryle CALDWELL,484404817049,north-west,consultancy,27,INTERESTED,orange,473.0,,
36,Naoma DURHAM,940509676942,south-east,entertainment,30,INTERESTED,black,2013.0,Naoma DURHAM,REJECTED
56,Alyssa HAMPTON,593981680906,north-east,consultancy,27,INTERESTED,orange,2292.0,Alyssa HAMPTON,REJECTED
72,Glen PATTON,532155164354,midlands,food,20,INTERESTED,red,2400.0,Glen PATTON,APPROVED
88,Christena KRAMER,379624957194,scotland,food,32,INTERESTED,orange,4067.0,,
101,Bonnie CALLAHAN,350965802992,scotland,food,27,INTERESTED,red,1753.0,Bonnie CALLAHAN,APPROVED


there are cases where an Agent has successfully made a lead interested but the lead didn't sign up.

Next we want to check which agent tends to have the highest signups/calls ratio.

In [20]:
# answer to the most calls question
calls_per_agent = calls_data.groupby('Agent')['Agent'].count().sort_index()
# answer to the most signups question
signups_per_agent = signups_leads_calls.groupby('Lead').last().groupby('Agent')['Agent'].count().sort_index()
# divide between the two and sort it
signups_rate = (signups_per_agent/calls_per_agent).sort_values(ascending = False)
signups_rate

Agent
blue      0.261307
red       0.213802
green     0.197640
orange    0.127126
black     0.065333
Name: Agent, dtype: float64

Alternatively, we can just include those final actual conversational calls (neither a deadline nor answer machine) each agent made.

In [24]:
actual_calls_per_agent = calls_data[(calls_data['Call Outcome'] != 'DEADLINE') & (calls_data['Call Outcome'] != 'ANSWER MACHINE')].groupby('Agent')['Agent'].count()
actual_signups_rate = (signups_per_agent/actual_calls_per_agent).sort_values(ascending = False)
actual_signups_rate

Agent
blue      0.290503
red       0.238491
green     0.219672
orange    0.140873
black     0.074130
Name: Agent, dtype: float64

Although Agent blue has the highest signups/calls ratio, he only made 199 calls in total, not even one tenth of the Agent orange, and he only made 52 signups, far less than those of Agent red.

Now we want to check whether the variation of average signups/calls ratio is statistically significant.

chi-square test can be used to test whether signups count is uniformlly distributed. Firstly we need to create a contingency table.

In [25]:
test_data = pd.concat([signups_per_agent, calls_per_agent], axis = 1)
test_data.columns = ['signups count', 'calls count']
test_data

Unnamed: 0_level_0,signups count,calls count
Agent,Unnamed: 1_level_1,Unnamed: 2_level_1
black,49,750
blue,52,199
green,67,339
orange,284,2234
red,316,1478


Using our bare eye we can see that the distribution of the ratio of signups against non-signups count is disproportioinal. 
We can prove this by running chi-square test.

In [26]:
# I use list comprehension to decompose the dataframe into separate row vector.
chi2, p, dof, expected = stats.chi2_contingency([i for i in np.array(test_data)])
chi2, p

(88.97471726636842, 2.1740841509211066e-18)

In [27]:
# I need to change the code here
test_data = pd.concat([signups_per_agent, actual_calls_per_agent], axis = 1)
test_data.columns = ['signups count', 'actual calls count']
chi2, p, dof, expected = stats.chi2_contingency([i for i in np.array(test_data)])
chi2, p

(86.62269161071926, 6.8649353770441634e-18)

From the p-value we can see that the variability between the agents signups per call is statistically significant.

Next we dig into the characteristic features of leads and try to firstly find out leads from which region are more likely to be interested.

In [28]:
leads_calls_inner = pd.merge(calls_data, leads_data, how = 'inner', on = 'Phone Number')
leadsOrderByRegion = leads_calls_inner[leads_calls_inner['Call Outcome'] == 'INTERESTED'].groupby('Region')['Region'].count().sort_values(ascending = False)
leadsOrderByRegion

Region
north-west          365
south-west          161
midlands            150
north-east          139
scotland            137
south-east          136
south                62
london               56
wales                50
northern-ireland     40
Name: Region, dtype: int64

We can see that leads from north-west are more likely to be interested.

then we can find out leads from which sector are more likely to be interested.

In [29]:
leadsOrderBySector = leads_calls_inner[leads_calls_inner['Call Outcome'] == 'INTERESTED'].groupby('Sector')['Sector'].count().sort_values(ascending = False)
leadsOrderBySector

Sector
consultancy      301
retail           290
food             261
wholesale        233
entertainment    135
construction      46
agriculture       30
Name: Sector, dtype: int64

We can see that leads from consultancy sector are more likely to be interested.

Given leads who have expressed interests and signed up, we want to find out leads from which region are more likely to be approved.

In [30]:
# We can use signups_leads_calls for this purpose since data included only covers those leads who have signed up.
signups_leads_calls_clean = signups_leads_calls[signups_leads_calls['Call Outcome'] == 'INTERESTED']
signups_per_region = signups_leads_calls_clean.groupby('Region')['Region'].count().sort_index()
approved_per_region = signups_leads_calls_clean[signups_leads_calls_clean['Approval Decision'] == 'APPROVED'].groupby('Region')['Region'].count().sort_index()
(approved_per_region/signups_per_region).sort_values(ascending = False)

Region
north-west          0.452381
scotland            0.451220
south               0.375000
south-east          0.337209
midlands            0.285714
northern-ireland    0.250000
south-west          0.245098
north-east          0.243902
wales               0.147059
london              0.080000
Name: Region, dtype: float64

We can see that leads from north-west region are more likely to be approved and the difference in approval rate is large across different regions.

We can run the same Chi-square test to see whether such difference is statistically different as well.

In [31]:
test_data = pd.concat([approved_per_region, signups_per_region], axis = 1)
chi2, p, dof, expected = stats.chi2_contingency(test_data)
chi2, p

(20.50334600413018, 0.015047849853946683)

The variation in approved rate is statistically significant across different regions.

Next, we want to build up a forecasting model for signups by firstly including the three characteristic features, i.e. age, region and sector.

In [35]:
# We are only interested in the calls that got a call outcome as 'INTERESTED' or 'NOT INTERESTED'. 
# 'ANSWER MACHINE', 'CALL BACK LATER' are unfinished calls; 'DEAD LINE' does not reflect any characteristic features by nature.
in_sample = signups_leads_calls_all[(signups_leads_calls_all['Call Outcome'] == 'INTERESTED') | (signups_leads_calls_all['Call Outcome'] == 'NOT INTERESTED')]
temp_data = pd.merge(calls_data, leads_data, on = 'Phone Number', how = 'right')
out_sample = temp_data[temp_data['Call Outcome'].isnull()]

Create a data transformation pipeline

In [36]:
def data_pipeline(data, selected_cols, target_index, char_cols, char_orders, mapper):
    data = data.loc[:, selected_cols]
    for index, char_col in enumerate(char_cols):
        cat_type = CategoricalDtype(categories = char_orders[index], ordered = True)
        data[char_col] = data[char_col].astype(cat_type).values
        data = data.join(pd.get_dummies(data[char_col]))
#         To reduce the right skewness of variable 'Age' 
        data['Age'] = np.log(data['Age'])
    target_col = data[selected_cols[target_index]].map(mapper)
    data.insert(0, 'target', target_col)
    data = data.drop(selected_cols[target_index], axis = 1)
    data = data.drop(char_cols, axis = 1)
    data = data.reset_index(drop = True)
    return data

In [37]:
# 1st step modelling
in_sample_vars = ['Call Outcome', 'Age', 'Sector', 'Region']
out_sample_vars = ['Phone Number', 'Name', 'Age', 'Sector', 'Region']
char_col =  ['Region', 'Sector']
char_order = [leadsOrderByRegion.index, leadsOrderBySector.index]
in_sample_mapper = {'INTERESTED':1, 'NOT INTERESTED':0}
out_sample_mapper = {}

final_in_sample = data_pipeline(in_sample, in_sample_vars, 0, char_col, char_order, in_sample_mapper)
final_out_sample = data_pipeline(out_sample, out_sample_vars, 0, char_col, char_order, out_sample_mapper)
X_out_sample = final_out_sample.iloc[:, 2:]

train_set, test_set = train_test_split(final_in_sample, test_size = 0.2, random_state = 42)
train_set = train_set.reset_index(drop = True)
test_set = test_set.reset_index(drop = True)

X_train = train_set.iloc[:, 1:]
Y_train = train_set.iloc[:, 0]
X_test = test_set.iloc[:, 1:]
Y_test = test_set.iloc[:, 0]
X_full = final_in_sample.iloc[:, 1:]
Y_full = final_in_sample.iloc[:, 0]

In [38]:
#  Fine tunning Logistic regression
from sklearn.model_selection import GridSearchCV
LogR = LogisticRegression(solver= 'liblinear', random_state=50)

param_grid = {
     'penalty' : ['l1', 'l2'],
    'C' : np.logspace(-2, 2, 20),
    'class_weight': ['balanced', None],
    'solver' : ['liblinear']}

temp = LogisticRegression(random_state = 50)
grid_search = GridSearchCV(temp, param_grid, cv = 10)
grid_search.fit(X_train, Y_train)
aug_LogR = grid_search.best_estimator_
[LogR.get_params, aug_LogR.get_params]

[<bound method BaseEstimator.get_params of LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=50, solver='liblinear',
           tol=0.0001, verbose=0, warm_start=False)>,
 <bound method BaseEstimator.get_params of LogisticRegression(C=0.06951927961775606, class_weight=None, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l2', random_state=50,
           solver='liblinear', tol=0.0001, verbose=0, warm_start=False)>]

In [39]:
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import precision_score, recall_score, roc_auc_score, auc, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

RandForest = RandomForestClassifier(n_estimators=100, random_state = 50)

temp = RandomForestClassifier(random_state = 50)
param_grid = {'n_estimators': [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [1, 2, 4],
               'bootstrap': [True, False]}

random_grid = RandomizedSearchCV(estimator = temp, param_distributions = param_grid, n_iter = 50, cv = 5, verbose=2, random_state=50, n_jobs = -1)
random_grid.fit(X_train, Y_train)
aug_RandForest = random_grid.best_estimator_

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   22.4s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.4min finished


In [40]:
[RandForest.get_params, aug_RandForest.get_params]

[<bound method BaseEstimator.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
             oob_score=False, random_state=50, verbose=0, warm_start=False)>,
 <bound method BaseEstimator.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=70, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=4, min_samples_split=10,
             min_weight_fraction_leaf=0.0, n_estimators=600, n_jobs=None,
             oob_score=False, random_state=50, verbose=0, warm_start=False)>]

In [41]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
GBM = GradientBoostingClassifier(subsample = 0.8, max_depth = 5, min_samples_split= 15, max_features = 'sqrt', random_state = 50)
KNN = KNeighborsClassifier(n_neighbors = 5)

In [42]:
model_dict = {'logistic' : LogR, 
        'tuned logistic' : aug_LogR, 
        'random forest' : RandForest, 
        'tuned random forest' : aug_RandForest,
        'GBM': GBM,
        'KNN': KNN}

validation_dict = {'precision' : precision_score,
                  'recall' : recall_score,
                  'f1_score': f1_score}

for key in model_dict:
    model_dict[key].fit(X_train, Y_train)

[LogR.coef_, aug_LogR.coef_]

[array([[-0.37793425,  0.60914159,  0.4363232 , -0.60425131, -0.18617069,
         -0.23387347,  0.01194474, -0.14904875,  1.48598785, -0.6606915 ,
         -0.22735238,  0.86291872, -0.52371939,  0.02080468,  0.59466066,
          0.28335261, -0.70892056, -0.04708746]]),
 array([[-0.0477804 ,  0.52818561,  0.34995491, -0.49218562, -0.12430482,
         -0.16943933,  0.01937853, -0.08930823,  0.63646746, -0.42475767,
         -0.10705501,  0.66343909, -0.52368232, -0.05267253,  0.42250993,
          0.1706423 , -0.50025654, -0.05304409]])]

In [43]:
# Create a function to generate modelling performance metrics
def output_table(model_f, val_f, X, Y, pred_f = None):
    out = []
    for key1 in model_f.keys():
        if pred_f == None:
            Y_predict = model_f[key1].predict(X)
        else:
            Y_predict = pred_f(model_f[key1], X, Y, cv = 10)
        for key2 in val_f.keys():
            out.append(val_f[key2](Y, Y_predict))
    cols = len(val_f)
    rows = len(model_f)
    out = np.array(out)
    out.shape = (rows, cols)
    out = pd.DataFrame(out, index=model_f.keys(), columns=val_f.keys())
    return out

In [44]:
# We can have a look at the modelling performance of all four models in the training sample
output_table(model_dict, validation_dict, X_train, Y_train, cross_val_predict)

Unnamed: 0,precision,recall,f1_score
logistic,0.651445,0.66826,0.659745
tuned logistic,0.647215,0.699809,0.672485
random forest,0.59002,0.576482,0.583172
tuned random forest,0.627181,0.652964,0.639813
GBM,0.62763,0.655832,0.641421
KNN,0.612695,0.636711,0.624473


In [45]:
# We can have a look at the forecasting performance of four models in the test sample
output_table(model_dict, validation_dict, X_test, Y_test)

Unnamed: 0,precision,recall,f1_score
logistic,0.614035,0.7,0.654206
tuned logistic,0.604651,0.728,0.660617
random forest,0.571429,0.64,0.603774
tuned random forest,0.574468,0.648,0.609023
GBM,0.576271,0.68,0.623853
KNN,0.581132,0.616,0.598058


In [46]:
# Define a function mapping predicted probability to precision
def prob_precision(model, X, Y, prob_range):
    precision = []
    for prob in prob_range:
        Y_prob = pd.DataFrame(model.predict_proba(X))
        selected_index = Y_prob.index[Y_prob[1] > prob].tolist()
        Y_selected = Y[selected_index]
        TP = len(Y_selected[Y_selected == 1])
        FP = len(Y_selected[Y_selected == 0])
        precision.append(TP/(TP + FP))
    out = pd.concat([pd.DataFrame(prob_range), pd.DataFrame(precision)], axis = 1)
    out.columns = ['proba', 'precision']
    return out

In [47]:
prob_precision(LogR, X_train, Y_train, np.linspace(.5, .8, 11))

Unnamed: 0,proba,precision
0,0.5,0.660395
1,0.53,0.682022
2,0.56,0.694132
3,0.59,0.698217
4,0.62,0.717666
5,0.65,0.729167
6,0.68,0.735211
7,0.71,0.751701
8,0.74,0.756881
9,0.77,0.782353


In [48]:
prob_precision(LogR, X_test, Y_test, np.linspace(.5, .8, 11))

Unnamed: 0,proba,precision
0,0.5,0.614035
1,0.53,0.647826
2,0.56,0.673267
3,0.59,0.6875
4,0.62,0.673203
5,0.65,0.709402
6,0.68,0.755814
7,0.71,0.746479
8,0.74,0.785714
9,0.77,0.808511


In [49]:
Y_out_sample_prob = pd.DataFrame(LogR.predict_proba(X_out_sample))
X_out_sample_selected_index = Y_out_sample_prob.index[Y_out_sample_prob[1] > .68].tolist()
X_out_sample = X_out_sample.reset_index(drop=True)
X_out_sample_1st = X_out_sample.iloc[X_out_sample_selected_index, ].reset_index(drop = True)
final_out_sample_1st = final_out_sample.iloc[X_out_sample_selected_index, ].reset_index(drop = True)

In [50]:
# 2nd step modelling starts
in_sample = signups_leads_calls_all[(signups_leads_calls_all['Call Outcome'] == 'INTERESTED')]
in_sample_vars = ['Approval Decision', 'Age', 'Sector', 'Region']
out_sample_vars = ['Phone Number','Name', 'Age', 'Sector', 'Region']
char_col =  ['Region', 'Sector']
char_order = [leadsOrderByRegion.index, leadsOrderBySector.index]
in_sample_mapper = {'APPROVED':1, 'REJECTED':1, None:0}
out_sample_mapper = {}

final_in_sample = data_pipeline(in_sample, in_sample_vars, 0, char_col, char_order, in_sample_mapper)
final_out_sample = data_pipeline(out_sample, out_sample_vars, 0, char_col, char_order, out_sample_mapper)

train_set, test_set = train_test_split(final_in_sample, test_size = 0.2, random_state = 42)
train_set = train_set.reset_index(drop = True)
test_set = test_set.reset_index(drop = True)
X_train = train_set.iloc[:, 1:]
Y_train = train_set.iloc[:, 0]
X_test = test_set.iloc[:, 1:]
Y_test = test_set.iloc[:, 0]
X_full = final_in_sample.iloc[:, 1:]
Y_full = final_in_sample.iloc[:, 0]

In [51]:
LogR_2nd = LogisticRegression(solver= 'liblinear', random_state=50)

param_grid = {
     'penalty' : ['l1', 'l2'],
    'C' : np.logspace(-2, 2, 20),
    'class_weight': ['balanced', None],
    'solver' : ['liblinear']}
temp = LogisticRegression(random_state = 50)
grid_search = GridSearchCV(temp, param_grid, cv = 10)
grid_search.fit(X_train, Y_train)
aug_LogR_2nd = grid_search.best_estimator_
[LogR_2nd.get_params, aug_LogR_2nd.get_params]

[<bound method BaseEstimator.get_params of LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=50, solver='liblinear',
           tol=0.0001, verbose=0, warm_start=False)>,
 <bound method BaseEstimator.get_params of LogisticRegression(C=0.06951927961775606, class_weight=None, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l2', random_state=50,
           solver='liblinear', tol=0.0001, verbose=0, warm_start=False)>]

In [52]:
RandForest_2nd = RandomForestClassifier(n_estimators=100, random_state = 50)

temp = RandomForestClassifier(random_state = 50)
param_grid = {'n_estimators': [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [1, 2, 4],
               'bootstrap': [True, False]}

random_grid = RandomizedSearchCV(estimator = temp, param_distributions = param_grid, n_iter = 50, cv = 5, verbose=2, random_state = 50, n_jobs = -1)
random_grid.fit(X_train, Y_train)
aug_RandForest_2nd = random_grid.best_estimator_

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   14.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.5min finished


In [53]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
GBM_2nd = GradientBoostingClassifier(subsample = 0.8, max_depth = 5, min_samples_split= 10, max_features = 'sqrt', random_state = 50)
KNN_2nd = KNeighborsClassifier(n_neighbors=3)

In [54]:
model_dict_2nd = {'logistic 2nd' : LogR_2nd, 
        'tuned logistic 2nd' : aug_LogR_2nd, 
        'random forest 2nd' : RandForest_2nd, 
        'tuned random 2nd' : aug_RandForest_2nd,
        'GBM 2nd': GBM_2nd,
        'KNN 2nd': KNN_2nd}

In [55]:
output_table(model_dict_2nd, validation_dict, X_train, Y_train, cross_val_predict)

Unnamed: 0,precision,recall,f1_score
logistic 2nd,0.602911,0.933977,0.732786
tuned logistic 2nd,0.602172,0.982287,0.746634
random forest 2nd,0.596899,0.619968,0.608215
tuned random 2nd,0.606178,0.758454,0.67382
GBM 2nd,0.601036,0.747182,0.666188
KNN 2nd,0.595273,0.648953,0.620955


In [56]:
for key in model_dict_2nd:
    model_dict_2nd[key].fit(X_train, Y_train)
    
output_table(model_dict_2nd, validation_dict, X_test, Y_test)

Unnamed: 0,precision,recall,f1_score
logistic 2nd,0.567797,0.911565,0.699739
tuned logistic 2nd,0.572,0.972789,0.720403
random forest 2nd,0.585987,0.62585,0.605263
tuned random 2nd,0.597938,0.789116,0.680352
GBM 2nd,0.56383,0.721088,0.632836
KNN 2nd,0.575,0.62585,0.599349


In [57]:
prob_precision(aug_LogR_2nd, X_train, Y_train, np.linspace(.5, .7, 11))

Unnamed: 0,proba,precision
0,0.5,0.604374
1,0.52,0.605817
2,0.54,0.613851
3,0.56,0.639503
4,0.58,0.649682
5,0.6,0.651042
6,0.62,0.681529
7,0.64,0.6917
8,0.66,0.732673
9,0.68,0.735294


In [132]:
largest_1000_index = np.argsort(aug_LogR_2nd.predict_proba(X_out_sample_1st)[:,1])[-1000:]
final_out_sample_1st.iloc[largest_1000_index, 1]

215            Dedra ADKINS
402          Pennie MURILLO
1147            Lars RITTER
1060             Neil MEYER
272       Alexandra HAMMOND
1192            Doris SCOTT
23               Gage HOUSE
1027            Marco OLSON
1263            Sal HERRING
876           Tia ARMSTRONG
952          Omer MCFARLAND
49           Willie HANCOCK
174            Beryl BURTON
175               Ali PATEL
665              Rosco SHAH
891          Regan FRANKLIN
1253            Fannie KOCH
1036        Bernadine BOONE
243       Krystle RASMUSSEN
733            Irena WAGNER
1146            Oran HAYNES
575            Miracle DUNN
1029    Hildegard GILLESPIE
915           Shalonda MEZA
1131            Tony WAGNER
846              Bette MEZA
1248             Gregg HOLT
499             Lyman RAMOS
224            Melody NOLAN
277         Milburn PEARSON
               ...         
60              Young PRATT
678         Zita STEPHENSON
732             Boss LARSON
255         Brennon RICHARD
231           Tamra 

From the 1000 called leads, we can derive the expected signups by multiplying the predicted probability in the 1st step modelling by the predicted probability in the 2nd step modelling and then sum up.

In [117]:
final_out_sample_2nd = final_out_sample_1st.iloc[largest_1000_index, 2:]
p1 = LogR.predict_proba(final_out_sample_2nd)[:, 1]
p2 = aug_LogR_2nd.predict_proba(final_out_sample_2nd)[:, 1]
np.sum(p1*p2)

441.165410604176

The expected signups is therefore around 564.

Finnaly, we want to check by taking all four characteristic features into account, which agent tend to reach more signups in the called leads . We first try to create the predictive model in one go.

In [70]:
in_sample = signups_leads_calls_all[(signups_leads_calls_all['Call Outcome'] == 'INTERESTED') | (signups_leads_calls_all['Call Outcome'] == 'NOT INTERESTED')]
in_sample_vars = ['Approval Decision', 'Age', 'Sector', 'Region', 'Agent']
char_col =  ['Region', 'Sector', 'Agent']
char_order = [leadsOrderByRegion.index, leadsOrderBySector.index, signups_rate.index]
in_sample_mapper = {'APPROVED':1,
                    'REJECTED':1,
                    None: 0}

final_in_sample = data_pipeline(in_sample, in_sample_vars, 0, char_col, char_order, in_sample_mapper)
train_set, test_set = train_test_split(final_in_sample, test_size = 0.2, random_state = 42)
train_set = train_set.reset_index(drop = True)
test_set = test_set.reset_index(drop = True)

X_train = train_set.iloc[:, 1:]
Y_train = train_set.iloc[:, 0]
X_test = test_set.iloc[:, 1:]
Y_test = test_set.iloc[:, 0]
X_full = final_in_sample.iloc[:, 1:]
Y_full = final_in_sample.iloc[:, 0]

In [71]:
LogR_extended = LogisticRegression(solver= 'liblinear', random_state = 50)

param_grid = {'penalty' : ['l1', 'l2'],
              'C' : np.logspace(-2, 2, 20),
              'class_weight': ['balanced', None],
              'solver' : ['liblinear']}
temp = LogisticRegression(random_state = 50)
grid_search = GridSearchCV(temp, param_grid, cv = 10)
grid_search.fit(X_train, Y_train)
aug_LogR_extended = grid_search.best_estimator_
[LogR_extended.get_params, aug_LogR_extended.get_params]

[<bound method BaseEstimator.get_params of LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=50, solver='liblinear',
           tol=0.0001, verbose=0, warm_start=False)>,
 <bound method BaseEstimator.get_params of LogisticRegression(C=0.7847599703514611, class_weight=None, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l1', random_state=50,
           solver='liblinear', tol=0.0001, verbose=0, warm_start=False)>]

In [72]:
RandForest_extended = RandomForestClassifier(n_estimators=100, random_state = 50)

temp = RandomForestClassifier(random_state = 50)
param_grid = {'n_estimators': [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [1, 2, 4],
               'bootstrap': [True, False]}

random_grid = RandomizedSearchCV(estimator = temp, param_distributions = param_grid, n_iter = 50, cv = 5, verbose=2, random_state=50, n_jobs = -1)
random_grid.fit(X_train, Y_train)
aug_RandForest_extended = random_grid.best_estimator_

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   19.0s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.2min finished


In [73]:
model_dict_extended = {'logistic' : LogR_extended, 
        'tuned logistic' : aug_LogR_extended, 
        'random forest' : RandForest_extended, 
        'tuned random forest' : aug_RandForest_extended}

validation_dict = {'precision' : precision_score,
                  'recall' : recall_score,
                   'roc_auc' : roc_auc_score,
                  'f1_score': f1_score}


LogR_extended.fit(X_train, Y_train)
aug_LogR_extended.fit(X_train, Y_train)
RandForest_extended.fit(X_train, Y_train)
aug_RandForest_extended.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=900, n_jobs=None,
            oob_score=False, random_state=50, verbose=0, warm_start=False)

In [74]:
[LogR_extended.coef_, aug_LogR_extended.coef_]

[array([[-0.42712563,  0.34687899,  0.41486207, -0.51101347, -0.15508759,
         -0.12175836,  0.08120608, -0.13755526,  0.14506644, -0.4488959 ,
         -0.05799832,  0.34300817, -0.37101835,  0.11800854,  0.3667841 ,
          0.28667232, -0.80324264, -0.38450747,  1.19576696,  0.32687267,
          0.05028557, -0.62173451, -1.39548602]]),
 array([[-0.23605892,  0.39004618,  0.45149874, -0.43949807, -0.07241232,
         -0.04094986,  0.10659246, -0.01978802,  0.08008299, -0.34594331,
          0.        ,  0.11790474, -0.5742477 , -0.08331351,  0.1354734 ,
          0.05253265, -0.98271527, -0.49401321,  0.88883388,  0.02743826,
         -0.19042074, -0.90387519, -1.68697601]])]

In [75]:
output_table(model_dict_extended, validation_dict, X_train, Y_train, cross_val_predict)

Unnamed: 0,precision,recall,roc_auc,f1_score
logistic,0.565891,0.237398,0.578411,0.334479
tuned logistic,0.581301,0.23252,0.57921,0.332172
random forest,0.421053,0.377236,0.57387,0.397942
tuned random forest,0.535398,0.196748,0.560604,0.287753


In [76]:
output_table(model_dict_extended, validation_dict, X_test, Y_test, cross_val_predict)

Unnamed: 0,precision,recall,roc_auc,f1_score
logistic,0.565789,0.281046,0.593245,0.375546
tuned logistic,0.555556,0.261438,0.584874,0.355556
random forest,0.346774,0.281046,0.524477,0.310469
tuned random forest,0.52381,0.143791,0.543242,0.225641


In [77]:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier(n_estimators = 50)
model.fit(X_train, Y_train)
importance = pd.Series(model.feature_importances_)
features = pd.Series(X_train.columns)
importance_table = pd.concat([features, importance], axis = 1)
importance_table.columns = ['feature', 'importance']
importance_table.sort_values('importance', ascending = False)

Unnamed: 0,feature,importance
0,Age,0.683798
19,red,0.031481
22,black,0.026105
21,orange,0.020493
18,blue,0.019172
12,retail,0.015997
1,north-west,0.015011
6,south-east,0.014789
13,food,0.014665
3,midlands,0.014558


It is probabily due to the fact that Agents 'red', 'black', 'blue' and 'orange' are all import predictors that the base logistic predicting model has decent precision score. So we can estimate the signups using this one-go predictive model. 

In [79]:
X_agent = X_full
signups_predict = np.zeros(5)
Agent = ['blue', 'red', 'green', 'orange', 'black']
for i in range(5):
    extend = np.zeros(5)
    extend[i] = 1
    X_agent[Agent] = extend
    signups_predict[i] = sum(LogR_extended.predict(X_agent))

signups_predict = pd.Series(signups_predict)
signups_predict.index = Agent
signups_predict

blue      2247.0
red        839.0
green      431.0
orange       0.0
black        0.0
dtype: float64

In [80]:
# becuase the forecasting performance of one go modelling is not convincing, we are now switching to two-step modelling method.
# 1st step starts
in_sample = signups_leads_calls_all[(signups_leads_calls_all['Call Outcome'] == 'INTERESTED') | (signups_leads_calls_all['Call Outcome'] == 'NOT INTERESTED')]
in_sample_vars = ['Call Outcome', 'Age', 'Sector', 'Region', 'Agent']
char_col =  ['Region', 'Sector', 'Agent']
char_order = [leadsOrderByRegion.index, leadsOrderBySector.index, signups_rate.index]
in_sample_mapper = {'INTERESTED':1, 'NOT INTERESTED':0}
out_sample_mapper = {}

final_in_sample = data_pipeline(in_sample, in_sample_vars, 0, char_col, char_order, in_sample_mapper)
train_set, test_set = train_test_split(final_in_sample, test_size = 0.2, random_state = 42)
train_set = train_set.reset_index(drop = True)
test_set = test_set.reset_index(drop = True)

X_train = train_set.iloc[:, 1:]
Y_train = train_set.iloc[:, 0]
X_test = test_set.iloc[:, 1:]
Y_test = test_set.iloc[:, 0]
X_full = final_in_sample.iloc[:, 1:]
Y_full = final_in_sample.iloc[:, 0]

In [81]:
LogR_extended = LogisticRegression(solver= 'liblinear', random_state=50)

param_grid = {
     'penalty' : ['l1', 'l2'],
    'C' : np.logspace(-2, 2, 20),
    'class_weight': ['balanced', None],
    'solver' : ['liblinear']}
temp = LogisticRegression(random_state = 50)
grid_search = GridSearchCV(temp, param_grid, cv = 10)
grid_search.fit(X_train, Y_train)
aug_LogR_extended = grid_search.best_estimator_
[LogR_extended.get_params, aug_LogR_extended.get_params]

[<bound method BaseEstimator.get_params of LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=50, solver='liblinear',
           tol=0.0001, verbose=0, warm_start=False)>,
 <bound method BaseEstimator.get_params of LogisticRegression(C=0.29763514416313175, class_weight='balanced', dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l1', random_state=50,
           solver='liblinear', tol=0.0001, verbose=0, warm_start=False)>]

In [82]:
RandForest_extended = RandomForestClassifier(n_estimators=100, random_state = 50)

temp = RandomForestClassifier(random_state = 50)
param_grid = {'n_estimators': [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [1, 2, 4],
               'bootstrap': [True, False]}

random_grid = RandomizedSearchCV(estimator = temp, param_distributions = param_grid, n_iter = 50, cv = 5, verbose=2, random_state=50, n_jobs = -1)
random_grid.fit(X_train, Y_train)
aug_RandForest_extended = random_grid.best_estimator_

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   18.5s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.2min finished


In [83]:
model_dict_extend = {'logistic' : LogR_extended, 
        'tuned logistic' : aug_LogR_extended, 
        'random forest' : RandForest_extended, 
        'tuned random forest' : aug_RandForest_extended}

validation_dict = {'precision' : precision_score,
                  'recall' : recall_score,
                   'roc_auc' : roc_auc_score,
                  'f1_score': f1_score}

LogR_extended.fit(X_train, Y_train)
aug_LogR_extended.fit(X_train, Y_train)
RandForest_extended.fit(X_train, Y_train)
aug_RandForest_extended.fit(X_train, Y_train)

[LogR_extended.coef_, aug_LogR_extended.coef_]

[array([[-5.23481611e-01,  6.10970594e-01,  4.46955690e-01,
         -6.52963170e-01, -2.38301838e-01, -2.41456166e-01,
         -2.29392503e-02, -1.08240604e-01,  1.54029661e+00,
         -7.56019513e-01, -2.82139524e-01,  8.56037270e-01,
         -5.70099167e-01,  1.18084234e-04,  5.85866462e-01,
          2.62702941e-01, -8.01872075e-01, -3.65906927e-02,
          9.51754724e-01,  2.70244784e-01,  7.21251771e-02,
         -2.46117045e-01, -7.51844818e-01]]),
 array([[ 0.        ,  0.74078937,  0.55516255, -0.4253809 ,  0.        ,
         -0.00491947,  0.0782384 ,  0.        ,  1.43939271, -0.45003213,
          0.        ,  0.78998769, -0.55152453,  0.        ,  0.5101017 ,
          0.17245713, -0.6896387 ,  0.        ,  0.65096251,  0.16637206,
          0.        , -0.28616908, -0.76282779]])]

In [84]:
output_table(model_dict_extended, validation_dict, X_train, Y_train, cross_val_predict)

Unnamed: 0,precision,recall,roc_auc,f1_score
logistic,0.66385,0.675908,0.651301,0.669825
tuned logistic,0.664767,0.669216,0.650562,0.666984
random forest,0.603561,0.615679,0.587297,0.60956
tuned random forest,0.642534,0.678776,0.633444,0.660158


In [85]:
output_table(model_dict_extended, validation_dict, X_test, Y_test, cross_val_predict)

Unnamed: 0,precision,recall,roc_auc,f1_score
logistic,0.644628,0.624,0.641365,0.634146
tuned logistic,0.640496,0.62,0.637381,0.630081
random forest,0.511719,0.524,0.513984,0.517787
tuned random forest,0.60166,0.58,0.599524,0.590631


Because the base logistic model outperforms the rest, we shall use the base logistic model for the 1st step

In [86]:
X_agent = X_full

In [87]:
# 2nd step 
in_sample = signups_leads_calls_all[(signups_leads_calls_all['Call Outcome'] == 'INTERESTED')]
in_sample_vars = ['Approval Decision', 'Age', 'Sector', 'Region', 'Agent']
char_col =  ['Region', 'Sector', 'Agent']
char_order = [leadsOrderByRegion.index, leadsOrderBySector.index, signups_rate.index]
in_sample_mapper = {'APPROVED':1, 'REJECTED':1, None:0}
final_in_sample_extended = data_pipeline(in_sample, in_sample_vars, 0, char_col, char_order, in_sample_mapper)
train_set, test_set = train_test_split(final_in_sample_extended, test_size = 0.2, random_state = 42)
train_set.reset_index(drop = True)
test_set.reset_index(drop = True)
X_train = train_set.iloc[:, 1:]
Y_train = train_set.iloc[:, 0]
X_test = test_set.iloc[:, 1:]
Y_test = test_set.iloc[:, 0]

In [88]:
from sklearn.model_selection import GridSearchCV

LogR_extended_2nd = LogisticRegression(solver= 'liblinear')

param_grid = {
     'penalty' : ['l1', 'l2'],
    'C' : np.logspace(-5, 5, 20),
    'class_weight': ['balanced', None],
    'solver' : ['liblinear']}
temp = LogisticRegression()
grid_search = GridSearchCV(temp, param_grid, cv = 10)
grid_search.fit(X_train, Y_train)
aug_LogR_extended_2nd = grid_search.best_estimator_
[LogR_extended_2nd.get_params, aug_LogR_extended_2nd.get_params]

[<bound method BaseEstimator.get_params of LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
           tol=0.0001, verbose=0, warm_start=False)>,
 <bound method BaseEstimator.get_params of LogisticRegression(C=0.1623776739188721, class_weight=None, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
           solver='liblinear', tol=0.0001, verbose=0, warm_start=False)>]

In [89]:
RandForest_extended_2nd = RandomForestClassifier(n_estimators=100, random_state = 50)

temp = RandomForestClassifier(random_state = 50)
param_grid = {'n_estimators': [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [1, 2, 4],
               'bootstrap': [True, False]}

random_grid = RandomizedSearchCV(estimator = temp, param_distributions = param_grid, n_iter = 50, cv = 5, verbose=2, random_state=42, n_jobs = -1)
random_grid.fit(X_train, Y_train)
aug_RandForest_extended_2nd = random_grid.best_estimator_

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   27.6s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.8min finished


In [90]:
[RandForest_extended_2nd.get_params, aug_LogR_extended_2nd.get_params]

[<bound method BaseEstimator.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
             oob_score=False, random_state=50, verbose=0, warm_start=False)>,
 <bound method BaseEstimator.get_params of LogisticRegression(C=0.1623776739188721, class_weight=None, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
           solver='liblinear', tol=0.0001, verbose=0, warm_start=False)>]

In [91]:
model_dict_extended_2nd = {'logistic extended 2nd':LogR_extended_2nd, 
        'tuned logistic extended 2nd':aug_LogR_extended_2nd, 
        'random forest extended 2nd':RandForest_extended_2nd, 
        'tuned random forest extended 2nd':aug_RandForest_extended_2nd}

LogR_extended_2nd.fit(X_train, Y_train)
aug_LogR_extended_2nd.fit(X_train, Y_train)
RandForest_extended_2nd.fit(X_train, Y_train)
aug_RandForest_extended_2nd.fit(X_train, Y_train)

[LogR_extended_2nd.coef_, aug_LogR_extended_2nd.coef_]

[array([[-0.11904204,  0.04746577,  0.48058017,  0.05188084,  0.04688377,
          0.13318816,  0.11924707, -0.17678567, -0.50970723,  0.11895485,
         -0.02726972,  0.01400882,  0.2573736 ,  0.50388326,  0.30642185,
          0.04997604, -0.2367378 , -0.61048776,  1.13587193,  0.61848788,
          0.18898966, -0.52842011, -1.13049135]]),
 array([[-3.10153927e-02,  3.15613020e-02,  3.80964350e-01,
          3.74313613e-02,  4.41789673e-02,  9.81670613e-02,
          8.30915176e-02, -1.44597983e-01, -3.34894213e-01,
          8.95175111e-02, -2.92974080e-02, -3.94551437e-02,
          1.89561517e-01,  4.15006834e-01,  2.25424369e-01,
         -2.94870884e-04, -1.63507057e-01, -3.70613181e-01,
          7.25735243e-01,  6.39111781e-01,  1.99517177e-01,
         -4.18148767e-01, -8.90092966e-01]])]

In [92]:
output_table(model_dict_extended_2nd, validation_dict, X_train, Y_train, cross_val_predict)

Unnamed: 0,precision,recall,roc_auc,f1_score
logistic extended 2nd,0.676838,0.785829,0.612192,0.727273
tuned logistic extended 2nd,0.67663,0.801932,0.614219,0.733972
random forest extended 2nd,0.652106,0.673108,0.567879,0.662441
tuned random forest extended 2nd,0.662011,0.763285,0.590076,0.70905


In [93]:
output_table(model_dict_extended_2nd, validation_dict, X_test, Y_test)

Unnamed: 0,precision,recall,roc_auc,f1_score
logistic extended 2nd,0.644444,0.789116,0.611372,0.70948
tuned logistic extended 2nd,0.646409,0.795918,0.614773,0.713415
random forest extended 2nd,0.626667,0.639456,0.57194,0.632997
tuned random forest extended 2nd,0.670659,0.761905,0.63759,0.713376


Now we can run regression to see the predicted leads for each agent by setting each of the agent to be 1 and dropping the others

In [94]:
signups_predict = np.zeros(5)
Agent = ['blue', 'red', 'green', 'orange', 'black']

for i in range(5):
    extend = np.zeros(5)
    extend[i] = 1
    X_agent[Agent] = extend 
    selected_index = X_agent.index[LogR_extended.predict(X_agent) == 1].tolist()
    X_in_sample = X_agent.iloc[selected_index, :]
    signups_predict[i] = sum(aug_RandForest_extended_2nd.predict(X_in_sample))

signups_predict = pd.Series(signups_predict)
signups_predict.index = Agent
signups_predict

blue      2312.0
red       1756.0
green     1599.0
orange     526.0
black      116.0
dtype: float64

Since the 2-step modelling method is more appropriate, I shall use the 2-step modelling method to estimata the signups for each agent for the 1000 selected leads.

In [133]:
X_agent = final_out_sample_2nd.reset_index(drop=True)
X_agent['blue'] = 0
X_agent['red'] = 0
X_agent['green'] = 0
X_agent['orange'] = 0
X_agent['black'] = 0

X_agent = X_agent.reset_index(drop = True)
for i in range(5):
    extend = np.zeros(5)
    extend[i] = 1
    X_agent[Agent] = extend 
    selected_index = X_agent.index[LogR_extended.predict(X_agent) == 1].tolist()
    X = X_agent.iloc[selected_index, :]
    pr1 = LogR_extended.predict_proba(X)
    pr2 = aug_RandForest_extended_2nd.predict_proba(X)
    signups_predict[i] = np.sum(pr1 * pr2)

signups_predict

blue      704.531561
red       600.753574
green     538.084104
orange    492.374353
black     270.076941
dtype: float64