# Intuit Quickbooks Upgrade

* Team-lead GitLab userid:
* Group name:
* Team member names:

## Setup

Please complete this python notebook with your group by answering the questions in `intuit-redux.pdf`. Create a Notebook and HTML file with all your results and comments and push both the Notebook and HTML file to GitLab when your team is done. All results MUST be reproducible (i.e., the TA and I must be able to recreate the HTML file from the Jupyter Notebook without changes or errors). This means that you should NOT use any python-packages that are not part of the rsm-msba-spark docker container.

This is the second group assignment for MGTA 455 and you will be using Git and GitLab. If two people edit the same file at the same time you could get what is called a "merge conflict". This is not something serious but you should realize that Git will not decide for you who's change to accept so the team-lead will have to determine the edits to use. To avoid merge conflicts, **always** "pull" changes to the repo before you start working on any files. Then, when you are done, save and commit your changes, and then push them to GitLab. Make "pull first" a habit!

If multiple people are going to work on the assignment at the same time I recommend you work in different notebooks. You can then `%run ...`  these "sub" notebooks from the main assignment file. You can seen an example of this in action below for the `model1.ipynb` notebook

Some group work-flow tips:

* Pull, edit, save, stage, commit, and push
* Schedule who does what and when
* Try to avoid working simultaneously on the same file 
* If you are going to work simultaneously, do it in different notebooks, e.g., 
    - model1.ipynb, model2.ipynb, model3.ipynb
* Use the `%run ... ` command to bring different pieces of code together into the main jupyter notebook
* Put python functions in modules that you can import from your notebooks. See the example below for the `example` function defined in `utils/functions.py`

A graphical depiction of the group work-flow is shown below:

![](images/git-group-workflow-wbg.png)

Tutorial videos about using Git, GitLab, and GitGadget for group assignments:

* Setup the MSBA server to use Git and GitLab: https://youtu.be/zJHwodmjatY
* Dealing with Merge Conflicts: https://youtu.be/qFnyb8_rgTI
* Group assignment practice: https://youtu.be/4Ty_94gIWeA

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyrsm as rsm
import statsmodels.formula.api as smf
from sklearn import preprocessing
from statsmodels.genmod.families import Binomial
from statsmodels.genmod.families.links import logit
import xgboost as xgb
from sklearn import metrics
from pyrsm import profit_max, confusion, profit_plot, gains_plot, lift_plot, ROME_plot
from sklearn.model_selection import RandomizedSearchCV


In [None]:
 mpl.rcParams["figure.dpi"] = 150

In [None]:
## loading the data - this dataset must NOT be changed
intuit75k = pd.read_pickle("../data/intuit75k.pkl")
intuit75k["res1_yes"] = (intuit75k["res1"] == "Yes").astype(int)
intuit75k.head()

In [None]:
# show dataset description
rsm.describe(intuit75k)

In [None]:
intuit75k.zip_bins = intuit75k.zip_bins.astype(object)

In [None]:
intuit75k = intuit75k.join(pd.get_dummies(intuit75k.sex), how='inner')
intuit75k = intuit75k.join(pd.get_dummies(intuit75k.zip_bins), how='inner')

In [None]:
intuit75k.dtypes

In [None]:
intuit_train = intuit75k.query('training == 1').reset_index()
intuit_val = intuit75k.query('training == 0').reset_index()

In [None]:
X_train = intuit_train.drop(columns=['id','zip', 'zip_bins','res1','res1_yes','training','sex','index'])
y_train = intuit_train[['res1_yes']]
X_test = intuit_val.drop(columns=['id','zip', 'zip_bins','res1','res1_yes','training','sex','index'])
y_test = intuit_val[['res1_yes']]

In [None]:
xgb_clf = xgb.XGBClassifier(objective='binary:logistic', n_estimators=1000, seed=123, max_depth=2, n_jobs=6, use_label_encoder=False,reg_lambda=3, learning_rate=0.3)
xgb_clf.fit(X_train, y_train.values.ravel(), early_stopping_rounds=10, eval_metric="auc", verbose=True, eval_set=[(X_test, y_test.values.ravel())])

In [None]:
# Prediction probabilities on the test set
pred = xgb_clf.predict_proba(X_test)
probs = pd.Series([p[1] for p in pred])

# Prediction probabilities on the train set
pred_train = xgb_clf.predict_proba(X_train)
probs_train = pd.Series([p[1] for p in pred_train])

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test.res1_yes, pred[:,1])
print(f'Test data auc is {metrics.auc(fpr,tpr)}')

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train.res1_yes, pred_train[:,1])
print(f'Train data auc is {metrics.auc(fpr,tpr)}')

In [None]:
breakeven = 1.41/30

In [None]:
pred_prof = pd.Series((probs)) 
pred_prof.name = 'predictions_xgb_1_test'

pred_prof_train = pd.Series((probs_train))
pred_prof_train.name = 'predictions_xgb_1_train'

df_test = y_test.join(pred_prof, how='inner')

In [None]:
p = profit_max(df_test,'res1_yes',1,'predictions_xgb_1_test',1.41,30)

print(f'The profit for {xgb_clf} on the test data is ${round(p,3)}')

In [None]:
TP, FP, TN, FN, contact = confusion(df_test,'res1_yes',1,'predictions_xgb_1_test',1.41,30)

print(f'TP: {TP}')
print(f'TN: {TN}')
print(f'FP: {FP}')
print(f'FN: {FN}')

In [None]:
df_test['target_xgb'] = (df_test.predictions_xgb_1_test > breakeven).astype(int)

total_biz = 801821
already_resp = 38487
population = total_biz - already_resp
response_rate_xgb = np.mean(df_test[df_test.target_xgb == 1]['res1_yes'])
targets = population * contact
responses = targets * (response_rate_xgb/2)

In [None]:
cost = targets * 1.41
rev = responses * 60
profit = rev - cost
print(f'The projected profit for the those people who did not respond to wave 1 of mailing but will be mailed a second time is $ {round(profit,2)}')

In [None]:
mpl.rcParams["figure.dpi"] = 500
xgb.plot_tree(xgb_clf, rankdir='LR')

In [None]:
clf = xgb.XGBClassifier(objective='binary:logistic',seed=123, use_label_encoder=False)

In [None]:
gbm_param_grid = {
    'xgb_clf__learning_rate': np.arange(0.05, 0.4, 0.05),
    'xgb_clf__max_depth': np.arange(1, 6, 1),
    'xgb_clf__n_estimators': np.arange(1000, 2000, 100)
}

In [None]:
from sklearn.metrics import make_scorer

def profit_scoring(y_true, y_pred):
    profit = rsm.profit(pd.Series(y_true), pd.Series(y_pred), 1, 1.41, 30)
    return profit

profit_score = make_scorer(profit_scoring, greater_is_better = True, needs_proba=True)

In [None]:
randomized_roc_auc = RandomizedSearchCV(scoring=profit_scoring,verbose=1, estimator=clf, param_distributions=gbm_param_grid, n_jobs=6)
randomized_roc_auc.fit(X_train,y_train.values.ravel())