# Intuit Quickbooks Upgrade

* Team-lead GitLab userid:
* Group name:
* Team member names:

## Setup

Please complete this python notebook with your group by answering the questions in `intuit-redux.pdf`. Create a Notebook and HTML file with all your results and comments and push both the Notebook and HTML file to GitLab when your team is done. All results MUST be reproducible (i.e., the TA and I must be able to recreate the HTML file from the Jupyter Notebook without changes or errors). This means that you should NOT use any python-packages that are not part of the rsm-msba-spark docker container.

This is the second group assignment for MGTA 455 and you will be using Git and GitLab. If two people edit the same file at the same time you could get what is called a "merge conflict". This is not something serious but you should realize that Git will not decide for you who's change to accept so the team-lead will have to determine the edits to use. To avoid merge conflicts, **always** "pull" changes to the repo before you start working on any files. Then, when you are done, save and commit your changes, and then push them to GitLab. Make "pull first" a habit!

If multiple people are going to work on the assignment at the same time I recommend you work in different notebooks. You can then `%run ...`  these "sub" notebooks from the main assignment file. You can seen an example of this in action below for the `model1.ipynb` notebook

Some group work-flow tips:

* Pull, edit, save, stage, commit, and push
* Schedule who does what and when
* Try to avoid working simultaneously on the same file 
* If you are going to work simultaneously, do it in different notebooks, e.g., 
    - model1.ipynb, model2.ipynb, model3.ipynb
* Use the `%run ... ` command to bring different pieces of code together into the main jupyter notebook
* Put python functions in modules that you can import from your notebooks. See the example below for the `example` function defined in `utils/functions.py`

A graphical depiction of the group work-flow is shown below:

![](images/git-group-workflow-wbg.png)

Tutorial videos about using Git, GitLab, and GitGadget for group assignments:

* Setup the MSBA server to use Git and GitLab: https://youtu.be/zJHwodmjatY
* Dealing with Merge Conflicts: https://youtu.be/qFnyb8_rgTI
* Group assignment practice: https://youtu.be/4Ty_94gIWeA

In [None]:
#importing libraries
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyrsm as rsm
from sklearn import metrics
from pyrsm import profit_max, confusion, profit_plot, gains_plot, lift_plot, ROME_plot
from keras.layers import Dense
from keras.models import Sequential
from keras.utils.np_utils import to_categorical
from keras.models import load_model
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from keras.wrappers.scikit_learn import KerasClassifier

In [None]:
## loading the data - this dataset must NOT be changed
intuit75k = pd.read_pickle("../data/intuit75k.pkl")
intuit75k["res1_yes"] = (intuit75k["res1"] == "Yes").astype(int)
intuit75k.head()

In [None]:
# show dataset description
rsm.describe(intuit75k)

In [None]:
# Standardization

scaler = preprocessing.StandardScaler()
sf = scaler.fit(intuit75k.query('training==1')[['numords','dollars','last','sincepurch']])
intuit75k[['numords','dollars','last','sincepurch']] = sf.transform(intuit75k[['numords','dollars','last','sincepurch']])

In [None]:
from sklearn.metrics import make_scorer

def profit_scoring(y_true, y_pred):
    profit = rsm.profit(pd.Series(y_true), pd.Series(y_pred), 1, 1.41, 60)
    return profit

profit_score = make_scorer(profit_scoring, greater_is_better = True, needs_proba=True)

In [None]:
intuit75k.zip_bins = intuit75k.zip_bins.astype(object)

In [None]:
#One-hot encoding of categorical variables
intuit75k = intuit75k.join(pd.get_dummies(intuit75k.zip_bins), how='inner')

In [None]:
#Extracting zips 00801 and 00804
intuit75k = intuit75k.assign(

    zip801 = (intuit75k['zip'] == '00801').astype(int),
    zip804 = (intuit75k['zip'] == '00804').astype(int)
)

In [None]:
#Splitting train and test dataset
intuit_train = intuit75k.query('training == 1').reset_index()
intuit_test = intuit75k.query('training == 0').reset_index()
intuit_train

In [None]:
#Creating X and Y datasets 
X_train = intuit_train.drop(columns=['id','zip', 'zip_bins','res1','res1_yes','training','sex','index','bizflag'])
y_train = intuit_train[['res1_yes']]
X_test = intuit_test.drop(columns=['id','zip', 'zip_bins','res1','res1_yes','training','sex','index','bizflag'])
y_test = intuit_test[['res1_yes']]

In [None]:
#Number of columns to define input shape of model
ncols= X_test.shape[1]
ncols

In [None]:
#Model creation
model= Sequential()
model.add(Dense(50, activation='relu', input_shape=(ncols,)))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(2, activation='softmax'))

In [None]:
#Model compilation and fit
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, to_categorical(y_train))

In [None]:
#Predictions from model
predictions= model.predict(X_test)
prob_true = pd.Series([p[1] for p in predictions])
prob_true.name = 'predictions_test'

In [None]:
#Prob to respond 
df_test = y_test.join(prob_true, how='inner')

In [None]:
#Calculating breakeven response rate
breakeven = 1.41/60

In [None]:
#Calculating confusion matrix
TP, FP, TN, FN, contact = confusion(df_test,'res1_yes',1,'predictions_test',1.41,60)

print(f'TP: {TP}')
print(f'TN: {TN}')
print(f'FP: {FP}')
print(f'FN: {FN}')

In [None]:
#Calculating profit on test dataset 
p = profit_max(df_test,'res1_yes',1,'predictions_test',1.41,30)

print(f'The profit on the test data is ${round(p,3)}')

In [None]:
# Use scikit-learn to grid search the batch size and epochs

# Function to create model, required for KerasClassifier
def create_model(neurons=1, activation=activation):
    # create model
    model = Sequential()
    model.add(Dense(neurons, activation=activation))
    model.add(Dense(neurons, activation=activation))
    model.add(Dense(2, activation='softmax'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['accuracy'])
    return model
# fix random seed for reproducibility
seed = 1234
np.random.seed(seed)

# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)
# define the grid search parameters
activation= ['linear','relu','softmax','tanh','sigmoid']
neurons= [1, 5, 10, 15, 20, 25, 30, 50, 100]
param_grid = dict(neurons=neurons, activation=activation)
grid = RandomizedSearchCV(estimator=model, param_grid=param_grid, n_jobs=4, cv=3, n_iter=20)
grid_result = grid.fit(X_train, to_categorical(y_train))
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))