# Group 4: Project Update 2

# Preamble

In project update 1, we defined the problem at hand along with the proposed solution. Our research aims at providing a tool for enterprises to identify, monitor, and mitigate compliance risk to government sales and use tax audit. We presented the Audit Risk Dataset along with descriptive and visual statistics for comprehension. The preprocessed and clean dataset is now ready for training and evaluation.

So, let us continue our journey by importing the necessary python libraries.

In [1]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from scipy.special import comb
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from matplotlib.colors import ListedColormap
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.ensemble import StackingClassifier
from sklearn.feature_selection import RFE
from sklearn.ensemble import VotingRegressor
from sklearn import linear_model
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Perceptron
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

I. To prevent bias in our model evaluation, we partitioned the audit risk dataset into training and evaluation sets.

In [2]:
# Load dataset
risk = pd.read_csv('../data/Audit_Risk.csv')

In [3]:
# Partition into training and validation sets
X = risk.drop(columns=['Audit Risk','NAICS Description'])
y = risk['Audit Risk']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

II. We tested the performance of the model for each classifier through different metrics, including accuracy, precision, recall, F1-score, AUC, and ROC. To that purpose, we built some helper functions as defined in the lines of codes below.

In [4]:
# A function to calculate and print the different metrics.
def evaluate_model(model, X_test, algorithm):
    # Evaluate the model and print the different metrics. 
    print('The metrics for the ', algorithm, ' algorithms are as follow:')
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_val, y_pred)
    print('Accuracy: %.2f%%' % (accuracy * 100))
    f1 = f1_score(y_val, y_pred)
    print("f1_score:  %.2f%%" % (f1 * 100))
    precision = precision_score(y_val, y_pred, zero_division=0)
    print("precision_score:  %.2f%%" % (precision * 100))
    recall = recall_score(y_val, y_pred)
    print("recall_score:  %.2f%%" % (recall * 100))
    auc = roc_auc_score(y_val, y_pred)
    print("roc_auc_score:  %.2f%%" % (auc * 100))
    
# A function for feature scaling.
def scale_features(x):
    scaler = StandardScaler()
    return scaler.fit_transform(x)

III. We spot checked the training of our model using machine learning algorithms for classification, including Naive Bayes, Support Vector Machines, and Neural Networks.

In [5]:
# experimenting with Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
evaluate_model(nb, X_val, 'Naive Bayes')

The metrics for the  Naive Bayes  algorithms are as follow:
Accuracy: 70.35%
f1_score:  78.80%
precision_score:  94.30%
recall_score:  67.68%
roc_auc_score:  74.87%


In [6]:
# experimenting with Support Vector Machines
svm = SVC()
svm.fit(X_train, y_train)
evaluate_model(svm, X_val, 'Support Vector Machines')

The metrics for the  Support Vector Machines  algorithms are as follow:
Accuracy: 81.43%
f1_score:  89.77%
precision_score:  81.43%
recall_score:  100.00%
roc_auc_score:  50.00%


In [7]:
 # experimenting with Neural Networks
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000)
mlp.fit(X_train, y_train)
evaluate_model(mlp, X_val, 'Neural Networks')

The metrics for the  Neural Networks  algorithms are as follow:
Accuracy: 93.94%
f1_score:  96.26%
precision_score:  96.68%
recall_score:  95.85%
roc_auc_score:  90.72%


IV. As demonstrated in the printed metrics from our earlier model training, each algorithm provides a different level of performance. The neural network algorithm offers the highest performance for all metrics, whereas the Naive Bayes algorithm yields fewer satisfying performances.

Our data has already been pre-processed and cleaned. So, improving performance through pre-processing and cleaning is no longer an option. To improve performance and predictive accuracy, feature selection/reduction became our best resource. We used the RandomForestRegressor from sklearn.ensemble for that matter. It allows us to extract the best fitting features from our dataset.

In [8]:
# Instantiate RandomForestRegressor
rf = RandomForestRegressor()
# Fit the model
rf.fit(X, y)
# Get feature importances
feature_importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
feature_importances.to_frame()

Unnamed: 0,0
Additional Tax Due,1.0
Taxable Ratio,0.0
Tax Paid,0.0
NAICS Code,0.0
Gross Sales Total,0.0
Gross Income,0.0
Tax Due,0.0


a) As shown in the data frame above, Additional Tax Due (score 1.0) is the most important feature, followed by NAICS Code. This confirms our expectation from our knowledge about the dataset. Although NAICS Code has a score of 0.0 in the feature selection, our knowledge of the dataset along with the audit risk algorithm tells us that it is an important distinguishing feature. The NAICS Code identifies industries and sectors. It is indispensable for the model unbiased evaluation. So, we keep it as a feature along with the Additional Tax Due. Then, we re-evaluated our model.

In [9]:
# Split the dataset for training based on the reduced features. 
X = risk[['NAICS Code', 'Additional Tax Due']]
y = risk['Audit Risk']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

b) Our model evaluation improves for all tested algorithm after reducing the features to just 'NAICS Code', 'Additional Tax Due'.

In [10]:
print("New metrics after feature selections")
nb = GaussianNB()
nb.fit(X_train, y_train)
evaluate_model(nb, X_val, 'Naive Bayes')

New metrics after feature selections
The metrics for the  Naive Bayes  algorithms are as follow:
Accuracy: 75.09%
f1_score:  82.22%
precision_score:  98.13%
recall_score:  70.76%
roc_auc_score:  82.42%


In [11]:
# Support Vector Machines
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)
evaluate_model(svm, X_val, 'Support Vector Machines')

The metrics for the  Support Vector Machines  algorithms are as follow:
Accuracy: 99.98%
f1_score:  99.99%
precision_score:  100.00%
recall_score:  99.98%
roc_auc_score:  99.99%


In [12]:
# Neural Networks
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000)
mlp.fit(X_train, y_train)
evaluate_model(mlp, X_val, 'Neural Networks')

The metrics for the  Neural Networks  algorithms are as follow:
Accuracy: 99.48%
f1_score:  99.68%
precision_score:  99.47%
recall_score:  99.89%
roc_auc_score:  98.77%


V. Now, let us move on to the more exciting part and combine the individual classifiers through ensemble models. We implemented three different ensemble-based classifiers, including voting classifier, Stacking Classifier, and Voting Regressor. Ensemble learning algorithms offer better predictive performance by combining the predictions from multiple models. 

The different models are themselves fit through pipelines. Pipeline provides built-in functionalities for composite estimators as a sequence of data transformers with an optional final predictor. We implemented three ensemble models each with three different pipelines, for a total of nine pipelines. We then devised a helper function to assist with cross validation scoring.

In [13]:
def cross_validation_scoring(model, algorithm_name):
    # Evaluate the model and print the different metrics. 
    print('\nThe metrics for the ', algorithm_name, ' algorithm are as follow:')
    metrics = 'accuracy', 'precision', 'recall', 'f1'
    results, names = list(), list()
    for metric in metrics:
        # Each model will be evaluated using repeated k-fold cross-validation.
        score = cross_val_score(estimator=model, X=X_train, y=y_train, cv=10, scoring=metric, error_score='raise')
        print(metric + ": %0.8f (+/- %0.8f) [%s]" % (score.mean(), score.std(), label))
        results.append(score)
        names.append(metric)

1.	The first ensemble is the voting classifier. It consists of three models, which are Naive Bayes, Support Vector Machines, and Neural Networks

In [14]:
# Pipelining the Naive Bayes classifier
naive_bayes_pipeline = make_pipeline(StandardScaler(), SimpleImputer(), PCA(n_components=2), GaussianNB())
naive_bayes_pipeline.fit(X_train, y_train)

In [15]:
# Pipelining the support vector machines
support_vector_machines_pipeline = make_pipeline(StandardScaler(), SimpleImputer(), PCA(n_components=2), SVC(kernel='linear'))
support_vector_machines_pipeline.fit(X_train, y_train)

In [16]:
# Pipelining the Neural Networks classifier
neural_networks_pipeline = make_pipeline(StandardScaler(), SimpleImputer(), PCA(n_components=2),
                                             MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000))
neural_networks_pipeline.fit(X_train, y_train)

In [17]:
# Implementing the ensemble voting classifier
votingClassifier = VotingClassifier(estimators=[('nb', naive_bayes_pipeline), ('svm', support_vector_machines_pipeline),
                    ('mlp', neural_networks_pipeline)], voting='hard')
votingClassifier.fit(X_train, y_train)

In [18]:
ensemble1_labels = ['Naive Bayes', 'Support Vector Machines', 'Neural Networks', 'Majority voting']
ensemble1 = [naive_bayes_pipeline, support_vector_machines_pipeline, neural_networks_pipeline, votingClassifier]
# Computing and displaying performance metrics
for clf, label in zip(ensemble1, ensemble1_labels):
    cross_validation_scoring(clf, label)


The metrics for the  Naive Bayes  algorithm are as follow:
accuracy: 0.81650797 (+/- 0.00004278) [Naive Bayes]
precision: 0.81650797 (+/- 0.00004278) [Naive Bayes]
recall: 1.00000000 (+/- 0.00000000) [Naive Bayes]
f1: 0.89898639 (+/- 0.00002593) [Naive Bayes]

The metrics for the  Support Vector Machines  algorithm are as follow:
accuracy: 0.93363768 (+/- 0.00344664) [Support Vector Machines]
precision: 1.00000000 (+/- 0.00000000) [Support Vector Machines]
recall: 0.91872424 (+/- 0.00422084) [Support Vector Machines]
f1: 0.95763569 (+/- 0.00229233) [Support Vector Machines]

The metrics for the  Neural Networks  algorithm are as follow:
accuracy: 0.99107320 (+/- 0.00361641) [Neural Networks]
precision: 0.99990952 (+/- 0.00018101) [Neural Networks]
recall: 0.99153624 (+/- 0.00387302) [Neural Networks]
f1: 0.99334071 (+/- 0.00194760) [Neural Networks]

The metrics for the  Majority voting  algorithm are as follow:
accuracy: 0.98996786 (+/- 0.00221077) [Majority voting]
precision: 0.9999

2. The second ensemble is the Stacking Classifier. It combines the following 3 classifiers: Logistic Regression, Decision Tree Classifier, and K Neighbors Classifier.

In [19]:
# Pipelining the logistic regression
logistic_regression_pipeline = make_pipeline(StandardScaler(), SimpleImputer(), 
                                             PCA(n_components=2), LogisticRegression(random_state=1, solver='lbfgs'))
logistic_regression_pipeline.fit(X_train, y_train)

In [20]:
# Pipelining the decision tree classifier
decision_tree_classifier_pipeline = make_pipeline(StandardScaler(), SimpleImputer(), PCA(n_components=2),
                                    DecisionTreeClassifier(max_depth=10, random_state=0, criterion='entropy'))
decision_tree_classifier_pipeline.fit(X_train, y_train)

In [21]:
# Pipelining the K neighbors classifier classifier
K_neighbors_classifier_pipeline = make_pipeline(StandardScaler(), SimpleImputer(), PCA(n_components=2),
                                  KNeighborsClassifier(n_neighbors=43, p=2, metric='minkowski'))
K_neighbors_classifier_pipeline.fit(X_train, y_train)

In [22]:
# Implementing the ensemble-based Stacking Classifier
estimators = [('lr', logistic_regression_pipeline), ('dt', decision_tree_classifier_pipeline), 
              ('kn', K_neighbors_classifier_pipeline)]
stackingClassifier = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stackingClassifier.fit(X_train, y_train)


In [23]:
# Computing and displaying performance metrics
ensemble2_labels = ['Logistic Regression', 'K Neighbors Classifier', 'Stacking Ensemble']
ensemble2 = [logistic_regression_pipeline, decision_tree_classifier_pipeline, 
             K_neighbors_classifier_pipeline,stackingClassifier]
for clf, label in zip(ensemble2, ensemble2_labels):
    cross_validation_scoring(clf, label)


The metrics for the  Logistic Regression  algorithm are as follow:
accuracy: 0.96465695 (+/- 0.00231687) [Logistic Regression]
precision: 1.00000000 (+/- 0.00000000) [Logistic Regression]
recall: 0.95671441 (+/- 0.00283715) [Logistic Regression]
f1: 0.97787629 (+/- 0.00148080) [Logistic Regression]

The metrics for the  K Neighbors Classifier  algorithm are as follow:
accuracy: 0.98698014 (+/- 0.00137962) [K Neighbors Classifier]
precision: 0.99025557 (+/- 0.00352808) [K Neighbors Classifier]
recall: 0.99385668 (+/- 0.00310967) [K Neighbors Classifier]
f1: 0.99204240 (+/- 0.00083624) [K Neighbors Classifier]

The metrics for the  Stacking Ensemble  algorithm are as follow:
accuracy: 0.99205693 (+/- 0.00074916) [Stacking Ensemble]
precision: 0.99954987 (+/- 0.00045500) [Stacking Ensemble]
recall: 0.99071814 (+/- 0.00063898) [Stacking Ensemble]
f1: 0.99511431 (+/- 0.00046133) [Stacking Ensemble]


3. Our last ensemble is the Voting Regressor. Our first two ensembles were classifiers. We figured we could try regression as well as a learning method. We pipeline the three regressors which are the Gradient Boosting Regressor, the Random Forest Regressor, and the Linear Regression, into the Voting Regressor as the ensemble-based algorithm. 

In [24]:
# Pipelining the gradient boosting regressor
gradient_boostingRegressor_pipeline = make_pipeline(StandardScaler(), SimpleImputer(), PCA(n_components=2),
                                                        GradientBoostingRegressor(random_state=1))
gradient_boostingRegressor_pipeline.fit(X_train, y_train)

In [25]:
# Pipelining the random forest regressor
randomForestRegressor_pipeline = make_pipeline(StandardScaler(), SimpleImputer(), PCA(n_components=2),
                                                   RandomForestRegressor(random_state=1))
randomForestRegressor_pipeline.fit(X_train, y_train)

In [26]:
# Pipelining the linear_regression
linear_regression_pipeline = make_pipeline(StandardScaler(), SimpleImputer(), PCA(n_components=2),
                                               LinearRegression())
linear_regression_pipeline.fit(X_train, y_train)

In [27]:
# Regrouping the regressors as estimators
estimators = [
        ('gb', gradient_boostingRegressor_pipeline),
        ('rf', randomForestRegressor_pipeline),
        ('lr', linear_regression_pipeline)]

# Loading the estimators to the voting regressor ensemble
votingRegressor = VotingRegressor(estimators)
# Fitting the voting regressor
votingRegressor.fit(X_train, y_train)

In [28]:
# Set the descriptive labels for the performance scores
ensemble3_labels = ['Gradient Boosting Regressor', 'Random Forest Regressor', 'LinearRegression', 'Voting Regressor']
# load the regressor pipelines along with the voting regressor for scoring
ensemble3 = [gradient_boostingRegressor_pipeline, randomForestRegressor_pipeline, linear_regression_pipeline,
            votingRegressor]

In [29]:
# Using the coefficient of determination regression score function to evaluate the strength of our prediction. 
for clf, label in zip(ensemble3, ensemble3_labels):
    clf.score(X_train, y_train)
    y_pred = clf.predict(X_val)
    r2 = r2_score(y_val, y_pred)
    print('The coefficient of determination regression score for ' + label + ' is: ' + str(r2))

The coefficient of determination regression score for Gradient Boosting Regressor is: 0.7789830348816729
The coefficient of determination regression score for Random Forest Regressor is: 0.9896227849604501
The coefficient of determination regression score for LinearRegression is: 0.08917920615943897
The coefficient of determination regression score for Voting Regressor is: 0.795688944568206


The scoring of the voting Regressor is not so promising. The Random Forest Regressor even performs better than the ensemble voting regressor, 98.96% versus 79.56%. The regression is certainly not a good fit for our audit risk dataset. Regression cannot do the job of a classifier. In the other hand, our classifiers perform tremendously well for all computed metrics, first through pipelining them through ensemble 1 and 2, voting classifier and Stacking Classifier, respectively. 

It is quite elegant to be able to use machine learning algorithms to predict and mitigate sales and use tax audit risk. Our tax audit risk algorithm is a valuable tool for businesses. But also, sales tax professionals, CPA, auditors, and attorneys involved in the auditing process can certainly instrumentalize it to build audit prevention infrastructures as well as defense strategies for their clients.