### Mini-Project 2: Yimeng Wu 
### Teammate: Kathy Zhang

#### 1. Since it’s important to use theory/intuition/common sense in concert with our data driven approaches, what factors do you suspect will affect the true, underlying model of whether or not a firm will commit tax evasion? Briefly explain.

1. Local Corruption. Companies in places where corruption is more common might be more likely to avoid paying taxes. Because in these areas, companies might find it easier to bribe the local officials to ignore their tax evasion.
2. Financial Status. In common sense, firms in bad financial status, for example firms having financial liquidity problem might be more tempted to evade taxes so that it can alleviate their financial burdens.
3. Firm's structure. Firms with more complex financial structures might be more likely to evade taxes, as the complex structures can potentially increse the difficulties of audition, and hide the tax evasion.
4. Industry of the firm. Certain industries might be more likely to evade tax, particularly those where transactions are frequently conducted in cash. This mode of transaction can make accounting records less traceable, thus hide firm's true financial activities from audits and evade taxes effectively.
5. Economic Conditions. During periods of economic downturn, firms might be more likely to evade taxes so as to maintain its profitability.

#### 2. Assume that in addition to some combination of the predictors listed in Table 1, the interaction of two predictor variables also enters the true model. If the appropriate interaction is not explicitly included as a predictor in the fitted model, what advantage does KNN enjoy over the LPM if the interaction is indeed important to the true relationship?

Advantages: KNN is a non-parametric method, which means it doesn't make strict assumptions about the structure of the model, i.e. the specific form of the function that connects the predictors to the outcome. This flexibility allows KNN to detect complex relationships and interactions between variables implicitly. In our situation, an important interaction between two predictors is not explicitly included in the model, KNN may still account for it through the proximity and arrangement of data points in the neighbors. In contrast, LPM is a parametric approach that requires the explicit specification of interaction terms if they are important to the true relationship. If the interaction term is omitted, LPM cannot capture the effect of the interacting predictors on the outcome, which could lead to bias. Therefore, KNN's flexible nature often gives it advantages in capturing nuanced patterns that an LPM might miss unless those interactions are explicitly included.


In [19]:
import os
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import math
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score

path = r'/Users/yimeng/Documents/Machine Learning/MP2'
audit_f = 'Data-Audit.csv'
audit = pd.read_csv(os.path.join(path, audit_f))
audit.head()

Unnamed: 0,Sector_score,PARA_A,Risk_A,PARA_B,Risk_B,Money_Value,Risk_D,Score,Inherent_Risk,Audit_Risk,Risk
0,3.89,4.18,2.508,2.5,0.5,3.38,0.676,2.4,8.574,1.7148,1
1,3.89,0.0,0.0,4.83,0.966,0.94,0.188,2.0,2.554,0.5108,0
2,3.89,0.51,0.102,0.23,0.046,0.0,0.0,2.0,1.548,0.3096,0
3,3.89,0.0,0.0,10.8,6.48,11.75,7.05,4.4,17.53,3.506,1
4,3.89,0.0,0.0,0.08,0.016,0.0,0.0,2.0,1.416,0.2832,0


In [20]:
print(audit.groupby("Risk").size())

Risk
0    471
1    305
dtype: int64


In [21]:
na_count = pd.DataFrame(np.sum(audit.isna(), axis = 0), columns = ["Count NAs"])
na_count

Unnamed: 0,Count NAs
Sector_score,0
PARA_A,0
Risk_A,0
PARA_B,0
Risk_B,0
Money_Value,1
Risk_D,0
Score,0
Inherent_Risk,0
Audit_Risk,0


In [22]:
audit = audit.dropna()
audit.info()

<class 'pandas.core.frame.DataFrame'>
Index: 775 entries, 0 to 775
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Sector_score   775 non-null    float64
 1   PARA_A         775 non-null    float64
 2   Risk_A         775 non-null    float64
 3   PARA_B         775 non-null    float64
 4   Risk_B         775 non-null    float64
 5   Money_Value    775 non-null    float64
 6   Risk_D         775 non-null    float64
 7   Score          775 non-null    float64
 8   Inherent_Risk  775 non-null    float64
 9   Audit_Risk     775 non-null    float64
 10  Risk           775 non-null    int64  
dtypes: float64(10), int64(1)
memory usage: 72.7 KB


In [23]:
pd.set_option('display.max_columns', None) 
audit.describe()

Unnamed: 0,Sector_score,PARA_A,Risk_A,PARA_B,Risk_B,Money_Value,Risk_D,Score,Inherent_Risk,Audit_Risk,Risk
count,775.0,775.0,775.0,775.0,775.0,775.0,775.0,775.0,775.0,775.0,775.0
mean,20.138877,2.453059,1.352712,10.813924,6.342181,14.137631,8.276099,2.703484,17.70156,7.177034,0.393548
std,24.301417,5.681977,3.442348,50.114461,30.091403,66.606519,39.995557,0.859106,54.772482,38.691674,0.488852
min,1.85,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.4,0.28,0.0
25%,2.37,0.21,0.042,0.0,0.0,0.0,0.0,2.0,1.584,0.3168,0.0
50%,3.89,0.88,0.176,0.41,0.082,0.09,0.018,2.4,2.214,0.556,0.0
75%,55.57,2.48,1.488,4.16,1.887,5.595,2.238,3.3,10.703,3.2526,1.0
max,59.85,85.0,51.0,1264.63,758.778,935.03,561.018,5.2,801.262,961.5144,1.0


### Data Analysis Questions

#### 3. Split the sample set into a training set and a validation set. Use the training set to fit a linear probability model (LPM). Apply the model to the second half of the data to predict the probability a firm cheated.

In [24]:
X = audit.drop(columns = ['Risk'])
target = audit.loc[:,'Risk']
display(X.head())
display(target.head())

Unnamed: 0,Sector_score,PARA_A,Risk_A,PARA_B,Risk_B,Money_Value,Risk_D,Score,Inherent_Risk,Audit_Risk
0,3.89,4.18,2.508,2.5,0.5,3.38,0.676,2.4,8.574,1.7148
1,3.89,0.0,0.0,4.83,0.966,0.94,0.188,2.0,2.554,0.5108
2,3.89,0.51,0.102,0.23,0.046,0.0,0.0,2.0,1.548,0.3096
3,3.89,0.0,0.0,10.8,6.48,11.75,7.05,4.4,17.53,3.506
4,3.89,0.0,0.0,0.08,0.016,0.0,0.0,2.0,1.416,0.2832


0    1
1    0
2    0
3    1
4    0
Name: Risk, dtype: int64

In [33]:
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.50, random_state=13)

lpm = LinearRegression()
lpm.fit(X_train, y_train)
lpm_pred = lpm.predict_proba(X_test)[:, 1]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(a) For firms with a predicted probability of tax evasion greater than 0.5, construct the
confusion matrix.

(b) For firms with a predicted probability of tax evasion greater than 0.6, construct the
confusion matrix.

(c) For each of the two thresholds, report the error rate. Which results in more accurate
overall predictions?

(d) For each of the two thresholds, what proportion of the firms predicted to evade their
taxes actually evaded taxes?

In [32]:
thresholds = [0.5, 0.6]
cm_lpm = {}
error_rate_lpm = {}
precision_lpm = {}

for threshold in thresholds:
    lpm_pred_bin = np.where(lpm_pred > threshold, 1, 0)
    cm_lpm[threshold] = confusion_matrix(y_test, lpm_pred_bin)
    error_rate_lpm[threshold] = 1 - accuracy_score(y_test, lpm_pred_bin)
    precision_lpm[threshold] = precision_score(y_test, lpm_pred_bin, zero_division=0)

print(f"Confusion Matrix: {cm_lpm}")
print(f"Error Rate: {error_rate_lpm}")
print(f"Precision: {precision_lpm}")

Confusion Matrix: {0.5: array([[226,   3],
       [  3, 156]]), 0.6: array([[228,   1],
       [  4, 155]])}
Error Rate: {0.5: 0.015463917525773141, 0.6: 0.012886597938144284}
Precision: {0.5: 0.9811320754716981, 0.6: 0.9935897435897436}


(a)/(b) Confusion Matrix for threshold of 0.5 and 0.6 shown as above output.   
(c) Error Rate: {0.5: 0.015463917525773141, 0.6: 0.012886597938144284}  
The error rate for tax evasion is lower at threshold of 0.6, meaning that (b) results in more accurate overall predictions.   
(d) Precision: {0.5: 0.9811320754716981, 0.6: 0.9935897435897436}   
For 0.5 threshold, 98.11% of the firms predicted to evade their taxes actually evaded taxes.   
For 0.6 threshold, 99.36% of the firms predicted to evade their taxes actually evaded taxes. 
The proportion of correct predictions for tax evasion is higher at threshold of 0.6. 

##### 4. In measuring performance in this context, should a false negative matter as much as a false positive? Briefly explain why or why not and how changing the threshold for classifying a firm as a tax evader (as in the previous question) affects this trade-off.

As mentioned in the context, the government's goal is to increase tax revenue, I consider focusing on minimizing false negatives (FNs) is crucial. Because false negatives represent missed opportunities to identify and collect from tax-evading firms and lost the direct revenue, i.e. the companies who evade the tax don't get caught and continue to evade tax.  

To achieve this, the government could lower the threshold for classifying a firm as a tax evader. A lower threshold makes the model more sensitive to potential evasion, thereby reducing FNs. However, this approach might increase false positives (FPs), leading to more compliant firms being incorrectly flagged as evaders.  

In question 3 we can see that lowering the threshold to 0.5 reduces FN (better capturing potential evaders) at a small cost of increasing FP. Raising the threshold to 0.6 decreases FP but misses more evaders (increases false negatives). Therefore, adjusting the threshold affects the trade-off between capturing more evaders and wrongly accusing compliant firms. A lower threshold (0.5) is more aligned with the goal of maximizing tax revenue, despite a slightly higher rate of false positives.  

One thing we need to consider is that although the lower threshold can help capture more evaders, it also increase the audition costs as the government need to study more companies. It's important to balance the costs and revenue.

#### 5. Using the training set from the previous question, fit a KNN model with k = 5, then use it to predict outcomes in the validation set.

(a) Construct the confusion matrix.

(b) Report the error rate. How accurate are the overall predictions?

(c) What proportion of the firms predicted to evade their taxes actually evaded taxes?

In [34]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)

cm_knn = confusion_matrix(y_test, knn_pred)
error_rate_knn = 1 - accuracy_score(y_test, knn_pred)
precision_knn = precision_score(y_test, knn_pred, zero_division=0)

print(f"Confusion Matrix: {cm_knn}")
print(f"Accuracy: {1-error_rate_knn}")
print(f"Error Rate: {error_rate_knn}")
print(f"Precision: {precision_knn}")

Confusion Matrix: [[226   3]
 [ 11 148]]
Accuracy: 0.9639175257731959
Error Rate: 0.03608247422680411
Precision: 0.9801324503311258


(b) With an error rate of 3.61%, the overall predictions have an accuracy of 96.39%.  
(c) 98.01% of the firms predicted to evaded their taxes actually evaded taxes. 

#### 6. Repeat the previous question with k = 5 after normalizing the data.

In [35]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
cols = X.columns
X_norm = pd.DataFrame(X_scaled, columns=cols)
X_norm.head()

Unnamed: 0,Sector_score,PARA_A,Risk_A,PARA_B,Risk_B,Money_Value,Risk_D,Score,Inherent_Risk,Audit_Risk
0,-0.669071,0.304129,0.335827,-0.166006,-0.194273,-0.161614,-0.190146,-0.353484,-0.166753,-0.141265
1,-0.669071,-0.432005,-0.393216,-0.119482,-0.178777,-0.198271,-0.202356,-0.819385,-0.276733,-0.172402
2,-0.669071,-0.34219,-0.363566,-0.211331,-0.20937,-0.212393,-0.207059,-0.819385,-0.295112,-0.177606
3,-0.669071,-0.432005,-0.393216,-0.000278,0.004583,-0.03587,-0.030676,1.976022,-0.003134,-0.09494
4,-0.669071,-0.432005,-0.393216,-0.214326,-0.210368,-0.212393,-0.207059,-0.819385,-0.297523,-0.178289


In [36]:
X_train_norm, X_test_norm, y_train_norm, y_test_norm = train_test_split(X_norm, target, test_size=0.50, random_state=13)

knn_norm = KNeighborsClassifier(n_neighbors=5)
knn_norm.fit(X_train_norm, y_train_norm)
knn_pred_norm = knn_norm.predict(X_test_norm)

cm_knn_norm = confusion_matrix(y_test_norm, knn_pred_norm)
error_rate_knn_norm = 1 - accuracy_score(y_test_norm, knn_pred_norm)
precision_knn_norm = precision_score(y_test_norm, knn_pred_norm, zero_division=0)

print(f"Confusion Matrix: {cm_knn_norm}")
print(f"Accuracy: {1-error_rate_knn_norm}")
print(f"Error Rate: {error_rate_knn_norm}")
print(f"Precision: {precision_knn_norm}")

Confusion Matrix: [[222   7]
 [ 15 144]]
Accuracy: 0.9432989690721649
Error Rate: 0.05670103092783507
Precision: 0.9536423841059603


(b) With an error rate of 5.67%, the overall predictions have an accuracy of 94.33%.   
(c) 95.36% of the firms predicted to evade their taxes actually evaded taxes. 

#### 7. Which KNN model performs better: with or without the predictors normalized? Briefly explain how you make this determination and why you think this is the case.

Without normalization, the error rate was 3.61% and precision of 98%.

With normalization, the error rate was 5.67% and precision of 95.36%.

Based on this comparison, the KNN model without normalization performs better. This is because the KNN model relies on the distance between points to make predictions. Normalization standardizes the scale of the predictors and thus all predictors contribute to the distance calculation equally. In this case, the original scale of predictors might already carry important information, and normalizing removed this information, leading to a less effective model.

#### 8. For KNN, which k yields the lowest error rate? By 5-fold cross-validation (5FCV), find the k with the lowest classification error rate.

In [39]:
rows = audit.shape[0]
upper_k = int(math.sqrt(rows))
ks = list(range(1, upper_k + 1))
para = {'n_neighbors': ks}

knni = KNeighborsClassifier()
knn_cv = GridSearchCV(knni, para, cv=KFold(5, random_state=13, shuffle=True))
knn_cv.fit(X, target)

print(f"Best parameters: {knn_cv.best_params_}")
print(f"Best cross-validation score: {knn_cv.best_score_}")
print(f"Lowest classification error rate: {1 - knn_cv.best_score_}")

Best parameters: {'n_neighbors': 1}
Best cross-validation score: 0.9793548387096775
Lowest classification error rate: 0.020645161290322456


Using 5FCV, the k with the lowest classification error rate is k=1 with an error of 2.06%.

#### 9. In the long run, what problem might arise from the nature of the sample if the government heavily uses your best KNN model to target audits? Hint: the firms in the data are all firms that were audited.

If the government use the KNN model with Best parameters: {'n_neighbors': 1}, there might be several problems:   
1. Overfitting problem. When K=1, the model might be too tailored to the training data, potentially reducing accuracy on predicting the new cases.   
2. Selection Bias. Since the model is trained exclusively on data from previously audited firms, it may not accurately represent the wider population of firms. This could lead to biases in the model's predictions.