# Predictive Modeling

The Consumer Financial Protection Bureau (CFPB) is a U.S. government agency that makes sure financial companies treat their customers fairly. Their website allows customers of financial services to file complaints against financial companies and banks against unfair treatment if these companies are unable to resolve complaints to the customer’s satisfaction.
 
When customers choose to complain to the CFPB, financial companies incur additional costs to resolve such complaints.

On receipt, the CFPB routes complaints to the financial companies, who generally respond to the consumer within 15 days.  Once a response is provided, one of two things can happen:

1.	In most cases, consumers accept the response or remediation offered by the financial companies, 
2.	In other cases, they choose to dispute the resolution offered by the company.  (flagged in the 'Consumer disputed?' field).  In these situations, the bank has to perform additional investigations, and possibly offer further relief to the customers.  As a result, the cost of dealing with disputes can be high.

The original dataset for this project has over 2 million anonymized recent records, and covers 6000+ financial providers of all varieties.  It can be downloaded following the instructions at https://www.consumerfinance.gov/data-research/consumer-complaints/.  

For this project, we will use only the data till 2017, and only for the top 5 banks in the US.  In order to make sure we are all working off the same data, we will use the file complaints_25Nov21.csv available in Jupyterhub under the shared/ folder. 

The cost structure:
1.	On average, it costs the banks $100 to resolve, respond to and close a complaint that is not disputed.  

2.	On the other hand, it costs banks an extra $500 to resolve a complaint if it has been disputed.  (This $500 is on top of the $100 they have already spent.)

3.	Extra diligence: If the banks know in advance which complaints will be disputed, they can perform “extra diligence” during the first round of addressing the complaint with a view to avoiding eventual disputes.  Performing extra diligence costs $90 per complaint, and provides a guarantee that the customer will not dispute the complaint.  But performing the extra diligence is wasted money if the customer would not have disputed the complaint.


In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
import seaborn as sns
from sklearn import tree
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import mean_absolute_error, mean_squared_error, ConfusionMatrixDisplay
from sklearn import metrics

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

import sklearn.preprocessing as preproc

In [2]:
#load data
df = pd.read_csv('shared/complaints_25Nov21.csv')
df

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2016-10-26,Money transfers,International money transfer,Other transaction issues,,"To whom it concerns, I would like to file a fo...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",,,,Consent provided,Web,2016-10-29,Closed with explanation,Yes,No,2180490
1,2015-03-27,Bank account or service,Other bank product/service,"Account opening, closing, or management",,My name is XXXX XXXX XXXX and huband name is X...,Company chooses not to provide a public response,"CITIBANK, N.A.",PA,151XX,Older American,Consent provided,Web,2015-03-27,Closed with explanation,Yes,No,1305453
2,2015-04-20,Bank account or service,Other bank product/service,"Making/receiving payments, sending money",,XXXX 2015 : I called to make a payment on XXXX...,Company chooses not to provide a public response,U.S. BANCORP,PA,152XX,,Consent provided,Web,2015-04-22,Closed with monetary relief,Yes,No,1337613
3,2013-04-29,Mortgage,Conventional fixed mortgage,"Application, originator, mortgage broker",,,,JPMORGAN CHASE & CO.,VA,22406,Servicemember,,Phone,2013-04-30,Closed with explanation,Yes,Yes,393900
4,2013-05-29,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",GA,30044,,,Referral,2013-05-31,Closed with explanation,Yes,No,418647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207255,2015-05-24,Debt collection,Credit card,Taking/threatening an illegal action,Sued w/o proper notification of suit,,,JPMORGAN CHASE & CO.,FL,33133,,Consent not provided,Web,2015-05-24,Closed with explanation,Yes,No,1390395
207256,2012-01-10,Mortgage,Conventional fixed mortgage,"Loan modification,collection,foreclosure",,,,JPMORGAN CHASE & CO.,NY,10312,,,Referral,2012-01-11,Closed without relief,Yes,Yes,12192
207257,2012-07-17,Student loan,Non-federal student loan,Repaying your loan,,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",NH,032XX,,,Web,2012-07-18,Closed with explanation,Yes,No,118351
207258,2016-09-29,Bank account or service,Checking account,"Account opening, closing, or management",,Near the end of XXXX 2016 I opened a Citigold ...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",CA,900XX,,Consent provided,Web,2016-09-29,Closed with non-monetary relief,Yes,No,2138969


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from imblearn.under_sampling import RandomUnderSampler  # Make sure this import is included
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

# Select specified predictor variables
predictors = ['Product', 'Sub-product', 'Issue', 'State', 'Tags', 'Submitted via', 'Company response to consumer', 'Timely response?']
X = pd.get_dummies(df[predictors], drop_first=True)  # One-hot encode categorical variables

# Convert 'Consumer disputed?' to 0s and 1s
y = LabelEncoder().fit_transform(df['Consumer disputed?'].astype(str))

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Check the proportion of disputes in the training dataset
proportion_disputed = y_train.sum() / len(y_train)
print(f"Proportion of disputes in the training dataset: {proportion_disputed:.2f}")

# Balance the dataset with random undersampling if the proportion of disputes is less than 30%
if proportion_disputed < 0.3:
    undersampler = RandomUnderSampler(random_state=123)
    X_train, y_train = undersampler.fit_resample(X_train, y_train)

Proportion of disputes in the training dataset: 0.22


In [None]:
num_disputed = y_test.sum()
num_non_disputed = len(y_test) - num_disputed
base_case_cost = num_disputed * 600 + num_non_disputed *100

TP = conf_matrix[1][1]
FP = conf_matrix[0][1]
model_based_cost = (TP + (y_test.sum() - TP)) * 100 + FP * (100 + 90) + TP * 90


In [None]:
y_scores = model_xgb.predict_proba(X_test)[:, 1]

best_threshold = 0.5
min_cost = float('inf')

for threshold in np.linspace(0,1,101):
    y_pred_threshold = (y_scores >= threshold).astype(int)
    conf_matrix = confusion_matrix(y_test, y_pred_threshold)
    TP = conf_matrix[1][1]
    FP = conf_matrix[0][1]
    cost = (TP + (y_test.sum() - TP))*100 + FP * (100 + 90) + TP * 90
    if cost < min_cost:
        min_cost = cost
        best_threshold = threshold

## In the test set (not the entire dataset), what proportion of consumers raised a dispute

In [5]:
#1
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# Assuming your DataFrame is named df
# If it has a different name, replace 'df' with the correct variable name

# Encode the target variable
le = preprocessing.LabelEncoder()
y = le.fit_transform(df['Consumer disputed?'])  # Corrected to use df instead of pd

# Perform an 80/20 train-test split
X = df.drop(columns=['Consumer disputed?'])  # Assuming all other columns are features for now
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Calculate the proportion of disputes in the test set
dispute_proportion_test_set = y_test.mean()
dispute_proportion_test_set

0.21586413200810575

## what proportion of consumers in the training dataset raised a dispute

In [6]:
#2
from imblearn.under_sampling import RandomUnderSampler

# Since we're asked after random undersampling, let's perform the undersampling on the training set
undersampler = RandomUnderSampler(random_state=123)
X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)

# Calculate the proportion of disputes in the undersampled training set
dispute_proportion_undersampled_train_set = y_resampled.mean()
dispute_proportion_undersampled_train_set

0.5

## Fit the XGBClassifier model as described in the instructions, and evaluate it on the test set.  What is the recall for the category 'Consumer disputed?' = 'Yes' on the test set

In [4]:
#3
# Train the XGBClassifier model
model_xgb = XGBClassifier(use_label_encoder=False, objective='binary:logistic', eval_metric='logloss', random_state=123)
model_xgb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model_xgb.predict(X_test)

# Generate the classification report
report = classification_report(y_test, y_pred, output_dict=True)

#Extract recall for 'Consumer disputed?' = 'Yes'
#recall_yes = report['1']['recall']
#print(f"Recall for 'Consumer disputed? = Yes': {recall_yes}")

y_pred = model_xgb.predict(X_test)
print(classification_report(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)


              precision    recall  f1-score   support

           0       0.84      0.53      0.65     32504
           1       0.27      0.63      0.38      8948

    accuracy                           0.55     41452
   macro avg       0.55      0.58      0.51     41452
weighted avg       0.72      0.55      0.59     41452

<function confusion_matrix at 0x7f4b351677e0>


## If there were no model, what would be the total cost to the banks of dealing with the complaints in the test set

In [7]:
#4
# The number of non-disputed complaints
num_non_disputed = (y_test == 0).sum()

# The number of disputed complaints
num_disputed = (y_test == 1).sum()

# Calculate the total cost without the model
total_cost_no_model = (num_non_disputed * 100) + (num_disputed * 600)
total_cost_no_model

8619200

## Assume that if the model predicts a complaint will be disputed, the banks decide to spend $90 performing extra diligence to avoid the $600 cost of a dispute.

In [None]:
#5
from sklearn.metrics import confusion_matrix

# Calculate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
TP, FP, TN, FN = conf_matrix[1, 1], conf_matrix[0, 1], conf_matrix[0, 0], conf_matrix[1, 0]

# Calculate the total cost based on the model's predictions
cost_per_diligence = 90  # Cost for each complaint predicted to be disputed
cost_per_missed_dispute = 600  # Cost for each actual disputed complaint that wasn't predicted

total_cost = (TP + FP) * cost_per_diligence + FN * cost_per_missed_dispute
total_cost

## The costs to the banks from doing due diligence and from having disputes are asymmetrical.  Therefore you have the opportunity to reduce total cost by varying the probability threshold from the default 0.5 in a binary classification situation as this.

## Change the value of the threshold and determine the lowest total cost to the banks based on the observations in the test set.

In [None]:
#6
import numpy as np

# Get the predicted probabilities for the positive class
y_probs = model_xgb.predict_proba(X_test)[:, 1]

# Initialize variables to store the best threshold and the corresponding lowest cost
best_threshold = None
lowest_cost = float('inf')

# Define the costs
cost_per_extra_diligence = 90
cost_per_dispute = 600

# Iterate over a range of possible threshold values
for threshold in np.linspace(0, 1, 101):
    # Convert probabilities to binary predictions based on the current threshold
    y_pred_threshold = (y_probs >= threshold).astype(int)
    
    # Calculate the confusion matrix
    conf_matrix = confusion_matrix(y_test, y_pred_threshold)
    TP, FP, TN, FN = conf_matrix[1, 1], conf_matrix[0, 1], conf_matrix[0, 0], conf_matrix[1, 0]
    
    # Calculate the total cost for this threshold
    total_cost = (TP + FP) * cost_per_extra_diligence + FN * cost_per_dispute
    
    # Update the best threshold and lowest cost if this threshold results in a lower cost
    if total_cost < lowest_cost:
        best_threshold = threshold
        lowest_cost = total_cost

print(f"Best threshold: {best_threshold}, Lowest total cost: ${lowest_cost}")

## At what value of the threshold is the lowest dollar cost achieved?


In [9]:
#7
if total_cost < lowest_cost:
    best_threshold = threshold
    lowest_cost = total_cost