# CFPB Consumer Complaints Modeling  
The Consumer Financial Protection Bureau (CFPB) is a U.S. government agency that makes sure financial companies treat their customers fairly. Their website allows customers of financial services to file complaints against financial companies and banks against unfair treatment if these companies are unable to resolve complaints to the customer’s satisfaction. On receipt, the CFPB routes complaints to the financial companies.

We create the model that can help the banks identify complaints that will end in a dispute. The goal is to minimize total financial costs, and if the banks can identify future disputes they can avoid the larger costs by performing the cheaper extra diligence in advance.

The cost structure:
On average, it costs the banks \\$100 to resolve, respond to and close a complaint that is not disputed. On the other hand, it costs banks an extra \\$500 to resolve a complaint if it has been disputed. 
Extra diligence: If the banks know in advance which complaints will be disputed, they can perform “extra diligence” during the first round of addressing the complaint with a view to avoiding eventual disputes.  Performing extra diligence costs $90 per complaint, and provides a guarantee that the customer will not dispute the complaint.  But performing the extra diligence is wasted money if the customer would not have disputed the complaint.

For this project, we will use only the data till 2017, and only for the top 5 banks in the US. https://www.consumerfinance.gov/data-research/consumer-complaintsmplaints

In [158]:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix

In [159]:
complaints = pd.read_csv('shared/complaints_25Nov21.csv')
complaints

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2016-10-26,Money transfers,International money transfer,Other transaction issues,,"To whom it concerns, I would like to file a fo...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",,,,Consent provided,Web,2016-10-29,Closed with explanation,Yes,No,2180490
1,2015-03-27,Bank account or service,Other bank product/service,"Account opening, closing, or management",,My name is XXXX XXXX XXXX and huband name is X...,Company chooses not to provide a public response,"CITIBANK, N.A.",PA,151XX,Older American,Consent provided,Web,2015-03-27,Closed with explanation,Yes,No,1305453
2,2015-04-20,Bank account or service,Other bank product/service,"Making/receiving payments, sending money",,XXXX 2015 : I called to make a payment on XXXX...,Company chooses not to provide a public response,U.S. BANCORP,PA,152XX,,Consent provided,Web,2015-04-22,Closed with monetary relief,Yes,No,1337613
3,2013-04-29,Mortgage,Conventional fixed mortgage,"Application, originator, mortgage broker",,,,JPMORGAN CHASE & CO.,VA,22406,Servicemember,,Phone,2013-04-30,Closed with explanation,Yes,Yes,393900
4,2013-05-29,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",GA,30044,,,Referral,2013-05-31,Closed with explanation,Yes,No,418647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207255,2015-05-24,Debt collection,Credit card,Taking/threatening an illegal action,Sued w/o proper notification of suit,,,JPMORGAN CHASE & CO.,FL,33133,,Consent not provided,Web,2015-05-24,Closed with explanation,Yes,No,1390395
207256,2012-01-10,Mortgage,Conventional fixed mortgage,"Loan modification,collection,foreclosure",,,,JPMORGAN CHASE & CO.,NY,10312,,,Referral,2012-01-11,Closed without relief,Yes,Yes,12192
207257,2012-07-17,Student loan,Non-federal student loan,Repaying your loan,,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",NH,032XX,,,Web,2012-07-18,Closed with explanation,Yes,No,118351
207258,2016-09-29,Bank account or service,Checking account,"Account opening, closing, or management",,Near the end of XXXX 2016 I opened a Citigold ...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",CA,900XX,,Consent provided,Web,2016-09-29,Closed with non-monetary relief,Yes,No,2138969


## 1. In the test set (not the entire dataset), what proportion of consumers raised a dispute?

In [161]:
# Select relevant columns
complaints = complaints[['Product', 'Sub-product', 'Issue', 'State', 'Tags', 'Submitted via', 'Company response to consumer', 'Timely response?', 'Consumer disputed?']]

# Convert if the consumer disputed to 0s and 1s
le = preprocessing.LabelEncoder()
y = le.fit_transform(complaints['Consumer disputed?'])

# Select factors for prediction
X = complaints.drop('Consumer disputed?', axis=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Calculate proportion of consumers raised a dispute
disputed_proportion_1 = sum(y_train) / len(y_train)

print(disputed_proportion_1)

0.21684719675769565


## 2. After you have performed random undersampling, what proportion of consumers in the training dataset raised a dispute?

In [162]:
# Perform random undersampling
undersampler = RandomUnderSampler(random_state=123)
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)

# Calculate proportion of consumers in the training dataset raised a dispute
disputed_proportion_2 = sum(y_train_resampled) / len(y_train_resampled)

print(disputed_proportion_2)

0.5


## 3. Fit the XGBClassifier model as described in the instructions, and evaluate it on the test set.  What is the recall for the category 'Consumer disputed?' = 'Yes' on the test set?

In [163]:
# Combine training and test sets
combined_data = pd.concat([X_train_resampled, X_test])

# One-hot encode categorical variables
encoder = OneHotEncoder()
X_combined_encoded = encoder.fit_transform(combined_data)

# Split the data
X_train_encoded = X_combined_encoded[:len(X_train_resampled)]
X_test_encoded = X_combined_encoded[len(X_train_resampled):]

# Fit the XGBClassifier model
model_xgb.fit(X_train_encoded, y_train_resampled)

# Make predictions on the test set
y_pred = model_xgb.predict(X_test_encoded)

report = classification_report(y_test, y_pred, target_names=le.classes_)

print(report)

              precision    recall  f1-score   support

          No       0.84      0.53      0.65     32504
         Yes       0.27      0.63      0.38      8948

    accuracy                           0.55     41452
   macro avg       0.55      0.58      0.51     41452
weighted avg       0.72      0.55      0.59     41452



## 4. If there were no model, what would be the total cost to the banks of dealing with the complaints in the test set? 

In [164]:
# Count the number of disputed and not-disputed complaints
num_disputed = sum(y_test)
num_not_disputed = len(y_test) - num_disputed

# Calculate the total cost without model
total_cost_without_model = (num_disputed * 600) + (num_not_disputed * 100)

print(total_cost_without_model)

8619200


## 5. Use the predictions for which complaints are likely to be disputed from the model you have created (using the default classification threshold).  Assume that if the model predicts a complaint will be disputed, the banks decide to spend \\$90 performing extra diligence to avoid the \\$600 cost of a dispute. 
## In this situation based on model results, what would be the total cost to the banks of dealing with the complaints in the test set?


In [165]:
# Counting the number of complaints predicted correctly
num_predicted_disputed_correctly = sum((y_pred == 1) & (y_test == 1))
num_predicted_not_disputed_correctly = sum((y_pred == 0) & (y_test == 0))

# Counting the number of complaints predicted incorrectly
num_predicted_not_disputed_incorrectly = sum((y_pred == 0) & (y_test == 1))
num_predicted_disputed_incorrectly = sum((y_pred == 1) & (y_test == 0))

# Calculating the total cost with the model
total_cost_with_model = (num_predicted_disputed_correctly * 510) + (num_predicted_not_disputed_incorrectly * 90) + (num_predicted_disputed_incorrectly * 100) + (num_predicted_not_disputed_correctly * 100)

print(total_cost_with_model)

6427040


## 6. The costs to the banks from doing due diligence and from having disputes are asymmetrical.  Therefore you have the opportunity to reduce total cost by varying the probability threshold from the default 0.5 in a binary classification situation as this. 
## Change the value of the threshold and determine the lowest total cost to the banks based on the observations in the test set.

In [156]:
# Calculate predicted probabilities
y_prob = model_xgb.predict_proba(X_test_encoded)[:, 1]

# Create an array of possible thresholds
thresholds = np.arange(0.1, 1.0, 0.01)

lowest_total_cost = float('inf')
best_threshold = None

# Calculating the total cost
for threshold in thresholds:
    y_pred_thresholded = (y_prob >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_thresholded).ravel()
    total_cost = tn * 100 + fp * 90 + fn * 600 + tp * 510

    if total_cost < lowest_total_cost:
        lowest_total_cost = total_cost
        best_threshold = threshold

print(lowest_total_cost)

7490300


## 7. At what value of the threshold is the lowest dollar cost achieved?

In [157]:
print(best_threshold)

0.1
