# Reading Data Into Python and Creating Analysis File

The goal of this project is to take narrative complaints from the CFPB complaint intake form and predict whether or not the complaint will be closed (either generally or with an explanation) versus the customer receiving relief. 

Initial exploration will consider an aggregate of all complaints in all states/time periods. The inputs will be the product, sub-product, consumer complaint narrative, company, state

possible things to take into account: recency (should more recent complaints get more weight than older complaints); somehow take into account whether a consumer disputed the outcome

maybe build several models that are representative then use some sort of model averaging?: maybe look at this: https://docs.pymc.io/notebooks/model_averaging.html

In [28]:
import pandas as pd
import numpy as np
import keras
import nltk
import re
import codecs
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from collections import namedtuple

In [29]:
cfpb=pd.read_csv("Consumer_Complaints.csv")

In [30]:
cfpb.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,01/16/2019,"Credit reporting, credit repair services, or o...",Credit reporting,Improper use of your report,Reporting company used your report improperly,,,"Diversified Consultants, Inc.",PA,18301,,,Web,01/16/2019,In progress,Yes,,3126392
1,01/16/2019,Debt collection,Other debt,Written notification about debt,Didn't receive notice of right to dispute,,,"Diversified Consultants, Inc.",TX,78130,,,Web,01/16/2019,In progress,Yes,,3126504
2,01/16/2019,Mortgage,Conventional home mortgage,Struggling to pay mortgage,,,,"BAYVIEW LOAN SERVICING, LLC",TN,377XX,,Other,Web,01/16/2019,In progress,Yes,,3126744
3,01/16/2019,Checking or savings account,Checking account,Closing an account,Company closed your account,,,NAVY FEDERAL CREDIT UNION,NC,282XX,,,Web,01/16/2019,In progress,Yes,,3126534
4,01/16/2019,"Payday loan, title loan, or personal loan",Payday loan,Struggling to pay your loan,,,Company believes it acted appropriately as aut...,BlueChip Financial,FL,,,,Web,01/16/2019,Closed with explanation,Yes,,3125859


In [31]:
cfpb=cfpb.rename(index=str, columns={"Date received":"date_rec", "Product": "prod", "Sub-product": "subprod", 
                                     "Issue":"issue", "Sub-issue": "sub_issue", "Consumer complaint narrative": 
                                     "narrative", "Company public response":"pub_resp", "Company": "company", 
                                      "State": "state", "ZIP code": "zip", "Consumer consent provided?": "consent",
                                      "Submitted via":"how_submit", "Date sent to company": "date_to_company", 
                                      "Company response to consumer":"comp_resp", "Timely response?":"timely_resp",
                                      "Consumer disputed?":"cons_disp", "Complaint ID":"id"})

In [35]:
cfpb.shape

(1199558, 18)

In [32]:
#put narrative in double quotes to escape , for csv
cfpb.narrative="\"" + cfpb.narrative + "\""

In [34]:
#subset data to get only those complaints and work with this first
complaints=cfpb[cfpb.narrative.notnull()]
complaints.shape
nonar=cfpb[cfpb['narrative'].isnull()]
nonar.shape

#write both to CSV to avoid problems with chaining later
complaints.to_csv('narratives.csv')
nonar.to_csv('nonarratives.csv')

In [24]:
#check to see how much missing data there is in other columns
complaints.isna().sum()

date_rec                0
prod                    0
subprod             52173
issue                   0
sub_issue          110239
narrative               0
pub_resp           185652
company                 0
state                1357
zip                 79473
Tags               297224
consent                 0
how_submit              0
date_to_company         0
comp_resp               4
timely_resp             0
cons_disp          195125
id                      0
dtype: int64

In [25]:
#date_rec, product, issue, company, consent, how_submit, date_to_company, timely_resp, and id have no missing
#need to look at possible values of comp_resp and cons_disp
complaints['comp_resp'].value_counts()

Closed with explanation            289812
Closed with non-monetary relief     42580
Closed with monetary relief         20566
Closed                               3741
Untimely response                    2506
Name: comp_resp, dtype: int64

In [8]:
complaints['cons_disp'].value_counts()

No     128277
Yes     35807
Name: cons_disp, dtype: int64

In [9]:
#a lot are missing for cons_dispute--could create a flag for "unknown" to include here
#look at geographic distribution
complaints['state'].value_counts()

CA                                      48652
FL                                      34638
TX                                      34228
GA                                      20859
NY                                      20382
IL                                      14042
PA                                      12478
NJ                                      11920
NC                                      11607
OH                                      10861
VA                                      10017
MD                                       9910
MI                                       8199
AZ                                       8076
TN                                       6898
WA                                       6811
MA                                       6348
CO                                       5741
MO                                       5673
SC                                       5512
NV                                       5019
LA                                

In [10]:
#some of these are really rare: PW (Palau), MH (Majuro). Out of the 50 states + Puerto Rico, the state with the least 
#amount of complaints is Wyoming, which follows from how populous the states are. The states towards the top and bottom
#roughly represent what we know about the spread of the US population. 