In [None]:
In this notebook, we prepare a dataset to be used to show how to perform text classification by fine-tuning a BERT-based model.

The data used here is obtained from the [Consumer Complaint Database](https://catalog.data.gov/dataset/consumer-complaint-database).

We download the entire dataset as a CSV file into the *data* local folder and then read it into a pandas dataframe.

In [50]:
import pandas as pd

#df1 = pd.read_csv('./data/consumer_complaint_data.csv')
df1 = pd.read_csv('E:/azure_ml_notebook/azureml_data/complaints.csv')


In [51]:
df1.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,transworld systems inc. \nis trying to collect...,,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,3384392
1,2020-09-25,Debt collection,I do not know,Written notification about debt,Didn't receive enough information to verify debt,,Company believes it acted appropriately as aut...,Phoenix Financial Services LLC,FL,33853,,Consent not provided,Web,2020-09-25,Closed with explanation,Yes,,3866397
2,2019-09-19,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,,3379500
3,2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"Over the past 2 weeks, I have been receiving e...",,"Diversified Consultants, Inc.",NC,275XX,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,3433198
4,2020-09-21,Debt collection,Credit card debt,Attempts to collect debt not owed,Debt is not yours,,,Resurgent Capital Services L.P.,MA,02124,,,Web,2020-09-21,Closed with explanation,Yes,,3857820


For our implementation, we use only the *Consumer complaint narrative* column, which we rename to *Complaint* and contains the textual information from the consumer complaints, and the *Product* column, which represents the financial products or services associated with a complaint.

In [52]:
df2 = df1[['Product', 'Consumer complaint narrative']]

In [53]:
df2.columns = ['Product', "Complaint"]

The dataset has approximately 1.4M rows, but a great portion of them has missing data in the *Complaint* column. Here we just drop all rows with missing data, and we end up with 472K rows.

In [54]:
df2.head()

Unnamed: 0,Product,Complaint
0,Debt collection,transworld systems inc. \nis trying to collect...
1,Debt collection,
2,"Credit reporting, credit repair services, or o...",
3,Debt collection,"Over the past 2 weeks, I have been receiving e..."
4,Debt collection,


In [55]:
df2.shape

(1782596, 2)

In [56]:
df2.dropna(inplace=True)

In [57]:
df2.shape

(606211, 2)

There are 18 distinct values for the *Product* column, but some of them are very underrepresented. Also, there is a lot of overlapping among them.

We then consolidate the distinct values for the *Product* column into 6 distinct categories: *Credit Reporting*, *Debt Collection*, *Mortgage*, *Card Services*, *Loans*, and *Banking Services*.

In [58]:
df2['Product'].value_counts()

Credit reporting, credit repair services, or other personal consumer reports    208537
Debt collection                                                                 123335
Mortgage                                                                         69464
Credit card or prepaid card                                                      43274
Credit reporting                                                                 31588
Student loan                                                                     26969
Checking or savings account                                                      25223
Credit card                                                                      18838
Bank account or service                                                          14885
Money transfer, virtual currency, or money service                               11023
Vehicle loan or lease                                                            10518
Consumer Loan                              

In [59]:
df2.replace({'Product':
             {'Credit reporting, credit repair services, or other personal consumer reports': 'Credit Reporting',
              'Debt collection': 'Debt Collection',
              'Credit reporting': 'Credit Reporting',
              'Credit card': 'Card Services',
              'Bank account or service': 'Banking Services',
              'Credit card or prepaid card': 'Card Services',
              'Student loan': 'Loans',
              'Checking or savings account': 'Banking Services',
              'Consumer Loan': 'Loans',
              'Vehicle loan or lease': 'Loans',
              'Money transfer, virtual currency, or money service': 'Banking Services',
              'Payday loan, title loan, or personal loan': 'Loans',
              'Payday loan': 'Loans',
              'Money transfers': 'Banking Services',
              'Prepaid card': 'Card Services',
              'Other financial service': 'Other',
              'Virtual currency': 'Banking Services'}
            }, inplace= True)

In [60]:
df2 = df2[df2['Product'] != 'Other']

In [61]:
pd.DataFrame(df2['Product'].value_counts())

Unnamed: 0,Product
Credit Reporting,240125
Debt Collection,123335
Mortgage,69464
Card Services,63562
Loans,56789
Banking Services,52644


We need to represent data as numeric values for the model. Here we create a new column *Product_Label* that encodes the information from the *Product* column into numeric values.

We need to do something similar for the textual information from the *Complaint* column, but as this is dependent of the model architecture, this is done in the subsequent notebook.

In [62]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
df2['Product_Label'] = enc.fit_transform(df2['Product'])

In [63]:
df2.head()

Unnamed: 0,Product,Complaint,Product_Label
0,Debt Collection,transworld systems inc. \nis trying to collect...,3
3,Debt Collection,"Over the past 2 weeks, I have been receiving e...",3
8,Debt Collection,"I received the email below, but I have never s...",3
9,Credit Reporting,i am a victim of identity theft as previously ...,2
11,Credit Reporting,"Previously, on XX/XX/XXXX, XX/XX/XXXX, and XX/...",2


In [64]:
df2.iloc[4]['Complaint']

'Previously, on XX/XX/XXXX, XX/XX/XXXX, and XX/XX/XXXX I requested that Experian send me a copy of the verifiable proof they have on file showing that the XXXX account they have listed on my credit report is actually mine. On XX/XX/XXXX and XX/XX/XXXX, instead of sending me a copy of the verifiable proof that I requested, Experian sent me a statement which reads, " The information you disputed has been verified as accurate. \'\' Experian also failed to provide me with the method of " verification. \'\' Since Experian neither provided me with a copy of the verifiable proof, nor did they delete the unverified information, I believe they are in violation of the Fair Credit Reporting Act and I have been harmed as a result. I have again, today, sent my fourth and final written request that they verify the account, and send me verifiable proof that this account is mine, or that they delete the unverified account. If they do not, my next step is to pursue a remedy through litigation.'

We can further preprocess the data, by trying to decrease the vocabulary size for the text. Here we perform a light text preprocessing, by removing punctuation, removing the masked information (*XXX…* patterns), removing extra spaces and finally normalize everything to lowercase.

In [65]:
import string

table = str.maketrans(string.punctuation, ' '*len(string.punctuation))
df2['Complaint'] = df2['Complaint'].str.translate(table)
df2['Complaint'] = df2['Complaint'].str.replace('X+', '')
df2['Complaint'] = df2['Complaint'].str.replace(' +', ' ')
df2['Complaint'] = df2['Complaint'].str.lower()
df2['Complaint'] = df2['Complaint'].str.strip()

In [66]:
df2.iloc[4]['Complaint']

'previously on and i requested that experian send me a copy of the verifiable proof they have on file showing that the account they have listed on my credit report is actually mine on and instead of sending me a copy of the verifiable proof that i requested experian sent me a statement which reads the information you disputed has been verified as accurate experian also failed to provide me with the method of verification since experian neither provided me with a copy of the verifiable proof nor did they delete the unverified information i believe they are in violation of the fair credit reporting act and i have been harmed as a result i have again today sent my fourth and final written request that they verify the account and send me verifiable proof that this account is mine or that they delete the unverified account if they do not my next step is to pursue a remedy through litigation'

There is some text in the *Complaint* column that has 0 or very few words, which represents about 1,000 rows in the dataset. Here we consider the minimum of 5 words for the text to have some useful information.

In [67]:
lengths = [len(df2.iloc[i]['Complaint'].split()) for i in range(len(df2))]
print(max(lengths))
print(min(lengths))

5958
0


In [68]:
df2 = df2[[l >= 5 for l in lengths]]

In [69]:
df2.shape

(604719, 3)

In [70]:
pd.DataFrame(df2['Product'].value_counts())

Unnamed: 0,Product
Credit Reporting,239250
Debt Collection,123128
Mortgage,69444
Card Services,63532
Loans,56759
Banking Services,52606


We then save the preprocessed dataset, and another one corresponding to a 10% sample.

In [71]:
#df2.to_csv('./data/consumer_complaint_data_prepared.csv', index=False)

# only top several
df2.head(100).to_csv('E:/azure_ml_notebook/azureml_data/complaints_after.tsv', sep='\t', index=None)

In [72]:
#
#df2.sample(n=int(len(df2)*0.1), random_state=111).to_csv('./data/#consumer_complaint_data_sample_prepared.csv', index=False)

df2.sample(n=100, random_state=111).to_csv('E:/azure_ml_notebook/azureml_data/complaints_sampled_after.csv', index=False)
