###  dataset https://www.consumerfinance.gov/data-research/
https://www.consumerfinance.gov/data-research/consumer-complaints/

What the consumers were complaining in the financial product and service market? 
Data from these complaints help us understand the financial marketplace and protect consumers.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

#### Data acquisition
- download [json format datasource](http://files.consumerfinance.gov/ccdb/complaints.json.zip) to local and unzip
- upload unzipped json file to DBFS
- read json file and partion by 
- create dellta table. Delta Engine is a high performance, Apache Spark compatible query engine that provides an efficient way to process data in data lakes including data stored in open source Delta Lake. Delta Engine optimizations accelerate data lake operations, supporting a variety of workloads ranging from large-scale ETL processing to ad-hoc, interactive queries.

In [0]:
#make sure file uploaded 6s
#dbutils.fs.rm('dbfs:/FileStore/tables/complaints_csv.zip',True)
#dbutils.fs.rm('dbfs:/FileStore/tables/complaints_json.zip',True)
#dbutils.fs.rm('dbfs:/FileStore/tables/complaints-1.json',True)
display(dbutils.fs.ls("dbfs:/FileStore/tables/"))

path,name,size
dbfs:/FileStore/tables/complaints.csv,complaints.csv,1233411049
dbfs:/FileStore/tables/complaints.json,complaints.json,1942366519
dbfs:/FileStore/tables/exercise_pyspark_dataframe.ipynb,exercise_pyspark_dataframe.ipynb,30542
dbfs:/FileStore/tables/flight_model/,flight_model/,0
dbfs:/FileStore/tables/flight_weather.csv,flight_weather.csv,431664555


In [0]:
file_name = '/FileStore/tables/complaints.json'
#for line,item in zip(open(file_name,'r').readlines(), range(3)):
#    print(line)
   
df = spark.read.format('json').load(file_name)
df.dtypes

#### Data cleansing
- Check date range, remove the top and bottom half month data (such as 2021/02)
- Delete data withought primary key (Complaint ID)

In [0]:
df.show(n=3, truncate=False, vertical=True)

In [0]:
count = df.count()

In [0]:
#https://blog.csdn.net/sinat_26917383/article/details/80500349
from pyspark.sql.functions import isnull
# delete null rows of these columns
df = df.dropna(subset=['complaint_id', 'issue','product','date_received','company','state','submitted_via'])

In [0]:
#rows deleted
print (count - df.count())

In [0]:
df.createOrReplaceTempView('t_complaints')

#### Time related questions

- What's the average processing time of the closed complaints. We don't know because lack of close time recorded.
- Are they resolved timely? 
- Are the consumers satisfied?

Let's see **timely** column. It shows **98%** of these complaints are timely resolved.

In [0]:
# what's the values in timely column
display(df.select(['timely']).groupby('timely').count())

timely,count
No,39765
Yes,1920683


- Which company has the highest delay rate? 
- How these delaid record distributes?
######Let's check companies with over 100 complaints and have delay process.

In [0]:
# how these intimely data dristributed in these financial companies 
df_delay = sqlContext.sql("SELECT company,sum(IF(timely='No', 1, 0)) AS delay_count, count(1) as total " + \
                           " FROM t_complaints " +\
                           " GROUP BY company " + 
                           " HAVING sum(IF(timely='No', 1, 0))>0 and count(1)>100")
pddf_delay = df_delay.toPandas()

In [0]:
#calculate delay rate
pddf_delay['percent'] = pddf_delay.apply(lambda x: float(("%.2f")%((x['delay_count'] / x['total'])*100)), axis=1)  
#keep those delay rate over 80%
pddf_delay_percent = pddf_delay[pddf_delay['percent'] > 80].sort_values("percent", ascending=False)

In [0]:
 display(pddf_delay.sort_values("total", ascending=False))

company,delay_count,total,percent
"EQUIFAX, INC.",1687,232551,0.73
"TRANSUNION INTERMEDIATE HOLDINGS, INC.",96,224589,0.04
Experian Information Solutions Inc.,9,222666,0.0
"BANK OF AMERICA, NATIONAL ASSOCIATION",1631,97852,1.67
WELLS FARGO & COMPANY,3807,83815,4.54
JPMORGAN CHASE & CO.,96,75470,0.13
"CITIBANK, N.A.",367,63120,0.58
CAPITAL ONE FINANCIAL CORPORATION,71,50644,0.14
"Navient Solutions, LLC.",3,33770,0.01
Ocwen Financial Corporation,545,30725,1.77


- These companies have highest delay rate. Let's see their information. The complaints records are hundred level. As this dataset is from 2015 by now. Checking it's distribution.

In [0]:
display(pddf_delay_percent)

company,delay_count,total,percent
"Mobiloans, LLC",451,451,100.0
Ameritech Financial,197,217,90.78
"Credit Collections U.S.A., L.L.C.",113,130,86.92
High Point Asset Inc,122,141,86.52


#### What are the most complained products?
- Credit report related (), Debt collection, Credit card related(), morgage, student loan.
- **Credit reporting, credit repair services, or other personal consumer reports** has the highest complaints. Checking sub-product to see what happened.

In [0]:
df_service = sqlContext.sql("SELECT product,count(1) AS count " + \
                           " FROM t_complaints " +\
                           " GROUP BY product ORDER BY 2")
display(df_service)

product,count
Virtual currency,18
Other financial service,1059
Prepaid card,3819
Money transfers,5354
Payday loan,5543
"Payday loan, title loan, or personal loan",16071
Vehicle loan or lease,22572
"Money transfer, virtual currency, or money service",22682
Consumer Loan,31604
Student loan,60527


**Credit reporting, credit repair services, or other personal consumer reports** has the highest complaints. Checking sub-product to see what happened.
- Over 99% of these complaints are about **Credit reporting**

In [0]:
df_subproduct = sqlContext.sql("SELECT sub_product,count(1) AS count " + \
                           " FROM t_complaints " +\
                           " WHERE PRODUCT='Credit reporting, credit repair services, or other personal consumer reports' " +\
                           " GROUP BY sub_product ORDER BY 2")
display(df_subproduct)

sub_product,count
Conventional home mortgage,1
Credit repair services,1700
Other personal consumer report,6468
Credit reporting,629885


##### What's the issues about **Credit reporting**?
- Check the issue and sub issues, the most complaints focus on **the consumer received Incorrect information on the report in which the Information belongs to someone else**. Ti's about the quality of information service.
- Is this kind of problem very commonly distributed in financial companies?

In [0]:
df_issue = sqlContext.sql("SELECT issue,sub_issue,count(issue) AS count_issue,count(sub_issue) as count_sub_issue " + \
                           " FROM t_complaints " +\
                           " WHERE PRODUCT='Credit reporting, credit repair services, or other personal consumer reports' " +\
                           " AND sub_product='Credit reporting' " +\
                           " GROUP BY issue,sub_issue")
display(df_issue)

issue,sub_issue,count_issue,count_sub_issue
Credit monitoring or identity theft protection services,Problem canceling credit monitoring or identify theft protection service,999,999
Incorrect information on your report,Account information incorrect,46398,46398
Credit monitoring or identity theft protection services,Received unwanted marketing or advertising,249,249
Problem with a credit reporting company's investigation into an existing problem,Their investigation did not fix an error on your report,85923,85923
Problem with fraud alerts or security freezes,,9822,9822
Credit monitoring or identity theft protection services,Didn't receive services that were advertised,757,757
Problem with a credit reporting company's investigation into an existing problem,Problem with personal statement of dispute,5561,5561
Incorrect information on your report,Information belongs to someone else,256333,256333
Improper use of your report,Received unsolicited financial product or insurance offers after opting out,397,397
Getting a loan or lease,Credit denial,3,3


##### Which company contributed this issue most?
- Check how this issue distributed in within these companies. Have they dealed these complaints timely?
TRANSUNION INTERMEDIATE HOLDINGS, INC. , Experian Information Solutions Inc. and EQUIFAX, INC. have the highest count of complaints on this issue. By google these companies, we can find these are established for over 30 years financial company. Is this because they have a big consumer group?

In [0]:
df_company_issue = sqlContext.sql("SELECT company,count(1) as count, " + \
                           " round((sum(IF(timely='No', 1, 0))/count(1))*100,2) as delay_percent " +\
                           " FROM t_complaints " +\
                           " WHERE PRODUCT='Credit reporting, credit repair services, or other personal consumer reports' " +\
                           " AND sub_product='Credit reporting' " +\
                           " AND issue='Incorrect information on your report' AND sub_issue='Information belongs to someone else' "\
                           " GROUP BY company")
display(df_company_issue)

company,count,delay_percent
"PlusFour, Inc",6,0.0
FORD MOTOR CREDIT CO.,38,0.0
"ClearOne Advantage, LLC",2,0.0
GLOBAL PAYMENTS DIRECT INC.,3,0.0
"International Collection Systems, Inc.",1,0.0
"Manhattan Beach Venture, LLC",1,100.0
"Medical Data Systems, Inc.",26,0.0
"Southern Credit Recovery, Inc.",5,0.0
Nelson Cruz & Associates LLC,8,75.0
AFNI INC.,95,0.0


###### How can we look further?
- Using web crawler, check the user amount of this company. 
- Checking the narrative, generating words cloud to see.
- From time aspect, check whether these issued have been fixed. 

These top 3 companies all have timely response, so this time let's only focus on the amount about this issue during these years. 
The line chart shows, from 2015 by now, this issue has been increased gradually instead of improved. Since 2020, the complaints amount increased greatly.

In [0]:
df_company_improve = sqlContext.sql("SELECT company,date_received,to_timestamp(date_received, 'yyyy-MM') as received_ym," + \
                           " count(1) as total "         
                           " FROM t_complaints " +\
                           " WHERE PRODUCT='Credit reporting, credit repair services, or other personal consumer reports' " +\
                           " AND sub_product='Credit reporting' " +\
                           " AND issue='Incorrect information on your report' AND sub_issue='Information belongs to someone else' "\
                           " AND ((company='TRANSUNION INTERMEDIATE HOLDINGS, INC.')  "
                           " OR (company='Experian Information Solutions Inc.')  "
                           " OR (company='EQUIFAX, INC.'))  "
                           " GROUP BY company,date_received,to_timestamp(date_received, 'yyyy-MM')")

display(df_company_improve)

company,date_received,received_ym,total
"EQUIFAX, INC.",2019-01-25,2019-01-01T00:00:00.000+0000,25
"EQUIFAX, INC.",2018-11-05,2018-11-01T00:00:00.000+0000,32
Experian Information Solutions Inc.,2018-11-08,2018-11-01T00:00:00.000+0000,32
"EQUIFAX, INC.",2020-10-11,2020-10-01T00:00:00.000+0000,72
Experian Information Solutions Inc.,2020-12-16,2020-12-01T00:00:00.000+0000,217
"EQUIFAX, INC.",2020-12-08,2020-12-01T00:00:00.000+0000,203
"EQUIFAX, INC.",2020-10-28,2020-10-01T00:00:00.000+0000,145
"TRANSUNION INTERMEDIATE HOLDINGS, INC.",2020-01-27,2020-01-01T00:00:00.000+0000,100
"TRANSUNION INTERMEDIATE HOLDINGS, INC.",2019-11-14,2019-11-01T00:00:00.000+0000,78
"EQUIFAX, INC.",2019-10-27,2019-10-01T00:00:00.000+0000,36


#### By taking some sampling data, we have seen what the data looks like.
- **complaint_what_happened** is the description of what that consumer complained. By applying nlp related analysis, can we find the sentiment of these records. This is all about complaints, but there usually exists different levels of severity. If we can analysis these levels, it might be used for the future classification.
- Check the columns which have these standard items
- **date_received** and **date_sent_to_company** for applying time series analysis

| Column Name               | Remark   |
|-------------------------|----------|
| company                   | any change or converge|
| company_public_response   | is it standard?|
| company_response          | standard item|
| complaint_id              | primary key  |
| complaint_what_happened   | story|
| consumer_consent_provided | standard item|
| consumer_disputed         | standard item|
| date_received             | date yyyy-mm-dd|
| date_sent_to_company      | date yyyy-mm-dd|
| issue                     | standard item|
| product                   | standard item|
| state                     | CA|
| sub_issue                 | standard item|
| sub_product               | standard item|
| submitted_via             | standard item|
| tags                      | ?|
| timely                    | standard item|
| zip_code                  | is it standard?|

In [0]:
df.count()

#### Load dataset from csv file
- read data from the csv file, the first row as the header, and display
- observe the sample data

参考 https://cloud.tencent.com/developer/article/1096712

In [0]:
#view few lines
#csvFile.take(3)
#read header
header = csvFile.first()
#columns
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
#column count
len(fields)

#### Process data schema 
- rename column names to make it easier to access
- normalize fieldcolumn data type
(reference https://www.nodalpoint.com/spark-data-frames-from-csv-files-handling-headers-column-types/   20210208)

In [0]:
#list the original fields, and modify them to correct data type
#fields
fields[0].dataType = TimestampType() #Date received
fields[12].dataType = TimestampType() #Date sent to company
#rename column names: replace minus to underscore, remove question, to lower case
for f in fields:
 f.name = f.name.replace(' ','_').replace('-','_').replace('?','').lower()
#construct a new schema
schema = StructType(fields) 

In [0]:
'''
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc =SparkContext()
sqlContext = SQLContext(sc)
'''
#observe the raw data from csv file to make sure applying the correct csv format
for line,item in zip(open('/tmp/complaints.csv','r').readlines(), range(3)):
    print(line)
#data = sqlContext.read.format('com.databricks.spark.csv').options(header='true', 
#inferschema='true').load('/FileStore/tables/complaints.csv')

#### From the above lines, we can see 
- data field seperated by comma
- **Consumer complaint narrative** column is enclosed in double quotes.

In [0]:
display(data)

Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,transworld systems inc.,,,,,,,,,,,,
is trying to collect a debt that is not mine,"not owed and is inaccurate.""",,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,3384392,,,,
2019-09-19,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,,3379500
2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work.",,"Diversified Consultants, Inc.",NC,275XX,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,3433198
2019-09-15,Debt collection,Other debt,Attempts to collect debt not owed,Debt was result of identity theft,"Pioneer has committed several federal violations against me, a Private law abiding Federally Protected Consumer. Each violation is a statutory cost of {$1000.00} each, which does not include my personal cost and fees which shall be determined for taking time to address these issues. Violations committed against me include but not limited to : ( 1 ) Violated 15 USC 1692c ( a ) ; Communication without prior consent, expressed permission. ( 2 ) Violated 15 USC 1692d ; Harass and oppressive use of intercourse about an alleged debt. ( 3 ) Violated 15 USC 1692d ( l ) ; Attacking my reputation, accusing me of owing an alleged debt to you. ( 4 ) Violated 15 USC 1692e ( 9 ) ; Use/distribution of communication with authorization or approval. ( 5 ) Violated 15 USC 1692f ( l ) ; Attempting to collect a debt unauthorized by an agreement between parties.",,Pioneer Capital Solutions Inc,CA,925XX,,Consent provided,Web,2019-09-15,Closed with explanation,Yes,,3374555
2021-01-11,"Payday loan, title loan, or personal loan",Installment loan,Charged fees or interest you didn't expect,,,Company disputes the facts presented in the complaint,"Express Collections, Inc.",MI,48114,,,Web,2021-01-11,Closed with explanation,Yes,,4061634
2019-07-18,Mortgage,Conventional home mortgage,Closing on a mortgage,,"I started the process to refinance my current mortgage. The closing lawyer attempted to obtain the payoff statement, but they would not provide that statement for 4 business as they claimed they are behind. The lawyer informed me I could obtain the information sooner. I called and spoke to a representative on XX/XX/XXXX who said she could request I receive the payoff statement in 24 hours if she requested an expedite through a supervisor. I called back on XX/XX/XXXX, my closing date on the refinance, to determine the payoff amount as I had not yet received the payoff statement. I was told that it had not yet been generated, but still could be done today. I asked to speak to a supervisor and was told that one was not available and could call me within 24 hours. They also told me that there was no one within the company that could help me with this inquiry.",Company has responded to the consumer and the CFPB and chooses not to provide a public response,Freedom Mortgage Company,NC,275XX,,Consent provided,Web,2019-07-18,Closed with explanation,Yes,,3311105
2020-12-30,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Improper use of your report,Credit inquiries on your report that you don't recognize,,,"EQUIFAX, INC.",LA,700XX,,,Web,2020-12-30,Closed with explanation,Yes,,4039920
2019-07-26,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Problem with a credit reporting company's investigation into an existing problem,Their investigation did not fix an error on your report,"""Previously, on XX/XX/XXXX, XX/XX/XXXX, and XX/XX/XXXX I requested that Experian send me a copy of the verifiable proof they have on file showing that the XXXX account they have listed on my credit report is actually mine. On XX/XX/XXXX and XX/XX/XXXX, instead of sending me a copy of the verifiable proof that I requested, Experian sent me a statement which reads, """" The information you disputed has been verified as accurate. '' Experian also failed to provide me with the method of """" verification. '' Since Experian neither provided me with a copy of the verifiable proof",nor did they delete the unverified information,I believe they are in violation of the Fair Credit Reporting Act and I have been harmed as a result. I have again,today,sent my fourth and final written request that they verify the account,and send me verifiable proof that this account is mine,or that they delete the unverified account. If they do not,"my next step is to pursue a remedy through litigation.""",Company has responded to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solutions Inc.,CA,914XX,
2019-07-08,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Problem with a credit reporting company's investigation into an existing problem,Their investigation did not fix an error on your report,"Hello This complaint is against the three credit reporting companies. XXXX, Trans Union and XXXX. I noticed some discrepencies on my credit report so I put a credit freeze with XXXX.on XX/XX/2019. I then notified the three credit agencies previously stated with a writtent letter dated XX/XX/2019 requesting them to verifiy certain accounts showing on my report They were a Bankruptcy and a bank account from XXXX XXXX XXXX.",,,,,,,,,,,,


In [0]:
# Read dataset.Some descriptive column consist of quotation
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType, TimestampType
df = (sqlContext.read.format("csv").
  option("header", "true").
  option("nullValue", "NA").
  option("inferSchema", True).
  option("delimiter",",").option("quote",'').option("escape",'').
  load("/FileStore/tables/complaints.csv"))
display(df)

Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,"""transworld systems inc.",,,,,,,,,,,,
is trying to collect a debt that is not mine,"not owed and is inaccurate.""",,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,3384392,,,,
2019-09-19,"""Credit reporting",credit repair services,"or other personal consumer reports""",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes
2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"""Over the past 2 weeks","I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work.""",,"""Diversified Consultants","Inc.""",NC,275XX,,Consent provided,Web,2019-11-08,Closed with explanation,Yes
2019-09-15,Debt collection,Other debt,Attempts to collect debt not owed,Debt was result of identity theft,"""Pioneer has committed several federal violations against me",a Private law abiding Federally Protected Consumer. Each violation is a statutory cost of {$1000.00} each,which does not include my personal cost and fees which shall be determined for taking time to address these issues. Violations committed against me include but not limited to : ( 1 ) Violated 15 USC 1692c ( a ) ; Communication without prior consent,expressed permission. ( 2 ) Violated 15 USC 1692d ; Harass and oppressive use of intercourse about an alleged debt. ( 3 ) Violated 15 USC 1692d ( l ) ; Attacking my reputation,"accusing me of owing an alleged debt to you. ( 4 ) Violated 15 USC 1692e ( 9 ) ; Use/distribution of communication with authorization or approval. ( 5 ) Violated 15 USC 1692f ( l ) ; Attempting to collect a debt unauthorized by an agreement between parties.""",,Pioneer Capital Solutions Inc,CA,925XX,,Consent provided,Web,2019-09-15
2021-01-11,"""Payday loan",title loan,"or personal loan""",Installment loan,Charged fees or interest you didn't expect,,,Company disputes the facts presented in the complaint,"""Express Collections","Inc.""",MI,48114,,,Web,2021-01-11,Closed with explanation
2019-07-18,Mortgage,Conventional home mortgage,Closing on a mortgage,"""""","""I started the process to refinance my current mortgage. The closing lawyer attempted to obtain the payoff statement",but they would not provide that statement for 4 business as they claimed they are behind. The lawyer informed me I could obtain the information sooner. I called and spoke to a representative on XX/XX/XXXX who said she could request I receive the payoff statement in 24 hours if she requested an expedite through a supervisor. I called back on XX/XX/XXXX,my closing date on the refinance,to determine the payoff amount as I had not yet received the payoff statement. I was told that it had not yet been generated,"but still could be done today. I asked to speak to a supervisor and was told that one was not available and could call me within 24 hours. They also told me that there was no one within the company that could help me with this inquiry.""",Company has responded to the consumer and the CFPB and chooses not to provide a public response,Freedom Mortgage Company,NC,275XX,,Consent provided,Web,2019-07-18
2020-12-30,"""Credit reporting",credit repair services,"or other personal consumer reports""",Credit reporting,Improper use of your report,Credit inquiries on your report that you don't recognize,,,"""EQUIFAX","INC.""",LA,700XX,,,Web,2020-12-30,Closed with explanation
2019-07-26,"""Credit reporting",credit repair services,"or other personal consumer reports""",Credit reporting,Problem with a credit reporting company's investigation into an existing problem,Their investigation did not fix an error on your report,"""Previously",on XX/XX/XXXX,XX/XX/XXXX,and XX/XX/XXXX I requested that Experian send me a copy of the verifiable proof they have on file showing that the XXXX account they have listed on my credit report is actually mine. On XX/XX/XXXX and XX/XX/XXXX,instead of sending me a copy of the verifiable proof that I requested,Experian sent me a statement which reads,""""" The information you disputed has been verified as accurate. '' Experian also failed to provide me with the method of """" verification. '' Since Experian neither provided me with a copy of the verifiable proof",nor did they delete the unverified information,I believe they are in violation of the Fair Credit Reporting Act and I have been harmed as a result. I have again,today,sent my fourth and final written request that they verify the account
2019-07-08,"""Credit reporting",credit repair services,"or other personal consumer reports""",Credit reporting,Problem with a credit reporting company's investigation into an existing problem,Their investigation did not fix an error on your report,"""Hello This complaint is against the three credit reporting companies. XXXX",Trans Union and XXXX. I noticed some discrepencies on my credit report so I put a credit freeze with XXXX.on XX/XX/2019. I then notified the three credit agencies previously stated with a writtent letter dated XX/XX/2019 requesting them to verifiy certain accounts showing on my report They were a Bankruptcy and a bank account from XXXX XXXX XXXX.,,,,,,,,,


####Check above sample dataset
- data loaded into wrong columns

In [0]:
csv = spark.read.csv('/FileStore/tables/complaints.csv', inferSchema=True, header=True)
#csv.show(10)
display(csv)

Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,transworld systems inc.,,,,,,,,,,,,
is trying to collect a debt that is not mine,"not owed and is inaccurate.""",,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,3384392,,,,
2019-09-19,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,,3379500
2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work.",,"Diversified Consultants, Inc.",NC,275XX,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,3433198
2019-09-15,Debt collection,Other debt,Attempts to collect debt not owed,Debt was result of identity theft,"Pioneer has committed several federal violations against me, a Private law abiding Federally Protected Consumer. Each violation is a statutory cost of {$1000.00} each, which does not include my personal cost and fees which shall be determined for taking time to address these issues. Violations committed against me include but not limited to : ( 1 ) Violated 15 USC 1692c ( a ) ; Communication without prior consent, expressed permission. ( 2 ) Violated 15 USC 1692d ; Harass and oppressive use of intercourse about an alleged debt. ( 3 ) Violated 15 USC 1692d ( l ) ; Attacking my reputation, accusing me of owing an alleged debt to you. ( 4 ) Violated 15 USC 1692e ( 9 ) ; Use/distribution of communication with authorization or approval. ( 5 ) Violated 15 USC 1692f ( l ) ; Attempting to collect a debt unauthorized by an agreement between parties.",,Pioneer Capital Solutions Inc,CA,925XX,,Consent provided,Web,2019-09-15,Closed with explanation,Yes,,3374555
2021-01-11,"Payday loan, title loan, or personal loan",Installment loan,Charged fees or interest you didn't expect,,,Company disputes the facts presented in the complaint,"Express Collections, Inc.",MI,48114,,,Web,2021-01-11,Closed with explanation,Yes,,4061634
2019-07-18,Mortgage,Conventional home mortgage,Closing on a mortgage,,"I started the process to refinance my current mortgage. The closing lawyer attempted to obtain the payoff statement, but they would not provide that statement for 4 business as they claimed they are behind. The lawyer informed me I could obtain the information sooner. I called and spoke to a representative on XX/XX/XXXX who said she could request I receive the payoff statement in 24 hours if she requested an expedite through a supervisor. I called back on XX/XX/XXXX, my closing date on the refinance, to determine the payoff amount as I had not yet received the payoff statement. I was told that it had not yet been generated, but still could be done today. I asked to speak to a supervisor and was told that one was not available and could call me within 24 hours. They also told me that there was no one within the company that could help me with this inquiry.",Company has responded to the consumer and the CFPB and chooses not to provide a public response,Freedom Mortgage Company,NC,275XX,,Consent provided,Web,2019-07-18,Closed with explanation,Yes,,3311105
2020-12-30,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Improper use of your report,Credit inquiries on your report that you don't recognize,,,"EQUIFAX, INC.",LA,700XX,,,Web,2020-12-30,Closed with explanation,Yes,,4039920
2019-07-26,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Problem with a credit reporting company's investigation into an existing problem,Their investigation did not fix an error on your report,"""Previously, on XX/XX/XXXX, XX/XX/XXXX, and XX/XX/XXXX I requested that Experian send me a copy of the verifiable proof they have on file showing that the XXXX account they have listed on my credit report is actually mine. On XX/XX/XXXX and XX/XX/XXXX, instead of sending me a copy of the verifiable proof that I requested, Experian sent me a statement which reads, """" The information you disputed has been verified as accurate. '' Experian also failed to provide me with the method of """" verification. '' Since Experian neither provided me with a copy of the verifiable proof",nor did they delete the unverified information,I believe they are in violation of the Fair Credit Reporting Act and I have been harmed as a result. I have again,today,sent my fourth and final written request that they verify the account,and send me verifiable proof that this account is mine,or that they delete the unverified account. If they do not,"my next step is to pursue a remedy through litigation.""",Company has responded to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solutions Inc.,CA,914XX,
2019-07-08,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Problem with a credit reporting company's investigation into an existing problem,Their investigation did not fix an error on your report,"Hello This complaint is against the three credit reporting companies. XXXX, Trans Union and XXXX. I noticed some discrepencies on my credit report so I put a credit freeze with XXXX.on XX/XX/2019. I then notified the three credit agencies previously stated with a writtent letter dated XX/XX/2019 requesting them to verifiy certain accounts showing on my report They were a Bankruptcy and a bank account from XXXX XXXX XXXX.",,,,,,,,,,,,


In [0]:
#show sample data
df.select(['Product','Complaint ID','Consumer complaint narrative']).show(20)
#data = csv.select("Product", "Complaint ID").where(col("Complaint ID").isNull())
#data.show(20)

#df2 = df.dropna(thresh=2,subset=('Product','Complaint ID'))
#df2.where(col("Complaint ID").isNull()).show(20)
#save to table1
sqlContext.registerDataFrameAsTable(df, "table1")

In [0]:
dfProduct = sqlContext.sql("SELECT `Complaint ID` as id,lower(Product) AS product, `Consumer complaint narrative` AS narrative " + \
                           ",`Submitted via` AS via FROM table1 ")
dfProduct = dfProduct.dropna()
#df2 = csv.na.drop()
#df.printSchema()


###import and download NLP related files

In [0]:
#!/bin/bash
!pip install nltk
!pip install --upgrade pip
!nltk.downloader all

In [0]:
import nltk
nltk.download('punkt')
nltk.download('twitter_samples')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('ieer')
nltk.download('stopwords')
#stopwords = set(STOPWORDS) 
stopwords = nltk.corpus.stopwords.words('english')


- Clearing text from punctuation (regexp_replace)
 - Tokenization (Tokenizer)
 - Delete stop words (StopWordsRemover)
 - Stematization (SnowballStemmer)
 - Filtering short words (udf)

### sample code

In [0]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
# alternatively, pattern="\\w+", gaps(False)

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") \
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

In [0]:
#get 100 sample data and remove id is null. backwords execution
dfProduct100 = dfProduct.dropna().limit(100)
type(dfProduct100)

In [0]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

sentenceDataFrame = dfProduct100

regexTokenizer = RegexTokenizer(inputCol="product", outputCol="words", pattern="\\W")
countTokens = udf(lambda words: len(words), IntegerType())

regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("product", "words") \
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)


In [0]:
#regexTokenized = regexTokenized.withColumn('words', concat_ws(' ', 'words'))
#regexTokenized.show(11)

In [0]:
#convert list column 'words' to string
#from pyspark.sql.functions import col, concat_ws
#words_list = regexTokenized.limit(100).select('words').collect() 
#stringList = ' '.join([str(item[0]) for item in words_list ])
#stringList
#import pyspark.sql.functions.*
#dfTokenized.select(concat_ws(' ', split(dfTokenized.words)).alias('content')).collect()

In [0]:
#word cloud
!pip install wordcloud
from wordcloud import WordCloud 
from wordcloud import WordCloud, STOPWORDS 
import matplotlib.pyplot as plt 

In [0]:
#stopwords = set(STOPWORDS) 
comment_words = ''  
# iterate through the csv file 
#for val in regexTokenized.words: 
    # typecaste each val to string 
    # split the value 
    ##tokens = val.split() 
    # Converts each token into lowercase 
    ##for i in range(len(tokens)): 
    ##    tokens[i] = tokens[i].lower() 
      
    ##comment_words += " ".join(tokens)+" "
#    comment_words += " ".join(val)+" "
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                min_font_size = 10).generate('credit reporting credit repair services or other personal consumer reports debt collection debt collection payday loan title loan or personal loan mortgage credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit card or prepaid card debt collection credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit card or prepaid card credit reporting credit repair services or other personal consumer reports vehicle loan or lease credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports checking or savings account debt collection credit reporting credit repair services or other personal consumer reports checking or savings account credit reporting credit repair services or other personal consumer reports mortgage credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports or not applying for credit recently credit card or prepaid card debt collection credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports mortgage credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports debt collection credit reporting credit repair services or other personal consumer reports credit reporting credit repair services or other personal consumer reports credit reporting credit repair services') 


In [0]:
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show() 

###NLP
-  DocumentAssembler(), one of the most essential transformers of the Spark NLP library. It’s the entry point to get your data in, and then process further with annotators. And, without linking its output to annotators in a pipeline, it has no meaning. In the following articles, we will talk about how you can apply certain NLP tasks on top of DocumentAssembler()

In [0]:
from sparknlp.base import *
documentAssembler = DocumentAssembler().setInputCol("product").setOutputCol("document").setCleanupMode("shrink")
doc_df = documentAssembler.transform(dfProduct100)
doc_df.show(10)

####flatten the document column

In [0]:
doc_df.select("document.result").take(1)
import pyspark.sql.functions as F
doc_df.withColumn("tmp",F.explode("document")).select("tmp.*").show(3)

In [0]:
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(stringList) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show() 

In [0]:
dfNarrative = dfProduct.filter("narrative IS NOT NULL")
dfNarrative.show(3)
#dfNarrative = dfProduct.select("narrative", "id").where(col("narrative").isNoNull())
#dfNarrative.show(20)

In [0]:
dfNarrative.count()

#### Where are these customers, in which way they complained.
 - using geo information to visualize the distribution of these consumers.
 - count the complaint record by source.

In [0]:
display(dfProduct)

id,product,narrative,via
3433198,debt collection,"Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work.",Web
3374555,debt collection,"Pioneer has committed several federal violations against me, a Private law abiding Federally Protected Consumer. Each violation is a statutory cost of {$1000.00} each, which does not include my personal cost and fees which shall be determined for taking time to address these issues. Violations committed against me include but not limited to : ( 1 ) Violated 15 USC 1692c ( a ) ; Communication without prior consent, expressed permission. ( 2 ) Violated 15 USC 1692d ; Harass and oppressive use of intercourse about an alleged debt. ( 3 ) Violated 15 USC 1692d ( l ) ; Attacking my reputation, accusing me of owing an alleged debt to you. ( 4 ) Violated 15 USC 1692e ( 9 ) ; Use/distribution of communication with authorization or approval. ( 5 ) Violated 15 USC 1692f ( l ) ; Attempting to collect a debt unauthorized by an agreement between parties.",Web
3311105,mortgage,"I started the process to refinance my current mortgage. The closing lawyer attempted to obtain the payoff statement, but they would not provide that statement for 4 business as they claimed they are behind. The lawyer informed me I could obtain the information sooner. I called and spoke to a representative on XX/XX/XXXX who said she could request I receive the payoff statement in 24 hours if she requested an expedite through a supervisor. I called back on XX/XX/XXXX, my closing date on the refinance, to determine the payoff amount as I had not yet received the payoff statement. I was told that it had not yet been generated, but still could be done today. I asked to speak to a supervisor and was told that one was not available and could call me within 24 hours. They also told me that there was no one within the company that could help me with this inquiry.",Web
3446975,"credit reporting, credit repair services, or other personal consumer reports",Today XX/XX/XXXX went online to dispute the incorrect personal information and it says This request can not be processed online,Web
3214857,"credit reporting, credit repair services, or other personal consumer reports",XXXX is reporting incorrectly to Equifax and XXXX an account balance of {$2300.00} on the XXXX partial account number XXXX. ( Please see pages 12 and 13 of the attached credit report ). This account is over 7 years old and therefore should not be on my credit report. This incorrect reporting is harming my credit score and is a Fair Credit Reporting Act ( F.C.R.A. ) violation.,Web
3417374,"credit reporting, credit repair services, or other personal consumer reports","Please reverse the late payments reported on the following accounts : XXXX XXXX XXXX XXXX XXXX XXXX XXXX The accounts were never past due, I never made a late payment to this company ever please change this, I have a good relationship with these companies.",Web
3415907,"credit reporting, credit repair services, or other personal consumer reports",i am a victim of identity theft as previously stated,Web
3158699,credit card or prepaid card,"I was shocked when I reviewed my credit report and found late payment on the date 60 days late as of XX/XX/2017 and XX/XX/2017. I am not sure how this happened, I believe that I had made my payments to you when I received my statements. My only thought is that my monthly statement did not get to me.",Web
3444592,"credit reporting, credit repair services, or other personal consumer reports",I would like the credit bureau to correct my XXXX XXXX XXXX XXXX balance. My correct balance is XXXX,Web
3448432,"credit reporting, credit repair services, or other personal consumer reports",The credit bureaus are reporting inaccurate/outdated/incomplete personal information.,Web


In [0]:
#pie chart
#dfProduct.agg({'via': 'count'}).withColumnRenamed("count(via)", "via_count").show()
import pyspark.sql.functions as F
from pyspark.sql import Window
w = Window.partitionBy('via')
#dfProduct.groupBy('via').count().select('via', dfProduct.col('count').alias('via_count')).show(10)
#dfProduct.select('via', dfProduct.count('via').over(w).alias('via_count')).sort('via').show()

#dfProduct.withColumn('via', F.count('via').over(w)).sort('via').show()
dfVia = sqlContext.sql(\
                           "SELECT `Submitted via` AS via, " + \
                           "COUNT(`Submitted via`) OVER (PARTITION BY `Submitted via`) as via_count " + \
                           "FROM table1 ")
display(dfVia)
#dfProduct.groupBy(F.col('via')).agg(F.count('via').alias('via_count')).show()


via,via_count
XXXX advising we are still a month behind. I have since been in contact with the Chase Executive Offices and have been advised,1
"Deceptive or Abusive Acts or Practices.""",1
I got a message that said to go to the website and,1
I hung up. I called the executive team,1
I started to receive alerts thru Credit wise,1
LLC itself.,1
OH XXXX Thank You,1
OHXXXX XXXX,7
OHXXXX XXXX,7
OHXXXX XXXX,7


In [0]:

dfTest = sqlContext.sql(\
                           "SELECT * FROM table1 WHERE `Complaint ID`='1471337'")
display(dfTest)

Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID


In [0]:
dfProduct.select('via', F.count('via').over(w).alias('via_count')).show()

In [0]:
display(dfProduct)

id,product,narrative
3433198,debt collection,"Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work."
3374555,debt collection,"Pioneer has committed several federal violations against me, a Private law abiding Federally Protected Consumer. Each violation is a statutory cost of {$1000.00} each, which does not include my personal cost and fees which shall be determined for taking time to address these issues. Violations committed against me include but not limited to : ( 1 ) Violated 15 USC 1692c ( a ) ; Communication without prior consent, expressed permission. ( 2 ) Violated 15 USC 1692d ; Harass and oppressive use of intercourse about an alleged debt. ( 3 ) Violated 15 USC 1692d ( l ) ; Attacking my reputation, accusing me of owing an alleged debt to you. ( 4 ) Violated 15 USC 1692e ( 9 ) ; Use/distribution of communication with authorization or approval. ( 5 ) Violated 15 USC 1692f ( l ) ; Attempting to collect a debt unauthorized by an agreement between parties."
3311105,mortgage,"I started the process to refinance my current mortgage. The closing lawyer attempted to obtain the payoff statement, but they would not provide that statement for 4 business as they claimed they are behind. The lawyer informed me I could obtain the information sooner. I called and spoke to a representative on XX/XX/XXXX who said she could request I receive the payoff statement in 24 hours if she requested an expedite through a supervisor. I called back on XX/XX/XXXX, my closing date on the refinance, to determine the payoff amount as I had not yet received the payoff statement. I was told that it had not yet been generated, but still could be done today. I asked to speak to a supervisor and was told that one was not available and could call me within 24 hours. They also told me that there was no one within the company that could help me with this inquiry."
3446975,"credit reporting, credit repair services, or other personal consumer reports",Today XX/XX/XXXX went online to dispute the incorrect personal information and it says This request can not be processed online
3214857,"credit reporting, credit repair services, or other personal consumer reports",XXXX is reporting incorrectly to Equifax and XXXX an account balance of {$2300.00} on the XXXX partial account number XXXX. ( Please see pages 12 and 13 of the attached credit report ). This account is over 7 years old and therefore should not be on my credit report. This incorrect reporting is harming my credit score and is a Fair Credit Reporting Act ( F.C.R.A. ) violation.
3417374,"credit reporting, credit repair services, or other personal consumer reports","Please reverse the late payments reported on the following accounts : XXXX XXXX XXXX XXXX XXXX XXXX XXXX The accounts were never past due, I never made a late payment to this company ever please change this, I have a good relationship with these companies."
3415907,"credit reporting, credit repair services, or other personal consumer reports",i am a victim of identity theft as previously stated
3158699,credit card or prepaid card,"I was shocked when I reviewed my credit report and found late payment on the date 60 days late as of XX/XX/2017 and XX/XX/2017. I am not sure how this happened, I believe that I had made my payments to you when I received my statements. My only thought is that my monthly statement did not get to me."
3444592,"credit reporting, credit repair services, or other personal consumer reports",I would like the credit bureau to correct my XXXX XXXX XXXX XXXX balance. My correct balance is XXXX
3448432,"credit reporting, credit repair services, or other personal consumer reports",The credit bureaus are reporting inaccurate/outdated/incomplete personal information.


#### How many valid and invalid complaints within the dataset?

The government provides platforms for financial products