## Text Summarization

Types: 
- Abstractive: Generative Models; generally involves deep learning models. BERT Models can be one of them. 
- Extractive: Extracting the 'most important sentences' from a document. 

We tried an Extractive Model, using Sentence Embeddings, based on Pre-Trained GloVe vectors. 

Source: https://appliedmachinelearning.blog/2019/12/31/extractive-text-summarization-using-glove-vectors/ 

Paper: https://nlp.stanford.edu/pubs/glove.pdf 



In [1]:
import csv

## Preprocessing

In [1]:
pip install swifter

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install contractions

Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
import re
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity

import seaborn as sns
import swifter
import gc
import nltk 
nltk.download('punkt')
nltk.download('stopwords')
import sklearn 
import collections
import contractions
import sys
import itertools
import string

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from tqdm.notebook import tqdm

[nltk_data] Downloading package punkt to /Users/kagenlim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kagenlim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
df_narr = pd.read_csv("/content/drive/MyDrive/Classes/2 Practicum/[For Classmates 3-Feb-2021] cfpb_cleaned/data_files/narratives_raw.csv")

In [None]:
df_narr.head()

Unnamed: 0.1,Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint length
0,0,2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,transworld systems inc. is trying to collect a...,,TRANSWORLD SYSTEMS INC,FL,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,18
1,1,2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"Over the past 2 weeks, I have been receiving e...",,"Diversified Consultants, Inc.",NC,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,78
2,2,2019-09-15,Debt collection,Other debt,Attempts to collect debt not owed,Debt was result of identity theft,Pioneer has committed several federal violatio...,,Pioneer Capital Solutions Inc,CA,,Consent provided,Web,2019-09-15,Closed with explanation,Yes,,152
3,3,2019-07-26,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,"Previously, on XX/XX/XXXX, XX/XX/XXXX, and XX/...",Company has responded to the consumer and the ...,Experian Information Solutions Inc.,CA,,Consent provided,Web,2019-07-26,Closed with explanation,Yes,,171
4,4,2019-07-08,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,Hello This complaint is against the three cred...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",NY,,Consent provided,Web,2019-07-08,Closed with explanation,Yes,,428


In [None]:
len(df_narr[df_narr['Complaint length']==1])

0

In [None]:
df_narr.shape

(657719, 18)

### Tokenization First (Sentence Tokenization uses Punctuation)

In [None]:
narrs = df_narr[['Consumer complaint narrative']]

narrs.head()

Unnamed: 0,Consumer complaint narrative
0,transworld systems inc. is trying to collect a...
1,"Over the past 2 weeks, I have been receiving e..."
2,Pioneer has committed several federal violatio...
3,"Previously, on XX/XX/XXXX, XX/XX/XXXX, and XX/..."
4,Hello This complaint is against the three cred...


In [None]:
narrs_stringed = narrs.convert_dtypes(convert_string=True)

In [None]:
narrs_tokenized = narrs_stringed['Consumer complaint narrative'].swifter.apply(sent_tokenize)

HBox(children=(FloatProgress(value=0.0, description='Dask Apply', max=4.0, style=ProgressStyle(description_wid…




In [None]:
df_narr['Consumer complaint narrative'] = narrs_tokenized.values

In [None]:
df_narr.to_csv('/content/drive/MyDrive/Classes/2 Practicum/cfpb_cleaned/data_files/sent_tokenized.csv')
print('Saved to data/narratives.csv')

Saved to data/narratives.csv


### Cleaning

In [None]:
df_narr.to_csv('sent_tokenized_use.csv')
print('Saved to Drive')

Saved to Drive


#### Cleaning Up sent_tokenized_use.csv

In [4]:
df_narr = pd.read_csv('sent_tokenized_use.csv')

In [5]:
df_narr.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint length
0,0,0,0,2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,['transworld systems inc. is trying to collect...,,TRANSWORLD SYSTEMS INC,FL,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,18
1,1,1,1,2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"['Over the past 2 weeks, I have been receiving...",,"Diversified Consultants, Inc.",NC,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,78
2,2,2,2,2019-09-15,Debt collection,Other debt,Attempts to collect debt not owed,Debt was result of identity theft,['Pioneer has committed several federal violat...,,Pioneer Capital Solutions Inc,CA,,Consent provided,Web,2019-09-15,Closed with explanation,Yes,,152
3,3,3,3,2019-07-26,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,"['Previously, on XX/XX/XXXX, XX/XX/XXXX, and X...",Company has responded to the consumer and the ...,Experian Information Solutions Inc.,CA,,Consent provided,Web,2019-07-26,Closed with explanation,Yes,,171
4,4,4,4,2019-07-08,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,['Hello This complaint is against the three cr...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",NY,,Consent provided,Web,2019-07-08,Closed with explanation,Yes,,428


In [6]:
type(df_narr['Consumer complaint narrative'][4]) #this is a string#

str

In [7]:
from tqdm import tqdm
tqdm.pandas()

def number_free(doc):
  number_free = re.sub(r"\d+","",doc)
  return number_free

In [8]:
df_narr_clean1 = df_narr['Consumer complaint narrative'].swifter.apply(number_free)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=657719.0, style=ProgressStyle(descript…




In [9]:
df_narr_clean1

0         ['transworld systems inc. is trying to collect...
1         ['Over the past  weeks, I have been receiving ...
2         ['Pioneer has committed several federal violat...
3         ['Previously, on XX/XX/XXXX, XX/XX/XXXX, and X...
4         ['Hello This complaint is against the three cr...
                                ...                        
657714    ['I was on automatic payment for my car loan.'...
657715    ['I recieved a collections call from an unknow...
657716    ['On XXXX XXXX, , I contacted XXXX XXXX, who i...
657717    ['I can not get from chase who services my mor...
657718    ['I made a payment to CITI XXXX Credit Card on...
Name: Consumer complaint narrative, Length: 657719, dtype: object

In [10]:
type(df_narr_clean1[4])

str

In [90]:
def odd_tokens_free(doc):
  processed = re.sub("(xx|xxxx|XX|XXXX)","",doc)
  return processed

In [91]:
df_narr_clean2 = df_narr_clean1.swifter.apply(odd_tokens_free)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=657719.0, style=ProgressStyle(descript…




In [92]:
df_narr_clean2[2] 

"['Pioneer has committed several federal violations against me, a Private law abiding Federally Protected Consumer.', 'Each violation is a statutory cost of {$.} each, which does not include my personal cost and fees which shall be determined for taking time to address these issues.', 'Violations committed against me include but not limited to : (  ) Violated  USC c ( a ) ; Communication without prior consent, expressed permission.', '(  ) Violated  USC d ; Harass and oppressive use of intercourse about an alleged debt.', '(  ) Violated  USC d ( l ) ; Attacking my reputation, accusing me of owing an alleged debt to you.', '(  ) Violated  USC e (  ) ; Use/distribution of communication with authorization or approval.', '(  ) Violated  USC f ( l ) ; Attempting to collect a debt unauthorized by an agreement between parties.']"

In [104]:
def string_to_list(doc):
  import ast
  lists = ast.literal_eval(doc.replace('\r','\\r').replace('\n','\\n').replace("\\x", '').replace("\\u", '')) #Resolve unicode errors 
  return lists

In [105]:
df_narr_clean3 = df_narr_clean2.swifter.apply(string_to_list)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=657719.0, style=ProgressStyle(descript…




In [106]:
df_narr_clean3[2]

['Pioneer has committed several federal violations against me, a Private law abiding Federally Protected Consumer.',
 'Each violation is a statutory cost of {$.} each, which does not include my personal cost and fees which shall be determined for taking time to address these issues.',
 'Violations committed against me include but not limited to : (  ) Violated  USC c ( a ) ; Communication without prior consent, expressed permission.',
 '(  ) Violated  USC d ; Harass and oppressive use of intercourse about an alleged debt.',
 '(  ) Violated  USC d ( l ) ; Attacking my reputation, accusing me of owing an alleged debt to you.',
 '(  ) Violated  USC e (  ) ; Use/distribution of communication with authorization or approval.',
 '(  ) Violated  USC f ( l ) ; Attempting to collect a debt unauthorized by an agreement between parties.']

In [167]:
def lowering(doc):
    lowered = " ".join([sentence.lower() for sentence in doc])
    return lowered

In [168]:
df_narr_clean4 = df_narr_clean3.progress_apply(lowering)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=657719.0, style=ProgressStyle(descript…




In [169]:
df_narr_clean4[2]

'pioneer has committed several federal violations against me, a private law abiding federally protected consumer. each violation is a statutory cost of {$.} each, which does not include my personal cost and fees which shall be determined for taking time to address these issues. violations committed against me include but not limited to : (  ) violated  usc c ( a ) ; communication without prior consent, expressed permission. (  ) violated  usc d ; harass and oppressive use of intercourse about an alleged debt. (  ) violated  usc d ( l ) ; attacking my reputation, accusing me of owing an alleged debt to you. (  ) violated  usc e (  ) ; use/distribution of communication with authorization or approval. (  ) violated  usc f ( l ) ; attempting to collect a debt unauthorized by an agreement between parties.'

In [173]:
stop_words = set(stopwords.words('english'))

def stop_free(doc): 
  stop_words = set(stopwords.words('english'))
  stop_rem = " ".join([word for word in doc.split() if word not in stop_words])
  return stop_rem

In [174]:
df_narr_clean5 = df_narr_clean4.swifter.apply(stop_free)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=657719.0, style=ProgressStyle(descript…




In [175]:
df_narr_clean5[2]

'pioneer committed several federal violations me, private law abiding federally protected consumer. violation statutory cost {$.} each, include personal cost fees shall determined taking time address issues. violations committed include limited : ( ) violated usc c ( ) ; communication without prior consent, expressed permission. ( ) violated usc ; harass oppressive use intercourse alleged debt. ( ) violated usc ( l ) ; attacking reputation, accusing owing alleged debt you. ( ) violated usc e ( ) ; use/distribution communication authorization approval. ( ) violated usc f ( l ) ; attempting collect debt unauthorized agreement parties.'

In [176]:
from nltk.stem import PorterStemmer 
ps = PorterStemmer()

def stemmer(doc):
  stemmed = " ".join([ps.stem(i) for i in doc.split()])
  return stemmed

In [177]:
df_narr_clean6 = df_narr_clean5.progress_apply(stemmer)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=657719.0, style=ProgressStyle(descript…




In [178]:
df_narr_clean6[2]

'pioneer commit sever feder violat me, privat law abid feder protect consumer. violat statutori cost {$.} each, includ person cost fee shall determin take time address issues. violat commit includ limit : ( ) violat usc c ( ) ; commun without prior consent, express permission. ( ) violat usc ; harass oppress use intercours alleg debt. ( ) violat usc ( l ) ; attack reputation, accus owe alleg debt you. ( ) violat usc e ( ) ; use/distribut commun author approval. ( ) violat usc f ( l ) ; attempt collect debt unauthor agreement parties.'

### Need to sentence tokenize again

In [179]:
narrs_tokenized7 = df_narr_clean6.swifter.apply(sent_tokenize)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=657719.0, style=ProgressStyle(descript…




In [180]:
narrs_tokenized7[2]

['pioneer commit sever feder violat me, privat law abid feder protect consumer.',
 'violat statutori cost {$.}',
 'each, includ person cost fee shall determin take time address issues.',
 'violat commit includ limit : ( ) violat usc c ( ) ; commun without prior consent, express permission.',
 '( ) violat usc ; harass oppress use intercours alleg debt.',
 '( ) violat usc ( l ) ; attack reputation, accus owe alleg debt you.',
 '( ) violat usc e ( ) ; use/distribut commun author approval.',
 '( ) violat usc f ( l ) ; attempt collect debt unauthor agreement parties.']

### Punctuation

In [195]:
def punc_free(doc):
  exclude = set(string.punctuation)
  punc_free = [''.join(c for c in s if c not in exclude) for s in doc]
  return punc_free

In [196]:
df_narr_clean8 = narrs_tokenized7.swifter.apply(punc_free)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=657719.0, style=ProgressStyle(descript…




In [197]:
df_narr_clean8[2]

['pioneer commit sever feder violat me privat law abid feder protect consumer',
 'violat statutori cost ',
 'each includ person cost fee shall determin take time address issues',
 'violat commit includ limit    violat usc c    commun without prior consent express permission',
 '  violat usc  harass oppress use intercours alleg debt',
 '  violat usc  l   attack reputation accus owe alleg debt you',
 '  violat usc e    usedistribut commun author approval',
 '  violat usc f  l   attempt collect debt unauthor agreement parties']

In [203]:
def odd_spaces(doc):
  well_spaced = [' '.join(foo.split()) for foo in doc]
  return well_spaced

In [204]:
df_narr_clean9 = df_narr_clean8.swifter.apply(odd_spaces)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=657719.0, style=ProgressStyle(descript…




In [205]:
df_narr_clean9[2]

['pioneer commit sever feder violat me privat law abid feder protect consumer',
 'violat statutori cost',
 'each includ person cost fee shall determin take time address issues',
 'violat commit includ limit violat usc c commun without prior consent express permission',
 'violat usc harass oppress use intercours alleg debt',
 'violat usc l attack reputation accus owe alleg debt you',
 'violat usc e usedistribut commun author approval',
 'violat usc f l attempt collect debt unauthor agreement parties']

In [206]:
type(df_narr_clean9)

pandas.core.series.Series

In [207]:
df_narr['narrs_cleaned_sent_tokenized'] = df_narr_clean9

In [208]:
df_narr

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,...,State,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint length,narrs_cleaned_sent_tokenized
0,0,0,0,2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,['transworld systems inc. is trying to collect...,,...,FL,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,18,[transworld system inc tri collect debt mine o...
1,1,1,1,2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"['Over the past 2 weeks, I have been receiving...",,...,NC,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,78,[past weeks receiv excess amount telephon call...
2,2,2,2,2019-09-15,Debt collection,Other debt,Attempts to collect debt not owed,Debt was result of identity theft,['Pioneer has committed several federal violat...,,...,CA,,Consent provided,Web,2019-09-15,Closed with explanation,Yes,,152,[pioneer commit sever feder violat me privat l...
3,3,3,3,2019-07-26,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,"['Previously, on XX/XX/XXXX, XX/XX/XXXX, and X...",Company has responded to the consumer and the ...,...,CA,,Consent provided,Web,2019-07-26,Closed with explanation,Yes,,171,[previously request experian send copi verifi ...
4,4,4,4,2019-07-08,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,['Hello This complaint is against the three cr...,Company has responded to the consumer and the ...,...,NY,,Consent provided,Web,2019-07-08,Closed with explanation,Yes,,428,[hello complaint three credit report companies...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
657714,657714,657714,657714,2016-07-11,Consumer Loan,Vehicle loan,Managing the loan or lease,,['I was on automatic payment for my car loan.'...,,...,IL,,Consent provided,Web,2016-07-11,Closed with explanation,Yes,No,100,"[automat payment car loan, fine print supposed..."
657715,657715,657715,657715,2017-01-24,Debt collection,I do not know,Communication tactics,Threatened to take legal action,['I recieved a collections call from an unknow...,Company has responded to the consumer and the ...,...,CA,,Consent provided,Web,2017-01-24,Closed with explanation,Yes,No,92,[reciev collect call unknown compani morn hosp...
657716,657716,657716,657716,2015-03-26,Mortgage,FHA mortgage,"Loan servicing, payments, escrow account",,"['On XXXX XXXX, 2015, I contacted XXXX XXXX, w...",,...,CA,,Consent provided,Web,2015-03-26,Closed with monetary relief,Yes,No,331,[contact branch manag gateway funding learn lo...
657717,657717,657717,657717,2015-12-12,Mortgage,Conventional adjustable mortgage (ARM),"Loan servicing, payments, escrow account",,['I can not get from chase who services my mor...,,...,NY,,Consent provided,Web,2015-12-12,Closed with explanation,Yes,No,21,[get chase servic mortgage own origin loan doc...


In [210]:
df_narr['narrs_cleaned_sent_tokenized'][2]

['pioneer commit sever feder violat me privat law abid feder protect consumer',
 'violat statutori cost',
 'each includ person cost fee shall determin take time address issues',
 'violat commit includ limit violat usc c commun without prior consent express permission',
 'violat usc harass oppress use intercours alleg debt',
 'violat usc l attack reputation accus owe alleg debt you',
 'violat usc e usedistribut commun author approval',
 'violat usc f l attempt collect debt unauthor agreement parties']

In [212]:
df_narr.shape

(657719, 21)

In [211]:
df_narr.to_csv('/Users/kagenlim/Documents/data_files/narrs_cleaned_sent_tokenized.csv')
print('Saved to Local')

Saved to Local


## Partial Data

In [None]:
def loadGloveModel(gloveFile):
    word_embeddings = {}
    f = open(gloveFile, encoding='utf-8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        word_embeddings[word] = coefs
    f.close()
    return word_embeddings

In [None]:
word_embeddings = loadGloveModel(gloveFile)
print("Vocab Size = ",len(word_embeddings))

## Full Model