# DSCI 614: Project 3

### Symphony Hopkins

## Introduction

We are part of a data scientist team working for a retirement investment company. We were given a paper in pdf file, tax-efficient-withdrawal-strategies.pdf. We were asked to perform the following tasks:

## 1. Extract all texts from the given pdf file.

We will begin by importing the necessary libraries and extracting the text from the pdf file.

In [1]:
# importing library
import PyPDF2
import re

In [2]:
# specifying file
tews_pdf = '/Users/symphonyhopkins/Documents/Maryville_University/DSCI_614/Week_3/tax-efficient-withdrawal-strategies.pdf'
# creating a pdf file object
pdf_file = open(tews_pdf, 'rb')
# creating a pdf reader object
pdf_file_reader = PyPDF2.PdfFileReader(pdf_file)


# getting the number of pages in the pdf file
page_count = pdf_file_reader.numPages
# printing number of pages in pdf file
print(f' There are {page_count} pages in the file.')


# initiating empty list
output = []
# getting text from each page via for loop
for i in range(page_count):
    # getting the i-th page contents from the pdf file
    pdf_page = pdf_file_reader.getPage(i)
    # extracting text from each page and append it to the list
    output.append(pdf_page.extractText())
# concatenating items in the list to a single string
all_texts = ' '.join(output)


# removing \n from the texts
all_texts = re.sub('\n', '', all_texts)
# removing punctuation from the texts
all_texts = re.sub(r'[^\w\s]','',all_texts)


# printing out the first 300 chars from the texts
print('\n',all_texts[:300])
# closing the pdf file object
pdf_file.close()


 There are 17 pages in the file.

 1T ROWE  PRICE INSIGHTSON RETIREMENTKEY INSIGHTS	There are alternatives to the conventional strategy of drawing on a taxable account first followed by taxdeferred and then Roth accounts 	Many people can take advantage of income in a low tax bracket or taxfree capital gains	If planning to leave an es


## 2. Extract all the tokens from the texts. Extract all lemmas from the texts.

Next, we'll extract the tokens and lemmas from the texts. Since the text is very long, we will only print out the first 20 items from each list.

In [3]:
# importing library
import spacy

In [4]:
# loading the model
nlp = spacy.load("en_core_web_sm")

# processing the texts
doc = nlp(all_texts)

# initializing empty lists to store tokens and lemmas
token_texts = []
lemma_texts = []

# storing tokens and lemmas in lists
for token in doc:
    # appending the token text to the token_texts list
    token_texts.append(token.text)
    # appending the lemma text to the lemma_texts list
    lemma_texts.append(token.lemma_)

# printing out the first 20 tokens and lemmas
for i in range(20):
    print(f'Token Text = {token_texts[i]}; Lemma Text = {lemma_texts[i]}')

Token Text = 1; Lemma Text = 1
Token Text = T; Lemma Text = t
Token Text = ROWE; Lemma Text = ROWE
Token Text =  ; Lemma Text =  
Token Text = PRICE; Lemma Text = price
Token Text = INSIGHTSON; Lemma Text = INSIGHTSON
Token Text = RETIREMENTKEY; Lemma Text = RETIREMENTKEY
Token Text = INSIGHTS; Lemma Text = INSIGHTS
Token Text = 	; Lemma Text = 	
Token Text = There; Lemma Text = there
Token Text = are; Lemma Text = be
Token Text = alternatives; Lemma Text = alternative
Token Text = to; Lemma Text = to
Token Text = the; Lemma Text = the
Token Text = conventional; Lemma Text = conventional
Token Text = strategy; Lemma Text = strategy
Token Text = of; Lemma Text = of
Token Text = drawing; Lemma Text = draw
Token Text = on; Lemma Text = on
Token Text = a; Lemma Text = a


## 3. Remove all the default stop words in SpaCy from the texts.

Now, we will remove all of the default stop words from the texts.

In [5]:
# removing default stop words
spacy_stopwords = nlp.Defaults.stop_words
tokens_without_stopword= [token for token in doc if not token.text in spacy_stopwords]

# printing tokens without stop words
print(tokens_without_stopword[:100])

[1, T, ROWE,  , PRICE, INSIGHTSON, RETIREMENTKEY, INSIGHTS, 	, There, alternatives, conventional, strategy, drawing, taxable, account, followed, taxdeferred, Roth, accounts, 	, Many, people, advantage, income, low, tax, bracket, taxfree, capital, gains, 	, If, planning, leave, estate, heirs, consider, assets, ultimately, maximize, aftertax, value, How, Get, More, Out, Your, Retirement, Account, Withdrawals, These, approaches, extend, life, portfolio, preserve, assets, heirsMany, people, rely, largely, Social, Security, benefits, taxdeferred, accounts, individual, retirement, accounts, IRAs, 401k, plans, support, lifestyle, retirement, However, sizable, number, retirees, enter, retirement, assets, taxable, accounts, brokerage, accounts, Roth, accounts, Deciding, use, combination, accounts, fund, spending, decision, likely, driven, tax, consequences]


## 4. Customize the stop words in SpaCy by: a. Adding "tax" and "account" to the stop words. b. Remove "full" from the default stop words. Then remove all the customized default stop words from the texts.

The company would like for us to add "tax" and "account" to the list of stop words, so let's do that. To ensure that they are added, we will count how many stop words there are before and after running the code.

In [6]:
# printing number of stop words currently
print(f'There are {len(nlp.Defaults.stop_words)} stop words in Spacy')

# specifying new stop words
customized_stop_words = ['tax', 'account']
# adding new stop words
for token in customized_stop_words:
    nlp.Defaults.stop_words.add(token)
# setting the tag of the customized stop words as stop word 
for token in customized_stop_words:
    nlp.vocab[token].is_stop = True
    
# printing number of stop words now
print(f'There are {len(nlp.Defaults.stop_words)} stop words in Spacy')

There are 326 stop words in Spacy
There are 328 stop words in Spacy


Then, we will remove "full" from the default stop words. Like before, we will show the number of stop words before and after.

In [7]:
# printing number of stop words currently
print(f'There are {len(nlp.Defaults.stop_words)} stop words in Spacy')

# specifying the stop word
remove_stop_words = ['full']
# removing the stop word
for token in remove_stop_words:
    nlp.Defaults.stop_words.remove(token)
# setting the tag of the removed stop words as non-stop word 
for token in remove_stop_words:
    nlp.vocab[token].is_stop = False

# printing number of stop words now
print(f'There are {len(nlp.Defaults.stop_words)} stop words in Spacy')

There are 328 stop words in Spacy
There are 327 stop words in Spacy


Finally, we will remove all of the customized default stop words from the texts.

In [8]:
# getting the new stop words
spacy_stopwords = nlp.Defaults.stop_words
tokens_without_stopword= [token for token in doc if not token.text in spacy_stopwords]
# printing the tokens
print(tokens_without_stopword[:100])

[1, T, ROWE,  , PRICE, INSIGHTSON, RETIREMENTKEY, INSIGHTS, 	, There, alternatives, conventional, strategy, drawing, taxable, followed, taxdeferred, Roth, accounts, 	, Many, people, advantage, income, low, bracket, taxfree, capital, gains, 	, If, planning, leave, estate, heirs, consider, assets, ultimately, maximize, aftertax, value, How, Get, More, Out, Your, Retirement, Account, Withdrawals, These, approaches, extend, life, portfolio, preserve, assets, heirsMany, people, rely, largely, Social, Security, benefits, taxdeferred, accounts, individual, retirement, accounts, IRAs, 401k, plans, support, lifestyle, retirement, However, sizable, number, retirees, enter, retirement, assets, taxable, accounts, brokerage, accounts, Roth, accounts, Deciding, use, combination, accounts, fund, spending, decision, likely, driven, consequences, distributions, withdrawals, accounts]


## 5. Perform the part of speech tagging for the texts.

Now, we will tag the parts of speech (POS) in the text. Since this a long text, we will only print the first 20 POS.

In [9]:
#initializing empty list
token_pos = []

# storing POS in list
for token in doc:
    # appending the POS to the token_pos list
    token_pos.append(token.pos_)
    
# printing out the first 20 POS
for i in range(20):
    print(f'Token Text = {token_texts[i]}; POS Text = {token_pos[i]}')

Token Text = 1; POS Text = NUM
Token Text = T; POS Text = NOUN
Token Text = ROWE; POS Text = PROPN
Token Text =  ; POS Text = SPACE
Token Text = PRICE; POS Text = NOUN
Token Text = INSIGHTSON; POS Text = PROPN
Token Text = RETIREMENTKEY; POS Text = PROPN
Token Text = INSIGHTS; POS Text = PROPN
Token Text = 	; POS Text = SPACE
Token Text = There; POS Text = PRON
Token Text = are; POS Text = VERB
Token Text = alternatives; POS Text = NOUN
Token Text = to; POS Text = ADP
Token Text = the; POS Text = DET
Token Text = conventional; POS Text = ADJ
Token Text = strategy; POS Text = NOUN
Token Text = of; POS Text = ADP
Token Text = drawing; POS Text = VERB
Token Text = on; POS Text = ADP
Token Text = a; POS Text = DET


## 6. Visualize the dependency parser of the texts.

Next, we'll visualize the dependency parser on the text. Because the visualization can become very large, we will parse through a small section of the text.

In [10]:
# importing library
from spacy import displacy

In [11]:
#visualizing the dependency parser
displacy.render(doc[:20], style="dep", jupyter = True)

## 7. Perform the named entities recognition for the texts.

Let's also perform named entity recognition.

In [12]:
# initializing empty lists to store entity text and labels
entity_text = []
entity_label = []

# storing text and types in list
for ent in doc.ents:
    # appending the entity texts to the entity_text list
    entity_text.append(ent.text)
    # appending the labels to the entity_label list
    entity_label.append(ent.label_)

# printing out the first 20 entity text and labels
for i in range(20):
    print(f'Entity Text = {entity_text[i]}; Entity Label = {entity_label[i]}')

Entity Text = 1; Entity Label = CARDINAL
Entity Text = ROWE; Entity Label = ORG
Entity Text = first; Entity Label = ORDINAL
Entity Text = Roth; Entity Label = PERSON
Entity Text = Social Security; Entity Label = ORG
Entity Text = 401k; Entity Label = PRODUCT
Entity Text = Roth; Entity Label = PERSON
Entity Text = 1; Entity Label = CARDINAL
Entity Text = Appendix 1A; Entity Label = PRODUCT
Entity Text = first; Entity Label = ORDINAL
Entity Text = Roth; Entity Label = PERSON
Entity Text = first; Entity Label = ORDINAL
Entity Text = Leaving Roth; Entity Label = PERSON
Entity Text = three; Entity Label = CARDINAL
Entity Text = Roger A; Entity Label = PERSON
Entity Text = Retirees With Relatively Modest Income; Entity Label = ORG
Entity Text = 5  ; Entity Label = CARDINAL
Entity Text = 6; Entity Label = CARDINAL
Entity Text = the SECURE Act9  Other Observations and Considerations; Entity Label = ORG
Entity Text = first; Entity Label = ORDINAL


## 8.  Visualize the MONEY  and QUANTITY  in the texts.

Finally, we will visualize the named entities, MONEY and QUANTITY.

In [13]:
# setting options
options = {"ents": ['MONEY', 'QUANTITY']}
# visualizing entities
displacy.render(doc, style="ent", jupyter = True, options=options)

There doesn't appear to be many MONEY and QUANTITY labels visualized in the document, but let's create counters to see how many are present.

In [14]:
# initializing counters for MONEY and QUANTITY entities
money_count = 0
quantity_count = 0

# counting labels using the entity_label variable from the previous question
for label in entity_label:
    # checking if the label is MONEY
    if label == "MONEY":
        money_count += 1
    # checking if the label is QUANTITY
    elif label == "QUANTITY":
        quantity_count += 1

# printing the counts
print(f'''
Number of MONEY entities: {money_count}
Number of QUANTITY entities: {quantity_count}''')


Number of MONEY entities: 1
Number of QUANTITY entities: 0


As we can see, there are not many MONEY and QUANTITY labels found in the document; however upon closer inspection, it appears likely that there are additional instances of these labels. Although NER wasn't accurate this time, we could improve it by cleaning and organizing the text more thoroughly. 