# Files in Python
---

**Handiling files in Python is done in a very similar manner between file types. However, there are some gotchas that do occur. This notebook delves into 5 of the more common file types**

## Index
1. Text Files
2. CSV & Excel Files 
3. JSON Files
4. PDF Files

## 1. Text Files
---
 
* Create or open existing text file using the built-in open() function
* open() returns a file object and takes 2 args open('filename', mode)

**There are a number of modes that can be used for the 'mode' argument in open(), the most common are listed here:**

* r = read only
* w = write only
* a = append
* r+ or w+ = read and write
* b = binary mode (for non-text files)
* modes can be combined, rb = read binary

In [11]:
# Basic method for opening an already existing file
txtfile = open('data/text1.txt')
print(type(txtfile))
print()

# When looking at whats in the txtfile variable, note that txtfile is a 
# wrapper to the text1.txt file and by default opens in read only mode ('r')
# Note: ALL files open in read only mode by default. 
print(txtfile)

# Files are closed manually with file.close() like so:
txtfile.close()

<class '_io.TextIOWrapper'>

<_io.TextIOWrapper name='data/text1.txt' mode='r' encoding='cp1252'>


### ***Important***
**While the above method for opening a file works, using the with keyword when dealing with file objects allows for automatic file closing after use.**

In [12]:
# Here the text1.txt file is opened using a with stamement and is stored in
# the f variable

# The file is then read using the built-in read() method which only reads
# the file once and will return an empty string if called again

with open('data/text1.txt') as f:
    print(f.read())
    print()
    print('nothing returned here:')
    print(f.read())
    
# Also note close() not required due to using with statement to open file

Welcome to the files tutorial.
First will see how Python can be used to work with text files.

nothing returned here:



In [14]:
# Text files can also be read line-by-line using the readlines() built-in 
# function, note that each line is stored in a list index, and just like
# read() function the file is only read through once

# Note that here the text is stored in a list so that it can be re-read
txtlist = []

with open('data/text1.txt') as f:
    txtlist = f.readlines()
    
    print('Initial f.readlines() stored in var returned here:')
    print(txtlist)
    print()
    
    print('Nothing returned here as f.readlines() already used:')
    print(f.readlines())
    
    print()
    print('Var can be re-used:')
    print(txtlist)


print() 
print('Here first word from each line returned:')
for line in txtlist:
    # Split takes each line and puts it into its own list, so here each line
    # is split and the first word in each list is pulled (index[0])
    print(line.split()[0])

Initial f.readlines() stored in var returned here:
['Welcome to the files tutorial.\n', 'First will see how Python can be used to work with text files.']

Nothing returned here as f.readlines() already used:
[]

Var can be re-used:
['Welcome to the files tutorial.\n', 'First will see how Python can be used to work with text files.']

Here first word from each line returned:
Welcome
First


In [15]:
# In order to write to a file, the mode w must be set
txtwrite = open('data/text2.txt', 'w+')
txtwrite.write('This is written text')

# The funtion seek() is used to return the text cursor to a certain point in
# a file, here 0 is used which goes to the start of the file. if 3 is used,
# it will go the third char in the txt. 
txtwrite.seek(0)
print(txtwrite.read())

txtwrite.seek(3)
print(txtwrite.read())

txtwrite.close()

This is written text
s is written text


In [16]:
# Writing to exsitng files will erase all data, using the append ('a') mode
# will prevent this

# Also note that write() is used to append to a file and not append() function
# which is used to add items to a list

txtappend = open('data/text2.txt', 'a+')
txtappend.write('\nAppended line....yeah')
txtappend.seek(0)
print(txtappend.read())
txtappend.close()

This is written text
Appended line....yeah


## 2. CSV & Excel Files
---
CSV Files are comma delimited and tabular in nature similar to Excel, however they do not save or conatin any formulas or calculations like excel does, so they are really just good for tabular data storage.

Cells in both CSV and Excel files are tabular in nature (row * column based) and because of this CSV and Excel data are usually numerical in 
nature and used for some sort of analytical purpose.

### There are two ways to work with CSV &  Excel files:
1. Using the csv built-in module (csv files only)
2. Using External packages such as Numpy or Pandas (both csv and excel)

### Built-in csv module
The simplest method for dealing with CSV files is the built-in CSV module

In [18]:
import csv

# Here the file is read using csv.reader() and printed out, 
# however note that that each row is given it's own list
print('CSV file data read in by row:')
print('----------------------------------------')
with open('data/AA_daily.csv') as csv_file:
    data = csv.reader(csv_file, delimiter=' ')
    for row in data:
        print(row)

CSV file data read in by row:
----------------------------------------
['date,changePercent,close,high,low,open,volume']
['2019-03-05,0.13699999999999998,29.2,29.31,28.76,29.07,1890970']
['2019-03-06,-3.253,28.25,29.155,28.14,29.06,2466049']
['2019-03-07,-3.15,27.36,28.3,27.31,28.16,3102396']
['2019-03-08,-1.974,26.82,27.16,26.51,26.93,2864338']
['2019-03-11,2.61,27.52,27.57,26.73,26.76,3497718']
['2019-03-12,3.343,28.44,28.59,27.6,27.73,3112615']


In [19]:
# Using csv module DictReader function, data can be stored with
# each pertinent row value in a key/value dictionary
# Note that each row is a separate dictionary, to see this simply print row
print()
print('Only returned Date and close price:')
print('----------------------------------------')
with open('data/AA_daily.csv') as csv_file:
    data = csv.DictReader(csv_file)
    for row in data:
        print(row['date'], row['close'])


Only returned Date and close price:
----------------------------------------
2019-03-05 29.2
2019-03-06 28.25
2019-03-07 27.36
2019-03-08 26.82
2019-03-11 27.52
2019-03-12 28.44


In [20]:
# Writing to CSV file is different than reading as the column headers 
# here called fieldnames must be included. The func DictWriter() is used 
# to get the file, the func writeheader() is used to populate fieldnames
# the func writerow() writes the row, note in dict key/val format
with open('data/test_csv.csv', 'w', newline='') as csvfile:
    fieldnames = ['first_name', 'last_name']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerow({'first_name': 'Jack', 'last_name':'Splatt'})
    writer.writerow({'first_name': 'Paul', 'last_name':'Bunyon'})




In [22]:
# This reads the file from the above example
with open('data/test_csv.csv') as csv_file:
    data = csv.DictReader(csv_file)
    for row in data:
        print(row['first_name'], row['last_name'])

Jack Splatt
Paul Bunyon


### Using external packages with csv and excel files

One of the more simplistic methods for getting data from CSV files to use is pandas and/or numpy to read the data (see pandas and numpy notebooks for detailed info on each). The csv module is faster than this methodology overall, but 
for smaller data sets, this really shouldn't be an issue. 

There are 2 methods for getting CSV data into numpy and 1 for pandas: 
* loadtxt (numpy) 
* genfromtxt (numpy)
* read_csv (pandas)

I performed a test to see which method was fastest, and the answer was genfromtxt. However, numpy is best for large numercial calcualtions and doesn't do text/headers very well. 

For general cases (not huge datasets) pandas will be the way to go as it
has a read/write methodology for all file types. So from here on out I only
use pandas.

Final note, usually data read into pandas is used for some type of analysation, therefore appending to the original data will be rare, usually
one will perform some kind of calculations or add columns with new
calculated values to the original table rather than add new rows 
to the table.

In [27]:
import pandas as pd

# Reading data in pandas is really easy using read_csv()
df = pd.read_csv('data/test_csv.csv', delimiter=",")
print('CSV table as read in by Pandas')
print('----------------------------------------')
print(df)

# Appending Data to the pands dataframe, note ignore-index=True must be used
df = df.append({'first_name':'Indy', 'last_name':'Jones'}, ignore_index=True)

# Write new table data to orginial CSV using to_csv()
# note that index=False
df.to_csv('data/test_csv2.csv', index=False)

# re-reading file after appending, note to re-write the entire file, run the
# cell above
df = pd.read_csv('data/test_csv2.csv', delimiter=",")
print()

print('CSV table appended to and re-read in by Pandas')
print('----------------------------------------')
print(df)

CSV table as read in by Pandas
----------------------------------------
  first_name last_name
0       Jack    Splatt
1       Paul    Bunyon

CSV table appended to and re-read in by Pandas
----------------------------------------
  first_name last_name
0       Jack    Splatt
1       Paul    Bunyon
2       Indy     Jones


### Converting between file types using pandas

Pandas is also good for converting data between file types as both Excel and CSV files are handled similarly.

In [28]:
# Convert CSV file to Excel using pandas to_excel() function
import pandas as pd
df = pd.read_csv('data/test_csv.csv', delimiter=',')
df.to_excel('data/test_excel.xlsx', index=False)

df = pd.read_excel('data/test_excel.xlsx')
df.index.name = 'id'
print(df)

   first_name last_name
id                     
0        Jack    Splatt
1        Paul    Bunyon


## 3. JSON Files
---
JSON (JavaScript Object Notation) is a key/value syntx for storing and exchanging data. JSON data is formatted in similar fashion to python data types, however there are a couple of exceptions
* true and false are not capitalized in JSON
* and None = null in JSON

JSON uses a key/value syntax (similar to python dictionaries). Listed below are all of the datatypes and some key rules:
* keys are ALWAYS strings
* values can be nums, true/false, lists, null, or dictionaries
* note that all strings must use double quotes " " in JSON

Python has a built-in JSON module to handle JSON files. 

In [29]:
import json

# Below are the methods available in the json module 
print(dir(json))

['JSONDecodeError', 'JSONDecoder', 'JSONEncoder', '__all__', '__author__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_default_decoder', '_default_encoder', 'codecs', 'decoder', 'detect_encoding', 'dump', 'dumps', 'encoder', 'load', 'loads', 'scanner']


In [31]:
# json.load() reads json file data into a python dictionary
# Note that after json converted to dict, True and False are capitalized

json_file = open('data/json_data.json', 'r', encoding='utf-8' )
movie = json.load(json_file)
json_file.close()

print(type(movie))
print()
print(movie)
print()
print(movie['title'])
print()
print(movie['actors'])
print()
print(movie['release_year'])

<class 'dict'>

{'title': 'Gattica', 'release_year': 1997, 'is_awesome': True, 'won_oscar': False, 'actors': ['Ethan Hawke', 'Uma Thurman', 'Alan Arkin', 'Loren Dean'], 'budget': None, 'credits': {'director': 'Andrew Niccol', 'writer': 'Andrew Nicool', 'composer': 'Michale Nyman', 'cinematographer': 'Slawomir Idziak'}}

Gattica

['Ethan Hawke', 'Uma Thurman', 'Alan Arkin', 'Loren Dean']

1997


In [32]:
# JSON data is usually consumed through an api rather than a file. Api data
# is usally read in as a string of data
# json.loads() function can load JSON from a string of data into a python dict

sample_api = '''
    {"title" : "Tron Legacy",
     "composer" : "Daft Punk",
     "year" : 2010,
     "budget" : 170000000,
     "actors" : null,
     "won_oscar" : false  
    }
    '''

tron = json.loads(sample_api)
print(type(tron))
print()
print(tron)
print()
print(tron['title'])

<class 'dict'>

{'title': 'Tron Legacy', 'composer': 'Daft Punk', 'year': 2010, 'budget': 170000000, 'actors': None, 'won_oscar': False}

Tron Legacy


In [33]:
# json.dumps() converts a python dictionary into a valid JSON string
# Note that true and false have been converted to lowercase

print(type(json.dumps(movie)))
print()
print(json.dumps(movie))

# If some non-ASCII character is used in conversion
# use json.dumps(movie, ensure_ascii=False)

<class 'str'>

{"title": "Gattica", "release_year": 1997, "is_awesome": true, "won_oscar": false, "actors": ["Ethan Hawke", "Uma Thurman", "Alan Arkin", "Loren Dean"], "budget": null, "credits": {"director": "Andrew Niccol", "writer": "Andrew Nicool", "composer": "Michale Nyman", "cinematographer": "Slawomir Idziak"}}


In [34]:
# json.dump() writes JSON object to file 
# A dictionary is created and then converted to a JSON file
movie2 = {}
movie2["title"] = "Minority Report"
movie2["director"] = "Steven Speilberg"
movie2["composer"] = "John Williams"
movie2["actors"] = ["Tom Cruise", "Colin Farrell", "Samantha Morton"]
movie2["is_awesome"] = True
movie2["cinematographer"] = "Janusz Kami\u0144ski" 

write_file = open('data/json_data2.json', 'w', encoding='utf-8')

json.dump(movie2, write_file, ensure_ascii=False)

write_file.close()

In [35]:
# JSON file load with pandas
import pandas as pd

get_data = pd.read_json(f'data/flights.json')
print(get_data)

     origin destination  duration
0  New York       Paris       435
1    Moscow       Paris       235
2      Lima    New York       455


In [36]:
# JSON file creation with pandas
print(type(get_data))
get_data.to_json(f'data/flights2.json', orient='records')

# If orientation is not used, JSON format will be off

<class 'pandas.core.frame.DataFrame'>


## 4. PDF Files
---
Python does not nativally handle PDF files so two external packages are used this notebook.

* PyPDF2 - converts simple, text based PDF files into python text format
* NLTK - is used extensively in the AI field of Natural Language Processing

For more on NLTK, see my Natural Language processing notebook

In [37]:
import PyPDF2

# nltk is used here to clean and convert phrases into keywords
import nltk
from nltk.corpus import stopwords

# Note that 'rb' is used for pdf files as they are in binary format, 
# 'rb' stands for read binary.
mypdf = open('data/pdf1.pdf', mode='rb')

# Once file is opened, the funtion PdfFileReader() is used to extract the text
pdf_text = PyPDF2.PdfFileReader(mypdf)

# The numPages attribute gets the number pages
num_pages = pdf_text.numPages
print('Number of pages in pdf:')
print(num_pages)

Number of pages in pdf:
242


In [39]:
# To extract text from a specific page, use getPage(pagenum) and extractText()
# note that this pdf is image based, therefore extractText() must be used,
# and even so the text comes back formatted strangely many times
page = pdf_text.getPage(12)
pagetxt = page.extractText()
print(pagetxt)
print()

  List of Exhibits
    xi
Exhibit 16.3  Cash Flow from Operating 
 t-Making) Activities in 
Decline Scenario 127Exhibit 17.1  Three Financial Statements and 
Footnotes 134
Exhibit 18.1  External Financial Statements 
of Business (without Footnotes) 146
Exhibit 19.1 External Income Statement for Year 160
Exhibit 19.2  t Report for Year 163

Exhibit 19.3  5% Sales Prices versus 5% Sales 
Volume Increase 
165Exhibit 19.4  EBIT Breakeven Point for Lower 
Sales Prices and Lower Sales Volume 167
fbetw.indd   xifbetw.indd   xi23-11-2013   16:31:19
23-11-2013   16:31:19



In [40]:
# Because the initial read-in text returns timestamps and all kinds of
# non-words it mus be cleaned. 

# The nltk package  is used to convert text to a list of tokens in order 
# to clean up the non-word issue.
tokens = nltk.word_tokenize(pagetxt)
print('Initial Tokens, Not Cleaned so a ton of crap included')
print('---------------------------------------------------------------------')
print(tokens)

Initial Tokens, Not Cleaned so a ton of crap included
---------------------------------------------------------------------
['List', 'of', 'Exhibits', 'xi', 'Exhibit', '16.3', 'Cash', 'Flow', 'from', 'Operating', 't-Making', ')', 'Activities', 'in', 'Decline', 'Scenario', '127Exhibit', '17.1', 'Three', 'Financial', 'Statements', 'and', 'Footnotes', '134', 'Exhibit', '18.1', 'External', 'Financial', 'Statements', 'of', 'Business', '(', 'without', 'Footnotes', ')', '146', 'Exhibit', '19.1', 'External', 'Income', 'Statement', 'for', 'Year', '160', 'Exhibit', '19.2', 't', 'Report', 'for', 'Year', '163', 'Exhibit', '19.3', '5', '%', 'Sales', 'Prices', 'versus', '5', '%', 'Sales', 'Volume', 'Increase', '165Exhibit', '19.4', 'EBIT', 'Breakeven', 'Point', 'for', 'Lower', 'Sales', 'Prices', 'and', 'Lower', 'Sales', 'Volume', '167', 'fbetw.indd', 'xifbetw.indd', 'xi23-11-2013', '16:31:19', '23-11-2013', '16:31:19']


In [41]:
# We can see above that nltk split all sections approppriately, now cleaning
# begin, note in this example, I do so using pure python. However, the next
# section delves into much simpler ways to handle the problem

clean_list = []

for token in tokens:
    # Don't add to clean list if last 4 letters are indd 
    # or if last value is a digit
    if token[-4:] == 'indd' or token[-1].isdigit() or token[0:1].isdigit():
        continue
    # Remove first char if number and second char is not digit
    elif token[0].isdigit() and not token[1].isdigit():
        token = token[1:]
        clean_list.append(token)
    # Token is a word, so simply append
    else:
        clean_list.append(token)

print()
print('Cleaner List, but some punctations and numbers snuck through')
print('---------------------------------------------------------------------')
print(clean_list)

# Note the clean list also has problems even after my initial scrub
# further cleaning can be done by removing punctuations like:
punctuations = ['(',')',';',':','[',']',',', '%']

# Simple words like 'I' 'and', 'the', can also be removed using nltk stopwords
# note that english stopwords is just a list of words
stop_words = stopwords.words('english')

# To further clean the list to remove all the junk, a final list can be 
# generated by eliminating all of the words from punctuations and stop_words
super_clean_list = [word for word in clean_list 
                    if not word in stop_words and not word in punctuations]
print()
print('Super Clean List')
print('---------------------------------------------------------------------')
print(super_clean_list)


Cleaner List, but some punctations and numbers snuck through
---------------------------------------------------------------------
['List', 'of', 'Exhibits', 'xi', 'Exhibit', 'Cash', 'Flow', 'from', 'Operating', 't-Making', ')', 'Activities', 'in', 'Decline', 'Scenario', 'Three', 'Financial', 'Statements', 'and', 'Footnotes', 'Exhibit', 'External', 'Financial', 'Statements', 'of', 'Business', '(', 'without', 'Footnotes', ')', 'Exhibit', 'External', 'Income', 'Statement', 'for', 'Year', 'Exhibit', 't', 'Report', 'for', 'Year', 'Exhibit', '%', 'Sales', 'Prices', 'versus', '%', 'Sales', 'Volume', 'Increase', 'EBIT', 'Breakeven', 'Point', 'for', 'Lower', 'Sales', 'Prices', 'and', 'Lower', 'Sales', 'Volume']

Super Clean List
---------------------------------------------------------------------
['List', 'Exhibits', 'xi', 'Exhibit', 'Cash', 'Flow', 'Operating', 't-Making', 'Activities', 'Decline', 'Scenario', 'Three', 'Financial', 'Statements', 'Footnotes', 'Exhibit', 'External', 'Financial

In [43]:
# Note that using the above approach to get the text into tokens is
# a cumbersome task. Using Natural Language Processing libraries such as
# spaCy, cleaning and preparing text for further analysis is easy. By default
# spacy documents create tokens automatically behind the scenes and removing
# unnecessary text is really simple using regular expressions

# Below is the page text pulled from the pdf in its raw form:
import PyPDF2
import nltk
import re
from nltk.corpus import stopwords

mypdf = open('data/pdf1.pdf', mode='rb')
pdf_text = PyPDF2.PdfFileReader(mypdf)
page = pdf_text.getPage(32)
pagetxt = page.extractText()
print(pagetxt)

Thr
 nancial statements
    17
 
Interest Expense:
 The amount of interest on debt (interest-bearing liabilities) for the period.
 nancing charges may also be included, such as loan origination fees.
 
Income Tax Expense:
 The total amount due the government (both federal and state) on the amount of taxable income of the business during the period. Taxable income is multiplied by the 
appropriate tax rates. The income tax expense does not include 

other types of taxes, such as unemployment and Social Security 

taxes on the company’s payroll. These other, nonincome taxes 

are included in operating expenses.c02.indd   17c02.indd   1712-12-2013   15:32:17
12-12-2013   15:32:17


In [44]:
# Cleaning text for processing using regex

# 1. Make all letters lowercase
processed_pdf = pagetxt.lower()
print(processed_pdf)

thr
 nancial statements
    17
 
interest expense:
 the amount of interest on debt (interest-bearing liabilities) for the period.
 nancing charges may also be included, such as loan origination fees.
 
income tax expense:
 the total amount due the government (both federal and state) on the amount of taxable income of the business during the period. taxable income is multiplied by the 
appropriate tax rates. the income tax expense does not include 

other types of taxes, such as unemployment and social security 

taxes on the company’s payroll. these other, nonincome taxes 

are included in operating expenses.c02.indd   17c02.indd   1712-12-2013   15:32:17
12-12-2013   15:32:17


In [45]:
# 2. Remove all chars that aren't letters
processed_pdf = re.sub('[^a-zA-Z]', ' ', processed_pdf)
print(processed_pdf)

thr  nancial statements          interest expense   the amount of interest on debt  interest bearing liabilities  for the period   nancing charges may also be included  such as loan origination fees    income tax expense   the total amount due the government  both federal and state  on the amount of taxable income of the business during the period  taxable income is multiplied by the  appropriate tax rates  the income tax expense does not include   other types of taxes  such as unemployment and social security   taxes on the company s payroll  these other  nonincome taxes   are included in operating expenses c   indd     c   indd                                                


In [46]:
# 3. Remove all white space
processed_pdf = re.sub(r'\s+', ' ', processed_pdf)
print(processed_pdf)

thr nancial statements interest expense the amount of interest on debt interest bearing liabilities for the period nancing charges may also be included such as loan origination fees income tax expense the total amount due the government both federal and state on the amount of taxable income of the business during the period taxable income is multiplied by the appropriate tax rates the income tax expense does not include other types of taxes such as unemployment and social security taxes on the company s payroll these other nonincome taxes are included in operating expenses c indd c indd 


In [47]:
# At this point the above text is readable, but some of the words are 
# incomplete and there is nonsense text like 'indd' and 'c' at the end.
# Such words are useless for any type of keyword/token analysis.
# So to fix this, the python library spacy can be used

import spacy
nlp = spacy.load('en_core_web_sm')

# First a spacy document must be created, doing so automatically creates
# tokens for each word and sentence in the document.

spacy_doc = nlp(processed_pdf)
for token in spacy_doc:
    print(token)

thr
nancial
statements
interest
expense
the
amount
of
interest
on
debt
interest
bearing
liabilities
for
the
period
nancing
charges
may
also
be
included
such
as
loan
origination
fees
income
tax
expense
the
total
amount
due
the
government
both
federal
and
state
on
the
amount
of
taxable
income
of
the
business
during
the
period
taxable
income
is
multiplied
by
the
appropriate
tax
rates
the
income
tax
expense
does
not
include
other
types
of
taxes
such
as
unemployment
and
social
security
taxes
on
the
company
s
payroll
these
other
nonincome
taxes
are
included
in
operating
expenses
c
indd
c
indd


In [48]:
# Notice there are a lot of single letters and nonsense words suc as indd,
# as well as stop-words such as 'the', 'of', 'and', etc.
# spacey can remove these using the is_stop attribute

for token in spacy_doc:
    if not token.is_stop:
        print(token)

thr
nancial
statements
interest
expense
interest
debt
interest
bearing
liabilities
period
nancing
charges
included
loan
origination
fees
income
tax
expense
total
government
federal
state
taxable
income
business
period
taxable
income
multiplied
appropriate
tax
rates
income
tax
expense
include
types
taxes
unemployment
social
security
taxes
company
s
payroll
nonincome
taxes
included
operating
expenses
c
indd
c
indd


In [36]:
# Now the keyword list is smaller, however, there are still single letters as
# well as two misspelled words at the start of the page. because the 
# single 'c' and the 'indd' is an error assocated with the PDF reader, it is 
# safe to say that they will be on every page. As such, there are a few ways to
# handle this, manually remove them using loops with or without regex. Another
# would be to add them to the stopword index in spacy. 

# here I manually do it, but see my Natural Language Processing workbook in
# this directory for more detail on this process.

In [49]:
# Here the spacy doc is converted back to a Python list and all the single
# chars are removed

scrubbed_tokens = [token.text for token in spacy_doc if len(token)!=1]
print(scrubbed_tokens)

['thr', 'nancial', 'statements', 'interest', 'expense', 'the', 'amount', 'of', 'interest', 'on', 'debt', 'interest', 'bearing', 'liabilities', 'for', 'the', 'period', 'nancing', 'charges', 'may', 'also', 'be', 'included', 'such', 'as', 'loan', 'origination', 'fees', 'income', 'tax', 'expense', 'the', 'total', 'amount', 'due', 'the', 'government', 'both', 'federal', 'and', 'state', 'on', 'the', 'amount', 'of', 'taxable', 'income', 'of', 'the', 'business', 'during', 'the', 'period', 'taxable', 'income', 'is', 'multiplied', 'by', 'the', 'appropriate', 'tax', 'rates', 'the', 'income', 'tax', 'expense', 'does', 'not', 'include', 'other', 'types', 'of', 'taxes', 'such', 'as', 'unemployment', 'and', 'social', 'security', 'taxes', 'on', 'the', 'company', 'payroll', 'these', 'other', 'nonincome', 'taxes', 'are', 'included', 'in', 'operating', 'expenses', 'indd', 'indd']


In [50]:
# Belwo the two misspelled words at the start of the list are removed
# along with the 2 'indd' at the end. 

# pop() could be used here to remove the last two items, and remove() could
# be used for the first two mispelled words. However, because these may not
# always appear at the start and end, it is a good idea to weed them out

# Once again there are a couple of ways to go about this, from quick searches
# I found an nltk solution but there is a library called pyenchant that 
# looked interesting. 

# Here I use nltk
import nltk
from nltk.corpus import words
#print('financial' in words.words())

# Before removing the words I want to point out that in word count analysis
# that the first two words are just incomplete and in reality would actually
# need to be included
# TODO SEE IF THAT pyenchant library WILL WORK FOR THIS, TO RESPELL THEM?

# Here I jsut used the nltk words library which removes them
# notice it takes a while to do a word by word checkt his way
final_list = [token for token in scrubbed_tokens if token in words.words()]
print(final_list)

['interest', 'expense', 'the', 'amount', 'of', 'interest', 'on', 'debt', 'interest', 'bearing', 'for', 'the', 'period', 'may', 'also', 'be', 'included', 'such', 'as', 'loan', 'origination', 'income', 'tax', 'expense', 'the', 'total', 'amount', 'due', 'the', 'government', 'both', 'federal', 'and', 'state', 'on', 'the', 'amount', 'of', 'taxable', 'income', 'of', 'the', 'business', 'during', 'the', 'period', 'taxable', 'income', 'is', 'by', 'the', 'appropriate', 'tax', 'the', 'income', 'tax', 'expense', 'does', 'not', 'include', 'other', 'of', 'such', 'as', 'unemployment', 'and', 'social', 'security', 'on', 'the', 'company', 'payroll', 'these', 'other', 'are', 'included', 'in', 'operating']


**In conclusion, the quickest way to remove "known" re-occuring unwanted tokens such as 'indd' above, it is best to add them to a list of stopwords, either self-created or with a library like spacy**

## TODO, Find and figure out the best way to discover misspelled words so that the first two words pictured below were not removed

<img src="img/pdf_data_almost_clean.png">