## Text Preprocessing using Python

1. f-strings (Format strings)
2. Writing and Reading text file
3. Writing and Reading PDF file
4. Regular Expressions

### f-strings (Format strings)

In [2]:
name = 'Fred'

# Old method
print('His name is {var}.'.format(var=name))

# New method:
print(f'His name is {name}.')

His name is Fred.
His name is Fred.


In [5]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]

for book in library:
    print(f'{book[0]} {book[1]} {book[2]}')

Author Topic Pages
Twain Rafting 601
Feynman Physics 95
Hamilton Mythology 144


In [6]:
for book in library:
    print(f'{book[0]:{10}} {book[1]:{8}} {book[2]:{7}}')

Author     Topic    Pages  
Twain      Rafting      601
Feynman    Physics       95
Hamilton   Mythology     144


In [7]:
for book in library:
    print(f'{book[0]:{10}} {book[1]:{10}} {book[2]:.>{7}}') # here .> was added

Author     Topic      ..Pages
Twain      Rafting    ....601
Feynman    Physics    .....95
Hamilton   Mythology  ....144


### Writing and Reading text file

In [8]:
%%writefile test.txt
Hello, Welcome everyone.
This is the first session in Learn and Share series.

Writing test.txt


In [15]:
myfile = open('./test.txt')
print(myfile.read())

Hello, Welcome everyone.
This is the first session in Learn and Share series.



`read()` is a generator and it yields values only once. Try running `myfile.read()` again and check.

In [18]:
myfile.read()

''

In [17]:
# Seek to the start of file (index 0)
myfile.seek(0)
myfile.read()

'Hello, Welcome everyone.\nThis is the first session in Learn and Share series.\n'

In [21]:
myfile.seek(0)
myfile.readlines()

['Hello, Welcome everyone.\n',
 'This is the first session in Learn and Share series.\n']

In [22]:
myfile.close()

In [31]:
myfile2 = open('test2.txt','w+')

In [32]:
myfile2.write('This is a new first line')

24

In [33]:
myfile2.write('\nThis is a new second line')

26

In [34]:
myfile2.seek(0)
myfile2.read()

'This is a new first line\nThis is a new second line'

In [36]:
myfile2.close() # You should close the file after usage

We can use `with` statement in python this automatically closes the files after usage

In [37]:
with open('test.txt', 'r') as f:
    txt = f.read()
    print(txt)

Hello, Welcome everyone.
This is the first session in Learn and Share series.



`r` stands for reading and `w` stands for writing and `a` stands for appending. If you append a text file and it doesn't exist then it creates a new file and write the data.

### Writing and Reading PDF files

In [41]:
import PyPDF2

f = open('./US_Declaration.pdf','rb')

In [42]:
pdf_reader = PyPDF2.PdfFileReader(f)
pdf_reader.numPages

5

In [43]:
page_one = pdf_reader.getPage(0)

In [46]:
page_one_txt = page_one.extractText()

In [47]:
page_one_txt

"Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the\npolitical bands which have connected them with another, and to assume among the powers of the\nearth, the separate and equal station to which the Laws of Nature and of Nature's God entitle\n\nthem, a decent respect to the opinions of mankind requires that they should declare the causes\n\nwhich impel them to the separation. \nWe hold these truths to be self-evident, that all men are created equal, that they are endowed by\n\ntheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving\n\ntheir just powers from the consent of the governed,ŠThat whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter or 

In [48]:
f.close()

In [49]:
pdf_writer = PyPDF2.PdfFileWriter() # Start pdf writer

In [50]:
pdf_writer.addPage(page_one) # Add pages

In [51]:
pdf_output = open("Some_New_Doc.pdf","wb") # Open a new pdf file

In [52]:
pdf_writer.write(pdf_output) # Push added pages to new file

In [53]:
pdf_output.close() # and lastly close the PDF file

### Regular Expressions

In [54]:
import re

In [92]:
text = "introducing regular expressions in this section. Repeat regular twice."

In [93]:
match = re.search("regular",text)

In [94]:
match.span()

(12, 19)

In [95]:
match.group()

'regular'

In [96]:
match = re.findall("regular",text)

In [97]:
match

['regular', 'regular']

In [100]:
text = 'this is my contact number 99999 99999'

In [106]:
match = re.search(r'\d\d\d\d\d \d\d\d\d\d', text)

In [107]:
match.group()

'99999 99999'

In [114]:
text = 'this is my employee id: EMPN01'

In [115]:
match = re.search(r'EMP\w\w\w', text)

In [116]:
match.group()

'EMPN01'

#### Identifiers: 

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [118]:
text = 'this is my contact number 99999 99999'

In [119]:
match = re.search(r'\d{5} \d{5}', text)

In [120]:
match.group()

'99999 99999'

#### Quantifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

**Groups:**

In [122]:
text = 'this is my contact number +91 - 99999 99999'

In [131]:
pattern = re.compile(r'(\W\d{2}) - (\d{5}) (\d{5})')
match = re.search(pattern, text)

In [132]:
match.group()

'+91 - 99999 99999'

In [133]:
match.group(1)

'+91'

In [135]:
match.group(2)

'99999'

In [136]:
match.group(3)

'99999'

**Or Operator:**

In [140]:
text = "introducing regular expressions in this section. Repeat Regular twice."

In [141]:
match = re.search(r'regular|Regular', text)

In [142]:
match.group()

'regular'

In [143]:
match = re.findall(r'regular|Regular', text)

In [144]:
match

['regular', 'Regular']

**Wildcard character:**

In [145]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [146]:
re.findall(r".at","The cat in the hat saturday here.")

['cat', 'hat', 'sat']

In [147]:
re.findall(r"......at","The cat in the hat saturday here.")

[' the hat']

In [154]:
re.findall(r"\S+at","The cat in the hat saturday here.")

['cat', 'hat', 'sat']

In [153]:
re.findall(r"\w+at","The cat in the hat saturday here 3q324at")

['cat', 'hat', 'sat', '3q324at']

**Starts with and Ends with:**

In [159]:
re.findall(r'\d$','This ends with a 123number kdhfadsf2')

['2']

In [163]:
re.findall(r'^\d','1 is the loneliest number.')

['1']

### Tokenization

In [164]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

In [165]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [166]:
# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
    print(token.text, end=' | ')

" | We | 're | moving | to | L.A. | ! | " | 

In [167]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [168]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


### Named Entities

In [173]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')

print('\n----')

for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 
----
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


### Noun Chunks

In [174]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [175]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


### Visualization

In [176]:
from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

In [177]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

### Stemming

In [187]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import *

p_stemmer = PorterStemmer()
s_stemmer = SnowballStemmer(language='english')

words = ['run','runner','running','ran','runs','easily','fairly']

print('Porter Stemmer...')
print('==========================')
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

print('\nSnowball Stemmer...')
print('==========================')
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

Porter Stemmer...
run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli

Snowball Stemmer...
run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


### Lemmatization

In [188]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [189]:
doc2 = nlp(u"I saw eighteen mice today!")
show_lemmas(doc2)

I            PRON   561228191312463089     -PRON-
saw          VERB   11925638236994514241   see
eighteen     NUM    9609336664675087640    eighteen
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !


### Feature Extraction from Text

#### Count Vectorizer

In [190]:
%%writefile 1.txt
This is a story about cats
our feline pets
Cats are furry animals

Writing 1.txt


In [191]:
%%writefile 2.txt
This story is about surfing
Catching waves is fun
Surfing is a popular water sport

Writing 2.txt


In [192]:
vocab = {}
i = 1

with open('1.txt') as f:
    x = f.read().lower().split()

for word in x:
    if word in vocab:
        continue
    else:
        vocab[word]=i
        i+=1

print(vocab)

{'this': 1, 'is': 2, 'a': 3, 'story': 4, 'about': 5, 'cats': 6, 'our': 7, 'feline': 8, 'pets': 9, 'are': 10, 'furry': 11, 'animals': 12}


In [193]:
with open('2.txt') as f:
    x = f.read().lower().split()

for word in x:
    if word in vocab:
        continue
    else:
        vocab[word]=i
        i+=1

print(vocab)

{'this': 1, 'is': 2, 'a': 3, 'story': 4, 'about': 5, 'cats': 6, 'our': 7, 'feline': 8, 'pets': 9, 'are': 10, 'furry': 11, 'animals': 12, 'surfing': 13, 'catching': 14, 'waves': 15, 'fun': 16, 'popular': 17, 'water': 18, 'sport': 19}


In [194]:
# Create an empty vector with space for each word in the vocabulary:
one = ['1.txt']+[0]*len(vocab)
one

['1.txt', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [195]:
# map the frequencies of each word in 1.txt to our vector:
with open('1.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    one[vocab[word]]+=1
    
one

['1.txt', 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

In [196]:
# Do the same for the second document:
two = ['2.txt']+[0]*len(vocab)

with open('2.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    two[vocab[word]]+=1

In [197]:
# Compare the two vectors:
print(f'{one}\n{two}')

['1.txt', 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
['2.txt', 1, 3, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1, 1]


### Tfidf Vectorizer

$$Tf(w, d) = Number \ of \ times \ word \ `w` \ appears \ in \ document \ `d`$$

$$IDF(w) = \log \frac{Total \ number \ of \ documents}{Number \ of \ documents \ with \ word \ `w`}$$

$$Tfidf(w, d) = Tf(w, d) * IDF(w)$$