### Working with Text files
in this section we will cover
* Working with f-string (formated string literals) to format printed text
* Working with files - opening, reading, writing and appending text files

In [1]:
name = 'S'
print('His name is {var}.'.format(var=name))
print(f'His name is {name!a}.')

His name is S.
His name is 'S'.


In [2]:
d = {'a':123, 'b':456}

print(f'Address : {d["a"]}')

Address : 123


In [3]:
library = [('Author', 'Topic', 'Pages'),
           ('Twain', 'Rafting', 601),
           ('Feynman', 'Pyhsics', 95),
           ('Hamilton', 'Mythology', 144)]

for book in library:
    print(f'{book[0]:{10}} {book[1]:{10}} {book[2]:{5}}')

Author     Topic      Pages
Twain      Rafting      601
Feynman    Pyhsics       95
Hamilton   Mythology    144


In [8]:
for book in library:
    print(f"{book[0]:10} {book[1]:10} {book[2]:.>7} ")

Author     Topic      ..Pages 
Twain      Rafting    ....601 
Feynman    Pyhsics    .....95 
Hamilton   Mythology  ....144 


Files

In [9]:
%%writefile helloworld.txt
Hello, this is a quick test file.
This is the second line of file.

Writing helloworld.txt


In [10]:
%%writefile -a helloworld.txt
Hello, this is a quick test file.
This is the second line of file.
Hello, this is a quick test file.
This is the second line of file.
Hello, this is a quick test file.
This is the second line of file.

Appending to helloworld.txt


In [11]:
myfile = open('helloworld.txt')

In [12]:
myfile.read()

'Hello, this is a quick test file.\nThis is the second line of file.\nHello, this is a quick test file.\nThis is the second line of file.\nHello, this is a quick test file.\nThis is the second line of file.\nHello, this is a quick test file.\nThis is the second line of file.\n'

In [19]:
myfile

<_io.TextIOWrapper name='helloworld.txt' mode='r' encoding='cp1252'>

In [20]:
print(myfile.read())




In [21]:
myfile.seek(0)
print(myfile.read())

Hello, this is a quick test file.
This is the second line of file.
Hello, this is a quick test file.
This is the second line of file.
Hello, this is a quick test file.
This is the second line of file.
Hello, this is a quick test file.
This is the second line of file.



In [31]:
myfile.seek(0)
myfile.readlines()

['Hello, this is a quick test file.\n',
 'This is the second line of file.\n',
 'Hello, this is a quick test file.\n',
 'This is the second line of file.\n',
 'Hello, this is a quick test file.\n',
 'This is the second line of file.\n',
 'Hello, this is a quick test file.\n',
 'This is the second line of file.\n']

In [32]:
myfile.close()

In [34]:
myfile = open('test.txt', 'w+')

In [35]:
myfile.write('This is a new First Line.')

25

In [37]:
myfile.seek(0)
myfile.read()

'This is a new First Line.'

In [1]:
myfile = open('test.txt', 'a+')
myfile.write("\nThis line is being appended to test.txt")
myfile.write("\nAnd another line here.")

23

In [2]:
myfile.seek(0)
print(myfile.read())

This is a new First Line.
This line is being appended to test.txt
And another line here.


In [3]:
myfile.close()

In [11]:
with open('test.txt', 'r') as txt:
    first_single_line = txt.readlines()[0]

print(first_single_line)

This is a new First Line.



In [13]:
with open('test.txt', 'r') as txt:
    for line in txt:
        print(line)

This is a new First Line.

This line is being appended to test.txt

And another line here.


### Working with PDF Files

In [13]:
import PyPDF2 as pdf2

In [14]:
f = open('US_Declaration.pdf', 'rb')

In [17]:
pdf_reader = pdf2.PdfReader(f)

In [21]:
len(pdf_reader.pages)

5

In [23]:
page_one = pdf_reader.pages[0].extract_text()

In [24]:
page_one

"Declaration of Independence\nIN CONGRESS, July 4, 1776.  \nThe unanimous Declaration of the thirteen united States of America,  \nWhen in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.— \x14That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  \x14That whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter or to 

# Reading all pages together

In [28]:
for page in range(len(pdf_reader.pages)):
    print(f"Page # {page}")
    print(pdf_reader.pages[page].extract_text())

Page # 0
Declaration of Independence
IN CONGRESS, July 4, 1776.  
The unanimous Declaration of the thirteen united States of America,  
When in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.— That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  That whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abo

In [29]:
f.close()

In [30]:
f = open('US_Declaration.pdf', 'rb')
pdf_reader = pdf2.PdfReader(f)

In [31]:
first_page = pdf_reader.pages[0]

In [32]:
pdf_writer = pdf2.PdfWriter()

In [33]:
pdf_writer.add_page(first_page)

{'/Type': '/Page',
 '/Contents': {},
 '/MediaBox': [0, 0, 612, 792],
 '/Resources': {'/Font': {'/F9': {'/Type': '/Font',
    '/Subtype': '/Type1',
    '/Name': '/F9',
    '/Encoding': '/WinAnsiEncoding',
    '/FirstChar': 31,
    '/LastChar': 255,
    '/Widths': [778,
     250,
     333,
     555,
     500,
     500,
     1000,
     833,
     278,
     333,
     333,
     500,
     570,
     250,
     333,
     250,
     278,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     333,
     333,
     570,
     570,
     570,
     500,
     930,
     722,
     667,
     722,
     722,
     667,
     611,
     778,
     778,
     389,
     500,
     778,
     667,
     944,
     722,
     778,
     611,
     778,
     722,
     556,
     667,
     722,
     722,
     1000,
     722,
     722,
     667,
     333,
     278,
     333,
     581,
     500,
     333,
     500,
     556,
     444,
     556,
     444,
     333,
     500,
     556,

In [39]:
newpdf = open('Some_New_Doc.pdf', 'wb')

In [40]:
pdf_writer.write(newpdf)

(False, <_io.BufferedWriter name='Some_New_Doc.pdf'>)

In [41]:
newpdf.close()
f.close()

In [50]:
f = open("US_Declaration.pdf", 'rb')

pdf_text = [0]
pdf_reader = pdf2.PdfReader(f)
for page in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page]
    pdf_text.append(page.extract_text())

f.close()


In [57]:
pdf_text

[0,
 "Declaration of Independence\nIN CONGRESS, July 4, 1776.  \nThe unanimous Declaration of the thirteen united States of America,  \nWhen in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.— \x14That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  \x14That whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter o

In [59]:
print(pdf_text[2])

He has dissolved Re presentative Ho uses repeatedly , for opposing wit h manly
firmness his invasions on the rights of the people.
He has refused for a long time, after such dissolutions, to cause others to be
elected; whereby the Leg islative powers, incapable of Annihilation, have returned
to the People at lar ge for their exe rcise; the State r emaining in the me an time
exposed to all the dangers of invasion from without, and convulsions within.
He has endeavou red to prevent the  population of these  States; for that pur pose
obstructing the L aws for Natural ization of Foreig ners; refusing  to pass others to
encourage their migrations hither, and raising the conditions of new
Appropriations of  Lands.
He has obstructed the Administration of Justice, by refusing his Assent to Laws
for establishing  Judiciary pow ers.
He has made Judge s dependent on his Wil l alone, for the te nure of their off ices,
and the amount and  payment of t heir salaries.
He has erected  a multitude of N

# Regular Expressions

In [60]:
text = "108-333-3241 The agent's phone number is 408-555-1234. Call soon!"

In [62]:
'phone' in text

True

In [63]:
import re

In [64]:
pattern = 'phone'
re.search(pattern, text)

<re.Match object; span=(25, 30), match='phone'>

In [65]:
pattern = 'NOT IN TEXT'
re.search(pattern, text)

In [72]:
pattern = 'phone'
match = re.search(pattern, text)

In [74]:
match.span()

(25, 30)

In [75]:
match.start()

25

In [76]:
match.end()

30

In [77]:
text = 'my phone is a new phone'
match = re.search('phone', text)

In [78]:
match.span()

(3, 8)

In [82]:
matches = re.findall('phone', text)
print(matches)
print(len(matches))

['phone', 'phone']
2


In [86]:
text = 'my phone is a new phone'

for match in re.finditer("phone", text):
    print(match.span())

(3, 8)
(18, 23)


# Regular Expression

In [4]:
import re
text = "108-333-3241 The agent's phone number is 408-555-1234. Call soon!"

In [2]:
'phone' in text

True

In [5]:
pattern = 'phone'
re.search(pattern,text)


<re.Match object; span=(25, 30), match='phone'>

In [6]:
text[25:30]

'phone'

In [7]:
pattern = 'NOT IN TEXT'

In [8]:
re.search(pattern, text)

In [11]:
pattern = 'phone'
match = re.search(pattern, text)
print(match.span())
print(match.start())
print(match.end())

(25, 30)
25
30


In [13]:
# But what if the pattern occurs more than once?
txt = 'my phone is a new phone'
match = re.search(pattern, txt)
print(match.span())

(3, 8)


In [16]:
matches = re.findall(pattern, txt)
matches

['phone', 'phone']

In [17]:
# To get actual match objects, use the iterator
for match in re.finditer(pattern, txt):
    print(match.span())

(3, 8)
(18, 23)


In [19]:
text = "108-333-3241 The agent's phone number is 408-555-1234. Call soon!"

In [21]:
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'
for match in re.finditer(pattern, text):
    print(match.span())

(0, 12)
(41, 53)


In [22]:
match.group()

'408-555-1234'

### Patterns

In [23]:
text2 = 'My telephone number is 408-555-1234'
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text2)
phone.group()

'408-555-1234'

In [24]:
#lets rewrite our patterns
text2 = 'My telephone number is 408-555-1234'
pattern = r'\d{3}-\d{3}-\d{4}'
phone = re.search(pattern, text2)
phone.group()

'408-555-1234'

In [37]:
# Groups
text2 = 'My telephone number is 408-555-1234'
pattern = r'(\d{3})-(\d{3})-(\d{4})' # it will return string object
phone = re.search(pattern, text2)
print(phone.group())
print(phone.group(1))
print(phone.group(2))
print(phone.group(3))

print(type(pattern))

408-555-1234
408
555
1234
<class 'str'>


In [38]:
# Groups
text2 = 'My telephone number is 408-555-1234'
pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')  # it will return Pattern Object
phone = re.search(pattern, text2)
print(phone.group())
print(phone.group(1))
print(phone.group(2))
print(phone.group(3))

print(type(pattern))

408-555-1234
408
555
1234
<class 're.Pattern'>


In [39]:
# it will return Error because there is no 4th group.
phone.group(4)

IndexError: no such group

### Additional Regex Systax
### Or operator |

In [46]:
re.search(r'man|woman', 'This man was here.')

<re.Match object; span=(5, 8), match='man'>

In [47]:
re.search(r'man|woman', 'This woman was here.')

<re.Match object; span=(5, 10), match='woman'>

In [3]:
# The Wildcard Character
re.findall(r'.at', 'The cat sat on the mat.')

['cat', 'sat', 'mat']

In [5]:
re.findall(r'.at', 'The bat went splat')

['bat', 'lat']

In [6]:
re.findall(r'...at', 'The bat went splat')

['e bat', 'splat']

In [None]:
re.findall(r'\S+at', 'The bat went splat i am at')

['bat', 'splat']

### Starts With and Ends With

In [8]:
re.findall(r'\d$', 'This ends with a number 2')

['2']

In [28]:
re.findall(r'^\d', '1 is the lonelies number. 1dfdfd')

['1']

In [19]:
phrase = 'there are 3 numbers 34 inside 5 sentences.'
re.findall(r'[^\d]', phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 's',
 '.']

In [44]:
test_phrase = 'This > is a string! But it {}has , punctuation. How can "we :remove it?'
re.findall('''[^!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' ]+''', test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [57]:
test_phrase = 'This > is a string! But it {}has , punctuation. How can "we :remove it?'
clean_punctuation = re.findall('''[^!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' ]+''', test_phrase)
join_seq = ' '.join(clean_punctuation)
print(join_seq)

This is a string But it has punctuation How can we remove it


# Spacy

In [3]:
import spacy
nlp = spacy.blank('en')

In [4]:
doc = nlp("This is a sentence.")
print(doc.text)

This is a sentence.


In [5]:
doc = nlp("I like tree kangroos and narwhals.")
print(doc.text)
print(doc[0])

I like tree kangroos and narwhals.
I


In [8]:
tree_kangroos = doc[2:4]
print(tree_kangroos.text)

tree_kangroos_and_narwhals = doc[2:6]
print(tree_kangroos_and_narwhals)

tree kangroos
tree kangroos and narwhals


In [11]:
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)


for token in doc:
    
    if token.like_num:
        print(token)
        next_token = doc[token.i + 1]
        print(next_token.i)
        if next_token.text == '%':
            print('Percentage Found :', token.text)
        

1990
2
60
6
Percentage Found : 60
4
21
Percentage Found : 4


In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [13]:
doc = nlp("She ate the pizza")

In [17]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


In [18]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


### Loading Pipeline

In [9]:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "It's official: Apple is the first U.S. public company to reach a 1$ trillion market value"
doc = nlp(text)
print(doc)


It's official: Apple is the first U.S. public company to reach a 1$ trillion market value


In [13]:
for token in doc:
    #token_text = token.text
    #token_pos = token.pos_
    #token_dep = token.dep_
    #print(f"{token_text:<12}{token_pos:<10}{token.dep:<10}")
    print(f"{token.text:<12}{token.pos_:<10}{token.dep_:<10}")

It          PRON      nsubj     
's          AUX       ccomp     
official    ADJ       acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
1           NUM       quantmod  
$           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [18]:
for ent in doc.ents:
    print(ent.text, ent.label_, ent.label)

Apple ORG 383
first ORDINAL 396
U.S. GPE 384
1$ trillion MONEY 394


# Efficient phrase matching

In [13]:
import json
import spacy

with open('countries.json', encoding='utf8') as f:
    COUNTRIES = json.loads(f.read())
    
nlp = spacy.blank('en')
doc = nlp("Czech Republic my help Slovakia protect its airspace")

from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', patterns)

macthes = matcher(doc)

print([doc[start:end] for match_id, start, end in matches])


[Czech Republic, Slovakia]


## Extracting countries and relationships

In [15]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

with open('countries.json', encoding='utf8') as f:
    COUNTRIES = json.loads(f.read())
    
with open('country_text.txt', encoding='utf8') as f:
    TEXT = f.read()
    
nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", patterns)

doc = nlp(TEXT)
doc.ents = []

for match_id, start, end in matcher(doc):
    span = Span(doc, start, end, label="GPE")
    
    doc.ents = list(doc.ents) + [span]
    
    span_root_head = span.root.head
    print(span_root_head.text, "-->", span.text)
    
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

in --> Namibia
in --> South Africa
Africa --> Cambodia
of --> Kuwait
as --> Somalia
Somalia --> Haiti
Haiti --> Mozambique
in --> Somalia
for --> Rwanda
Britain --> Singapore
War --> Sierra Leone
of --> Afghanistan
invaded --> Iraq
in --> Sudan
of --> Congo
earthquake --> Haiti
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]


# Chapter 3 | Processing Pipelines

### 6 | Simple components

In [2]:
import spacy
from spacy.language import Language

@Language.component("length_component")
def length_component_func(doc):
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    return doc

nlp = spacy.load("en_core_web_sm")

nlp.add_pipe("length_component", first=True)
print(nlp.pipe_names)

doc = nlp("This is a sentence.")

['length_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
This document is 5 tokens long.


### 7 | Complex components

In [9]:
import spacy
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)

matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", animal_patterns)

@Language.component("animal_component")
def animal_component_func(doc):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    doc.ents = spans
    return doc

nlp.add_pipe("animal_component", before="tok2vec")
print(nlp.pipe_names)
    
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['animal_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


In [17]:
print(items(nlp))

NameError: name 'items' is not defined