<a href="https://colab.research.google.com/github/skflwright/learning_rep/blob/main/intro_to_nlp_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Text & Files

### Formatted String Literals (f-strings)

In [None]:
name = 'John'

In [None]:
# Using the old .format() method:
print('His name is {var}.'.format(var=name))

# Using f-strings:
print(f'His name is {name}.')

His name is John.
His name is John.


In [None]:
print(f'His name is {name!r}')

His name is 'John'


In [None]:
d = {'a':123,'b':456}

# print(f'Address: {d['a']} Main Street')

In [None]:
print(f"Address: {d['a']} Main Street")

Address: 123 Main Street


In [None]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]

for book in library:
    print(f'{book[0]:{10}} {book[1]:{8}} {book[2]:{7}}')

Author     Topic    Pages  
Twain      Rafting      601
Feynman    Physics       95
Hamilton   Mythology     144


### Date Formatting

In [None]:
from datetime import datetime

In [None]:
today = datetime.now()

In [None]:
today

datetime.datetime(2023, 11, 8, 23, 51, 37, 398602)

In [None]:
today = datetime(year=2023, month=11, day=8)

In [None]:
today

datetime.datetime(2023, 11, 8, 0, 0)

In [None]:
print(f'{today:%B %d, %Y}')

November 08, 2023


## Files

In [None]:
%%writefile test.txt
Hi there, General Kenobi.
Ohh the negotiator.

Writing test.txt


In [None]:
data = open('test.txt')

In [None]:
data.read()

'Hi there, General Kenobi.\nOhh the negotiator.\n'

In [None]:
data.read()

''

In [None]:
data.seek(0)

0

In [None]:
data.read()

'Hi there, General Kenobi.\nOhh the negotiator.\n'

In [None]:
data.seek(0)
data.readlines()

['Hi there, General Kenobi.\n', 'Ohh the negotiator.\n']

In [None]:
data.close()

In [None]:
# data.read()

In [None]:
my_file = open('test.txt','a+')

In [None]:
my_file.write('\nThis line is being appended to test.txt')
my_file.write('\nAnd another line here.')

23

In [None]:
my_file.seek(0)
print(my_file.read())

Hi there, General Kenobi.
Ohh the negotiator.

This line is being appended to test.txt
And another line here.


In [None]:
my_file.close()

In [None]:
%%writefile -a test.txt

This is more text being appended to test.txt
And another line here.

Appending to test.txt


In [None]:
with open('test.txt','r') as txt:
    first_line = txt.readlines()[0:-1]

print(first_line)

['Hi there, General Kenobi.\n', 'Ohh the negotiator.\n', '\n', 'This line is being appended to test.txt\n', 'And another line here.\n', 'This is more text being appended to test.txt\n']


In [None]:
# txt.read()

# Working with PDFs by using PyPDF2

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
import PyPDF2

In [None]:
# Reading PDFs

f = open('/content/drive/MyDrive/LaGuardia Classes/Fall 2023 Predictive Analytics/US_Declaration.pdf', 'rb')

In [None]:
pdf_reader = PyPDF2.PdfReader(f)

In [None]:
len(pdf_reader.pages)

5

In [None]:
page_one = pdf_reader.pages[0]

In [None]:
page_one_text = page_one.extract_text()

In [None]:
page_one_text

"Declaration of Independence\nIN CONGRESS, July 4, 1776.  \nThe unanimous Declaration of the thirteen united States of America,  \nWhen in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.— \x14That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  \x14That whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter or to 

In [None]:
f.close()

# Regular expressions

### Searching for Basic Patterns

In [None]:
import re

In [None]:
text = "The agent's phone number is 408-555-1234. Call soon!"

In [None]:
'phone' in text

True

In [None]:
pattern = 'phone'

In [None]:
re.search(pattern, text)

<re.Match object; span=(12, 17), match='phone'>

In [None]:
match = re.search(pattern,text)

In [None]:
match

<re.Match object; span=(12, 17), match='phone'>

In [None]:
match.span()

(12, 17)

In [None]:
match.start()

12

In [None]:
match.end()

17

In [None]:
text = "my phone is a new phone"

In [None]:
match = re.search("phone", text)

In [None]:
match.span()

(3, 8)

In [None]:
matches = re.findall("phone",text)

In [None]:
matches

['phone', 'phone']

In [None]:
for match in re.finditer("phone", text):
    print(match.span())

(3, 8)
(18, 23)


In [None]:
text = "My telephone number is 408-555-1234"

In [None]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)

In [None]:
phone.group()

'408-555-1234'

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [None]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<re.Match object; span=(23, 35), match='408-555-1234'>

# Intro to SpaCy

In [None]:
# Reference: https://spacy.io/


In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


### Pipeline

In [None]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x780e0db97820>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x780e0db97340>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x780e0dc90cf0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x780e0da94900>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x780e0d97aa40>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x780e0dc90f90>)]

In [None]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

### Tokenization

In [None]:
doc2 = nlp(u"Tesla isn't   looking into startups anymore.")

for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
   SPACE dep
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [None]:
doc2

Tesla isn't   looking into startups anymore.

In [None]:
type(doc2)

spacy.tokens.doc.Doc

### Part-of-Speech Tagging (POS)

In [None]:
doc2[0].pos_

'PROPN'

In [None]:
spacy.explain('PROPN')

'proper noun'

In [None]:
spacy.explain('nsubj')

'nominal subject'

### Additional Token Attributes

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [None]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

looking
look


In [None]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

VERB
VBG / verb, gerund or present participle


In [None]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [None]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False


### Spans

In [None]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [None]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [None]:
type(life_quote)

spacy.tokens.span.Span

### Sentences

In [None]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This =is the last sentence.')

In [None]:
for sentences in doc4.sents:
    print(sentences)

This is the first sentence.
This is another sentence.
This is the last sentence.
