# Spacy.io NLP stuff

[Spacy](https://spacy.io/) is "Industrial-Strength Natural Language Processing" (NLP)

Make sure you have installed spacy <br>
`pip install -q -U spacy`
`python -m spacy download en_core_web_sm` 

You might need to restart your jupyter kernel.

In [1]:
! curl -H "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36" https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2307k    0 2307k    0     0  2547k      0 --:--:-- --:--:-- --:--:-- 2563k


In [15]:
import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'lxml')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
tsla = html2text(html_text)
print(tsla[0:100].split())

['S-1', '1', 'ds1.htm', 'REGISTRATION', 'STATEMENT', 'ON', 'FORM', 'S-1', 'Registration', 'Statement', 'on', 'Form', 'S-1', 'Table', 'of', 'Cont']


## Tokenizing with Spacy

In [16]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [17]:
doc = nlp(tsla[0:5000])
type(doc)

spacy.tokens.doc.Doc

In [18]:
for token in doc[:30]:
    if len(str(token).strip())>0:
        print(token.text.strip())

S-1
1
ds1.htm
REGISTRATION
STATEMENT
ON
FORM
S-1
Registration
Statement
on
Form
S-1
Table
of
Contents
As
filed
with
the
Securities
and
Exchange


## Parts of speech

In [6]:
import pandas as pd
winfo = []
for token in doc[100:120]:
    winfo.append([token.text, token.pos_, token.is_stop])
winfo

[['jurisdiction', 'NOUN', False],
 ['of', 'ADP', True],
 ['incorporation', 'NOUN', False],
 ['or', 'CCONJ', True],
 ['organization', 'NOUN', False],
 [')', 'PUNCT', False],
 ['\n\xa0\n ', 'SPACE', False],
 ['(', 'PUNCT', False],
 ['Primary', 'PROPN', False],
 ['Standard', 'PROPN', False],
 ['Industrial', 'PROPN', False],
 ['Classification', 'PROPN', False],
 ['Code', 'PROPN', False],
 ['Number', 'PROPN', False],
 [')', 'PUNCT', False],
 ['\n\xa0\n ', 'SPACE', False],
 ['(', 'PUNCT', False],
 ['I.R.S.', 'PROPN', False],
 ['Employer', 'PROPN', False],
 ['Identification', 'PROPN', False]]

In [7]:
pd.DataFrame(data=winfo, columns=['word','part of speech', 'stop word'])

Unnamed: 0,word,part of speech,stop word
0,jurisdiction,NOUN,False
1,of,ADP,True
2,incorporation,NOUN,False
3,or,CCONJ,True
4,organization,NOUN,False
5,),PUNCT,False
6,\n \n,SPACE,False
7,(,PUNCT,False
8,Primary,PROPN,False
9,Standard,PROPN,False


### Entities
You can access the named entities recognized in a Doc object by iterating through the .ents attribute. 
Entity Labels:

Each entity recognized by spaCy is associated with a label that indicates the type of entity it represents. Common entity labels include:

PERSON: Person names. <br>
ORG: Organizations. <br>
GPE: Geopolitical entities (e.g., countries, cities). <br>
DATE: Dates and times. <br>
MONEY: Currency amounts. <br>
PERCENT: Percentage values. <br>
PRODUCT: Product names. <br>
And many more.

In [8]:
winfo = []
for ent in doc.ents[:20]:
    winfo.append([ent.text, ent.label_])
pd.DataFrame(data=winfo, columns=['word', 'label'])

Unnamed: 0,word,label
0,S-1,WORK_OF_ART
1,1,CARDINAL
2,the Securities and Exchange Commission,ORG
3,"January 29, 2010",DATE
4,UNITED STATES,GPE
5,Washington,GPE
6,D.C.,GPE
7,20549,DATE
8,1933,DATE
9,Tesla Motors,ORG


**Word vectors**

In [9]:
winfo = []
for t in doc[100:110]:
    winfo.append([t.text, t.vector])
pd.DataFrame(data=winfo, columns=['word', 'vector'])

Unnamed: 0,word,vector
0,jurisdiction,"[-0.06652121, -0.11071381, 0.16533509, -0.7449..."
1,of,"[2.121176, -0.7885181, -0.11417009, -0.9308520..."
2,incorporation,"[-0.9278236, 0.19262522, 0.1077134, -0.4664523..."
3,or,"[2.1501977, -0.75503707, -0.6008313, -0.020373..."
4,organization,"[-0.19795139, -0.15959348, -1.1749573, -0.3780..."
5,),"[-0.80560553, -0.32602733, -0.38138187, -2.045..."
6,\n \n,"[0.13619107, 0.08687705, -1.129379, -1.5866337..."
7,(,"[0.34838852, -0.5217571, -0.47835198, -0.01440..."
8,Primary,"[-0.1356867, -0.56116086, 0.979896, 0.95332634..."
9,Standard,"[-0.26291898, -0.8166239, -0.6339028, 0.484101..."


## Visualizing entities in notebook

In [10]:
from spacy import displacy
displacy.render(doc[100:180], style='ent')

## Splitting into sentences

In [11]:
winfo = []
for s in doc.sents:
    if len(s.text.strip())>2:
        winfo.append([s.text])
pd.DataFrame(data=winfo, columns=['sentence'])

Unnamed: 0,sentence
0,\nS-1\n1\nds1.htm\nREGISTRATION STATEMENT ON F...
1,(Exact name of Registrant as\nspecified in its...
2,(650) 251-5000 Approximate date\nof commen...
3,If any of the\nsecurities being registered on ...
4,¨ If this Form is filed to register additiona...
5,¨ If this Form is a post-effective amendment ...
6,¨ If this Form is a post-effective amendment ...
7,¨ Indicate by check mark whether the registra...
8,See the definitions of large accelerated file...
9,(Check one): \n\n\n\n\n\n\n\n\n\n\nLarge acc...


## Exercise

Extract any word in the TSLA doc that is a number per Spacy. Your output should look like (assuming you used `doc = nlp(tsla[0:5000])`):

```
[1, 29, 2010, 20549, 1933, 3711, 91, 2197729, 3500, 94304, 650, 413, 4000, 3500, 94304, 650, 413, 4000, 650, 94304, 650, 493, 9300, 2550, 94304, 650, 251, 5000, 415, one, 0.001, 100,000,000, 7,130, 1, 457, 1933, 2, 1933, 29, 2010]
```

See [solution](https://github.com/parrt/msds692/tree/master/notes/code/spacy) if you get stuck.