# NLP using spaCy

<img src='figures/nlp_general.png' width=90%>

**how i learnt this:**

watched a couple videos on youtube:
* https://www.youtube.com/watch?v=NiIJIU5BEBU
* https://www.youtube.com/watch?v=WnGPv6HnBok&list=PLBmcuObd5An559HbDr_alBnwVsGq-7uTF
* https://www.youtube.com/watch?v=KL4-Mpgbahw

**Notes**
first i tried with nltk (it the old, true and tested NLP package) but spaCy seemed easier
* stemming and lemmatization: https://www.guru99.com/stemming-lemmatization-python-nltk.html#5
* NER: https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
* if working with Danish text: https://spacy.io/universe/project/lemmy, 
python -m spacy download da_core_news_sm
import spacy
nlp = spacy.load("da_core_news_sm")'''

**This notebook is structured as follows.**
1. **basic intro**
1. **visualizing**
1. **real data**
1. **a**

## basic intro

In [149]:
# Import the english language class
from spacy.lang.en import English

# create the nlp object
nlp = English()

# the doc object
doc = nlp('Hello world! I am here :D\n')

# Iterate over tokens in doc
print('tokens: ',[token.text for token in doc])
    
# doc can be indexed
span = doc[1:5]
print('span: ',span.text,'\n')

# some of the attributes
print('Index: ', [token.i for token in doc])
print('text: ', [token.text for token in doc])
print('is_alpha: ', [token.is_alpha for token in doc])
print('is_punct: ', [token.is_punct for token in doc])
print('like_num: ', [token.like_num for token in doc])

tokens:  ['Hello', 'world', '!', 'I', 'am', 'here', ':D', '\n']
span:  world! I am 

Index:  [0, 1, 2, 3, 4, 5, 6, 7]
text:  ['Hello', 'world', '!', 'I', 'am', 'here', ':D', '\n']
is_alpha:  [True, True, False, True, True, True, False, False]
is_punct:  [False, False, True, False, False, False, False, False]
like_num:  [False, False, False, False, False, False, False, False]


## visualizing

In [136]:
import spacy
from spacy import displacy

nlp= spacy.load("en_core_web_sm")

doc = nlp('Anton and I would like to purchase an apple.')
[t for t in doc]

[Anton, and, I, would, like, to, purchase, an, apple, .]

In [128]:
displacy.render(doc, jupyter=True, style='ent', minify=True,
                options={
                    "compact": True, 
                    "bg": "#09a3d5",
                    "color": "white", 
                    "font": "Source Sans Pro"})

In [146]:
displacy.render(doc, jupyter=True, style='dep', minify=True,
                options={
                    "compact": True, 
                    "bg": "#09a3d5",
                    "color": "white", 
                    "font": "Source Sans Pro"})

In [147]:
spacy.explain("PRON"), spacy.explain(doc[2].pos_), 

('pronoun', 'pronoun')

In [148]:
import pandas as pd

def dataFrame_from_doc(doc):
    '''input: doc, outout: dataframe'''
    df = pd.DataFrame([[t.text, 
               t.lemma_,
               t.pos_, spacy.explain(t.pos_),  
               t.tag_, spacy.explain(t.tag_), 
               t.dep_, spacy.explain(t.dep_),
               t.shape_, 
               t.is_alpha, 
               t.is_stop] for t in doc],
            columns='text, lemma_, pos_, pos_ex, tag_, tag_ex, dep_, dep_ex, shape_, is_alpha, is_stop'.split(', '))
    return df

dataFrame_from_doc(doc)

Unnamed: 0,text,lemma_,pos_,pos_ex,tag_,tag_ex,dep_,dep_ex,shape_,is_alpha,is_stop
0,Anton,Anton,PROPN,proper noun,NNP,"noun, proper singular",nsubj,nominal subject,Xxxxx,True,False
1,and,and,CCONJ,coordinating conjunction,CC,"conjunction, coordinating",cc,coordinating conjunction,xxx,True,True
2,I,I,PRON,pronoun,PRP,"pronoun, personal",conj,conjunct,X,True,True
3,would,would,AUX,auxiliary,MD,"verb, modal auxiliary",aux,auxiliary,xxxx,True,True
4,like,like,VERB,verb,VB,"verb, base form",ROOT,,xxxx,True,False
5,to,to,PART,particle,TO,"infinitival ""to""",aux,auxiliary,xx,True,True
6,purchase,purchase,VERB,verb,VB,"verb, base form",xcomp,open clausal complement,xxxx,True,False
7,an,an,DET,determiner,DT,determiner,det,determiner,xx,True,True
8,apple,apple,NOUN,noun,NN,"noun, singular or mass",dobj,direct object,xxxx,True,False
9,.,.,PUNCT,punctuation,.,"punctuation mark, sentence closer",punct,punctuation,.,False,False


## real data
I'll use stack exchange questions

In [131]:
# dataset is @ 'https://www.kaggle.com/stackoverflow/stacksample?select=Answers.csv'

df = (pd.read_csv("/Users/antongolles/Downloads/stackSample/Questions.csv", nrows=1_000_000,
     encoding="ISO-8859-1", usecols=['Title', 'Id']))

titles = [_ for _ in df.loc[lambda d: d['Title'].str.lower().str.contains("go")]['Title']]

In [133]:
%%time
def has_golang(doc):
    for t in doc:
        if t.lower_ in ["go", "golang"]:
            if t.pos_ == 'NOUN':
                if t.dep_ == 'pobj':
                    return True
    return False

g = (doc for doc in nlp.pipe(titles) if has_golang((doc))) # nlp.pipe is faster

[next(g) for i in range(10)]

CPU times: user 8.65 s, sys: 1.54 s, total: 10.2 s
Wall time: 10.9 s


[Can I append an Ajax requestXML object to my document tree all in one go?,
 How do I disable multiple listboxes in one go using jQuery?,
 multi package makefile example for go,
 What's the point of having pointers in Go?,
 Trouble reading from a socket in go,
 Is there a way to find a specific file and then change into the directory containing it in one go?,
 How many records can be loaded into Salesforce using Apex Data Loader in one go?,
 How can I run multiple inserts with NHibernate in one go?,
 Convert string to integer type in Go?,
 Generating Random Numbers in Go]

In [135]:
string = 'How many records can be loaded into Salesforce using Apex Data Loader in one go?,'
displacy.render(nlp(string))