## Spacy Lab Session

1 library for NLP (e.g., NLTK, gensim)
HuggingFace (mostly for deep learning)

pip install spacy or pip install -U 'spacy[cuda-autodetect]'

python -m spacy download en_core_web_sm

python -m spacy download en_core_web_md #has word embedding (gloVe)

In [47]:
import spacy
spacy.__version__

'3.5.0'

In [48]:
#create a spacy object that can parse a lot of stuffs
#based on some learned model

nlp = spacy.load('en_core_web_sm')

In [49]:
text = 'Chaky really like to eat naan and masala.  He also likes to eat sushi.'

In [50]:
doc = nlp(text)
type(doc)

spacy.tokens.doc.Doc

In [51]:
#there are so many things in this doc
for tokens in doc[:10]:
    print(tokens)  #this spacy.tokens.doc.Doc already tokenize it!!!

Chaky
really
like
to
eat
naan
and
masala
.
 


In [52]:
for sent in doc.sents:
    print(sent)  #it also has sentence 

Chaky really like to eat naan and masala.  
He also likes to eat sushi.


In [53]:
tokens

 

In [54]:
tokens.ent_type #entity type ids
tokens.ent_type_ #geo political entity

''

In [55]:
spacy.explain('GPE')

'Countries, cities, states'

## 2.Word2vec

In [63]:
nlp = spacy.load('en_core_web_sm')
text= "Chakey likes to eat suhi"

In [64]:
doc = nlp(text)

In [65]:
sentence = list(doc.sents)[0]

In [66]:
sentence[1]

likes

In [67]:
len(sentence[1].vector)  #what is the size?? --> 300 glove embedding

96

## 3.Similarity

In [68]:
doc = nlp("I love coffee.")
nlp.vocab.strings['cofee']

5527451774744871294

In [69]:
nlp.vocab.strings[3197928453018144401]

'coffee'

In [70]:
#first numericalize dog
integer = nlp.vocab.strings['dog']
integer

7562983679033046312

## 4. Sentence similarity

In [74]:
doc1 = nlp("Chaky likes french fries")
doc2 = nlp("Tonson likes sweet potato nuggets")

In [75]:
doc1.similarity(doc2) 

  doc1.similarity(doc2)


0.8397504609170756

In [76]:
#doc ---> sents ---> span ---> tokens

#do span similarity
span1 = doc1[2:4]
span1

french fries

In [77]:
span2 = doc2[2:6]
span2

sweet potato nuggets

In [78]:
span1.similarity(span2)

  span1.similarity(span2)


0.7385958433151245

## 1.Basic

### 1.1.Intro

### 1.2.Word Vectors

### 1.3.Similarity

### 1.4.Docs and Span similarity

## 2.2.Entity Ruler

### 2.2 Matcher

In [79]:
from spacy.matcher import Matcher #help us recognize patterns

In [80]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

In [81]:
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL", [pattern])

doc = nlp("Chaky email is chaklam@ait.asia.")
matches = matcher(doc)

In [82]:
matches

[(17587345535198158200, 3, 4)]

In [84]:
#proper nouns and longer phrases
with open("wiki_king.txt", "r") as f:
    text = f.read()
    
text

'Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 â€“ April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.\n\nKing participated in and led marches for blacks\' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his f

In [86]:
nlp = spacy.load("en_core_web_sm")

In [87]:
matcher = Matcher(nlp.vocab)

In [88]:
pattern = [{"POS": "PROPN"}]  #pos ==> part of speech
matcher.add("PROPER_NOUN_CHAKY", [pattern])

In [89]:
doc = nlp(text)
matches = matcher(doc)

In [90]:
for match in matches[:10]:
    print(match, doc[match[1]:match[2]]) #match[1] start of the span, match[2] end of the span

(2015442650195688329, 0, 1) Martin
(2015442650195688329, 1, 2) Luther
(2015442650195688329, 2, 3) King
(2015442650195688329, 3, 4) Jr.
(2015442650195688329, 6, 7) Michael
(2015442650195688329, 7, 8) King
(2015442650195688329, 8, 9) Jr.
(2015442650195688329, 10, 11) January
(2015442650195688329, 16, 17) April
(2015442650195688329, 24, 25) Baptist


In [93]:
# mulitple token
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]  #pos ==> part of speech
matcher.add("PROPER_NOUN_CHAKY", [pattern])
doc = nlp(text)
matches = matcher(doc)
for match in matches[:10]:
    print(match, doc[match[1]:match[2]]) #match[1] start of the span, match[2] end of the span

(2015442650195688329, 0, 1) Martin
(2015442650195688329, 0, 2) Martin Luther
(2015442650195688329, 1, 2) Luther
(2015442650195688329, 0, 3) Martin Luther King
(2015442650195688329, 1, 3) Luther King
(2015442650195688329, 2, 3) King
(2015442650195688329, 0, 4) Martin Luther King Jr.
(2015442650195688329, 1, 4) Luther King Jr.
(2015442650195688329, 2, 4) King Jr.
(2015442650195688329, 3, 4) Jr.
