<a href="https://colab.research.google.com/github/rajanpbg/Demo_labs/blob/deeplearning/NLP/01_Spacy_POS_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Spacy  
# Spacy is a NLP Tool mailnly used to 
       - part of speech 
       - named entities 

The english langvage has 3 main models 
      - small (en_core_web_sm)
      - Medium (en_core_web_md)
      - Large (en_core_web_lg)

In [3]:
!pip install spacy 
import spacy
nlp = spacy.load("en_core_web_sm")



In [5]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

### Rule Based Matching

In [7]:
## Now lets looks at a pattern  (example Hello world)
# Hellow world appers in below ways 
# Hello World 
# Hello-world 
pattern_1 = [{'LOWER': 'hello'},{'LOWER':'world'}]
pattern_2 = [{'LOWER': 'hello'},{'IS_PUNCT': True},{'LOWER':'world'}]
matcher.add('Hello World',None, pattern_1,pattern_2)

In [8]:
doc = nlp("'Hello World' is the first programme for everyone, printing 'Hello-World' means u started coding ")
doc

'Hello World' is the first programme for everyone, printing 'Hello-World' means u started coding 

In [9]:
find_matches = matcher(doc)
print(find_matches)

[(8585552006568828647, 1, 3), (8585552006568828647, 13, 16)]


In [10]:
for match_id,start,end in find_matches:   ## the  find_matches gives us 3 variables match_id (matched string id), start,end
  string_id = nlp.vocab.strings[match_id]
  span =  doc[start:end]
  print(match_id,string_id,start,end,span.text)

8585552006568828647 Hello World 1 3 Hello World
8585552006568828647 Hello World 13 16 Hello-World


In [12]:
pattern_3 = [{'LOWER': 'hello'},{'LOWER':'world'}]
pattern_4 = [{'LOWER': 'hello'},{'IS_PUNCT': True, 'OP' : '*'},{'LOWER':'world'}]
matcher.add('Hello World',None, pattern_3,pattern_4)
doc1 = nlp("'Hello World' is the first programme for everyone Hello World , printing 'Hello-World' means u started coding in Hello--world")
find_matches = matcher(doc1)
print(find_matches)

[(8585552006568828647, 1, 3), (8585552006568828647, 10, 12), (8585552006568828647, 15, 18), (8585552006568828647, 24, 27)]


### Phrase Based matching

Here we look for some phrases rather than  rules .. Lets look at example 



In [16]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
pharse_list = ['Barak Obama', 'Angela Markel','Washington, D.c']
pharse_patterns = [nlp(text) for text in pharse_list]
pharse_patterns
matcher.add("terminology", None, *pharse_patterns)

In [17]:
doc3 = nlp("German chancoler Angela Markel meet the Barak Obama in  Washington, D.c ")
find_matches = matcher(doc3)
print(find_matches)

[(14795493746145778174, 2, 4), (14795493746145778174, 6, 8), (14795493746145778174, 10, 13)]


In [18]:
for match_id,start,end in find_matches:   ## the  find_matches gives us 3 variables match_id (matched string id), start,end
  string_id = nlp.vocab.strings[match_id]
  span =  doc3[start:end]
  print(match_id,string_id,start,end,span.text)

14795493746145778174 terminology 2 4 Angela Markel
14795493746145778174 terminology 6 8 Barak Obama
14795493746145778174 terminology 10 13 Washington, D.c


### Part of Speech tagging

In [22]:
doc4 = "Apple is trying to Buy the U.K startup for $1billion dollars"
postext = nlp(doc4)
for i in postext: 
  print(i.text, i.pos_,i.tag,spacy.explain(i.tag_))



Apple PROPN 15794550382381185553 noun, proper singular
is AUX 13927759927860985106 verb, 3rd person singular present
trying VERB 1534113631682161808 verb, gerund or present participle
to PART 5595707737748328492 infinitival "to"
Buy VERB 14200088355797579614 verb, base form
the DET 15267657372422890137 determiner
U.K PROPN 15794550382381185553 noun, proper singular
startup NOUN 15308085513773655218 noun, singular or mass
for ADP 1292078113972184607 conjunction, subordinating or preposition
$ SYM 11283501755624150392 symbol, currency
1billion NUM 8427216679587749980 cardinal number
dollars NOUN 783433942507015291 noun, plural


In [26]:
for key, val in postext.count_by(spacy.attrs.POS).items():
  print(key, postext.vocab[key].text,val)

96 PROPN 2
87 AUX 1
100 VERB 2
94 PART 1
90 DET 1
92 NOUN 2
85 ADP 1
99 SYM 1
93 NUM 1


In [31]:
from spacy import displacy
displacy.render(docs=postext,style='dep',jupyter=True,options={'distance':100})

### Named Entity Recognisation

Finding the entities in given statement

In [34]:
s1 = "The Apple want to buy some Company in U.K for $1 billion"
s2= "san francico tries to disable sideways"

ner_s1 = nlp(s1)
ner_s2 = nlp(s2)
ner_s1.ents

(Apple, U.K, $1 billion)

In [37]:
for ent in ner_s1.ents:
  print(ent.text,ent.label_)

for ent in ner_s2.ents:
  print(ent.text,ent.label_)

Apple ORG
U.K GPE
$1 billion MONEY
san francico GPE


In [38]:
# adding new entities
s3 = "facebook is hiring the new V.P"
ner_s3 = nlp(s3) 
for ent in ner_s3.ents:
  print(ent.text,ent.label_)

V.P GPE


In [40]:
## adding facebook As entitity 
ORG = ner_s3.vocab.strings['ORG']
from spacy.tokens import Span
new_ent = Span(ner_s3,0,1,label=ORG) 
ner_s3.ents = list(ner_s3.ents) + [new_ent]
for ent in ner_s3.ents:
  print(ent.text,ent.label_)

facebook ORG
V.P GPE


In [42]:
displacy.render(docs=ner_s3,style='ent',jupyter=True)

In [43]:
displacy.render(docs=ner_s3,style='ent',options={'ents':['ORG']},jupyter=True)

In [47]:
s4 = "This is first sentance in u.s.. The second one in U.K.. There is the third sentence"
sent_seg = nlp(s4)
for sent in sent_seg.sents:
  print(sent.text)


This is first sentance in u.s..
The second one in U.K..
There is the third sentence


In [60]:
def sentence_breaker(doc):
  for token in doc[:-1]:
    if token.text == ';':
      print(token.i)
      doc[token.i+1].is_sent_start = True
  return doc 


In [62]:
s4 = "This is first sentance in u.s.; The second one in U.K.; There is the third sentence"
## now we need to sent custome bondaries 
nlp1 = spacy.load("en_core_web_sm")
print(nlp1.pipe_names)
nlp1.add_pipe(sentence_breaker,before='parser')
print("after adding custom sentence breaker")
print(nlp1.pipe_names)

['tagger', 'parser', 'ner']
after adding custom sentence breaker
['tagger', 'sentence_breaker', 'parser', 'ner']


In [63]:
sent_seg = nlp1(s4)
for sent in sent_seg.sents:
  print(sent.text)

7
13
This is first sentance in u.s.;
The second one in U.K.;
There is the third sentence
