# spaCy Tutorial - NLP with python

## Introduction

Spacy is popular open-source natural language processing(NLP) library that is designed to be fast, efficient, and prodction-ready. 

## Basic NLP Tasks with spaCY

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "India is beautiful country."
doc = nlp(text)

for token in doc:
    print(token.text, token.pos_)

<class 'spacy.lang.en.English'>
India PROPN
is AUX
beautiful ADJ
country NOUN
. PUNCT


## Named Entity Recognition(NER)

In [34]:
for ent in doc.ents:
    print(ent.text, ent.label_)

India GPE


## Dependency Parsing

In [35]:

for token in doc:
    print(token.text, token.dep_,token.head.text,token.head.pos_)

India nsubj is AUX
is ROOT is AUX
beautiful amod country NOUN
country attr is AUX
. punct is AUX


In [36]:
print(spacy.explain('amod'))

adjectival modifier


## Advanced NLP Techniques

#### word vectors and Similarity

In [37]:
word1 = nlp("king")
word2 = nlp("prince")
# word2 = nlp("queen")
similarity = word1.similarity(word2)
print("Similarity:",similarity)

Similarity: 0.7899729502051105


  similarity = word1.similarity(word2)


## Text Preprocessing With spaCy

In [38]:
text = "I am running in the park"
doc = nlp(text)
print(doc)

I am running in the park


## Token

In [39]:
tokens = [token.text for token in doc]
print(tokens)

['I', 'am', 'running', 'in', 'the', 'park']


## Lemmatization

In [40]:
lemmas = [token.lemma_ for token in doc]

In [41]:
print(lemmas)

['I', 'be', 'run', 'in', 'the', 'park']


## Removing stopword and punctuation

In [42]:
filterd_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
print(filterd_words)

['running', 'park']


## POS Tagging

In [43]:
pos_tags = [(token.text,token.pos_)for token in doc]
print(pos_tags)

[('I', 'PRON'), ('am', 'AUX'), ('running', 'VERB'), ('in', 'ADP'), ('the', 'DET'), ('park', 'NOUN')]


## NER

In [44]:
doc1 = nlp("I lived in India")

In [45]:
entities = [(ent.text,ent.label) for ent in doc1.ents]
print(entities)

[('India', 384)]


## TextBlog

In [46]:
from textblob import TextBlob

In [47]:
text = "The brown dog chased the cat. Rahul is a bad boy"
blob = TextBlob(text)

In [48]:
# tokenization
words = blob.words
print(words)
sentences = blob
print(sentences)

['The', 'brown', 'dog', 'chased', 'the', 'cat', 'Rahul', 'is', 'a', 'bad', 'boy']
[Sentence("The brown dog chased the cat."), Sentence("Rahul is a bad boy")]


In [49]:
# pos tagging 
pos_tags = blob.tags
print(pos_tags)

[('The', 'DT'), ('brown', 'JJ'), ('dog', 'NN'), ('chased', 'VBD'), ('the', 'DT'), ('cat', 'NN'), ('Rahul', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('bad', 'JJ'), ('boy', 'NN')]


In [50]:
#noun phrases
noun_phrases = blob.noun_phrases
noun_phrases

WordList(['brown dog', 'rahul', 'bad boy'])

In [51]:
# Spell Checker
sentence = "i havv a book"
obj = TextBlob(sentence)
obj.correct()

TextBlob("i have a book")

In [52]:
obj.words[3].pluralize()

'books'

In [53]:
text = "I hate this country"
obj = TextBlob(text)

In [54]:
obj.sentiment

Sentiment(polarity=-0.8, subjectivity=0.9)

## N-Grams in NLP

in nlp an n-gram is a contiguous sequence of n items(or tokens) from a given sample of text of speech.

In [55]:
import nltk 
from nltk.util import ngrams

sentence = "this is a sentence"
tokens = nltk.word_tokenize(sentence)

bigrams = list(ngrams(tokens,2))
trigrams = list(ngrams(tokens,3))

print('Bigrams:',bigrams)
print()
print('trigrams:',trigrams)



Bigrams: [('this', 'is'), ('is', 'a'), ('a', 'sentence')]

trigrams: [('this', 'is', 'a'), ('is', 'a', 'sentence')]


## Application of the   N-grams

Language modeling

Text classification

spell checking

Machine Translation

## N-grams in Language Modeling 

In [56]:
from nltk.lm import MLE # Maximum Likelihood Estimation (MLE)
from nltk.lm.preprocessing import padded_everygram_pipeline

<b>Padded_everygram_pipeline</b> = padd_sequence + everygram

<b>padd_sequence</b>: padding adds special tokens (like &lt;s> for the start of a sentence and &lt;/s> for the end) to make all sentences the same length.

<b>everygrams</b>: is funciton that helps you generate all possible n-grams(contiguous sequences of n items, usally words) from a given input sequence.

In [59]:
text = "I love working natural language processing. I love nlp most."
sentence = [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text)]

#Create a vocabulory and preprocess the text
n = 3 # Use trigrams
train_data, padded_sents = padded_everygram_pipeline(n,sentence)

model = MLE(n) # create trigram model
model.fit(train_data,padded_sents)

# Generate the next word
next_word = model.generate(1,random_seed=42)
print("Generated text:", " ".join(next_word))

Generated text: l a n g u a g e


In [63]:
# Previous word
previous_word = "love"

#Generate text the next word based on the previous word
next_word = model.generate(3,random_seed=42,text_seed=[previous_word])
print("Generated text:"," ".join(next_word))


Generated text: working natural language
