<a href="https://colab.research.google.com/github/rushikeshnaik779/EDA/blob/master/pytorch_for_nlp/Pytorch_NLP_Basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import spacy 
nlp = spacy.load('en')
text = "Marry, don't slap the green witch"
print([str(token) for token in nlp(text.lower())])

['marry', ',', 'do', "n't", 'slap', 'the', 'green', 'witch']


In [4]:
from nltk.tokenize import TweetTokenizer 
tweet=u"Snow White and the Seven Degrees #MakeAMovieCold@midnight :­-)" 
tokenizer = TweetTokenizer() 
print(tokenizer.tokenize(tweet.lower()))

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':', '\xad', '-', ')']


In [5]:
# Generating N-grams 

def n_grams(text, n=1):
    """
    takes tokens or text, returns a list of n-grams 
    """
    return [text[i: i+n] for i in range(len(text)-n+1)]

In [7]:
cleaned = ['mary', ',', "n't", 'slap', 'green', 'witch', '.'] 
print(n_grams(cleaned, 5))

[['mary', ',', "n't", 'slap', 'green'], [',', "n't", 'slap', 'green', 'witch'], ["n't", 'slap', 'green', 'witch', '.']]


In [8]:
# Lemmatization: reducing words to their root forms

import spacy 
nlp = spacy.load("en")
doc = nlp(u"he was running late")
for token in doc: 
    print('{} --> {}'.format(token, token.lemma_))

he --> -PRON-
was --> be
running --> run
late --> late


spaCy, for example, uses a predefined dictionary, called WordNet, for extracting lemmas, but lemmatization can be framed as a machine learning problem requiring an understanding of the morphology of the language.

In [9]:
# CATEGORIZATION SENTENCES AND DOCUMENTS 

Categorizing or classifying documents is probably one of the earliest applications of NLP. The TF and TF­IDF representations we described in hapter 1 are immediately useful for classifying and categorizing longer chunks of text such as documents or sentences. Problems such as assigning topic labels, predicting sentiment of reviews, filtering spam emails, language identification, and email triaging can be framed as supervised document classification problems. 

In [12]:
# Categorizing Words: POS Tagging

import spacy 
nlp = spacy.load('en')

doc = nlp(u"Marry slapped the green witch . !! !!!! ++ ++")
for token in doc: 
    print("{} --> {}".format(token, token.pos_))

Marry --> PROPN
slapped --> VERB
the --> DET
green --> ADJ
witch --> NOUN
. --> PUNCT
! --> PUNCT
! --> PUNCT
! --> PUNCT
! --> PUNCT
! --> PUNCT
! --> PUNCT
+ --> CCONJ
+ --> NOUN
+ --> NOUN
+ --> NOUN


In [15]:
# Categorizing Spans: Chunking and Named Entity Recognition

import spacy 
nlp = spacy.load('en')
doc = nlp(u"his is called chunking or shallow parsing. Shallow parsing aims to derive higher­order units composed of the grammatical atoms, like nouns, verbs, adjectives, and so on. It is possible to write regular expressions over the part­of­speech tags to approximate shallow parsing if you do not have data to train models for shallow parsing. Fortunately,")

In [16]:
for chunk in doc.noun_chunks:
    print ('{} ­ {}'.format(chunk, chunk.label_))

higher­order units ­ NP
the grammatical atoms ­ NP
nouns ­ NP
verbs ­ NP
adjectives ­ NP
It ­ NP
regular expressions ­ NP
the part­of­speech tags ­ NP
you ­ NP
data ­ NP
models ­ NP
shallow parsing ­ NP
