<a href="https://colab.research.google.com/github/wdase/AI-and-ML-projects/blob/master/NLP_g_colab_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Lab Practice using Python on Google Colab

Tokenize sentences and words, remove stopwords, use stemmer & lemmatizer

**Tokenisation** - Splitting bigger parts to small parts. We can tokenize paragraphs to sentences and sentences to words. The process of converting the normal text strings into a list of tokens (words that we actually want).

**Stemming** - Removing affixes from words and returning the root word.

**Lemmatization** - Word lemmatizing is similar to stemming, but the difference lies in the output. The Lemmatized output is a real word and not just any trimmed word. For this piece of code to work, you will have to download the wordnet package for nltk.         

> from nltk.stem import WordNetLemmatizer    
lemmatizer = WordNetLemmatizer()


**Stop words**: There are some words in English like “the,” “of,” “a,” “an,” and so on. These are ‘stop words’. Stop words differ from language to language. These stop words may affect the results and thus removing them is necessary.

**Count word frequency** - Counting the frequency of occurrence of a word is a crucial part of language analysis. NLTK ships with a word frequency counter in order to count the number of times the word is repeated in a particular dataset.

**Synonyms/Antonyms** - And finally, we can also find Synonyms as well as Antonyms of any English word we desire.

**Pos Tagging**: The English language is formed of different parts of speech (POS) like nouns, verbs, pronouns, adjectives, etc. POS tagging analyzes the words in a sentences and associates it with a POS tag depending on the way it is used. Also called grammatical tagging or word-category disambiguation. Use nltk.pos_tag. There are different types of tagsets used with the most common being the Penn Treebank tagset and the Universal tagset.



In [5]:
# Setup
!pip install -q wordcloud
import wordcloud

import nltk
nltk.download('stopwords')     #for stop words
nltk.download('wordnet')       #for lemmatizer
nltk.download('punkt')         #
nltk.download('averaged_perceptron_tagger')       # for POS Tagging

import pandas as pd
import matplotlib.pyplot as plt
import io
import unicodedata
import numpy as np
import re
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


# Tokenizer

In [0]:
from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Word tokenization is the process of splitting a large sample of text into words. We can also tokenize the sentences in a paragraph like we tokenized the words. We use the method word_tokenize and  sent_tokenize to achieve these."
print(word_tokenize(EXAMPLE_TEXT))
print(sent_tokenize(EXAMPLE_TEXT))

# Stemmer 

In [7]:
import nltk
from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()
word_data = "In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words - agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word. This program uses the Porter Stemming Algorithm for stemming."


nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
    print ("Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w)))

Actual: In  Stem: In
Actual: the  Stem: the
Actual: areas  Stem: area
Actual: of  Stem: of
Actual: Natural  Stem: natur
Actual: Language  Stem: languag
Actual: Processing  Stem: process
Actual: we  Stem: we
Actual: come  Stem: come
Actual: across  Stem: across
Actual: situation  Stem: situat
Actual: where  Stem: where
Actual: two  Stem: two
Actual: or  Stem: or
Actual: more  Stem: more
Actual: words  Stem: word
Actual: have  Stem: have
Actual: a  Stem: a
Actual: common  Stem: common
Actual: root  Stem: root
Actual: .  Stem: .
Actual: For  Stem: for
Actual: example  Stem: exampl
Actual: ,  Stem: ,
Actual: the  Stem: the
Actual: three  Stem: three
Actual: words  Stem: word
Actual: -  Stem: -
Actual: agreed  Stem: agre
Actual: ,  Stem: ,
Actual: agreeing  Stem: agre
Actual: and  Stem: and
Actual: agreeable  Stem: agreeabl
Actual: have  Stem: have
Actual: the  Stem: the
Actual: same  Stem: same
Actual: root  Stem: root
Actual: word  Stem: word
Actual: agree  Stem: agre
Actual: .  Stem: .
A

# Lemmatizer

In [8]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
word_data = "Lemmatization is similar to stemming but it brings context to the words. So it goes a steps further by linking words with similar meaning to one word. For example if a paragraph has words like cars, trains and automobile, then it will link all of them to automobile. In the below program we use the WordNet lexical database for lemmatization."
nltk_tokens = nltk.word_tokenize(word_data)

for w in nltk_tokens:
     print ("Actual: %s  Lemma: %s"  %    (w,wordnet_lemmatizer.lemmatize(w)))

Actual: Lemmatization  Lemma: Lemmatization
Actual: is  Lemma: is
Actual: similar  Lemma: similar
Actual: to  Lemma: to
Actual: stemming  Lemma: stemming
Actual: but  Lemma: but
Actual: it  Lemma: it
Actual: brings  Lemma: brings
Actual: context  Lemma: context
Actual: to  Lemma: to
Actual: the  Lemma: the
Actual: words  Lemma: word
Actual: .  Lemma: .
Actual: So  Lemma: So
Actual: it  Lemma: it
Actual: goes  Lemma: go
Actual: a  Lemma: a
Actual: steps  Lemma: step
Actual: further  Lemma: further
Actual: by  Lemma: by
Actual: linking  Lemma: linking
Actual: words  Lemma: word
Actual: with  Lemma: with
Actual: similar  Lemma: similar
Actual: meaning  Lemma: meaning
Actual: to  Lemma: to
Actual: one  Lemma: one
Actual: word  Lemma: word
Actual: .  Lemma: .
Actual: For  Lemma: For
Actual: example  Lemma: example
Actual: if  Lemma: if
Actual: a  Lemma: a
Actual: paragraph  Lemma: paragraph
Actual: has  Lemma: ha
Actual: words  Lemma: word
Actual: like  Lemma: like
Actual: cars  Lemma: car


# POS Tagging

In [9]:
import nltk
text=nltk.word_tokenize("And now for something compeletely")
print(nltk.pos_tag(text))


[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('compeletely', 'RB')]
