**TEXT ANALYSIS**
## 1. Problem Statement
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document
Frequency.


**Import and install required packages**

In [None]:
pip install nltk


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import nltk
import re


**Download the required packages**

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True


## 2. Data Collection

In [None]:
text= "Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentences is called Tokenization."

## 3. Exploratory Data Analysis
**Perform Tokenization**

Sentence Tokenization

In [None]:
from nltk.tokenize import sent_tokenize
tokenized_text= sent_tokenize(text)
print(tokenized_text)

['Tokenization is the first step in text analytics.', 'The process of breaking down a text paragraph into smaller chunks such as words or sentences is called Tokenization.']


Word Tokenization

In [None]:
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)

['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'analytics', '.', 'The', 'process', 'of', 'breaking', 'down', 'a', 'text', 'paragraph', 'into', 'smaller', 'chunks', 'such', 'as', 'words', 'or', 'sentences', 'is', 'called', 'Tokenization', '.']


**Removing Punctuations and Stop Word**

In [None]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)
text= "How to remove stop words with NLTK library in Python?"
text= re.sub('[^a-zA-Z]', ' ',text)
tokens = word_tokenize(text.lower())
filtered_text=[]
for w in tokens:
  if w not in stop_words:
    filtered_text.append(w)
print("Tokenized Sentence:",tokens)
print("Filterd Sentence:",filtered_text)

{'hasn', 'won', 'can', 'below', 'didn', 'where', "she's", 'his', 'yourself', 'an', 'during', "wouldn't", 'how', 'between', 'she', 'him', 'over', 'own', "hadn't", 'are', 'and', "you've", 'such', 'further', 'd', 'he', "needn't", "mightn't", 'ours', 'most', 'from', 'aren', 'up', "won't", 'while', 'with', "that'll", 'ma', 'which', 'through', 'theirs', 'll', 'yours', 'her', 're', 'you', 'nor', 'haven', 'mightn', 'were', 'me', 'few', 'does', 'being', 'why', 's', 'o', "doesn't", "didn't", 'couldn', 'yourselves', 'shan', 'hadn', "weren't", "you're", 'doesn', 'doing', "wasn't", 'same', 'these', 'was', "isn't", 'their', 'under', 'whom', 'm', 'too', 'herself', 'after', 'have', 'there', 'or', 'so', 'any', 'myself', 'that', 'isn', 'we', 'your', 'at', 'needn', 'before', "couldn't", "you'll", 'just', 'did', 'as', 'himself', 'if', 'than', 'wasn', 'when', 'to', "you'd", 'having', 'of', 'by', 'y', "shan't", 'been', 'mustn', 'who', 'each', 'had', 'more', "mustn't", 'hers', "hasn't", 'the', 'down', 'on', 

**Perform Stemming**

In [None]:
from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
  rootWord=ps.stem(w)
print(rootWord)

wait


**Perform Lemmatization**

In [None]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
  print("Lemma for {} is {}".format(w,
wordnet_lemmatizer.lemmatize(w)))

Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


**Apply POS Tagging to text**

In [None]:
import nltk
from nltk.tokenize import word_tokenize
data="The pink sweater fit her perfectly"
words=word_tokenize(data)
for word in words:
  print(nltk.pos_tag([word]))

[('The', 'DT')]
[('pink', 'NN')]
[('sweater', 'NN')]
[('fit', 'NN')]
[('her', 'PRP$')]
[('perfectly', 'RB')]


***Algorithm for Create representation of document by calculating TFIDF***

**Step 1: Import the necessary libraries.**

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

**Step 2: Initialize the Documents.**

In [None]:
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the Sun'

**Step 3: Create BagofWords (BoW) for Document A and B.**

In [None]:
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')

**Step 4: Create Collection of Unique words from Document A and B.**

In [None]:
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))

**Step 5: Create a dictionary of words and their occurrence for each document in the
corpus**

In [None]:
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
  numOfWordsA[word] += 1
  numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
  numOfWordsB[word] += 1

**Step 6: Compute the term frequency for each of our documents.**

In [None]:
def computeTF(wordDict, bagOfWords):
  tfDict = {}
  bagOfWordsCount = len(bagOfWords)
  for word, count in wordDict.items():
    tfDict[word] = count / float(bagOfWordsCount)
  return tfDict
  tfA = computeTF(numOfWordsA, bagOfWordsA)
  tfB = computeTF(numOfWordsB, bagOfWordsB)

**Step 7: Compute the term Inverse Document Frequency.**

In [None]:
def computeIDF(documents):
  import math
  N = len(documents)
  idfDict = dict.fromkeys(documents[0].keys(), 0)
  for document in documents:
    for word, val in document.items():
      if val > 0:
        idfDict[word] += 1
  for word, val in idfDict.items():
    idfDict[word] = math.log(N / float(val))
  return idfDict
  idfs = computeIDF([numOfWordsA, numOfWordsB])
  idfs

**Step 8: Compute the term TF/IDF for all words.a**

In [None]:
def computeTFIDF(tfBagOfWords, idfs):
  tfidf = {}
  for word, val in tfBagOfWords.items():
    tfidf[word] = val * idfs[word]
  return tfidf
  tfidfA = computeTFIDF(tfA, idfs)
  tfidfB = computeTFIDF(tfB, idfs)
  df = pd.DataFrame([tfidfA, tfidfB])
  df