# Assignment 10 : Text Analytics

Problem Statement

1. Extract Sample document and apply following document preprocessing methods: Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.

2. Create representation of document by calculating Term Frequency and Inverse Document Frequency



# Theory before Assignment

Natural Language Processing (NLP) begins by converting raw text into structured data

## Tokenization
Tokenization is the process of splitting raw text into smaller units called tokens (such as words, punctuation marks, or sentences).  
This is usually the very first step in an NLP pipeline, creating manageable pieces while preserving context. By converting text into tokens, algorithms can more easily identify patterns  
For example, the sentence `"Hello, world! NLP is fun."` can be tokenized into words or sentences. NLTK provides functions like `word_tokenize` and `sent_tokenize` for this purpose.

## Stop Words

Stop words are very common words (e.g. “and”, “the”, “is”) that carry little semantic weight in many tasks​. In NLP preprocessing, we often filter out stop words to reduce noise. NLTK includes a built-in list of stop words for multiple languages via `nltk.corpus.stopwords`. Removing stop words speeds up processing without losing crucial information  

For example, from the sentence “The quick brown fox jumps over the lazy dog,” words like “the”, “over” might be removed.​  

**Pitfall**: Overzealous removal may drop important context. For instance, the word “not” is often on stop lists but negates meaning (“not bad” vs “bad”). Also, stop lists are language-specific – be sure to use the correct language list.

## Stemming
Stemming reduces words to their root form (stem) using heuristic rules.
For example, “running”, “runs”, “runner” may all reduce to “run”. NLTK provides stemmers like `PorterStemmer`, `SnowballStemmer`, and `LancasterStemmer`. Stemming groups variants of a word together, simplifying text data​

>  “running” → “run”, “jumps” → “jump”, “easily” → “easili”. Notice “fairly” becomes “fairli”, which isn’t a valid English word. This aggressive stripping of suffixes is typical of stemming.

## Lemmatization
Lemmatization also reduces words to a base form, but it produces dictionary words (lemmas) by using vocabulary and part-of-speech (POS) information.  

For example, “better” becomes “good”, “running” (as a verb) becomes “run”. Lemmatization considers word context (POS) and is generally more accurate than stemming
In NLTK, `WordNetLemmatizer` uses the WordNet corpus for dictionary forms.

## Part-of-Speech (POS) Tagging
Part-of-speech tagging labels each token with its grammatical role (noun, verb, adjective, etc.)​.  
POS tagging assigns tags like “NN” (noun), “VB” (verb), “JJ” (adjective) to words. This step follows tokenization in an NLP pipeline.
NLTK’s `pos_tag` function uses a trained tagger (default Penn Treebank tagset) to annotate tokens.

**Pitfall**: POS tagging is context-dependent. Homonyms can be tagged differently (“permit” as a noun or verb). Punctuation and capitalization also affect tagging (e.g. “May” can be a verb or month name).

## Term Frequency (TF)
Term Frequency (TF) quantifies how often a word appears in a document. Usually defined as the raw count or normalized count of the term in that document.  
For example, in the document “this is a sample sample text”, the word “sample” has TF = 2. A higher TF suggests the term is more important within that document  
TF alone doesn’t account for how common a term is across other documents.

**𝑇𝐹 = ( Number of times word appears / Total words )**

**Pitfall** : Common words (like “the”) can have high TF but low informational value. That’s why TF is often combined with IDF (below) to downweight common terms

## Inverse Document Frequency (IDF)
Inverse Document Frequency (IDF) measures how rare or informative a word is across a collection of documents. A common formula is:

**IDF(t)=log ( N / (1+DF(t)) )**

**IDF = log( Total number of documents / Number of documents containing the word )**

where N is the total number of documents and DF(t) is the number of documents containing term t. Rare terms (low DF) get high IDF, and very common terms get low IDF  

Intuitively, if a term appears in almost every document (high DF), it’s not useful for distinguishing documents, so its IDF is low​

**Pitfall**: If a word appears in all documents (DF = N), the IDF becomes log(N/N)=0; it contributes nothing. Sometimes people add 1 to numerator or denominator to keep it non-zero. Also, very rare terms (e.g. typos) get very high IDF but might not be useful.

​

In [1]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\LENOVO\AppData\Roaming\nltk_data

True

In [2]:
import nltk
import string
import math

In [3]:
with open("text_01.txt","r") as file:
    doc1 = file.read()

with open("text_02.txt","r") as file:
    doc2 = file.read()

### Tokenization

In [4]:
from nltk.tokenize import word_tokenize, sent_tokenize

# Sentence Level Tokenization
sent_tokens = sent_tokenize(doc1)

# Word Level Tokenization
word_tokens = word_tokenize(doc1)

In [5]:
sent_tokens[:5]

['Between 2016 and 2019, the state forest department under theÂ\xa0BJPÂ\xa0government had launched Green Maharashtra drive with an aim to plant 50 crore trees across the state in the four-year period.',
 'In October 2019, the government had claimed it had surpassed the target by planting 33 crore trees in July-September 2019.Â\xa0The Indian ExpressÂ\xa0had found that non-forest agencies â€” such as gram panchayats â€” which were tasked with planting trees had not uploaded the mandatory audio-visual proof of the tree plantation drives on the specially created portal.',
 'In Pune Revenue Division, it was claimed the gram panchayats planted 1.7 crore saplings; however, no evidence was uploaded for 87 per cent (1.49 crore) saplings.',
 'Also, out of the 59 government agencies involved in the drive as many as 38 had not submitted survival reports about the saplings.',
 'This year, the targets set by the forest department were comparatively modest.']

In [6]:
word_tokens[:10]

['Between',
 '2016',
 'and',
 '2019',
 ',',
 'the',
 'state',
 'forest',
 'department',
 'under']

### Stop Words

In [7]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
list(stop_words)[:10]

["mustn't",
 "i'd",
 'mustn',
 "she'd",
 'there',
 'why',
 'under',
 'until',
 'wouldn',
 'what']

In [8]:
word_tokens = [token for token in word_tokens if token not in stop_words]
word_tokens[:10]

['Between',
 '2016',
 '2019',
 ',',
 'state',
 'forest',
 'department',
 'theÂ',
 'BJPÂ',
 'government']

### POS Tagging

nouns ('NN')  
verb ('VBZ')  
adjectives ('JJ')  
determiners ('DT')  

In [9]:
pos_tags = nltk.pos_tag(word_tokens)
pos_tags[:10]

[('Between', 'IN'),
 ('2016', 'CD'),
 ('2019', 'CD'),
 (',', ','),
 ('state', 'NN'),
 ('forest', 'JJS'),
 ('department', 'NN'),
 ('theÂ', 'NN'),
 ('BJPÂ', 'NNP'),
 ('government', 'NN')]

### Stemming
Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.

In [10]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed_words = [stemmer.stem(w) for w in word_tokens]

dict(zip(word_tokens[:10],stemmed_words[:10]))

{'Between': 'between',
 '2016': '2016',
 '2019': '2019',
 ',': ',',
 'state': 'state',
 'forest': 'forest',
 'department': 'depart',
 'theÂ': 'theâ',
 'BJPÂ': 'bjpâ',
 'government': 'govern'}

### Lemmatization

In [11]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(w) for w in word_tokens]
dict(zip(word_tokens[:30],lemmatized_tokens[:30]))

{'Between': 'Between',
 '2016': '2016',
 '2019': '2019',
 ',': ',',
 'state': 'state',
 'forest': 'forest',
 'department': 'department',
 'theÂ': 'theÂ',
 'BJPÂ': 'BJPÂ',
 'government': 'government',
 'launched': 'launched',
 'Green': 'Green',
 'Maharashtra': 'Maharashtra',
 'drive': 'drive',
 'aim': 'aim',
 'plant': 'plant',
 '50': '50',
 'crore': 'crore',
 'trees': 'tree',
 'across': 'across',
 'four-year': 'four-year',
 'period': 'period',
 '.': '.',
 'In': 'In',
 'October': 'October',
 'claimed': 'claimed'}

### Term Frequency (TF)

In [12]:
from nltk.probability import FreqDist
fdist = FreqDist(word_tokens)
fdist

FreqDist({',': 38, '.': 19, 'crore': 12, 'plantation': 10, 'saplings': 9, 'Forest': 7, 'forest': 6, 'In': 6, 'â€': 6, '”': 6, ...})

In [13]:
# OR you can use collections library and use Counter
from collections import Counter
tf_counts = Counter(word_tokens)
tf_counts['crore']

12

### Inverse Document Frequency

In [14]:
d1_tokens = word_tokenize(doc1)
d2_tokens = word_tokenize(doc2)

d1_df = Counter(d1_tokens)
d2_df = Counter(d2_tokens)

all_words = set(d1_tokens) | set(d2_tokens)

N = 2  # number of documents

idf = {}
for word in all_words:
    df = (1 if word in d1_df else 0) + (1 if word in d2_df else 0)
    idf[word] = math.log(N / (df))          # or log(N/1+df)

In [15]:
d1_tf_idf_scores = []
for token in d1_tokens:
    d1_tf_idf_scores.append([token,d1_df[token] * idf[token]])

d2_tf_idf_scores = []
for token in d2_tokens:
    d2_tf_idf_scores.append([token,d2_df[token] * idf[token]])

In [16]:
d1_tf_idf_scores

[['Between', 0.6931471805599453],
 ['2016', 2.0794415416798357],
 ['and', 0.0],
 ['2019', 0.0],
 [',', 0.0],
 ['the', 0.0],
 ['state', 0.0],
 ['forest', 0.0],
 ['department', 1.3862943611198906],
 ['under', 0.0],
 ['theÂ', 0.6931471805599453],
 ['BJPÂ', 0.6931471805599453],
 ['government', 0.0],
 ['had', 4.852030263919617],
 ['launched', 0.6931471805599453],
 ['Green', 2.0794415416798357],
 ['Maharashtra', 3.4657359027997265],
 ['drive', 0.0],
 ['with', 4.852030263919617],
 ['an', 0.0],
 ['aim', 1.3862943611198906],
 ['to', 0.0],
 ['plant', 2.0794415416798357],
 ['50', 2.0794415416798357],
 ['crore', 8.317766166719343],
 ['trees', 0.0],
 ['across', 0.0],
 ['the', 0.0],
 ['state', 0.0],
 ['in', 0.0],
 ['the', 0.0],
 ['four-year', 0.6931471805599453],
 ['period', 0.6931471805599453],
 ['.', 0.0],
 ['In', 4.1588830833596715],
 ['October', 0.6931471805599453],
 ['2019', 0.0],
 [',', 0.0],
 ['the', 0.0],
 ['government', 0.0],
 ['had', 4.852030263919617],
 ['claimed', 1.3862943611198906],
 [

In [17]:
d2_tf_idf_scores

[['Millions', 0.6931471805599453],
 ['of', 0.0],
 ['people', 0.6931471805599453],
 ['in', 0.0],
 ['India', 2.772588722239781],
 ['took', 0.6931471805599453],
 ['part', 0.0],
 ['in', 0.0],
 ['an', 0.0],
 ['annual', 1.3862943611198906],
 ['tree', 0.0],
 ['planting', 0.0],
 ['drive', 0.0],
 ['Sunday', 0.6931471805599453],
 ['.', 0.0],
 ['More', 0.6931471805599453],
 ['than', 0.6931471805599453],
 ['250', 0.6931471805599453],
 ['million', 1.3862943611198906],
 ['saplings', 0.0],
 ['were', 0.0],
 ['planted', 0.0],
 ['in', 0.0],
 ['a', 0.0],
 ['single', 0.0],
 ['day', 0.0],
 ['across', 0.0],
 ['the', 0.0],
 ['country', 1.3862943611198906],
 ["'s", 1.3862943611198906],
 ['most-populous', 0.6931471805599453],
 ['state', 0.0],
 ['.', 0.0],
 ['The', 0.0],
 ['campaign', 0.6931471805599453],
 ['was', 0.0],
 ['led', 0.6931471805599453],
 ['by', 0.0],
 ['Uttar', 2.772588722239781],
 ['Pradesh', 2.772588722239781],
 ['state', 0.0],
 ['government', 0.0],
 ['officials', 1.3862943611198906],
 [',', 0.0]