In [None]:
# pip install nltk
# !pip install scikit-learn

In [None]:
paragraph = '''
Narendra Damodardas Modi[a] (born 17 September 1950) is an Indian politician who has served as the prime minister of India since 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindutva paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.

Modi was born and raised in Vadnagar, Bombay State (present-day Gujarat), where he completed his secondary education. He was introduced to the RSS at the age of eight, becoming a full-time worker for the organisation in Gujarat in 1971. The RSS assigned him to the BJP in 1985, and he rose through the party hierarchy, becoming general secretary in 1998.[b] In 2001, Modi was appointed chief minister of Gujarat and elected to the legislative assembly soon after. His administration is considered complicit in the 2002 Gujarat violence[c] and has been criticised for its management of the crisis. According to official records, a little over 1,000 people were killed, three-quarters of whom were Muslim; independent sources estimated 2,000 deaths, mostly Muslim.[4] A Special Investigation Team appointed by the Supreme Court of India in 2012 found no evidence to initiate prosecution proceedings against him, causing widespread anger and disbelief among the country's Muslim communities.[d] While his policies as chief minister were credited for encouraging economic growth, his administration was criticised for failing to significantly improve health, poverty and education indices in the state.[e]
'''
paragraph

In [None]:
import nltk

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import re

## Text Preprocessing:

### 1. Tokenisation and Cleaning

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')   # for newer NLTK
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

sentences = nltk.sent_tokenize(paragraph)

In [None]:
# Clean the data using Regular Expressions
corpus = []
for i in range(len(sentences)):
    # Remove everything except letters a-z and A-Z, replacing them with a space
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    # Convert text to lowercase
    review = review.lower()
    corpus.append(review)

### 2. Stemming and Lemmatisation

The instructor demonstrates how to convert a large paragraph (corpus) into a list of sentences and then clean those sentences by removing special characters and converting everything to lowercase.

In [None]:
# Initialize tools
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

# print(stop_words)

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])

    review = []
    for word in words:
        if word not in stop_words:
            lemma = lemmatizer.lemmatize(word)
            review.append(lemma)

    review = ' '.join(review)
    corpus[i] = review

corpus


## Vectorisation: 

### 1. Binary Bag of Words (BBoW)

Finally, the cleaned text is converted into numerical vectors using the CountVectorizer from Scikit-Learn. The instructor also mentions Binary Bag of Words, which only records if a word is present (1) or absent (0), regardless of its frequency.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Initialise CountVectorizer
cv = CountVectorizer(binary=True) # binary=True creates Binary Bag of Words [16]

X_bbow = cv.fit_transform(corpus)

# To view the mapping of words to their index:
print("Vocabulary:\n", cv.vocabulary_)

# Convert sparse matrix to array
bbow_array = X_bbow.toarray()

# Print full document-term matrix
print("\nDocument-Term Matrix:")
print(bbow_array)

# Print vector for the first and Second sentence
print("\n First sentence vector:")
print(bbow_array[0])
print("\n Second sentence vector:")
print(bbow_array[1])

### 2. Bag of Words (BoW)

Bag of Words model, but instead of single words (unigrams), you’re using:

bigrams (2-word sequences)

trigrams (3-word sequences)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Creating the BoW model
# ngram_range(2,3) considers both bigrams and trigrams [13]
cv = CountVectorizer(ngram_range=(2, 3)) 

X_bow = cv.fit_transform(corpus)

# Vocabulary shows the indexes of the features [4, 13]
print("BoW Vocabulary:", cv.vocabulary_)

bow_array = X_bow.toarray()
print("BoW Vectors:\n", bow_array)

# Print vector for the first and Second sentence
print("\n First sentence vector:")
print(bow_array[0])
print("\n Second sentence vector:")
print(bow_array[1])



### 3. Term Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF is used to capture word importance by giving higher weight to rare words and lower weight to common words that appear in every sentence. This helps address the lack of semantic meaning in BoW.

#### Key Concepts from the Sources
• **Term Frequency (TF)** : Calculated as the number of repetitions of a word in a sentence divided by the total number of words in that sentence.

• **Inverse Document Frequency (IDF)** : Calculated as the log of (total number of sentences / number of sentences containing the word).

• **Sparsity Issue** : Both BoW and TF-IDF can result in large vectors containing many zeros if the vocabulary is huge, which makes computation difficult.

• **Word Importance** : TF-IDF identifies important words; for example, if the word "good" appears in every sentence, its TF-IDF value becomes 0, indicating it does not help distinguish between sentences

To understand TF-IDF, imagine a digital highlighter. If every sentence in a book has the word "the", your highlighter ignores it because it doesn't help you find a specific topic. However, if the word "pizza" only appears in one chapter, the highlighter marks it brightly because it is a rare and important keyword for that specific section.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Initialising TF-IDF Vectorizer
# max_features=10 selects the top 10 highest frequency features to reduce sparsity [19, 20]
tfidf = TfidfVectorizer(max_features=10, ngram_range=(1, 1))

X_tfidf = tfidf.fit_transform(corpus)

# Converting to an array to see the vector values [5]
tfidf_array = X_tfidf.toarray()
print("TF-IDF Vectors:\n", tfidf_array)

# Print vector for the first and Second sentence
print("\n First sentence vector:")
print(tfidf_array[0])
print("\n Second sentence vector:")
print(tfidf_array[1])

## Word Embeddings

### 1. Continous Bag Of Words

### 2. Skip - Gram