In [105]:
paragraph = '''Narendra Modi, the Prime Minister of India since 2014, is a prominent figure in global politics known for his dynamic leadership and ambitious vision for the country. Born on September 17, 1950, in Vadnagar, Gujarat, Modi's journey from a small-town tea seller to the highest office in the country is both inspiring and remarkable. He rose through the ranks of the Bharatiya Janata Party (BJP), showcasing his organizational acumen and oratory skills, which eventually led to his tenure as the Chief Minister of Gujarat from 2001 to 2014. During his time as Chief Minister, he was credited with transforming Gujarat into a vibrant economic powerhouse, although his tenure was also marked by controversy, particularly the 2002 Gujarat riots.

As Prime Minister, Modi has initiated several key reforms and policies aimed at modernizing India's infrastructure and economy. His flagship initiatives like "Make in India," "Digital India," and "Swachh Bharat Abhiyan" (Clean India Mission) have garnered significant attention and investment, both domestically and internationally. Modi's foreign policy has also been assertive, striving to enhance India's global stature and forge strategic partnerships with major world powers. His leadership style, characterized by a strong centralization of power and a focus on nationalistic and economic development agendas, has resonated with many Indians, securing him a second term in 2019 with a decisive electoral mandate.

Despite his successes, Modi's tenure has not been without criticism. His economic policies, including the controversial demonetization of 2016 and the implementation of the Goods and Services Tax (GST), have faced scrutiny for their impacts on various sectors of the economy. Additionally, his administration's handling of issues such as religious tolerance, press freedom, and the Kashmir conflict has been a subject of intense debate and international concern. Nevertheless, Narendra Modi remains a formidable and influential leader, continually shaping the course of India's future through his policies and vision. His ability to connect with the masses and his unwavering commitment to his vision for India ensure that he remains a pivotal figure in both Indian and global politics.'''

In [106]:
paragraph

'Narendra Modi, the Prime Minister of India since 2014, is a prominent figure in global politics known for his dynamic leadership and ambitious vision for the country. Born on September 17, 1950, in Vadnagar, Gujarat, Modi\'s journey from a small-town tea seller to the highest office in the country is both inspiring and remarkable. He rose through the ranks of the Bharatiya Janata Party (BJP), showcasing his organizational acumen and oratory skills, which eventually led to his tenure as the Chief Minister of Gujarat from 2001 to 2014. During his time as Chief Minister, he was credited with transforming Gujarat into a vibrant economic powerhouse, although his tenure was also marked by controversy, particularly the 2002 Gujarat riots.\n\nAs Prime Minister, Modi has initiated several key reforms and policies aimed at modernizing India\'s infrastructure and economy. His flagship initiatives like "Make in India," "Digital India," and "Swachh Bharat Abhiyan" (Clean India Mission) have garner

## Implementation of TextPre Processing 1

### 1. Tokenization
### 2. StopWords
### 3. Stemming
### 4. Lemmatization

In [107]:
# All imports for PreText Processing !
import nltk
from nltk.stem import PorterStemmer 
from nltk.corpus import stopwords   

In [108]:
# Tokenization  --- converting paragraph-sentences-words
nltk.download('punkt')
sentences = nltk.sent_tokenize(paragraph)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\CZ0234\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [109]:
# Doing the Stemming part
stemmer = PorterStemmer()

In [110]:
stemmer.stem("history")

'histori'

In [111]:
# Lemmatization steps
from nltk.stem import  WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\CZ0234\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [112]:
lemmatizer = WordNetLemmatizer()

In [113]:
lemmatizer.lemmatize("Drinking")

'Drinking'

In [114]:
lemmatizer.lemmatize("goes") # its more accurate.

'go'

In [116]:
# cleaning the whole paragraph.
len(sentences)

13

In [119]:
#  Downloading the StopWord in the NLTK Library.
nltk.download('stopwords')
# stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\CZ0234\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [124]:
# Cleaning the sentences and doing 
# 0. Removing special character and symbols which are not going to use.
# 1. Stemming
# 2. Lemmatization
import re # importing the Regular-Expression
corpus = []

for i in range(len(sentences)):
    # Remove non-alphabetic characters
    review = re.sub('[^a-zA-Z]', " ", sentences[i])
    # Convert to lowercase
    review = review.lower()
    # Split into words
    words = review.split()
    # Remove stopwords and lemmatize the remaining words
    filtered_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    # Rejoin the words into a single string
    review = ' '.join(filtered_words)
    # Append to corpus
    corpus.append(review)

print(corpus)       

['narendra modi prime minister india since prominent figure global politics known dynamic leadership ambitious vision country', 'born september vadnagar gujarat modi journey small town tea seller highest office country inspiring remarkable', 'rose rank bharatiya janata party bjp showcasing organizational acumen oratory skill eventually led tenure chief minister gujarat', 'time chief minister credited transforming gujarat vibrant economic powerhouse although tenure also marked controversy particularly gujarat riot', 'prime minister modi initiated several key reform policy aimed modernizing india infrastructure economy', 'flagship initiative like make india digital india swachh bharat abhiyan clean india mission garnered significant attention investment domestically internationally', 'modi foreign policy also assertive striving enhance india global stature forge strategic partnership major world power', 'leadership style characterized strong centralization power focus nationalistic econo

## Implementation of TextPre Processing 2

### 1. Bag of Words
### 2. TF-IDF
### 3. Ngrams
### 4. Word2Vec

In [125]:
# For Bag of word

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=True,ngram_range=(3,3))

In [126]:
x = cv.fit_transform(corpus)

In [127]:
cv.vocabulary_  # in this vocabulary we are only getting trigrams.

{'narendra modi prime': 108,
 'modi prime minister': 105,
 'prime minister india': 126,
 'minister india since': 98,
 'india since prominent': 72,
 'since prominent figure': 146,
 'prominent figure global': 128,
 'figure global politics': 45,
 'global politics known': 55,
 'politics known dynamic': 122,
 'known dynamic leadership': 86,
 'dynamic leadership ambitious': 37,
 'leadership ambitious vision': 88,
 'ambitious vision country': 10,
 'born september vadnagar': 16,
 'september vadnagar gujarat': 140,
 'vadnagar gujarat modi': 168,
 'gujarat modi journey': 59,
 'modi journey small': 104,
 'journey small town': 83,
 'small town tea': 148,
 'town tea seller': 165,
 'tea seller highest': 158,
 'seller highest office': 139,
 'highest office country': 62,
 'office country inspiring': 112,
 'country inspiring remarkable': 28,
 'rose rank bharatiya': 135,
 'rank bharatiya janata': 129,
 'bharatiya janata party': 14,
 'janata party bjp': 82,
 'party bjp showcasing': 117,
 'bjp showcasing 

In [128]:
corpus[0]

'narendra modi prime minister india since prominent figure global politics known dynamic leadership ambitious vision country'

In [129]:
x[0].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

# so we have Learn some basic things related to text-processing 

In [130]:
# Now we will see the implementaion of TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(ngram_range=(3,3)) # want just trigram
x = cv.fit_transform(corpus)

In [131]:
corpus[0]

'narendra modi prime minister india since prominent figure global politics known dynamic leadership ambitious vision country'

In [132]:
cv.vocabulary_

{'narendra modi prime': 108,
 'modi prime minister': 105,
 'prime minister india': 126,
 'minister india since': 98,
 'india since prominent': 72,
 'since prominent figure': 146,
 'prominent figure global': 128,
 'figure global politics': 45,
 'global politics known': 55,
 'politics known dynamic': 122,
 'known dynamic leadership': 86,
 'dynamic leadership ambitious': 37,
 'leadership ambitious vision': 88,
 'ambitious vision country': 10,
 'born september vadnagar': 16,
 'september vadnagar gujarat': 140,
 'vadnagar gujarat modi': 168,
 'gujarat modi journey': 59,
 'modi journey small': 104,
 'journey small town': 83,
 'small town tea': 148,
 'town tea seller': 165,
 'tea seller highest': 158,
 'seller highest office': 139,
 'highest office country': 62,
 'office country inspiring': 112,
 'country inspiring remarkable': 28,
 'rose rank bharatiya': 135,
 'rank bharatiya janata': 129,
 'bharatiya janata party': 14,
 'janata party bjp': 82,
 'party bjp showcasing': 117,
 'bjp showcasing 

In [104]:
x[0].toarray() # here the vector is getting created with different weighted.

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.26726124, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.26726124, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.26726124, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.26726124, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.26726124, 0.  