**Text Preprocessing and TF-IDF Analysis:**
The main objective is to analyze a text about Allama Muhammad Iqbal, focusing on preprocessing techniques (cleaning, normalization) and applying TF-IDF to extract significant words.

In [1]:
pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.1-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.1


In [2]:
# Import necessary libraries for Natural Language Processing (NLP) tasks
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup  # For removing HTML tags
from spellchecker import SpellChecker  # For spelling correction
from sklearn.feature_extraction.text import TfidfVectorizer  # For TF-IDF vectorization

In [3]:
# Download required NLTK resources
nltk.download('punkt')  # For tokenization
nltk.download('punkt_tab')  # For tokenization
nltk.download('stopwords')  # For stop word removal
nltk.download('wordnet')  # For lemmatization

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [4]:
# Input Text
# Purpose: Define the text data for analysis
text="""Dr. Sir Allama Muhammad Iqbal (9 November 1877 – 21 April 1938) was a Muslim poet, philosopher, political thinker, and politician from Punjab, British India (now
Pakistan), whose poetry in Urdu and Persian is considered to be among the greatest of the modern era, whose vision of an independent state for the Muslims of British
India was to inspire the creation of Pakistan, and who is thus revered by Pakistanis and recognized internationally as Pakistan’s spiritual father of the nation.
Iqbal was born in Sialkot, now in Pakistan’s Punjab province. His father, Sheikh Noor Muhammad, was a tailor by profession and a pious individual with a mystic bent – he
had received no formal education but could read Urdu and Persian books and treasured the company of scholars and mystics, some of whom called him an
“unlettered philosopher”. Iqbal’s mother, Imam Bibi, was illiterate but was highly respected in the family as a wise and generous woman who quietly gave financial
help to the poor and needy and arbitrated in neighbours’ disputes. A few days before the birth of Iqbal, his father had a dream: “I saw a big crowd
gathered in a large field. A magnificent coloured bird was flying over our heads and everyone was admiring it and trying to catch it, but no one succeeded, and, at last, it
got tired of its flight and fell into my lap.” He understood this to be a message that God was about to bless him with a world-famous son. Hence, the “unlettered
philosopher” gave his son the name Muhammad Iqbal – the word Iqbal, whose origins lie in the Arabic language, means recognition, stature, respect, and fortune.
About four hundred years before Iqbal’s birth, his Brahmin ancestors, who lived in Kashmir (Northern India), had converted to Islam. In the late eighteenth or early
nineteenth century, when Afghan rule in Kashmir was being replaced by Sikh rule, Iqbal’s great grandfather emigrated from Kashmir to Sialkot. """

In [5]:
def preprocess_text(text):
    """
    Preprocesses text data by applying cleaning and normalization steps.

    Args:
        text (str): The input text to be preprocessed.

    Returns:
        list: A list of preprocessed tokens.
    """
    # 1. Convert to lowercase
    text = text.lower()

    # 2. Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # 3. Remove HTML tags (if any)
    soup = BeautifulSoup(text, 'html.parser')
    text = soup.get_text()

    # 4. Tokenize into words
    tokens = word_tokenize(text)

    # 5. Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # 6. Perform stemming (using PorterStemmer)
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    # 7. Perform lemmatization (using WordNetLemmatizer)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # 8. Spelling correction (using SpellChecker)
    spell = SpellChecker()
    tokens = [spell.correction(word) or word for word in tokens]  # Replace None with the original word


    return tokens

In [6]:
processed_tokens = preprocess_text(text)
processed_tokens

['do',
 'sir',
 'llama',
 'muhammad',
 'equal',
 '9',
 'novel',
 '1877',
 '21',
 'aril',
 '1938',
 'muslin',
 'poet',
 'philosophy',
 'polite',
 'thinker',
 'politician',
 'unjam',
 'brutish',
 'india',
 'pakistan',
 'whose',
 'poetry',
 'rude',
 'person',
 'conoid',
 'among',
 'greatest',
 'modern',
 'era',
 'whose',
 'vision',
 'indeed',
 'state',
 'muslin',
 'brutish',
 'india',
 'inspire',
 'creation',
 'pakistan',
 'the',
 'never',
 'pakistan',
 'recon',
 'intern',
 'pakistan',
 'spirit',
 'father',
 'nation',
 'equal',
 'born',
 'shallot',
 'pakistan',
 'unjam',
 'proving',
 'father',
 'sheikh',
 'door',
 'muhammad',
 'tailor',
 'profess',
 'pious',
 'individual',
 'mystic',
 'bent',
 'receive',
 'formal',
 'educe',
 'could',
 'read',
 'rude',
 'person',
 'book',
 'treasure',
 'company',
 'scholar',
 'mystic',
 'call',
 'unless',
 'philosophy',
 'equal',
 'mother',
 'imam',
 'bib',
 'liter',
 'highly',
 'respect',
 'family',
 'wise',
 'gene',
 'woman',
 'quietly',
 'gave',
 'fina

**TF-IDF (Term Frequency-Inverse Document Frequency**) is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It's widely used in information retrieval and text mining.

In [10]:
# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Join the preprocessed tokens into a single string
text_for_tfidf = ' '.join(processed_tokens)

# Fit and transform the text
tfidf_matrix = vectorizer.fit_transform([text_for_tfidf])  # Pass as a list with a single string element
tfidf_matrix

<1x147 sparse matrix of type '<class 'numpy.float64'>'
	with 147 stored elements in Compressed Sparse Row format>

In [8]:
# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()
feature_names

array(['1877', '1938', '21', 'admit', 'afghan', 'among', 'ancestor',
       'arab', 'arbiter', 'aril', 'bent', 'bib', 'big', 'bird', 'birth',
       'bless', 'book', 'born', 'brain', 'brutish', 'call', 'catch',
       'century', 'color', 'company', 'conoid', 'convert', 'could',
       'creation', 'crowd', 'day', 'dispute', 'do', 'door', 'dream',
       'early', 'educe', 'eighteenth', 'emir', 'equal', 'era', 'everyone',
       'family', 'father', 'fell', 'field', 'finance', 'flight', 'fly',
       'formal', 'fortune', 'four', 'gather', 'gave', 'gene', 'god',
       'got', 'grandfather', 'great', 'greatest', 'head', 'help', 'hence',
       'highly', 'imam', 'indeed', 'india', 'individual', 'inspire',
       'intern', 'kashmir', 'language', 'lap', 'large', 'last', 'late',
       'lie', 'liter', 'live', 'llama', 'magnific', 'mean', 'message',
       'modern', 'mother', 'muhammad', 'muslin', 'mystic', 'name',
       'nation', 'need', 'neighbor', 'never', 'nineteenth', 'northern',
       'no

In [9]:
# Access and analyze TF-IDF values
# For example, to print the top 10 words with highest TF-IDF scores:
top_n = 10
for i in range(top_n):
    index = tfidf_matrix[0].toarray().argsort()[0, -i - 1]
    print(f"{feature_names[index]}: {tfidf_matrix[0, index]:.4f}")

equal: 0.4465
pakistan: 0.2791
india: 0.1674
father: 0.1674
whose: 0.1674
kashmir: 0.1674
philosophy: 0.1674
muhammad: 0.1674
muslin: 0.1116
brutish: 0.1116
