# Text Insights: Extracting Key Terms with NLP & TF-IDF

## Introduction

Natural Language Processing (NLP) enables us to analyze, process, and extract meaningful insights from textual data. In this project, we explore the application of **TF-IDF (Term Frequency-Inverse Document Frequency)** and **text preprocessing** techniques to analyze a collection of articles. The primary goal is to convert raw text into a structured numerical format, allowing us to identify significant terms and patterns within each document.

### **Objectives of the Project**
- **Preprocess textual data** by cleaning, tokenizing, removing stopwords, and lemmatizing.
- **Vectorize the text** using **CountVectorizer** and **TF-IDF** to transform words into numerical features.
- **Analyze TF-IDF scores** to identify the most important terms in each article.
- **Gain insights** into document importance and keyword significance.

By implementing these NLP techniques, we can better understand how term importance varies across different texts, a fundamental step in various applications like search engines, text classification, and topic modeling.

---

## **Workflow of the Project**
1. **Text Preprocessing:**
   - Convert text to lowercase and remove punctuation.
   - Tokenize the text into words.
   - Remove stopwords to reduce noise.
   - Apply **lemmatization** to normalize words to their root forms.
    
   
2. **Text Vectorization:**
   - Use **CountVectorizer** to convert words into numerical frequency-based representations.
   - Transform the frequency matrix using **TF-IDF** to assign importance to words based on their relevance.
    

3. **Data Analysis:**
   - Construct **TF-IDF matrices** for each article.
   - Identify the **most significant words** per document.
   - Compare different vectorization methods to validate consistency.
---

In [1]:
import nltk
import re
import pandas as pd
import numpy as np
from nltk.corpus import wordnet, stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from articles import articles  # Import the articles list from the file

#### Helper functions and Initialize stopwords and lemmatizer

In [2]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [3]:
def get_part_of_speech(word): # this wil determine the most probable part of speech for a word.
    probable_pos = wordnet.synsets(word)
    pos_counts = Counter()
    pos_counts["n"] = sum(1 for item in probable_pos if item.pos() == "n")
    pos_counts["v"] = sum(1 for item in probable_pos if item.pos() == "v")
    pos_counts["a"] = sum(1 for item in probable_pos if item.pos() == "a")
    pos_counts["r"] = sum(1 for item in probable_pos if item.pos() == "r")
    return pos_counts.most_common(1)[0][0] if pos_counts else "n"

In [4]:
def preprocess_text(text):  # this will lean, tokenize, remove stopwords, and lemmatize text
    text = re.sub(r'[^\w\s]', '', text.lower())
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(token, get_part_of_speech(token)) 
              for token in tokens if token not in stop_words and not re.match(r'\d+', token)]
    return ' '.join(tokens)

#### Preprocess articles

In [5]:
processed_articles = [preprocess_text(article) for article in articles]

In [6]:
print("First two articles:\n", articles[:2]) 

First two articles:
 ['KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling. Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.', 'HONG KONG:  Hong Kong shares opened 0.66 percent lower Monday following a tepid lead from Wall Street, as the first full week of the new year kicked off. The benchmark Hang Seng Index dipped 158.63 points to 23,699.19.']


In [7]:
print("First two articles after cleaning the text for our analysis:\n", processed_articles[:2])

First two articles after cleaning the text for our analysis:
 ['karachi sindh government decide bring public transport fare per cent due massive reduction petroleum product price federal government geo news reportedsources say reduction fare applicable public transport rickshaw taxi mean travel meanwhile karachi transport ittehad kti refuse abide government decisionkti president irshad bukhari say commuter charge low fare karachi compare part country add vehicle run compress natural gas cng bukhari say karachi transporter cut fare decrease cng price make', 'hong kong hong kong share open percent lower monday follow tepid lead wall street first full week new year kick benchmark hang seng index dip point']


#### Initialize vectorizers

In [8]:
vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer(norm=None)
tfidf_vectorizer = TfidfVectorizer(norm=None)

#### Convert text to numerical representation

In [9]:
word_counts = vectorizer.fit_transform(processed_articles)
tfidf_scores_transformed = tfidf_transformer.fit_transform(word_counts)
tfidf_scores = tfidf_vectorizer.fit_transform(processed_articles)

#### Validate TF-IDF consistency

In [10]:
if np.allclose(tfidf_scores_transformed.todense(), tfidf_scores.todense()):
    print("TF-IDF scores match.")
    pd.DataFrame({'Are the tf-idf scores the same?':['YES']})
else:
    print("Mismatch in TF-IDF scores.")
    pd.DataFrame({'Are the tf-idf scores the same?':['No, something is wrong :(']})

TF-IDF scores match.


#### Extract feature names and construct DataFrames to visulize the results

In [11]:
feature_names = vectorizer.get_feature_names_out()
article_index = [f"Article {i+1}" for i in range(len(articles))]
df_word_counts = pd.DataFrame(word_counts.T.todense(), index=feature_names, columns=article_index)
df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=article_index)

display(df_tf_idf)
print(df_tf_idf.describe())

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
agency,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


        Article 1   Article 2   Article 3   Article 4   Article 5   Article 6  \
count  315.000000  315.000000  315.000000  315.000000  315.000000  315.000000   
mean     0.529467    0.217187    0.462144    0.518605    0.380584    0.421591   
std      1.385332    0.788563    1.714269    1.172005    1.351378    1.294715   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
50%      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
75%      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
max     10.818992    5.409496   18.933237    8.114244   16.228489    8.114244   

        Article 7   Article 8   Article 9  Article 10  
count  315.000000  315.000000  315.000000  315.000000  
mean     0.401589    0.452858    0.354990    0.246758  
std      1.052310    1.331015    1.107808    0.888571  
min      0.000000    0.000000    0.000000    0

#### Display top TF-IDF term for each article

In [12]:
for i in range(len(articles)):
    print(f"Top term for Article {i+1}:", df_tf_idf.iloc[:, i].idxmax())

Top term for Article 1: fare
Top term for Article 2: hong
Top term for Article 3: sugar
Top term for Article 4: petrol
Top term for Article 5: engine
Top term for Article 6: australia
Top term for Article 7: apple
Top term for Article 8: railway
Top term for Article 9: cabinet
Top term for Article 10: china


## **Key Findings & Conclusion**

Through this analysis, we achieved a structured numerical representation of textual data, enabling us to highlight the **most relevant words per article**. Some key takeaways:
- **TF-IDF effectively ranks words based on importance**, reducing the influence of common words while amplifying unique terms.
- **Text preprocessing significantly impacts results**, as stopword removal and lemmatization improve clarity and reduce redundancy.
- **The highest scoring TF-IDF words offer valuable insights** into document themes, providing a foundation for tasks like authorship attribution, keyword extraction, and topic classification.

This project demonstrates the power of **NLP and text vectorization techniques** in transforming raw text into meaningful insights. This project will serve us as a strong foundation for future advancements and enhancements, such as **document classification, sentiment analysis, or machine learning applications**.

*We will update and enhance this project very soon!*