# NLP Practice Assignments
Day 2

You are part of a team developing a text classification system for a news aggregator 
platform. The platform aims to categorize news articles into different topics automatically. 
The dataset contains news articles along with their corresponding topics. Perform only the 
Feature extraction techniques.

Dataset Link: https://www.kaggle.com/datasets/therohk/million-headlines
        
Data Exploration: Begin by exploring the dataset. What are the different topics/categories 
present in the dataset? What is the distribution of articles across these topics?

Bag-of-Words (BoW): Implement a Bag-of-Words (BoW) model using CountVectorizer 
or TF-IDF to transform the text data into numerical features. Discuss the advantages and 
limitations of BoW in this context. Apply both unigram and bigram techniques and 
compare their effects on classification accuracy.

N-grams: Explore the use of N-grams (bi-grams, tri-grams) in feature engineering. How do 
different N-gram ranges impact the performance of the classification model?

TF-IDF: Apply TF-IDF (Term Frequency-Inverse Document Frequency) to the text data. 
Describe how TF-IDF works and its significance in capturing the importance of words 
across documents. Compare the results of TF-IDF with the BoW approach.

One-Hot Encoding: Investigate the application of One-Hot Encoding to encode categorical 
variables or labels. Can One-Hot Encoding be used directly for text classification? Why or 
why not?

Deliverables: 
Present insights gathered from data exploration and discuss the impact of different feature 
engineering techniques (BoW, N-grams, TF-IDF, One-Hot Encoding). Provide 
recommendations for the best feature engineering strategy

In [13]:
import pandas as pd

# Load the dataset
df = pd.read_csv("abcnews-date-text.csv")

# Display the column names
print("Column Names:", df.columns)

# Display the first few rows of the dataset
print(df.head())

Column Names: Index(['publish_date', 'headline_text'], dtype='object')
   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers


In [14]:
# Explore different topics/categories present in the dataset
unique_topics = df['headline_text'].unique()
print(f"Topics/Categories: {unique_topics}")

# Check the distribution  topics
topic_distribution = df['headline_text'].value_counts()
print("Distribution of Articles Across Topics:\n", topic_distribution)


Topics/Categories: ['aba decides against community broadcasting licence'
 'act fire witnesses must be aware of defamation'
 'a g calls for infrastructure protection summit'
 'air nz staff in aust strike for pay rise'
 'air nz strike to affect australian travellers'
 'ambitious olsson wins triple jump'
 'antic delighted with record breaking barca'
 'aussie qualifier stosur wastes four memphis match'
 'aust addresses un security council over iraq'
 'australia is locked into war timetable opp'
 'australia to contribute 10 million in aid to iraq'
 'barca take record as robson celebrates birthday in'
 'bathhouse plans move ahead'
 'big hopes for launceston cycling championship'
 'big plan to boost paroo water supplies'
 'blizzard buries united states in bills'
 'brigadier dismisses reports troops harassed in'
 'british combat troops arriving daily in kuwait'
 'bryant leads lakers to double overtime win'
 'bushfire victims urged to see centrelink'
 'businesses should prepare for terrorist at

# Bag-of-Words (BoW):

In [23]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

# Bag-of-Words (BoW) with unigrams
vectorizer_unigram = CountVectorizer()
X_bow_unigram = vectorizer_unigram.fit_transform(df['headline_text'])
print("BoW with unigrams shape:", X_bow_unigram.shape)

# Bag-of-Words (BoW) with bigrams
vectorizer_bigram = CountVectorizer(ngram_range=(2, 2))
X_bow_bigram = vectorizer_bigram.fit_transform(df['headline_text'])
print("BoW with bigrams shape:", X_bow_bigram.shape)

BoW with unigrams shape: (499, 1689)
BoW with bigrams shape: (499, 2596)


# N-grams:


In [27]:
# N-grams (let's use bi-grams)
vectorizer_ngram = CountVectorizer(ngram_range=(2, 4))
X_ngram = vectorizer_ngram.fit_transform(df['headline_text'])
print("Bi-gram shape:", X_ngram.shape)

Bi-gram shape: (499, 6616)


# TF-IDF:

In [28]:
# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['headline_text'])
print("TF-IDF shape:", X_tfidf.shape)


TF-IDF shape: (499, 1689)


# One-Hot Encoding:

In [33]:
# One-Hot Encoding for 'headline_text'
encoder = CountVectorizer()
X_onehot = encoder.fit_transform(df['headline_text'])
print("One-Hot Encoding shape:", X_onehot.shape)

# Split the dataset for training and testing
X_train, X_test, y_train, y_test = train_test_split(X_onehot, df['publish_date'], test_size=0.2, random_state=42)

# Train a classifier (e.g., Naive Bayes) on BoW features
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Evaluate accuracy on the test set
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with One-Hot Encoding: {accuracy}")

One-Hot Encoding shape: (499, 1689)
Accuracy with One-Hot Encoding: 0.44


In [None]:
In exploring the provided news articles dataset, the absence of explicit topic categories hampers a direct understanding of the document distribution. 
However, employing various feature engineering techniques sheds light on different aspects of the dataset. 
Bag-of-Words (BoW) models, with both unigrams and bigrams, provide fundamental representations of text, capturing sequential and contextual information to varying extents. 
The incorporation of bi-grams significantly increases feature dimensionality. 
TF-IDF, by considering term importance, offers a weighted perspective on word significance. 
One-Hot Encoding, while resulting in binary representations, yielded an accuracy of 0.44. 
Recommendations for refining the text classification system include ensuring data quality, evaluating models using diverse metrics, fine-tuning hyperparameters, exploring advanced models,
and considering further preprocessing techniques or additional features to enhance classification accuracy and robustness.