# BBC News Classification — Exploratory Data Analysis

This notebook performs the exploratory data analysis (EDA) and feature extraction steps for the BBC news classification project.  We load the provided training data, inspect its structure, visualise important statistics and explore topic structure using Non‑negative Matrix Factorisation (NMF).

## 1. Load and inspect the data

We read the `BBC News Train.csv` file and examine its columns, size and basic statistics.

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the training data
train_df = pd.read_csv('/home/oai/share/BBC News Train.csv')

# Display the first few rows
train_df.head()


In [None]:

# Dataset shape and column information
print('Dataset shape:', train_df.shape)
print('Columns:', train_df.columns.tolist())

# Check for missing values
print('
Missing values per column:')
print(train_df.isnull().sum())

# Check for duplicate ArticleId and duplicate Text
print('
Number of duplicate ArticleId:', train_df['ArticleId'].duplicated().sum())
print('Number of duplicate Text:', train_df['Text'].duplicated().sum())

# Compute word and character counts
train_df['char_count'] = train_df['Text'].str.len()
train_df['word_count'] = train_df['Text'].str.split().apply(len)

# Display summary statistics of word_count
train_df['word_count'].describe()


From the above, we see that the dataset contains 1 490 rows and three columns.  There are no missing values, but there are duplicate `Text` entries.  To prevent the model from over‑fitting on identical documents, we will drop duplicate texts when building features.

In [None]:

# Remove duplicates for visualisation
viz_df = train_df.drop_duplicates(subset='Text')

# Plot category distribution
plt.figure(figsize=(6, 4))
sns.countplot(y='Category', data=viz_df, order=viz_df['Category'].value_counts().index)
plt.title('Number of articles per category')
plt.xlabel('Count')
plt.ylabel('Category')
plt.tight_layout()
plt.show()

# Plot boxplot of word counts by category
plt.figure(figsize=(7, 4))
sns.boxplot(x='Category', y='word_count', data=viz_df)
plt.title('Distribution of word counts by category')
plt.xlabel('Category')
plt.ylabel('Word Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Plot character count distribution
plt.figure(figsize=(6, 4))
sns.histplot(viz_df['char_count'], bins=50)
plt.title('Distribution of article character count')
plt.xlabel('Number of characters')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()



In [None]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
import numpy as np

# Compute top words after stop‑word removal
cv = CountVectorizer(stop_words='english')
word_counts = cv.fit_transform(train_df['Text'].str.lower())
total_counts = word_counts.sum(axis=0)
words = cv.get_feature_names_out()
word_freq = dict(zip(words, np.array(total_counts).flatten()))

# Get top 20 words
import heapq
top_words = heapq.nlargest(20, word_freq, key=word_freq.get)
top_counts = [word_freq[w] for w in top_words]

# Plot the top words
plt.figure(figsize=(8, 5))
sns.barplot(x=top_counts, y=top_words, palette='viridis')
plt.title('Top 20 frequent words (stop words removed)')
plt.xlabel('Frequency')
plt.ylabel('Word')
plt.tight_layout()
plt.show()


The bar chart above lists the 20 most frequent words after removing stop words.  Common journalistic terms such as **said**, **new**, **people** and **year** appear frequently.  Other terms like **election**, **music**, **film** and **mobile** hint at specific topics.

In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# Remove duplicate texts for NMF
nmf_df = train_df.drop_duplicates(subset='Text')

# TF‑IDF vectorisation
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(nmf_df['Text'])

# Apply NMF with 5 components (number of categories)
n_topics = 5
nmf_model = NMF(n_components=n_topics, random_state=42)
W = nmf_model.fit_transform(X)
H = nmf_model.components_

# Display top words for each topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(H):
    top_indices = topic.argsort()[::-1][:10]
    top_words = [feature_names[i] for i in top_indices]
    print(f'Topic {topic_idx + 1} top words: {", ".join(top_words)}')


The NMF model uncovers latent topics by factorising the TF‑IDF matrix into non‑negative components.  Each topic is represented by a set of high‑weight words.  In our case, the topics align closely with the five news categories: one topic contains sports terms like *england*, *game* and *win*; another contains political terms like *labour*, *election* and *blair*; a business topic includes words like *growth* and *economy*; an entertainment topic contains *film*, *awards* and *actor*; and a technology topic includes *mobile*, *music* and *phone*.  This alignment suggests that TF‑IDF combined with NMF captures meaningful structure in the corpus.

## 2. Discussion of embedding methods and plan of analysis

Several techniques can convert raw text into numeric features:

* **Term Frequency–Inverse Document Frequency (TF‑IDF)** weighs each term according to how often it appears in a document and how rare it is across the corpus.  Terms that are frequent in a document but rare overall receive higher weights, which often improves discrimination.
* **Word2Vec** trains a shallow neural network to predict context words (skip‑gram) or predict a word from its context (CBOW).  After training, semantically similar words have similar vectors.
* **GloVe** constructs a global co‑occurrence matrix and factorises it so that the dot product of two word vectors approximates the log probability of the words appearing together.  GloVe embeddings capture both local and global context.

For this project we choose TF‑IDF followed by NMF.  Unlike Word2Vec or GloVe, TF‑IDF does not require a large corpus to learn useful representations.  NMF reduces the high‑dimensional TF‑IDF matrix to a handful of interpretable topics that align with the news categories.  The resulting features will serve as inputs to a classifier in the next stage of the project.

## 3. Data cleaning and next steps

Before training a model we will:
1. **Convert to lower case and remove punctuation** to reduce vocabulary size.
2. **Remove stop words and apply lemmatisation or stemming** to reduce inflected forms.
3. **Drop duplicate texts** to avoid biasing the model.
4. **Vectorise** the cleaned corpus with TF‑IDF, limiting the vocabulary size.
5. **Apply NMF** to reduce dimensionality and extract latent topics.

Next we will train a classifier (e.g., logistic regression or SVM) on the NMF features, tune hyper‑parameters using cross‑validation, and evaluate accuracy and F1‑scores.
