DAY 1 – Load Text Dataset & Exploration

In [1]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')

print("Number of documents:", len(data.data))
print("Target classes:", data.target_names[:5])
print("\nSample text:\n", data.data[0][:500])


Number of documents: 11314
Target classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware']

Sample text:
 From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a m


DAY 2 – Text Cleaning & Preprocessing (Basic)

In [2]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\W+', ' ', text)
    return text

cleaned_texts = [clean_text(text) for text in data.data]


DAY 3 – Bag of Words (CountVectorizer)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(cleaned_texts)

y = data.target

print("Feature matrix shape:", X.shape)


Feature matrix shape: (11314, 5000)


DAY 4 – Train/Test Split + Naive Bayes Model

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

nb = MultinomialNB()
nb.fit(X_train, y_train)

pred = nb.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))


Accuracy: 0.8214759169244366


DAY 5 – Model Evaluation (Classification Report)

In [5]:
from sklearn.metrics import classification_report

print(classification_report(y_test, pred))


              precision    recall  f1-score   support

           0       0.87      0.93      0.90        97
           1       0.58      0.87      0.69       104
           2       1.00      0.03      0.07       115
           3       0.54      0.74      0.62       123
           4       0.71      0.87      0.78       126
           5       0.81      0.83      0.82       106
           6       0.69      0.85      0.76       109
           7       0.86      0.89      0.88       139
           8       0.85      0.91      0.88       122
           9       0.88      0.96      0.92       102
          10       1.00      0.93      0.96       108
          11       1.00      0.89      0.94       125
          12       0.81      0.76      0.79       114
          13       0.96      0.91      0.93       119
          14       0.93      0.89      0.91       127
          15       0.92      0.88      0.90       122
          16       0.90      0.91      0.91       121
          17       0.98    

DAY 6 – TF-IDF Vectorization (Improvement)

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = tfidf.fit_transform(cleaned_texts)

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)

nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train, y_train)

pred_tfidf = nb_tfidf.predict(X_test)
print("TF-IDF Accuracy:", accuracy_score(y_test, pred_tfidf))


TF-IDF Accuracy: 0.8590366769774636


DAY 7 – Predict on New Custom Text

In [7]:
sample_text = ["NASA launches new satellite into space"]

sample_clean = [clean_text(sample_text[0])]
sample_vec = tfidf.transform(sample_clean)

prediction = nb_tfidf.predict(sample_vec)
print("Predicted Category:", data.target_names[prediction[0]])


Predicted Category: sci.space
