# Text Classification with 20 Newsgroups Dataset

In this notebook, we perform text classification on the 20 Newsgroups dataset using a Naive Bayes classifier. The process includes:

1. **Loading Data**: Fetching the training and test subsets of the 20 Newsgroups dataset.
2. **Preprocessing**: Converting text documents into a numerical format using `CountVectorizer` and scaling with `TfidfTransformer`.
3. **Model Training**: Training a Naive Bayes classifier (`MultinomialNB`) on the TF-IDF features of the training data.
4. **Evaluation**: Using a pipeline to streamline preprocessing and prediction on the test data.

The notebook demonstrates a typical workflow for text classification tasks using scikit-learn.


In [1]:
# Importing the necessary module from scikit-learn to load the 20 Newsgroups dataset.
from sklearn.datasets import fetch_20newsgroups

# Fetching the training subset of the 20 Newsgroups dataset. This dataset is used for text classification.
# 'shuffle=True' ensures that the data is shuffled randomly for better training.
doc_train = fetch_20newsgroups(subset='train', shuffle=True)


In [2]:
# Importing CountVectorizer for converting a collection of text documents into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer

# Initializing the CountVectorizer. This will be used to convert the text data into a format suitable for model training.
count_vect = CountVectorizer()

# Applying the CountVectorizer to the training data to create a document-term matrix.
# Each row represents a document, and each column represents a term (word) in the corpus.
X_train_counts = count_vect.fit_transform(doc_train.data)


In [7]:
# Importing TfidfTransformer to transform the count matrix to a term-frequency times inverse document-frequency (TF-IDF) representation.
from sklearn.feature_extraction.text import TfidfTransformer

# Initializing the TfidfTransformer. This will scale the word counts based on their frequency in the entire dataset.
tfid = TfidfTransformer()

# Applying the TfidfTransformer to the document-term matrix to compute the TF-IDF scores.
X_train_tfid = tfid.fit_transform(X_train_counts)


In [8]:
# Importing MultinomialNB, a Naive Bayes classifier for multinomially distributed data, which is commonly used for text classification.
from sklearn.naive_bayes import MultinomialNB

# Initializing the MultinomialNB classifier.
clf = MultinomialNB()

# Training the Naive Bayes classifier on the TF-IDF transformed training data and the corresponding labels (targets).
clf.fit(X_train_tfid, doc_train.target)


In [16]:
# Importing Pipeline from scikit-learn to streamline the process of combining multiple steps in a single workflow.
from sklearn.pipeline import Pipeline

# Creating a Pipeline object to automate the steps of text vectorization and TF-IDF transformation.
# This will simplify the process of applying these steps consistently during both training and prediction.
text_clf = Pipeline(
    [
        ('vect', CountVectorizer()),  # Step 1: Convert text documents to term-count matrix.
        ('tfid', TfidfTransformer()),  # Step 2: Transform term-count matrix to TF-IDF representation.
        ('clf', MultinomialNB())
    ]
)

# Fitting the pipeline to the training data. This performs both steps of vectorization and TF-IDF transformation in one go.
text_clf = text_clf.fit(doc_train.data, doc_train.target)


In [19]:
# Fetching the test subset of the 20 Newsgroups dataset. This dataset will be used to evaluate the performance of the trained model.
doc_test = fetch_20newsgroups(subset='test', shuffle=True)

# Using the trained pipeline to make predictions on the test data.
# The pipeline handles both vectorization and TF-IDF transformation as part of the prediction process.
prediction = text_clf.predict(doc_test.data)
prediction


array([ 7, 11,  0, ...,  9,  3, 15])

In [20]:
import numpy as np
prediction_mean = np.mean(prediction == doc_test.target)
prediction_mean

np.float64(0.7738980350504514)

In [21]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict the labels for the test data
predictions = text_clf.predict(doc_test.data)

# Calculate and print the accuracy
accuracy = accuracy_score(doc_test.target, predictions)
print(f'Accuracy: {accuracy:.2f}')

# Print classification report
report = classification_report(doc_test.target, predictions, target_names=doc_test.target_names)
print('Classification Report:\n', report)

# Print confusion matrix
conf_matrix = confusion_matrix(doc_test.target, predictions)
print('Confusion Matrix:\n', conf_matrix)


Accuracy: 0.77
Classification Report:
                           precision    recall  f1-score   support

             alt.atheism       0.80      0.52      0.63       319
           comp.graphics       0.81      0.65      0.72       389
 comp.os.ms-windows.misc       0.82      0.65      0.73       394
comp.sys.ibm.pc.hardware       0.67      0.78      0.72       392
   comp.sys.mac.hardware       0.86      0.77      0.81       385
          comp.windows.x       0.89      0.75      0.82       395
            misc.forsale       0.93      0.69      0.80       390
               rec.autos       0.85      0.92      0.88       396
         rec.motorcycles       0.94      0.93      0.93       398
      rec.sport.baseball       0.92      0.90      0.91       397
        rec.sport.hockey       0.89      0.97      0.93       399
               sci.crypt       0.59      0.97      0.74       396
         sci.electronics       0.84      0.60      0.70       393
                 sci.med       0.92 

In [22]:
import numpy as np

# Extract feature names
feature_names = text_clf.named_steps['vect'].get_feature_names_out()

# Get feature importances from the classifier
# Note: For Naive Bayes, we look at the feature log probabilities
class_labels = text_clf.named_steps['clf'].classes_
feature_log_prob = text_clf.named_steps['clf'].feature_log_prob_

# Display the top features for each class
for i, class_label in enumerate(class_labels):
    top10 = np.argsort(feature_log_prob[i])[-10:]  # Get indices of top 10 features
    print(f'\nTop features for class {class_label}:')
    for index in top10:
        print(f'{feature_names[index]}: {np.exp(feature_log_prob[i, index]):.4f}')



Top features for class 0:
keith: 0.0002
it: 0.0002
and: 0.0002
you: 0.0002
in: 0.0002
that: 0.0002
is: 0.0002
to: 0.0003
of: 0.0003
the: 0.0004

Top features for class 1:
edu: 0.0001
in: 0.0001
for: 0.0001
it: 0.0001
is: 0.0001
and: 0.0002
graphics: 0.0002
of: 0.0002
to: 0.0002
the: 0.0003

Top features for class 2:
file: 0.0001
for: 0.0001
of: 0.0001
and: 0.0001
edu: 0.0001
is: 0.0001
it: 0.0002
to: 0.0002
the: 0.0003
windows: 0.0003

Top features for class 3:
card: 0.0001
ide: 0.0001
is: 0.0001
of: 0.0002
it: 0.0002
drive: 0.0002
and: 0.0002
scsi: 0.0002
to: 0.0002
the: 0.0004

Top features for class 4:
in: 0.0001
it: 0.0001
is: 0.0001
and: 0.0002
of: 0.0002
edu: 0.0002
apple: 0.0002
mac: 0.0002
to: 0.0002
the: 0.0004

Top features for class 5:
it: 0.0001
mit: 0.0001
in: 0.0001
motif: 0.0001
and: 0.0001
is: 0.0001
of: 0.0002
window: 0.0002
to: 0.0002
the: 0.0003

Top features for class 6:
shipping: 0.0001
offer: 0.0001
of: 0.0001
00: 0.0001
to: 0.0001
and: 0.0001
edu: 0.0002
the: 0.

In [23]:
# Print the pipeline steps
print('Pipeline steps:')
for step_name, step_process in text_clf.named_steps.items():
    print(f'{step_name}: {step_process}')


Pipeline steps:
vect: CountVectorizer()
tfid: TfidfTransformer()
clf: MultinomialNB()


In [24]:
# Print a few sample predictions
for doc, prediction, true_label in zip(doc_test.data[:5], predictions[:5], doc_test.target[:5]):
    print(f'Document: {doc[:100]}...')  # Print the first 100 characters of the document
    print(f'Predicted Label: {doc_test.target_names[prediction]}')
    print(f'True Label: {doc_test.target_names[true_label]}')
    print()


Document: From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-89 Bonneville
Organi...
Predicted Label: rec.autos
True Label: rec.autos

Document: From: Rick Miller <rick@ee.uwm.edu>
Subject: X-Face?
Organization: Just me.
Lines: 17
Distribution: ...
Predicted Label: sci.crypt
True Label: comp.windows.x

Document: From: mathew <mathew@mantis.co.uk>
Subject: Re: STRONG & weak Atheism
Organization: Mantis Consultan...
Predicted Label: alt.atheism
True Label: alt.atheism

Document: From: bakken@cs.arizona.edu (Dave Bakken)
Subject: Re: Saudi clergy condemns debut of human rights g...
Predicted Label: talk.politics.mideast
True Label: talk.politics.mideast

Document: From: livesey@solntze.wpd.sgi.com (Jon Livesey)
Subject: Re: After 2000 years, can we say that Chris...
Predicted Label: alt.atheism
True Label: talk.religion.misc



#### Summary
**Performance Metrics:** Use accuracy_score, classification_report, and confusion_matrix for evaluating the model.  
**Feature Importances:** Inspect features and their importance or log probabilities for MultinomialNB.  
**Pipeline Steps:** Print out the pipeline steps to understand the processing flow. 
**Sample Predictions:** Review a few sample predictions to get a sense of model output.     

These methods will give you a comprehensive understanding of your model's performance and behavior. 