<a href="https://colab.research.google.com/github/rewpak/AI-works/blob/main/NLP_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 6. Natural Language Processing
# Task 6.2 Text Classification
# Problem Descriptions

The task is to develop a text classification model for a dataset consisting of messages from 20 news groups, covering diverse topics such as comp.graphics, rec.sport.hockey, sci.space, talk.religion.misc, and others. The objective is to automatically categorize these messages into their respective news groups based on their content.

Text Classification consist of the following steps:

1. Collection of texts with class labels:

   1. rec.sport.hockey
   2. talk.religion.misc
   3. comp.graphics
   4. sci.space

2. Split the data into training set and testing set:

   The training set consists of a collection of text documents from the specified categories: 'rec.sport.hockey', 'talk.religion.misc', 'comp.graphics', and 'sci.space'.

   

3. Extract the word count features

  tf: term frequency

  tf-idf: tf weighted by inverse document frequency

4. Build the naïve Bayes classifier from the training set:

   For this task involving discrete data, we employ a Multinomial Naïve Bayes classifier. The choice of this classifier is apt for text classification due to its simplicity and effectiveness with word count features.

5. Apply the classifier on the testing data set:

   The Naïve Bayes classifier is built using the training set, and then it's applied to the testing data set to make predictions.

6. Evaluate the performance:

  The final step involves evaluating the model's performance. Common metrics such as accuracy, precision, recall, and F1-score are employed to assess how well the classifier distinguishes between positive and negative sentiments.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

fetch_20newsgroups(subset='train')["target_names"]

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [None]:
# Get the training dataset for the specified categoires
categories = ['rec.motorcycles', 'talk.politics.guns',
              'comp.sys.mac.hardware', 'sci.crypt']
training_data = fetch_20newsgroups(subset='train', categories=categories)



In [None]:
# Create the tf-idf transformer
tfidf = TfidfVectorizer(use_idf=True)
# tfidf = TfidfVectorizer(use_idf=False)
training_tfidf = tfidf.fit_transform(training_data.data)
print(training_tfidf.shape)

# Train a Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(training_tfidf, training_data.target)


(2317, 36913)


In [None]:
from sklearn import metrics

testing_data = fetch_20newsgroups(subset='test', categories=categories)
testing_tfidf = tfidf.transform(testing_data.data)
predictions = classifier.predict(testing_tfidf)
print(metrics.classification_report(testing_data.target, predictions, target_names=categories))


                       precision    recall  f1-score   support

      rec.motorcycles       0.99      0.91      0.95       385
   talk.politics.guns       0.98      0.96      0.97       398
comp.sys.mac.hardware       0.90      0.99      0.94       396
            sci.crypt       0.96      0.96      0.96       364

             accuracy                           0.95      1543
            macro avg       0.96      0.95      0.95      1543
         weighted avg       0.96      0.95      0.95      1543



In [None]:
errors = [i for i in range(len(predictions)) if predictions[i] != testing_data.target[i]]

for i, post_id in enumerate(errors[:5]):
  print("------------------------------------------------------------------")
  print("%s --> %s\n" %(testing_data.target_names[testing_data.target[post_id]],
                      testing_data.target_names[predictions[post_id]]))
  print(testing_data.data[post_id])


# Discussions

In this task, our text classification model did pretty well overall. It accurately predicted topics like rec.sport.hockey, talk.religion.misc, and comp.graphics most of the time. However, it struggled a bit with sci.space, especially in recalling all relevant documents.

1. For rec.sport.hockey, it was correct 97% of the time (Precision), and it found 90% of all relevant documents (Recall).

2. Talk.religion.misc was even better, with 93% precision and 99% recall.

3. Comp.graphics had 83% precision and 98% recall.

4. Sci.space had perfect precision (100%) but only found 72% of the relevant documents.


For new groups from data set I got the following results:

It accurately predicted topics like rec.motorcycles, talk.politics.guns, comp.sys.mac.hardware, and sci.crypt with high precision and recall.

1. For rec.motorcycles, it was correct 99% of the time (Precision), and it found 91% of all relevant documents (Recall).

2. Talk.politics.guns had 98% precision and 96% recall.

3. Comp.sys.mac.hardware achieved 90% precision and 99% recall.

4. Sci.crypt had 96% precision and 96% recall.

Overall accuracy was 92% and 95%, meaning it made correct predictions about 92% and 95% for both groups of categories. This suggests that our model is effective in handling a variety of topics, demonstrating strong precision and recall scores.