<table align="left" width=100%>
    <tr>
        <td width="10%">
            <img src="../images/RA_Logo.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> 10. Text Classification </b>
                </font>
            </div>
        </td>
    </tr>
</table>

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/vidyadharbendre/learn_nlp_using_examples/blob/main/notebooks/10_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/vidyadharbendre/learn_nlp_using_examples/blob/main/notebooks/10_Text_Classification.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

## Text Classification
What is Text Classification?

Text Classification is the process of categorizing text into predefined classes or categories based on its content. It involves training a model to understand and recognize patterns in text data, enabling it to assign appropriate labels or categories to new, unseen text.

Why Text Classification?
Text Classification is important for:

Document Organization: Automatically categorizing documents, articles, or emails based on their content.
Sentiment Analysis: Identifying sentiments expressed in text (positive, negative, neutral) for sentiment analysis tasks.
Spam Detection: Filtering out spam or irrelevant content from legitimate messages in email or social media.

How to Perform Text Classification Programmatically?

Using Scikit-Learn for Text Classification

In [3]:
#!conda install scikit-learn

In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

In [5]:
# Load 20 Newsgroups dataset (example dataset for text classification)
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

In [6]:
# Vectorize text data using TF-IDF
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
y_train = data_train.target
y_test = data_test.target

In [7]:
# Train a Support Vector Machine (SVM) classifier
classifier = make_pipeline(SVC(kernel='linear'))
classifier.fit(X_train, y_train)

In [8]:
# Predict on test data
y_pred = classifier.predict(X_test)

In [9]:
# Evaluate performance
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data_test.target_names))
print("Accuracy:", accuracy_score(y_test, y_pred))

Classification Report:
                        precision    recall  f1-score   support

           alt.atheism       0.96      0.83      0.89       319
         comp.graphics       0.90      0.96      0.93       389
               sci.med       0.94      0.91      0.93       396
soc.religion.christian       0.89      0.96      0.93       398

              accuracy                           0.92      1502
             macro avg       0.93      0.92      0.92      1502
          weighted avg       0.92      0.92      0.92      1502

Accuracy: 0.9207723035952063


## Explanation:

What: This example uses Scikit-Learn to perform text classification on the 20 Newsgroups dataset.
Why: Scikit-Learn provides a robust framework for building and evaluating text classifiers using various algorithms.
How: The text data is vectorized using TF-IDF (TfidfVectorizer()), and a Support Vector Machine (SVM) classifier (SVC(kernel='linear')) is trained on the vectorized data. Performance metrics like accuracy and classification report are computed to evaluate the classifier's performance.