# 07. Text Classification
Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning predefined categories or labels to text documents based on their content. It is widely used in various applications such as spam detection, sentiment analysis, topic categorization, and more.

### What You'll Learn:
- How to classify documents
- Multiple algorithms
- Train/test split
- Evaluation metrics
- Real examples

## Text Classification Task

Assign text to predefined categories.

**Examples**:
- Spam vs Not Spam (emails)
- Positive vs Negative (sentiment)
- News categories (sports, politics, tech)
- Language detection (English, French, etc.)

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

print('='*60)
print('TEXT CLASSIFICATION EXAMPLE')
print('='*60)

# Sample data
texts = [
    'Free money now!!!', 'You won lottery', 'Click here for cash',
    'Meeting at 3pm', 'Project deadline tomorrow', 'Team discussion scheduled',
    'Great product quality', 'Highly recommended', 'Very satisfied customer',
    'Terrible experience', 'Worst purchase ever', 'Complete waste of money'
]
labels = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]  # 0=spam, 1=meeting, 2=positive, 3=negative

print(f'Total samples: {len(texts)}')
print(f'Classes: 0=Spam, 1=Meeting, 2=Positive, 3=Negative\n')

# Feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
y = np.array(labels)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#print(f'Training samples: {len(X_train)}')
#print(f'Testing samples: {len(X_test)}\n')

# Train classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predictions
y_pred = classifier.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2%}')
print(f'\nClassification Report:')
print(classification_report(y_test, y_pred))

TEXT CLASSIFICATION EXAMPLE
Total samples: 12
Classes: 0=Spam, 1=Meeting, 2=Positive, 3=Negative

Accuracy: 0.00%

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       1.0
           1       0.00      0.00      0.00       0.0
           2       0.00      0.00      0.00       1.0
           3       0.00      0.00      0.00       2.0

    accuracy                           0.00       4.0
   macro avg       0.00      0.00      0.00       4.0
weighted avg       0.00      0.00      0.00       4.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Comparing Algorithms

In [4]:
from sklearn.pipeline import Pipeline

# Simple dataset
train_texts = ['good great excellent', 'bad terrible awful', 'ok fine']
train_labels = [1, 0, 0]  # 1=positive, 0=negative

test_texts = ['wonderful', 'horrible', 'average']
test_labels = [1, 0, 0]

algorithms = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(max_iter=200),
    'SVM': LinearSVC(max_iter=200)
}

print('\nALGORITHM COMPARISON')
for name, clf in algorithms.items():
    pipe = Pipeline([('tfidf', TfidfVectorizer()), ('clf', clf)])
    pipe.fit(train_texts, train_labels)
    score = pipe.score(test_texts, test_labels)
    print(f'{name:20} -> Accuracy: {score:.2%}')


ALGORITHM COMPARISON
Naive Bayes          -> Accuracy: 66.67%
Logistic Regression  -> Accuracy: 66.67%
SVM                  -> Accuracy: 66.67%
