## Classification using Ridge Classifier
This is an example showing how scikit-learn `RidgeClassifier` can be used to classify documents by topics using a Bag of Words approach.

### Load and vectorize text dataset
We define a function to load data form the 20 newsgroups text dataset, which comprises around 18,000 newsgroups posts on 20 topics split in two subsets: one for training and the other one for testing.

1. Load dataset

In [2]:
# Import libraries
from time import time
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

# Define text categories
categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

# Function to get the document size
def size_mb(docs):
    return sum(len(s.encode("utf-8")) for s in docs) / 1e6

# Load and vectorize the 20 newsgroups dataset
def load_dataset(verbose=False, remove=()):
    data_train = fetch_20newsgroups(
        subset="train",
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=remove,
    )
    data_test = fetch_20newsgroups(
        subset="test",
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=remove,
    )

    # Order of labels in 'target_names' can be different from 'categories'
    target_names = data_train["target_name"]

    # Split target to training and testing sets
    y_train, y_test = data_train["target"], data_test["target"]

    # Extract features from training data using a sparse vectorizer
    t0 = time()
    vectorizer = TfidfVectorizer(
        sublinear_tf=True, max_df=0.5, min_df=5, stop_words="english"
    )
    X_train = vectorizer.fit_transform(data_train["data"])
    duration_train = time() - t0

    # Extract features from testing data using the same vectorizer
    t0 = time()
    X_test = vectorizer.fit_transform(data_test["data"])
    duration_test = time() - t0
    feature_names = vectorizer.get_feature_names_out()

    return X_train, X_test, y_train, y_test, feature_names, target_names



### Train the classifier using the dataset
`RidgeClassifier` is a linear classification model that uses the mean squared error on {-1, 1} encoded targets, one for each possible class. 

2. Train the ridge classifier

In [None]:
# Import library
from sklearn.linear_model import RidgeClassifier

X_train, X_test, y_train, y_test, feature_names, target_names = load_dataset(verbose=False)

# Train the ridge classifier
clf = RidgeClassifier(tol=1e-2, solver="sparse_cg")
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

3. Plot the confusion matrix

Find if there is a pattern in the classification errors.

In [None]:
# Import libraries
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

# Create a plot
fig, ax = plt.subplots()
ConfusionMatrixDisplay.from_predictions(y_test, predictions, ax=ax)
ax.xaxis.set_ticklabels(target_names)
ax.yaxis.set_ticklabels(target_names)
ax.set_title(
    f"Confusion Matrix for {clf.__class__.__name__}\non the original documents"
)