The following code is part of a machine learning pipeline that processes, analyzes, and classifies text data from a dataset containing newsgroup posts. Here's a summary of the main steps and components:

1. Imports: The script begins by importing necessary Python libraries and modules for data manipulation, machine learning, and plotting. This includes pandas and numpy for data handling, datasets for loading the data, matplotlib for visualization, and several components from scikit-learn for text vectorization, dimensionality reduction, and logistic regression modeling.

2. Data Loading: The dataset named "rungalileo/20_Newsgroups_Fixed" is loaded using the `datasets` library. This dataset presumably contains posts from 20 different newsgroups, along with labels indicating which newsgroup each post belongs to.

3. Data Preprocessing:

    - The dataset is filtered to remove any entries where either the label or the text is missing.
    - The 'id' column is removed from both the training and test sets, as it's not relevant for the analysis.
    - The labels (`y_train` and `y_test`) are extracted from both the training and test datasets.

4. Text Vectorization:

    - A `TfidfVectorizer` is used to convert the text data into a matrix of TF-IDF features. Initially, this is done for the test data to display the feature names and vectorized data.
    - A second `TfidfVectorizer` is configured with specific thresholds (`min_df=5`, `max_df=0.40`) to vectorize the training text data and then applied to both the training and test datasets to prepare them for modeling.

5. Model Training and Prediction:

    - A logistic regression model is set up with a specific random state and a maximum iteration count equal to the number of rows in the training dataset.
    - The logistic regression model is trained on the vectorized training data and the corresponding labels.
    - Finally, the model is used to predict the labels for the vectorized test data.

This process involves typical steps in a text classification task, including data cleaning, feature extraction through TF-IDF vectorization, and classification using logistic regression. The purpose is to classify text posts into one of the 20 newsgroups based on their content.

In [None]:
# All imports at the top
import pandas as pd
import numpy as np
import datasets
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import LabelEncoder

# Load the data
dataset = datasets.load_dataset("rungalileo/20_Newsgroups_Fixed")

# Filter dataset to drop any "none" values
filtered_train = dataset["train"].filter(lambda x: x['label'] is not None and x['text'] is not None)
filtered_test = dataset["test"].filter(lambda x: x['label'] is not None and x['text'] is not None)

# Remove irrelevant 'id' column
filtered_test = filtered_test.remove_columns('id')
filtered_train = filtered_train.remove_columns('id')

y_train = filtered_train['label']
y_test = filtered_test['label']

# Extract text data from filtered_test
text = [item['text'] for item in filtered_test]

# Vectorize text data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text)

# Display feature names and vectorized data
print(vectorizer.get_feature_names_out())
print(X.toarray())

# Define frequency vectorizer with specific thresholds
freqV = TfidfVectorizer(min_df=5, max_df=0.40)
tokenized_train = [filtered_train[i]['text'] for i in range(filtered_train.num_rows)]
x_train = freqV.fit_transform(tokenized_train)

# Apply the same vectorization to test data
tokenized_test = [filtered_test[i]['text'] for i in range(filtered_test.num_rows)]
x_test = freqV.transform(tokenized_test)

# Set up and train the logistic regression model
logreg = LogisticRegression(random_state=16, max_iter=filtered_train.num_rows)
logreg.fit(x_train, y_train)

# Predict on test data
log_pred = logreg.predict(x_test)

## Multiclass Visualization

Significant challenges arise when visualizing logistic regression results for a multiclass classification task, especially with text data transformed via [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vectorization. Traditional 2D or 3D plots used in binary logistic regression, [such as this example](https://raw.githubusercontent.com/trenton3983/Blog-Posts/main/2024-04-29-visualizing-logistic-regression-multiclass-text-data/82L2cb2T.png), are not directly applicable because they typically represent binary outcomes and cannot easily accommodate the high-dimensional space created by TF-IDF vectorization.

### Key Challenges:

1. **High Dimensionality**: TF-IDF vectorization transforms text into a high-dimensional space (often thousands of dimensions), where each dimension corresponds to a specific word's frequency or importance in the text. Standard binary logistic regression plots, which typically show decision boundaries in two or three dimensions, cannot naturally extend to this high-dimensional space.

2. **Multiclass Classification**: Binary logistic regression is inherently designed for two classes. In multiclass settings, logistic regression models typically use schemes like [one-vs-rest](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) (OvR) or [multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression), which involve multiple binary decisions or probability distributions across more than two classes. This complexity makes it challenging to depict decision boundaries or class separations in a simple 2D or 3D plot.

### Alternative Visualization Strategies:

Given these constraints, alternative visualization methods are recommended:

- [**Confusion Matrix**](https://en.wikipedia.org/wiki/Confusion_matrix): This is particularly useful in multiclass settings to show the model's performance across all classes, illustrating how often each class is correctly predicted versus misclassified.

- [**Dimensionality Reduction**](https://en.wikipedia.org/wiki/Dimensionality_reduction): Techniques such as [PCA](https://en.wikipedia.org/wiki/Dimensionality_reduction#Principal_component_analysis_(PCA)) (Principal Component Analysis) or [t-SNE](https://en.wikipedia.org/wiki/Dimensionality_reduction#t-SNE) (t-Distributed Stochastic Neighbor Embedding) can be used to reduce the dataset to two or three dimensions. These reduced dimensions can then be visualized in a scatter plot, providing a way to observe data clustering and separation at a high level.

These methods provide more meaningful insights into the performance and behavior of logistic regression models in multiclass, high-dimensional scenarios like those involving TF-IDF vectorized text data.

## Confusion Matrix

A confusion matrix can help you understand the performance of your classifier across different classes. It shows the actual vs. predicted classifications.

In a confusion matrix for a classification task, the matrix isn't necessarily symmetrical, meaning the upper and lower sections (above and below the diagonal) aren't expected to be mirror images of each other. Here’s why:

### Understanding the Confusion Matrix

A confusion matrix shows the counts of predictions versus the actual labels:
- **Diagonal elements** show the number of correct predictions for each class (True Positives for each class).
- **Off-diagonal elements** show the misclassifications:
  - **Elements above the diagonal** indicate how many times class X was incorrectly predicted as class Y.
  - **Elements below the diagonal** indicate how many times class Y was incorrectly predicted as class X.

### Reasons for Asymmetry

1. **Class Imbalance**: If some classes have more samples than others, the likelihood of predicting the majority class increases, affecting the symmetry.

2. **Model Biases and Sensitivities**: The model might be better at recognizing certain classes over others due to inherent biases in the training data or differences in feature distinctiveness between classes.

3. **Error Types**:
  - [**Type I Errors (False Positives)**](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Type_I_error): Cases where the model incorrectly predicts the positive class.
  - [**Type II Errors (False Negatives)**](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Type_II_error): Cases where the model fails to predict the positive class when it is the actual class.
    Each type of error might be more common for certain classes than others.

### Example:

Suppose you have three classes: A, B, and C. If:
- Class A is often confused with Class B, but not vice versa.
- Class C is often mistaken for both A and B, but rarely are A or B mistaken for C.

This leads to a non-symmetrical confusion matrix because the misclassification patterns are not uniform across classes.

### Conclusion

A non-symmetrical confusion matrix is typical in practice, especially in multi-class scenarios where varying features, class distributions, and model sensitivities contribute to unique patterns of misclassification. This matrix is a valuable tool for identifying how well the model performs on each class and where it may need improvements or adjustments in its training data or feature selection.

In [None]:
# Compute the confusion matrix
cm = confusion_matrix(y_test, log_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=logreg.classes_)

# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(10, 10))
disp.plot(ax=ax, cmap='viridis', xticks_rotation='vertical')  # You can specify the color map
plt.show()

## Dimensionality Reduction for Visualization

You can use techniques like PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding) to reduce the dimensionality of your TF-IDF vectors to two or three dimensions, and then plot these dimensions.

### General PCA Explanation

**Principal Components**: Each principal component is a linear combination of the original features and represents a direction in the feature space where the data varies the most. Principal components are orthogonal to each other and are ordered by the amount of variance they capture from the data.

- **Variance Representation**: The first principal component captures the most variance, the second captures the next most, and so on, under the constraint that each is orthogonal to the others.
- **Interpretation of Axes**: The axes in PCA plots (whether 2D or 3D) represent these principal components. The ticks on these axes indicate the scale or magnitude of the data points along the respective principal components. Since the data often undergoes transformation such as scaling to have zero mean and unit variance before applying PCA, these ticks do not represent the original units of measurement but rather relative positions in the transformed space.

### Specific to 2D Plots

In a 2D PCA plot:

- **X-axis (First Principal Component)**: Represents the direction of the greatest variance in the data. This axis captures the largest amount of information (variation) in the dataset.
- **Y-axis (Second Principal Component)**: Represents the direction of the second greatest variance, orthogonal to the first principal component.

**Visual Analysis**: The plot can reveal clustering and other patterns, helping to visually assess similarities and differences in the data. Points that are close together can be interpreted as having similar characteristics according to the most significant features.

### Specific to 3D Plots

In addition to the first and second principal components, 3D PCA plots include:

- **Z-axis (Third Principal Component)**: This axis captures the third highest variance in the data, providing an additional dimension of analysis which is orthogonal to both the first and second components.

**Enhanced Visual Analysis**: A 3D plot allows for a deeper visual exploration, revealing structures and relationships that might not be visible in 2D. It can be especially useful in datasets where the top two components do not capture the majority of variance.

### Practical Use

Both 2D and 3D PCA plots are used for:
- **Pattern Recognition**: Identifying clusters or outliers in the data.
- **Data Simplification**: Reducing the dimensionality of the data while attempting to retain the most important characteristics.
- **Exploratory Data Analysis**: Providing insights into the structure of the data before applying more complex models.

These visualizations serve as powerful tools for initial data analysis, especially when dealing with complex datasets like those generated from text vectorization (e.g., TF-IDF) in natural language processing tasks. They help in making informed decisions about the next steps in the data analysis or machine learning workflow.

#### 2D Plot

In [None]:
# Reduce dimensions (PCA)
label_encoder = LabelEncoder()
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(x_test.toarray())
y_test_numeric = label_encoder.fit_transform(y_test)

# Plot
plt.figure(figsize=(8, 8))
scatter = plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_test_numeric, cmap='viridis')
plt.colorbar(scatter) 

# Generate legend
classes = label_encoder.classes_
colors = plt.cm.viridis(np.linspace(0, 1, len(classes)))
legend_handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=col, markersize=5) for col in colors]
plt.legend(legend_handles, classes, title="Classes", bbox_to_anchor=(1.2, 0.5), loc='center left', frameon=False)

# Show the plot
plt.show()

#### 3D Plot

In [None]:
# Fit and transform with PCA for 3 components
label_encoder = LabelEncoder()
pca = PCA(n_components=3)  # Change this to 3 for 3D plotting
X_reduced = pca.fit_transform(x_test.toarray())
y_test_numeric = label_encoder.fit_transform(y_test)

# Create a 3D plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')  # Add a 3D subplot

# Scatter plot for 3D data
scatter = ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y_test_numeric, cmap='viridis')

# Color bar
plt.colorbar(scatter, ax=ax)

# Generate legend manually
classes = label_encoder.classes_
colors = plt.cm.viridis(np.linspace(0, 1, len(classes)))
legend_handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=col, markersize=5) for col in colors]
ax.legend(legend_handles, classes, title="Classes", bbox_to_anchor=(1.2, 0.5), loc='center left', frameon=False)

# Show the plot
plt.show()

### General t-SNE Explanation

**t-SNE Overview**: t-SNE is a non-linear dimensionality reduction technique that is particularly well suited for embedding high-dimensional data into a space of two or three dimensions. The technique is designed to maintain the local structure of the data, making it excellent for visualizing clusters or groups within the data.

- **Local Structure Preservation**: t-SNE focuses on preserving the local distances between points, meaning that points that are similar in the high-dimensional space are placed near each other in the reduced space. However, global distances (i.e., distances between clusters) might not be as accurately represented.
- **Interpretation of Axes**: Unlike PCA, t-SNE axes do not have intrinsic meaning as principal components do. The axes in t-SNE plots are abstract and do not correspond to specific original variables. The placement and orientation of clusters can vary significantly between different runs of the algorithm due to its stochastic nature. The axes are simply the 2D or 3D coordinates chosen to best preserve local point-to-point distances.

### Specific to 2D Plots

In a 2D t-SNE plot:

- **X-axis and Y-axis**: These represent the two dimensions onto which the data has been mapped. The axes themselves don’t carry specific meanings but serve as a canvas to observe the grouping and separation of data points.

**Visual Analysis**: 2D plots are typically sufficient to identify clusters and outliers. They allow for easy visualization of how data points are grouped and which points are similar to each other.

### Specific to 3D Plots

A 3D t-SNE plot introduces an additional dimension:

- **Z-axis**: Adds depth to the visualization, offering another layer for interpreting the data. This can sometimes reveal structures hidden in 2D views.

**Enhanced Visual Analysis**: 3D plots can provide a more comprehensive view of the data's structure, revealing relationships that might not be perceptible in only two dimensions. However, they can also be more challenging to interpret and navigate, especially when trying to understand complex data relationships visually.

### Practical Use

Both 2D and 3D t-SNE plots are used for:

- **Cluster Visualization**: Effectively demonstrates how data points are clustered or grouped, which is invaluable for exploratory data analysis, particularly in fields like genomics, image analysis, and text data analysis.
- **Outlier Detection**: Helps in identifying data points that do not fit well with any group.
- **Data Exploration**: Provides a means to visually explore the structure of the data, which can guide further analysis or preprocessing steps.

t-SNE is a powerful tool for data visualization, especially when the primary interest is to understand the local structure of the data or to discover patterns in data that lacks clear labels or defined groups.

#### 2D Plot

In [None]:
# Reduce dimensions (t-SNE)
label_encoder = LabelEncoder()
tsne = TSNE(n_components=2, random_state=16)
X_embedded = tsne.fit_transform(x_test.toarray())
y_test_numeric = label_encoder.fit_transform(y_test)

In [None]:
# Plot
plt.figure(figsize=(8, 8))
scatter = plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y_test_numeric, cmap='viridis')
plt.colorbar(scatter) 

classes = label_encoder.classes_
colors = plt.cm.viridis(np.linspace(0, 1, len(classes)))
legend_handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=col, markersize=5) for col in colors]
plt.legend(legend_handles, classes, title="Classes", bbox_to_anchor=(1.2, 0.5), loc='center left', frameon=False)

#### 3D Plot

In [None]:
# Reduce dimensions (t-SNE)
label_encoder = LabelEncoder()
tsne = TSNE(n_components=3, random_state=16)
X_embedded = tsne.fit_transform(x_test.toarray())
y_test_numeric = label_encoder.fit_transform(y_test)

In [None]:
# Create a 3D plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')  # Add a 3D subplot

# Scatter plot for 3D data
scatter = ax.scatter(X_embedded[:, 0], X_embedded[:, 1], X_embedded[:, 2], c=y_test_numeric, cmap='viridis')

# Color bar
plt.colorbar(scatter, ax=ax)

# Generate legend manually
classes = label_encoder.classes_
colors = plt.cm.viridis(np.linspace(0, 1, len(classes)))
legend_handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=col, markersize=5) for col in colors]
ax.legend(legend_handles, classes, title="Classes", bbox_to_anchor=(1.2, 0.5), loc='center left', frameon=False)

# Show the plot
plt.show()

## Model Coefficients

You can also look at the coefficients of the logistic regression model to determine the importance of each feature (word) but visualizing this effectively can be challenging due to the high number of features. You might want to display the most influential words for each class.

In [None]:
feature_names = np.array(vectorizer.get_feature_names_out())
sorted_coef_index = logreg.coef_[0].argsort()

print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))