# Data preparation

First, import the necessary packages, load the data, and clean it. We remove missing values to avoid errors and bias, and set Major and Minor as prediction Targets.

In [None]:

#Torch is an "AI" or machine-learning or neural-network package.
import torch
torch.__version__
#Import Pandas for data manipulation
import pandas as pd
pd.set_option('display.max_columns', None)


In [None]:
#billsTrain.csv is a CSV file of 72,517 Congressional bills from http://congressionalbills.org/about.html
#More specifically, it is the a part of the 83-9X congressional csv. This will be our training set for an LLM to categorize bills based on

#Columns of key interest are "major" and "minor", from https://www.comparativeagendas.net/pages/master-codebook
#These are our topic area proxies.
total = pd.read_csv("billsTrain.csv")


if total.isnull().values.any():
  total = total.dropna()


Disclosure: This notebook incorporates some sections of machine-written code.

In [None]:
display(total.head())

Unnamed: 0,Title,Major,Minor
1880,A bill to establish a new program of health ca...,3.0,301.0
1881,An Act to provide for pension reform.,5.0,503.0
1882,A bill to provide for the regulation of surfac...,8.0,805.0
1883,A bill to provide that meetings of Government ...,2.0,208.0
1884,A bill to amend the Internal Revenue Code of 1...,6.0,601.0


The above dataset (billsLess) is the Congressional dataset with all but the titles and Major/Minor categories stripped from it.


In [None]:
major_target = total['Major']
minor_target = total['Minor']

print("First 5 values of major_target:\n", major_target.head())
print("\nFirst 5 values of minor_target:\n", minor_target.head())

First 5 values of major_target:
 1880    3.0
1881    5.0
1882    8.0
1883    2.0
1884    6.0
Name: Major, dtype: float64

First 5 values of minor_target:
 1880    301.0
1881    503.0
1882    805.0
1883    208.0
1884    601.0
Name: Minor, dtype: float64


## Vectorizing Data

Vectorize the 'Title' column using TF-IDF to convert text into numerical features. This will be the sole feature set for the models.
By doing feature engineering, it transforms every bill title into a high-dimensional sparse vector.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(total['Title'])

print("Shape of TF-IDF features:", tfidf_features.shape)

Shape of TF-IDF features: (59757, 17685)


## Establishing Evaluation Metrics (Train/Test Split)
To evaluate the model and prevent overfitting, we split the data into a training and a testing set.

We perform this split for both Major and Minor targets.

In [None]:
from sklearn.model_selection import train_test_split

# Split data for 'Major' target
X_train_major, X_test_major, y_train_major, y_test_major = train_test_split(
    tfidf_features, major_target, test_size=0.2, random_state=42
)

# Split data for 'Minor' target
X_train_minor, X_test_minor, y_train_minor, y_test_minor = train_test_split(
    tfidf_features, minor_target, test_size=0.2, random_state=42
)

print("Shape of X_train_major:", X_train_major.shape)
print("Shape of X_test_major:", X_test_major.shape)
print("Shape of y_train_major:", y_train_major.shape)
print("Shape of y_test_major:", y_test_major.shape)
print("\nShape of X_train_minor:", X_train_minor.shape)
print("Shape of X_test_minor:", X_test_minor.shape)
print("Shape of y_train_minor:", y_train_minor.shape)
print("Shape of y_test_minor:", y_test_minor.shape)

Shape of X_train_major: (47805, 17685)
Shape of X_test_major: (11952, 17685)
Shape of y_train_major: (47805,)
Shape of y_test_major: (11952,)

Shape of X_train_minor: (47805, 17685)
Shape of X_test_minor: (11952, 17685)
Shape of y_train_minor: (47805,)
Shape of y_test_minor: (11952,)


### Building the PyTorch Data Pipeline
Convert Scikit-learn sparse matrices into PyTorch FloatTensors and create dataloaders to feed 64 samples at a time during training.

In [None]:
from torch.utils.data import TensorDataset, DataLoader

# Convert TF-IDF features to PyTorch tensors (float32)
X_train_major_tensor = torch.tensor(X_train_major.toarray(), dtype=torch.float32)
X_test_major_tensor = torch.tensor(X_test_major.toarray(), dtype=torch.float32)

# Convert target labels to PyTorch tensors (long)
y_train_major_tensor = torch.tensor(y_train_major.values, dtype=torch.long)
y_test_major_tensor = torch.tensor(y_test_major.values, dtype=torch.long)

# Create TensorDatasets
train_major_dataset = TensorDataset(X_train_major_tensor, y_train_major_tensor)
test_major_dataset = TensorDataset(X_test_major_tensor, y_test_major_tensor)

# Create DataLoaders
batch_size = 16 #Original batch size was 64, but a smaller batch size lets us use an extended dataset.
train_major_loader = DataLoader(train_major_dataset, batch_size=batch_size, shuffle=True)
test_major_loader = DataLoader(test_major_dataset, batch_size=batch_size, shuffle=False)

print("Shape of X_train_major_tensor:", X_train_major_tensor.shape)
print("Shape of y_train_major_tensor:", y_train_major_tensor.shape)
print("Number of batches in train_major_loader:", len(train_major_loader))
print("Number of batches in test_major_loader:", len(test_major_loader))

Shape of X_train_major_tensor: torch.Size([47805, 17685])
Shape of y_train_major_tensor: torch.Size([47805])
Number of batches in train_major_loader: 747
Number of batches in test_major_loader: 187


## Model 1: Major Topic Classification
Then we classify the 'Major' category from the TF-IDF features using linear and activation fns.
We define a standard MLP, taking TF-IDF as Input, ReLU as activation functions and Major Topics as output.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

# Determine the number of unique classes in 'major_target'
# This will be the output dimension for our classification model.
num_major_classes = 23

# Define the neural network architecture
class MajorCategoryClassifier(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(MajorCategoryClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, 256)  # First hidden layer
        self.fc2 = nn.Linear(256, 128)        # Second hidden layer
        self.fc3 = nn.Linear(128, output_dim) # Output layer

    def forward(self, x):
        x = F.relu(self.fc1(x)) # Apply ReLU activation to first hidden layer
        x = F.relu(self.fc2(x)) # Apply ReLU activation to second hidden layer
        x = self.fc3(x)         # Output layer, no activation here as CrossEntropyLoss will handle Softmax
        return x

# Instantiate the model
input_dim = X_train_major_tensor.shape[1]
model_major = MajorCategoryClassifier(input_dim, num_major_classes)

# Print the model architecture
print("Model Architecture for Major Category Classification:")
print(model_major)
print(f"\nNumber of unique major classes: {num_major_classes}")

Model Architecture for Major Category Classification:
MajorCategoryClassifier(
  (fc1): Linear(in_features=17685, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=23, bias=True)
)

Number of unique major classes: 23


Below, we have our model to predict the Major Categories. This model is a multi-layer perceptron that takes the bill title features as an input and trains against the major label.

We tuned the number of features in each layer, as well as the number of layers; however, due to compute time and units (this project has already cost me money to buy compute units) this may not be the perfect optimal solution. Still, it is a long way up from the base model's 80-83% accuracy.  

In [None]:
import torch.optim as optim
from sklearn.preprocessing import LabelEncoder
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

# --- Re-evaluate and transform labels ---
# Create a LabelEncoder to ensure labels are 0-indexed and contiguous
le = LabelEncoder()

# Fit the encoder on the entire 'major_target' Series to get a global, consistent mapping
# This will identify all unique major categories, including any potentially unexpected ones like '99'.
le.fit(major_target)

# Transform both training and testing labels to their new 0-indexed, contiguous values
remapped_y_train_major = le.transform(y_train_major)
remapped_y_test_major = le.transform(y_test_major)

# The corrected number of classes is the total number of unique classes found by the LabelEncoder
num_major_classes_corrected = len(le.classes_)

print(f"Original unique major labels found: {le.inverse_transform(range(num_major_classes_corrected))}")
print(f"Corrected number of major classes (output dimension): {num_major_classes_corrected}")

# --- Re-define PyTorch Model Architecture with corrected num_major_classes ---
# This class definition is repeated to ensure the code block is self-contained and fully runnable.
class MajorCategoryClassifier(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(MajorCategoryClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, 1024)  # First hidden layer
        self.fc2 = nn.Linear(1024, 512)
        self.fc3 = nn.Linear(512, 256)
        self.fc4 = nn.Linear(256, 128)        # Second hidden layer
        self.fc5 = nn.Linear(128, output_dim) # Output layer

    def forward(self, x):
        x = F.relu(self.fc1(x)) # Apply ReLU activation to first hidden layer
        x = F.relu(self.fc2(x)) # Apply ReLU activation to second hidden layer
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.fc5(x)         # Output layer, no activation here as CrossEntropyLoss will handle Softmax
        return x

# Instantiate the model with the corrected number of classes
# input_dim comes from X_train_major_tensor.shape[1], which was defined in a previous cell.
input_dim = X_train_major_tensor.shape[1]
model_major = MajorCategoryClassifier(input_dim, num_major_classes_corrected)

# Print the updated model architecture for verification
print("\nUpdated Model Architecture for Major Category Classification:")
print(model_major)

# --- Re-prepare Data for PyTorch with transformed labels ---
# Convert the transformed target labels to PyTorch tensors (long type is required for CrossEntropyLoss)
y_train_major_tensor_remapped = torch.tensor(remapped_y_train_major, dtype=torch.long)
y_test_major_tensor_remapped = torch.tensor(remapped_y_test_major, dtype=torch.long)

# Create TensorDatasets using the original TF-IDF features and the newly remapped labels
train_major_dataset_remapped = TensorDataset(X_train_major_tensor, y_train_major_tensor_remapped)
test_major_dataset_remapped = TensorDataset(X_test_major_tensor, y_test_major_tensor_remapped)

# Re-create DataLoaders with the remapped datasets
# batch_size is assumed to be defined in a previous cell (e.g., batch_size = 64)
batch_size = 64
train_major_loader_remapped = DataLoader(train_major_dataset_remapped, batch_size=batch_size, shuffle=True)
test_major_loader_remapped = DataLoader(test_major_dataset_remapped, batch_size=batch_size, shuffle=False)

print(f"\nNumber of batches in remapped train_major_loader: {len(train_major_loader_remapped)}")
print(f"Number of batches in remapped test_major_loader: {len(test_major_loader_remapped)}")

# --- Training loop using the re-prepared data and model ---
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_major.parameters(), lr=0.0005)

# 3. Set the number of training epochs
num_epochs = 25


# 4. Implement a training loop
print("\nStarting training with remapped labels...")
for epoch in range(num_epochs):
    model_major.train() # Set the model to training mode
    running_loss = 0.0
    # Iterate through batches from the remapped training data loader
    for inputs, labels in train_major_loader_remapped:
        optimizer.zero_grad()
        outputs = model_major(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * inputs.size(0)

    # Calculate and print the average loss for the current epoch
    epoch_loss = running_loss / len(train_major_loader_remapped.dataset)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}")

print("Training finished.")


Original unique major labels found: [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 12. 13. 14. 15. 16. 17. 18. 19.
 20. 21. 23. 99.]
Corrected number of major classes (output dimension): 22

Updated Model Architecture for Major Category Classification:
MajorCategoryClassifier(
  (fc1): Linear(in_features=17685, out_features=1024, bias=True)
  (fc2): Linear(in_features=1024, out_features=512, bias=True)
  (fc3): Linear(in_features=512, out_features=256, bias=True)
  (fc4): Linear(in_features=256, out_features=128, bias=True)
  (fc5): Linear(in_features=128, out_features=22, bias=True)
)

Number of batches in remapped train_major_loader: 747
Number of batches in remapped test_major_loader: 187

Starting training with remapped labels...
Epoch 1/25, Loss: 1.2780
Epoch 2/25, Loss: 0.3500
Epoch 3/25, Loss: 0.1996
Epoch 4/25, Loss: 0.1439
Epoch 5/25, Loss: 0.1150
Epoch 6/25, Loss: 0.0986
Epoch 7/25, Loss: 0.0878
Epoch 8/25, Loss: 0.0795
Epoch 9/25, Loss: 0.0741
Epoch 10/25, Loss: 0.0697
Epoch 11/25

## Evaluate the Major Category Model

We now assess the performance of the trained PyTorch model on the test data.

In [None]:
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# 1. Set the model to evaluation mode
model_major.eval()

# 2. Initialize empty lists to store true labels and predicted labels
all_preds = []
all_labels = []

# 3. Iterate through the test_major_loader_remapped without computing gradients
with torch.no_grad():
    for inputs, labels in test_major_loader_remapped:
        # Perform a forward pass to get predictions
        outputs = model_major(inputs)

        # Get the predicted class by finding the index of the maximum log-probability
        _, predicted = torch.max(outputs.data, 1)

        # Move the true labels and predicted labels to the CPU and convert to NumPy arrays
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# 4. Concatenate all true labels and predicted labels into single NumPy arrays
all_preds = np.array(all_preds)
all_labels = np.array(all_labels)

# 5. Calculate and print the overall accuracy
accuracy = accuracy_score(all_labels, all_preds)
print(f"Accuracy on the test set for Major Category: {accuracy:.4f}")

Accuracy on the test set for Major Category: 0.9075


We took a title "An Act to Protect Hawai'is Coastal Waters from Overfishing" for a prediction test. This was made up -- hence, it does not exist in the model's training dataset-- with the intention of putting it in the Agricultural section, or Category 4. Invoking Hawai'i and coastal waters is designed to try to confuse the model into thinking this is an environmental bill, but it successfully predicts Agriculture.

In [None]:
# New bill title for prediction
new_bill_title = ["An Act to Protect Hawai'is Coastal Waters from Overfishing"]

# Vectorize the new bill title using the *already fitted* tfidf_vectorizer
new_bill_title_tfidf = tfidf_vectorizer.transform(new_bill_title)

# Make prediction using the trained model_major
model_major.eval() # Set model to evaluation mode
with torch.no_grad():
    # Convert TF-IDF features to PyTorch tensor
    new_bill_title_tensor = torch.tensor(new_bill_title_tfidf.toarray(), dtype=torch.float32)
    outputs = model_major(new_bill_title_tensor)
    _, predicted_label_remapped = torch.max(outputs.data, 1)

# Inverse transform the predicted remapped label to get the original Major category
predicted_major_category = le.inverse_transform(predicted_label_remapped.cpu().numpy())

print(f"The predicted Major category for \"{new_bill_title[0]}\" is: {predicted_major_category[0]}")

The predicted Major category for "An Act to Protect Hawai'is Coastal Waters from Overfishing" is: 4.0


### Mislabeled analysis


We wanted to take a list of bills that were mislabeled in order to determine what types of errors we are experiencing and how we might interpret or remedy them.

In [None]:
import numpy as np

# Find indices where predicted labels do not match true labels
mismatched_indices = np.where(all_preds != all_labels)[0]

# Get the first 30 mislabeled indices from the test set
num_to_display = 30
selected_mismatched_indices = mismatched_indices[:num_to_display]

# Map these indices back to the original `total` DataFrame indices
# First, get the original indices of the test set
original_test_indices = y_test_major.index.values

# Then, get the original indices for the selected mismatched entries
original_indices_for_display = original_test_indices[selected_mismatched_indices]

print(f"Displaying the first {len(selected_mismatched_indices)} mislabeled entries:\n")

for i, original_df_idx in enumerate(original_indices_for_display):
    # Get the original bill title
    bill_title = total.loc[original_df_idx, 'Title']

    # Get the remapped actual and predicted labels
    remapped_actual_label = all_labels[selected_mismatched_indices[i]]
    remapped_predicted_label = all_preds[selected_mismatched_indices[i]]

    # Inverse transform to get the original major category values
    actual_major_category = le.inverse_transform([remapped_actual_label])[0]
    predicted_major_category = le.inverse_transform([remapped_predicted_label])[0]

    print(f"Entry {i+1}:")
    print(f"  Title: {bill_title}")
    print(f"  Actual Major Category: {actual_major_category}")
    print(f"  Predicted Major Category: {predicted_major_category}")
    print("--------------------------------------------------")

Displaying the first 30 mislabeled entries:

Entry 1:
  Title: A bill to amend section 5052 and 5232 of title 10, United States Code, relating to the appointment to the grades of general and lieutenant general of Marine Corps officers designated for appropriate higher commands or for performance of duties of great importance and responsibility.
  Actual Major Category: 16.0
  Predicted Major Category: 12.0
--------------------------------------------------
Entry 2:
  Title: An Act for the relief of Swiff-Train Company.
  Actual Major Category: 99.0
  Predicted Major Category: 4.0
--------------------------------------------------
Entry 3:
  Title: A bill to amend the Internal Revenue Code of 1954 to add social security benefits to the annuity and pension payments which are exempt from levy thereunder.
  Actual Major Category: 13.0
  Predicted Major Category: 5.0
--------------------------------------------------
Entry 4:
  Title: A bill to provide for minimum rate provisions by nominat

# Model 2: Minor Topic Prediction
We use a similar architecture to Model 1, but the output layer is adjusted to match the number of Minor categories. We want to see how well the model performs using *only* the text title.

There are more categories for minor topics, thus, it is harder than predicting Major. While usually this would call for more epoches to train, compute limits meant we were limited to 25 epoches. Still, given our large dataset of 72,500 entries, we should be able to expect reasonable accuracy.

In [None]:
import torch.optim as optim
from sklearn.preprocessing import LabelEncoder
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

# --- Re-evaluate and transform labels for Minor Category ---
le_minor = LabelEncoder()
le_minor.fit(minor_target)

remapped_y_train_minor = le_minor.transform(y_train_minor)
remapped_y_test_minor = le_minor.transform(y_test_minor)

num_minor_classes_corrected = len(le_minor.classes_)

print(f"Original unique minor labels found: {le_minor.inverse_transform(range(num_minor_classes_corrected))}")
print(f"Corrected number of minor classes (output dimension): {num_minor_classes_corrected}")

# --- Define PyTorch Model Architecture for Minor Category ---
class MinorCategoryClassifier(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(MinorCategoryClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, 1024)  # First hidden layer
        self.fc2 = nn.Linear(1024, 512)
        self.fc3 = nn.Linear(512, 256)
        self.fc4 = nn.Linear(256, 128)        # Second hidden layer
        self.fc5 = nn.Linear(128, output_dim) # Output layer

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.fc5(x)
        return x

# Instantiate the model with the corrected number of classes
input_dim = X_train_minor.shape[1] # Use X_train_minor for input_dim
model_minor = MinorCategoryClassifier(input_dim, num_minor_classes_corrected)

print("\nModel Architecture for Minor Category Classification:")
print(model_minor)

# --- Re-prepare Data for PyTorch with transformed minor labels ---
y_train_minor_tensor_remapped = torch.tensor(remapped_y_train_minor, dtype=torch.long)
y_test_minor_tensor_remapped = torch.tensor(remapped_y_test_minor, dtype=torch.long)

# Create TensorDatasets using the original TF-IDF features and the newly remapped labels
train_minor_dataset_remapped = TensorDataset(torch.tensor(X_train_minor.toarray(), dtype=torch.float32), y_train_minor_tensor_remapped)
test_minor_dataset_remapped = TensorDataset(torch.tensor(X_test_minor.toarray(), dtype=torch.float32), y_test_minor_tensor_remapped)

# Re-create DataLoaders with the remapped datasets
batch_size = 64 # Assuming batch_size is already defined
train_minor_loader_remapped = DataLoader(train_minor_dataset_remapped, batch_size=batch_size, shuffle=True)
test_minor_loader_remapped = DataLoader(test_minor_dataset_remapped, batch_size=batch_size, shuffle=False)

print(f"\nNumber of batches in remapped train_minor_loader: {len(train_minor_loader_remapped)}")
print(f"Number of batches in remapped test_minor_loader: {len(test_minor_loader_remapped)}")

# --- Training loop for Minor Category Model ---
criterion_minor = nn.CrossEntropyLoss()
optimizer_minor = optim.Adam(model_minor.parameters(), lr=0.0005)

num_epochs_minor = 25 # You can adjust the number of epochs

print("\nStarting training for Minor Category with remapped labels...")
for epoch in range(num_epochs_minor):
    model_minor.train() # Set the model to training mode
    running_loss_minor = 0.0
    for inputs, labels in train_minor_loader_remapped:
        optimizer_minor.zero_grad()
        outputs = model_minor(inputs)
        loss = criterion_minor(outputs, labels)
        loss.backward()
        optimizer_minor.step()

        running_loss_minor += loss.item() * inputs.size(0)

    epoch_loss_minor = running_loss_minor / len(train_minor_loader_remapped.dataset)
    print(f"Epoch {epoch+1}/{num_epochs_minor}, Loss: {epoch_loss_minor:.4f}")

print("Training for Minor Category finished.")

Original unique minor labels found: [ 100.  101.  103.  104.  105.  107.  108.  110.  200.  201.  202.  204.
  205.  206.  207.  208.  209.  299.  300.  301.  302.  321.  322.  323.
  324.  325.  331.  332.  333.  334.  335.  341.  342.  398.  399.  400.
  401.  402.  403.  404.  405.  408.  498.  499.  500.  501.  502.  503.
  504.  505.  506.  529.  599.  600.  601.  602.  603.  604.  606.  607.
  699.  700.  701.  703.  704.  705.  707.  708.  709.  711.  798.  799.
  800.  801.  802.  803.  805.  806.  807.  898.  899.  900. 1000. 1001.
 1002. 1003. 1005. 1007. 1010. 1098. 1099. 1200. 1201. 1202. 1203. 1204.
 1205. 1206. 1207. 1208. 1210. 1211. 1299. 1300. 1301. 1303. 1304. 1305.
 1308. 1399. 1400. 1401. 1403. 1404. 1405. 1406. 1407. 1408. 1409. 1499.
 1500. 1501. 1502. 1504. 1505. 1507. 1520. 1521. 1522. 1523. 1524. 1525.
 1526. 1599. 1600. 1602. 1603. 1604. 1605. 1606. 1608. 1610. 1611. 1612.
 1614. 1615. 1616. 1617. 1619. 1620. 1698. 1699. 1700. 1701. 1704. 1705.
 1706. 1707. 17

## Model 2 Evaluation

In [None]:
# 1. Set the model to evaluation mode
model_minor.eval()

# 2. Initialize empty lists to store true labels and predicted labels
all_preds = []
all_labels = []

# 3. Iterate through the test with no grads.
with torch.no_grad():
    for inputs, labels in test_minor_loader_remapped:
        # Perform a forward pass to get predictions
        outputs = model_minor(inputs)

        # Get the predicted class by finding the index of the maximum log-probability
        _, predicted = torch.max(outputs.data, 1)

        # Move the true labels and predicted labels to the CPU and convert to NumPy arrays
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# 4. Concatenate all true labels and predicted labels into single NumPy arrays
all_preds = np.array(all_preds)
all_labels = np.array(all_labels)

# 5. Calculate and print the overall accuracy
accuracy = accuracy_score(all_labels, all_preds)
print(f"Accuracy on the test set for Minor Category: {accuracy:.4f}")

Accuracy on the test set for Minor Category: 0.8376


We attained 83.76% accuracy for the Minor category. This is significantly (~15%) higher than our previous Minor model returned, and we believe it is due to the more complex model architecture with a greater number of hidden layers.  

# Model 2 mislabeled analysis

With Minor categories, we have two types of errors: the first are those where the Minor category fails to match the Major category (e.g., for our overfishing Hawai'i example: if the Minor model predicted an environmental Major category, it would return a 700's minor category rather than one in the 400s) and the second type are those where the Minor category returns the same Major category but incorrect minor label.

We see below that 21 of a sample of 30 were of the first type.

In [None]:
import numpy as np

# Find indices where predicted labels do not match true labels for the minor model
mismatched_indices_minor = np.where(all_preds != all_labels)[0]

# Get the first 30 mislabeled indices from the test set
num_to_display_minor = 30
selected_mismatched_indices_minor = mismatched_indices_minor[:num_to_display_minor]

# Map these indices back to the original `total` DataFrame indices
# First, get the original indices of the minor test set (from y_test_minor)
original_test_indices_minor = y_test_minor.index.values

# Then, get the original indices for the selected mismatched entries
original_indices_for_display_minor = original_test_indices_minor[selected_mismatched_indices_minor]

print(f"Displaying the first {len(selected_mismatched_indices_minor)} mislabeled entries for Minor Category:\n")

for i, original_df_idx in enumerate(original_indices_for_display_minor):
    # Get the original bill title
    bill_title = total.loc[original_df_idx, 'Title']

    # Get the remapped actual and predicted labels
    remapped_actual_label_minor = all_labels[selected_mismatched_indices_minor[i]]
    remapped_predicted_label_minor = all_preds[selected_mismatched_indices_minor[i]]

    # Inverse transform to get the original minor category values
    actual_minor_category = le_minor.inverse_transform([remapped_actual_label_minor])[0]
    predicted_minor_category = le_minor.inverse_transform([remapped_predicted_label_minor])[0]

    print(f"Entry {i+1}:")
    print(f"  Title: {bill_title}")
    print(f"  Actual Minor Category: {actual_minor_category}")
    print(f"  Predicted Minor Category: {predicted_minor_category}")
    print("--------------------------------------------------")

Displaying the first 30 mislabeled entries for Minor Category:

Entry 1:
  Title: A bill to delay the repayment of an advance or advances to the unemployment account of a State under title XII of the Social Security Act.
  Actual Minor Category: 502.0
  Predicted Minor Category: 302.0
--------------------------------------------------
Entry 2:
  Title: A bill to amend section 5052 and 5232 of title 10, United States Code, relating to the appointment to the grades of general and lieutenant general of Marine Corps officers designated for appropriate higher commands or for performance of duties of great importance and responsibility.
  Actual Minor Category: 1608.0
  Predicted Minor Category: 323.0
--------------------------------------------------
Entry 3:
  Title: A bill to authorize the Architect of the Capitol to furnish chilled water to the Folger Shakespeare Library.
  Actual Minor Category: 2008.0
  Predicted Minor Category: 2007.0
--------------------------------------------------

Given the above, we wanted to test if a model that took MajorModel's results as part of its inputs would yield better predictions. As such, we had ModelMajor generate a set of inputs for a second Minor model.

In [None]:
import numpy as np

# 1. Set the model to evaluation mode
model_major.eval()

# 2. Initialize empty lists to store predicted major categories
train_major_preds = []
test_major_preds = []

# 3. Iterate through train_major_loader_remapped and collect predictions
with torch.no_grad():
    for inputs, _ in train_major_loader_remapped:
        outputs = model_major(inputs)
        _, predicted = torch.max(outputs.data, 1)
        train_major_preds.extend(predicted.cpu().numpy())

# 4. Iterate through test_major_loader_remapped and collect predictions
with torch.no_grad():
    for inputs, _ in test_major_loader_remapped:
        outputs = model_major(inputs)
        _, predicted = torch.max(outputs.data, 1)
        test_major_preds.extend(predicted.cpu().numpy())

# 5. Convert lists into NumPy arrays
train_major_preds = np.array(train_major_preds, dtype=np.float32)
test_major_preds = np.array(test_major_preds, dtype=np.float32)

print("Major category predictions generated for training and testing datasets.")
print(f"Shape of train_major_preds: {train_major_preds.shape}")
print(f"Shape of test_major_preds: {test_major_preds.shape}")

Major category predictions generated for training and testing datasets.
Shape of train_major_preds: (47805,)
Shape of test_major_preds: (11952,)


To create the augmented feature set for `MinorModel2`, the NumPy prediction arrays were converted to PyTorch tensors, and then concatenated with the existing TF-IDF feature tensors for both the training and testing sets.



In [None]:
import torch

# Ensure X_train_minor_tensor and X_test_minor_tensor are explicitly defined
# (These were implicitly converted in DataLoader creation in a previous cell)
X_train_minor_tensor = torch.tensor(X_train_minor.toarray(), dtype=torch.float32)
X_test_minor_tensor = torch.tensor(X_test_minor.toarray(), dtype=torch.float32)

# 1. Convert train_major_preds and test_major_preds NumPy arrays to PyTorch tensors (float32)
train_major_preds_tensor = torch.tensor(train_major_preds, dtype=torch.float32)
test_major_preds_tensor = torch.tensor(test_major_preds, dtype=torch.float32)

# 2. Reshape prediction tensors from 1D to 2D (e.g., (num_samples,) to (num_samples, 1))
train_major_preds_tensor_reshaped = train_major_preds_tensor.unsqueeze(1)
test_major_preds_tensor_reshaped = test_major_preds_tensor.unsqueeze(1)

# 3. Concatenate the reshaped train_major_preds tensor with X_train_minor_tensor
X_train_minor_augmented = torch.cat((X_train_minor_tensor, train_major_preds_tensor_reshaped), dim=1)

# 4. Concatenate the reshaped test_major_preds tensor with X_test_minor_tensor
X_test_minor_augmented = torch.cat((X_test_minor_tensor, test_major_preds_tensor_reshaped), dim=1)

# 5. Print the shapes of the newly created augmented tensors
print(f"Shape of X_train_minor_augmented: {X_train_minor_augmented.shape}")
print(f"Shape of X_test_minor_augmented: {X_test_minor_augmented.shape}")

Shape of X_train_minor_augmented: torch.Size([47805, 17686])
Shape of X_test_minor_augmented: torch.Size([11952, 17686])


## Define MinorModel2 Architecture

### Subtask:
Design and define a new PyTorch neural network model (MinorModel2) that accepts the combined feature set as input and outputs predictions for the minor categories.


In [None]:
import torch.nn as nn
import torch.nn.functional as F

# Define the neural network architecture for MinorModel2
class MinorModel2(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(MinorModel2, self).__init__()
        self.fc1 = nn.Linear(input_dim, 1024)  # First hidden layer
        self.fc2 = nn.Linear(1024, 512)        # Second hidden layer
        self.fc3 = nn.Linear(512, 256)         # Third hidden layer
        self.fc4 = nn.Linear(256, 128)         # Fourth hidden layer
        self.fc5 = nn.Linear(128, output_dim)  # Output layer

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.fc5(x) # Output layer, no activation here as CrossEntropyLoss will handle Softmax
        return x

# Instantiate MinorModel2
# input_dim for MinorModel2 will be the shape of X_train_minor_augmented
input_dim_minor_model2 = X_train_minor_augmented.shape[1]
model_minor2 = MinorModel2(input_dim_minor_model2, num_minor_classes_corrected)

# Print the model architecture
print("Model Architecture for MinorModel2 (using augmented features):")
print(model_minor2)
print(f"\nInput dimension for MinorModel2: {input_dim_minor_model2}")
print(f"Output dimension for MinorModel2 (num_minor_classes_corrected): {num_minor_classes_corrected}")

Model Architecture for MinorModel2 (using augmented features):
MinorModel2(
  (fc1): Linear(in_features=17686, out_features=1024, bias=True)
  (fc2): Linear(in_features=1024, out_features=512, bias=True)
  (fc3): Linear(in_features=512, out_features=256, bias=True)
  (fc4): Linear(in_features=256, out_features=128, bias=True)
  (fc5): Linear(in_features=128, out_features=208, bias=True)
)

Input dimension for MinorModel2: 17686
Output dimension for MinorModel2 (num_minor_classes_corrected): 208


In [None]:
from torch.utils.data import TensorDataset, DataLoader

# 1. Create a TensorDataset for the training data
train_minor_dataset_model2 = TensorDataset(X_train_minor_augmented, y_train_minor_tensor_remapped)

# 2. Create a TensorDataset for the testing data
test_minor_dataset_model2 = TensorDataset(X_test_minor_augmented, y_test_minor_tensor_remapped)

# Define batch_size (assuming it's already defined as 64 from previous cells, or define it here if not)
batch_size = 64

# 3. Create a DataLoader for the training dataset
train_minor_loader_model2 = DataLoader(train_minor_dataset_model2, batch_size=batch_size, shuffle=True)

# 4. Create a DataLoader for the testing dataset
test_minor_loader_model2 = DataLoader(test_minor_dataset_model2, batch_size=batch_size, shuffle=False)

# 5. Print the number of batches in both DataLoaders
print(f"Number of batches in train_minor_loader_model2: {len(train_minor_loader_model2)}")
print(f"Number of batches in test_minor_loader_model2: {len(test_minor_loader_model2)}")

Number of batches in train_minor_loader_model2: 747
Number of batches in test_minor_loader_model2: 187


We now train MinorModel2 using predictions from Model 1. This is to try and rectify our original flaw of Model_Minor predicting minor categories outside of the major category of the bill.

In [None]:
import torch.optim as optim

# 1. Define the loss function and optimizer for model_minor2
criterion_minor2 = nn.CrossEntropyLoss()
optimizer_minor2 = optim.Adam(model_minor2.parameters(), lr=0.0005) # Using the same learning rate as previous models

# 2. Set the number of training epochs
num_epochs_minor2 = 25 # You can adjust this number as needed

# 3. Implement a training loop
print("\nStarting training for MinorModel2 with augmented features...")
for epoch in range(num_epochs_minor2):
    model_minor2.train() # 4. Set the model to training mode
    running_loss_minor2 = 0.0
    # 5. Iterate through batches from train_minor_loader_model2
    for inputs, labels in train_minor_loader_model2:
        # 6a. Zero the gradients of the optimizer
        optimizer_minor2.zero_grad()

        # 6b. Perform a forward pass
        outputs = model_minor2(inputs)

        # 6c. Calculate the loss
        loss = criterion_minor2(outputs, labels)

        # 6d. Perform backpropagation
        loss.backward()

        # 6e. Update the model's weights
        optimizer_minor2.step()

        # 6f. Accumulate the running loss
        running_loss_minor2 += loss.item() * inputs.size(0)

    # 7. Calculate and print the average loss for the current epoch
    epoch_loss_minor2 = running_loss_minor2 / len(train_minor_loader_model2.dataset)
    print(f"Epoch {epoch+1}/{num_epochs_minor2}, Loss: {epoch_loss_minor2:.4f}")

print("Training for MinorModel2 finished.")


Starting training for MinorModel2 with augmented features...
Epoch 1/25, Loss: 3.5607
Epoch 2/25, Loss: 1.6724
Epoch 3/25, Loss: 1.0492
Epoch 4/25, Loss: 0.7292
Epoch 5/25, Loss: 0.5541
Epoch 6/25, Loss: 0.4464
Epoch 7/25, Loss: 0.3755
Epoch 8/25, Loss: 0.3250
Epoch 9/25, Loss: 0.2866
Epoch 10/25, Loss: 0.2562
Epoch 11/25, Loss: 0.2355
Epoch 12/25, Loss: 0.2108
Epoch 13/25, Loss: 0.1875
Epoch 14/25, Loss: 0.1774
Epoch 15/25, Loss: 0.1698
Epoch 16/25, Loss: 0.1552
Epoch 17/25, Loss: 0.1461
Epoch 18/25, Loss: 0.1394
Epoch 19/25, Loss: 0.1278
Epoch 20/25, Loss: 0.1195
Epoch 21/25, Loss: 0.1175
Epoch 22/25, Loss: 0.1156
Epoch 23/25, Loss: 0.1070
Epoch 24/25, Loss: 0.1075
Epoch 25/25, Loss: 0.0972
Training for MinorModel2 finished.


In [None]:
from sklearn.metrics import accuracy_score
import numpy as np

# 1. Set the model to evaluation mode
model_minor2.eval()

# 2. Initialize empty lists to store true labels and predicted labels
all_preds_minor2 = []
all_labels_minor2 = []

# 3. Iterate through the test_minor_loader_model2 without computing gradients
with torch.no_grad():
    for inputs, labels in test_minor_loader_model2:
        # Perform a forward pass to get predictions
        outputs = model_minor2(inputs)

        # Get the predicted class by finding the index of the maximum log-probability
        _, predicted = torch.max(outputs.data, 1)

        # Move the true labels and predicted labels to the CPU and convert to NumPy arrays
        all_preds_minor2.extend(predicted.cpu().numpy())
        all_labels_minor2.extend(labels.cpu().numpy())

# 4. Concatenate all true labels and predicted labels into single NumPy arrays
all_preds_minor2 = np.array(all_preds_minor2)
all_labels_minor2 = np.array(all_labels_minor2)

# 5. Calculate and print the overall accuracy
accuracy_minor2 = accuracy_score(all_labels_minor2, all_preds_minor2)
print(f"Accuracy on the test set for MinorModel2: {accuracy_minor2:.4f}")

Accuracy on the test set for MinorModel2: 0.7961


The MinorModel2 is less accurate than the initial MinorModel. This may be because errors made by the Major Model are propogated by the Minor Model using them as inputs, and if the Major and Minor models made errors on two different bills then both would be incorrect as opposed to only the one being incorrect using our initial Minor model.