<a href="https://colab.research.google.com/github/timeowilliams/Responsible-ai/blob/main/HW_3_Create_A_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CORD-19 Predictive Model
## Responsible AI Assignment: Assignment 3 - Model Creation

This notebook builds a predictive model to classify whether a CORD-19 paper has a PDF available (`pdf_json_files` not null). The dataset has biases (e.g., journal concentration, recent papers), which is okay per the assignment. We’ll fetch the data from Kaggle, install dependencies, clean it, train a model, evaluate it with detailed metrics, check for proxy features, and document everything for replication.

## 1. Install Dependencies
Install all required packages to ensure reproducibility.

In [None]:
# Install dependencies from requirements.txt
pip install pandas numpy matplotlib seaborn scikit-learn langdetect scipy statsmodels kaggle

## 2. Setup and Data Loading
Download the dataset from Kaggle and load it. Ensures data is clean and accurate.

**Note**: You’ll need a Kaggle API key. See Section 8 for setup instructions.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.sparse import hstack
from statsmodels.stats.outliers_influence import variance_inflation_factor
import os
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
sns.set_palette('husl')

# Download dataset from Kaggle
os.environ['KAGGLE_USERNAME'] = 'timeowilliams'
os.environ['KAGGLE_KEY'] = 'a7f8c8f6ad2f54d8ce119b3d607e0833'
!kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge -f metadata.csv
!unzip -o metadata.csv.zip

# Load dataset
dtype_dict = {
    'sha': str,
    'doi': str,
    'pmcid': str,
    'pubmed_id': str,
    'who_covidence_id': str,
    'arxiv_id': str,
    'pdf_json_files': str,
    'pmc_json_files': str
}
df = pd.read_csv('metadata.csv', dtype=dtype_dict)

# Clean data
df = df.drop_duplicates(subset='sha', keep='first')
print("Dataset Shape after deduplication:", df.shape)
print("\nMissing Values:")
print(df.isnull().sum())

## 3. Data Preprocessing
Prepare features and target. Target: `has_pdf` (1 = PDF available, 0 = not). Features: journal, year, title text.

**Note**: If `has_pdf` is imbalanced (e.g., mostly 1s), the model might favor the majority class. We’ll check this below.

In [None]:
# Define target
df['has_pdf'] = (~df['pdf_json_files'].isna()).astype(int)
print("\nTarget Distribution (proportion of 0s and 1s):")
print(df['has_pdf'].value_counts(normalize=True))

# Feature engineering
df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce')
df['year'] = df['publish_time'].dt.year.fillna(-1).astype(int)
df['journal'] = df['journal'].fillna('Unknown')
df['title'] = df['title'].fillna('')

# Select features and target
X = df[['journal', 'year', 'title']]
y = df['has_pdf']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTrain Shape:", X_train.shape, "Test Shape:", X_test.shape)

## 4. Model Pipeline
Build a pipeline to process features and train a Logistic Regression model.

**Why Logistic Regression?** It’s simple, interpretable, and good for binary classification (0 or 1).

**Note**: Encode `journal` on the full dataset to avoid unseen labels in test set.

In [None]:
# Encode journal on full dataset to handle all possible categories
le = LabelEncoder()
X['journal_encoded'] = le.fit_transform(X['journal'])  # Fit on full X before split
X_train['journal_encoded'] = le.transform(X_train['journal'])  # Transform train
X_test['journal_encoded'] = le.transform(X_test['journal'])    # Transform test

# Vectorize title
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
X_train_tfidf = tfidf.fit_transform(X_train['title'])
X_test_tfidf = tfidf.transform(X_test['title'])

# Combine features
X_train_final = hstack([X_train_tfidf, X_train[['year', 'journal_encoded']].values])
X_test_final = hstack([X_test_tfidf, X_test[['year', 'journal_encoded']].values])

# Train model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_final, y_train)

# Predict
y_pred = model.predict(X_test_final)
print("\nModel Training Complete")

## 5. Model Evaluation
Evaluate performance with metrics and visuals.

**Quick Metrics Guide**:
- **Confusion Matrix**: Counts predictions:
  - True Positives (TP): Predicted 1, actual 1 (correctly predicted PDF)
  - True Negatives (TN): Predicted 0, actual 0 (correctly predicted no PDF)
  - False Positives (FP): Predicted 1, actual 0 (wrongly predicted PDF)
  - False Negatives (FN): Predicted 0, actual 1 (missed a PDF)
- **Precision**: TP / (TP + FP) - How often are positive predictions correct?
- **Recall**: TP / (TP + FN) - How many actual positives did we catch?
- **F1**: Balances precision and recall
- **FPR**: FP / (FP + TN) - Rate of wrong positives
- **FNR**: FN / (FN + TP) - Rate of missed positives
- **TPR**: Same as recall
- **TNR**: TN / (TN + FP) - Rate of correct negatives

In [None]:
# Basic metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))

# Extended metrics (force 2x2 matrix with labels=[0, 1])
cm = confusion_matrix(y_test, y_pred, labels=[0, 1])
tn, fp, fn, tp = cm.ravel()  # Now guaranteed to unpack 4 values
fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
fnr = fn / (fn + tp) if (fn + tp) > 0 else 0
tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
tnr = tn / (tn + fp) if (tn + fp) > 0 else 0
ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
print(f"\nExtended Metrics: FPR={fpr:.3f}, FNR={fnr:.3f}, TPR={tpr:.3f}, TNR={tnr:.3f}, PPV={ppv:.3f}")

# Group analysis by year
df_test = X_test.copy()
df_test['y_true'] = y_test
df_test['y_pred'] = y_pred
print("\nFPR and TPR by Year (checking equal opportunity):")
for year in df_test['year'].unique():
    subset = df_test[df_test['year'] == year]
    tn, fp, fn, tp = confusion_matrix(subset['y_true'], subset['y_pred'], labels=[0, 1]).ravel()
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    print(f"Year {year}: FPR={fpr:.3f}, TPR={tpr:.3f}")

# Confusion matrix plot
plt.figure()
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 6. Proxy Features Detection
Check if features like `journal` or `year` act as proxies for sensitive attributes (e.g., prestige, region).

**What’s a Proxy?** A feature that indirectly hints at something sensitive (e.g., `journal` might reflect funding levels). We’ll use correlation and VIF (Variance Inflation Factor) to spot overlap.

In [None]:
# Feature importance
feature_names = tfidf.get_feature_names_out().tolist() + ['year', 'journal_encoded']
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': model.coef_[0]})
print("\nTop 10 Positive Coefficients (Increase PDF Likelihood):")
print(coef_df.sort_values('Coefficient', ascending=False).head(10))
print("\nTop 10 Negative Coefficients (Decrease PDF Likelihood):")
print(coef_df.sort_values('Coefficient').head(10))

# Proxy check: Correlation
print("\nCorrelation Matrix (year vs. journal_encoded):")
print(X_train[['year', 'journal_encoded']].corr())

# Proxy check: VIF
vif_data = pd.DataFrame()
vif_data['Feature'] = ['year', 'journal_encoded']
vif_data['VIF'] = [variance_inflation_factor(X_train[['year', 'journal_encoded']].values, i)
                   for i in range(2)]
print("\nVariance Inflation Factor (VIF > 5 suggests overlap):")
print(vif_data)

## 7. Analysis and Discussion
### Model Performance
- **Accuracy**: 1.000 (100%) reflects that all 74,744 test instances have PDFs, and the model predicts “1” for all, perfectly matching the test set. This stems from the dataset’s extreme imbalance (99.9997% PDFs, 0.0003% no PDFs), with the single “no PDF” case in the training set.
- **Metrics**: TPR = 1.000 (caught all PDFs), FPR = 0.000, FNR = 0.000, TNR = 0.000, PPV = 1.000. Metrics for class 0 are undefined in practice (no `0`s in test), defaulting to 0 due to absence of TN, FP, or FN. The confusion matrix is [[0, 0], [0, 74744]], showing only TPs.
- **Group Fairness**: FPR and TPR by year are uniformly 0.000 and 1.000, respectively, as all test data is class 1. No meaningful fairness analysis is possible without class 0 representation.

### Proxy Insights
- **Coefficients**: Positive coefficients (e.g., `year`=0.003899, `journal_encoded`=0.003849, `virus`=2.052750e-06) align with higher PDF availability, tied to recent years and COVID-related journals/topics. Negative terms (e.g., `scientific`=-1.377613e-05) slightly reduce likelihood, but all effects are muted with 99.9997% “1”s—the model defaults to “1” regardless.
- **Correlation/VIF**: Correlation between `year` and `journal_encoded` is 0.006499 (negligible), and VIF values (4.041687) confirm no overlap, indicating no significant proxying between these features.

### Bias Reflection
The model is a trivial “always yes” predictor due to the dataset’s extreme skew (0.0003% no PDFs), reflecting CORD-19’s bias toward accessible, PDF-available research. With no “no PDF” cases in the test set, it achieves perfect accuracy but learns nothing beyond the majority class. This satisfies the assignment’s “bias is okay” rule, delivering predictable results, though it lacks practical value for distinguishing rare cases. Startups might prioritize such simplicity over nuance due to resource constraints.

**Future Idea**: Proxy analysis could explore Ivy League funding or military lab proximity—fun projects needing balanced datasets!

## 8. Reproducibility Notes
To replicate:
1. Clone repo: `git clone https://github.com/timeowilliams/Responsible-ai`
2. Run this notebook in Jupyter (installs dependencies and downloads `metadata.csv` automatically).
3. Set up Kaggle API:
   - Go to [Kaggle Account](https://www.kaggle.com/account), create API token (`kaggle.json`).
   - In Section 2, `KAGGLE_USERNAME` and `KAGGLE_KEY` use `timeowilliams` and its key. Replace with your own if forking.
   - Alternatively, run `kaggle config set -n username -v YOUR_USERNAME` and `kaggle config set -n key -v YOUR_KEY` in terminal.

Random state is 42 for consistency. Dataset source: [Kaggle CORD-19](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv).

## 9. Acknowledgments
This notebook was developed with significant assistance from Grok, an AI developed by xAI. Grok provided guidance on data preprocessing, model development, error troubleshooting, and documentation to ensure reproducibility and alignment with Responsible AI principles for this assignment.