# DBLP Citation Impact Prediction

**Team Member:** Julio Amaya  
**Task:** Predictive Modeling  
**Date:** December 4, 2025

## Overview

This notebook builds predictive models to forecast citation impact of research papers:

1. Load paper data with features (metadata + topics)
2. Engineer features for prediction
3. Time-split data (pre-2010 train, post-2010 test)
4. Train classifiers (Logistic Regression, XGBoost, Random Forest)
5. Predict citation impact categories
6. Evaluate models (F1, AUC, accuracy)
7. Extract feature importance
8. Analyze high/low impact papers

---

In [None]:
import sys
import warnings
from pathlib import Path

# 1. Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import joblib

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score

# ==========================================
# CONFIGURATION & SETUP
# ==========================================

# Define Project Path (Uses current working directory)
PROJECT_ROOT = Path.cwd()

# Define Sub-directories
DATA_DIR = PROJECT_ROOT / "data/parquet"
FIGURES_DIR = PROJECT_ROOT / "figures"
MODELS_DIR = PROJECT_ROOT / "models"
RESULTS_DIR = PROJECT_ROOT / "results"

# Create directories if they don't exist
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# Plotting settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11
warnings.filterwarnings('ignore')

print(f"✅ Setup Complete. Working directory: {PROJECT_ROOT}")

In [None]:
# ==========================================
# STEP 1: Load Data (CORRECTED FOR FULL PROJECT)
# ==========================================

print("Loading papers from partitioned Parquet files...")
# This reads ALL part-*.parquet files in data/parquet/papers/ — exactly what Truc generated
papers = pd.read_parquet(DATA_DIR / "papers")

print(f"Loaded {len(papers):,} papers")
print(f"Year range: {papers['year'].min():.0f} – {papers['year'].max():.0f}")
print(f"Papers with abstract: {papers['abstract'].notna().sum():,} ({papers['abstract'].notna().mean():.1%})")

# Load Topic Assignments (saved from notebook 06)
# In notebook 06 you saved it to: output/paper_topics.csv
topics_path = PROJECT_ROOT / "output" / "paper_with_topics.csv"

if not topics_path.exists():
    raise FileNotFoundError(
        f"Topic file not found at {topics_path}\n"
        "Please run notebook 06_topic_modeling.ipynb first and make sure it saves:\n"
        "    df[['id', 'year', 'topic_id', 'topic_label']].to_csv('output/paper_topics.csv', index=False)"
    )

paper_topics = pd.read_csv(topics_path)

print(f"Loaded {len(paper_topics):,} topic assignments")

# Merge with main papers table
paper_topics = paper_topics.rename(columns={'topic_id': 'topic'})  # keep consistent naming
df = papers.merge(paper_topics[['id', 'topic']], on='id', how='left')

print(f"\nMerge complete:")
print(f"   • Total papers:           {len(df):,}")
print(f"   • Papers with topics:     {df['topic'].notna().sum():,} ({df['topic'].notna().mean():.1%})")
print(f"   • Papers without topics:  {df['topic'].isna().sum():,} (will be excluded later)")

In [None]:
# ==========================================
# STEP 2: Feature Engineering
# ==========================================

# Filter papers
df_features = df[
    (df['year'].notna()) &
    (df['n_citation'].notna()) &
    (df['topic'].notna())
].copy()

# --- BINARY CLASSIFICATION (High vs Low) ---
def calculate_relative_impact(group):
    # Calculate the Median (50th percentile) for this year
    median = group['n_citation'].quantile(0.50)
    
    def classify(n):
        if n > median: return 1   # High Impact (Top 50%)
        else: return 0            # Low Impact (Bottom 50%)
        
    return group['n_citation'].apply(classify)

print("Calculating relative impact (Binary: Above/Below Median)...")
df_features['citation_impact'] = df_features.groupby('year', group_keys=False).apply(calculate_relative_impact)

# Add other features
df_features['paper_age'] = 2017 - df_features['year']

if 'author_count' not in df_features.columns:
    df_features['author_count'] = df_features['authors'].apply(lambda x: len(x) if x is not None else 1)

if 'ref_count' not in df_features.columns:
    df_features['ref_count'] = df_features['references'].apply(lambda x: len(x) if x is not None else 0)

print(f"\nTarget distribution:")
print(df_features['citation_impact'].value_counts().sort_index())

# --- ENCODING ---

# Encode venue (top 50 venues + 'OTHER')
top_venues = df_features['venue'].value_counts().head(50).index
df_features['venue_encoded'] = df_features['venue'].apply(
    lambda x: x if x in top_venues else 'OTHER'
)

print("Encoding venues and topics...")
venue_dummies = pd.get_dummies(df_features['venue_encoded'], prefix='venue')
topic_dummies = pd.get_dummies(df_features['topic'], prefix='topic')

# Select Numeric features
feature_cols = ['author_count', 'ref_count'] 
X_base = df_features[feature_cols].fillna(0).copy()

# Combine All Features
X = pd.concat([X_base, venue_dummies, topic_dummies], axis=1)
y = df_features['citation_impact']

print(f"Feature matrix shape: {X.shape}")

In [None]:
# ==========================================
# STEP 3: Model Training
# ==========================================

# Time-based split
SPLIT_YEAR = 2010

train_mask = df_features['year'] < SPLIT_YEAR
test_mask = df_features['year'] >= SPLIT_YEAR

X_train = X[train_mask]
X_test = X[test_mask]
y_train = y[train_mask]
y_test = y[test_mask]

print(f"Training set (pre-{SPLIT_YEAR}): {len(X_train):,} papers")
print(f"Test set ({SPLIT_YEAR}+): {len(X_test):,} papers")

# Scale numerical features
scaler = StandardScaler()
numeric_cols = ['author_count', 'ref_count']

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test_scaled[numeric_cols] = scaler.transform(X_test[numeric_cols])

print(f"\n✓ Features scaled")

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1), # n_jobs=-1 uses all local cores
    'XGBoost': xgb.XGBClassifier(n_estimators=100, max_depth=6, random_state=42, n_jobs=-1, eval_metric='logloss')
}

results = {}
trained_models = {}

print("Starting Model Training...")

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train_scaled, y_train)
    trained_models[name] = model

    # Predictions
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] 

    # Metrics
    f1 = f1_score(y_test, y_pred, average='weighted')
    try:
        auc = roc_auc_score(y_test, y_pred_proba)
    except ValueError:
        auc = 0.0
    accuracy = (y_pred == y_test).mean()

    results[name] = {
        'f1_score': f1,
        'auc': auc,
        'accuracy': accuracy,
        'predictions': y_pred
    }

    print(f"  F1 Score: {f1:.4f}")
    print(f"  AUC: {auc:.4f}")
    print(f"  Accuracy: {accuracy:.4f}")

print("\n✓ All models trained")

# Model comparison DataFrame
comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'F1 Score': [results[m]['f1_score'] for m in results.keys()],
    'AUC': [results[m]['auc'] for m in results.keys()],
    'Accuracy': [results[m]['accuracy'] for m in results.keys()]
})

print("Model Performance Comparison:")
print("="*60)
print(comparison_df.to_string(index=False))

# Save results
comparison_df.to_csv(RESULTS_DIR / 'model_comparison.csv', index=False)

In [None]:
# ==========================================
# STEP 4: Visualization & Analysis
# ==========================================

# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, metric in enumerate(['F1 Score', 'AUC', 'Accuracy']):
    axes[i].bar(comparison_df['Model'], comparison_df[metric], color=['steelblue', 'coral', 'seagreen'], alpha=0.8)
    axes[i].set_ylabel(metric)
    axes[i].set_title(f'{metric} by Model')
    axes[i].set_ylim([0, 1.0])
    axes[i].grid(axis='y', alpha=0.3)
    axes[i].tick_params(axis='x', rotation=15)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'fig7_model_comparison.png', dpi=300, bbox_inches='tight')
print(f"Saved figure to {FIGURES_DIR / 'fig7_model_comparison.png'}")
# plt.show() # Uncomment if you want to see the plot popup

# Get feature importance from Random Forest
rf_model = trained_models['Random Forest']
feature_importance = pd.DataFrame({
    'feature': X_train_scaled.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Top 15 features
top_features = feature_importance.head(15)

print("\nTop 15 Most Important Features:")
print("="*60)
print(top_features.to_string(index=False))

# Visualize Feature Importance
fig, ax = plt.subplots(figsize=(12, 6))
ax.barh(range(len(top_features)), top_features['importance'], color='mediumseagreen', alpha=0.8)
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['feature'], fontsize=10)
ax.set_xlabel('Importance')
ax.set_title('Top 15 Feature Importances (Random Forest)')
ax.invert_yaxis()
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'fig8_feature_importance.png', dpi=300, bbox_inches='tight')
print(f"Saved figure to {FIGURES_DIR / 'fig8_feature_importance.png'}")

# Save feature importance
feature_importance.to_csv(RESULTS_DIR / 'feature_importance.csv', index=False)

# Add predictions to test data
df_test = df_features[test_mask].copy()
df_test['predicted_impact'] = results['Random Forest']['predictions']

# --- CASE STUDIES ---

# High impact papers (Correctly Predicted as 1)
high_impact_correct = df_test[
    (df_test['citation_impact'] == 1) &
    (df_test['predicted_impact'] == 1)
].nlargest(5, 'n_citation')

print("\nHigh Impact Papers (Correctly Predicted):")
print("="*80)
display_cols = ['title', 'year', 'venue', 'n_citation', 'author_count', 'topic']
print(high_impact_correct[display_cols].to_string())

# Low impact papers (Correctly Predicted as 0)
low_impact_correct = df_test[
    (df_test['citation_impact'] == 0) &
    (df_test['predicted_impact'] == 0)
].head(5)

print("\nLow Impact Papers (Correctly Predicted):")
print("="*80)
print(low_impact_correct[display_cols].to_string())

# Save case studies
high_impact_correct.to_csv(RESULTS_DIR / 'high_impact_cases.csv', index=False)
low_impact_correct.to_csv(RESULTS_DIR / 'low_impact_cases.csv', index=False)

# Save best model
best_model_name = comparison_df.loc[comparison_df['F1 Score'].idxmax(), 'Model']
best_model = trained_models[best_model_name]

joblib.dump(best_model, MODELS_DIR / 'citation_predictor.pkl')
joblib.dump(scaler, MODELS_DIR / 'feature_scaler.pkl')

print(f"\n✓ Best model ({best_model_name}) saved to {MODELS_DIR}")

# Final summary
print("\n" + "="*80)
print("PREDICTIVE MODELING SUMMARY")
print("="*80)
print(f"Task: Citation Impact Prediction (Binary: High vs Low)")
print(f"Train Period: Pre-{SPLIT_YEAR} ({len(X_train):,} papers)")
print(f"Test Period: {SPLIT_YEAR}+ ({len(X_test):,} papers)")
print(f"Features Used: {X.shape[1]} (metadata + topic_dummies + venue_dummies)")
print(f"\nBest Model: {best_model_name}")
print(f"  F1 Score: {comparison_df.loc[comparison_df['F1 Score'].idxmax(), 'F1 Score']:.4f}")
print(f"  AUC: {comparison_df.loc[comparison_df['F1 Score'].idxmax(), 'AUC']:.4f}")
print(f"\nOutput Locations:")
print(f"  - Models: {MODELS_DIR}")
print(f"  - Results: {RESULTS_DIR}")
print(f"  - Figures: {FIGURES_DIR}")
print("="*80)

## Results Summary

This notebook successfully completed the following:

1. **Data Loading**: Loaded papers dataset and merged with topic assignments from NLP modeling
2. **Feature Engineering**: Created binary classification target (high vs low impact based on median citations per year) and encoded venue/topic features
3. **Time-Based Split**: Split data at 2010 (training on pre-2010, testing on 2010+)
4. **Model Training**: Trained three classifiers (Logistic Regression, Random Forest, XGBoost) with standardized features
5. **Model Evaluation**: Compared models using F1 score, AUC, and accuracy metrics
6. **Feature Analysis**: Extracted and visualized top 15 most important features from Random Forest
7. **Case Studies**: Identified correctly predicted high and low impact papers

All results, models, and visualizations have been saved to the respective directories.