<h1 style="color: 	#80B1D3;"><strong>PAPER LENS PROJECT</strong></h1>

<h2 style="color: 	#365F93;"><strong>Overview</strong></h2>

**Author:** Xavi

**Date:** June 2025 

**Environment:** Jupyter Notebook · Python 

---

<h3 style="color: 	#365F93;">🎯 Objective</h3>

This project aims to develop a machine learning model capable of **predicting the retraction risk of a scientific paper** at the moment of its publication. The core idea is to identify subtle, quantifiable patterns in a paper's metadata, authorship, and content that correlate with a higher likelihood of future retraction.

By training on a comprehensive dataset of previously retracted articles and a carefully constructed control group of non-retracted papers, the model will learn to distinguish high-risk publications from sound research, providing a valuable tool for researchers, editors, and the public.

---

<h3 style="color: 	#365F93;">📦 Steps Covered in This Notebook</h3>

*The steps below are covered in this notebook. Steps 5-8 are detailed in the second part of this project, 02_Analysis_and_Modeling.ipynb.*

0. **Libraries and Functions**
1. **Data Collection**
2. **Data Cleaning & Formatting**
3. **Obtain Selected Metadata from the OpenAlex API for Retracted Papers**
4. **Obtain Selected Metadata from the OpenAlex API for Non-Retracted Papers**<br><br>


5. **EDA**
6. **Feature Engineering**
7. **Applying Machine Learning Models**
8. **Conclusions**

---

<h3 style="color: 	#365F93;">📁 Dataset</h3>

- **Primary Source**: Retraction Watch Database, providing the ground truth for retracted articles.
  - *Git Repository: https://gitlab.com/crossref/retraction-watch-data.git*
- **Enrichment Source**: OpenAlex API, used to fetch detailed and consistent metadata for all papers.
- **Format**: Initial data from `.csv`, enriched data processed into `.jsonl` and final datasets saved as `.csv`.
- **Records**: ~55k retracted papers (the "cases") and a matched control group of ~55k non-retracted papers.<br><br>
- **Target Variable**:
  - `is_retracted`: A binary flag (1 for retracted, 0 for non-retracted).<br><br>
- **Feature Variables (examples)**:
  - **Bibliographic**: `publication_year`, `article_type`, `is_open_access`, `n_references`.
  - **Authorship**: `author_count`, `country_count`, `first_author_country`, `is_international_collaboration`.
  - **Venue**: `publisher`, `journal_name`.
  - **Content (NLP)**:  `title_length`, `abstract_length`, `n_concepts`, `top_concept_level`.
  - **Impact**: `citations_in_first_2_years`.

---

<h2 style="color: 	#365F93;"><strong>Libraries and Functions</strong></h2>

In [None]:
# Run in the terminal: pip install -r requirements.txt

In [1]:
# Install required packages if not already installed throught the requirements.txt
# !pip install pandas requests sentence-transformers tqdm --quiet

In [2]:
# Importing necessary libraries for data preprocessing, model training, and evaluation
import time
import json
import random
import requests
import pandas as pd
from tqdm import tqdm
from tqdm.notebook import tqdm
from typing import Optional, List, Dict, Tuple, Any

In [3]:
# Importing custom utility functions from the src folder
import os
import sys
sys.path.append('../src')
from feature_extractor import get_work_features_from_doi

<h2 style="color: 	#365F93;"><strong>1. Data Collection</strong></h2>

In [None]:
# Clone the scientific publication retraction dataset and store it in 'data/raw'
# !git clone https://gitlab.com/crossref/retraction-watch-data.git ../data/raw/retraction-watch-data

Cloning into '../data/raw/retraction-watch-data'...


In [3]:
# Read the CSV file from the 'raw/retraction-watch-data' folder
df_retraction = pd.read_csv('../data/raw/retraction-watch-data/retraction_watch.csv')
df_retraction

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
0,65194,Fuzzy regression model for forecasting impact ...,(B/T) Business - Economics;(B/T) Data Science;,"School of Economics and Management, Tongji Uni...",Journal of Intelligent & Fuzzy Systems,IOS Press (bought by Sage November 2023),China,Lining Guo;Tian Yuan,https://retractionwatch.com/2025/01/31/sage-jo...,Research Article;,1/15/2025 0:00,10.3233/JIFS-219434,0.0,2/1/2021 0:00,10.3233/jifs-189671,0.0,Retraction,+Computer-Aided Content or Computer-Generated ...,No,ACM information page for the original article:...
1,65193,Fuzzy multi-attribute decision making method b...,(B/T) Data Science;,"School of Management and Engineering, Capital ...",Journal of Intelligent & Fuzzy Systems,IOS Press (bought by Sage November 2023),China,Jing Yang;Wei Su,https://retractionwatch.com/2025/01/31/sage-jo...,Research Article;,1/15/2025 0:00,10.3233/JIFS-219434,0.0,6/5/2022 0:00,10.3233/jifs-220534,0.0,Retraction,+Computer-Aided Content or Computer-Generated ...,No,ACM information page for the original article:...
2,65192,Fuzzy modelling approach and soft computing me...,(B/T) Data Science;(HSC) Medicine - Geriatric;...,"Department of Mathematics, The Arab Academic C...",Journal of Intelligent & Fuzzy Systems,IOS Press (bought by Sage November 2023),Bahrain;India;Israel;Malaysia;Mexico;Peru,Yousef Methkal Abd Algani;K Suresh Babu;Shehab...,https://retractionwatch.com/2025/01/31/sage-jo...,Research Article;,1/15/2025 0:00,10.3233/JIFS-219434,0.0,12/12/2023 0:00,10.3233/jifs-233695,0.0,Retraction,+Computer-Aided Content or Computer-Generated ...,No,
3,65191,Fuzzy logical system for personalized vocal mu...,(B/T) Data Science;(HUM) Arts - Music;(SOC) Ed...,"School of Educational Sciences, Hui Zhou Unive...",Journal of Intelligent & Fuzzy Systems,IOS Press (bought by Sage November 2023),China,Yu Wang,https://retractionwatch.com/2025/01/31/sage-jo...,Research Article;,1/15/2025 0:00,10.3233/JIFS-219434,0.0,3/11/2024 0:00,10.3233/jifs-236248,0.0,Retraction,+Computer-Aided Content or Computer-Generated ...,No,See also: https://pubpeer.com/publications/521...
4,65181,Fuzzy assessment and improvement path of green...,(B/T) Business - Economics;(B/T) International...,"Institute of Finance, University of Internatio...",Journal of Intelligent & Fuzzy Systems,IOS Press (bought by Sage November 2023),China,Yu Liu,https://retractionwatch.com/2025/01/31/sage-jo...,Research Article;,1/15/2025 0:00,10.3233/JIFS-219434,0.0,6/5/2023 0:00,10.3233/jifs-223257,0.0,Retraction,+Computer-Aided Content or Computer-Generated ...,No,ACM information page for the original article:...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64631,5,Effect of Perindopril on Large Artery Stiffnes...,(BLS) Biochemistry;(HSC) Medicine - Cardiology...,"Alfred and Baker Medical Unit, Baker Heart Res...",JAMA: Journal of the American Medical Association,American Medical Association,Australia,Anna A Ahimastos;Anuradha Aggarwal;Kellie M D'...,http://retractionwatch.com/2015/11/23/jama-ret...,Clinical Study;Research Article;,12/22/2015 0:00,10.1001/jama.2015.16678,26594834.0,10/3/2007 0:00,10.1001/jama.298.13.1539,1791149.0,Retraction,+Falsification/Fabrication of Data;+Investigat...,No,
64632,4,MtvR is a global small noncoding regulatory RN...,(BLS) Biology - Cellular;(BLS) Genetics;(BLS) ...,Institute for Biotechnology and Bioengineering...,Journal of Bacteriology,American Society for Microbiology,Portugal,Christian G Ramos;André M Grilo;Paulo J P da C...,http://retractionwatch.com/2014/11/03/post-doc...,Research Article;,11/1/2014 0:00,10.1128/JB.02299-14,25319527.0,5/31/2013 0:00,10.1128/JB.00242-13,2372964.0,Retraction,+Duplication of/in Image;+Manipulation of Images;,No,exact date of retraction unknown
64633,3,"The second RNA chaperone, Hfq2, is also requir...",(BLS) Biology - Cellular;(BLS) Genetics;(BLS) ...,IBB—Institute for Biotechnology and Bioenginee...,Journal of Bacteriology,American Society for Microbiology,Portugal,Christian G Ramos;Sílvia A Sousa;André M Grilo...,http://retractionwatch.com/2014/10/17/this-sit...,Research Article;,11/1/2014 0:00,10.1128/JB.02242-14,25319526.0,1/28/2011 0:00,10.1128/JB.01375-10,21278292.0,Retraction,+Duplication of/in Image;+Error in Image;,No,
64634,2,Regulation of Wnt/beta-catenin pathway by cPLA...,(BLS) Biology - Cancer;(BLS) Biology - Cellula...,"Department of Pathology, University of Pittsbu...",Journal of Cellular Biochemistry,Wiley,United States,Chang Han;Kyu Lim;Lihong Xu;Guiying Li;Tong Wu,http://retractionwatch.com/2015/02/09/figure-d...,Research Article;,1/29/2015 0:00,10.1002/jcb.25020,25767853.0,7/17/2008 0:00,10.1002/jcb.21852,18636547.0,Retraction,+Duplication of/in Image;+Falsification/Fabric...,No,


<h2 style="color: 	#365F93;"><strong>2. Data Cleaning & Formatting</strong></h2>

In [4]:
# Standardize column names: lowercase and replace spaces with underscores, also I changed some column names for better readability
df_retraction.columns = df_retraction.columns.str.lower().str.replace(' ', '_') \
    .str.replace('articletype', 'article_type') \
    .str.replace('retractiondate', 'retraction_date') \
    .str.replace('retractiondoi', 'retraction_doi') \
    .str.replace('retractionpubmedid', 'retraction_pub_med_id') \
    .str.replace('originalpaperdate', 'original_paper_date') \
    .str.replace('originalpaperdoi', 'original_paper_doi') \
    .str.replace('originalpaperpubmedid', 'original_paper_pub_med_id') \
    .str.replace('retractionnature', 'retraction_nature')

In [5]:
df_retraction.columns

Index(['record_id', 'title', 'subject', 'institution', 'journal', 'publisher',
       'country', 'author', 'urls', 'article_type', 'retraction_date',
       'retraction_doi', 'retraction_pub_med_id', 'original_paper_date',
       'original_paper_doi', 'original_paper_pub_med_id', 'retraction_nature',
       'reason', 'paywalled', 'notes'],
      dtype='object')

In [6]:
# Normalize DOI column: strip, lowercase
df_retraction['original_paper_doi'] = df_retraction['original_paper_doi'].astype(str).str.strip().str.lower()

In [7]:
# Remove invalid DOI rows: NaN, empty strings, 'nan', 'none', 'unavailable'
invalids = df_retraction['original_paper_doi'].isna() | df_retraction['original_paper_doi'].isin(['', 'nan', 'none', 'unavailable'])
print(f"Invalid DOI rows to be removed: {invalids.sum()}")
df_retraction = df_retraction.loc[~invalids]

Invalid DOI rows to be removed: 5931


In [8]:
# Remove duplicates based on 'original_paper_doi' column
before = len(df_retraction)
df_retraction = df_retraction.drop_duplicates(subset=['original_paper_doi'])
after = len(df_retraction)
print(f"Rows before drop_duplicates: {before}, after: {after}, removed: {before - after}")

Rows before drop_duplicates: 58705, after: 56046, removed: 2659


In [9]:
# Confirm rows == unique DOIs
unique_count = df_retraction['original_paper_doi'].nunique(dropna=True)
print(f"Final rows: {len(df_retraction)}, unique DOIs: {unique_count}")

Final rows: 56046, unique DOIs: 56046


<details>
  <summary><strong>📌Information of One of the Removed Rows (An Interesting Curiosity)</strong></summary>

{'record_id': 18930, 'title': 'Treatise upon Electricity', 'subject': '(PHY) Energy;', 'institution': 'unavailable', 'journal': 'Philosophical Transactions', 'publisher': 'Royal Society Publishing', 'country': 'United Kingdom', 'author': 'Benjamin Wilson', 'urls': 'http://retractionwatch.com/2012/02/27/the-first-ever-english-language-retraction-1756/', 'article_type': 'Technical Report/White Paper;', 'retraction_date': '6/24/1756 12:00:00 AM', 'retraction_doi': '10.1098/rstl.1755.0107', 'retraction_pub_med_id': 0.0, 'original_paper_date': '1/1/1753 12:00:00 AM', 'original_paper_doi': 'unavailable', 'original_paper_pub_med_id': 0.0, 'retraction_nature': 'Retraction', 'reason': '+Error in Text;', 'paywalled': 'No', 'notes': 'Original Paper date may be in 1750;'}

In [10]:
# Extracts unique, non-null, and lowercase DOI values from the "original_paper_doi" column, without duplicates, and stores them in a set
retract_dois = set(df_retraction['original_paper_doi'])
print(len(retract_dois))

56046


In [11]:
# Converts the 'original_paper_date' column to datetime format (coercing any errors (invalid dates) to NaT (Not a Time))
df_retraction['original_paper_date'] = pd.to_datetime(df_retraction['original_paper_date'], errors='coerce')

In [12]:
# Date check
n_invalid_dates = df_retraction['original_paper_date'].isna().sum()
print(f"Invalid dates after to_datetime: {n_invalid_dates}")

Invalid dates after to_datetime: 0


In [13]:
# Verify conversion
print(f"Valid dates in 'original_paper_date': {df_retraction['original_paper_date'].notna().sum()}")

Valid dates in 'original_paper_date': 56046


In [14]:
# Add a new column with the year from 'original_paper_date'
df_retraction['year'] = df_retraction['original_paper_date'].dt.year

In [15]:
# Count retracted papers per year and convert to dictionary
counts_by_year = df_retraction['year'].value_counts().sort_index().to_dict()
print(counts_by_year)

{1940: 1, 1942: 1, 1943: 1, 1946: 1, 1955: 1, 1956: 1, 1958: 1, 1959: 5, 1960: 9, 1961: 2, 1962: 5, 1963: 1, 1964: 2, 1965: 1, 1966: 2, 1967: 6, 1968: 1, 1969: 1, 1970: 8, 1971: 5, 1972: 3, 1973: 2, 1974: 3, 1975: 7, 1976: 7, 1977: 27, 1978: 10, 1979: 19, 1980: 14, 1981: 18, 1982: 7, 1983: 14, 1984: 13, 1985: 13, 1986: 25, 1987: 16, 1988: 17, 1989: 35, 1990: 47, 1991: 39, 1992: 32, 1993: 41, 1994: 45, 1995: 73, 1996: 80, 1997: 78, 1998: 119, 1999: 144, 2000: 177, 2001: 243, 2002: 323, 2003: 336, 2004: 438, 2005: 492, 2006: 615, 2007: 817, 2008: 829, 2009: 1700, 2010: 3229, 2011: 5150, 2012: 1479, 2013: 1696, 2014: 1815, 2015: 1905, 2016: 1849, 2017: 2131, 2018: 3058, 2019: 3092, 2020: 3909, 2021: 6676, 2022: 10325, 2023: 1864, 2024: 882, 2025: 13}


In [16]:
# Save the cleaned retraction data to a CSV file in the processed folder
df_retraction.to_csv('../data/processed/cleaned_retraction_data.csv', index=False)

<h2 style="color: 	#365F93;"><strong>3. Obtain Selected Metadata from the OpenAlex API for Retracted Papers</strong></h2>

In [17]:
# 1. LOAD DATA: Get the list of DOIs to process

retract_dois_set = set()

# Check if the set already exists in memory from a previous step
# We use 'globals().get()' which is a safe way to check if a variable exists
if 'retract_dois' in globals() and isinstance(globals()['retract_dois'], set) and len(globals()['retract_dois']) > 0:
    print("✅ Efficiency win! Using the 'retract_dois' set already present in memory.")
    retract_dois_set = globals()['retract_dois']
    print(f"Set contains {len(retract_dois_set)} unique DOIs.")

else:
    # If the set is not in memory, load it from the file
    print("Variable 'retract_dois' not found in memory. Attempting to load from file...")
    
    # Define the path to cleaned data file
    cleaned_data_path = '../data/processed/cleaned_retraction_data.csv'
    print(f"Attempting to load data from: {cleaned_data_path}")

    try:
        # Load the DataFrame
        df_retraction = pd.read_csv(cleaned_data_path)
        
        # Extract the unique DOIs into the set
        retract_dois_set = set(df_retraction['original_paper_doi'].dropna().astype(str))
        
        if not retract_dois_set:
            print("⚠️ WARNING: The file was loaded, but no DOIs were found.")
        else:
            print(f"✅ Success! Recreated the set from file. It contains {len(retract_dois_set)} unique DOIs.")

    except FileNotFoundError:
        print(f"❌ ERROR: The file was not found at '{cleaned_data_path}'.")
        print("Run the cleaning notebook first or check the file path.")
    except KeyError:
        print(f"❌ ERROR: The column 'original_paper_doi' was not found in the file.")
        print("Check the column names in your CSV file.")

✅ Efficiency win! Using the 'retract_dois' set already present in memory.
Set contains 56046 unique DOIs.


In [18]:
# 2. TEST RUN: Process a small sample to validate the function

print("Starting Test Run")
test_results = []

# Check if the set is not empty before proceeding
if retract_dois_set:
    # To take a random sample from a set, we first convert it to a list
    sample_dois = random.sample(list(retract_dois_set), k=min(100, len(retract_dois_set)))
    
    for doi in tqdm(sample_dois, desc="Testing with sample"):
        print(f"Processing test DOI: {doi}")
        features = get_work_features_from_doi(doi)
        if features:
            test_results.append(features)
        time.sleep(1) 

    if test_results:
        df_test = pd.DataFrame(test_results)
        print("\nTest run successful. Here's a preview of the extracted data:")
        display(df_test)
    else:
        print("\nTest run completed, but no data was extracted. Check for errors above.")
else:
    print("Skipping test run because no DOIs were loaded in the previous step.")

Starting Test Run


Testing with sample:   0%|          | 0/100 [00:00<?, ?it/s]

Processing test DOI: 10.4314/jfas.v10i6s.120
Processing test DOI: 10.1007/s13277-014-2123-6
Processing test DOI: 10.1109/icemms.2010.5563470
Processing test DOI: 10.1155/2022/6861781
Processing test DOI: 10.1155/2022/6249534
Processing test DOI: 10.1007/s00261-014-0337-0
Processing test DOI: 10.1016/j.ptsp.2020.04.015
Processing test DOI: 10.1371/journal.pone.0170860
Processing test DOI: 10.3390/ani13071240
Processing test DOI: 10.1021/ja5011724
Processing test DOI: 10.1109/edt.2010.5496358
Processing test DOI: 10.1017/s2045796021000408
Processing test DOI: 10.1016/j.isatra.2019.11.016
Processing test DOI: 10.1155/2022/2197071
Processing test DOI: 10.1016/j.abd.2022.06.015
Processing test DOI: 10.1186/s40748-020-00121-3
Processing test DOI: 10.1016/j.omtn.2019.06.005
Processing test DOI: 10.1155/2022/5991154
Processing test DOI: 10.1109/coconet.2018.8476816
Processing test DOI: 10.1016/j.nmni.2017.03.005
Processing test DOI: 10.1124/mol.116.103697
Processing test DOI: 10.3233/cbm-18237

Unnamed: 0,doi,source_id,publication_year,article_type,is_open_access,author_count,institution_count,country_count,first_author_country,is_international_collaboration,...,is_publisher_missing,title,abstract,title_length,abstract_length,is_abstract_missing,n_concepts,top_concept_level,citations_in_first_2_years,n_references
0,10.1007/s13277-014-2123-6,https://openalex.org/S12644804,2014,article,False,9,3,1,CN,False,...,False,RETRACTED ARTICLE: Relationships between genet...,Our study aims to discuss the association betw...,144,1726,False,16,2,2,50
1,10.1109/icemms.2010.5563470,https://openalex.org/S4306418675,2010,article,False,4,1,1,CN,False,...,True,Notice of Retraction: Research on the influenc...,,109,0,True,10,2,0,0
2,10.1155/2022/6861781,https://openalex.org/S11392764,2022,article,True,1,1,1,CN,False,...,False,The Rights and Interests Protection Strategy o...,In order to have an in-depth understanding of ...,110,1446,False,28,2,1,24
3,10.1155/2022/6249534,https://openalex.org/S36980176,2022,article,True,9,1,1,CN,False,...,False,Alisol B 23-Acetate Increases the Antitumor Ef...,Objective. Liver cancer seriously threatens th...,117,1764,False,16,3,6,46
4,10.1007/s00261-014-0337-0,https://openalex.org/S131800149,2014,article,True,12,3,2,US,True,...,False,Preoperative CT-based nomogram for predicting ...,,126,0,True,14,2,1,20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,10.1088/0957-4484/18/22/225103,https://openalex.org/S206811868,2007,article,False,6,3,1,IN,False,...,False,Retracted: Characterization of enhanced antiba...,"In the present study, we report the preparatio...",91,792,False,22,2,0,31
94,10.1177/0020720920923308,https://openalex.org/S183582160,2020,article,False,2,1,1,TW,False,...,False,RETRACTED: Cost estimation through Monte Carlo...,Architectural design can be considered an info...,93,900,False,20,2,0,18
95,10.1155/2022/7532086,https://openalex.org/S155241436,2022,article,True,6,5,4,IN,True,...,False,Anti-Quorum Sensing in Pathogenic Microbes Usi...,Infectious disease-causing pathogenic microorg...,85,808,False,13,4,11,31
96,10.1080/16742834.2009.11446809,https://openalex.org/S2764567338,2009,review,True,2,0,0,,False,...,False,Retraction: A Review of Potential Vorticity an...,,102,0,True,13,4,0,0


In [None]:
# 3. FULL RUN: Process all DOIs

print("Starting Full-Scale Enrichment Process")

# Define paths for saving progress
output_folder = '../data/processed/enriched_data'
results_filepath = os.path.join(output_folder, 'retracted_enriched_results.jsonl')
processed_dois_filepath = os.path.join(output_folder, 'processed_dois.txt')

# Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Load already processed DOIs to avoid re-doing work
processed_dois = set()
try:
    with open(processed_dois_filepath, 'r') as f:
        processed_dois = {line.strip() for line in f} # More efficient way to load a set
    print(f"Resuming process. Found {len(processed_dois)} already processed DOIs.")
except FileNotFoundError:
    print("Starting a new process. No previously processed DOIs found.")

# Filter the set to only include DOIs that have NOT been processed yet
# KEY CHANGE: Using `retract_dois_set` which was created in the LOAD DATA cell
dois_to_process = list(retract_dois_set - processed_dois)
print(f"Number of new DOIs to process: {len(dois_to_process)}")

# Main processing loop
if dois_to_process:
    # Initialize a counter for progress updates
    processed_count = 0
    
    # Open files in 'append' mode ('a') to add new results
    with open(results_filepath, 'a') as results_file, \
         open(processed_dois_filepath, 'a') as dois_file:
        
        for doi in tqdm(dois_to_process, desc="Enriching Retracted DOIs"):
            features = None # Reset features for each loop
            try:
                # 1. Get features from the API
                features = get_work_features_from_doi(doi)
            except Exception as e:
                # This is an ultimate safety net for unexpected errors in the extractor
                print(f"  - CRITICAL ERROR processing {doi}: {e}. Skipping.")
            
            # 2. If the function returned a dictionary of features, save it
            if features and isinstance(features, dict):
                results_file.write(json.dumps(features) + '\n')
            
            # 3. ALWAYS log the DOI as processed to prevent retrying it
            dois_file.write(doi + '\n')
            
            # 4. Increment and report progress
            processed_count += 1
            if processed_count % 500 == 0:
                print(f"\n--- Progress Update: {processed_count} / {len(dois_to_process)} DOIs processed. ---")
                # Flush buffers periodically to ensure data is saved to disk
                results_file.flush()
                dois_file.flush()
            
            # 5. Be polite to the API
            time.sleep(0.1) # Safe rate of ~10 requests per second

print(f"\n--- Full-scale enrichment process complete! Total DOIs processed in this run: {processed_count} ---")

Starting Full-Scale Enrichment Process
Starting a new process. No previously processed DOIs found.
Number of new DOIs to process: 56046


Enriching Retracted DOIs:   0%|          | 0/56046 [00:00<?, ?it/s]


--- Progress Update: 500 / 56046 DOIs processed. ---

--- Progress Update: 1000 / 56046 DOIs processed. ---

--- Progress Update: 1500 / 56046 DOIs processed. ---

--- Progress Update: 2000 / 56046 DOIs processed. ---

--- Progress Update: 2500 / 56046 DOIs processed. ---

--- Progress Update: 3000 / 56046 DOIs processed. ---

--- Progress Update: 3500 / 56046 DOIs processed. ---

--- Progress Update: 4000 / 56046 DOIs processed. ---

--- Progress Update: 4500 / 56046 DOIs processed. ---

--- Progress Update: 5000 / 56046 DOIs processed. ---

--- Progress Update: 5500 / 56046 DOIs processed. ---

--- Progress Update: 6000 / 56046 DOIs processed. ---

--- Progress Update: 6500 / 56046 DOIs processed. ---

--- Progress Update: 7000 / 56046 DOIs processed. ---

--- Progress Update: 7500 / 56046 DOIs processed. ---

--- Progress Update: 8000 / 56046 DOIs processed. ---

--- Progress Update: 8500 / 56046 DOIs processed. ---

--- Progress Update: 9000 / 56046 DOIs processed. ---

--- Progre

In [None]:
# 4. FINAL ASSEMBLY: Create the final DataFrame

print("Assembling the final enriched DataFrame for Retracted Articles")

# Define file paths
results_filepath = '../data/processed/enriched_data/retracted_enriched_results.jsonl'
original_info_filepath = '../data/processed/cleaned_retraction_data.csv'
final_csv_path = '../data/processed/retracted_cases_final.csv'

try:
    # 1. Load the enriched data from the JSONL file
    df_retracted_enriched = pd.read_json(results_filepath, lines=True)
    print(f"✅ Loaded {len(df_retracted_enriched)} enriched records.")

    # 2. Load the specific columns needed from the original cleaned data
    df_original_info = pd.read_csv(
        original_info_filepath, 
        usecols=['original_paper_doi', 'reason', 'retraction_nature']
    )
    
    # 3. Merge the two DataFrames
    df_final_cases = pd.merge(
        df_retracted_enriched,
        df_original_info,
        left_on='doi',
        right_on='original_paper_doi',
        how='left'
    )
    
    # 4. Clean up and add the target variable
    df_final_cases = df_final_cases.drop(columns=['original_paper_doi'])
    df_final_cases['is_retracted'] = 1
    
    # 5. Save the final result and show info
    df_final_cases.to_csv(final_csv_path, index=False)
    
    print("\nFinal DataFrame info:")
    df_final_cases.info()
    print(f"✅ FIRST HALF OF GOLDEN DATASET COMPLETE!")
    print(f"Final enriched 'cases' data saved to: {final_csv_path}")
    
    display(df_final_cases.head())
    
except FileNotFoundError:
    print(f"❌ ERROR: A required file was not found. Please check paths for '{results_filepath}' and '{original_info_filepath}'.")
except Exception as e:
    print(f"❌ ERROR: An unexpected error occurred during assembly.")
    print(f"   Error details: {e}")

Assembling the final enriched DataFrame for Retracted Articles
✅ Loaded 55682 enriched records.

Final DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55682 entries, 0 to 55681
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   doi                             55682 non-null  object
 1   source_id                       48559 non-null  object
 2   publication_year                55682 non-null  int64 
 3   article_type                    55682 non-null  object
 4   is_open_access                  55682 non-null  bool  
 5   author_count                    55682 non-null  int64 
 6   institution_count               55682 non-null  int64 
 7   country_count                   55682 non-null  int64 
 8   first_author_country            52584 non-null  object
 9   is_international_collaboration  55682 non-null  bool  
 10  journal_name                    48559 non-null 

Unnamed: 0,doi,source_id,publication_year,article_type,is_open_access,author_count,institution_count,country_count,first_author_country,is_international_collaboration,...,title_length,abstract_length,is_abstract_missing,n_concepts,top_concept_level,citations_in_first_2_years,n_references,retraction_nature,reason,is_retracted
0,10.1109/ifita.2009.133,https://openalex.org/S4306424334,2009,article,False,3,1,1,CN,False,...,132,0,True,17,2,0,14,Retraction,+Breach of Policy by Author;+Date of Article a...,1
1,10.1111/jan.14338,https://openalex.org/S137832324,2020,article,False,9,2,1,CN,False,...,173,2726,False,10,0,2,23,Retraction,+Concerns/Issues About Data;+Ethical Violation...,1
2,10.1016/j.jaad.2018.01.028,https://openalex.org/S58727509,2018,retraction,True,14,11,1,US,False,...,209,0,True,11,0,3,5,Retraction,+Concerns/Issues About Data;,1
3,10.1109/icnidc.2009.5360923,https://openalex.org/S4306422667,2009,article,False,3,2,1,CN,False,...,80,0,True,12,2,0,4,Retraction,+Breach of Policy by Author;+Date of Article a...,1
4,10.1021/acs.jpcb.5b02877,https://openalex.org/S185621672,2015,retraction,False,4,0,0,,False,...,157,1891,False,14,3,0,39,Retraction,+Concerns/Issues About Authorship/Affiliation;,1


In [None]:
# 5. Identify DOIs not processed by the API and save them to a CSV File

if 'retract_dois_set' in globals() and 'df_final_cases' in globals():

    # 1. Get the original set of DOIs
    original_dois = retract_dois_set
    
    # 2. Get the set of DOIs that were successfully processed
    processed_dois_ok = set(df_final_cases['doi'])
    
    # 3. Calculate the set of missing DOIs
    missing_dois_set = original_dois - processed_dois_ok
    
    # 4. Print a summary report
    print(f"Original DOIs to process: {len(original_dois)}")
    print(f"DOIs successfully enriched: {len(processed_dois_ok)}")
    print(f"Total missing/unprocessed DOIs: {len(missing_dois_set)}")
    
    # 5. Save the full list of missing DOIs to a file
    if missing_dois_set:
        missing_dois_filepath = '../data/processed/missing_dois_from_retracted_cases_final.csv'
        pd.DataFrame(list(missing_dois_set), columns=['doi']).to_csv(missing_dois_filepath, index=False)
        print(f"\n✅ A full list of the {len(missing_dois_set)} missing DOIs has been saved to: '{missing_dois_filepath}'")
    else:
        print("\nNo missing DOIs found.")
                
else:
    print("Could not run audit. Ensure 'retract_dois_set' and 'df_final_cases' are loaded in memory.")

Original DOIs to process: 56046
DOIs successfully enriched: 55682
Total missing/unprocessed DOIs: 365

✅ A full list of the 365 missing DOIs has been saved to: '../data/processed/missing_dois_from_retracted_cases_final.csv'


In [6]:
df_final_cases = pd.read_csv('../data/processed/retracted_cases_final.csv')

In [7]:
df_final_cases

Unnamed: 0,doi,source_id,publication_year,article_type,is_open_access,author_count,institution_count,country_count,first_author_country,is_international_collaboration,...,title_length,abstract_length,is_abstract_missing,n_concepts,top_concept_level,citations_in_first_2_years,n_references,retraction_nature,reason,is_retracted
0,10.1109/ifita.2009.133,https://openalex.org/S4306424334,2009,article,False,3,1,1,CN,False,...,132,0,True,17,2,0,14,Retraction,+Breach of Policy by Author;+Date of Article a...,1
1,10.1111/jan.14338,https://openalex.org/S137832324,2020,article,False,9,2,1,CN,False,...,173,2726,False,10,0,2,23,Retraction,+Concerns/Issues About Data;+Ethical Violation...,1
2,10.1016/j.jaad.2018.01.028,https://openalex.org/S58727509,2018,retraction,True,14,11,1,US,False,...,209,0,True,11,0,3,5,Retraction,+Concerns/Issues About Data;,1
3,10.1109/icnidc.2009.5360923,https://openalex.org/S4306422667,2009,article,False,3,2,1,CN,False,...,80,0,True,12,2,0,4,Retraction,+Breach of Policy by Author;+Date of Article a...,1
4,10.1021/acs.jpcb.5b02877,https://openalex.org/S185621672,2015,retraction,False,4,0,0,,False,...,157,1891,False,14,3,0,39,Retraction,+Concerns/Issues About Authorship/Affiliation;,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55677,10.1155/2022/2026210,https://openalex.org/S66990431,2022,article,True,9,7,3,,True,...,77,1170,False,13,0,6,14,Retraction,+Concerns/Issues About Data;+Concerns/Issues a...,1
55678,10.1109/icmss.2009.5302366,,2009,article,False,1,1,1,CN,False,...,73,0,True,7,2,0,1,Retraction,+Breach of Policy by Author;+Date of Article a...,1
55679,10.1109/paccs.2009.96,https://openalex.org/S4306425123,2009,article,False,3,1,1,CN,False,...,94,0,True,7,2,0,3,Retraction,+Breach of Policy by Author;+Date of Article a...,1
55680,10.1016/j.fitote.2014.03.020,https://openalex.org/S44811770,2014,article,False,10,2,1,CN,False,...,145,0,True,10,0,4,31,Retraction,+Concerns/Issues About Image;+Error in Image;+...,1


<h2 style="color: 	#365F93;"><strong>4. Obtain Selected Metadata from the OpenAlex API for Non-Retracted Papers</strong></h2>

In [None]:
# 1. Setup and Sampling Plan for a 100-item Test

# My details for the OpenAlex API
MY_EMAIL = "xavi@gmail.com"

# Step 1: Load Retracted Cases
print("Loading the retracted cases dataset.")
try:
    df_cases = pd.read_csv('../data/processed/retracted_cases_final.csv')
    print(f"✅ Successfully loaded {len(df_cases)} retracted cases.")
except FileNotFoundError:
    print("❌ ERROR: 'retracted_cases_final.csv' not found. Please ensure the previous steps are complete.")
    df_cases = pd.DataFrame() # Create empty df to avoid errors

# Step 2: Create a 100-item Test Sampling Plan
if not df_cases.empty:
    print("\nCreating a sampling plan for a 100-item test run.")
    
    # Take a random sample of 100 cases to test our logic
    if len(df_cases) > 100:
        test_plan = df_cases.sample(n=100, random_state=42).copy()
    else:
        # If the dataset is smaller than 100, use all of it
        test_plan = df_cases.copy()
    
    test_plan.reset_index(inplace=True, drop=True)
    
    print(f"✅ Test plan created with {len(test_plan)} cases.")
    print("This plan will be used to find matching 'twin' control papers.")
    display(test_plan.head())

Loading the retracted cases dataset.
✅ Successfully loaded 55682 retracted cases.

Creating a sampling plan for a 100-item test run.
✅ Test plan created with 100 cases.
This plan will be used to find matching 'twin' control papers.


Unnamed: 0,doi,source_id,publication_year,article_type,is_open_access,author_count,institution_count,country_count,first_author_country,is_international_collaboration,...,title_length,abstract_length,is_abstract_missing,n_concepts,top_concept_level,citations_in_first_2_years,n_references,retraction_nature,reason,is_retracted
0,10.1155/2022/2756459,https://openalex.org/S50311662,2022,article,True,4,3,1,CN,False,...,101,630,False,21,3,3,30,Retraction,+Concerns/Issues About Data;+Concerns/Issues a...,1
1,10.1016/j.jenvrad.2009.09.005,https://openalex.org/S76547072,2009,article,True,3,2,1,BR,False,...,132,0,True,10,3,0,0,Retraction,+Falsification/Fabrication of Results;+Investi...,1
2,10.1074/jbc.m109.065987,https://openalex.org/S140251998,2009,article,True,6,4,1,US,False,...,111,0,True,15,4,0,51,Retraction,+Error in Image;+Investigation by Company/Inst...,1
3,10.1155/2022/3408501,https://openalex.org/S36625193,2022,article,True,16,11,6,CN,True,...,168,2114,False,14,2,18,89,Retraction,+Concerns/Issues About Data;+Concerns/Issues a...,1
4,10.1016/j.jisa.2019.102412,https://openalex.org/S4210191536,2020,article,False,4,2,1,EG,False,...,89,0,True,7,3,0,40,Retraction,+Duplication of/in Article;,1


In [18]:
# 2. The Hierarchical Search Function

# We'll use our existing feature extractor function from src.feature_extractor import get_work_features_from_doi

def find_control_twin_hierarchical_robust(case_row: pd.Series) -> tuple[Optional[dict], Optional[str]]:
    """
    Tries to find a control twin using a robust cascade with a retry mechanism.
    This version is designed to minimize 'Not_Found' cases due to temporary network/API errors.
    
    Returns a tuple: (features_of_twin, match_quality_level)
    """
    base_filters = [
        f"publication_year:{int(case_row['publication_year'])}",
        f"type:{case_row['article_type']}",
        "is_retracted:false",
        f"doi:!{case_row['doi']}"
    ]

    cascade_levels = [
        {'quality': 'Source_and_Country', 'fields': ['source_id', 'first_author_country']},
        {'quality': 'Source_Only', 'fields': ['source_id']},
        {'quality': 'Country_Only', 'fields': ['first_author_country']},
        {'quality': 'Type_Only', 'fields': []}
    ]

    for level in cascade_levels:
        if any(pd.isna(case_row.get(field)) for field in level['fields']):
            continue

        level_filters = base_filters.copy()
        
        filter_mapping = {
            'source_id': 'primary_location.source.id',
            'first_author_country': 'authorships.countries'
        }
        
        for field in level['fields']:
            value = case_row[field]
            safe_value = requests.utils.quote(str(value))
            level_filters.append(f"{filter_mapping[field]}:{safe_value}")
        
        filter_string = ",".join(level_filters)
        url = f"https://api.openalex.org/works?filter={filter_string}&sample=1&mailto={MY_EMAIL}"

        # Retry Mechanism
        max_retries = 3
        for attempt in range(max_retries):
            try:
                response = requests.get(url, headers={'User-Agent': f'PaperLensClient/1.0 (mailto:{MY_EMAIL})'}, timeout=30)
                
                # If we get a successful response with results, process it and exit the retry loop
                if response.status_code == 200 and response.json().get('results'):
                    twin_doi = response.json()['results'][0].get('doi')
                    if twin_doi:
                        twin_features = get_work_features_from_doi(twin_doi) 
                        if twin_features:
                            return twin_features, level['quality']
                    # If we found a result but failed to process it, break the retry loop and go to the next level
                    break 

                # If the API says no results, there's no point in retrying this level
                elif response.status_code == 200:
                    break

            except requests.RequestException as e:
                # This catches network errors, timeouts, etc.
                print(f"  - WARN: Attempt {attempt + 1}/{max_retries} failed for level '{level['quality']}' on DOI {case_row['doi']}. Error: {e}")
                if attempt < max_retries - 1:
                    # Wait before retrying (e.g., 1, 2 seconds)
                    time.sleep(attempt + 1) 
                continue # Go to the next attempt in the retry loop

    # If the entire cascade finishes without finding anything, return None
    return None, None

In [19]:
# 3. The Test Run

print(f"Starting the test run on {len(test_plan)} cases with the new robust function")

# This list will store the results of our test
test_results_robust = []
# This dict will count how many twins we find at each quality level
match_quality_counts_robust = {}

# We iterate through the test plan DataFrame
for index, case_row in tqdm(test_plan.iterrows(), total=len(test_plan), desc="Finding Control Twins (Robust)"):
    
    # We call the new function with the retry mechanism
    twin_features, quality_level = find_control_twin_hierarchical_robust(case_row)
    
    if twin_features and quality_level:
        # A twin was found!
        twin_features['match_quality'] = quality_level
        twin_features['original_case_doi'] = case_row['doi']
        test_results_robust.append(twin_features)
        
        # Update our counter
        match_quality_counts_robust[quality_level] = match_quality_counts_robust.get(quality_level, 0) + 1
    else:
        # This should only happen in very rare cases now
        match_quality_counts_robust['Not_Found'] = match_quality_counts_robust.get('Not_Found', 0) + 1
        
    time.sleep(0.1)

# Analyze the Test Results
print("Robust Test Run Complete")
print(f"Total twins found: {len(test_results_robust)} out of {len(test_plan)} cases.")

if test_results_robust:
    # Convert results to a DataFrame for easy viewing
    df_test_results_robust = pd.DataFrame(test_results_robust)
    
    print("\nDistribution of Match Quality:")
    # Sort the counts for a nicer display
    sorted_counts = sorted(match_quality_counts_robust.items(), key=lambda item: item[1], reverse=True)
    for level, count in sorted_counts:
        print(f"  - {level}: {count} twins")
        
    print("\nSample of the first 5 twins found:")
    display(df_test_results_robust[['doi', 'publication_year', 'publisher', 'journal_name', 'match_quality', 'original_case_doi']].head())
else:
    print("\nNo control twins were found in this robust test run.")

Starting the test run on 100 cases with the new robust function


Finding Control Twins (Robust):   0%|          | 0/100 [00:00<?, ?it/s]

Robust Test Run Complete
Total twins found: 97 out of 100 cases.

Distribution of Match Quality:
  - Source_and_Country: 86 twins
  - Source_Only: 6 twins
  - Country_Only: 4 twins
  - Not_Found: 3 twins
  - Type_Only: 1 twins

Sample of the first 5 twins found:


Unnamed: 0,doi,publication_year,publisher,journal_name,match_quality,original_case_doi
0,10.1155/2022/4442417,2022,Hindawi Publishing Corporation,Applied Bionics and Biomechanics,Source_and_Country,10.1155/2022/2756459
1,10.1016/j.jenvrad.2008.12.017,2009,Elsevier BV,Journal of Environmental Radioactivity,Source_and_Country,10.1016/j.jenvrad.2009.09.005
2,10.1074/jbc.m109.043885,2009,Elsevier BV,Journal of Biological Chemistry,Source_and_Country,10.1074/jbc.m109.065987
3,10.1155/2022/6984403,2022,Hindawi Publishing Corporation,Journal of Healthcare Engineering,Source_and_Country,10.1155/2022/3408501
4,10.1016/j.jisa.2020.102644,2020,Elsevier BV,Journal of Information Security and Applications,Source_and_Country,10.1016/j.jisa.2019.102412


**Final Data Matching Strategy for Control Group Generation**

Our final strategy for generating the control group of non-retracted papers is a hierarchical matching cascade, designed for robustness and to maximize data retention. This approach is superior to previous iterations as it is based on our direct debugging of the OpenAlex API.

The plan: For each retracted paper (a "case"), we will attempt to find a non-retracted "twin" by querying the API in a specific, prioritized order. As soon as a twin is found, we record its match quality and move to the next case. The *publication_year* and a*rticle_type* are mandatory for all levels.

Matching Cascade:
- **Level 1 (Quality:** *Source_and_Country*): Match on *source_id*, *first_author_country*, *article_type*, and *year*.
- **Level 2 (Quality:** *Source_Only*): If Level 1 fails, match on *source_id*, *article_type*, and *year*.
- **Level 3 (Quality:** *Country_Only*): If prior levels fail (e.g., due to a missing *source_id*), match on *first_author_country*, *article_type*, and *year*.
- **Level 4 (Quality:** *Type_Only*): As a final fallback, match only on *article_type* and *year*.

Advantages of this approach:
1. **Uses a Unique Identifier:** The *source_id* is a much more reliable key than text-based names like publisher or journal, preventing mismatches.
2. **Controls for the True "Source":** Matching on *source_id* ensures we are comparing papers from the exact same journal or conference, which is an excellent proxy for editorial standards and quality control.
3. **Simple and Defensible:** The logic is clear, evidence-based from our API testing, and methodologically sound. It gives us the flexibility to filter by match quality during the modeling phase.

In [20]:
# 4. FULL RUN: Process all DOIs
# Step 1: Load Retracted Cases
print("Loading the full dataset of retracted cases.")
try:
    df_cases = pd.read_csv('../data/processed/retracted_cases_final.csv')
    print(f"✅ Successfully loaded {len(df_cases)} retracted cases.")
except FileNotFoundError:
    print("❌ ERROR: 'retracted_cases_final.csv' not found. Please run the previous steps.")
    df_cases = pd.DataFrame()

# Step 3: Create the Full-Scale Sampling Plan
if not df_cases.empty:
    print("\nCreating the full-scale sampling plan.")
    # We use the entire dataframe now.
    sampling_plan = df_cases.copy()
    # We use the original index of the DataFrame for resumable processing
    sampling_plan.reset_index(inplace=True) 
    
    print(f"✅ Full sampling plan created with {len(sampling_plan)} cases.")

Loading the full dataset of retracted cases.
✅ Successfully loaded 55682 retracted cases.

Creating the full-scale sampling plan.
✅ Full sampling plan created with 55682 cases.


In [22]:
# The Robust Hierarchical Search Function

def find_control_twin_hierarchical_robust(case_row: pd.Series) -> tuple[Optional[dict], Optional[str]]:
    """
    Tries to find a control twin using the verified robust cascade based on source_id.
    Includes a retry mechanism for network reliability.
    """
    base_filters = [
        f"publication_year:{int(case_row['publication_year'])}",
        f"type:{case_row['article_type']}",
        "is_retracted:false",
        f"doi:!{case_row['doi']}"
    ]

    cascade_levels = [
        {'quality': 'Source_and_Country', 'fields': ['source_id', 'first_author_country']},
        {'quality': 'Source_Only', 'fields': ['source_id']},
        {'quality': 'Country_Only', 'fields': ['first_author_country']},
        {'quality': 'Type_Only', 'fields': []}
    ]

    for level in cascade_levels:
        if any(pd.isna(case_row.get(field)) for field in level['fields']):
            continue

        level_filters = base_filters.copy()
        filter_mapping = {
            'source_id': 'primary_location.source.id',
            'first_author_country': 'authorships.countries'
        }
        
        for field in level['fields']:
            value = case_row[field]
            safe_value = requests.utils.quote(str(value))
            level_filters.append(f"{filter_mapping[field]}:{safe_value}")
        
        filter_string = ",".join(level_filters)
        url = f"https://api.openalex.org/works?filter={filter_string}&sample=1&mailto={MY_EMAIL}"

        max_retries = 3
        for attempt in range(max_retries):
            try:
                response = requests.get(url, headers={'User-Agent': f'PaperLensClient/1.0 (mailto:{MY_EMAIL})'}, timeout=30)
                
                if response.status_code == 200 and response.json().get('results'):
                    twin_doi = response.json()['results'][0].get('doi')
                    if twin_doi:
                        twin_features = get_work_features_from_doi(twin_doi) 
                        if twin_features:
                            return twin_features, level['quality']
                    break 
                elif response.status_code == 200:
                    break
            except requests.RequestException as e:
                if attempt < max_retries - 1:
                    time.sleep(attempt + 1)
                continue
    
    return None, None

In [23]:
# Full-Scale Data Collection Execution

print(f"Starting Full-Scale Data Collection on {len(sampling_plan)} cases.")

# Define paths for saving progress
output_folder = '../data/processed/enriched_data'
results_filepath = os.path.join(output_folder, 'non_retracted_enriched.jsonl')
processed_indices_filepath = os.path.join(output_folder, 'controls_processed_indices.txt')
not_found_filepath = os.path.join(output_folder, 'controls_not_found.txt')

os.makedirs(output_folder, exist_ok=True)

# Load already processed indices to avoid re-doing work
try:
    with open(processed_indices_filepath, 'r') as f:
        # The 'index' column we created in the sampling_plan
        processed_indices = {int(line.strip()) for line in f}
    print(f"Resuming process. Found {len(processed_indices)} already processed cases.")
except FileNotFoundError:
    processed_indices = set()
    print("Starting a new process. No previously processed cases found.")

# Main processing loop
with open(results_filepath, 'a') as results_file, \
     open(processed_indices_filepath, 'a') as indices_file, \
     open(not_found_filepath, 'a') as not_found_file:

    # We iterate using .iterrows() to get both the index and the row data
    for index, case_row in tqdm(sampling_plan.iterrows(), total=len(sampling_plan), desc="Finding All Control Twins"):
        
        # The 'index' here is the row number from the sampling_plan DataFrame
        if index in processed_indices:
            continue
        
        twin_features, quality_level = find_control_twin_hierarchical_robust(case_row)
        
        if twin_features and quality_level:
            twin_features['match_quality'] = quality_level
            twin_features['original_case_doi'] = case_row['doi']
            results_file.write(json.dumps(twin_features) + '\n')
        else:
            not_found_file.write(f"{case_row['doi']}\n")
            
        indices_file.write(str(index) + '\n')
        
        # Flush the files every 100 iterations to save progress
        if index > 0 and index % 100 == 0:
            results_file.flush()
            indices_file.flush()
            not_found_file.flush()
            
        time.sleep(0.1)

print("Full-scale data collection complete!")

Starting Full-Scale Data Collection on 55682 cases.
Starting a new process. No previously processed cases found.


Finding All Control Twins:   0%|          | 0/55682 [00:00<?, ?it/s]

Full-scale data collection complete!


In [24]:
# Final Assembly of the Control Group DataFrame
print("Assembling the final DataFrame for Non-Retracted Control Articles.")

# Define file paths
controls_enriched_path = '../data/processed/enriched_data/non_retracted_enriched.jsonl'
cases_final_path = '../data/processed/retracted_cases_final.csv'
controls_final_path = '../data/processed/non_retracted_controls_final.csv'

try:
    # 1. Load the enriched control data
    df_controls_enriched = pd.read_json(controls_enriched_path, lines=True)
    print(f"✅ Loaded {len(df_controls_enriched)} enriched control records.")

    # 2. Add the target variable for the control group
    df_controls_enriched['is_retracted'] = 0
    
    # 3. Load the 'cases' DataFrame to use its structure as a template
    print(f"Loading '{cases_final_path}' to use as a structural template.")
    df_cases_template = pd.read_csv(cases_final_path)
    
    # 4. Define the final, consistent column order
    # Get all columns from the cases template
    final_column_order = df_cases_template.columns.tolist()
    
    # Add the new columns that are specific to the control group
    final_column_order.append('match_quality')
    final_column_order.append('original_case_doi')
    
    # 5. Create the final DataFrame by reindexing
    # This ensures both DataFrames will have the exact same columns.
    # Columns in df_controls_enriched that are not in final_column_order will be dropped.
    # Columns in final_column_order that are not in df_controls_enriched (like 'reason') will be created and filled with NaN.
    df_controls_final = df_controls_enriched.reindex(columns=final_column_order)
    
    # 6. Save the final result to CSV
    df_controls_final.to_csv(controls_final_path, index=False)
    
    print("\n--- Final Control Group DataFrame ---")
    df_controls_final.info()
    print(f"\n✅ SECOND HALF OF GOLDEN DATASET COMPLETE!")
    print(f"   The structure now perfectly matches the cases dataset.")
    print(f"   Final control group data saved to: {controls_final_path}")
    
    display(df_controls_final.head())
    
except FileNotFoundError as e:
    print(f"❌ FILE NOT FOUND ERROR: {e}. Please check your file paths.")
except Exception as e:
    print(f"❌ An unexpected error occurred during final assembly: {e}")

Assembling the final DataFrame for Non-Retracted Control Articles.
✅ Loaded 51428 enriched control records.
Loading '../data/processed/retracted_cases_final.csv' to use as a structural template.

--- Final Control Group DataFrame ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51428 entries, 0 to 51427
Data columns (total 27 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   doi                             51428 non-null  object 
 1   source_id                       50751 non-null  object 
 2   publication_year                51428 non-null  int64  
 3   article_type                    51428 non-null  object 
 4   is_open_access                  51428 non-null  bool   
 5   author_count                    51428 non-null  int64  
 6   institution_count               51428 non-null  int64  
 7   country_count                   51428 non-null  int64  
 8   first_author_country            49314 non-

Unnamed: 0,doi,source_id,publication_year,article_type,is_open_access,author_count,institution_count,country_count,first_author_country,is_international_collaboration,...,is_abstract_missing,n_concepts,top_concept_level,citations_in_first_2_years,n_references,retraction_nature,reason,is_retracted,match_quality,original_case_doi
0,10.1109/ifita.2009.226,https://openalex.org/S4306424334,2009,article,False,3,1,1,CN,False,...,False,29,3,0,10,,,0,Source_and_Country,10.1109/ifita.2009.133
1,10.1111/jan.14571,https://openalex.org/S137832324,2020,article,False,4,1,1,CN,False,...,False,16,2,1,46,,,0,Source_and_Country,10.1111/jan.14338
2,10.1016/j.jaad.2018.04.011,https://openalex.org/S58727509,2018,retraction,True,0,0,0,,False,...,True,4,0,0,0,,,0,Source_Only,10.1016/j.jaad.2018.01.028
3,10.1109/icnidc.2009.5360970,https://openalex.org/S4306422667,2009,article,False,3,1,1,CN,False,...,False,15,3,0,6,,,0,Source_and_Country,10.1109/icnidc.2009.5360923
4,10.4103/0972-124x.154336,https://openalex.org/S30640462,2015,retraction,True,0,0,0,,False,...,False,9,0,0,2,,,0,Type_Only,10.1021/acs.jpcb.5b02877


In [4]:
df_controls_final = pd.read_csv('../data/processed/non_retracted_controls_final.csv')

In [5]:
df_controls_final

Unnamed: 0,doi,source_id,publication_year,article_type,is_open_access,author_count,institution_count,country_count,first_author_country,is_international_collaboration,...,is_abstract_missing,n_concepts,top_concept_level,citations_in_first_2_years,n_references,retraction_nature,reason,is_retracted,match_quality,original_case_doi
0,10.1109/ifita.2009.226,https://openalex.org/S4306424334,2009,article,False,3,1,1,CN,False,...,False,29,3,0,10,,,0,Source_and_Country,10.1109/ifita.2009.133
1,10.1111/jan.14571,https://openalex.org/S137832324,2020,article,False,4,1,1,CN,False,...,False,16,2,1,46,,,0,Source_and_Country,10.1111/jan.14338
2,10.1016/j.jaad.2018.04.011,https://openalex.org/S58727509,2018,retraction,True,0,0,0,,False,...,True,4,0,0,0,,,0,Source_Only,10.1016/j.jaad.2018.01.028
3,10.1109/icnidc.2009.5360970,https://openalex.org/S4306422667,2009,article,False,3,1,1,CN,False,...,False,15,3,0,6,,,0,Source_and_Country,10.1109/icnidc.2009.5360923
4,10.4103/0972-124x.154336,https://openalex.org/S30640462,2015,retraction,True,0,0,0,,False,...,False,9,0,0,2,,,0,Type_Only,10.1021/acs.jpcb.5b02877
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51423,10.1155/2022/6230298,https://openalex.org/S66990431,2022,article,True,8,4,1,,False,...,False,17,0,19,19,,,0,Source_Only,10.1155/2022/2026210
51424,10.1117/12.872955,https://openalex.org/S183492911,2009,article,False,5,1,1,CN,False,...,False,19,2,0,0,,,0,Country_Only,10.1109/icmss.2009.5302366
51425,10.1109/paccs.2009.20,https://openalex.org/S4306425123,2009,article,False,2,1,1,CN,False,...,False,7,0,0,3,,,0,Source_and_Country,10.1109/paccs.2009.96
51426,10.1016/j.fitote.2013.12.020,https://openalex.org/S44811770,2014,article,False,8,2,1,CN,False,...,True,19,4,5,33,,,0,Source_and_Country,10.1016/j.fitote.2014.03.020


Excelente pregunta. Es el momento perfecto para reevaluar esto, ahora que hemos resuelto el problema de la identificación de la fuente.

La respuesta corta es: **Sí, ahora sí podrías obtenerlo, pero el problema no era el que creías. Y la solución tiene un coste importante.**

Vamos a desglosarlo.

### El Verdadero Problema del `h-index` (y por qué falló antes)

El problema para obtener el `journal_h_index` no estaba directamente relacionado con nuestro problema de *matching*. Eran dos problemas separados:

1.  **Problema de Matching (Resuelto):** Era "¿Cómo encuentro un gemelo que venga de la misma 'fuente'?". Descubrimos que la solución era usar el `source_id` para filtrar.
2.  **Problema de Extracción del H-Index (El que nos ocupa ahora):** Es "¿Una vez que tengo un `source_id`, dónde encuentro su `h-index`?".

Tu intento anterior probablemente falló por una de estas dos razones:
*   Estabas buscando el `h_index` en el lugar equivocado: dentro de la respuesta de la API para un **artículo** (`/works/{doi}`). Un artículo individual no contiene el h-index de toda la revista.
*   El `source_id` que tenías no era válido o estaba vacío, por lo que no podías hacer la consulta.

### La Solución Técnica (y su "Precio")

Ahora que sabemos que el `source_id` es la clave, SÍ es técnicamente posible obtener el `h-index`. El proceso sería el siguiente:

1.  Haces la primera llamada a la API que ya haces en tu `get_work_features_from_doi` para obtener los datos del **artículo**. De esta llamada, extraes el `source_id`.
2.  Si el `source_id` no está vacío, haces una **SEGUNDA LLAMADA A LA API** a un endpoint diferente: el de las "fuentes".
    -   La URL sería: `https://api.openalex.org/sources/{source_id}` (ej: `.../sources/S50311662`).
3.  La respuesta a esta segunda llamada es un objeto JSON que describe la revista, y este objeto SÍ contiene un campo llamado `h_index`.
4.  Añades ese `h_index` a tu diccionario de features.

**El "Precio" a Pagar: Duplicar las Llamadas a la API**
Para cada uno de los ~110,000 artículos (55k casos + 55k controles), tendrías que hacer **dos** llamadas a la API en lugar de una.

*   **Llamadas Totales:** ~220,000 en lugar de ~110,000.
*   **Tiempo de Ejecución:** El tiempo de recolección se duplicaría como mínimo, pasando de ~3 horas a **~6 horas o más**.

### Mi Recomendación: Un Enfoque por Fases (Mucho más inteligente)

Dado el alto coste en tiempo, te propongo una estrategia mucho más eficiente y profesional. No intentes obtener el `h-index` *durante* la recolección principal.

**Fase 1: Recolección Principal (Lo que estamos a punto de hacer)**
1.  Lanza el script de recolección tal y como lo tenemos ahora.
2.  Genera tus dos datasets principales: `retracted_cases_final.csv` y `non_retracted_controls_final.csv`.
3.  Guarda estos archivos. ¡Ya tienes tu "Golden Dataset" listo para modelar!

**Fase 2: Script de Enriquecimiento (Un nuevo script o celdas)**
*Después* de tener los datos, crea un nuevo proceso que haga lo siguiente:
1.  Carga tu `retracted_cases_final.csv` y tu `non_retracted_controls_final.csv`.
2.  Júntalos en un único DataFrame.
3.  Obtén la lista de `source_id` **únicos** de ese DataFrame (`df['source_id'].unique()`). Esto reducirá drásticamente el número de llamadas. En lugar de 110,000, quizás solo tengas 5,000-10,000 `source_id` únicos.
4.  Crea un bucle que solo itere sobre esta lista de `source_id` únicos. Para cada uno, llama a la API (`.../sources/{source_id}`) y obtén su `h_index`.
5.  Guarda los resultados en un diccionario simple: `h_index_map = {'S123': 25, 'S456': 150, ...}`.
6.  Usa `pd.merge` o `.map()` para añadir la nueva columna `journal_h_index` a tu DataFrame principal, usando el `h_index_map` que acabas de crear.
7.  Guarda el DataFrame final y enriquecido como `golden_dataset_enriched.csv`.

**Ventajas de este enfoque:**
-   **Mucho más rápido:** Haces 10 veces menos llamadas a la API.
-   **No bloqueante:** Puedes empezar a trabajar en el Notebook 02 con los datos principales mientras el script de enriquecimiento se ejecuta en segundo plano.
-   **Más limpio:** Separas la lógica de recolección de la lógica de enriquecimiento.

**Conclusión:**
No, el problema no era el mismo, y la solución no es automática. Pero ahora SÍ tenemos las herramientas para conseguirlo. Mi recomendación es: **No lo hagas ahora.** Lanza la recolección principal y luego, si quieres esa feature, la añadimos con un script de enriquecimiento mucho más eficiente.

Aplicamos algo de la semana de statistics? probar con varios df final, el matching perfecto, el no tan perfecto, el con todo incluyendo y descargando los que falten que me diga por año y type_only etc, otro acotando a 2022 para que no haya no retrcted que en el futuro sean retractados (o si hay un doi no retractado es porque ya ha pasado esa revision inicial minimo? o al principio un autor al publicar ya pone de inicio sin que haya pasad un segundo ya pone no retractado?)

También mirar lo que hizo Rocío que Toño nos pasó el tema del modelo piña y otro para analizar palabras o no se que 

Excelentes preguntas. Aclarar esto antes de empezar es crucial para saber qué esperar al final.

### 1. ¿Tendremos una columna para filtrar por la calidad del matching?

**Sí, absolutamente.** Eso está ya integrado en el código.

En el bucle de la Celda 3, cuando encontramos un gemelo, hacemos esto:

```python
twin_features, quality_level = find_control_twin_hierarchical_robust(case_row)

if twin_features and quality_level:
    # Añadimos la columna con la calidad
    twin_features['match_quality'] = quality_level 
    # Y el DOI del caso original para poder trazarlo
    twin_features['original_case_doi'] = case_row['doi']
    test_results_robust.append(twin_features)
```

Al final, tu archivo `non_retracted_enriched.jsonl` (y el CSV que generes a partir de él) tendrá una columna llamada `match_quality`. Cuando cargues los datos en el Notebook 02, podrás filtrar de forma muy sencilla:

```python
# Cargar todos los controles
df_controls = pd.read_csv('../data/processed/non_retracted_controls_final.csv')

# Modo Estricto: Solo los de máxima calidad
df_strict_controls = df_controls[df_controls['match_quality'].isin(['Source_and_Country', 'Source_Only'])]

# Modo Inclusivo: Todos los que se encontraron
df_inclusive_controls = df_controls.copy()
```

Esta es la flexibilidad que querías, y ya está implementada.

### 2. ¿Podemos añadir los `Not_Found` al final, sin matching?

**Sí, y es una idea muy interesante para experimentar.**

Al final del proceso, tendrás una lista de los DOIs retractados para los que no se pudo encontrar un gemelo (el archivo `controls_not_found.txt`).

Para tu análisis de ML, podrías probar un tercer enfoque:
1.  **Dataset 1 (Estricto):** `df_retracted_con_gemelo` + `df_controls_estrictos`
2.  **Dataset 2 (Inclusivo):** `df_retracted_con_gemelo` + `df_controls_inclusivos`
3.  **Dataset 3 (Totalmente Aleatorio):** `df_retracted_TOTAL` (todos, con y sin gemelo) + un número igual de controles **completamente aleatorios** (sin ningún tipo de matching, solo que no estén retractados).

Comparar el rendimiento de los modelos entrenados en estos tres datasets sería un análisis increíblemente completo y te daría insights sobre cuánto valor aporta realmente el matching.

### 3. ¿Son 100% no retractados? ¿Y el modelo sabrá cuál es el gemelo?

Esta es una pregunta doble muy importante.

**a) ¿Son 100% no retractados?**
Sí, con una confianza muy alta. Cada petición que hacemos a la API incluye el filtro `is_retracted:false`. OpenAlex se basa en la información que le dan los publishers y en la base de datos de Retraction Watch (que ellos ahora poseen) para poner esa etiqueta.

¿Podría haber un artículo que fue retractado ayer y que OpenAlex todavía no ha actualizado? Sí, es una posibilidad remota, pero la probabilidad es muy baja. Para nuestro proyecto, podemos asumir con un 99.9% de confianza que lo que OpenAlex marca como no retractado, efectivamente no lo está. Es la fuente de "ground truth" más fiable que existe para esto.

**b) ¿El modelo sabe cuál es el gemelo de cada caso?**
**No, y lo más importante es que NO DEBE SABERLO.**

Una vez que tienes tu `df_retracted` y tu `df_controls`, los unes en un único "Golden Dataset":

```python
golden_dataset = pd.concat([df_retracted, df_controls])
```
Y el paso más crucial es **barajarlo (shuffle)**:

```python
golden_dataset = golden_dataset.sample(frac=1).reset_index(drop=True)
```

A partir de este momento, el modelo (y tú mismo) **pierde por completo la noción de qué control era el gemelo de qué caso**. El modelo solo ve una lista de ~110,000 artículos, cada uno con sus features y una etiqueta (`is_retracted` = 1 o 0).

El propósito del matching no era decirle al modelo "compara este con este otro". El propósito era **construir un grupo de control que, en su conjunto, tenga las mismas características contextuales que el grupo de casos**. Al hacer esto, obligamos al modelo a ignorar esas características contextuales (porque no le sirven para discriminar) y a centrarse en las señales sutiles, que es exactamente nuestro objetivo.

Estás listo. Tienes todas las piezas y la lógica bajo control. ¡Luz verde para la recolección

TEMA DE FEATURES CATEGÓRICAS ME DIJO GEMINI TODO LO QUE HACER