# Data Preprocessing Pipeline  
This Notebook documents the complete data preprocessing workflow.     
  
The project uses the [arXiv Dataset](https://www.kaggle.com/Cornell-University/arxiv). We also provide the [processed dataset](https://github.com/xinhuangcs/PaperTrail/releases) for direct use.
  
1. **Original Size**: ~2.84M papers.
2. **Filtering**: We filter for categories starting with `cs.` (Computer Science), `stat.` (Statistics), and `eess.` (Electrical Engineering and Systems Science), etc.
3. **Add citation data**: Use the OpenAlex API to add citation data to papers.
4. **Final Dataset**: Contains ~730,000 papers used for the search index.
5. **Preprocessing**: Text cleaning, stop word removal, and stemming are applied before indexing.  




Our dataset was downloaded on September 26, 2025, and the statistics are based on this date.  
When we expanded our citation data, we discovered that OpenAlex has not yet indexed some publications released since 2025. Therefore, our dataset currently only includes papers published up to the end of 2024.  
  
The core logic has been implemented in Python scripts within the `src/preprocess_data/` directory.    
Because of the large size of the source code, this Notebook serves as the scheduling layer, executing these scripts sequentially while explaining the logic behind each step.  
  
**⏱️ Estimated runtime (excluding data-download time):** more than 200 Hours  
  * **\>200 Hours**: Get the newest original [ArXiv Dataset](https://www.kaggle.com/Cornell-University/arxiv) and call OpenAlex API to insert citation data.    
   * **\<1 Hours**: For data cleaning.

**1. Environment & Path Setup**
 **Prerequisites**:    
- Python 3.12+  
- When running the Notebook, ensure that the current working directory is `src/jupyter_notebook/` , in order to get the relative paths used below correct.
- **Download Data**: Get the newest original [ArXiv Dataset](https://www.kaggle.com/Cornell-University/arxiv) and place `arxiv-metadata-oai-snapshot.json` in `data/preprocess/`.

In [12]:
!pip install -r ../requirements.txt



**2. Reduce 2.8M papers to a manageable subset relevant to CS** (src/preprocess_data/0_1_reduce_categories.py):

In [13]:
%run ../src/preprocess_data/0_1_reduce_categories.py

过滤中: 100%|██████████| 2840638/2840638 [00:12<00:00, 226262.27line/s]


=== Summary ===
Input:     /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-metadata-oai-snapshot.json
Output:    /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-cs-data.json
Read:      2,840,638
Kept:      895,809
Skipped:   1,944,829
By category:
  - Computer Science: 1,238,160
  - Statistics: 163,543
  - EESS: 116,936





 **3. Add citation counts from OpenAlex using DOI and Title matching (sliced execution)**(`(1_0_add_incite_num)`)  
We enriched the dataset with citation counts by querying the OpenAlex API using DOIs, falling back to vague title matching for entries without identifiers. 


**Due to the prohibitive runtime of over 200 hours for the full dataset,Here, we manually constructed a minimal dataset to test whether Jupyter can run.**  


The processed files can be get from the releases section of our GitHub repository.

In [18]:
%run ../src/preprocess_data/1_0_add_incite_num.py

[i] Parameters:
    INPUT_FILE  = /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-cs-data.json
    OUTPUT_FILE = /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-cs-data-with-citations_slice6.jsonl
    CACHE_FILE  = /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/citation_cache.json
    SLICE       = 6/19
    SLEEP_SECS  = 0.2
[i] Cache loaded: 0 entries
[i] Total lines=895809, this slice range=268741 ~ 313530 (total 44790 lines)


Slice 6/19:   0%|          | 7/44790 [00:03<5:55:15,  2.10rec/s] 
 Interrupted, saving cache
Slice 6/19:   0%|          | 8/44790 [00:04<6:39:57,  1.87rec/s]

Processing completed: processed=8, written=8
Output file: /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-cs-data-with-citations_slice6.jsonl





**4. Merge distributed processing slices into a unified dataset** (`(1_1_merge_slices)`)  


We used this script to merge multiple dispersed JSONL slice files into a single final dataset file. By streaming the read and write operations line-by-line, we efficiently consolidate the data while minimizing memory usage.


Here, we manually constructed a minimal dataset to test whether Jupyter can run.  


The processed files can be get from the releases section of our GitHub repository.



In [20]:
%run ../src/preprocess_data/1_1_merge_slices

Merging (any order): 100%|██████████| 1/1 [00:00<00:00, 1855.07file/s]

Merge done → /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-cs-data-with-citations-final-dataset.json
[i] Files: 1 | Lines: 8





**5. Fix missing citation data using a multi-stage fallback strategy (Truncated Title/Abstract search)** (`(1_2_retry_record_of_negone)`)  
We addressed records with missing citation data by implementing a multi-stage fallback strategy that iteratively queries OpenAlex using DOIs, truncated titles, and abstract snippets. 




Here, we manually constructed a minimal dataset to test whether Jupyter can run.  


The processed files can be get from the releases section of our GitHub repository.



In [21]:
%run ../src/preprocess_data/1_2_retry_record_of_negone.py

[i] Parameters:
    INPUT_FILE   = /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-cs-data-with-citations_merged_zrk_5.json
    OUTPUT_FILE  = /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-cs-data-with-citations_final_dataset_odd.json
    CACHE_FILE   = /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/citation_cache.json
    SLEEP_SECS   = 0.25
    TITLE_SIM    = 0.8
[i] Cache loaded: 8 entries
[i] Records to re-check (citation_count == -1): 4


Re-check (-1 only): 100%|██████████| 4/4 [00:03<00:00,  1.25rec/s]

Match summary:
  DOI:     attempt=0, ok=0, nf=0
  Title:   attempt=4, ok=0, nf=4
  Title4:  attempt=4, ok=3, nf=1, skipped=0
  Abstract:attempt=1, ok=0, nf=1, skipped=0
  Final:   ok=3, neg1=1
Second pass done. -1 processed=4, total written=96
Output file: /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-cs-data-with-citations_final_dataset_odd.json





**6. Normalize text data and prepare `processed_content` for embedding/indexing** (`(2_data_filtering)`)  



We implemented a text normalization pipeline designed to standardize unstructured paper metadata by merging titles and abstracts into a single textual feature. It applies a functional transformations, such as lowercasing, noise removal, and Porter stemming—executed.

Here, we manually constructed a minimal dataset to test whether Jupyter can run.  


The processed files can be get from the releases section of our GitHub repository.

In [24]:
import sys
!{sys.executable} ../src/preprocess_data/2_data_filtering.py

Using batch processing for large dataset...
Processing large dataset in batches of 1000...
Completed! Total processed: 8 papers
Saved to: /Users/jasonh/Desktop/02807/PaperTrail/data/preprocess/arxiv-cs-data-with-citations-final-dataset_preprocessed.json
