# Data preparation

Run the following code to prepare all data for the experiments.

## Prerequisites
- Python 3.8 or newer
- pip

In order to obtain a larger sample of papers on which logistic regression can be performed, it is adviced to obtain a **Semantic Scholar API key**, like mentioned in the original codebase. 


## Setup

In [None]:
!pip install -r requirements.txt

## arXiv Dataset

Download the original dataset from the public ArXiv Dataset available on [Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv/data).

### Step 1
Convert the dataset acquired as JSON into CSV for easier processing.

In [None]:
!python import_metadata.py

### Step 2
Generate `categories.csv` for the next script.

In [None]:
!python create_categories.py

### Step 3
Select only papers in Computer Science category, and split dataset in train and test set.

In [None]:
!python explore_data.py

### Step 4
Acquire additional details for each paper in the test set, necessary for logistic regression.

In [None]:
!python get_paper_details.py

### Step 5
*(Optional: sample dataset)* 
Sample the test (and/or train) set to the desired size.

In [None]:
!python sample_dataset.py

### Step 6
Acquire additional details for all authors, external papers that cited a paper of the dataset, and all papers that a paper in the dataset cites.

In [None]:
!python get_authors_papers.py
!python get_paper_citations.py
!python get_references.py

### Step 7
Generate TF-IDF embeddings for all papers of the train and test set, and all authors that appear in both based on cosine similarity.

In [None]:
!python tfidf_authors.py

### Step 8
Randomly sample 1,000 authors for the experiments.

In [None]:
!python sample_random_authors.py

### Step 9
Generate a single data file containing the embeddings of all authors and all papers of the test set. This is required input for the experiments.

In [None]:
!python generate_data_source_file.py

### Step 10
Perform logistic regression on arXiv dataset.

In [None]:
!python logistic_regression.py

## Amazon Books Reviews dataset
The dataset for the extension was sourced from Kaggle, available on [Kaggle](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews/data).

#### Step 1
Split dataset in train and test set, create embeddings for all books, create embeddings for 1,000 random authors.

In [None]:
!cd ./Amazon

In [None]:
!python amazon_exp_preparation.py

### Step 2
Generate a single data file containing the embeddings of all authors and all papers of the test set. This is required input for the experiments.

In [None]:
!python ../generate_data_source_file.py --df ./Amazon/

### Step 3
Perform logistic regression on Amazon dataset.

In [None]:
!python amazon_logreg.py

# Experiments

## arXiv Dataset

### Step 1
Perform experiment 1, recreating the unused figure in the original paper.

In [None]:
!cd ..

In [None]:
!python ./experiments/fig1.py --n 20 --m 40 --curves 10 --curve_pts 50 --ff ./experiments/fig1.png

### Step 2
Perform experiment 2a, recreating figure 1a in the original paper.

In [None]:
!python ./experiments/fig2a.py --n 20 --m 40 --curves 10 --curve_pts 50 --clusters 25 --components 2 --df ./Data/data_source_file_experiments.pickle --ff ./experiments/fig2a.png

### Step 3
Perform experiment 2b, recreating figure 2b in the original paper.

In [None]:
!python ./experiments/fig2b.py --n 20 --m 40 --curves 10 --curve_pts 50 --beta 0.9 --df ./Data/data_source_file_experiments.pickle --ff ./experiments/fig2b.png

## Amazon Books Reviews dataset

### Step 1
Perform experiment 2a for the Amazon Books Reviews dataset, recreating figure 1a in the original paper with new data.

In [None]:
!python ./experiments/fig2a.py --n 20 --m 40 --curves 10 --curve_pts 50 --clusters 25 --components 2 --df ./Amazon/Data/data_source_file_experiments.pickle --ff ./experiments/fig2a_amazon.png

### Step 2
Perform experiment 2b for the Amazon Books Reviews dataset, recreating figure 1b in the original paper with new data.

In [None]:
!python ./experiments/fig2b.py --n 20 --m 40 --curves 10 --curve_pts 50 --beta 0.9 --df ./Amazon/Data/data_source_file_experiments.pickle --ff ./experiments/fig2b_amazon.png

## Multiple recommendations per author


### Step 1
Perform experiment 2a with recommending k=3 papers per author, recreating figure 1a in the original paper with different k.

In [None]:
!python ./experiments/fig2a.py --n 20 --m 40 --curves 10 --curve_pts 50 --clusters 25 --components 2 --k 3 --df ./Data/data_source_file_experiments.pickle --ff ./experiments/fig2a_k3.png

### Step 2
Perform experiment 2b with recommending k=3 papers per author, recreating figure 1b in the original paper with different k.

In [None]:
!python ./experiments/fig2b.py --n 20 --m 40 --curves 10 --curve_pts 50 --beta 0.9 --k 3 --df ./Data/data_source_file_experiments.pickle --ff ./experiments/fig2b_k3.png

### Step 3
Perform experiment 2a with recommending k=5 papers per author, recreating figure 1a in the original paper with different k.

In [None]:
!python ./experiments/fig2a.py --n 20 --m 40 --curves 10 --curve_pts 50 --clusters 25 --components 2 --k 5 --df ./Data/data_source_file_experiments.pickle --ff ./experiments/fig2a_k5.png

### Step 4
Perform experiment 2b with recommending k=5 papers per author, recreating figure 1b in the original paper with different k.

In [None]:
!python ./experiments/fig2b.py --n 20 --m 40 --curves 10 --curve_pts 50 --beta 0.9 --k 5 --df ./Data/data_source_file_experiments.pickle --ff ./experiments/fig2b_k5.png