# Arxiv Recommender System 

The purpose of this notebook is to enhance the user-friendliness of the Arxiv Recommender system while simultaneously providing clear explanations of its functions.

## 1 - Data Acquisition from arXiv

### we developed the `DataProcessor.py` in order to achieve 2 main tasks: 
1. Install raw data for a specified query, such as `record_subject = PDE`. In this phase, with `recorded_article_count` set to 20000, numerous data points linked to articles are fetched and stored locally as `recorded_articles.csv` file. The raw data encompasses:

    * `Title`
    * `Authors`
    * `Published`
    * `Abstract`
    * `Link`
    * `PrimaryCategory`
    * `Categories`
    * `Article Abstract`
    * `Article Tittle`
   
2.  Process the downloaded data and save it in vectorized format using sparse matrices. In this step, we systematically process the 'abstract' and 'title' features for each article through the following steps:
    * cleaning the texts, e.g, removing special characters and fix formatting issues
    * tokenize 
    * Stopword removal
    * lemmatize 
    * Vectorize
    * Save in sparse matrix form as a pkl file along with the vectorizers in data folder 

_Remark_: On average to install and process 20000 articles takes about 20 minutes 

#### __IMPORTANT NOTE__: Please do not run the following code unless you want to install new dataset

In [3]:
from src.DataProcessor import ProcessData

record_subject = 'PDE'
recorded_article_count = 20000

ProcessData(record_subject, recorded_article_count)

Saving Positive speed of pr ...: 100%|██████████| 20000/20000 [16:44<00:00, 19.90it/s]


Raw data installed @ ./data/recorded_articles.csv
Vectorized Abstract data is successfully saved at @ data/Abstract_tfidf_sparse_matrix.pkl
Vectorized Title data is successfully saved at @ data/Title_tfidf_sparse_matrix.pkl


## 2. arXiv recommender engine

### Recommender engine has the following compnonents

Given a `user_name`, 

1. Utilize the vectorizer that is saved during the data processing step to install and process up to 5 user articles.

2. Employ cosine similarity to compute the similarity of each article within the recorded data to the selected user articles.
    * In this stage, each of the five user articles is systematically compared to articles from the records, and the final article similarity score is determined as the highest similarity score among the five articles.
    * Separate similarity scores for the `Abstract` and `Title` features are computed, and the overall similarity score is derived through a weighted average. The weight assigned to `Abstract` similarity is `0.8`, while the weight for `Title` is set at `0.2`.


3. identify and record the top 5 articles with the highest similarity scores, excluding the user's own articles.
    * For increased efficiency, the second and third steps are consolidated, eliminating the need to store individual scores for each article and sort them at the end.
    * A heap data structure is utilized to optimize both space and time efficiency in the deployment of the code.


4. Compile and present the results in the form of a DataFrame.

On average this process takes under a minute.

In [10]:
import src.recommender
recommender_engine = src.recommender.ArxivRecommender()

user_name = 'Krutika Tawri'
recommendations = recommender_engine.recommend_to(user_name)

recommendations

Saving Multi-BERT for Embed ...: 100%|██████████| 5/5 [00:00<00:00, 12.77it/s]


5


Unnamed: 0,Title,Authors,Published,Abstract,Link,PrimaryCategory,Categories,Similarity Scores
0,Local existence and non-explosion of solutions...,"Michael Rockner, Rongchan Zhu, Xiangchan Zhu",2013-07-16 19:48:40+00:00,In this paper we prove the local existence and...,http://arxiv.org/pdf/1307.4392v1,math.PR,"['math.PR', 'math.AP']",0.323205
1,Weak-strong uniqueness for fluid-rigid body in...,"Nikolai V. Chemetov, Sarka Necasova, Boris Muha",2017-10-03 20:49:40+00:00,We consider a coupled PDE-ODE system describin...,http://arxiv.org/pdf/1710.01382v2,math.AP,"['math.AP', '35Q30']",0.316557
2,Local existence of Strong solutions for a flui...,Sourav Mitra,2018-08-20 23:01:01+00:00,We are interested in studying a system couplin...,http://arxiv.org/pdf/1808.06716v1,math.AP,['math.AP'],0.306568
3,On the existence and the uniqueness of the sol...,"Daniele Boffi, Lucia Gastaldi",2020-06-18 13:52:34+00:00,In this paper we consider the linearized versi...,http://arxiv.org/pdf/2006.10536v1,math.AP,"['math.AP', '65N30, 65N12, 74F10']",0.304354
4,Stochastic 2D Navier-Stokes equations on time-...,"Wei Wang, Jianliang Zhai, Tusheng Zhang",2021-05-28 03:19:18+00:00,We establish the existence and uniqueness of s...,http://arxiv.org/pdf/2105.13565v1,math.PR,"['math.PR', 'math.AP']",0.294312



## 3. Next Steps and Potential Enhancements:
1. Enhance the data acquisition process (ProcessData) by incorporating multiple fields for greater diversity.
2. Implement categories with varying weights to refine the recommendation system.
3. When a substantial number of user articles is unavailable, consider leveraging the user's advisor's information to supplement and enrich user preference.
4. Upgrade vectorizer.py by replacing the existing 'TFIDF vectorizer' with a more advanced alternative. For example, consider incorporating pre-trained models such as 'Word2Vec' or 'Doc2Vec' for improved vectorization.
