<a href="https://www.kaggle.com/code/tomsawyyer/classify-axriv-by-competition-s-label?scriptVersionId=130941824" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import json
from datetime import datetime
import plotly.express as px
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Notebook Overview

In this notebook, we will attempt to analyze the source files from arXiv and use LLM and Transformer-based models to classify potential target papers.


## Reveiew Arxiv Metadata and filter

In [None]:
# Inspired by https://www.kaggle.com/code/leonidkulyk/kaggle-ai-report-topic-selection?scriptVersionId=129336498&cellId=24
# import JSON data
dict_arxiv = []
for line in open("/kaggle/input/2023-kaggle-ai-report/arxiv_metadata_20230510.json", 'r'):
    dict_arxiv.append(json.loads(line))

In [None]:
len(dict_arxiv) # total amount of papers

### Filter paper

In [None]:
# The number of scholarly articles is simply gigantic, so let's first filter out all the articles that were published before 2021
date_format = "%a, %d %b %Y %H:%M:%S %Z"
limit_year = datetime.strptime("2021", "%Y")

filtered_arxiv = []
for article_meta in dict_arxiv:
    parsed_date = datetime.strptime(
        article_meta["versions"][0]["created"], 
        date_format
    )
    if parsed_date >= limit_year:
        filtered_arxiv.append(article_meta)

In [None]:
len(filtered_arxiv)

In [None]:
filtered_arxiv[0] 
# one paper may have serveral categories
# pay attention if you have analysis based on category

### Find all CS categories

- Find all categories from https://arxiv.org/category_taxonomy

In [None]:
# List all CS related topics in arXiv Category Taxonomy.
categories = ['cs.AI', 'cs.AR', 'cs.CC', 'cs.CE', 'cs.CG', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.CY', 'cs.DB', 'cs.DC', 'cs.DL', 'cs.DM', 'cs.DS', 'cs.ET', 'cs.FL', 'cs.GL', 'cs.GR', 'cs.GT', 'cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS', 'cs.PF', 'cs.PL', 'cs.RO', 'cs.SC', 'cs.SD', 'cs.SE', 'cs.SI', 'cs.SY']

- If you want to get other topics from arXiv Category Taxonomy,
- Try following code snippet(javascript) in Chrome Devtool -> Source Code -> Code Snippets

```javascript
// init array
let csContents = [];

// get all h4 tags in page https://arxiv.org/category_taxonomy
let h4Tags = document.getElementsByTagName("h4");

// iterate array to get useful content
for(let i = 0; i < h4Tags.length; i++) {
    if(h4Tags[i].textContent.startsWith("cs.")) { // modify cs to your topic
        let csContent = h4Tags[i].textContent.split(" (")[0];
        csContents.push(csContent);
    }
}

console.log(csContents);
```

### Category filtered papers

In [None]:
categories_ids = {}
for category in categories:
    category_ids = []
    for article_meta in filtered_arxiv:
        if category in article_meta["categories"]:
            category_ids.append(article_meta["id"])
    categories_ids[category] = set(category_ids)

In [None]:
for key, value_set in categories_ids.items():
    print(f"{key} has {len(value_set)} items in its set.")

In [None]:
fig = px.bar(x=categories_ids.keys(), y=[len(x) for x in categories_ids.values()], text_auto='.2s')
fig.update_layout(
    xaxis_title="Category",
    yaxis_title="Articles count"
)
fig.show()

- Thanks for https://www.kaggle.com/code/leonidkulyk/kaggle-ai-report-topic-selection/notebook#--1.-Topics-selection
- I've just realized that I can utilize the ChatGPT API & lang-chain to categorize these papers into the following topics.
- The current Kaggle competition's topic classification differs from the classification method used in arXiv. 
- If someone plans to generate papers using generative AI in the future, it's advisable to first classify the papers based on the Kaggle's topics. The topics for the competition are as follows:


1. Text data
2. Image and/or video data
3. Tabular and/or time series data
4. Kaggle Competitions
5. Generative AI
6. AI ethics
7. Other



### Prepare content which will be classify by llm

In [None]:

# we will use paper's title + abstract content
# Because our mission is about AI report, so I will drop all papars without 'cs' category to save my API credits

extracted_data = [{"id": item["id"], "title": item["title"], "abstract": item["abstract"]} for item in filtered_arxiv if 'cs' in item["categories"]]

df = pd.DataFrame(extracted_data)

df['title_abstract'] = df['title'] + ' ' + df['abstract']

df = df.drop(['title', 'abstract'], axis=1)

print(df) # total 231386 LOL my credit

In [None]:
print(df.loc[0]['title_abstract']) 
# for demo usage
temp_title_abstract = df.loc[0]['title_abstract']

In [None]:
# prepare candidate labels
candidate_labels = [
    "Text data", 
    "Image and/or video data", 
    "Tabular and/or time series data", 
    "Generative AI", 
    "AI ethics",
    "Other"
]

### Using LangChain

- We now have our dataset and labels at hand.
- It's time to proceed with classification.
- we'll utilize the ChatGPT model, gpt3.5-turbo, and lang chain for easy use of the prompt template.
- If you're new to lang chain, I recommend going through this.https://python.langchain.com/en/latest/modules/prompts/prompt_templates.html
- Given that we have 220k papers on CS topics, this would result in substantial API credit usage.
- Therefore, I've decided to adopt a zero-shot approach to create a smaller dataset for the Bert-downstream task(label 50 papers) for first attempt

In [None]:
# intall deps
# Todo

!pip install langchain
!pip install openai
!pip install tqdm

In [None]:
# define a progress function for API-Calliing, killing time
from tqdm import tqdm
import time

def show_progress(iterable):
    with tqdm(iterable) as pbar:
        for item in pbar:
            pbar.set_description(f"Progress: {pbar.n + 1}/50")

In [None]:
from langchain.llms import OpenAI
# llm = OpenAI(openai_api_key="OPENAI_API_KEY")
llm = OpenAI(openai_api_key="", model_name="gpt-3.5-turbo",temperature=0.0)

from langchain import PromptTemplate,LLMChain


template = """
Consider the following categories: 'Text data', 'Image and/or video data', 'Tabular and/or time series data', 'Generative AI', 'AI ethics', and 'Other'. Given a piece of text,
please strictly classify it into one of these categories exactly as written. 
Do not add any extra words, punctuation, or provide paraphrased versions of the categories. 
The input text is: {titleAbstract}
"""

prompt = PromptTemplate(
    input_variables=["titleAbstract"],
    template=template,
)

# try template result to validate your prompt
prompt.format(titleAbstract=temp_title_abstract)

llm_chain = LLMChain(prompt=prompt, llm=llm)
llm_chain.run(temp_title_abstract)


In [None]:
# prepare label paper, to save my credit, 50 parpers are considered.
import string
df_small = df.iloc[:50]
df_small['api-label'] = None

def get_gpt_label():
    for i in range(len(df_small)):
        text = df_small['title_abstract'].iloc[i]
        label = llm_chain.run(text)
        # Uncomment the following print function if you wish to review the result.
        # Be prepared for extensive and potentially tedious output.
        # print("original content is {} label predicted by API is {}".format(text,label))
        # print()
        df_small['api-label'].iloc[i] = label.strip(string.punctuation) # Hard coding is necessary because the API response does not strictly follow my request
        yield i

        
show_progress(get_gpt_label())

In [None]:
df_small.head(20)

In [None]:
label_counts = df_small['api-label'].value_counts().reset_index()

label_counts.columns = ['api-label', 'count']

fig = px.bar(label_counts, x='api-label', y='count', title='Distribution of API Labels')

fig.show()

### Using zero shot classification model from HuggingFace

- The results are not entirely convincing as the tabular and time series data have excessive counts.
- I intend to experiment with other models for comparison and to possibly improve the results.
- This time we use Bart-Large-mnli model from Facebook
- We both use first 50 parper to classify, let's see the difference

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")

nlp = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

text = "The research paper discusses the application of deep learning for natural language processing."

result = nlp(text, candidate_labels)

max_score_index = result['scores'].index(max(result['scores']))
predicted_label = result['labels'][max_score_index]
print(predicted_label) # validate model work

In [None]:
df_small_bart = df.iloc[:50]
df_small_bart['api-label'] = None

def get_bart_label():
    for i in range(len(df_small_bart)):
        text = df_small_bart['title_abstract'].iloc[i]
        label = nlp(text, candidate_labels)
        max_score_index = result['scores'].index(max(result['scores']))
        predicted_label = result['labels'][max_score_index]
        df_small_bart['api-label'].iloc[i] = predicted_label.strip(string.punctuation)
        yield i
show_progress(get_bart_label())

In [None]:
df_small_bart.head(20) # What a ridiculous result! All papers has been classify to Other topic.

- Considering the star count associated with the `Zero-Shot Classification` task on Huggingface,
- There seems to be no reason to choose any model other than 'facebook/bart-large-mnli'.
- Feel free to review or select your choice through this link: https://huggingface.co/models?pipeline_tag=zero-shot-classification&library=transformers&sort=downloads.


# We are not Finished yet

At this moment, I find myself at a crossroads with two divergent paths in front of me:

1. **Option One:** Classify all the papers using the API call, but this will consume a large amount of credits. Based on the current consumption rate, classifying 50 papers using the gpt3.5-turbo model costs roughly `0.4 dollars`. Therefore, classifying `230,000` pieces of data would cost approximately `1800 dollars`. While this might not be a large sum for a lab or a company, it is certainly a high price for an individual like me and beyond what I can afford.

2. **Option Two:** Use all my credit allowance for this month to classify about 700 data pieces, creating a small training set. I could then use this training set to distill or fine-tune a Bert or similar encoder model, and subsequently classify the entire set of 230,000 papers. However, I am concerned that 700 pieces of data are too few to effectively fine-tune the parameters.

### Now, I am stuck in this predicament. If you have any good suggestions, please discuss them in the comment section.
