#### Note
This classifier could be re-implemented easily using the new LLM_classifier framework used for theory and methods identification.

This notebook with legacy-code is only included because it is what we used to actually generate the classifications

# LLM (GPT) section type classification

## Imports and general setup

In [None]:
import pandas as pd

from openai import OpenAI
import json
import datetime

import sys
import os

# In a Jupyter notebook, __file__ is not defined. Use the current working directory instead.
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), "..")))


from section_classification.gpt_classification_helpers import (
    prepare_articles_dataframe,
    expand_classified_articles_dataframe,
    classify_articles_with_validation,
)
from python_code.helpers import recombine_chunks, apply_in_chunks, safe_load_env_variable

In [None]:
df = pd.read_pickle("./data/processed_embeddings_light_20250702_142821.pkl")
df.keys()

In [None]:
prepared_df = prepare_articles_dataframe(df)

In [None]:

# The following block dynamically inserts categories and their descriptions loaded from a JSON file.
with open("./classifiers_LLM_generated.json", "r") as f:
    categories_json = json.load(f)

category_lines = []
for cat in categories_json:
    category_lines.append(f"- {cat['category']}: {cat['description']}")
    # category_lines.append(f"- {cat['category']}")

categories_prompt = "\n".join(category_lines)

#### Initialize OpenAI

In [None]:
api_key = safe_load_env_variable("OPENAI_API_KEY", "..")
client = OpenAI(api_key)

## Categorize whole articles

In [None]:
whole_article_system_prompt = f"""
You are a expert in the field of physics education research. 
You are given a whole article of a physics education research article. You will be given the article formatted as markdown with each section marked by a subheading.
Your task is to classify each section into one of the following categories:

{categories_prompt}
 
Each article can match multiple categories. Consider both title, content and the context of the section to make the best classification.
 

Return an array of all the sections with the fields
- section: the section title
- classifications: the classifications for the section
- probabilities: the probability distribution for the classifications

Generate a probability distribution for the classification of each section. After generating all of the distribution review them and refine 
them if need be.
Return  the probability distribution for each classification in the json field "probabilities". Make sure that these probabilties reflect the ambiguities of the data.
Return the classifications as a array of strings in the json field "classifications". 
"""

In [None]:
model = "gpt-4.1-mini"   # Update to change model
data: pd.DataFrame = prepared_df


def classify(data):
    res = classify_articles_with_validation(
        data, system_prompt=whole_article_system_prompt, model=model, client=client, verbose=True)
    print(res)
    return res

# Use the chunking helper function in case we encounter a failure
classified_df = apply_in_chunks(data, classify, chunk_size=50,
                                output_dir="classified_sections")

Processing 1222 rows in chunks of 50 for a total of 25 chunks...
Processing chunk 0-49...
{'sections': [{'section': 'INTRODUCTION', 'classifications': ['Introduction / Motivation', 'Literature Review/Reanalysis'], 'probabilities': {'Introduction / Motivation': 0.75, 'Literature Review/Reanalysis': 0.2, 'Theoretical Framework': 0.03, 'Methods': 0.01, 'Results': 0.0, 'Discussion': 0.0, 'Implications': 0.0, 'Conclusion / Summary': 0.0, 'Limitations': 0.0, 'Ethical Considerations': 0.0, 'Appendix / Supplementary Material': 0.0, 'Acknowledgments': 0.0, 'References': 0.01}}, {'section': 'CU IMPLEMENTATION', 'classifications': ['Methods'], 'probabilities': {'Methods': 0.9, 'Introduction / Motivation': 0.05, 'Results': 0.03, 'Discussion': 0.01, 'Theoretical Framework': 0.01}}, {'section': 'DATA AND RESULTS', 'classifications': ['Results'], 'probabilities': {'Results': 0.95, 'Discussion': 0.03, 'Methods': 0.02}}, {'section': 'DISCUSSION', 'classifications': ['Discussion', 'Theoretical Framework

AttributeError: 'NoneType' object has no attribute 'iterrows'

In [None]:
classified_df = recombine_chunks("classified_sections")

## Renaming, reformatting and saving the classifications

Having run the classifications we are left with one row per article. Since we are primarily interested in specific types of section, we should explode this df to have one row per section.

In [20]:

expanded_classified_df = expand_classified_articles_dataframe(
    classified_df)

In [21]:
expanded_classified_df

Unnamed: 0,article_id,section_title,section_content,probability_dist,classification_gpt,classification_highest_prob,failed_validation
0,10.1103/PhysRevSTPER.1.010101,INTRODUCTION,I. INTRODUCTION Both popular [bibr:c1] and res...,"{'Introduction / Motivation': 0.75, 'Literatur...","[Introduction / Motivation, Literature Review/...",Introduction / Motivation,False
1,10.1103/PhysRevSTPER.1.010101,CU IMPLEMENTATION,II. CU IMPLEMENTATION A. Background and course...,"{'Methods': 0.9, 'Introduction / Motivation': ...",[Methods],Methods,False
2,10.1103/PhysRevSTPER.1.010101,DATA AND RESULTS,III. DATA AND RESULTS Data were collected in t...,"{'Results': 0.95, 'Discussion': 0.03, 'Methods...",[Results],Results,False
3,10.1103/PhysRevSTPER.1.010101,DISCUSSION,IV. DISCUSSION A. Examining the how and why of...,"{'Discussion': 0.8, 'Theoretical Framework': 0...","[Discussion, Theoretical Framework]",Discussion,False
4,10.1103/PhysRevSTPER.1.010101,CONCLUDING REMARKS,"V. CONCLUDING REMARKS Clearly, it is possible ...","{'Conclusion / Summary': 0.7, 'Implications': ...","[Conclusion / Summary, Implications]",Conclusion / Summary,False
...,...,...,...,...,...,...,...
7308,10.1103/PhysRevPhysEducRes.19.020109,INTRODUCTION,I. INTRODUCTION Gender disparity in the partic...,"{'Introduction / Motivation': 0.7, 'Literature...","[Introduction / Motivation, Literature Review/...",Introduction / Motivation,False
7309,10.1103/PhysRevPhysEducRes.19.020109,METHODOLOGY,II. METHODOLOGY In this study gender was defin...,"{'Methods': 1.0, 'Introduction / Motivation': ...",[Methods],Methods,False
7310,10.1103/PhysRevPhysEducRes.19.020109,RESULTS AND DISCUSSION,III. RESULTS AND DISCUSSION Figure [fig:f1] sh...,"{'Results': 0.6, 'Discussion': 0.4, 'Introduct...","[Results, Discussion]",Results,False
7311,10.1103/PhysRevPhysEducRes.19.020109,LIMITATIONS AND FUTURE DIRECTIONS,IV. LIMITATIONS AND FUTURE DIRECTIONS This wor...,"{'Limitations': 0.65, 'Implications': 0.35, 'I...","[Limitations, Implications]",Limitations,False


Now let us make the column names a bit more informative.

In [None]:
merged_df = df.merge(
    expanded_classified_df,
    how="left",
    left_on=["article_id", "section_title"],
    right_on=["article_id", "section_title"]
)

merged_df.rename(columns={"probability_dist": "probability_dist_gpt-4.1-mini"}, inplace=True)
merged_df.rename(columns={"classification_gpt": "classification_gpt-4.1-mini"}, inplace=True)
merged_df.rename(columns={"classification_highest_prob": "classification_highest_prob_gpt-4.1-mini"}, inplace=True)
merged_df.rename(columns={"failed_validation": "failed_validation_gpt-4.1-mini"}, inplace=True)

In [28]:
merged_df.drop(columns=["section_title"], inplace=True)

In [30]:
merged_df.keys()

Index(['section_label', 'article_title', 'section_title_raw', 'article_id',
       'year', 'content_text', 'relative_position', 'title_embedding',
       'content_embedding', 'section_content', 'probability_dist_gpt-4.1-mini',
       'classification_gpt-4.1-mini',
       'classification_highest_prob_gpt-4.1-mini',
       'failed_validation_gpt-4.1-mini'],
      dtype='object')

Finally, we can save our results!

In [None]:

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
merged_df.to_pickle(f"./data/classified_sections_light_gpt-4.1-mini_{timestamp}.pkl")