<a href="https://colab.research.google.com/github/ieg-dhr/NLP-Course4Humanities_2024/blob/main/Large_Language_Models_Article_Separation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Large Language Models and Article Extraction


Created by Sarah Oberbichler [ORCID](https://orcid.org/0000-0002-1031-2759)

###Using LLMs via APIs

For this course, we utilize the NVIDIA API, which provides up to 4,000 free credits to access the open-source model llama-3.1-nemotron-70b-instruct via NVIDIA's GPU infrastructure. When using larger models outside of chatbot applications, they demand significant computational resources.
While APIs offer a solution for accessing models and GPU power through third parties where no local computer power is available, they typically:

*   Require payment beyond free trial credits
*   Should not be used with sensitive data
*   Should not be used with copyright restricted data


### Using LLMs via APIs for the Analysis of Historical Newspapers
Historical newspapers published before 1940 are generally free from copyright protection and, when accessed through public newspaper platforms, are not classified as sensitive data. However, important considerations include:

*   Library licensing agreements may restrict usage
*   Cultural heritage institutions might have specific terms of use
*   Access and processing policies may vary by institution

When using API's provided by third parties, make sure to check the licensing agreements of the data provider (e.g. library). For example, newspapers makred with **Public Domain Mark 1.0 Universell** don't have any restrictions.

#Setting up the Large Language Model

In order to use the large language model via API, you need to get an API key: https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct. Add your private key to you Colab Notebook under *Secrets* as NVIDIA_TOKEN. Run the next cell and see if everything worked as intended.

In [None]:
import pandas as pd
from openai import OpenAI

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key = userdata.get('NVIDIA_TOKEN')
)

# Process the DataFrame
all_articles = []
for index, row in df.iterrows():
    try:
        # Make API call
        completion = client.chat.completions.create(
            model="nvidia/llama-3.1-nemotron-70b-instruct",
            messages=[
                {
                    'role': 'system',
                    'content': """ System Instructions: """
                },
                {
                    'role': 'user',
                    'content': f"""# Task Instructions:
Text to analyze:
{row['plainpagefulltext']}"""
                }
            ],
            temperature=0.0,
            max_tokens=20000
        )

        content = completion.choices[0].message.content

        # Process articles
        if content and "Keine Artikel mit dem angegebenen Thema gefunden." not in content:
            new_row = row.to_dict()
            new_row['extracted_article'] = content.strip()
            all_articles.append(new_row)

    except Exception as e:
        print(f"Error processing row {index}: {str(e)}")
        continue

# Create final DataFrame
result_df = pd.DataFrame(all_articles)

# Save to Excel
result_df.to_excel('test_1.xlsx', index=False)

# Display results
print(result_df.head())

#Now try the code with the provided prompts

How well did your prompt perform in comparison to this prompt?

In [None]:
import pandas as pd
from typing import List, Dict
from openai import OpenAI

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key = userdata.get('NVIDIA_TOKEN')
)

def analyze_dataframe(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
    def analyze_text(text: str) -> List[Dict[str, str]]:
        system_prompt = f"""
# System Instructions
You are an expert text analyst and information retrieval specialist and hate summarization as well as enumerations. Use {examples} for structuring your answer.
Your task is to carefully analyze given texts and extract complete articles that contain specific themes. You never change original texts.

Classify as relevant if the text contains:
- Primary earthquake terminology from the 19th and 20th century
- Official earthquake reports
- geology and seismology
- Impact descriptions
- Solution description
- Technical description
- Aid
- Honorations
- Political discussion and opinions on earthquake
- Stories from victims and refugees
- reportings on refugees and victims
- Live of victims
- historical references
- comparisons

Your output should consist of the extracted articles and the verification

Maintain a neutral, objective stance throughout the analysis. Focus on accuracy and completeness in your extractions
"""
        user_prompt = f"""
# Task Instructions
Bitte führe die folgenden Schritte aus:
1. Lese jeden Text aufmerksam durch. Behandle jeden Text als eigene Einheit, ohne auf andere Texte zu referieren
2. Identifiziere alle Artikel zum Thema Erdbeben und Erstoß
3. Für jedes Vorkommen des Themas:
   a. Bestimme den Anfang des Artikels, in dem das Thema vorkommen.
   b. Kontrolliere Satz für Satz, ob diese zusammengehören, Ende den Artikel, wenn die Sätze nicht mehr zusammengehören.
   c. Markiere den vollständigen Artikel von Anfang bis Ende.
   d. Wenn der Artikel zu lang für eine Antwort ist, antworte mit Ja auf "article too long, human addition needed":
   e. Berücksichtige auch sehr kurze und sehr lange Artikel
4. Überprüfe jeden markierten Artikel:
   a. Stelle sicher, dass er eine Einheit bildet, auch wenn es nicht mehr um Erdbeben geht.
   b. Vergewissere dich, dass er eines der genannten Themen enthält.
   c. Prüfe, ob der extrahierte Text tatsächlich im Dokument ist
5. Extrahiere jeden überprüften Artikel als Originaltext, der nichts als den originalen Text enthält
6. Korrigiere OCR-Fehler
7. Wenn keine Artikel gefunden wurden, gib "Keine Artikel mit dem angegebenen Thema gefunden." aus.

Führe nun diese Schritte für den folgenden Text aus:
{text}
"""
        try:
            messages = [
                {
                    'role': 'system',
                    'content': system_prompt
                },
                {
                    'role': 'user',
                    'content': user_prompt
                }
            ]

            completion = client.chat.completions.create(
                model="nvidia/llama-3.1-nemotron-70b-instruct",
                messages=messages,
                temperature=0.0,
                max_tokens=20000
            )

            content = completion.choices[0].message.content

            # Split the content into individual articles
            articles = []
            if "Keine Artikel mit dem angegebenen Thema gefunden." in content:
                return []

            # Split by "**END OF ARTICLE**" if present, otherwise treat as single article
            if "**END OF ARTICLE**" in content:
                parts = content.split("**END OF ARTICLE**")
                articles = [{"article": part.strip()} for part in parts if part.strip()]
            else:
                articles = [{"article": content.strip()}]

            return articles

        except Exception as e:
            print(f"Error in AI processing: {str(e)}")
            return []

    # Apply the analysis to each row in the DataFrame
    all_articles = []
    for index, row in df.iterrows():
        articles = analyze_text(row[text_column])
        for i, article in enumerate(articles, 1):
            new_row = row.to_dict()
            new_row['extracted_article'] = article['article']
            new_row['article_part'] = i
            new_row['total_parts'] = len(articles)
            all_articles.append(new_row)

    # Create a new DataFrame with individual rows for each article
    result_df = pd.DataFrame(all_articles)

    return result_df

# Usage example
text_column = 'plainpagefulltext'
result_df = analyze_dataframe(df, text_column)

# Save the results to an Excel file
result_df.to_excel('test_2.xlsx', index=False)

# Display the first few rows of the result
print(result_df.head())