<a href="https://colab.research.google.com/github/ieg-dhr/NLP-Course4Humanities_2024/blob/main/Large_Language_Models_Article_Separation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Large Language Models and Article Extraction


Created by Sarah Oberbichler [![ORCID](https://info.orcid.org/wp-content/uploads/2019/11/orcid_16x16.png)](https://orcid.org/0000-0002-1031-2759)

###Using LLMs via APIs

For this course, we utilize the NVIDIA API, which provides up to 4,000 free credits to access the open-source model llama-3.1-nemotron-70b-instruct via NVIDIA's GPU infrastructure. When using larger models outside of chatbot applications, they demand significant computational resources.
While APIs offer a solution for accessing models and GPU power through third parties where no local computer power is available, they typically:

*   Require payment beyond free trial credits
*   Should not be used with sensitive data
*   Should not be used with copyright restricted data


### Using LLMs via APIs for the Analysis of Historical Newspapers
Historical newspapers published before 1940 are generally free from copyright protection and, when accessed through public newspaper platforms, are not classified as sensitive data. However, important considerations include:

*   Library licensing agreements may restrict usage
*   Cultural heritage institutions might have specific terms of use
*   Access and processing policies may vary by institution

When using API's provided by third parties, make sure to check the licensing agreements of the data provider (e.g. library). For example, newspapers makred with **Public Domain Mark 1.0 Universell** don't have any restrictions.

In [None]:
!git clone https://github.com/ieg-dhr/NLP-Course4Humanities_2024.git

#Setting up the Large Language Model

In order to use the large language model via API, you need to get an API key: https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct. Add your private key to you Colab Notebook under *Secrets* as NVIDIA_TOKEN. Run the next cell and see if everything worked as intended.

In [None]:
!pip uninstall -y httpx
!pip install httpx==0.27.2
from openai import OpenAI
from google.colab import userdata

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=userdata.get('NVIDIA_TOKEN'),
    # Remove any default timeout settings
    timeout=None
)


completion = client.chat.completions.create(
  model="nvidia/llama-3.1-nemotron-70b-instruct",
  messages=[{"role":"user","content":f"""Hello?"""
}],
  temperature=0.3,
  top_p=1,
  max_tokens=10024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

#Importing the Dataset

In [None]:
import pandas as pd

# Replace 'your_file.xlsx' with the actual path to your Excel file.
df = pd.read_excel('/content/NLP-Course4Humanities_2024/datasets/Süddeutsche_Zeitung_Messina.xlsx')

# Display the first few rows of the DataFrame to verify it's loaded correctly.
df=df[:4]
df.head()

In [None]:
import pandas as pd
from typing import List, Dict
from openai import OpenAI

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=userdata.get('NVIDIA_TOKEN')
)

def analyze_dataframe(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
    def analyze_text(text: str) -> str:
        system_prompt = """
# System Instructions
You are an expert text analyst and information retrieval specialist and hate summarization as well as enumerations.
Your task is to carefully analyze given texts and extract complete articles that contain specific themes. You never change original texts.

Classify as relevant if the text contains:
- Primary earthquake terminology from the 19th and 20th century
- Official earthquake reports
- geology and seismology
- Impact descriptions
- Solution description
- Technical description
- Aid
- Honorations
- Political discussion and opinions on earthquake
- Stories from victims and refugees
- reportings on refugees and victims
- Live of victims
- historical references
- comparisons

Your output should consist of nothing else but the the xml structure <article></article><verification></verification><human_verification_needed></human_verification_needed>

Maintain a neutral, objective stance throughout the analysis. Focus on accuracy and completeness in your extractions
"""
        user_prompt = f"""
Bitte befolgen Sie diese Spezifikationen:
1. Definition eines Artikels: Ein Artikel ist eine semantische Einheit im Text, die sich deutlich von vorangehendem und nachfolgendem Inhalt abgrenzt (z.B. durch eine eigene Überschrift).
3. Antwortformat:
- Wenn ein oder mehrere relevante Artikel gefunden werden, strukturieren Sie Ihre Antwort mit XML-Tags wie im folgenden Beispiel, unter Verwendung der Tags article, verification und human_verification_needed (True oder False): <article>vollständiger extrahierter Artikelinhalt</article><verification>Ist die Einheit kohärent? Ist das Thema vorhanden? Ist der Artikel vollständig? Wurden alle Artikel gefunden?</verification><human_verification_needed>False</human_verification_needed>
- Gebe alle relevanten Artikel in ihrer Originalform zurück, ohne Ergänzungen, Auslassungen, Korrekturen oder Kommentare.
- Wenn keine relevanten Artikel gefunden werden, ist keine besondere Strukturierung erforderlich; gebe einfach "Kein relevanter Artikel gefunden." ohne weitere Erklärungen zurück.
4. Hinweise zur Segmentierung:
- Stelle sicher, dass über mehrere Absätze verteilte Artikel als eine Einheit behandelt werden.
5. Menschliche Überprüfung notwendig:
- Kann die Werte "True" oder "False" haben
- False: Wenn Sie glauben, den Artikel korrekt segmentiert und seine Relevanz richtig eingeschätzt zu haben.
- True: Wenn du unsicher bist, ob du den vollständigen Inhalt des Artikels, wie er im Zeitungsdokument enthalten ist, erfasst hast oder ob er relevant ist.

Hier ist das Zeitungsdokument:

{text}
"""
        try:
            messages = [
                {
                    'role': 'system',
                    'content': system_prompt
                },
                {
                    'role': 'user',
                    'content': user_prompt
                }
            ]

            completion = client.chat.completions.create(
                model="nvidia/llama-3.1-nemotron-70b-instruct",
                messages=messages,
                temperature=0.0,
                max_tokens=20000
            )
            return completion.choices[0].message.content
        except Exception as e:
            print(f"Error in API call: {str(e)}")
            return ""

    # Apply the analysis to each row in the DataFrame
    df['separated_articles'] = df[text_column].apply(lambda x: analyze_text(x) if pd.notna(x) else "")

    return df

# Usage example (assuming df is your input DataFrame)
if __name__ == "__main__":
    # Process the DataFrame
    text_column = 'plainpagefulltext'  # or your text column name
    result_df = analyze_dataframe(df, text_column)

    # Save the results
    result_df.to_excel('analyzed_results.xlsx', index=False)

    # Display sample results
    print("\nSample of processed articles:")
    print(result_df['separated_articles'].head())

Write a prompt for OCR Post-Correction

In [None]:
import pandas as pd
from openai import OpenAI

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key = userdata.get('NVIDIA_TOKEN')
)

# Process the DataFrame
all_articles = []
for index, row in result_df.iterrows():
    try:
        # Make API call
        completion = client.chat.completions.create(
            model="nvidia/llama-3.1-nemotron-70b-instruct",
            messages=[
                {
                    'role': 'system',
                    'content': """ System Instructions: """
                },
                {
                    'role': 'user',
                    'content': f"""# Task Instructions:
Text to analyze:
{row['extracted_article']}"""
                }
            ],
            temperature=0.0,
            max_tokens=20000
        )

        content = completion.choices[0].message.content

        # Process articles
        if content and "Keine Artikel mit dem angegebenen Thema gefunden." not in content:
            new_row = row.to_dict()
            new_row['article_corrected'] = content.strip()
            all_articles.append(new_row)

    except Exception as e:
        print(f"Error processing row {index}: {str(e)}")
        continue

# Create final DataFrame
result_2_df = pd.DataFrame(all_articles)

# Save to Excel
result_2_df.to_excel('test_2.xlsx', index=False)

# Display results
print(result_2_df.head())