<a href="https://colab.research.google.com/github/soberbichler/mogon_ki/blob/main/article_separation_nemotron_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import ollama

In [None]:
import os
os.environ['no_proxy'] = 'localhost'


In [None]:
  client = ollama.Client()
  client.list()

In [None]:

import pandas as pd

# Replace 'your_excel_file.xlsx' with the actual path to your Excel file
df = pd.read_excel('your_excel_file.xlsx')

df.head()

In [None]:
import pandas as pd
from typing import List, Dict

with open('examples.txt', 'r') as file:
    examples = file.read()

def analyze_dataframe(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
    def analyze_text(text: str) -> List[Dict[str, str]]:
        combined_prompt = f"""
# System Instructions
You are an expert text analyst and information retrieval specialist. Your task is to carefully analyze given texts and extract complete articles that contain specific themes. Follow these guidelines and use {examples} to learn from:

1. Approach each text with meticulous attention to detail.
2. Identify all instances of the specified theme within the text.
3. For each keyword occurrence:
   a. Determine the beginning of the article containing the keyword.
   b. Analyze sentence by sentence to ensure continuity and relevance.
   c. Include the entire article. If the article is too long to fit in one response, write "[CONTINUED]" at the end and continue in the next response.
   d. If articles have headlines, consider them as start/end markers.
4. Verify each extracted article:
   a. Ensure it forms a coherent unit.
   b. Confirm it contains the specified keyword.
   c. Check for completeness and inclusion of all relevant information.
5. Extract and present each verified article in its original, unaltered form.
6. Separate distinct articles clearly with "###ARTICLE_SEPARATOR###".
7. If no articles containing the keyword are found, state this explicitly.

Your output should consist solely of the extracted articles or the statement that no relevant articles were found. Do not include explanations, summaries, or additional commentary unless specifically requested.

Maintain a neutral, objective stance throughout the analysis. Focus on accuracy and completeness in your extractions.

# Task Instructions
Bitte führe die folgenden Schritte aus
1. Lese den gesamten Text sorgfältig durch.
2. Identifiziere alle Artikel zum Thema Erdbeben, Erdbebenkatastrophe in Italien/Messina/Sizilien/katastrophe/Trümmer und Flüchtlinge und suche nach Keywords 'Erdbeben', 'Erdstoß', Erdbebenkatastrophe', 'Messina', 'Trümmer', 'Hilfsaktion', 'Katastrophe'.
3. Für jedes Vorkommen des themas:
   a. Bestimme den Anfang des Artikels, in dem Keywords vorkommen.
   b. Kontrolliere Satz für Satz, ob diese zusammengehören, Ende den Artikel, wenn die Sätze nicht mehr zusammengehören.
   c. Markiere den vollständigen Artikel von Anfang bis Ende.
   d. Wenn der Artikel zu lang für eine Antwort ist, schreibe "[CONTINUED]" am Ende und setze in der nächsten Antwort fort.
4. Überprüfe jeden markierten Artikel:
   a. Stelle sicher, dass er eine zusammenhängende Einheit bildet.
   b. Vergewissere dich, dass er zum Thema passt.
   c. Prüfe, ob er vollständig ist und keine wichtigen Informationen fehlen.
5. Extrahiere jeden überprüften Artikel als Originaltext ohne Korrekturen.
6. Gib die extrahierten Artikel exakt und unverändert wieder, ohne Zusammenfassungen oder zusätzliche Kommentare.
7. Wenn keine Artikel gefunden wurden, gib "Keine Artikel mit dem angegebenen Keyword gefunden." aus.

Führe nun diese Schritte für den folgenden Text aus:
{text}
"""

        articles = []
        current_article = ""
        continuation = False

        while True:
            try:
                response = client.generate(
                    model='nemotron:latest',
                    prompt=combined_prompt if not continuation else current_article,
                    options={
                        'num_ctx': 20000,
                        'temperature': 0.1,
                        'top_p': 0.5,
                        'num_predict': 20000,
                        'repeat_penalty': 1,
                        'top_k': 20
                    }
                )

                content = response['response']

                if continuation:
                    current_article += content
                else:
                    parts = content.split("###ARTICLE_SEPARATOR###")
                    for part in parts[:-1]:
                        if current_article:
                            articles.append({"article": current_article.strip()})
                            current_article = ""
                        articles.append({"article": part.strip()})
                    current_article = parts[-1]

                if content.endswith("[CONTINUED]"):
                    continuation = True
                    current_article = current_article[:-11]  # Remove "[CONTINUED]"
                else:
                    continuation = False
                    if current_article:
                        articles.append({"article": current_article.strip()})
                        current_article = ""
                    break

            except Exception as e:
                print(f"Error in AI processing: {str(e)}")
                break

        return articles

    # Apply the analysis to each row in the DataFrame
    all_articles = []
    for index, row in df.iterrows():
        articles = analyze_text(row[text_column])
        for i, article in enumerate(articles, 1):
            new_row = row.to_dict()
            new_row['extracted_article'] = article['article']
            new_row['article_part'] = i
            new_row['total_parts'] = len(articles)
            all_articles.append(new_row)

    # Create a new DataFrame with individual rows for each article
    result_df = pd.DataFrame(all_articles)

    return result_df

# Usage example (run this in your notebook)
text_column = 'plainpagefulltext'
result_df = analyze_dataframe(df, text_column)

# Optionally, save the results to an Excel file
result_df.to_excel('analysis_results.xlsx', index=False)

# Display the first few rows of the result
print(result_df.head())