# Prompt Engineering for Repository Summarization

This notebook is designed to help you experiment with different prompts to generate high-quality summaries for government repositories. 

You can:
1.  Load your repository data from a CSV file.
2.  Take a random sample of repositories to test.
3.  Define and test multiple summarization prompts.
4.  Generate summaries using a language model.
5.  Compare the results in a clear, side-by-side format.

## 1. Setup

First, let's import the necessary libraries and set up our configuration.

In [43]:
import pandas as pd
from IPython.display import display, HTML
import random

# --- Configuration ---
CSV_FILE_PATH = '/home/theo/projects/iai/data/tmprunid/classified_gov_repositories_batch.csv'
SAMPLE_SIZE = 10  # Number of random repositories to test

## 2. Load and Sample Data

We'll load the data from the CSV file and select a random sample of repositories to work with. This keeps the process fast and manageable for experimentation.

In [36]:
try:
    df = pd.read_csv(CSV_FILE_PATH)
    # Ensure the sample size is not larger than the number of rows in the dataframe
    sample_size = min(SAMPLE_SIZE, len(df))
    if len(df) > sample_size:
        sample_df = df.sample(n=sample_size, random_state=62) # random_state for reproducibility
    else:
        sample_df = df
    print(f"Successfully loaded {len(df)} repositories and sampled {len(sample_df)} of them.")
    display(sample_df[['name', 'readme', 'description']].head())
except FileNotFoundError:
    print(f"Error: The file was not found at {CSV_FILE_PATH}")

Successfully loaded 500 repositories and sampled 20 of them.


Unnamed: 0,name,readme,description
83,rdl-standard,"# Risk Data Library Standard\n\nThe Risk Data Library Standard is a data model for describing Hazard, Exposure, \nVulnerability and Loss data.\n\nThe model describes the common core metadata that applies to all risk datasets, \nas well as standardised metadata that applies to Hazard, Exposure, Vulnerability and Loss \ndata.\n\nThis repository is used to coordinate the development of this data model. It will \nbe used to:\n\n* publish working and released drafts of the data model specifications\n* coordinate collaboration and discussion around the iterative development of those specifications\n* provide an overview of the current status and roadmap\n\n## Intended audience\n\nThe repository is intended to support the work of those developing and contributing to the \nRisk Data Library specifications.\n\nThis repository is intended to:\n\n* support comments or feedback on the current specifications\n* propose and discuss changes, e.g. in the form of revised wording or additions to the model\n* answer questions about the governance and evolution of the standard\n\nOther more useful resources exist if you have general questions about the scope and goals \nof [the Risk Data Library project](http://riskdatalibrary.org/), or are looking for a more [high-level introduction to \nthe standard and its key concepts](https://docs.riskdatalibrary.org/).\n\n## How to contribute\n\nThe [Contributors guide](CONTRIBUTING.md) covers the different ways in which you can contribute to this project to \nsupport the development and adoption of the Risk Data Library Standard.\n\n## Project governance\n\nRead [the project governance documentation](GOVERNANCE.md) for more detail about our approach to making decisions and \nagreeing changes to the standard.\n\n## Licence\n\nThe published specifications and all working documents in this repository are published under \na [Creative Commons Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/legalcode) licence.\n\nVisit the Creative Commons website for [official translations of the licence text](https://creativecommons.org/licenses/by-sa/4.0/legalcode#languages).\n","The Risk Data Library Standard (RDLS) is an open data standard to make it easier to work with disaster and climate risk data. It provides a common description of the data used and produced in risk assessments, including hazard, exposure, vulnerability, and modelled loss, or impact, data."
235,GeoNature-mobile-webapi,"# GeoNature-mobile-webapi\n\nGeoNature est une application de saisie et de synth√®se des observations faune et flore : https://github.com/PnEcrins/GeoNature\n\nPour pouvoir importer les donn√©es saisies avec [Geonature-mobile](https://github.com/PnEcrins/GeoNature-mobile) dans la BDD PostgreSQL de GeoNature, cette web-API doit √™tre install√©e sur le serveur.\n\nLa synchronisation de ces donn√©es peut √™tre faite par le r√©seau (wifi ou 3G) ou en connectant le mobile en USB √† un PC connect√© √† internet. Dans ce cas, une application de synchronisation des donn√©es soit √™tre install√©e sur le PC : https://github.com/PnEcrins/GeoNature-mobile-sync \n\n![GeoNature schema general](https://github.com/PnEcrins/GeoNature/raw/master/docs/images/schema-geonature-environnement.jpg)\n\n## License\n&copy; Makina Corpus / Parc national des Ecrins 2012 - 2017\n",WebAPI (cot√© serveur) de synchronisation des donn√©es produites par GeoNature-mobile
207,portal-brasil,"<div align=""center""><img alt=""logo"" src=""https://raw.githubusercontent.com/plonegovbr/plonegovbr.portal/main/docs/logo.png"" width=""150"" /></div>\n\n<h1 align=""center"">PortalBrasil</h1>\n\nProjeto de desenvolvimento do Portal Brasil\n\n## Instala√ß√£o\n\nClone este reposit√≥rio\n\n```bash\ngit clone git@github.com:plonegovbr/portal-brasil.git\n```\n\nInstale as depend√™ncias de backend\n\n```bash\nmake install-backend\n```\n\nInstale as depend√™ncias de frontend\n\n```bash\nmake install-frontend\n```\n\n## Inicie os servidores\n\nInicie o servidor de backend\n\n```bash\nmake start-backend\n```\nEm outro terminal, inicie o servidor de frontend:\n\n```bash\nmake start-frontend\n```\n\n## Pacotes em desenvolvimento\n\n### Backend\n\nEdite o arquivo `backend/mx.ini` e adicione / edite os pacotes e rode `make install-backend` novamente.\n\n### Frontend\n\nEdite o arquivo `frontend/mrs.developer.json` e adicione / edite os pacotes e rode `make install-frontend` novamente.\n",Ambiente de desenvolvimento do PortalBrasil
168,Alouette_ISIS_extract,"\n# Alouette-1, ISIS - 1 and ISIS -2 - Ionogram Data Extraction - Data from Canada's First Satellites Over 60 Years In the Making\n\n> In this project, the film rolls from the Alouette and ISIS satellites were scanned, digitized, and made accessible to the public. The primary aim was to establish a centralized data repository, facilitating access for researchers to utilize both the data and metadata derived from the satellites for future research.\n\nAlouette -1 was the first topside ionospheric satellite and the first Canadian satellite launched in 1962 in collaboration with the United States through NASA. Alouette ‚Äì 1 was known for its swept frequency topside sounder experiment with the goal to investigate the geographic and diurnal variation of the topside ionosphere at altitudes up to 1000 km. One of the most important scientific results from Alouette-1 was that it provided the first global picture of electron-density distribution in the topside ionosphere. With the success of Alouette -1, Canada and the United States formally agreed on December 23rd, 1963, to extend their collaboration to a program called International Satellites for Ionospheric studies (ISIS). As part of this program, Canada designed and built an additional family of ionospheric satellites: Alouette ‚Äì 2, ISIS -1 and ISIS ‚Äì 2. The ISIS - 1 and ISIS - 2 satellites had a more complex navigational systems and larger data collection capabilities than Alouette- 1 and 2 satellites. For instance, ISIS ‚Äì 1 was the first in the series to contain a swept and fixed frequency sounder technique combined with a complete set of direct measurements.\n\nThe output from the topside sounders were a video signal that contained the ionospheric echo pulses, but also pulses that depicted frequency markers and when a new frame started. A system was built to read the 7-track reel-to-reel magnetic tapes displayed on a cathode ray tube in ‚ÄòB-scan‚Äô form. This product was called an ionogram, which depicted the reflections of radio waves emitted from the satellite off the top side of the ionosphere, across a range of frequencies. The scanning of the ionograms as the first step of the historical data restoration of the Alouette and ISIS satellites began in 2017. The processing of the Alouette \n and ISIS data was concluded in 2023 and 2024 respectively. \n\n> Canadian Space Agency has created a centralized repository, facilitating easy access for researchers to utilize both the data and metadata derived from the Alouette and ISIS satellites. This includes but is not limited to open-source code on the processing of the data, raw images, data dictionaries, detailed methodology and a micro application that provides users the ability to select, download and visualize Alouette and ISIS data.\n\n\n## How to Get Started\n**To learn how to access, work and re-process the data, read:**\n\n- [**Alouette-1 ‚Äì Ionogram Data Extraction Methodology**](https://github.com/asc-csa/Alouette_extract/blob/working/documentation/Alouette-1%20-",üõ∞Ô∏è Ce code sert √† extraire les donn√©es et les m√©tadonn√©es des ionogrammes num√©ris√©s des satellites Alouette et ISIS | üõ∞Ô∏è This code is an effort to extract data and metadata from the scanned ionogram images from the Alouette and ISIS satellites.
135,quadratic-voting-frontend,"# Quadratic Voting Frontend\n\nÊ≠§ËôïÁÇ∫[2019Á∏ΩÁµ±ÁõÉÈªëÂÆ¢Êùæ](https://presidential-hackathon.taiwan.gov.tw/)Âπ≥ÊñπÊäïÁ•®Ê≥ï‰πãÂâçÁ´ØÁ®ãÂºèÁ¢ºÔºå‰æõÂ§ßÁúæÂèÉËÄÉÂà©Áî®„ÄÇ\n\nÊ≥®ÊÑèÔºöÊ≠§Á®ãÂºèÁ¢º‰∏çÂê´ÂæåÁ´ØÁ®ãÂºèÁ¢º\n\nThis is the frontend code of [Taiwan Presidential Hackathon 2019](https://presidential-hackathon.taiwan.gov.tw/en/Default.aspx) quadratic voting page, and open under MIT License for public use.\n\nNotice: This code did not include backend.\n\n## ÊäïÁ•®ÁµêÊûú\n\nÊ≠§ËôïË≥áÊñôÁÇ∫ÊäïÁ•®Âæå‰πãË≥áÊñôÔºå‰æõÂ§ßÁúæÁ†îÁ©∂Âà©Áî®„ÄÇ\n\n### ÊèêÊ°àË≥áÊñô\n\n#### Ê™îÊ°àÂêçÔºö\n\n[Proposal.json](data/Proposal.json)\n\n#### Ê¨Ñ‰ΩçË™™ÊòéÔºö\n\n- ProposalID: ÊèêÊ°àÁ∑®Ëôü\n- ServiceAgencies: ÂúòÈ´î/Ê©üÊßãÂêçÁ®±\n- TeamName: ÈöäÂêç\n- ProposalTitle: È°åÁõÆ\n\n### ÊäïÁ•®ÁµêÊûúË≥áÊñô\n\n#### Ê™îÊ°àÂêçÔºö\n\n[ProposalPolls.json](data/ProposalPolls.json)\n\n#### Ê¨Ñ‰ΩçË™™ÊòéÔºö\n\n- UserID: ‰ΩøÁî®ËÄÖÁ∑®Ëôü\n- ProposalID: ÊèêÊ°àÁ∑®Ëôü\n- Count: ÂæóÁ•®Êï∏\n- CreateDate: Âª∫Á´ãÊôÇÈñìÔºàÊôÇÂçÄÔºöUTC+8Ôºâ\n\n### ‰ΩøÁî®ËÄÖË≥áÊñô\n\n#### Ê™îÊ°àÂêçÔºö\n\n[User.json](data/User.json)\n\n#### Ê¨Ñ‰ΩçË™™ÊòéÔºö\n\n- UserID: ‰ΩøÁî®ËÄÖÁ∑®Ëôü\n- CreateDate: Âª∫Á´ãÊôÇÈñìÔºàÊôÇÂçÄÔºöUTC+8Ôºâ\n\n### ‰ΩøÁî®ËÄÖÁ¥ÄÈåÑ\n\n#### Ê™îÊ°àÂêçÔºö\n\n[UserAction.json](data/UserAction.json)\n\n#### Ê¨Ñ‰ΩçË™™ÊòéÔºö\n\n- ActionID: Â∫èËôü\n- UserID: ‰ΩøÁî®ËÄÖÁ∑®Ëôü\n- ProposalID: ÊèêÊ°àÁ∑®Ëôü\n- Sequence: Ë©≤Ê¨°ÊäïÁ•®Èö®Ê©üÊéíÂ∫èÂá∫ÁèæÂú®Á¨¨N‰Ωç\n- Action: ÊäïÁ•®ÊàñÂõûÊî∂\n - Add: ÊäïÁ•®\n - Sub: Êî∂Âõû‰∏ÄÁ•®\n- VoteCount: ÊäïÁ•®ÊàñÊî∂ÂõûÂæåÂâ©NÁ•®\n- SessionID: SessionË≠òÂà•Á¢º\n- CreateDate: Âª∫Á´ãÊôÇÈñìÔºàÊôÇÂçÄÔºöUTC+8Ôºâ\n\n# Author\n\nÈô≥‰∏ñÁ••\n\n# License\n\n[MIT](License)\n\n",No description


## 3. Define LLM Utility

This is where you'll integrate your language model. The function `get_summary` is a placeholder. **You should replace its content with the logic from your `llm_utils.py` file.**

The function should accept a `prompt` and the `text_to_summarize` and return the generated summary as a string.

## 4. Define Prompts for Engineering

Here you can define all the different prompts you want to test. I've added a few examples to get you started, focusing on different aspects like tone, format, and content.

In [37]:
# 1. Install necessary libraries if you haven't already
# !pip install langchain-google-genai langchain python-dotenv

import os
from dotenv import load_dotenv
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_google_genai import ChatGoogleGenerativeAI

# Load environment variables from your .env file
load_dotenv(dotenv_path='../.env')

# Securely get the API key from the environment.
# This prevents the key from being stored in the notebook's code or output.
google_api_key = os.getenv("GOOGLE_API_KEY")
if not google_api_key:
    raise ValueError("GOOGLE_API_KEY not found. Make sure it's in your .env file.")

# Define the model name (as seen in llm_utils.py)
LLM_MODEL_NAME = "gemini-1.5-flash"

# Initialize the Google Generative AI model
llm = ChatGoogleGenerativeAI(
    model=LLM_MODEL_NAME,
    google_api_key=google_api_key,
)

# --- Example Usage ---
# You can replace these variables with your own data
repo_description = "A Python library for parsing and generating beautiful, human-readable reports from complex data structures."
repo_readme = """
# Reportify v1.2

Reportify is a tool designed to make data reporting simple. It takes JSON or dictionary data and outputs clean, formatted reports in PDF or HTML.

## Features
- Multiple output formats (PDF, HTML)
- Customizable templates
- Easy integration with pandas DataFrames

## Getting Started
`pip install reportify`
"""

def get_summary(prompt: str, repo_description: str, repo_readme:str) -> str:
    """
    Generates a summary for the given text using a specified prompt.
    This is a placeholder and should be replaced with your actual LLM implementation.

    Args:
        prompt: The prompt to use for the summarization task.
        repo_description: ...
        repo_readme: ...

    Returns:
        The generated summary as a string.
    """

    # This is the prompt template copied directly from llm_utils.py
    summary_prompt = PromptTemplate(
        input_variables=["description", "readme"],
        template=prompt,
    )

    # Create the LangChain chain by piping the components together
    summary_chain = summary_prompt | llm | StrOutputParser()


    # Invoke the chain with your repository data
    return summary_chain.invoke({
        "description": repo_description,
        "readme": repo_readme
    })

# Print the final summary
print(generated_summary)


Reportify is a Python library that simplifies data reporting.  It transforms JSON or dictionary data into aesthetically pleasing PDF or HTML reports.  The library offers customizable templates and integrates easily with pandas DataFrames.  Users can generate formatted reports from complex data structures.


In [44]:
# prompts must include "{description}" and "{readme}"


simple_prompt = """
        Please provide a summary of the following GitHub repository based on its description and README.md content.
        If the README.md or description is not in English, please first translate it to English and then generate a summary.
    
        Repository description:
        {description}
    
        README.md content:
        {readme}
    
        Summary:
        """

short_response_prompt = """
        Please provide a very short summary of the purpose following GitHub repository based on its description and README.md content.
        If the README.md or description is not in English, please first translate it to English and then generate a summary.
        Focus on the intent and purpose of the repository. Not the state or orginsation that created it, the programing language, or nature of the licence.

        The summary should be very breif, around ten words, should not be sentances just relevant, expresive words.    
        
        Repository description:
        {description}
    
        README.md content:
        {readme}
    
        Summary:
        """

bullets_prompt = """
        Please provide a summary bullets of the following GitHub repository based on its description and README.md content.
        If the README.md or description is not in English, please first translate it to English and then generate a summary.
        Focus on the intent and purpose of the repository. Not the state or orginsation that created it, the programing language, or nature of the licence.
        Repository description:
        {description}
    
        README.md content:
        {readme}
    
        Summary:
        """

perscriptive = """
        Please provide a summary of the following GitHub repository based on its description and README.md content.
        These will be used for categorisation via embeddings.
        If the README.md or description is not in English, please first translate it to English and then generate a summary.

        The summary should be:
            * concise and in fewer than 3 sentences.
            * focus on what the repository is used for, enables, what the *intent* of it is
        The summary should not:
            * mention the country, state or organisation that the repository is for.
            * include information about the type of license.
            * comment on where more information can be found or what information was not available.
            * mention the programing language used unless intrinsic to its purpose
            * include the name of the repository unless its a standard word or discription needed to summarise
    
        Repository description:
        {description}
    
        README.md content:
        {readme}
    
        Summary:
        """

prompts_to_test = {
    # 'simple': simple_prompt ,
    # 'bullets':bullets_prompt,
    # 'perscriptive': perscriptive,
    'short': short_response_prompt
}

print(f"Defined {len(prompts_to_test)} prompts to test.")

Defined 1 prompts to test.


## 5. Generate and Compare Summaries

This next cell will iterate through each of the sampled repositories and generate a summary for each of the prompts you defined above. The results will be collected into a DataFrame for easy comparison.

In [45]:
results = []

for index, row in sample_df.iterrows():
    print(f"Processing repository: {row['name']}...")
    repo_info = {
        'name': row['name'],
        'description': row['description'],
        'original_summary': row['summary']
    }
    
    
    for prompt_name, prompt_text in prompts_to_test.items():
        # Generate the summary using the placeholder function
        # In a real run, this will call your LLM
        summary = get_summary(prompt_text, repo_info['description'], row['readme'] )
        repo_info[prompt_name] = summary
        
    results.append(repo_info)

print("\nFinished processing all repositories.")

# Create a DataFrame from the results
results_df = pd.DataFrame(results)


Processing repository: rdl-standard...
Processing repository: GeoNature-mobile-webapi...
Processing repository: portal-brasil...
Processing repository: Alouette_ISIS_extract...
Processing repository: quadratic-voting-frontend...
Processing repository: ala-install...
Processing repository: gsoc-2023...
Processing repository: volto-vlibras...
Processing repository: ioos-code-sprint...
Processing repository: ris-backend-service...
Processing repository: openspace-android-sdk...
Processing repository: steuerlotse...
Processing repository: lexml-renderer-pdf...
Processing repository: uswds...
Processing repository: avh-hub...
Processing repository: invitation-manager...
Processing repository: ioos-python-package-skeleton...
Processing repository: medlink...
Processing repository: peacetrack-readme...
Processing repository: restaurant-inspections...

Finished processing all repositories.


## 6. Review the Results

The table below shows the original summary alongside the new summaries generated by each of your prompts. This should make it easy to compare their effectiveness.

In [46]:
# Set display options to show full text content
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

# Display the DataFrame as an HTML table for better readability
display(HTML(results_df.to_html()))

Unnamed: 0,name,description,original_summary,short
0,rdl-standard,"The Risk Data Library Standard (RDLS) is an open data standard to make it easier to work with disaster and climate risk data. It provides a common description of the data used and produced in risk assessments, including hazard, exposure, vulnerability, and modelled loss, or impact, data.","The Risk Data Library Standard (RDLS) is an open data standard facilitating work with disaster and climate risk data. It provides a common framework for describing hazard, exposure, vulnerability, and loss data used in risk assessments. The standard's development is coordinated through this repository, which publishes specifications and supports collaborative improvements. The repository also encourages feedback and discussion on the evolving standard.","Standardizing disaster risk data: hazard, exposure, vulnerability, loss."
1,GeoNature-mobile-webapi,WebAPI (cot√© serveur) de synchronisation des donn√©es produites par GeoNature-mobile,This repository provides a web API for synchronizing data generated by the GeoNature-mobile application. It enables importing data from GeoNature-mobile into a GeoNature PostgreSQL database. Synchronization can occur via network connection or by connecting a mobile device to a computer. The API is designed to work with a separate synchronization application for PC-based syncing.,"Data synchronization, GeoNature mobile, PostgreSQL database."
2,portal-brasil,Ambiente de desenvolvimento do PortalBrasil,This repository provides the development environment for the PortalBrasil website. It enables the installation and running of both backend and frontend servers using provided make commands. The repository includes instructions for managing backend and frontend packages. Developers can easily set up and manage the PortalBrasil website using this environment.,"PortalBrasil development environment, backend, frontend."
3,Alouette_ISIS_extract,üõ∞Ô∏è Ce code sert √† extraire les donn√©es et les m√©tadonn√©es des ionogrammes num√©ris√©s des satellites Alouette et ISIS | üõ∞Ô∏è This code is an effort to extract data and metadata from the scanned ionogram images from the Alouette and ISIS satellites.,"This repository provides tools and data for extracting information from scanned ionograms of the Alouette and ISIS satellites. It offers a centralized repository of data and metadata from these satellites, enabling researchers to access and utilize this historical information for further research. The repository includes open-source code for data processing, raw images, and a user-friendly application for data selection, download, and visualization. The data covers over 60 years of ionospheric research.","Satellite ionogram data extraction, research access, data repository."
4,quadratic-voting-frontend,No description,"This repository provides the frontend code for a quadratic voting system developed for the 2019 Taiwan Presidential Hackathon. It includes sample data files (Proposal.json, ProposalPolls.json, User.json, UserAction.json) representing proposals, voting results, users, and user actions. The repository does not contain backend code. The data can be used for research and analysis.","Quadratic voting, frontend, hackathon project, data, public use."
5,ala-install,Ansible playbooks for installing the ALA components,"This repository provides Ansible playbooks for installing ALA components on Ubuntu 16 and later systems. It includes a playbook for setting up an ALA demo. The playbooks are designed to facilitate the deployment of a Living Atlas, and integrate with supporting tools to simplify the process. These tools assist in generating necessary configuration files and managing the Living Atlas portal.","ALA component installation, Ansible playbooks, Ubuntu deployment."
6,gsoc-2023,Google Summer of Code 2023 with the Mayor's Office of New Urban Mechanics: Guidance + Ideas,This repository contains guidance and ideas developed during Google Summer of Code 2023 in collaboration with the Mayor's Office of New Urban Mechanics. The project aims to provide support and resources. The exact nature of the guidance and ideas is unspecified due to the lack of a README.,Google Summer of Code project: urban mechanics guidance.
7,volto-vlibras,An addon integrating the VLibras service into a Plone site running Volto,"This repository provides a Volto add-on that integrates the VLibras service, enabling sign language interpretation within Plone websites. It allows developers to easily add VLibras functionality to their Volto-based Plone projects. Installation instructions and configuration details are included. The add-on is designed for use with Volto 18 and utilizes pnpm for package management.",Plone Volto addon: VLibras video interpretation integration.
8,ioos-code-sprint,Information about IOOS Code Sprint activities.,"This repository organizes the biannual IOOS Code Sprint, a four-day hackathon focused on addressing ocean data and information challenges. The sprint brings together developers, researchers, and community members to work on projects supporting NOAA's Integrated Ocean Observing System mission. Past sprints have been held both in-person and virtually. Project ideas can be submitted via issues on this repository.","Ocean data, hackathon, collaboration, projects, information challenges."
9,ris-backend-service,RIS Caselaw,"This repository, RIS Caselaw, provides a backend service. It requires several CLI tools including Docker, Node.js, and Java. The setup uses a container runtime and manages dependencies. The project includes tools for vulnerability scanning and managing architecture decision records.","Caselaw data, backend service, legal information, RIS system."
