# **Repo Setup**



If it is the first time importing the GitHub repo: follow the tutorial in this [link](https://github.com/KSDeshappriya/GoogleColab-GDrive-Git-GitHub_Repo)

## 1 - Imports and general setup

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Managing secrets
from google.colab import userdata

import gc # Garbage collector
import sys
from pathlib import Path
import os
import inspect # Access to source code


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2 - Config Git

In [2]:
# Set username and email
!git config --global user.name "SamdGuizani"
!git config --global user.email "samd.guizani@gmail.com"

# Set remote repo name/username
repo = 'reportingAgent'
username = "zBotta"

# Get GitHub token (must be 'classic' type)
github_token = userdata.get('GitHub_Samd_ReportAgent_GoogleColab')

# Cloning remote repo
cloning_required = False

# Clone remote repo and change working directory
# Set `cloning_required` to `True` if creating local repo for first time.
# Set `cloning_required` to `False` if local repo already exist in Google Drive.
if cloning_required == True:
    !git clone https://{username}:{github_token}@github.com/{username}/{repo}.git /content/drive/MyDrive/GitHub/{repo}

# Changing working directory to local repo
%cd /content/drive/MyDrive/GitHub/{repo}

print(f"Current working directory : {os.getcwd()}\n")

# List content of working directory
!ls -lia

# Setting up token access (needed for git push)
!git remote set-url origin https://{username}:{github_token}@github.com/{username}/{repo}.git

# Check git status and current branch
!git status


/content/drive/MyDrive/GitHub/reportingAgent
Current working directory : /content/drive/MyDrive/GitHub/reportingAgent

total 19
37 drwx------ 2 root root 4096 Aug  6 13:44 app
36 drwx------ 2 root root 4096 Aug  6 13:44 .git
43 -rw------- 1 root root  126 Aug  9 16:20 .gitignore
41 -rw------- 1 root root 1073 Aug  7 07:42 LICENSE
38 drwx------ 3 root root 4096 Aug  6 13:46 PoC
40 -rw------- 1 root root 1151 Aug  7 07:42 projectSetup.py
42 -rw------- 1 root root 1453 Aug  7 07:42 README.md
39 -rw------- 1 root root 1474 Aug  7 07:42 requirements.txt
Refresh index: 100% (22/22), done.
On branch dev
Your branch is up to date with 'origin/dev'.

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .gitignore[m
	[31mdeleted:    PoC/PoC_Prompt and report gen.ipynb[m
	[31mmodified:   app/mods/modelLoader.py[m
	[31mmodified:   app/mods/reportGenerator.py[m


## 3 - Git commands

Use `!git` to interact with Git or use terminal

Useful Git commands

In [3]:
# !git status                           # Display status (current branch and up-to-date status)

# !git fetch                            # Downloads all history from remote tracking branches

# !git checkout [branch name]           # Switches to specified branch and updates working dir

# !git branch [branch name]             # Creates branch
# !git branch -a                        # List all branches

# !git pull                             # Updates current local working branch (combination of 'git fetch' and 'git merge')

# !git add --all                        # Add all untracked and tracked changes
# !git add [file]                       # Add a specific file

# !git commit -m ["commit message"]     # Commit changes to the current branch

# !git push origin [branch name]        # Pushes changes to remote repo (branch already tracked)
# !git push -u origin [branch name]     # Pushes changes to remote repo (branch not tracked)

# !git remote get-url origin            # Display url of the remote repo
# !git remote set-url origin [url]      # Set up url of the remote repo

# !git reset --soft HEAD~1              # BEFORE PUSH: Undo last commit, keep all changes in staging area
# !git reset --mixed HEAD~1             # BEFORE PUSH: Undo last commit, move changes to working dir (unstaged)
# !git reset --hard HEAD~1              # BEFORE PUSH: Undo last commit, delete permanantly changes from staging area and working dir


# **Python Environment Setup**

## 1 - Install project dependencies



In [4]:
# !pip install -r requirements.txt

In [5]:
# !pip install --upgrade torch torchvision

# **PoC Work**

## 1 - Check `git status` and current branch

In [6]:
!git status


On branch dev
Your branch is up to date with 'origin/dev'.

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .gitignore[m
	[31mdeleted:    PoC/PoC_Prompt and report gen.ipynb[m
	[31mmodified:   app/mods/modelLoader.py[m
	[31mmodified:   app/mods/reportGenerator.py[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mPoC/Experiments/[m
	[31mPoC/PoC_01_Prompt and report gen with different models.ipynb[m
	[31mPoC/PoC_02_Report gen with parameters.ipynb[m
	[31mPoC/python_env_setup.sh[m
	[31mapp/datasets/pharma_dev_reports_collection.xlsx[m

no changes added to commit (use "git add" and/or "git commit -a")


In [7]:
!git fetch


In [8]:
!git pull


Already up to date.


## 2 - Project specific imports

In [75]:
import numpy as np
import pandas as pd
import torch

from huggingface_hub import login
login(token=userdata.get('HF_TOKEN'))  # insert your Hugging Face token here

import itertools
import json

To import locally available library : [Importing python library from Drive](https://colab.research.google.com/drive/12qC2abKAIAlUM_jNAokGlooKY-idbSxi#scrollTo=prUMpfLaB-D7)

In [10]:
# Add root and app project path to environment -> permits module import
sys.path.append(os.getcwd())
sys.path.append(os.getcwd() + '/app')

# Import project modules
# from app.mods.apiReportGenerator import * # ModuleNotFoundError: No module named 'instructor'
from app.mods.dataHandler import *
from app.mods.metricsEvaluator import * # ImportError: cannot import name '_center' from 'numpy._core.umath' (/usr/local/lib/python3.11/dist-packages/numpy/_core/umath.py)
from app.mods.modelLoader import *
from app.mods.promptGenerator import *
from app.mods.reportGenerator import *
from app.mods.testBench import *


## 3 - Global Variables

In [11]:
# set device to cuda if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch_dtype = torch.float32 if torch.cuda.is_available() else torch.float32

device, torch_dtype

(device(type='cpu'), torch.float32)

In [12]:
# List of model IDs to test
model_ids = [
    "gpt2",
    "EleutherAI/gpt-neo-1.3B",
    "meta-llama/Llama-3.2-1B",
    "meta-llama/Llama-3.2-1B-Instruct",
    "meta-llama/Llama-3.2-3B",
    "meta-llama/Llama-3.2-3B-Instruct",
    "microsoft/phi-2",
    "HuggingFaceTB/SmolLM3-3B",
]


## 4 - Functions and Classes (defined within notebook)

In [None]:
def generate_report_from_row(row,
                             model, 
                             tokenizer, 
                             prompt_method,
                             generation_args={'temperature': 0.7}, output_type=Report):
    '''
    Generate a report from a single DataFrame row using a language model.

    Builds a prompt from the row's data, generates a report via `ReportGenerator`,
    and returns the title, report text, generation parameters, and metadata.
    On failure, returns placeholder values with the error message.

    Parameters
    ----------
    row : pandas.Series
        Row containing input fields for prompt generation.
    model : Any
        Language model instance.
    tokenizer : Any
        Associated tokenizer.
    prompt_method : str
        Prompt generation method name.
    generation_args : dict
        Arguments for `generate_report`. Defaults to `{'temperature': 0.7}`.
    output_type : type, optional
        Report output class. Defaults to `Report`.

    Returns
    -------
    dict
        Dictionary with 'prompt_method', generation args, 'generated_title', 'generated_report', and 'gen_params'.
    '''
    dh = DataHandler()
    try:
        prompt = PromptGenerator(**row.to_dict()).create_prompt(prompt_method)
        output, gen_params = ReportGenerator(model, tokenizer, output_type).generate_report(prompt, **generation_args)
        title, report = dh.get_title_and_report(output) #json.loads(output).values()
    except Exception as e:
        print(f'Failed to generate report: {e}')
        title = 'FAIL'
        report = f'Error: {e}'
    # This part is redundant, is better to pass all the gen_params as kwargs and the prompt method using TestBench class methods.
    return {'prompt_method': prompt_method,
            **generation_args,
            'generated_title': title,
            'generated_report': report,
            'gen_params': gen_params}

## 5 - Main Script

### Import Dataset

In [53]:
# Import Reports_dataset.xlsx

_path = os.getcwd() + '/app/datasets/pharma_dev_reports_collection.xlsx'

# Read the Excel file into a pandas DataFrame
df_Reports = pd.read_excel(_path)
df_Reports.columns = ['report_name', 'what', 'when', 'where', 'who', 'how', 'why', 'contingency_actions', 'event_description', 'NbChr', 'Comments']
df_Reports

Unnamed: 0,report_name,what,when,where,who,how,why,contingency_actions,event_description,NbChr,Comments
0,,Incorrect pH adjustment in buffer preparation,"June 10, 2025, 9:15 AM","Formulation Area, Production Building 2","Rahul Mehta, Process Technician",pH meter not calibrated before use,Technician skipped calibration step due to tim...,"Buffer batch discarded, technician retrained, ...","On June 10, 2025, at 9:15 AM in the Formulatio...",347,Manually generated in ChatGPT
1,,Contaminated gloves observed during aseptic fi...,"June 12, 2025, 2:40 PM","Grade A Filling Line, Sterile Suite A","Emily Zhang, Line Operator",Touched non-sterile surface during setup,Operator unaware surface was non-sterile,"Line stopped, gloves changed, affected vials q...","On June 12, 2025, at 2:40 PM, during aseptic f...",269,Manually generated in ChatGPT
2,,Late sampling of stability chamber,"June 15, 2025, 11:00 AM","QC Lab, Stability Room 3","Daniel Ortiz, QC Analyst",Sample collection delayed by 24 hours,Oversight due to miscommunication in sampling ...,"Deviation logged, additional sample points add...","On June 15, 2025, at 11:00 AM in QC Stability ...",258,Manually generated in ChatGPT
3,,Temperature excursion in cold room,"June 17, 2025, 6:00 AM – 9:00 AM","Cold Room 2, Warehouse Building 1",Detected by automated monitoring,HVAC malfunction caused temp rise to 10°C,Unexpected failure of compressor unit,"Products moved, HVAC repaired, QA notified, ro...","Between 6:00 and 9:00 AM on June 17, 2025, Col...",252,Manually generated in ChatGPT
4,,Incorrect material label applied,"June 19, 2025, 4:30 PM",Material Receiving Area,"Alexandra Becker, Warehouse Operator",Wrong label selected from batch printout,Look-alike/sound-alike material names,"All affected labels corrected, batch quarantin...","On June 19, 2025, at 4:30 PM, Alexandra Becker...",260,Manually generated in ChatGPT
...,...,...,...,...,...,...,...,...,...,...,...
102,Product Recall Due to Potency Out of Specifica...,potency OOS in finished product batch FP-4321,"July 12, 2024, at 08:00 (discovery)",Quality Control Testing QC-Lab5,"David Nguyen, QC Chemist – discovered; Quality...",assay results below specification limits,possible raw material variability or process d...,"batch recall executed, investigation launched ...",David Nguyen detected potency out-of-specifica...,224,2025-07-19_Automatically generated using API (...
103,Failure to Follow SOP for Sampling,deviation from SOP in sampling during raw mate...,"March 28, 2024, at 12:00 (discovery)",Incoming Inspection Area IIA-4,"Samuel Brooks, QC Technician – discovered; QC ...",sample size smaller than defined by SOP,operator misunderstanding of procedure,"resampling performed, SOP review and retrainin...",Samuel Brooks identified that sampling during ...,225,2025-07-19_Automatically generated using API (...
104,Expired Material Used in Production,use of expired excipient batch EX-998 in produ...,"April 9, 2024, at 14:15 (discovery)",Production Area PA-7,"Jessica Evans, Inventory Controller – discover...",expired excipient mistakenly retrieved due to ...,failure to segregate expired materials properly,"product batch quarantined, material segregatio...","On April 9, 2024 at 14:15, Jessica Evans found...",234,2025-07-19_Automatically generated using API (...
105,Incomplete Stability Sample Submission,missing stability samples scheduled for analys...,"May 20, 2024, at 09:00 (discovery)",Stability Testing Lab STL-1,"Olivia Turner, Stability Analyst – discovered;...",samples not delivered on time from storage,logistical error in sample transportation,"samples located and submitted, scheduling proc...","Olivia Turner discovered on May 20, 2024 at 09...",238,2025-07-19_Automatically generated using API (...


### Load model

In [14]:
ml = ModelLoader(model_id=model_ids[0], device=device, torch_dtype=torch_dtype)
model, tokenizer = ml.load_model(hf_token=userdata.get('HF_TOKEN'))

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [15]:
# # Select a case
# report_index = 29 # row to pick in df_Reports
# row = df_Reports.loc[report_index, 'what':'contingency_actions']

### Generate report per input row (sample of dataset)

In [77]:
# Sample of dataset
df = df_Reports.loc[:4, 'what':'contingency_actions']

# Choose prompting method and user-defined generation parameters
prompt_method = 'C'
generation_args={'temperature': 0.7}

# Generate a report for each case
df_results = pd.DataFrame(
    (df.apply(lambda row: generate_report_from_row(row, model, tokenizer, 'C'), axis=1)).to_list()
)
df_results

Unnamed: 0,prompt_method,temperature,title,report,gen_params
0,C,0.7,Wrong pH adjustment,"On June 10, 2025, at 9:15 AM, Rahul Mehta and ...","{'temperature': 0.7, 'top_k': 50, 'top_p': 1.0..."
1,C,0.7,Wrong tablet count,"On June 12, 2025, at 2:40 PM, Emily Zhang load...","{'temperature': 0.7, 'top_k': 50, 'top_p': 1.0..."
2,C,0.7,Wrong tablet count,"On June 15, 2025, at 11:00 AM, Daniel Ortiz re...","{'temperature': 0.7, 'top_k': 50, 'top_p': 1.0..."
3,C,0.7,"On July 2, 2025, at 3:30 PM, Erik Hansen was u...",",","{'temperature': 0.7, 'top_k': 50, 'top_p': 1.0..."
4,C,0.7,Wrong material label applied,"On June 19, 2025, at 4:30 PM, Alexandra Becker...","{'temperature': 0.7, 'top_k': 50, 'top_p': 1.0..."


### Automate testing of various generation parameters to dataset

In [None]:
import os
import shutil
from projectSetup import Setup
from conf.projectConfig import Config as cf
from mods.testBench import TestBench
from mods.metricsEvaluator import MetricsEvaluator
from mods.dataHandler import DataHandler, Report
from mods.promptGenerator import PromptGenerator
from mods.reportGenerator import ReportGenerator
from mods.modelLoader import ModelLoader

env = Setup()
met_eval = MetricsEvaluator()
# Load data
dh = DataHandler()

report_data = dh.import_reports()
# Load model
model_id = 'openai-community/gpt2' 
ml = ModelLoader(model_id=model_id, device=env.device, torch_dtype=env.torch_dtype)
model, tokenizer = ml.load_model(hf_token=env.config["HF_TOKEN"])

rg = ReportGenerator(model, tokenizer, output_type=Report)
tb = TestBench(MetricsEvaluator = met_eval, DataHandler=dh, ModelLoader=ml)

report_idx_list = [20, 25]
report_data_filtered = report_data.iloc[report_idx_list]
tb.eval_gs_param(report_data=report_data_filtered,
                    report_generator = rg,
                    prompt_method_list=["C"],
                    param_dict={"temperature": [0.7, 1.3],
                                "top_p": [0.6, 1],
                                "max_new_tokens": [300]} )

In [None]:
# Sample of dataset
df = df_Reports.loc[:4, 'what':'contingency_actions']

# Parameter ranges
list_prompt_method = ['A', 'B', 'C']
list_temperature = [0.6, 0.9]
list_top_p = [0.9]
list_max_new_tokens = [300]
list_repetition_penalty = [1.0]

# Test different prompts and generation parameters
i = 1
experiments_results = []
for prompt_method in list_prompt_method:

    for temperature, top_p, max_new_tokens, repetition_penalty in itertools.product(list_temperature, list_top_p, list_max_new_tokens, list_repetition_penalty):
        print (f'----- Starting test #{i} -----')
        generation_args = {
                "temperature": temperature,
                "top_p": top_p,
                "max_new_tokens": max_new_tokens,
                "repetition_penalty": repetition_penalty
        }
        # TODO: update the TestBench to implement apply within a df and paralellization
        df_results = pd.DataFrame(
            (df.apply(lambda row: generate_report_from_row(row, model, tokenizer, prompt_method=prompt_method, generation_args=generation_args), axis=1)).to_list()
            )

        experiments_results.append(df_results)
        print (f'----- End test #{i} -----')
        i = i + 1

----- Starting test #1 -----
----- End test #1 -----
----- Starting test #2 -----
----- End test #2 -----
----- Starting test #3 -----
Failed to generate report: Unterminated string starting at: line 1 column 44 (char 43)
----- End test #3 -----
----- Starting test #4 -----
Failed to generate report: Unterminated string starting at: line 1 column 46 (char 45)
----- End test #4 -----
----- Starting test #5 -----
----- End test #5 -----
----- Starting test #6 -----
Failed to generate report: Unterminated string starting at: line 1 column 54 (char 53)
----- End test #6 -----


In [88]:
experiments_results[5]

Unnamed: 0,prompt_method,temperature,top_p,max_new_tokens,repetition_penalty,generated_title,generated_report,gen_params
0,C,0.9,0.9,300,1.0,Wrong pH adjustment,"On June 10, 2025, at 9:15 AM, Erik Hansen chec...","{'temperature': 0.9, 'top_k': 50, 'top_p': 0.9..."
1,C,0.9,0.9,300,1.0,Wrong tablet count in bottle,"On June 12, 2025, at 2:40 PM, Erik Hansen load...","{'temperature': 0.9, 'top_k': 50, 'top_p': 0.9..."
2,C,0.9,0.9,300,1.0,Wrong tablet counting,"On June 15, 2025, at 11:00 AM, Erik Hansen ins...","{'temperature': 0.9, 'top_k': 50, 'top_p': 0.9..."
3,C,0.9,0.9,300,1.0,Wrong tablet counting,"On June 17, 2025, at 6:00 AM, Erik Hansen chec...","{'temperature': 0.9, 'top_k': 50, 'top_p': 0.9..."
4,C,0.9,0.9,300,1.0,FAIL,Error: Unterminated string starting at: line 1...,"{'temperature': 0.9, 'top_k': 50, 'top_p': 0.9..."


In [None]:
# df.to_excel("generated_reports.xlsx", index=False) # Export to xlsx