<a href="https://colab.research.google.com/github/summerdevlin46/recruitment_test/blob/main/bsc_recruitment_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#BSC Recruitment Test

## Setting up environment

In [1]:
  !git clone https://github.com/summerdevlin46/recruitment_test

Cloning into 'recruitment_test'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 22 (delta 4), reused 7 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (22/22), 299.49 KiB | 21.39 MiB/s, done.
Resolving deltas: 100% (4/4), done.


In [11]:
!git status

fatal: not a git repository (or any of the parent directories): .git


In [3]:
from huggingface_hub import login
from google.colab import userdata

access_token_read = userdata.get('HF_TOKEN')

login(token = access_token_read)

In [4]:
import json

# Load the JSON data
with open("recruitment_test/data/articles.json", "r") as file:
    data = json.load(file)

## Preprocessing JSON Files

I removed any entries that contained null for each value, which left us with 36 entries containing usable data.

In [None]:
# Filter out the entries with all null values except for pmcid
filtered_articles = [entry for entry in data if any(value is not None for key, value in entry.items() if key != 'pmcid')]

# Print filtered data
print(json.dumps(filtered_articles, indent=4))

# Number of entries after filtering
#print("Number of entries after filtering:", len(filtered_articles))


## Setting up LLM and Prompting

In [12]:
import torch
from transformers import pipeline

# Set Device to CPU if GPU is out of RAM

# model_id = "meta-llama/Meta-Llama-3-8B-Instruct"   # Uses more GPU RAM
model_id = "meta-llama/Llama-3.2-3B-Instruct"        # Uses Less GPU RAM
# model_id = "meta-llama/Llama-3.2-1B-Instruct"        # Uses the Least GPU RAM
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map='auto',
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Device set to use cuda:0


In [18]:
# Function to process a single article and extract relevant information
def extract_information_from_article(article):
    # Combine relevant fields into a single text input
    article_content = (
        f"Abstract: {article.get('abstract', 'Not available')}\n"
        f"Introduction: {article.get('intro', 'Not available')}\n"
        f"Methods: {article.get('method', 'Not available')}\n"
        f"Subjects: {article.get('subjects', 'Not available')}\n"
        f"Results: {article.get('results', 'Not available')}\n"
        f"Discussion: {article.get('discussion', 'Not available')}\n"
        f"Conclusion: {article.get('conclusion', 'Not available')}\n"
    )

    # Define the prompt for extracting information
    # Prompt Engineering Reasoning:
    # After experimenting with multiple prompt variations, this approach consistently produced the best results.
    #
    # Key Design Choices:
    # 1. Role Specification: Clearly defining the LLM's role as an "assistant trained to extract structured data"
    #    improves accuracy and consistency in extraction.
    # 2. JSON Format Enforcement: The output is explicitly structured in JSON, ensuring machine-readable and
    #    structured responses. If any information is missing, it is assigned a `null` value to maintain consistency.
    # 3. Field Definition: Each field is clearly specified, detailing the type of information expected.
    # 4. Flexibility in Study Design: Initially, a stricter JSON structure was enforced, but due to variability in
    #    research methodologies (e.g., some studies not having an experimental-control group setup), the prompt was
    #    adjusted to allow for different study designs. This ensures broader applicability across various scientific articles.

    system_prompt = f"""
You are an assistant trained to extract structured data from the scientific article provided.
Extract the following information fields and return the output in JSON. If some information is missing, you respond with null.
1. disease_condition: Studied disease condition, health condition, disorder, syndrome, etc.
2. total_patients: Total number of patients
3. case_group_patients: Number of patients in experimental groups
4. control_group_patients: Number of patients in control groups
5. patient_sex: Sex of the patients
6. patient_origin: Origin of the patients
"""

    user_prompt = f""" Scientific article:
{article_content}
"""
    # Configure the messages for the LLM
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]

    outputs = pipe(
    messages,
    max_new_tokens=500,
)
    return outputs[0]["generated_text"][-1]['content']

## Processes the articles and saves extracted information

In [14]:
# Formats extracted information into the correct format and return it in json format
def format_json(pmcid, content):

  extracted_info = {}
  assistant_content = content

  # Look for the part of the string between the triple backticks
  if "```" in assistant_content:
      # Split based on the '```' markers
      parts = assistant_content.split('```')
      if len(parts) > 1:
          # The second part should be the JSON content between the backticks
          json_content = parts[1].strip()  # Strip out unnecessary spaces/newlines
          try:
              # Parse the JSON content
              extracted_json = json.loads(json_content)
              extracted_info = extracted_json  # Store the extracted content
          except json.JSONDecodeError:
              # print(f"Error decoding JSON for pmcid: {pmcid}")
              return 0, None

  return 1, {'pmcid':pmcid, 'extracted_info':extracted_info}

In [19]:
import json
from pathlib import Path

# Process all articles in the dataset
results = []

for article in filtered_articles:
    attempts = 0
    content = extract_information_from_article(article)

    while attempts < 5:
        check, formatted_extracted_info = format_json(article['pmcid'], content)
        if check == 1:
            results.append(formatted_extracted_info)
            print(f"PROCESSED article with PMCID: {article['pmcid']}")
            break
        # Re-extract content only if necessary
        content = extract_information_from_article(article)
        attempts += 1
    else:
        print(f"ERROR processing article with PMCID: {article['pmcid']}\nCONTENT:{content}\n")

# Save the results to a JSON file
output_path = Path("/content/recruitment_test/data/extracted_info.json")
output_path.write_text(json.dumps(results, indent=4))

print(f"Extraction complete. Results saved to {output_path}.")


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC10167034


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC10191296


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC10390885


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC2383879


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC2693442


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC3145824


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC3174812


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC3212907


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC3534646


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC3909226


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC4393161


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC5040013


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC5425199


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC5441889


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC5601641


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC5717332


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC5839230


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC6060212


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC6286024


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC6730009


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC6760014


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC6775309


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC6853912


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC6955584


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC7007877


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC7449478


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC7657407


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC7661891


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC8288503


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC8382172


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC8976245


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC9333080


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC9381901


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC9529122


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PROCESSED article with PMCID: PMC9616492
PROCESSED article with PMCID: PMC9697589
Extraction complete. Results saved to /content/recruitment_test/data/extracted_info.json.
