**Step 1: Literature search and text collection**

In [1]:
# Search PubMed and get article abstracts

# pip install biopython pandas openpyxl

from Bio import Entrez
import pandas as pd

# Define your email to use with NCBI Entrez
Entrez.email = "your@email.com"

def search_pubmed(keyword):
    
    # Adjust the search term to focus on abstracts
    search_term = f"{keyword}[Abstract]"
    handle = Entrez.esearch(db="pubmed", term=search_term, retmax=500)
    record = Entrez.read(handle)
    handle.close()
    # Get the list of Ids returned by the search
    id_list = record["IdList"]
    return id_list

def fetch_details(id_list):
    ids = ','.join(id_list)
    handle = Entrez.efetch(db="pubmed", id=ids, retmode="xml")
    records = Entrez.read(handle)
    handle.close()

    # Create a list to hold our article details
    articles = []

    for pubmed_article in records['PubmedArticle']:
        article = {}
        article_data = pubmed_article['MedlineCitation']['Article']
        article['Title'] = article_data.get('ArticleTitle')
        
        # Directly output the abstract
        abstract_text = article_data.get('Abstract', {}).get('AbstractText', [])
        if isinstance(abstract_text, list):
            abstract_text = ' '.join(abstract_text)
        article['Abstract'] = abstract_text

        article['Journal'] = article_data.get('Journal', {}).get('Title')

        articles.append(article)

    return articles



# Example usage
keyword = "yarrowia carotene"
id_list = search_pubmed(keyword)
articles = fetch_details(id_list)

# Convert our list of articles to a DataFrame
df = pd.DataFrame(articles)

# Saving the DataFrame to an Excel file
excel_filename = keyword+"_pubmed_search_results.xlsx"
df.to_excel(excel_filename, index=False)

print(f"Saved search results to {excel_filename}")


Saved search results to yarrowia carotene_pubmed_search_results.xlsx


**Step 2: Entity and relationship extraction with LLM**

In [2]:
import pandas as pd
import os
import requests

def ask_questions(abstract, questions, system_prompts):
    responses = []
    for question, system_prompt in zip(questions, system_prompts):
        prompt_text = question + " " + str(abstract)
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt_text}
        ]
        
        try:
            # Use the OpenAI-compatible endpoint
            response = requests.post(
                'http://localhost:11434/v1/chat/completions',
                json={
                    'model': 'qwen2.5:14b',  # Verify model name with `ollama list`
                    'messages': messages,
                    'max_tokens': 5000,
                    'request_timeout': 10000  # Timeout in seconds
                }
            )
            response.raise_for_status()
            
            # Extract response in OpenAI-compatible format
            answer = response.json()['choices'][0]['message']['content']
            responses.append(answer.strip())
        except Exception as e:
            print(f"Error getting response: {e}")
            responses.append("")
    
    return responses

# ---------------------------------------------------
# Example usage reading from Excel and saving results
# ---------------------------------------------------

# Read the Excel file
file_path = excel_filename  # Replace with your file path
df = pd.read_excel(file_path)

questions = [" "]  # Maintain a placeholder for text structure
system_prompts = [
    "You are a specialized analyzer for scientific paper abstracts with a focus on identifying causal relationships between key entities in biological studies. Your primary task is to extract and identify all causal relationships present in an abstract between the following entities: Performance, Species, Genes, Methods of genetic engineering (such as knockout or expression), Enzymes, Proteins, and Bioprocess conditions (e.g., growth conditions). For each abstract provided, identify every causal relationship between these entities. You must consider all combinations, even indirect relationships, and output a long, detailed answer that includes every possible valid combination pair. Your output should strictly follow this format: (Entity A, Entity B), (Entity C, Entity D), ... with no additional text. Important Instructions: Comprehensiveness: Include all valid combinations of causal relationships found in the abstract; Detail: Ensure the output is long and detailed, listing every relationship pair even if some relationships may be indirectly connected; Format: Output only the pairs in the exact format described above with no additional explanations or commentary. Examples: Example 1: Abstract: The knockout of gene X significantly improved performance in species Y. Output: (gene X, performance), (gene X, species Y), (species Y, performance). Example 2: Abstract: Expression of enzyme Z in species W leads to increased protein levels under specific growth conditions. Output: (enzyme Z, protein levels), (enzyme Z, species W), (enzyme Z, growth conditions), (protein levels, species W), (protein levels, growth conditions), (species W, growth conditions). Example 3: Abstract: The study shows that a change in bioprocess conditions influences the expression of gene A, which in turn affects the performance of species B. Output: (bioprocess conditions, gene A), (bioprocess conditions, performance), (bioprocess conditions, species B), (gene A, performance), (gene A, species B), (performance, species B).",
]

# Process each abstract and store the response
total_rows = len(df)

for i, row in df.iterrows():
    os.system('cls' if os.name == 'nt' else 'clear')
    
    # Get response from Ollama
    response = ask_questions(row['Abstract'], [questions[0]], [system_prompts[0]])[0]
    df.at[i, 'Extracted entities'] = response
    
    print(f"Response for Row {i+1}:")
    print(f"Answer to Question 2: {response}")
    
    progress = ((i + 1) / total_rows) * 100
    print(f"Progress: {progress:.2f}% completed")

# Save the updated DataFrame
output_file_path = 'updated(Qwen2.5_14b)_' + keyword + '_causal.xlsx'
df.to_excel(output_file_path, index=False)


Response for Row 1:
Answer to Question 2: (bioprocess conditions, performance), (bioprocess conditions, species Yarrowia lipolytica), (species Yarrowia lipolytica, performance), (lipid utilization, performance), (lipid utilization, species Yarrowia lipolytica), (growth rate, lipid accumulation), (growth rate, performance), (growth rate, species Yarrowia lipolytica), (lipid uptake, lipid accumulation), (lipid uptake, performance), (lipid uptake, species Yarrowia lipolytica), (lipid accumulation, performance), (lipid accumulation, species Yarrowia lipolytica)
Progress: 1.10% completed
Response for Row 2:
Answer to Question 2: (squalene biosynthesis pathway, squalene production), (ABC transporters, squalene secretion), (OSH3, squalene secretion), (binding domain of OSH3, secretion signal peptide), (signal peptides, squalene secretion), (Yarrowia lipolytica, squalene production), (Yarrowia lipolytica, squalene secretion), (SNQ2, squalene secretion), (squalene biosynthesis pathway, Yarrowia

**Step 3: Combine entities with similar meanings**

In [3]:
import pandas as pd
import re
import requests
import numpy as np
import concurrent.futures

##################################################
# 1) READ EXCEL AND EXTRACT ENTITIES
##################################################
df = pd.read_excel(output_file_path, engine="openpyxl")
df["Extracted entities"] = df["Extracted entities"].fillna("")
column_values = df["Extracted entities"].astype(str).tolist()

pattern = r"\(([^,]+), ([^)]+)\)"
entities = []
for value in column_values:
    matches = re.findall(pattern, value)
    for (e1, e2) in matches:
        entities.append(e1)
        entities.append(e2)

# Remove duplicates (preserving the order of first appearance)
entities = list(dict.fromkeys(entities))

##################################################
# 2) GET OLLAMA EMBEDDINGS (PARALLEL)
##################################################
def get_ollama_embedding(text, model):
    """
    Calls Ollama's OpenAI-style /v1/embeddings endpoint.
    Returns a Python list of floats or None on error.
    """
    try:
        r = requests.post(
            "http://localhost:11434/v1/embeddings",
            json={"model": model, "input": text},
            timeout=30
        )
        r.raise_for_status()
        data = r.json()
        # data["data"][0]["embedding"] => the actual embedding vector
        return data["data"][0]["embedding"]
    except Exception as e:
        print(f"Error embedding '{text}': {e}")
        return None

model_name = "qwen2.5:14b"  # Replace with your actual Ollama model name

# --- PARALLELIZE EMBEDDING REQUESTS ---
all_embeddings = []
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
    # Submit a future for each entity
    future_to_entity = {executor.submit(get_ollama_embedding, ent, model_name): ent for ent in entities}
    
    # Collect results as they complete
    for future in concurrent.futures.as_completed(future_to_entity):
        ent = future_to_entity[future]
        try:
            emb = future.result()
            all_embeddings.append((ent, emb))
        except Exception as e:
            print(f"Error for entity '{ent}': {e}")
            all_embeddings.append((ent, None))

# Re-sort embeddings back to original entity order
emb_dict = dict(all_embeddings)  # { "EntityString": embedding or None }
vectors = []
for ent in entities:
    emb_vec = emb_dict[ent]
    if emb_vec is not None:
        vectors.append(np.array(emb_vec, dtype=np.float32))
    else:
        vectors.append(None)

##################################################
# 3) COSINE SIMILARITY (VECTORIZED IN NUMPY)
##################################################
# We need a consistent dimensionality, so fill None embeddings with zeros
valid_vectors = [v for v in vectors if v is not None]
if not valid_vectors:
    print("No valid embeddings found, cannot proceed.")
    exit()

dim = len(valid_vectors[0])
for i, v in enumerate(vectors):
    if v is None:
        vectors[i] = np.zeros(dim, dtype=np.float32)

# Create a single 2D array: shape (N, D)
matrix = np.stack(vectors)  # shape (N, D)

# Dot product matrix (N x N)
dot_matrix = matrix @ matrix.T
norms = np.linalg.norm(matrix, axis=1, keepdims=True)  # shape (N,1)
denominator = norms @ norms.T                           # shape (N,N)
similarity_matrix = dot_matrix / denominator

threshold = 0.8
N = len(entities)
similar_phrases = {}

# --------------------------------------------------------
# We use np.triu_indices(N, k=1) => all i<j pairs in [0..N-1].
# This covers every unique pair exactly once, no duplication.
# --------------------------------------------------------
upper_indices = np.triu_indices(N, k=1)  # i<j
sim_vals = similarity_matrix[upper_indices]  # 1D array: sim for each pair (i<j)
above_thresh = np.where(sim_vals > threshold)[0]

for idx in above_thresh:
    i = upper_indices[0][idx]  # row index
    j = upper_indices[1][idx]  # col index
    # If sim > threshold, we say entity j is similar to entity i
    similar_phrases[entities[j]] = entities[i]

##################################################
# 4) REPLACE SIMILAR PHRASES IN THE DATAFRAME
##################################################
total_rows = len(df)
for row_idx in range(total_rows):
    if row_idx % 100 == 0 or row_idx == total_rows - 1:
        print(f"Progress: {100.0 * row_idx / total_rows:.1f}%")

    cell_value = str(df.at[row_idx, "Extracted entities"])
    
    # If "Yarrowia" appears in the cell, skip it
    if "Yarrowia" in cell_value:
        continue

    for similar, original in similar_phrases.items():
        # Also skip if "Yarrowia" is in the phrase itself
        if "Yarrowia" in similar:
            continue
        if similar in cell_value:
            cell_value = cell_value.replace(similar, original)

    df.at[row_idx, "Extracted entities"] = cell_value
modified_file_path = 'modified_' + output_file_path
df.to_excel(modified_file_path, index=False, engine="openpyxl")
print("Done. Saved modified file.")


Progress: 0.0%
Progress: 98.9%
Done. Saved modified file.


**Step 4.1: Plot knowledge graph**

In [4]:
from pyvis.network import Network
import pandas as pd
import re
import networkx as nx

# Load the Excel file
filepath = modified_file_path
df = pd.read_excel(filepath, engine='openpyxl')

# Initialize NetworkX Graph
G = nx.Graph()

# Nodes to exclude
words_to_exclude = []

# Regular expression to match the pattern (entity A, entity B)
pattern = r'\(([^,]+), ([^\)]+)\)'

# Iterate over the DataFrame rows to extract entity pairs and their sources
for _, row in df.iterrows():
    value = row['Extracted entities']
    source = row['Title']  # Extract source for each pair

    matches = re.findall(pattern, value)
    for entity_a, entity_b in matches:
        # Check if any word to exclude is part of the entity names
        if not any(word in entity_a for word in words_to_exclude) and not any(word in entity_b for word in words_to_exclude):
            G.add_node(entity_a, label=entity_a)
            G.add_node(entity_b, label=entity_b)
            G.add_edge(entity_a, entity_b, title=source)

def search_network(graph, keywords, depth=1):
    # Ensure all keywords are lowercase for case-insensitive search
    keyword_list = [kw.lower() for kw in keywords]

    # Helper function to check if a node label contains all keywords
    def contains_all_keywords(label):
        return all(kw in label.lower() for kw in keyword_list)

    # Collect nodes that contain all keywords in their label
    nodes_of_interest = set()
    for node, attr in graph.nodes(data=True):
        if 'label' in attr and contains_all_keywords(attr['label']):
            nodes_of_interest.add(node)

    # Expand search to include neighbors up to the specified depth
    for _ in range(depth):
        neighbors = set()
        for node in nodes_of_interest:
            neighbors.update(nx.neighbors(graph, node))
        nodes_of_interest.update(neighbors)
    
    # Return a subgraph containing only relevant nodes and edges
    return graph.subgraph(nodes_of_interest).copy()

# Perform search with a list of keywords
word_combinations = ["carotene"]  # Replace with your keywords
filtered_graph = search_network(G, word_combinations)

# Extract node names from the filtered graph
node_names = list(filtered_graph.nodes())

# Prepare a simple text summary of node names
node_names_text = ", ".join(node_names)

# Now, `node_names_text` contains a clean, comma-separated list of node names, ready for summarization
print(node_names_text)

# Initialize Pyvis network with the filtered graph
net = Network(height="2160px", width="100%", bgcolor="#222222", font_color="white")
net.from_nx(filtered_graph)

# Continue with setting options and saving the network as before
net.set_options("""
{
  "physics": {
    "barnesHut": {
      "gravitationalConstant": -80000,
      "centralGravity": 0.5,
      "springLength": 75,
      "springConstant": 0.05,
      "damping": 0.09,
      "avoidOverlap": 0.5
    },
    "maxVelocity": 100,
    "minVelocity": 0.1,
    "solver": "barnesHut",
    "timestep": 0.3,
    "stabilization": {
        "enabled": true,
        "iterations": 500,
        "updateInterval": 10,
        "onlyDynamicEdges": false,
        "fit": true
    }
  },
  "nodes": {
    "font": {
      "size": 30,
      "color": "white"
    }
  }
}
""")

# Save and show the network
net.write_html('filtered_entity_' + "_".join(word_combinations) + '_network.html')


increased β-carotene titer, α-carotene, genotyping yeast, carB, overall β-carotene concentration, glucose, optimized fermentation conditions, production capability, β-carotene ketolase, carRP, β-carotene hydroxylases CrtZ, astaxanthin, intermediates' accumulation, β-carotene production, beta-carotene synthesis related genes copy number increase, HMG1, production of β-carotene, zeaxanthin, morphological transition, retinol, enhancing MVA pathway, Y. lipolytica, α-carotene production, HMG-CoA and FPP accumulation, salinity stress, lipids production, mycelium to yeast form conversion, β-carotene accumulation, downregulation of SQS1, glycerol, gene modification, fermentation delay, transcriptional units, subchronic toxicity, subcellular organelles, indigoidine titer yield, alleviating rate-limiting steps, genes crtI, β-carotene hydroxylase from Chondromyces crocatus, glucose consumption, increased β-carotene production, metabolic engineering, beta-carotene, β-carotene biosynthesis pathway,

**Step 4.2 produce summarization report**

In [5]:
from IPython.display import Markdown
import requests  # Required for API calls

def trim_text(text, max_length):
    if len(text) > max_length:
        return text[:max_length].rsplit(' ', 1)[0] + "..."  # Trim to max_length, avoid cutting words in half
    else:
        return text

# Apply the trimming function to node_names_text
cut_off_chunk_size = 5000
trimmed_node_names_text = trim_text(node_names_text, cut_off_chunk_size)
keyword = ", ".join(word_combinations)

# Construct the prompt with the potentially trimmed node_names_text
prompt = "These are the terms related to " + filepath + keyword + ", categorize them and write a summary report.   " + trimmed_node_names_text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

try:
    # Use Ollama's OpenAI-compatible API
    response = requests.post(
        'http://localhost:11434/v1/chat/completions',
        json={
            'model': 'qwen2.5:14b',  # Verify with `ollama list`
            'messages': messages,
            'max_tokens': 5000,
            'timeout': 10000  # 10 seconds
        }
    )
    response.raise_for_status()
    response1 = response.json()['choices'][0]['message']['content']
    
except Exception as e:
    print(f"API Error: {e}")
    response1 = "Failed to generate response"

display(Markdown(response1))

### Summary Report: Terms Related to β-Carotene Production in _Yarrowia lipolytica_

#### Introduction
This report classifies and summarizes the various terms related to β-carotene production in *Yarrowia lipolytica*. The objective is to provide an organized overview of genetic modifications, metabolic engineering techniques, and bioprocess conditions aimed at enhancing the production of this important carotenoid.

#### Categorization

1. **Genetic Modifications**
   - Increased copy number: β-carotene synthesis related genes
   - Gene modification methods: CRISPR/Cas9, multiple fragment assembly method (MFA), codon-adapted genes, homologous recombination
   - Genes involved in the pathway:
     - crtB (carB) and carRP
     - β-carotene hydroxylases (crtZ, GGS1/crtE)
     - Upregulated genes: HMG1, crtI, crtYB
     - Downregulated gene: SQS1
     - Additional key enzymes/enzymes from heterologous organisms:
       - β-ketolase CrtW (Chondromyces crocatus)
       - Hydroxylases

2. **Bioprocess Conditions**
   - Fermentation parameters:
     - Fed-batch fermentation, DO-stat flask culture conditions
     - Optimization: glucose consumption rate, salinity stress, pH effects, acetic acid concentration and utilization
     - Cultivation conditions: YPD medium, fermenters 
   - Stress Relief and Metabolic Balance:
     - Relieving metabolic stress
     - Alleviating rate-limiting steps
     - Enhancing precursor supply
     - NAA-dependent MFE1 degradation

3. **Metabolic Pathways**
   - Mevalonate (MVA) pathway: Improved expression, ERG13, HMG-CoA and FPP accumulation
   - Enhancement of native pathways:
     - Precursor boosting in mevalonate pathway (tHMG1 gene overexpression)
     - β-carotene biosynthetic pathway metabolic balance 
     - Homologous enzymes from other organisms (e.g., OluLCY, HMG)

4. **Product and Analytical Methods**
   - Product measurement: Increased β-carotene titer, accumulation, extracellular export
   - Performance metrics:
     - Dry cell weight (DCW), β-carotene content per cell, biomass productivity 
     - α- and β-carotenes, zeaxanthin, lutein, astaxanthin
     - Yield: Indigoidine titer yield, subchronic toxicity, genotoxicity tests
   - Subcellular organelles: ER, mitochondria

5. **Strain Engineering**
   - Yarrowia lipolytica strains:
     - Improved strains (XK19/Yli-C)
     - Lipid overproducer strain
     - Genetically modified strains 
   - Gene deletion mutants: Ura3Δ

#### Summary
The report categorizes the extensive list of terms into key areas focused on enhancing β-carotene production through genetic modification, optimizing fermentation conditions, and improving metabolic pathways in *Yarrowia lipolytica*. Genetic techniques such as CRISPR/Cas9 and MFA have been employed to modify critical genes within the carotenoid biosynthesis pathway. Additionally, metabolic engineering strategies target rate-limiting steps and precursor supply for improving β-carotene synthesis efficiency.

Bioprocess conditions have also seen significant optimization through fed-batch fermentations and DO-stat flask culture methods, with a focus on alleviating metabolic stress to enhance biomass productivity. Strain development has led to the creation of high-performing strains like XK19/Yli-C, enabling higher β-carotene titers with improved precursor supply from the MVA pathway.

Overall, this report reflects the comprehensive approach undertaken in enhancing β-carotene production in *Yarrowia lipolytica*, integrating genetic advancements and bioprocess optimization at both microscale bench experiments and large-scale fermenter applications.