**Step 1: Literature search and text collection**

In [1]:
# Search PubMed and get article abstracts

# pip install biopython pandas openpyxl

from Bio import Entrez
import pandas as pd

# Define your email to use with NCBI Entrez
Entrez.email = "your@email.com"

def search_pubmed(keyword):
    
    # Adjust the search term to focus on abstracts
    search_term = f"{keyword}[Abstract]"
    handle = Entrez.esearch(db="pubmed", term=search_term, retmax=500)
    record = Entrez.read(handle)
    handle.close()
    # Get the list of Ids returned by the search
    id_list = record["IdList"]
    return id_list

def fetch_details(id_list):
    ids = ','.join(id_list)
    handle = Entrez.efetch(db="pubmed", id=ids, retmode="xml")
    records = Entrez.read(handle)
    handle.close()

    # Create a list to hold our article details
    articles = []

    for pubmed_article in records['PubmedArticle']:
        article = {}
        article_data = pubmed_article['MedlineCitation']['Article']
        article['Title'] = article_data.get('ArticleTitle')
        
        # Directly output the abstract
        abstract_text = article_data.get('Abstract', {}).get('AbstractText', [])
        if isinstance(abstract_text, list):
            abstract_text = ' '.join(abstract_text)
        article['Abstract'] = abstract_text

        article['Journal'] = article_data.get('Journal', {}).get('Title')

        articles.append(article)

    return articles



# Example usage
keyword = "yarrowia carotene"
id_list = search_pubmed(keyword)
articles = fetch_details(id_list)

# Convert our list of articles to a DataFrame
df = pd.DataFrame(articles)

# Saving the DataFrame to an Excel file
excel_filename = keyword+"_pubmed_search_results.xlsx"
df.to_excel(excel_filename, index=False)

print(f"Saved search results to {excel_filename}")


Saved search results to yarrowia carotene_pubmed_search_results.xlsx


**Step 2: Entity and relationship extraction with LLM**

In [2]:
import pandas as pd
import os
from openai import OpenAI  # Updated import syntax

# Initialize OpenAI client (put your key here)
client = OpenAI(api_key="YOUR-API-KEY")  # New client initialization

def ask_questions(abstract, questions, system_prompts):
    responses = []
    for question, system_prompt in zip(questions, system_prompts):
        prompt_text = question + " " + str(abstract)
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt_text}
        ]
        try:
            # Updated API call syntax
            response = client.chat.completions.create(
                model="gpt-4o-mini",  # Use the latest model you have access to
                messages=messages,
                max_tokens=5000,
                temperature=0
            )
            answer = response.choices[0].message.content
            responses.append(answer.strip())
        except Exception as e:
            print(f"Error getting response: {e}")
            responses.append("")
    return responses

# ---------------------------------------------------
# Example usage reading from Excel and saving results
# ---------------------------------------------------
# Read the Excel file
file_path = excel_filename  # Replace with your file path
df = pd.read_excel(file_path)

questions = [" "]
system_prompts = [
    "You are a specialized analyzer for scientific paper abstracts with a focus on identifying causal relationships between key entities in biological studies. Your primary task is to extract and identify all causal relationships present in an abstract between the following entities: Performance, Species, Genes, Methods of genetic engineering (such as knockout or expression), Enzymes, Proteins, and Bioprocess conditions (e.g., growth conditions). For each abstract provided, identify every causal relationship between these entities. You must consider all combinations, even indirect relationships, and output a long, detailed answer that includes every possible valid combination pair. Your output should strictly follow this format: (Entity A, Entity B), (Entity C, Entity D), ... with no additional text. Important Instructions: Comprehensiveness: Include all valid combinations of causal relationships found in the abstract; Detail: Ensure the output is long and detailed, listing every relationship pair even if some relationships may be indirectly connected; Format: Output only the pairs in the exact format described above with no additional explanations or commentary. Examples: Example 1: Abstract: The knockout of gene X significantly improved performance in species Y. Output: (gene X, performance), (gene X, species Y), (species Y, performance). Example 2: Abstract: Expression of enzyme Z in species W leads to increased protein levels under specific growth conditions. Output: (enzyme Z, protein levels), (enzyme Z, species W), (enzyme Z, growth conditions), (protein levels, species W), (protein levels, growth conditions), (species W, growth conditions). Example 3: Abstract: The study shows that a change in bioprocess conditions influences the expression of gene A, which in turn affects the performance of species B. Output: (bioprocess conditions, gene A), (bioprocess conditions, performance), (bioprocess conditions, species B), (gene A, performance), (gene A, species B), (performance, species B).",
]
# Process each abstract and store the response
total_rows = len(df)
for i, row in df.iterrows():
    os.system('cls' if os.name == 'nt' else 'clear')
    response = ask_questions(row['Abstract'], [questions[0]], [system_prompts[0]])[0]
    df.at[i, 'Extracted entities'] = response
    
    print(f"Response for Row {i+1}:")
    print(f"Answer to Question 2: {response}")
    progress = ((i + 1) / total_rows) * 100
    print(f"Progress: {progress:.2f}% completed")

# Save the updated DataFrame
output_file_path = f'updated(GPT-4o)_{keyword}_causal.xlsx'  # New filename
df.to_excel(output_file_path, index=False)

Response for Row 1:
Answer to Question 2: (Yarrowia lipolytica, growth), (Yarrowia lipolytica, lipid utilization), (Yarrowia lipolytica, efficiency), (Yarrowia lipolytica, Po1f strain), (Po1f strain, growth), (Po1f strain, lipid utilization), (Po1f strain, efficiency), (engineered strain, growth rate), (engineered strain, lipid content), (engineered strain, lipid uptake), (engineered strain, lipid accumulation), (engineered strain, lipid metabolism), (engineered strain, Yarrowia lipolytica), (original strain, growth rate), (original strain, lipid content), (original strain, lipid uptake), (original strain, lipid accumulation), (original strain, lipid metabolism), (original strain, Yarrowia lipolytica), (β-Carotene, production of lipophilic natural compounds), (engineered strain, β-Carotene), (engineered strain, efficiency), (growth rate, efficiency), (lipid content, efficiency), (lipid uptake, efficiency), (lipid accumulation, efficiency), (lipid metabolism, efficiency), (growth, lipid

**Step 3: Combine entities with similar meanings**

In [3]:
import pandas as pd
import re
import numpy as np
import concurrent.futures
import openai
import os

##################################################
# 1) READ EXCEL AND EXTRACT ENTITIES (UNCHANGED)
##################################################
df = pd.read_excel(output_file_path, engine="openpyxl")
df["Extracted entities"] = df["Extracted entities"].fillna("")
column_values = df["Extracted entities"].astype(str).tolist()

pattern = r"\(([^,]+), ([^)]+)\)"
entities = []
for value in column_values:
    matches = re.findall(pattern, value)
    for (e1, e2) in matches:
        entities.append(e1)
        entities.append(e2)

# Remove duplicates while preserving order
entities = list(dict.fromkeys(entities))

##################################################
# 2) UPDATED OPENAI EMBEDDINGS IMPLEMENTATION
##################################################

def get_openai_embedding(text, model="text-embedding-3-small"):
    """Get embedding with proper error handling and API parameters"""
    try:
        response = client.embeddings.create(
            input=text,
            model=model,
            encoding_format="float"  # Explicitly request float format
        )
        return response.data[0].embedding
    except openai.APIError as e:
        print(f"API Error: {e.status_code} - {e.message}")
    except Exception as e:
        print(f"General error embedding '{text[:20]}...': {str(e)}")
    return None

# --- PARALLEL EMBEDDING REQUESTS WITH RATE LIMIT HANDLING ---
model_name = "text-embedding-3-small"  # Can change to "text-embedding-ada-002" for older model
all_embeddings = []

# Reduced max_workers to comply with OpenAI rate limits
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    future_to_entity = {
        executor.submit(get_openai_embedding, ent, model_name): ent 
        for ent in entities
    }

    for future in concurrent.futures.as_completed(future_to_entity):
        ent = future_to_entity[future]
        try:
            emb = future.result()
            all_embeddings.append((ent, emb))
        except Exception as e:
            print(f"Error processing '{ent}': {e}")
            all_embeddings.append((ent, None))

# Create embedding dictionary maintaining original entity order
emb_dict = dict(all_embeddings)
vectors = []
for ent in entities:
    emb_vec = emb_dict.get(ent)
    if emb_vec is not None:
        vectors.append(np.array(emb_vec, dtype=np.float32))
    else:
        vectors.append(None)

##################################################
# 3) COSINE SIMILARITY (VECTORIZED IN NUMPY)
##################################################
# Ensure a consistent dimensionality by replacing missing embeddings with zeros
valid_vectors = [v for v in vectors if v is not None]
if not valid_vectors:
    print("No valid embeddings found, cannot proceed.")
    sys.exit(1)  # Exit the script gracefully

dim = len(valid_vectors[0])
for i, v in enumerate(vectors):
    if v is None:
        vectors[i] = np.zeros(dim, dtype=np.float32)

# Create a 2D array with shape (N, D)
matrix = np.stack(vectors)

# Compute cosine similarity matrix
dot_matrix = matrix @ matrix.T
norms = np.linalg.norm(matrix, axis=1, keepdims=True)
denom = norms @ norms.T
similarity_matrix = dot_matrix / denom

threshold = 0.8
N = len(entities)
similar_phrases = {}

# Use np.triu_indices to consider unique pairs (i < j)
upper_indices = np.triu_indices(N, k=1)
sim_vals = similarity_matrix[upper_indices]
above_thresh = np.where(sim_vals > threshold)[0]

for idx in above_thresh:
    i = upper_indices[0][idx]
    j = upper_indices[1][idx]
    # Consider entity j as similar to entity i
    similar_phrases[entities[j]] = entities[i]

##################################################
# 4) REPLACE SIMILAR PHRASES IN THE DATAFRAME
##################################################
total_rows = len(df)
for row_idx in range(total_rows):
    if row_idx % 100 == 0 or row_idx == total_rows - 1:
        print(f"Progress: {100.0 * row_idx / total_rows:.1f}%")
    
    cell_value = str(df.at[row_idx, "Extracted entities"])
    
    # Skip rows containing "Yarrowia"
    if "Yarrowia" in cell_value:
        continue
    
    for similar, original in similar_phrases.items():
        # Also skip if "Yarrowia" appears in the phrase itself
        if "Yarrowia" in similar:
            continue
        if similar in cell_value:
            cell_value = cell_value.replace(similar, original)
    
    df.at[row_idx, "Extracted entities"] = cell_value

modified_file_path = f'modified_{output_file_path}'
df.to_excel(modified_file_path, index=False, engine="openpyxl")
print(f"Processing complete. Modified file saved to: {modified_file_path}")

Progress: 0.0%
Progress: 98.9%
Processing complete. Modified file saved to: modified_updated(GPT-4o)_yarrowia carotene_causal.xlsx


**Step 4.1: Plot knowledge graph**

In [4]:
from pyvis.network import Network
import pandas as pd
import re
import networkx as nx

# Load the Excel file
filepath = modified_file_path
df = pd.read_excel(filepath, engine='openpyxl')

# Initialize NetworkX Graph
G = nx.Graph()

# Nodes to exclude
words_to_exclude = []

# Regular expression to match the pattern (entity A, entity B)
pattern = r'\(([^,]+), ([^\)]+)\)'

# Iterate over the DataFrame rows to extract entity pairs and their sources
for _, row in df.iterrows():
    value = row['Extracted entities']
    source = row['Title']  # Extract source for each pair

    matches = re.findall(pattern, value)
    for entity_a, entity_b in matches:
        # Check if any word to exclude is part of the entity names
        if not any(word in entity_a for word in words_to_exclude) and not any(word in entity_b for word in words_to_exclude):
            G.add_node(entity_a, label=entity_a)
            G.add_node(entity_b, label=entity_b)
            G.add_edge(entity_a, entity_b, title=source)

def search_network(graph, keywords, depth=1):
    # Ensure all keywords are lowercase for case-insensitive search
    keyword_list = [kw.lower() for kw in keywords]

    # Helper function to check if a node label contains all keywords
    def contains_all_keywords(label):
        return all(kw in label.lower() for kw in keyword_list)

    # Collect nodes that contain all keywords in their label
    nodes_of_interest = set()
    for node, attr in graph.nodes(data=True):
        if 'label' in attr and contains_all_keywords(attr['label']):
            nodes_of_interest.add(node)

    # Expand search to include neighbors up to the specified depth
    for _ in range(depth):
        neighbors = set()
        for node in nodes_of_interest:
            neighbors.update(nx.neighbors(graph, node))
        nodes_of_interest.update(neighbors)
    
    # Return a subgraph containing only relevant nodes and edges
    return graph.subgraph(nodes_of_interest).copy()

# Perform search with a list of keywords
word_combinations = ["carotene"]  # Replace with your keywords
filtered_graph = search_network(G, word_combinations)

# Extract node names from the filtered graph
node_names = list(filtered_graph.nodes())

# Prepare a simple text summary of node names
node_names_text = ", ".join(node_names)

# Now, `node_names_text` contains a clean, comma-separated list of node names, ready for summarization
print(node_names_text)

# Initialize Pyvis network with the filtered graph
net = Network(height="2160px", width="100%", bgcolor="#222222", font_color="white")
net.from_nx(filtered_graph)

# Continue with setting options and saving the network as before
net.set_options("""
{
  "physics": {
    "barnesHut": {
      "gravitationalConstant": -80000,
      "centralGravity": 0.5,
      "springLength": 75,
      "springConstant": 0.05,
      "damping": 0.09,
      "avoidOverlap": 0.5
    },
    "maxVelocity": 100,
    "minVelocity": 0.1,
    "solver": "barnesHut",
    "timestep": 0.3,
    "stabilization": {
        "enabled": true,
        "iterations": 500,
        "updateInterval": 10,
        "onlyDynamicEdges": false,
        "fit": true
    }
  },
  "nodes": {
    "font": {
      "size": 30,
      "color": "white"
    }
  }
}
""")

# Save and show the network
net.write_html('filtered_entity_' + "_".join(word_combinations) + '_network.html')


CarB, hexokinase activity, growth mediums, fermentation cycle, tHMG, heterologous pathway, hydrophobic substrates, enzyme fusion, carotenogenesis genes, total production, phytoene synthase, Yarrowia lipolytica, glucose based YP medium, β-carotene concentration, carotenes, protein engineering, native precursor supply, gene integration, wild-type strain, rate-limiting steps, sterol transcriptional regulation, NADPH regeneration, retinol, competitive producer organism, growth profile, metabolic balance, target product yield, extracellular export, β-carotene-producing Yarrowia lipolytica strains, integration efficiency, syngas-derived acetic acid, canola oil-containing yeast-peptone, mvaE, fatty acid oxidation, health functions, productivity, engineered strains, genes, genes of Mucor circinelloides, EasyClone, lycopene β-cyclase, transcriptional level of Hxk, heterotrophic workhorses, copy numbers of carRP, β-carotene ketolase, screening efficiency, bioreactors, β-carotene production chass

**Step 4.2: Produce summarization report**

In [8]:
from IPython.display import Markdown

def trim_text(text, max_length):
    if len(text) > max_length:
        return text[:max_length].rsplit(' ', 1)[0] + "..."  # Trim to max_length, avoid cutting words in half
    else:
        return text

# Apply the trimming function to node_names_text
cut_off_chunk_size = 5000
trimmed_node_names_text = trim_text(node_names_text, cut_off_chunk_size)
keyword = ", ".join(word_combinations)

# Construct the prompt with the potentially trimmed node_names_text
prompt = "These are the terms related to " + filepath + keyword + ", categorize them and write a summary report.   " + trimmed_node_names_text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

try:
    # Use OpenAI's API directly
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=5000,
        timeout=30  # Timeout in seconds
    )
    response1 = response.choices[0].message.content
except Exception as e:
    print(f"API Error: {e}")
    response1 = "Failed to generate response"

display(Markdown(response1))

### Summary Report on Carotene Production in Yarrowia lipolytica

#### Categories:
1. **Genetic Engineering and Biotechnology:**
   - **Genes and Pathways:** CarB, carotenogenesis genes, phytoene synthase, crtI, carRA, 11 genes in β-carotene synthesis pathway, CrtE, crtYB, crtZ, genes of Mucor circinelloides, CrtW, engineered pathway, heterologous expression, heterologous pathway.
   - **Metabolic and Pathway Engineering:** MVA pathway, mevalonate pathway, central carbon pathway engineering, metabolic engineering, native precursor supply, acetyl-CoA, NADP+/NADPH, metabolic balance, key nodes on the carotenoid pathway, iterative overexpression, lipid biosynthesis pathway, upcycler strain.
   - **Gene Manipulation Tools:** CRISPR/Cas9, gene integration, protein engineering, codon adaptation, gene integration method, NHEJ, EasyClone, gene expression, codon-adapted CarRA, codon-adapted CarB, protein knockdown.

2. **Fermentation and Cultivation Conditions:**
   - **Cultivation Media:** YPD cultures, glucose based YP medium, glucose utilization, glucose consumption, acetic acid, syngas-derived acetic acid, canola oil-containing yeast-peptone, YNB flask cultures.
   - **Fermentation Methods:** Bioreactors, shake flask cultures, bioreactor fermentations, controlled conditions, high-throughput screening, large-scale fermentation, DO-stat fed-batch fermentation.
   - **Growth Conditions:** pH effects, optimal medium, temperature, substrate choice, bioprocess conditions, kinetic parameters, high-β-carotene production.

3. **Strain Development:**
   - **Engineered Strains:** β-carotene-producing Yarrowia lipolytica strains, engineered Yarrowia lipolytica, engineered strain Yli-CAH, optimized strain, lipid overproducer strain, auxotrophic mutants IMUFRJ 50682, wild-type strain, engineered Yarrowia lipolytica strain.
   - **Productivity and Efficiency:** Productivity, extraction yields, integration efficiency, rate-limiting steps, production titer, yield.

4. **Product Accumulation and Analysis:**
   - **Carotenoids and Derivatives:** β-carotene, α-carotene, zeaxanthin, retinol, lycopene, canthaxanthin, β-ionone, astaxanthin, retinal.
   - **Production Metrics:** Total production, β-carotene concentration, β-carotene yield, β-carotene titer, β-carotene biosynthetic pathway, β-carotene production chassis, accumulation of β-carotene.

5. **Functional Attributes and Applications:**
   - **Health and Nutritional Benefits:** Health functions, anti-cardiovascular properties, antioxidant properties, nutritional supplement, clinical pathology.
   - **Industrial and Biotechnological Applications:** Biotechnological industry, microbial cell factories, production of lipophilic natural compounds.

6. **Structural Biology and Enzyme Activity:**
   - **Enzymes and Proteins:** Hexokinase activity, enzyme fusion, phytoene synthase, lycopene β-cyclase, bifunctional phytoene synthase/lycopene cyclase, β-carotene ketolase, HpCrtZ.
   - **Gene Regulation and Expression:** Sterol transcriptional regulation, transcriptional level of Hxk, transcriptional unit, transcriptional factor, promoter, promoters engineering, gene targets.

7. **Safety and Toxicology:**
   - **Toxicological Assessments:** Genotoxicity, subchronic toxicity, gross- and histopathological evaluations, NOAEL.

#### Summary:
The carotene production research in Yarrowia lipolytica spans various domains. Genetic and metabolic engineering strategies are leveraged to optimize carotenoid biosynthesis pathways. This involves manipulating specific genes and pathways like the MVA pathway and enhancing native precursor supply for improved productivity.

Fermentation and cultivation strategies, including the choice of growth media and optimization of fermentation conditions, play a significant role in achieving high carotene yields. The productivity of β-carotene and related compounds is monitored through extraction yields and production titers, with emphasis on engineered and wild-type strains of Yarrowia lipolytica.

Applications range from health-related benefits, such as antioxidant and anti-cardiovascular properties, to industrial applications in biotechnology for the production of lipophilic natural compounds.

Finally, there is a focus on assessing the safety of these engineered strains, ensuring the absence of genotoxic effects through comprehensive toxicological evaluations.