In [1]:
# Interesting
# - https://github.com/ollama/ollama/blob/main/docs/import.md
# Resources:
# - https://medium.com/@danushidk507/rag-with-llama-using-ollama-a-deep-dive-into-retrieval-augmented-generation-c58b9a1cfcd3
# - https://github.com/ollama/ollama/blob/main/docs/linux.md

In [2]:
# To be able to run `ollama`
# From terminal with same conda env activated as in this notebook:
# curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz
# tar -xzf ollama-linux-amd64.tgz
## -> you'll now have bin/ and lib/. 
## -> I extracted the contents of each to /path/to/conda/env/{bin,lib}/ollama
## -> created /path/to/conda/env/etc/conda/activate.d/ollama.sh to prepend the corresponding paths to $PATH and $LD_LIBRARY_PATH (exported obviously) 

import os
os.environ["PATH"] += ":/research/rgs01/home/clusterHome/jpastr08/.conda/envs/py310/bin/ollama/"

In [3]:
%cd /research/rgs01/home/clusterHome/jpastr08/biohackathon/KIDS24-team12/vm_files/Jose
%pwd

/research/rgs01/home/clusterHome/jpastr08/biohackathon/KIDS24-team12/vm_files/Jose


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


'/research/rgs01/home/clusterHome/jpastr08/biohackathon/KIDS24-team12/vm_files/Jose'

In [4]:
# Commented out to avoid running every time
#!pip install langchain
#!pip install -U langchain-community
#!pip install sentence-transformers
#!pip install faiss-gpu
#!pip install pypdf
#!pip install langchain_ollama
#!pip install colab-xterm

In [5]:
# Data ingestion

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

# Load the document
loader = PyPDFLoader("files/4-2_manual.pdf")
documents = loader.load()

# Split the document into chunks
text_splitter = CharacterTextSplitter(chunk_size=750, chunk_overlap=100, separator="\n")
docs = text_splitter.split_documents(documents=documents)


In [6]:
# Data Embedding and Storage

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS # library for vector similarity search

# Load embedding model 
embedding_model_name = "sentence-transformers/all-mpnet-base-v2" # known for its robust performance across various text tasks
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=model_kwargs
)

# Create FAISS vector store
vectorstore = FAISS.from_documents(docs, embeddings)

# Save and reload the vector store
vectorstore.save_local("faiss_index_")
persisted_vectorstore = FAISS.load_local("faiss_index_", embeddings, allow_dangerous_deserialization=True)

# Create a retriever
retriever = persisted_vectorstore.as_retriever()


  embeddings = HuggingFaceEmbeddings(
  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# From terminal with same conda env activated as in this notebook:
# ollama serve & ollama pull llama3.1

In [9]:
# Load LLaMA model

from langchain_community.llms import Ollama

# Initialize the LLaMA model
llm = Ollama(model="llama3.1")

# Test with a sample prompt
response = llm.invoke("Tell me a joke")
print(response)


Here's one:

What do you call a fake noodle?

(wait for it...)

An impasta!

Hope that made you smile! Do you want to hear another one?


In [None]:
# Use RAG with retriever defined above

from langchain.chains import RetrievalQA

# Create RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

# Interactive query loop
while True:
    query = input("Type your query (or type 'Exit' to quit): \n")
    if query.lower() == "exit":
        break
    result = qa.run(query)
    print(result)


Type your query (or type 'Exit' to quit): 
 If the DRAGEN software is run with the parameter --enable-variant-caller=true, what are the implications? Construct a command line including this parameter


  result = qa.run(query)


If the DRAGEN software is run with the parameter `--enable-variant-caller=true`, it implies that the variant caller component will be enabled in the workflow. This means that DRAGEN will construct a workflow that includes the variant caller component and automatically resolve any component inconsistencies.

Here's an example command line including this parameter:

`dragen --enable-variant-caller=true --input-input-fastq.gz your_data.fastq.gz`

Note: The `--input-input-fastq.gz` option is not specified in the provided text, but it is a general input option that DRAGEN uses to read input data. You should replace `your_data.fastq.gz` with your actual input file path.

When the variant caller component is enabled, DRAGEN will produce its own set of VCFs and metric output files for each run.


Type your query (or type 'Exit' to quit): 
 Using the Illumina DRAGEN software, how do you construct a command line for mapping FASTQ files to a reference genome?


Here's an example of how to construct a command line for mapping FASTQ files to a reference genome using the Illumina DRAGEN software:

`dragen -r <reference_path> -1 <FASTQ_file_1> -2 <FASTQ_file_2> --RGID <Read_group_ID> --RGSMD <Sample_Name> --enable-map-align true`

Where:
- `<reference_path>` is the path to the reference genome
- `<FASTQ_file_1>` and `<FASTQ_file_2>` are the paths to the two FASTQ files to be mapped
- `<Read_group_ID>` is a unique identifier for the read group
- `<Sample_Name>` is the name of the sample being analyzed

For example:

`dragen -r /staging/human/reference/hg38_alt_aware/DRAGEN/8 -1 /staging/test/data/NA12878_R1.fastq -2 /staging/test/data/NA12878_R2.fastq --RGID DRAGEN_RGID --RGSMD NA12878 --enable-map-align true`

Note that you will need to replace the placeholders (`<reference_path>`, `<FASTQ_file_1>`, etc.) with your actual file paths and names.


Type your query (or type 'Exit' to quit): 
 Provide an example using the reference genome GRCh38 and input files sample1_R1.fastq and sample1_R2.fastq


I don't know. The provided context only contains examples that use the hg19 reference genome, but it does not include any information on how to use the GRCh38 reference genome with the specified input files (sample1_R1.fastq and sample1_R2.fastq). 

However, based on the general usage of DRAGEN mentioned in the documentation, a possible command would be:

```
dragen-f \
-r /staging/human/reference/GRCh38/GRCh38.fa.k_21.f_16.m_149\
-1 sample1_R1.fastq.gz\
-2 sample1_R2.fastq.gz\
--output-directory/staging/examples/\
--output-file-prefixsample1_dragen
```

Note that you need to replace the path to the reference genome with your own location, and adjust the output directory and file prefix as needed. Additionally, you may need to specify other options depending on your specific use case.


Type your query (or type 'Exit' to quit): 
 What is the command line syntax to enable variant calling in DRAGEN?


According to the provided context, I don't have information about how to specifically enable variant calling for DRAGEN. However, based on a general section that mentions configuring VARIANT CALLERs, it seems like the relevant option is:

• Configure the VARIANT CALLERs based on the application


Type your query (or type 'Exit' to quit): 
 Include parameters for setting the output directory to /data/output and specifying a memory limit of 16 GB.


Here are the command line options you would need:

`--output-directory Yes Specifies the output directory.` should be specified with `/data/output`.

Unfortunately, there is no option directly mentioned in the provided context for "specifying a memory limit". However, there is an option to specify the number of threads (`-n --num-threads No Specifies the number of processor threads to use.`) which could indirectly influence the system's ability to handle memory.

Therefore, you would add these two options to your command:

```
dragen \
--output-directory /data/output \
-n 16
```


Type your query (or type 'Exit' to quit): 
 Write a command line to perform mapping, variant calling, and generate a BAM file, using DRAGEN. Include the input files reads_R1.fastq and reads_R2.fastq, a reference genome hg19, and an output directory /output_dir


Here is the command line to perform mapping, variant calling, and generate a BAM file using DRAGEN:

```bash
dragen \
-r /path/to/reference/hg19 \
--fastq-file1 /path/to/input/reads_R1.fastq \
--fastq-file2 /path/to/input/reads_R2.fastq \
--output-directory /output_dir \
--output-file-prefix BAM_file_prefix \
--RGID DRAGEN_RGID \
--RGSM sample_name \
--enable-map-align=true \
--enable-cyp21a2=true
```

Note that you should replace `/path/to/reference/hg19` with the actual path to your reference genome, and `sample_name` with a meaningful name for your sample. Also, make sure to adjust the output file prefix as needed.

This command line uses FASTQ input files (`reads_R1.fastq` and `reads_R2.fastq`) and maps them against the hg19 reference genome using DRAGEN's mapper/aligner. The resulting BAM file will be stored in the `/output_dir` directory with a name specified by the `--output-file-prefix` option.


Type your query (or type 'Exit' to quit): 
 The following command results in an error: dragen --input sample.fastq --output-dir /output. Correct the command and explain the changes you made.


The original command is incorrect because it does not specify a reference directory, which is required by DRAGEN.

Here's the corrected command:
```
dragen \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
--input sample.fastq \
--output-dir /output
```
The changes made were:

* I added the reference directory option (`-r`) to specify a valid reference directory. This is required for DRAGEN to function correctly.
* The `--fastq-file1` and `--fastq-file2` options are not necessary in this case, since only one input file (sample.fastq) is provided. These options are typically used when working with paired-end FASTQ files.
* I removed the `--RGID` and `--RGSM` options, as they are typically required for passing a FASTQ file as input, but not necessary in this case.
* The rest of the command remains unchanged.

Note that without more information about the specific version of DRAGEN being used or the specific reference directory available, it's difficult to provide a completely a

In [4]:
## Code graveyard

## Instead of the following (to have venvs within a conda env), I think I could just launch the 
## jupyter notebook from each venv with jupyter installed in it. Doing that should take the venv as
## the environment jupyter is in, which in turn would also be within the conda env. 

# From a terminal window, after activating the same conda environment this notebook runs in.
# cd <dir of choice, probably within the scope of the notebook>
# python -m venv venv_rag_langchain
# source myenv/bin/activate
# pip install ipykernel
# python -m ipykernel install --user --name=venv_rag_langchain

# If you want to remove the kernel
# jupyter kernelspec remove venv_rag_langchain