# MPCS56430 Assignment 7B

# Problem 1. RCC Single Node Blast
Using the RCC version of BLAST, conduct a single node run of the following sequence against the `refseq` database. The databases we will be using are located on a shared directory for the class located at `/project2/mpcs56430/db`. 

Query sequence:
``` 
>YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHV
SGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPF
LGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPI
NLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYN
ENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASV
YAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYF
PLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFL
PFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLT
PTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLG
AENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGI
AVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIG
VTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDI
LSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLM
SFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNT
FVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVA
KNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
SEPVLKGVKLHYT
```

Perform several runs and establish a baseline time for the database search. Please, run the job using `sinteractive` so it runs on worker node.

**Running times with 1 thread:**
- 0.54s
- 0.25s
- 0.31s
- 0.25s
- 0.31s
  
avg: 0.332s

**Question:** Does altering the `--num_threads` parameter change the runtime? Is this the same as what your observed running it locally?

**2 threads**:
- 0.25s
- 0.32s
- 0.49s
- 0.27s
- 0.31s

avg: 0.328s

**4 threads**:
- 0.28s
- 0.26s
- 0.25s
- 0.25s
- 0.26s

avg: 0.26s

**16 threads**:
- 0.83s
- 0.32s
- 0.32s
- 0.28s
- 1.32s

avg: 0.614s

Changing the --num_threads parameter does not really change the runimes, they are all around 0.3s.
Too many threads (like 16) actually worsens the running time, in this case doubling the average. This is most likely due to overhead.

**Question:** Redo all the same runs, but this time using a RCC _big mem_ machine. Does this make a difference?

> Threads Runtime (with big mem):
> 
> 1 Thread: 0.25s\
> 2 Threads: 0.24s\
> 4 Threads: 1.35s\
> 6 Threads: 0.24s\
> 8 Threads: 0.24s\
> 16 Threads: 0.24s\
> 32 Threads: 0.24s

**Conclusion**: \
The big mem machine does not have a significant inmpact on the total execution time, as the times stay arounf 0.25s which is not significantly faster than the runs shown previously. Running on 4 threads even performs significantly worse, though this may be a one-off error.

Script:

In [1]:
#!/bin/bash
#BATCH --account=mpcs56430
#SBATCH --mail-user=sharapova@cs.uchicago.edu
#SBATCH --mail-type=ALL

#SBATCH --job-name=b_problem1          # Job name
#SBATCH --output=./sbatch/out/%j.B1.out
#SBATCH --error=./sbatch/out/%j.B1.err
#SBATCH --cpus-per-task=4          # Number of CPU cores per task
#SBATCH --nodes=1                  # One node per task
#SBATCH --partition=bigmem2
#
#


# Load necessary modules (if required, e.g., BLAST+)
module load blast

# Paths to query and database
QUERY=~/../../project2/mpcs56430/db/query_sequence.fasta
DB_PATH=~/../../project2/mpcs56430/db/refseq # uh ohhh, inconsistent with part  b... might have been the cause of error...

# Array of thread counts
THREADS=(1 2 4 6 8 16 32)

# Output file for times
TIMES_FILE="blastp_times_bigmem.txt"
echo "Threads Runtime" > $TIMES_FILE

# Loop through thread counts and run BLASTP
for THREAD in "${THREADS[@]}"
do
    echo "Running BLASTP with $THREAD threads..."
    OUTPUT_FILE="tmp_out/blastp_${THREAD}_threads.out"
    
    # Run BLASTP and capture runtime
    { time -p blastp -query $QUERY -db $DB_PATH -num_threads $THREAD -out $OUTPUT_FILE ; } 2> temp_time.txt
    
    # Extract runtime
    RUNTIME=$(grep "real" temp_time.txt | awk '{print $2}')
    
    # Save thread count and runtime to file
    echo "$THREAD $RUNTIME" >> $TIMES_FILE
done

# Cleanup
rm temp_time.txt

echo "All BLASTP runs completed. Times saved in $TIMES_FILE."

SyntaxError: invalid syntax (1217766901.py, line 17)

_NOTE: After re-running the script the next day, the numbers were very different and this is what you see in the blastp_times text files submitted._ 

# Problem 2. SLURM Array Job
One of the simplest ways to parrallelize a computing task is to break it in to smaller parts.  Write a SLURM submission script that submits the above query against the fragmented *nr* database in *array job* mode.  

The *nr* database has already been fragmented into 22 equally sized databases. It is located in the `db` diretory in the format: `/project2/mpcs56430/db/refseq_protein.[00-21]`.  

You should create a single script that utilizes the SLURM *array job* functionality.  Specify only 1 task per node.  Paste your script below and include it in the assignment repository.

In [None]:
#!/bin/bash
#BATCH --account=mpcs56430
#SBATCH --mail-user=sharapova@cs.uchicago.edu
#SBATCH --mail-type=ALL
#SBATCH --job-name=blastp_array          # Job name
#SBATCH --output=./sbatch/out/bp_array_%j.out
#SBATCH --error=./sbatch/out/bp_array_%j.err
#SBATCH --cpus-per-task=16          # Number of CPU cores per task
#SBATCH --nodes=1                  # One node per task
#SBATCH --time=02:00:00
#SBATCH --ntasks=1
#SBATCH --array=0-21
#SBATCH --partition=bigmem2

# Load BLAST module 
# module load blast
# conda activate blast

# Define the query file
# QUERY=~/../../project2/mpcs56430/db/query_seq_b.fasta
QUERY=~/../../project2/mpcs56430/db/query_sequence.fasta

# Define the database prefix and fragment pattern
DB_PREFIX=~/../../project2/mpcs56430/db/refseq_protein
DB_FRAGMENT=${DB_PREFIX}.$(printf "%02d" $SLURM_ARRAY_TASK_ID)

# Define the output directory and file
OUTPUT_DIR=./assignment_b/array_output
mkdir -p $OUTPUT_DIR
OUTPUT_FILE=${OUTPUT_DIR}/blastp_${SLURM_ARRAY_TASK_ID}.out

# Run BLASTP for the current database fragment
echo "Running BLASTP for database fragment $DB_FRAGMENT..."
time -p blastp -query $QUERY -db $DB_FRAGMENT -num_threads 16 -out $OUTPUT_FILE

echo "BLASTP for database fragment $DB_FRAGMENT completed."


**Outputs in assignment_b.out and sbatch/out**

**Question** What is the runtime for the entire job?

Command used to find total times:
> $ grep "real" ./sbatch/out/bp_array_43369*.err | awk '{sum += $2} END {print "Total Runtime:", sum, "seconds"}'

Total Runtime: 45.29 seconds

_NOTE: most times were around 0.01 and 0.02, but some had a runtime of around 11s so this caused the total runtime to be skewed.. You can  see the times in sbatch/out/bp_array_43369*.err\
Some of the runs gave errors such as:_
> _BLAST Database error: No alias or index file found for protein database [/home/sharapova/../../project2/mpcs56430/db/refseq_protein.19] in search path [/scratch/midway3/sharapova/mpcs56430-2024-autumn-assignment-7-s-sharapova:/project2/databases/blast:]_

# Question 3. HPC Speedup

Speedup is a metric used to assess the relative performance improvement gained by executing one task versus another.  In our case, we are comparing serial execution (a single process) of searching a single sequence against a database versus a task parallelism version. 

Speedup is defined as 
$$ S \equiv \frac{T_{\text{Old}}}{T_{\text{New}}} $$ 

where $S$ is speedup, $T_{\text{Old}}$ is the time taken to execute the script without improvement (serial), and $T_{\text{New}}$ is the time taken to execute the script with the improvement (parallel).

**Question:** What is the speedup of a single node job against in Problem 1 vs running as an array job in Problem 2?

$T_{Old} = 0.332s$

$T_{New} = \text{Total parallel runtime is the maximum runtime across all fragments} = 11.32..$

$\text{Speedup} = 0.332 / 11.32 = 0.029$

Terrible and WRONG!! whoopsie..

But comparing num-thread parallelization we get:

$\text{Speedup} = 0.332 / 0.26 = 1.2769$

but this is not thaaaat great..

**Question:** Experiment with any other SLURM settings that you can think of (eg. _tasks per node_,) and identify what combination gives the best speedup for the query against the entire database? Support your answer with a benchmark experiment.

**Approaches:**

- Run the array job with different --cpus-per-task values (4, 8, 16): _this monitors how increased CPU utilization affects runtimes._
- Using Multiple Nodes (--nodes): _this tests running the array job across multiple nodes (distributes tasks)_
- Experiment with --ntasks and --ntasks-per-node 

My values did not change with these adjustments, but maybe running on a different set would work. I was having issues with the RCC..