# CSUMB Diagnostics Lab Tool Development

YOU and you're team are bioinformaticians for a clinical lab that performs diagnostic testing for the early, and accurate, detection of differentially expressed oncogenes and known oncogenic mutations.

Today, you're tasked with developing some software that can automate the quantification of gene expression and the detection of variants from RNA-sequencing data.

Case-study data has been deposited in the organizational database: `/media/fileshare/CSUMB_aligned_data/
` where you can access it to help you develop these tools.

## Loading reference sequences (FASTA) into Python

In [7]:
# this DICTIONARY will hold
# genes as KEYS
# sequences as VALUES
reference_dictionary = {}
fasta_path = ''
# with the file connection set to our reference FASTA ... 
with open(fasta_path) as reference:
        header = '' # create empty string for headers
        sequence = '' # create empty string for sequences

        for line in reference:
            if line.startswith('>'):
                header = line[1:].rstrip()
                sequence = ''
            else:
                sequence += ''.join(line.rstrip().split()).upper()
            reference_dictionary[header] = sequence

In [None]:
for gene, sequence in reference_dictionary.items():
  print(gene, len(sequence)/1000, 'kb')

## Loading Alignments (SAM) into Python

The `.sam` alignment files are stored in `/media/fileshare/aligned_data/{KRAS, CDH1, TP53}`

In [9]:
%ls /media/fileshare/CSUMB_aligned_data/*/pair_1 | head

ls: /media/fileshare/CSUMB_aligned_data/*/pair_1: No such file or directory


We can also use some `bash` commands by pre-pending them with `!`, for example let's look at one of the alignment (.sam) files and think about which lines we need to load and which we need to skip

In [10]:
!head /media/fileshare/CSUMB_aligned_data/KRAS/pair_3/pair_3.sam

head: /media/fileshare/CSUMB_aligned_data/KRAS/pair_3/pair_3.sam: No such file or directory


### And now,

Lets write the code.

I'm providing the destination object: `alignment_dictionary` that you will populate with `read_id`'s, `read_start` positions, and `read_sequence`'s

Use the `FASTA` parser code above as inspiration to implement an approach that reads the `SAM` files line-by-line and stores the relevant information in `alignment_dictionary`

hint: not all SAM lines are records. The first few lines of a SAM files are the "header". Use the terminal to find a pattern to ignore these lines. Don't include these in your dictionary.

In [3]:
alignment_dictionary = {}
# keys: read_id
# values: tuple(start position of alignment, sequence of aligned read)

loading /Users/vikas/Downloads/CSUMB_aligned_data//KRAS/pair_1/pair_1.sam into aligment dictionary


Explore `alignment_dictionary` by printing its keys. Print the value of the first key thats printed. 

# Exercises

## Exercise #1

Calculate the number of reads aligned to the reference and the size of your reference gene
  - Determine how many reads we have aligned to our reference if possible, compare to the bowtie output and remember that in this special case, we're aligning to only one gene
  - store this in the variable "read_count"
hint:
  - we want to measure how many entries are in our alignment obj...

## Exercise #2

Calculate the number of reads aligned to the reference per-kilobase length of gene
  - Scale / normalize the read abundance to the size of the gene and answer: **why are we doing this**?
  

hint:
  - reuse `read_count`

## Exercise #3

Calculate per-nucleotide coverage across target gene
  - for every position in the reference, determine the number of reads that overlap that position

hint:
  - Python is 0-based (refer to primer) but the reference sequence is not

extra credit:
  - using `max(dictionary, key=dictionary.get)` determine the position, and coverage that is the maximum (do the same for `min`) across the reference

## Exercise #4

Calculate SNP-aware per-nucleotide coverage of target reference
  - for every position in the reference,determine the number of reads that overlap that position with the expected, matching nucleotide as well as the number of reads that contain single-nucleotide polymorphisms (SNPs)

hint:
  - re-use as much code as possible....

extra credit:
  - calculate the "allele frequency" at positions with
  - mutations, where `AF = reference observed / total observations`

If your dictionary is correct, replace the variable YOUR_DICTIONAR
Y_HERE below with your dictionary and a graph should be produced: 

In [None]:
import matplotlib.pyplot as plt

positions_dict = YOUR_DICTIONARY_HERE

# Extract the keys and values from positions_dict
keys = list(positions_dict.keys())
values = list(positions_dict.values())

# Create the bar plot
plt.bar(keys, values)

# Set labels and title
plt.xlabel('Positions')
plt.ylabel('Values')
plt.title('Bar Plot of positions_dict')


# Set y-axis to logarithmic scale
plt.yscale('log')

# Show the plot
plt.show()

## Exercise #5

Generalize your approach to work on all of the generated SAM files
  * Create two functions: 
    * 1) Read each SAM file into an alignment dictionary. Store the dictionary in another dictionary where the key is the sample name, and the value is the alignment dictionary
    * 2) Process the SAM file to do the following:
  * For each file and its corresponding reference, print the position with the highest Allele fraction in the following format: 
      * "Pair 1/2/3 from gene A had an allele fraction of B for gene C at position D"
      
hint: you can modify the following if statement to check if a file exists: 
* if os.path.exists(file_path):