# Bio Computing Course

# Instructions
### Welcome to the Python and Biological Concepts practice notebook! This set of exercises will help you apply your understanding of basic Python programming while exploring biological concepts. Please read each question carefully and use the provided code cells to write your solutions.

# General Guidelines:
### Read Each Question Carefully: Make sure you understand what is being asked before you start coding. Pay close attention to the input and output requirements.

### Write Clear and Efficient Code: Aim to write code that is both correct and efficient. Use basic Python concepts such as loops, conditionals, lists, dictionaries, and strings as needed.

In [None]:
dna_sequence = "ATGCGTACGTT"

### 1. String Manipulation - DNA Sequence
#### Question: Given a DNA sequence "ATGCGTACGTT", write a Python function to count the number of times the nucleotide 'G' appears.

### 2. List - DNA Codons
#### Question: Create a list of all codons (3-letter combinations) in the DNA sequence "ATGCGTACGTT".

### 3. Dictionary - Codon to Amino Acid Mapping
#### Question: Create a dictionary that maps codons to their corresponding amino acids for the following: 'ATG': 'Methionine', 'CGT': 'Arginine', 'TAC': 'Tyrosine', 'GTT': 'Valine'.

In [None]:
codon_to_amino_acid = {
    'ATG': 'Methionine',
    'CGT': 'Arginine',
    'TAC': 'Tyrosine',
    'GTT': 'Valine'
}

### 4. Conditional Statements - DNA Base Identification
#### Question: Write a function that identifies if a base is a purine (A, G) or pyrimidine (C, T).

### 5. Loops - Counting Bases
#### Question: Write a function to count the number of each base in the DNA sequence "ATGCGTACGTT".

### 6. String Slicing - mRNA from DNA

#### Question: Convert the DNA sequence "ATGCGTACGTT" to its mRNA sequence.



### 7. List - Reverse Complement of DNA
#### Question: Write a function that returns the reverse complement of a DNA sequence "ATGCGTACGTT".

### 8. Dictionary - Transcription Mapping

#### Question: Create a dictionary to map each DNA base to its RNA complement (A->U, T->A, C->G, G->C) and use it to transcribe "ATGCGTACGTT".



### 9. Conditional Statements - DNA Sequence Validity

#### Question: Write a function to check if a given sequence is a valid DNA sequence (only contains A, T, C, G).



### 10. Loops - Translation of Codons

#### Question: Write a function to translate the DNA sequence "ATGCGTACGTT" into a protein sequence using the codon dictionary provided.

### 11. String - Counting GC Content

#### Question: Write a function to calculate the GC content of a DNA sequence "ATGCGTACGTT".

### 12. List - Extract Exons

#### Question: Given a list of exon positions [(0, 3), (4, 7)], extract the exons from the DNA sequence "ATGCGTACGTT".

### 13. Dictionary - Frequency of Codons

#### Question: Write a function to count the frequency of each codon in the DNA sequence "ATGCGTACGTT".

### 14. String Slicing - Subsequence Check
#### Question: Write a function to check if the sequence "CGTAC" is a subsequence of "ATGCGTACGTT".

### 15. Loops - Finding Start Codon
#### Question: Write a function to find the position of the first start codon (ATG) in the DNA sequence "ATGCGTACGTT".

### 16. Dictionary - Counting Nucleotides
#### Question: Write a function to count the occurrence of each nucleotide in the DNA sequence "ATGCGTACGTT" using a dictionary.

### 17. List - Split DNA Sequence into Codons
#### Question: Write a function to split the DNA sequence "ATGCGTACGTT" into codons and store them in a list.

### 18. Conditional Statements - Check Palindrome
#### Question: Write a function to check if the DNA sequence "ATGCGCAT" is a palindrome.

In [169]:
def check_palindrome(x : str) -> bool:
    #x = "ATTA"

    is_palindrome = True
    n = len(x)
    for i in range(n//2):
        if x[i] != x[n-i-1]:
            is_palindrome = False
        
    return is_palindrome

dna_sequence = "ATTA"

print(f"DNA : {dna_sequence}")

if check_palindrome(dna_sequence) == True:
    print("DNA Seq is Palinddrome")
else :
    print("DNA seq is NOT palindrome")





DNA : ATTA
DNA Seq is Palinddrome


### 19. Loops - Complementary DNA Strand
#### Question: Write a function to generate the complementary DNA strand for "ATGCGTACGTT".

### 20. Dictionary - Counting Codons in mRNA
#### Question: Write a function to count the frequency of each codon in the mRNA sequence "AUGCGUACGUU" using a dictionary.

In [29]:
from pprint import pprint

nuc = "AUGC"

codon_list = [(char_1+char_2+char_3,0) for char_1 in nuc for char_2 in nuc for char_3 in nuc]
#print(codon_list)

count = dict(codon_list)
print(count)

dna_seq = "AUGCGUACGUU"
n = len(dna_seq)
for i in range(0,n,3):

    # to handle case when seq length is not a multiple of 3
    if len(dna_seq[i:i+3]) < 3:
        break

    codon = dna_seq[i:i+3]
    count[codon] += 1

pprint(count)




{'AAA': 0, 'AAU': 0, 'AAG': 0, 'AAC': 0, 'AUA': 0, 'AUU': 0, 'AUG': 0, 'AUC': 0, 'AGA': 0, 'AGU': 0, 'AGG': 0, 'AGC': 0, 'ACA': 0, 'ACU': 0, 'ACG': 0, 'ACC': 0, 'UAA': 0, 'UAU': 0, 'UAG': 0, 'UAC': 0, 'UUA': 0, 'UUU': 0, 'UUG': 0, 'UUC': 0, 'UGA': 0, 'UGU': 0, 'UGG': 0, 'UGC': 0, 'UCA': 0, 'UCU': 0, 'UCG': 0, 'UCC': 0, 'GAA': 0, 'GAU': 0, 'GAG': 0, 'GAC': 0, 'GUA': 0, 'GUU': 0, 'GUG': 0, 'GUC': 0, 'GGA': 0, 'GGU': 0, 'GGG': 0, 'GGC': 0, 'GCA': 0, 'GCU': 0, 'GCG': 0, 'GCC': 0, 'CAA': 0, 'CAU': 0, 'CAG': 0, 'CAC': 0, 'CUA': 0, 'CUU': 0, 'CUG': 0, 'CUC': 0, 'CGA': 0, 'CGU': 0, 'CGG': 0, 'CGC': 0, 'CCA': 0, 'CCU': 0, 'CCG': 0, 'CCC': 0}
{'AAA': 0,
 'AAC': 0,
 'AAG': 0,
 'AAU': 0,
 'ACA': 0,
 'ACC': 0,
 'ACG': 1,
 'ACU': 0,
 'AGA': 0,
 'AGC': 0,
 'AGG': 0,
 'AGU': 0,
 'AUA': 0,
 'AUC': 0,
 'AUG': 1,
 'AUU': 0,
 'CAA': 0,
 'CAC': 0,
 'CAG': 0,
 'CAU': 0,
 'CCA': 0,
 'CCC': 0,
 'CCG': 0,
 'CCU': 0,
 'CGA': 0,
 'CGC': 0,
 'CGG': 0,
 'CGU': 1,
 'CUA': 0,
 'CUC': 0,
 'CUG': 0,
 'CUU': 0,
 'GAA':

In [1]:
codon_to_amino_acid = {
    'ATG': 'Methionine', 'CGT': 'Arginine', 'TAC': 'Tyrosine', 'GTT': 'Valine',
    'TAA': 'Stop', 'TAG': 'Stop', 'TGA': 'Stop'
}

### 21. Dictionary & List - DNA to Protein Translation with a Stop Codon

#### Question: Given a dictionary mapping codons to amino acids and a DNA sequence "ATGCGTTAA", write a function to translate the DNA sequence into a protein sequence, stopping translation at the first stop codon ('TAA', 'TAG', 'TGA').

In [36]:
codon_to_amino_acid = {
        'AUG': 'Methionine', 'UUU': 'Phenylalanine', 'UUC': 'Phenylalanine', 'UUA': 'Leucine', 'UUG': 'Leucine',
        'CUU': 'Leucine', 'CUC': 'Leucine', 'CUA': 'Leucine', 'CUG': 'Leucine', 'AUU': 'Isoleucine', 'AUC': 'Isoleucine',
        'AUA': 'Isoleucine', 'GUU': 'Valine', 'GUC': 'Valine', 'GUA': 'Valine', 'GUG': 'Valine', 'UCU': 'Serine',
        'UCC': 'Serine', 'UCA': 'Serine', 'UCG': 'Serine', 'CCU': 'Proline', 'CCC': 'Proline', 'CCA': 'Proline',
        'CCG': 'Proline', 'ACU': 'Threonine', 'ACC': 'Threonine', 'ACA': 'Threonine', 'ACG': 'Threonine', 'GCU': 'Alanine',
        'GCC': 'Alanine', 'GCA': 'Alanine', 'GCG': 'Alanine', 'UAU': 'Tyrosine', 'UAC': 'Tyrosine', 'UAA': 'Stop',
        'UAG': 'Stop', 'CAU': 'Histidine', 'CAC': 'Histidine', 'CAA': 'Glutamine', 'CAG': 'Glutamine', 'AAU': 'Asparagine',
        'AAC': 'Asparagine', 'AAA': 'Lysine', 'AAG': 'Lysine', 'GAU': 'Aspartic Acid', 'GAC': 'Aspartic Acid',
        'GAA': 'Glutamic Acid', 'GAG': 'Glutamic Acid', 'UGU': 'Cysteine', 'UGC': 'Cysteine', 'UGA': 'Stop',
        'UGG': 'Tryptophan', 'CGU': 'Arginine', 'CGC': 'Arginine', 'CGA': 'Arginine', 'CGG': 'Arginine', 'AGU': 'Serine',
        'AGC': 'Serine', 'AGA': 'Arginine', 'AGG': 'Arginine', 'GGU': 'Glycine', 'GGC': 'Glycine', 'GGA': 'Glycine',
        'GGG': 'Glycine'
    }
stop_codon = ['TAA','TAG','TGA']

dna_seq = "AUGCGUACGUU"
n = len(dna_seq)

protein = ""


for i in range(0,n,3):
    
    codon = dna_seq[i:i+3]

    # if codon len is < 3 then its invalid
    if len(codon) < 3 :        
        break

    # if codon is stop codon , add stop in protein and break
    if codon in stop_codon:
        protein += "-stop"
        break

    # continue adding protein corresponding to codon

    amino_acid = codon_to_amino_acid[codon]

    protein +=  amino_acid + "-"

print(f"DNA : {dna_seq}")
#print(f"Protein : {protein}")
print(f"Protein : {protein[:-1]}")





DNA : AUGCGUACGUU
Protein : Methionine-Arginine-Threonine


### 22. Set Operations - Unique Nucleotide Combinations

#### Question: Write a function that finds all unique nucleotide combinations of length 2 in a DNA sequence "ATGCGTACGTT".

In [42]:
from pprint import pprint

unique_nu_2 = []
dna_seq = "ATGCGTACGTT"
n = len(dna_seq)

for i in range(0,n,2):

    nuc_pair = dna_seq[i:i+2]

    # To handel case when length of seq is odd & last nuc will not form pair
    if len(nuc_pair) < 2:
        break

    # just add to unique nuc pair list
    if nuc_pair not in unique_nu_2:
        unique_nu_2.append(nuc_pair)


pprint(unique_nu_2)




['AT', 'GC', 'GT', 'AC']


In [47]:
from pprint import pprint

unique_nu_2 = set()
dna_seq = "ATGCGTACGTT"
n = len(dna_seq)

for i in range(0,n,2):

    nuc_pair = dna_seq[i:i+2]

    # To handel case when length of seq is odd & last nuc will not form pair
    if len(nuc_pair) < 2:
        break

    # just add to unique nuc pair list
    
    unique_nu_2.add(nuc_pair)


pprint(unique_nu_2)




{'GT', 'GC', 'AT', 'AC'}


### 23. Nested Loops - Finding All ORFs
#### Question: Write a function to find all open reading frames (ORFs) in the DNA sequence "ATGCGTACGTTATGCGTTAA" starting with 'ATG' and ending with a stop codon.

In [99]:
# using regular expressions 

import re 

seq = "ATGCGTACTGGTAAATGCGTACGTAGATGCGTACCTGAATGCGTACTGATGA" 

stop_codon = ['TAA','TAG','TGA']
pattern = r"ATG((?:[ATGC]{1})*?)(TAA|TAG|TGA)"

pattern = re.compile(pattern , re.IGNORECASE)
matches = pattern.finditer(seq)

for i,match in enumerate(matches,1):    
    #print(match.group(0))
    print(f"ORF {i} : {match.group(1)}")
    
    

ORF 1 : CGTACTGG
ORF 2 : CGTACG
ORF 3 : CGTACC
ORF 4 : CGTAC


### 24. Regular Expressions - Validating DNA Sequence
#### Question: Use a regular expression to validate if a given DNA sequence "ATGCGTACGTT" only contains valid nucleotides (A, T, C, G).

In [105]:
import re
seq = "ATGCATGfCAGgT"


pattern = r"[^ATGC]"

pattern = re.compile(pattern , re.IGNORECASE)
matches = pattern.findall(seq)

print(f"DNA Sequence : {seq}")

if matches:
    print("DNA is INVALID")

else:
    print("DNA is VALID")



DNA Sequence : ATGCATGfCAGgT
DNA is INVALID


### 25. Recursion - Calculate GC Content Recursively
#### Question: Write a recursive function to calculate the GC content of a DNA sequence "ATGCGTACGTT".

In [125]:
def count_gc(seq,i):
    if i == 0 and (seq[i] in ['G','C']):        
        return 1
    

    if i == 0 and (seq[i] not in ['G','C']):
        return 0
    
    if seq[i] in ['G','C']:        
        return 1 + count_gc(seq,i-1)
    else: 
        return count_gc(seq,i-1)

seq = "ATGCGTACGTT"
n = len(seq)

print(f"DNA : {seq}")
print(f"GC Count : {count_gc(seq,n-1)}")

DNA : ATGCGTACGTT
GC Count : 5


### 27. List & Set - Finding Unique and Common Codons
#### Question: Write a function to find unique and common codons between two DNA sequences "ATGCGTACGTT" and "ATGCGTGTGTA".

In [139]:
def find_codon(dna):
    codon_list = set()
    n = len(dna)

    for i in range(0,n,3):
        codon = dna[i:i+3]
        if len(codon) < 3:
            break
        
        codon_list.add(codon)

    return codon_list

def find_unique_codon(dna1 ,dna2):

    codon_set_1 = find_codon(dna1)
    codon_set_2 = find_codon(dna2)

    unique_common_codon =  codon_set_1.union(codon_set_2)

    return unique_common_codon

dna1 = "ATG CGT ACG TT"
dna2 = "ATG CGT GTG TA"

find_unique_codon(dna1, dna2)

{'ACG', 'ATG', 'CGT', 'GTG'}

### 28. List Comprehension - Transcribe Multiple DNA Sequences
#### Question: Write a function to transcribe a list of DNA sequences ["ATGCGT", "GATTACA", "CGTACG"] into their respective mRNA sequences using list comprehension.

In [141]:
dna_seq = ["ATGCGT", "GATTACA", "CGTACG"]

rna_seq = [dna.replace('T','U') for dna in dna_seq]

print(dna_seq)
print(rna_seq)

['ATGCGT', 'GATTACA', 'CGTACG']
['AUGCGU', 'GAUUACA', 'CGUACG']


### 29. Nested Loops - Counting Overlapping Codons
#### Question: Write a function to count the number of times a specific codon "CGT" appears in a DNA sequence "ATGCGTACGTT" including overlapping occurrences.

In [146]:
codon = "CGT"
dna = "ATGCGTACGTT"
n = len(dna)

count = 0

for i in range(n):
    local_codon = dna[i:i+3]

    if len(local_codon) < 3:
        break

    if local_codon == codon:
        count += 1

print(f"DNA : {dna}")
print(f"codon : {codon}")
print(f"overlapping : {count}")


    

DNA : ATGCGTACGTT
codon : CGT
overlapping : 2


### 30. List & Dictionary - Grouping DNA Sequences by GC Content
#### Question: Write a function to group DNA sequences ["ATGCGT", "GATTACA", "CGTACG", "TATATA"] into categories based on GC content: "High GC" (>50%), "Moderate GC" (30-50%), "Low GC" (<30%).

In [162]:
dna_list = ["ATGCGT", "GATTACA", "CGTACG", "TATATA"]

high_gc_dna_list = [dna for dna in dna_list if (count_gc(dna,len(dna) -1 )/len(dna) >= .5)]

moderate_gc_dna_list = [dna for dna in dna_list if (count_gc(dna,len(dna) -1 )/len(dna) >= .3) and (count_gc(dna,len(dna) -1 )/len(dna) < .5 )]

low_gc_dna_list = [dna for dna in dna_list if (count_gc(dna,len(dna) -1 )/len(dna) < .3)]

print("High GC DNA: ")
[print(dna,end=" ") for dna in high_gc_dna_list]
print("\n")

print("Moderate GC DNA: ")
[print(dna,end=" ") for dna in moderate_gc_dna_list]
print("\n")

print("low GC DNA: ")
[print(dna,end=" ") for dna in low_gc_dna_list]
print("\n")



High GC DNA: 
ATGCGT CGTACG 

Moderate GC DNA: 


low GC DNA: 
GATTACA TATATA 



### 31. List & Conditional Statements - Counting Transitions and Transversions
#### Question: Write a function that counts the number of transitions (purine to purine, pyrimidine to pyrimidine) and transversions (purine to pyrimidine, vice versa) between two DNA sequences "ATGCGTAC" and "ATGCGTAG".

In [168]:

def count_transitions_transversions(dna1, dna2):
    purines = ['A','G']
    pyrimidines = ['T','C']

    transition = 0
    transversions = 0

    for nuc_1, nuc_2 in zip(dna1,dna2):
        if (nuc_1 in purines and nuc_2 in purines ) or (nuc_1 in pyrimidines and nuc_2 in pyrimidines ):
            transition += 1

        if (nuc_1 in purines and nuc_2 in pyrimidines ) or (nuc_1 in pyrimidines and nuc_2 in purines ):
            transversions += 1

    return (transition,transversions)

dna_1 = "ATGCGTAC" 
dna_2 = "ATGCGTAG"
count_transitions_transversions(dna_1, dna_2)

(7, 1)

False

### 32. List & Dictionary - Identifying Palindromic Sequences
#### Question: Write a function to find all palindromic sequences of length 4 in a DNA sequence "ATGCGTACGTACGCGT".

In [172]:
def find_4_len_palin(dna):


    n = len(dna)
    palin_list = []

    for i in range(n):
        sm_dna = dna[i:i+4]
        
        if(len(sm_dna) < 4):            
            break

        if check_palindrome(sm_dna) == True:
            palin_list.append(sm_dna)

    return palin_list



dna = "ATTAGTACGTACGCGT"

find_4_len_palin(dna)
        



['ATTA']

### 33. String & List - DNA Mutation Simulation
#### Question: Write a function to simulate a point mutation in a DNA sequence "ATGCGTAC" by randomly replacing one nucleotide with another.

In [11]:
import random

def random_nuc(nuc): 
    nuc_list = ['A','T','G','C']
    nuc_list.remove(nuc)
    return nuc_list[random.randint(0,2)]




dna = "ATGCGTAC"
n = len(dna)

random_index = random.randint(0,n-1)

print(f"DNA : {dna}")
print(f"Random mutation at {random_index}")

mutated_dna = dna[0:random_index] + random_nuc(dna[random_index]) + dna[random_index+1 : ]

print(f"Mutated DNA : {mutated_dna}")
 


DNA : ATGCGTAC
Random mutation at 5
Mutated DNA : ATGCGAAC


### 34. Loops & Dictionary - Counting Nucleotides in Multiple Sequences
#### Question: Write a function to count the occurrence of each nucleotide in a list of DNA sequences ["ATG", "CGT", "TAC", "GTT"].

In [19]:
nucleotides = ['A','T','G','C']

def count_nuc(dna):
    count = {'A':0,'T':0,'G':0,'C':0}

    for key in count.keys():
        count[key] = dna.count(key)

    return count


def print_count_nuc(count):

    nucleotides = ['A','T','G','C']

    
    for nuc in nucleotides:
        print(f"Count {nuc} : {count[nuc]}")


dna_list = ["ATG", "CGT", "TAC", "GTT"]

for dna in dna_list:
    print(f"DNA : {dna}")
    print_count_nuc(count_nuc(dna))
    print()

    



DNA : ATG
Count A : 1
Count T : 1
Count G : 1
Count C : 0

DNA : CGT
Count A : 0
Count T : 1
Count G : 1
Count C : 1

DNA : TAC
Count A : 1
Count T : 1
Count G : 0
Count C : 1

DNA : GTT
Count A : 0
Count T : 2
Count G : 1
Count C : 0



### 35. String Manipulation - GC Skew Calculation
#### Question: Write a function to calculate the GC skew (G - C / G + C) at each position in the DNA sequence "ATGCGTACGTT".

In [27]:
nucleotides = ['A','T','G','C']

def count_nuc(dna):
    count = {'A':0,'T':0,'G':0,'C':0}

    for key in count.keys():
        count[key] = dna.count(key)

    return count

dna = "ATGCGTACGTT"
count_all = count_nuc(dna)

gc_skew = count_all['G'] - count_all['C'] / count_all['G'] + count_all['C']

print(f"DNA : {dna}")
print(f"GC skew : {gc_skew:.3f}")


DNA : ATGCGTACGTT
GC skew : 4.333


### 36. List & Set - Finding Common and Unique Nucleotides
#### Question: Write a function to find common and unique nucleotides between two DNA sequences "ATGCGT" and "GATTACA".

In [32]:
dna_1 = "ATT" 
dna_2 = "ATGC"

common_unique = set(dna_1).intersection(set(dna_2))

print(f"dna 1 : {dna_1}")
print(f"dna 2 : {dna_2}")

print(f"common & Unique Nucleotides are : ",end=" ")
for nuc in common_unique:
    print(f"{nuc}",end=" ")

dna 1 : ATT
dna 2 : ATGC
common & Unique Nucleotides are :  A T 

### 37. String & List - Reversing Transcription
#### Question: Write a function to reverse transcribe an mRNA sequence "AUGCGUACGUU" back into a DNA sequence.

In [33]:
rna = "AUGCGUACGUU"
dna = rna.replace('U','T')

print(f'MRNA : {rna}')
print(f"DNA : {dna}")

MRNA : AUGCGUACGUU
DNA : ATGCGTACGTT


### 38. List & String - Translating Overlapping Codons
#### Question: Write a function to translate overlapping codons in a DNA sequence "ATGCGTACGTT" by one base at a time.

In [65]:
def count_nuc(dna):
    count = {'A':0,'T':0,'G':0,'C':0}

    for key in count.keys():
        count[key] = dna.count(key)

    return count

dna = "ATGCGTACGTT"
print(count_nuc(dna))

{'A': 2, 'T': 4, 'G': 3, 'C': 2}


### 39. List & Conditional Statements - Detecting Frameshifts
#### Question: Write a function to detect frameshift mutations between two sequences "ATGCGTACGTT" and "ATCGTACGTT".

### 40. String & List - Extracting Introns and Exons
#### Question: Write a function to extract exons and introns from a DNA sequence "ATGCGTACGTT" with exons at positions [(0, 3), (4, 7)].

In [64]:
#Assuming in (0,3) means  Exon locations are 0,1,2,3


dna = "ATGCGTACGTT"
n = len(dna)
exon_loc = [(0, 3), (4, 7)]
introns =[] 
 #[  if exon_loc[i+1][0] != None]

for i in range(len(exon_loc)):
    if i == (len(exon_loc)-1):
      introns.append((exon_loc[i][1]+1,n-1))
      break
    else :
       start = exon_loc[i][1]+1      
       end = (exon_loc[i+1][0])-1
       if end < start :
          continue 
       
       #print(start,end)
       introns.append((start,end))   

print(f"DNA : {dna}")
print("Exon Location : ",exon_loc) 
print("Introns Location : ",introns)


exons_value = [dna[start:end+1] for (start,end) in exon_loc]

introns_value = [dna[start:end+1] for (start,end) in introns]



print("Exons  : ",end=" ")
for intron in exons_value:
    print(f"{intron}",end=" ")

print()
print("Introns : ",end=" ")
for intron in introns_value:
    print(f"{intron}",end=" ")



DNA : ATGCGTACGTT
Exon Location :  [(0, 3), (4, 7)]
Introns Location :  [(8, 10)]
Exons  :  ATGC GTAC 
Introns :  GTT 

### 41. String & Dictionary - Codon Usage Frequency

#### Question: Write a function that calculates the frequency of each codon in a DNA sequence "ATGCGTACGTTATGCGT" and returns a dictionary with the counts.



In [None]:
# Repeat

### 42. Nested Loops - Finding Longest ORF

#### Question: Write a function to find the longest open reading frame (ORF) in the DNA sequence "ATGCGTACGTTATGCGTTAA" that starts with 'ATG' and ends with a stop codon.



In [72]:
import re 

dna = dnaseq = "ATGCGTACTGGTAAATGCGTACGTAGATGCGTACCTGAATGCGTACTGATGA" 

stop_codon = ['TAA','TAG','TGA']
pattern = r"ATG((?:[ATGC]{1})*?)(TAA|TAG|TGA)"

pattern = re.compile(pattern , re.IGNORECASE)
matches = pattern.finditer(seq)

longest_ORF = ""

for i,match in enumerate(matches,1):    
    #print(match.group(0))
    if len(match.group(1)) >= len(longest_ORF):
        longest_ORF = match.group(0)
        #print(f"updated max ORF {i} : {match.group(1)}")

    #print(f" ORF {i} : {match.group(1)}")

print(f"DNA : {dna}")
print(f"Longest ORF : {longest_ORF}")
    
    

DNA : ATGCGTACTGGTAAATGCGTACGTAGATGCGTACCTGAATGCGTACTGATGA
Longest ORF : ATGCGTACTGGTAA


### 43. List & Set - Unique Amino Acids in Protein Sequence

#### Question: Write a function to identify unique amino acids in a protein sequence "MKVLYRFY" using a set.

##### Output: {'R', 'V', 'K', 'L', 'Y', 'M', 'F'}

In [74]:
protein = "MKVLYRFY"
uni_amino = set(protein)


print(uni_amino)

{'L', 'Y', 'F', 'R', 'K', 'V', 'M'}


### 44. String Manipulation - Finding Overlapping K-mers

#### Question: Write a function to find all overlapping k-mers of length 3 in the DNA sequence "ATGCGTACGTT".



In [78]:

def kmer(seq,k):
    kmer_list = []

    #dna = "ATGCGTACGTT"

    for i in range(0,n):
        kmer = seq[i:i+3]
        if len(kmer)< k:
            break

        kmer_list.append(kmer)

    return kmer_list

dna = "ATGCGTACGTT"

print(f"DNA : {dna}")
print("kmer (k = 3): ",end=" ")

for kmer in kmer(dna,3):
    print(f"{kmer}",end=" ")     


DNA : ATGCGTACGTT
kmer (k = 3):  ATG TGC GCG CGT GTA TAC ACG CGT GTT 

### 45. List & Conditional Statements - GC Content Windows

#### Question: Write a function to calculate the GC content in non-overlapping windows of size 4 in the DNA sequence "ATGCGTACGTT".

#####  Output: [50.0, 50.0, 25.0]


In [96]:
# easy

### 46. List & String - Translating Codon Frame Shifts

#### Question: Write a function to translate all possible frames (0, +1, +2) in the DNA sequence "ATGCGTACGTT".

##### # Output: [['Methionine', 'Arginine', 'Tyrosine'], ['Unknown', 'Unknown', 'Unknown'], ['Unknown', 'Unknown', 'Unknown']]

### 47. String & Dictionary - Complementary RNA Strand
#### Question: Write a function to generate the complementary RNA strand for the sequence "AUGCGUACGUU".
##### Output: 'UACGCAUGCAA'

In [83]:
comp_rna_pair ={"A":"U","U":"A","G":"C","C":"G"}
rna  = "AUGCGUACGUU"
comp_rna = ""

for char in rna :
    comp_rna += comp_rna_pair[char]

print(f"RNA : {rna}")
print(f"comp Rna : {comp_rna}")


RNA : AUGCGUACGUU
comp Rna : UACGCAUGCAA


### 49. List & Loops - Translating Reverse Complement
#### Question: Write a function to translate the reverse complement of a DNA sequence "ATGCGTACGTT" into a protein sequence.


##### # Output: ['Unknown', 'Unknown', 'Unknown', 'Unknown']

In [95]:
comp_pair_dna = {'A':'T','T':'A','G':'C','C':'G'}
dna = "ATGCGTACGTT"

comp_dna = ""
for nuc in dna:
    comp_dna += comp_pair_dna[nuc]
rev_comp = comp_dna[::-1]

print(dna)
print(comp_dna)
print(rev_comp)

def dna_to_rna(dna):
    return dna.replace('T','U')



codon_table = {
        'AUG': 'Methionine', 'UUU': 'Phenylalanine', 'UUC': 'Phenylalanine', 'UUA': 'Leucine', 'UUG': 'Leucine',
        'CUU': 'Leucine', 'CUC': 'Leucine', 'CUA': 'Leucine', 'CUG': 'Leucine', 'AUU': 'Isoleucine', 'AUC': 'Isoleucine',
        'AUA': 'Isoleucine', 'GUU': 'Valine', 'GUC': 'Valine', 'GUA': 'Valine', 'GUG': 'Valine', 'UCU': 'Serine',
        'UCC': 'Serine', 'UCA': 'Serine', 'UCG': 'Serine', 'CCU': 'Proline', 'CCC': 'Proline', 'CCA': 'Proline',
        'CCG': 'Proline', 'ACU': 'Threonine', 'ACC': 'Threonine', 'ACA': 'Threonine', 'ACG': 'Threonine', 'GCU': 'Alanine',
        'GCC': 'Alanine', 'GCA': 'Alanine', 'GCG': 'Alanine', 'UAU': 'Tyrosine', 'UAC': 'Tyrosine', 'UAA': 'Stop',
        'UAG': 'Stop', 'CAU': 'Histidine', 'CAC': 'Histidine', 'CAA': 'Glutamine', 'CAG': 'Glutamine', 'AAU': 'Asparagine',
        'AAC': 'Asparagine', 'AAA': 'Lysine', 'AAG': 'Lysine', 'GAU': 'Aspartic Acid', 'GAC': 'Aspartic Acid',
        'GAA': 'Glutamic Acid', 'GAG': 'Glutamic Acid', 'UGU': 'Cysteine', 'UGC': 'Cysteine', 'UGA': 'Stop',
        'UGG': 'Tryptophan', 'CGU': 'Arginine', 'CGC': 'Arginine', 'CGA': 'Arginine', 'CGG': 'Arginine', 'AGU': 'Serine',
        'AGC': 'Serine', 'AGA': 'Arginine', 'AGG': 'Arginine', 'GGU': 'Glycine', 'GGC': 'Glycine', 'GGA': 'Glycine',
        'GGG': 'Glycine'
    }

def translate_rna_protein(codon : str):
    if codon in codon_table.keys():
        return codon_table[codon]
    else :
        return "Invalid"
    
def rna_to_protein(codon_seq):
    n = len(codon_seq)

    protein_seq = ""

    for i in range(0,n,3):     
        protein_seq = protein_seq + translate_rna_protein(codon_seq[i:i+3])+"-"

    return protein_seq 


    


result = rna_to_protein((dna_to_rna(rev_comp)))

print(result)




ATGCGTACGTT
TACGCATGCAA
AACGTACGCAT
Asparagine-Valine-Arginine-Invalid-


### 50. List & String - Identifying Start and Stop Codons

#### Question: Write a function to find all positions of start codons 'ATG' and stop codons ('TAA', 'TAG', 'TGA') in a DNA sequence "ATGCGTATGCGTTAA".

##### # Output: ([0, 5], [11])

In [None]:
import re



### 51. String & List - Generating K-mer Frequencies

#### Question: Write a function to calculate the frequency of each k-mer of length 3 in the DNA sequence "ATGCGTACGTT".

##### # Output: {'ATG': 1, 'TGC': 1, 'GCG': 1, 'CGT': 2, 'GTA': 1, 'TAC': 1, 'ACG': 1, 'GTT': 1}


### 52. Dictionary & List - Counting Dinucleotides

#### Question: Write a function to count the frequency of each dinucleotide pair (e.g., 'AA', 'AC', etc.) in a DNA sequence "ATGCGTACGTT".

##### # Output: {'AT': 1, 'TG': 1, 'GC': 2, 'CG': 2, 'GT': 2, 'TA': 1, 'AC': 1}


### 53. List & String - Finding Reverse Palindromes

#### Question: Write a function to find all reverse palindromic sequences of length 6 in the DNA sequence "ATGCGTACGCGTACGT".

##### Output: ['CGTACG', 'GCGTAC']

### 54. String & List - Protein Subsequence Search

#### Question: Write a function to find all occurrences of a protein subsequence "LYR" in a protein sequence "MKVLYRLYRFY".

##### Output: [3, 6]

### 55. List & Dictionary - Finding Codon Usage Bias

#### Question: Write a function to compare codon usage in two different DNA sequences "ATGCGTACGTT" and "ATGCGTAGCGT".

##### Output: ({'ATG': 1, 'CGT': 2, 'ACG': 1, 'GTT': 1}, {'ATG': 1, 'CGT': 2, 'AGC': 1, 'GTA': 1})


### 56. List & String - Converting DNA to Protein Using a Custom Genetic Code

#### Question: Write a function to translate a DNA sequence "ATGCGTACGTT" into a protein using a custom genetic code mapping.

##### Output: ['Methionine', 'Arginine', 'Tyrosine', 'Valine']


In [None]:
custom_genetic_code = {
    'ATG': 'Methionine', 'CGT': 'Arginine', 'TAC': 'Tyrosine', 'GTT': 'Valine',
    'TAA': 'Stop', 'TAG': 'Stop', 'TGA': 'Stop'
}

### 57. List & Set - Finding Unique Codons

#### Question: Write a function to find all unique codons in a DNA sequence "ATGCGTACGTTATGCGT".

##### Output: {'ATG', 'CGT', 'ACG', 'TTA'}

### 58. String & List - Translating Protein Sequences with Multiple Start Codons

#### Question: Write a function to translate all possible protein sequences starting from each occurrence of 'ATG' in a DNA sequence "ATGCGTATGCGTTAA".

##### Output: [['Methionine', 'Arginine', 'Tyrosine'], ['Methionine', 'Arginine', 'Valine']]


### 59. List & Dictionary - Calculating Codon Usage in Different Frames

#### Question: Write a function to calculate codon usage in all three reading frames of a DNA sequence "ATGCGTACGTT".

##### Output: [{'ATG': 1, 'CGT': 2, 'ACG': 1, 'GTT': 1}, {'TGC': 1, 'GTA': 1, 'CGT': 1, 'TAC': 1}, {'GCG': 1, 'TAC': 1, 'GT': 1}]


### 60. List & Conditional Statements - Identifying ORFs in Reverse Complement

#### Question: Write a function to find all open reading frames (ORFs) in the reverse complement of a DNA sequence "ATGCGTACGTT" that start with 'ATG' and end with a stop codon.



### 61. String & Set - Finding Overlapping Motifs

#### Question: Write a function to find all occurrences of the motif "ACGT" in the DNA sequence "ATGCGTACGTTACGTACGT".



### 62. List & Dictionary - Counting Triplet Nucleotide Repeats

#### Question: Write a function to count the number of times each triplet nucleotide repeat occurs in the DNA sequence "ATGCGTACGTTACG".



### 63. String & Loops - Translating a Custom Genetic Code Sequence

#### Question: Write a function to translate a DNA sequence "ATGCGTACGTT" using a custom genetic code that includes ambiguous codons (e.g., 'ATN' -> 'Methionine', 'CGN' -> 'Arginine').



In [2]:
genetic_code = {
    'ATG': 'Methionine', 'ATN': 'Methionine', 'CGT': 'Arginine', 'CGN': 'Arginine',
    'TAC': 'Tyrosine', 'GTT': 'Valine', 'TAA': 'Stop', 'TAG': 'Stop', 'TGA': 'Stop'
}

### 64. String & List - Analyzing Amino Acid Composition

#### Question: Write a function to calculate the composition of each amino acid in a protein sequence "MKVLYRFY".



### 65. List & Dictionary - Calculating Nucleotide Composition in Codon Positions

#### Question: Write a function to calculate the nucleotide composition at each codon position (1st, 2nd, 3rd) in the DNA sequence "ATGCGTACGTT".

##### # Output: {1: {'A': 1, 'T': 1, 'C': 1, 'G': 2}, 2: {'A': 0, 'T': 1, 'C': 3, 'G': 1}, 3: {'A': 1, 'T': 2, 'C': 0, 'G': 2}}


### 66. List & Set - Identifying Non-overlapping Motifs

#### Question: Write a function to find all non-overlapping occurrences of the motif "ACG" in the DNA sequence "ATGCGTACGTTACG".

##### Output: [5, 11]

### 67. List & Dictionary - Creating a Reverse Complement Dictionary

#### Question: Write a function to create a dictionary that maps each codon to its reverse complement in a DNA sequence "ATGCGTACGTT".

##### Output: {'ATG': 'CAT', 'CGT': 'ACG', 'ACG': 'CGT', 'GTT': 'AAC'}


### 68. List & String - Transcribing DNA with Ambiguous Bases

#### Question: Write a function to transcribe a DNA sequence with ambiguous bases "ATGCGTNCGTT" into an RNA sequence, where 'N' represents any nucleotide.

##### Output: 'UACGCANGCAA'

### 69. String & List - Translating Overlapping Codons with Degenerate Bases

#### Question: Write a function to translate overlapping codons in a DNA sequence "ATGCGTACGTNNN" using a genetic code that includes degenerate bases.

##### Output: ['Methionine', 'Arginine', 'Threonine', 'Any']


In [3]:
degenerate_genetic_code = {
    'ATG': 'Methionine', 'CGT': 'Arginine', 'ACG': 'Threonine', 'TNN': 'Any'
}

### 70. List & String - Finding Longest Protein Coding Sequence

#### Question: Write a function to find the longest protein coding sequence in the DNA sequence "ATGCGTACGTTGAAATGCCGTTAG".

In [1]:
genetic_code = {
    'ATG': 'Methionine', 'CGT': 'Arginine', 'TAC': 'Tyrosine', 'GTT': 'Valine',
    'TAA': 'Stop', 'TAG': 'Stop', 'TGA': 'Stop'
}



## Please use the below seq for verifying the results

In [None]:
dna_sequences = [
    "ATGCGTACGTTGACGTAGCCTAGCGTACGATTACGCGTATGGGCTACTGCGTACGTTGCGTATGCGTACGTTGAATGCGT",
    "GCTAGCGTACGTTGCGTAGCGTACGTGACGTACTGCGTAGCTAGCGTTACGTTACGCGTACGATGCGTACGTGCGTGACG",
    "ATGCGTACGTTGCGTATGCGTACGTTGACGTAGCTAGCGTACGTTACGCGTACGCTGCGTAGCGTACGCGTATGCGTACG",
    "CGTACGTTGACGTAGCGTACGTGCGTACGCGTACGCTAGCGTACGTTGCGTACGTACGCGTACGTTGCGTACGCGTACGT",
    "ACGTTGCGTACGCGTACGTTGACGTAGCGTACGTTACGCGTACGTGCGTAGCGTACGCTGCGTACGCGTACGTTGCGTAC"
]

protein_sequences = [
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY",
    "MVVVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY",
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY",
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY",
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
]

In [None]:
dna_sequences = [
    "ATGCGTACGTTGACGTAGCCTAGCGTACGATTACGCGTATGGGCTACTGCGTACGTTGCGTATGCGTACGTTGAATGCGT"
    "TACGTTGACGTAGCGTACGTGCGTAGCTAGCGTACGTTACGCGTACGATGCGTACGTGCGTGACGTACGTTACGCGTACG"
    "TAGCGTACGTTGCGTACGTTGACGTAGCGTACGCTGCGTAGCGTACGCGTATGCGTACGTTGACGTAGCGTACGTGCGTA"
    "GCTAGCGTACGTTGCGTACGTACGCGTACGTTGCGTACGCGTACGTTGACGTAGCGTACGTTACGCGTACGTGCGTAGCG",
    
    "GCTAGCGTACGTTGCGTAGCGTACGTGACGTACTGCGTAGCTAGCGTTACGTTACGCGTACGATGCGTACGTGCGTGACG"
    "TACGTTGACGTAGCGTACGTTACGCGTACGATGCGTACGTGCGTGACGTACGTTACGCGTACGTGCGTAGCGTACGTTAC"
    "GCGTACGTGCGTACGTTGACGTAGCGTACGTTACGCGTACGTGCGTAGCGTACGCTGCGTACGCGTACGTTGCGTACGCG"
    "TACGTGCGTACGTTGACGTAGCGTACGTTACGCGTACGTGCGTAGCGTACGTTACGCGTACGTTGACGTAGCGTACGTTG",
    
    "ATGCGTACGTTGCGTATGCGTACGTTGACGTAGCTAGCGTACGTTACGCGTACGCTGCGTAGCGTACGCGTATGCGTACG"
    "GCGTACGTTGCGTAGCGTACGTTGACGTAGCGTACGCTGCGTAGCGTACGCGTATGCGTACGTTGACGTAGCGTACGTGC"
    "GTTACGCGTACGTGCGTAGCGTACGCTGCGTACGCGTACGTTGCGTACGTGCGTACGTTGACGTAGCGTACGTTACGCGT"
    "ACGTGCGTAGCGTACGTTGACGTAGCGTACGTTACGCGTACGTGCGTAGCGTACGTTGCGTACGCGTACGTTGACGTAGC",
    
    "CGTACGTTGACGTAGCGTACGTGCGTACGCGTACGCTAGCGTACGTTGCGTACGTACGCGTACGTTGCGTACGCGTACGT"
    "GCGTACGTTGACGTAGCGTACGTGCGTACGCGTACGTTGACGTAGCGTACGTTACGCGTACGCTGCGTACGCGTACGTTG"
    "CGTACGTTGACGTAGCGTACGTGCGTACGCGTACGCTAGCGTACGTTGCGTACGTACGCGTACGTTGCGTACGCGTACGT"
    "GCGTACGTTGACGTAGCGTACGTGCGTACGCGTACGTTGACGTAGCGTACGTTACGCGTACGCTGCGTACGCGTACGTTG",
    
    "ACGTTGCGTACGCGTACGTTGACGTAGCGTACGTTACGCGTACGTGCGTAGCGTACGCTGCGTACGCGTACGTTGCGTAC"
    "GCGTACGTGCGTACGTTGACGTAGCGTACGTTACGCGTACGTGCGTAGCGTACGCTGCGTACGCGTACGTTGCGTACGCG"
    "TACGTGCGTAGCGTACGTTGACGTAGCGTACGTTACGCGTACGTGCGTAGCGTACGCTGCGTACGCGTACGTTGCGTACG"
    "GTTGACGTAGCGTACGTTACGCGTACGTGCGTAGCGTACGCTGCGTACGCGTACGTTGCGTACGCGTACGTGCGTAGCGT"
]


In [None]:
protein_sequences = [
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY",
    
    "MVVVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY",
    
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY",
    
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY",
    
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
    "MKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFYMKVLYRLYRFY"
]
