**CREATING DATA**

**The data consists of 50 rows and 6 columns, namely, IndividualID, Chromosome, Position, Reference Base, Variation Base and Genotype**

In [1]:
# generating random genomics data
# assuming the length of the chromosome to be 2.5 million base pairs

import random

genomics_data = []
dna_types = ["A","T","C","G"]

def generate_genomics_data():
  for i in range(0, 50):
    genomics_data.append({"IndividualID" : random.randint(1,50), 
                            "Chromosome" : random.randint(1,22), 
                            "Position" : random.randint(1,2500000), 
                            "ReferenceBase" : random.choice(dna_types), 
                            "VariationBase" : random.choice(dna_types)
                          })

  for i in range(0,50):
    genomics_data[i]["Genotype"] = genomics_data[i]["ReferenceBase"] + genomics_data[i]["VariationBase"]

  return genomics_data

In [2]:
generate_genomics_data()[:2]

[{'IndividualID': 46,
  'Chromosome': 14,
  'Position': 1455111,
  'ReferenceBase': 'C',
  'VariationBase': 'T',
  'Genotype': 'CT'},
 {'IndividualID': 1,
  'Chromosome': 18,
  'Position': 1759091,
  'ReferenceBase': 'A',
  'VariationBase': 'T',
  'Genotype': 'AT'}]

**There can be various insights that can be performed on the above genomics dataset but here we are going to focus on few of them:**

**1) Comparing Reference Base and Variation Base: this specifies the variation in gene as compared to the reference. We will calculate the percentage of people who have the alleles and the percentage of people who have different alleles based on the given randomly generated dataset**

**2) Calculating Genetic Density**

**These insights can be used to determine the genetic diversity among different individuals which can be further be used to detect diseases based on the these genes differences.**

In [3]:
homo_allele = []
hetero_allele = []

def compare_ref_var(data):
    for i in range(0, len(data)):
        if data[i]["ReferenceBase"] != data[i]["VariationBase"]:
            hetero_allele.append(data[i]["Genotype"])
            #print(f'The genotype {genomics_data[i]["Genotype"]} is a Heterozygous alleles.')

        else:
            homo_allele.append(data[i]["Genotype"])
            #print(f'The genotype {genomics_data[i]["Genotype"]} is a Homozygous allele.')

    return hetero_allele, homo_allele

In [17]:
# compare_ref_var: This fucntion will compare the values of reference base and variation base
# It returns a list of homozygous alleles and heterozygous alleles

homo_allele = []
hetero_allele = []

def compare_ref_var(data):
    for i in range(0, len(data)):
        if data[i]["ReferenceBase"] != data[i]["VariationBase"]:
            hetero_allele.append(data[i]["Genotype"])
            #print(f'The genotype {genomics_data[i]["Genotype"]} is a Heterozygous alleles.')

        else:
            homo_allele.append(data[i]["Genotype"])
            #print(f'The genotype {genomics_data[i]["Genotype"]} is a Homozygous allele.')

    return hetero_allele, homo_allele

def calc_percentage(homo_list, hetero_list): 
    percentage_homo = (len(homo_list)/(len(homo_list) + len(hetero_list))) * 100
    percentage_hetero = (len(hetero_list)/(len(homo_list) + len(hetero_list))) * 100

    return percentage_homo, percentage_hetero

chromosome_variant_counts = {}
chromosome_lengths = {
    1: 249250621,
    2: 243199373,
    3: 198022430,
    5: 180857866,
    7: 159345973
}


def genetic_density(data):

# Calculate variant counts per chromosome
    for entry in genomics_data:
        chromosome = entry['Chromosome']
        if chromosome not in chromosome_variant_counts:
            chromosome_variant_counts[chromosome] = 0
        chromosome_variant_counts[chromosome] += 1

    for chromosome, variant_count in chromosome_variant_counts.items():
        chromosome_length = chromosome_lengths.get(chromosome, 1)  # Default to 1 if length is unknown
        genetic_density = variant_count / chromosome_length
        print(f"Chromosome {chromosome}: Genetic Density = {genetic_density:.6f} variants/base pair")


In [None]:
compare_ref_var(genomics_data)

In [18]:
genetic_density(genomics_data)

Chromosome 14: Genetic Density = 3.000000 variants/base pair
Chromosome 18: Genetic Density = 3.000000 variants/base pair
Chromosome 16: Genetic Density = 3.000000 variants/base pair
Chromosome 5: Genetic Density = 0.000000 variants/base pair
Chromosome 11: Genetic Density = 4.000000 variants/base pair
Chromosome 15: Genetic Density = 5.000000 variants/base pair
Chromosome 17: Genetic Density = 4.000000 variants/base pair
Chromosome 12: Genetic Density = 3.000000 variants/base pair
Chromosome 22: Genetic Density = 3.000000 variants/base pair
Chromosome 8: Genetic Density = 3.000000 variants/base pair
Chromosome 13: Genetic Density = 4.000000 variants/base pair
Chromosome 4: Genetic Density = 3.000000 variants/base pair
Chromosome 20: Genetic Density = 1.000000 variants/base pair
Chromosome 1: Genetic Density = 0.000000 variants/base pair
Chromosome 21: Genetic Density = 1.000000 variants/base pair
Chromosome 10: Genetic Density = 1.000000 variants/base pair
Chromosome 3: Genetic Densit

In [12]:
chromosome_variant_counts

{14: 3,
 18: 3,
 16: 3,
 5: 1,
 11: 4,
 15: 5,
 17: 4,
 12: 3,
 22: 3,
 8: 3,
 13: 4,
 4: 3,
 20: 1,
 1: 4,
 21: 1,
 10: 1,
 3: 1,
 6: 2,
 2: 1}

**We can also the percentage of each type of dna molecules present in a dna sequence**

**Here, we will try to generate a random data which shows dna sequences of the above 50 individuals. Using this data, we can calculate the percentage of "A", "G", "C", "T" in a particular individual**

In [19]:
# defining a function to generate random dna sequence

def generate_dna_sequence(length):
     neuclotides = ["A", "G", "C", "T"]
     ran_seq = []

     for i in range(length):
          ran_seq.append(random.choice(neuclotides))

     ran_seq = ''.join(ran_seq)
     return ran_seq


In [20]:
ran_seq = generate_dna_sequence(10)

In [21]:
ran_seq

'TGTAAAAGGA'

In [22]:
# calculating the percentage of each of neuclotides present in a particular dna sequence
# defining a function which calculates GC content

def gc_content(random_sequence):

    count_a = 0
    count_c = 0
    count_g = 0
    count_t = 0


    for i in random_sequence:
        if i == "A":
            count_a += 1
        elif i == "C":
            count_c += 1
        elif i == "G":
            count_g += 1
        elif i == "T":
            count_t += 1

    percent_a = (count_a/len(random_sequence))*100
    percent_c = (count_c/len(random_sequence))*100
    percent_g = (count_g/len(random_sequence))*100
    percent_t = (count_t/len(random_sequence))*100

    print(f'Percentage of A: {percent_a}\nPercentage of C: {percent_c}\nPercentage of G: {percent_g}\nPercentage of T: {percent_c}')

In [23]:
gc_content(ran_seq)

Percentage of A: 50.0
Percentage of C: 0.0
Percentage of G: 30.0
Percentage of T: 0.0


In [29]:
# The main program starts from here 

print("WELCOME TO THE GENOMICS DATA ANALYSIS PROGRAM")
print("*"*100)
print("Make a choice to proceed further")
print("1) Enter 1: To generate the random genomics data\n2) Enter 2: To perform analysis on the randomly generated data:\n\ta) Enter a: To compare the reference base and variation base in the data\n\tb) Enter b: To calculate the percentage of homozygous alleles and heterozygous alleles\n\tc) Enter c: To calculate the gentic density\n3) Enter 3: To genrate random dna sequences\n\ta) Enter a: To calculate the percentage of each of the neuclotides present in the sequence")
print("*"*100)
while True:
    choice = int(input("Enter the specific choice to proceed further: "))
    if choice == 1:
        print("The data contains details about 50 individuals. The details include: IndividualId, Chromosome, Position, Reference Base, Variation Base, Genotype.\nHere we only display the data of the first 5 individuals for the refenrece.")
        print(generate_genomics_data()[:5])

    elif choice == 2:
        compare_ref_var(genomics_data)
        
    elif choice == 3:
        calc_percentage(homo_allele, hetero_allele)
    
    elif choice == 4:
        genetic_density()
    
    elif choice == 4:
        print("This is a randomly generated dna sequence data just for the analysis purpose")
        if choice == 1:
            gc_content()

    else:
        break

WELCOME TO THE GENOMICS DATA ANALYSIS PROGRAM
****************************************************************************************************
Make a choice to proceed further
1) Enter 1: To generate the random genomics data
2) Enter 2: To perform analysis on the randomly generated data:
	a) Enter a: To compare the reference base and variation base in the data
	b) Enter b: To calculate the percentage of homozygous alleles and heterozygous alleles
	c) Enter c: To calculate the gentic density
3) Enter 3: To genrate random dna sequences
	a) Enter a: To calculate the percentage of each of the neuclotides present in the sequence
****************************************************************************************************
The data contains details about 50 individuals. The details include: IndividualId, Chromosome, Position, Reference Base, Variation Base, Genotype.
Here we only display the data of the first 5 individuals for the refenrece.
[{'IndividualID': 46, 'Chromosome': 14, 

ValueError: invalid literal for int() with base 10: ''