# **Instalasi**

In [6]:
! pip install biopython



In [9]:
from Bio import SeqIO
import pandas as pd
from datetime import datetime

# **Feature GEN: DNA**

**"feature"** pada DNA merujuk pada elemen-elemen atau karakteristik penting yang memiliki fungsi tertentu atau informasi yang signifikan dalam genom. Feature pada database disimpan pada format file **(.gb)**. Berikut adalah beberapa contoh "feature" DNA yang umum ditemukan:


1. **Gen**: Segmen DNA yang mengkodekan protein atau molekul RNA fungsional. Gen adalah unit dasar hereditas.
2. **Promotor**: Bagian dari DNA yang berada di depan gen dan bertindak sebagai tempat pengikatan bagi enzim RNA polimerase, yang memulai transkripsi gen.
3. **Ekson dan Intron**: Ekson adalah segmen dalam gen yang mengkode protein, sementara intron adalah segmen non-koding yang dipotong selama pemrosesan RNA.
4. **Enhancer dan Silencer**: Urutan DNA yang mengatur ekspresi gen. Enhancer meningkatkan ekspresi gen, sementara silencer menekan ekspresi gen.
5. **Regulatory Sequence**: Wilayah yang mengandung elemen-elemen yang mengontrol kapan, di mana, dan seberapa banyak suatu gen diekspresikan.
6. **ORI (Origin of Replication)**: Titik awal di mana replikasi DNA dimulai dalam sel.
7. **Telomer**: Ujung dari kromosom yang melindungi DNA dari kerusakan dan mencegah hilangnya informasi genetik selama replikasi.
8. **Centromere**: Wilayah yang menghubungkan dua kromatid saudara selama pembelahan sel dan menjadi tempat melekatnya kinetokor untuk pemisahan kromosom.
9. **Situs Pengikatan Protein**: Lokasi di DNA tempat protein, seperti faktor transkripsi, menempel untuk mengatur aktivitas gen.
10. **Microsatellite dan Minisatellite**: Urutan pendek yang berulang di DNA yang sering digunakan sebagai penanda genetik karena variasinya yang tinggi di antara individu.
11. **Transposon (Elemen Bergerak)**: Segmen DNA yang dapat berpindah ke lokasi lain dalam genom, memengaruhi struktur dan fungsi genom.
12. **Pseudogen**: Urutan yang mirip dengan gen tetapi tidak fungsional, sering kali karena mutasi.
    
Feature-feature ini penting untuk pemahaman fungsi genom dan analisis genetika serta dalam pengembangan obat, diagnostik, dan penelitian evolusi.

# **Explorasi Gen Feature**

Untuk mengeksplorasi **gene feature** dalam DNA menggunakan Python, kita bisa memanfaatkan paket *Biopython*. Biopython menyediakan modul-modul yang sangat berguna untuk membaca, menulis, dan menganalisis data biologis dari format seperti GenBank, FASTA, dan lainnya.

Dalam contoh ini, kita akan memuat data genom dari file **GenBank**, mengekstrak informasi gene feature, dan menampilkannya. Berikut adalah langkah-langkah dasar untuk melakukannya:

# **Explorasi Data Genbank**

## 1. Upload data genbank dan melihat semua nilai

In [33]:
from Bio import SeqIO

# Function to open and print details from a single-record GenBank file
def open_single_record_genbank(file_path):
    # Use SeqIO.read to read a single GenBank record
    with open(file_path, "r") as handle:
        seq_record = SeqIO.read(handle, "genbank")

        # Print basic information
        print(f"Accession ID: {seq_record.id}")
        print(f"Nama: {seq_record.name}")
        print(f"Original: {seq_record.annotations.get('comment', 'Unknown')}")
        print(f"Deskripsi: {seq_record.description}")
        print(f"Taksonomi: {seq_record.annotations.get('taxonomy', 'Unknown')}")
        print(f"Organisme: {seq_record.annotations.get('organism', 'Unknown')}")
        print(f"Sumber: {seq_record.annotations.get('source', 'Unknown')}")
        print(f"Panjang Sekuen: {len(seq_record.seq)}")
        print(f"Sekuen: {seq_record.seq[:50]}...")  # Print first 50 bases of the sequence
        print(f"Tanggal Submit: {seq_record.annotations.get('date', 'Unknown')}")
        #print(f"Referensi: {seq_record.annotations.get('references', 'Unknown')}")

        # Print features
        print("\nFeatures:")
        for feature in seq_record.features:
            print(f" - {feature.type} at location {feature.location}")
        print("-" * 40)

# Example usage
file_path = "/kaggle/input/biokom-genbank/plasmodium-vivax-1-record.gb"  # Ganti disini
open_single_record_genbank(file_path)

Accession ID: L23073.1
Nama: PFAPNG4A
Original: Original source text: Plasmodium vivax (individual_isolate Papua
New Guinea 41-2) DNA.
Deskripsi: Plasmodium vivax (Papua New Guinea 41-2) microneme protein-1 (erythrocyte-binding domain) gene, exon 2
Taksonomi: ['Eukaryota', 'Sar', 'Alveolata', 'Apicomplexa', 'Aconoidasida', 'Haemosporida', 'Plasmodiidae', 'Plasmodium', 'Plasmodium (Plasmodium)']
Organisme: Plasmodium vivax
Sumber: Plasmodium vivax (malaria parasite P. vivax)
Panjang Sekuen: 2272
Sekuen: AGATATCAATTATGTATGAAGGAACTTACGAATTTTGTAAATAATACAGA...
Tanggal Submit: 31-JAN-1995

Features:
 - source at location [0:2272](+)
 - CDS at location [<0:>2272](+)
 - misc_feature at location [1296:1372](+)
----------------------------------------


## 2. Melihat semua feature

In [15]:
seq_record.features[:5]

[SeqFeature(SimpleLocation(ExactPosition(0), ExactPosition(2272), strand=1), type='source', qualifiers=...),
 SeqFeature(SimpleLocation(BeforePosition(0), AfterPosition(2272), strand=1), type='CDS', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(1296), ExactPosition(1372), strand=1), type='misc_feature', qualifiers=...)]

## 3. Melihat feature gen

In [18]:
gene_features = []
for i in range(len(seq_record.features)):
    if(seq_record.features[i].type == 'gene'):
        gene_features.append(seq_record.features[i])

print(f'Number of gene features: {len(gene_features)}')
gene_features

Number of gene features: 0


[]

## 4. Melihat feature CDS

In [20]:
CDS_features = []
for i in range(len(seq_record.features)):
    if(seq_record.features[i].type == 'CDS'):
        CDS_features.append(seq_record.features[i])

print(f"Jumlah feature CDS: {len(CDS_features)}")
CDS_features

Jumlah feature CDS: 1


[SeqFeature(SimpleLocation(BeforePosition(0), AfterPosition(2272), strand=1), type='CDS', qualifiers=...)]

## 5. Memilih salah satu CDS dan mempelajari feature

In [22]:
print(f'CDS Qualifier Keys: {CDS_features[0].qualifiers.keys()}\n')

print('Menampilkan CDS Feature Pertama')
print(CDS_features[0].qualifiers) # ordered dictionary

CDS Qualifier Keys: dict_keys(['function', 'experiment', 'codon_start', 'product', 'protein_id', 'translation'])

Menampilkan CDS Feature Pertama
{'function': ['erythrocyte binding domain'], 'experiment': ['experimental evidence, no additional details recorded'], 'codon_start': ['1'], 'product': ['microneme protein-1'], 'protein_id': ['AAA61770.1'], 'translation': ['RYQLCMKELTNFVNNTDTNFHRDITFRKLYLKWKLIYDAAVEFDLLLKLNNYRYNKVFCKDIRRSLGDFGDIIMGTDMEGIGYSEVVENNLRSFGTGEKAQQHRKQWWNESKAQIWTAMMYSVKKRLKGNFIWICKINVAVNIEPQIYRRIREWGRDYVSELPTEVQKLKEKCDGKINYTDKKVCKVPPCQNACKSYDQWITRKKNQWDVLSNKFKSVKNAEKVQTAGIVTPYDILKQELDEFNEVAFENEINKRDGAYIELCVCSVEEAKKNTQEVVTNVDNAAKSQATNSNPISQPVDSSKAEKVPGDSAHGNVNSGQDSSTTGKAVTGDGQNGNQTPAESDVQRSDIAESVSAKNVDPQKSVSKRSDDTASVTGIAEAGKENLGASNSRPSESTVEANSPGDDTVNSASIPVVRGENPLVTPYNGLGHSKDNSDGTAEFAESTKSAESMANPDSNSKGETGKGQDNDMAKATKDSSNSSDGTSSATGDTTDAVDREINKGVPEDRDKTVGSKDGGGEDNSANKDAATVVGEDRIRENSAGGGTNDRSKNDTEKNGASTPDSKQSEDATALSKTESLESTESGDRTTNDTTNSLENKNGGKEKDLQKHDFKSNDTPNEEPNSDQTTDAE

## 6. Melihat semua informasi CDS

In [23]:
for key, value in CDS_features[0].qualifiers.items():
    print(f'{key} : {value}')

function : ['erythrocyte binding domain']
experiment : ['experimental evidence, no additional details recorded']
codon_start : ['1']
product : ['microneme protein-1']
protein_id : ['AAA61770.1']
translation : ['RYQLCMKELTNFVNNTDTNFHRDITFRKLYLKWKLIYDAAVEFDLLLKLNNYRYNKVFCKDIRRSLGDFGDIIMGTDMEGIGYSEVVENNLRSFGTGEKAQQHRKQWWNESKAQIWTAMMYSVKKRLKGNFIWICKINVAVNIEPQIYRRIREWGRDYVSELPTEVQKLKEKCDGKINYTDKKVCKVPPCQNACKSYDQWITRKKNQWDVLSNKFKSVKNAEKVQTAGIVTPYDILKQELDEFNEVAFENEINKRDGAYIELCVCSVEEAKKNTQEVVTNVDNAAKSQATNSNPISQPVDSSKAEKVPGDSAHGNVNSGQDSSTTGKAVTGDGQNGNQTPAESDVQRSDIAESVSAKNVDPQKSVSKRSDDTASVTGIAEAGKENLGASNSRPSESTVEANSPGDDTVNSASIPVVRGENPLVTPYNGLGHSKDNSDGTAEFAESTKSAESMANPDSNSKGETGKGQDNDMAKATKDSSNSSDGTSSATGDTTDAVDREINKGVPEDRDKTVGSKDGGGEDNSANKDAATVVGEDRIRENSAGGGTNDRSKNDTEKNGASTPDSKQSEDATALSKTESLESTESGDRTTNDTTNSLENKNGGKEKDLQKHDFKSNDTPNEEPNSDQTTDAEGHDRDSIKNDKAERRKHMNKDTFTKNTNSHHLNSNNNLSNGKLDIKEYKYRDVKATRKNIILMSSVHKCNNNISLEYCNSVEDKISSNTCSREKSKNLCCSISDFCLNYFDVNSYEYHSCMKKEFE']


# **Multiple genbank**

## 1. Input data

In [36]:
from Bio import SeqIO

# Function to open and print GenBank file details
def open_genbank_file(file_path):
    with open(file_path, "r") as handle:
        for seq_record in SeqIO.parse(handle, "genbank"):
            print(f"Accession: {seq_record.id}")
            print(f"Name: {seq_record.name}")
            print(f"Description: {seq_record.description}")
            print(f"Organism: {seq_record.annotations.get('organism', 'Unknown')}")
            print(f"Sequence Length: {len(seq_record.seq)}")
            print(f"Sequence: {seq_record.seq[:50]}...")  # Print first 50 bases of the sequence
            print(f"Submission Date: {seq_record.annotations.get('date', 'Unknown')}")
            print("\nFeatures:")
            for feature in seq_record.features:
                print(f" - {feature.type} at location {feature.location}")
            print("-" * 40)

In [37]:
# Example usage
file_path = "/kaggle/input/biokom-multi/plasmodium-vivax-10-record.gb"  # Replace with the path to your GenBank file
open_genbank_file(file_path)

Accession: L23073.1
Name: PFAPNG4A
Description: Plasmodium vivax (Papua New Guinea 41-2) microneme protein-1 (erythrocyte-binding domain) gene, exon 2
Organism: Plasmodium vivax
Sequence Length: 2272
Sequence: AGATATCAATTATGTATGAAGGAACTTACGAATTTTGTAAATAATACAGA...
Submission Date: 31-JAN-1995

Features:
 - source at location [0:2272](+)
 - CDS at location [<0:>2272](+)
 - misc_feature at location [1296:1372](+)
----------------------------------------
Accession: L23075.1
Name: PFAPNG8A
Description: Plasmodium vivax (Papua New Guinea 8-1) microneme protein-1 (erythrocyte-binding domain) gene, exon 2
Organism: Plasmodium vivax
Sequence Length: 2251
Sequence: AGATATCAATTATGTATGAAGGAACTTACGAATTTGGTAAATAATACAGA...
Submission Date: 31-JAN-1995

Features:
 - source at location [0:2251](+)
 - CDS at location [<0:>2251](+)
 - misc_feature at location [1296:1351](+)
----------------------------------------
Accession: L23074.1
Name: PFAPNG7A
Description: Plasmodium vivax (Papua New Guinea 7-1) mic

In [38]:
# Function to parse GenBank file and extract relevant data
def genbank_to_dataframe_by_year(file_path):
    # Initialize an empty list to store the data
    data = []

    # Read the GenBank file (multiple records)
    with open(file_path, "r") as handle:
        for seq_record in SeqIO.parse(handle, "genbank"):
            # Extract the sequence's metadata
            accession = seq_record.id
            name = seq_record.name
            description = seq_record.description
            organism = seq_record.annotations.get("organism", "Unknown")
            sequence_length = len(seq_record.seq)
            sequence = str(seq_record.seq)

            # Extract the date (submission date in the annotations)
            if "date" in seq_record.annotations:
                date_str = seq_record.annotations["date"]
                try:
                    submission_year = datetime.strptime(date_str, "%d-%b-%Y").year
                except ValueError:
                    submission_year = "Unknown"  # If date parsing fails
            else:
                submission_year = "Unknown"

            # Append the record data
            data.append({
                "Accession": accession,
                "Name": name,
                "Description": description,
                "Organism": organism,
                "Sequence Length": sequence_length,
                "Submission Year": submission_year,
                "Sequence": sequence
            })

    # Create a DataFrame from the collected data
    df = pd.DataFrame(data)

    # Sort the DataFrame by year
    df = df.sort_values(by="Submission Year", ascending=True)

    return df

In [40]:
# Example usage
df = genbank_to_dataframe_by_year(file_path)

# Save the DataFrame to a CSV file
df.to_csv("genbank_data_for.csv", index=False)

# Display the DataFrame
df

Unnamed: 0,Accession,Name,Description,Organism,Sequence Length,Submission Year,Sequence
0,L23073.1,PFAPNG4A,Plasmodium vivax (Papua New Guinea 41-2) micro...,Plasmodium vivax,2272,1995,AGATATCAATTATGTATGAAGGAACTTACGAATTTTGTAAATAATA...
1,L23075.1,PFAPNG8A,Plasmodium vivax (Papua New Guinea 8-1) micron...,Plasmodium vivax,2251,1995,AGATATCAATTATGTATGAAGGAACTTACGAATTTGGTAAATAATA...
2,L23074.1,PFAPNG7A,Plasmodium vivax (Papua New Guinea 7-1) micron...,Plasmodium vivax,2251,1995,AGATATCAATTATGTATGAAGGAACTTACGAATTTGGTAAATAATA...
3,L23072.1,PFAPNG3A,Plasmodium vivax (Papua New Guinea 32-1) micro...,Plasmodium vivax,2272,1995,AGATATCAATTATGTATGAAGGAACTTACGAATTTGGTAAATAATA...
4,L23071.1,PFAPNG2AA,Plasmodium vivax (Papua New Guinea 29-1) micro...,Plasmodium vivax,2251,1995,AGATATCAATTATGTATGAAGGAACTTACGAATTTGGTAAATAATA...
5,L23070.1,PFAPNG1AB,Plasmodium vivax (Papua New Guinea 18-5) micro...,Plasmodium vivax,2251,1995,AGATATCAATTATGTATGAAGGAACTTACGAATTTGGTAAATAATA...
6,L23069.1,PFAPNG1AA,Plasmodium vivax (Papua New Guinea 15-1) micro...,Plasmodium vivax,2272,1995,AGATATCAATTATGTATGAAGGAACTTACGAATTTGGTAAATAATA...
9,U31928.1,PVU31928,Plasmodium vivax strain PNG 46 merozoite surfa...,Plasmodium vivax,381,1996,ACTGAGAAGAACAAGCCGACCGTGGCAGCGGCAGATATAGTGGCAA...
7,DQ376101.1,DQ376101,Plasmodium vivax isolate Chesson merozoite sur...,Plasmodium vivax,493,2016,GACAGAGGTCACAACCAATGCGGTAACATCTGAAGTACAACAACAA...
8,DQ376094.1,DQ376094,Plasmodium vivax isolate Chesson apical membra...,Plasmodium vivax,360,2016,CATATCTCCCCCATGACATTAGCGAACCTTAAGGAAAGGTATAAAG...


## **Tugas**

Berdasarkan soal latihan sebelumnya anda saat sudah memiliki semua data genbank varian COVID-19. Tugas anda adalah menggabungkan semua data varian menjadi satu dataframe dengan kolom-kolom sebagai berikut:
1. Accession ID
2. Organism
3. Source
4. Submission Year
5. Sequence Length
6. CDS Number
7. Sequence
8. Reference Number
9. save to csv