# Create a Database

Create a database that connects Genbank Data to MP3 pathogenicity output. Please see final_project.ipynb for more information. Genbank data can be downloaded form a rast annotation of a genome then run through MP3.  
File structure should be 
- ./
    - create_database.ipynb
    - Groups
        - Group1
            - Group1.faa.Hybrid.result
            - Group1.gbk
        - Group2
            - Group2.faa.Hybrid.result
            - Group2.gbk
    - database.pkl
    
The term groups from the class project. Algorithm may be tweaked for changed filenames and directories

### Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from Bio import SeqIO
from collections import Counter

### Extract Data

In [2]:
# Extract genbank data and put into dictionary for recovery later
groups_directory = "./Groups/"
database_directory = "./"
database_name = "salmonella_database.pkl"
groups = [
    1, 3, 4, 5, 6, 7, 
    8, 9, 10, 11, 12, 
    14, 15
]
genbank = {}
for group in groups:
    file_name = "Group" + str(group) + ".gbk"
    file_directory = groups_directory + "Group" + str(group) + "/"
    for seq_record in SeqIO.parse(file_directory + file_name, "genbank"):
        organism = seq_record.annotations["organism"]
        if len(seq_record.features) > 1:
            for feature in seq_record.features[1:]:
                name = feature.qualifiers["db_xref"][0][5:]
                if feature.type != 'CDS':
                    continue
                protein_sequence = feature.qualifiers["translation"][0]
                protein_name = feature.qualifiers["product"][0]
                genbank[name] = [protein_sequence, protein_name, organism]


In [3]:
# Parse mp3 output data
columns = ["Group", "Sr._No.", "Sequence_Name", "Type_of_Pfam_domains", 
           "HMM_Prediction", "SVM_Score", 
           "SVM_prediction", "Hybrid_Prediction", "Assignment", 
           "Sequence", "Product", "Organism"]
data = []
for group in groups:
    file_name = "Group" + str(group) + ".faa.Hybrid.result"
    file_directory = groups_directory + "Group" + str(group) + "/"
    with open(file_directory + file_name, "r") as handle:
        lines = handle.readlines()[1:]
        for line in lines:
            if len(line) == 1:
                continue
            l = line.split('\t')
            if len(l) == 8:
                for i in range(len(l)):
                    l[i] = l[i].strip()
                l[4] = float(l[4])
                l[0] = int(float(l[0]))
                # add in genbank data
                l.extend(genbank[l[1]])
                # add group number field
                l.insert(0, group)
                data.append(l)


### Create Dataframe
We chose to use pandas dataframe for our database because it is easy to implement. In the future we could implement a SQL relational database which would help link our entries. 

In [4]:
df = pd.DataFrame(data=data, columns=columns)

Save to pickle for faster loading without input files

In [5]:
df.to_pickle(database_directory + database_name)

Please see analyze_database.ipynb for querying examples and analysis tools