# Bacterial Ensemble Data Processing

This notebook creates a language model dataset from an ensemble of bacterial genomes. This dataset will be used for unsupervised learning, so it will simply be the text of the genome.

#### Data Source
All genomes are downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/genome/browse#!/overview/)

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai import *
from fastai.text import *
from Bio import Seq
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import FeatureLocation, CompoundLocation
import networkx as nx

In [3]:
sys.path.append("../../..")
from utils import *

In [4]:
path = Path('F:/genome/bacterial genomes/')

Genome files used:

In [5]:
os.listdir(path/'genome_fastas')

['Bacillus andreraoultii.fna',
 'Bacillus cereus.fna',
 'bacillus sp 2b10.fna',
 'Bacillus sp EGD-AK10.fna',
 'Bacillus sp FJAT-28004.fna',
 'Bacillus sp L_1B0_5.fna',
 'Bacillus sp MBGLi97.fna',
 'Bacillus sp MYb78.fna',
 'Bacillus sp OV166.fna',
 'Bacillus thuringiensis.fna',
 'Clostridium sp ASF502.fna',
 'Clostridium sp DL-VIII.fna',
 'Corynebacterium diphtheriae.fna',
 'Corynebacterium sp 13CS0277.fna',
 'Corynebacterium sp CNJ-954.fna',
 'Corynebacterium sp J010B-136.fna',
 'Corynebacterium sp JB4.fna',
 'Corynebacterium sp Marseille-P2417.fna',
 'Corynebacterium sp YIM 101343.fna',
 'Corynebacterium striatum.fna',
 'Escherichia albertii.fna',
 'Escherichia coli.fna',
 'Escherichia fergusonii.fna',
 'Escherichia marmotae.fna',
 'Escherichia sp KTE172.fna',
 'Escherichia sp MOD1-EC2449.fna',
 'Escherichia sp MOD1-EC4516.fna',
 'Escherichia sp MOD1-EC4560.fna',
 'Escherichia sp R11.fna',
 'Escherichia sp R14.fna',
 'Escherichia sp R15.fna',
 'Escherichia sp r18.fna',
 'Escherichia 

In [6]:
valid_pct = 0.1
dfs_trn = []
dfs_val = []
for file in os.listdir(path/'genome_fastas'):
    source = file.split('.')[0]
    
    data = process_fasta(path/'genome_fastas'/file, 2000, 900)
    
    df = pd.DataFrame(data, columns=['Sequence'])
    df['Source'] = source
    cut = int((1-valid_pct) * len(df)) + 1
    train_df, valid_df = df[:cut], df[cut:]
    dfs_trn.append(train_df)
    dfs_val.append(valid_df)

df_trn = pd.concat(dfs_trn)
df_trn['set'] = 'train'
df_val = pd.concat(dfs_val)
df_val['set'] = 'valid'

data_df = pd.concat(dfs_trn+dfs_val)
data_df.reset_index(inplace=True, drop=True)
data_df.to_csv(path/'bacterial_data.csv', index=False)

In [7]:
data_df.head()

Unnamed: 0,Sequence,Source
0,AGACGCTCTATCCAATTGAGCTACGGGCGCATATAAATGGTGCCGA...,Bacillus andreraoultii
1,TATAGGAATTGTATTTACGGGATTTCCGCATAAATTTTACACATTT...,Bacillus andreraoultii
2,AAGTCAATGATTATCTTCCAACGAAAGTCCGGGTTTTATCGTCTAT...,Bacillus andreraoultii
3,CATGAGCTAGCGAAATCGCACTTTCGAGTAGAACGTGAACAGACGT...,Bacillus andreraoultii
4,TAAATGGTTTAATTAACTATAACATACTTGACCTTGCGAAAAAAAC...,Bacillus andreraoultii


In [8]:
data_df.shape

(371831, 2)