# Mammal Ensemble Data Processing

This notebook creates a language model dataset from an ensemble of mammalian genomes. Specifically, four primate genomes are used:
  * [Homo sapiens (human)](https://www.ncbi.nlm.nih.gov/genome/51)
  * [Pan troglodytes (chimpanzee)](https://www.ncbi.nlm.nih.gov/genome/202?genome_assembly_id=276759)
  * [Pan paniscus (pygmy chimpanzee)](https://www.ncbi.nlm.nih.gov/genome/10729?genome_assembly_id=249283)
  * [Gorilla gorilla gorilla (western lowland gorilla)](https://www.ncbi.nlm.nih.gov/genome/2156?genome_assembly_id=291477)


This dataset will be used for unsupervised learning, so it will simply be the text of the genome.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai import *
from fastai.text import *
from Bio import Seq
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import FeatureLocation, CompoundLocation
import networkx as nx

In [3]:
sys.path.append("../../..")
from utils import *

In [4]:
path = Path('F:/genome/mammals/')

# Genome Processing

# Human

In [5]:
fname = 'GCF_000001405.38_GRCh38.p12_genomic.fna'

In [6]:
data = process_fasta(path/fname, 10000, 2000, filter_txt='NC_')

In [7]:
df_human = pd.DataFrame(data, columns=['Sequence'])
df_human['Source'] = 'NCBI Human'

In [8]:
df_human.shape

(1465634, 2)

# Chimp

In [9]:
fname = 'GCF_000001515.7_Pan_tro_3.0_genomic.fna'

In [10]:
data = process_fasta(path/fname, 10000, 2000, filter_txt='NC_')

In [11]:
df_c1 = pd.DataFrame(data, columns=['Sequence'])
df_c1['Source'] = 'NCBI Pan Troglodytes'

In [12]:
df_c1.shape

(1353383, 2)

# Chimp 2

In [13]:
fname = 'GCF_000258655.2_panpan1.1_genomic.fna'

In [14]:
data = process_fasta(path/fname, 10000, 2000, filter_txt='NC_')

In [15]:
df_c2 = pd.DataFrame(data, columns=['Sequence'])
df_c2['Source'] = 'NCBI Pan paniscus'

In [16]:
df_c2.shape

(954992, 2)

# Gorilla

In [17]:
fname = 'GCF_000151905.2_gorGor4_genomic.fna'

In [18]:
data = process_fasta(path/fname, 10000, 2000, filter_txt='NC_')

In [19]:
df_g = pd.DataFrame(data, columns=['Sequence'])
df_g['Source'] = 'NCBI Gorilla gorilla gorilla'

In [20]:
df_g.shape

(987197, 2)

# Concat

In [21]:
def partition_data(df):
    
    train_size = int(len(df)*.99)
    
    train_df = df[:train_size]
    valid_df = df[train_size:]

    train_df['set'] = 'train'
    valid_df['set'] = 'valid'

    
    return (train_df, valid_df)

In [22]:
h_t, h_v = partition_data(df_human)
c1_t, c1_v = partition_data(df_c1)
c2_t, c2_v = partition_data(df_c2)
g_t, g_v = partition_data(df_g)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [23]:
mammal_train = pd.concat([h_t, c1_t, c2_t, g_t])
mammal_val = pd.concat([h_v, c1_v, c2_v, g_v])

In [24]:
mammal_train.shape, mammal_val.shape

((4713593, 3), (47613, 3))

In [25]:
mammal_train.to_csv(path/'mammal_train.csv', index=False)
mammal_val.to_csv(path/'mammal_val.csv', index=False)