## Data Source

The following files used in this project were obtained from the [INPHARED GitHub repository](https://github.com/RyanCook94/inphared?tab=readme-ov-file):

- `14Apr2025_data_excluding_refseq.tsv`
- `14Apr2025_genomes_excluding_refseq.fa`



In [1]:
import pandas as pd

In [None]:

# importing metadata for viral genomes
background_data = pd.read_csv('data/14Apr2025_data_excluding_refseq.tsv', sep='\t')

In [3]:
# selecting host with atleast 200 viral genome entries

pd.set_option('display.max_rows', None)
host_counts = background_data['Host'].value_counts()


host_counts = host_counts[host_counts > 200]

selected_hosts = host_counts.index.tolist()[1:]
len(selected_hosts)


22

In [12]:
filtered_df = background_data[background_data['Host'].isin(selected_hosts)]
filtered_df.shape

(20277, 27)

### Loading and Working with FASTA File

The file `14Apr2025_genomes_excluding_refseq.fa` contains phage genome sequences in FASTA format.


In [5]:
# taking sequences with "complete genome" or "complete sequence" in description

from Bio import SeqIO
from tqdm import tqdm

fasta_file = "data/14Apr2025_genomes_excluding_refseq.fa"
fasta_data = []

for record in tqdm(SeqIO.parse(fasta_file, "fasta"), total=28665, desc="Processing Sequences"):
    seq_id = record.id           
    description = record.description  
    sequence = str(record.seq)   
    if "complete genome" in description or "complete sequence" in description:
        fasta_data.append({
        "id": seq_id,
        "description": description,
        "sequence": sequence,
        "length": len(sequence)
    })

fasta_df = pd.DataFrame(fasta_data)




Processing Sequences: 100%|██████████| 28665/28665 [00:08<00:00, 3580.27it/s]


### Matching Accession and Reducing FASTA Entries

To ensure consistency between metadata and genome sequences, accession identifiers from the `14Apr2025_data_excluding_refseq.tsv` file were matched with the entries in the FASTA file and vice versa.



In [13]:
filtered_df.shape, fasta_df.shape

((20277, 27), (25075, 4))

In [None]:
#delete rows that have less variability in values and relevance

filtered_df = filtered_df.drop(columns=['Jumbophage', 'Modification Date'
                                   , 'Low Coding Capacity Warning', 'Genbank Division',
                                   'Isolation Host (beware inconsistent and nonsense values)'])
       

In [15]:
filtered_df.columns

Index(['Accession', 'Description', 'Classification', 'Genome Length (bp)',
       'molGC (%)', 'Molecule', 'Number CDS', 'Positive Strand (%)',
       'Negative Strand (%)', 'Coding Capacity(%)', 'tRNAs', 'Host',
       'Lowest Taxa', 'Genus', 'Sub-family', 'Family', 'Order', 'Class',
       'Phylum', 'Kingdom', 'Realm', 'Baltimore Group'],
      dtype='object')

In [61]:
fasta_df.head(1)

Unnamed: 0,id,description,sequence,length
0,PP099880,"PP099880 Erwinia phage RH-42-1, complete genome.",GGGGATACGTGCCCCTCCACCGCCACCCGCACCCCCTACCAAAATT...,14942


In [16]:
# mapping metadata file -> fasta

fasta_df = fasta_df[fasta_df['id'].isin(filtered_df['Accession'])]
fasta_df.shape

(17424, 4)

In [17]:
# mapping fasta -> metadata file

filtered_df = filtered_df[filtered_df['Accession'].isin(fasta_df['id'])]
filtered_df.shape, fasta_df.shape

((17424, 22), (17424, 4))

In [None]:
filtered_df.to_csv('data/filtered_meta_data.csv', index=False)
fasta_df.to_csv('data/filtered_fasta_data.csv', index=False)

In [84]:
fasta_df.head(1)

Unnamed: 0,id,description,sequence,length
2,PQ614275,"PQ614275 Vibrio phage vB_VhaS_R31B, complete g...",ATGGTAGCTGTAAACAATAATGTAGTAGGTACAACGGAACAACCGT...,58583
