#**Genome/*Contig* retrieval and Inspection**

##NCBI GenomeID retrieval

In [None]:
import pandas as pd
from Bio import Entrez

main = pd.read_csv("FULL_results.csv")
main = main.sort_values(
    by=["E-value", "P(H)", "%aln"],
    ascending=[True, False, True]  # Specify sorting order
)

protein_ids = main["GenBankID"].to_list()
Entrez.email = "juancarlos.ramirezm@estuadiante.uam.es"

def fetch_genomic_accession(protein_ids):
    genomic_accessions = {}
    for protein_id in protein_ids:
        try:
            handle = Entrez.efetch(db="protein", id=protein_id, rettype="gp", retmode="xml")
            records = Entrez.read(handle)
            handle.close()

            # Navigate to the 'GBSeq_feature-table' to find the genomic accession
            for feature in records[0]['GBSeq_feature-table']:
                if feature['GBFeature_key'] == 'CDS':
                    for qualifier in feature['GBFeature_quals']:
                        if qualifier['GBQualifier_name'] == 'coded_by':
                            # Extracting the genomic accession
                            coded_by = qualifier['GBQualifier_value']
                            genomic_acc = coded_by.split(":")[0]
                            genomic_accessions[protein_id] = genomic_acc
                            print(f"Genomic Accession for {protein_id}: {genomic_acc}")
                            break
        except Exception as e:
            print(f"Error fetching data for {protein_id}: {e}")
            genomic_accessions[protein_id] = protein_id
    return genomic_accessions

# Retrieve and print genomic accessions
genomic_accessions = fetch_genomic_accession(protein_ids)
for protein, genome in genomic_accessions.items():
    print(f"Protein ID: {protein}, Genomic Accession: {genome}")

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
Genomic Accession for DAJ64125.1: BK043702.1
Genomic Accession for MCC5410459.1: JAJHOF010000423.1
Genomic Accession for HQL66766.1: DAOLPE010000010.1
Genomic Accession for DAM92081.1: BK052906.1
Genomic Accession for DAI26734.1: complement(BK030368.1
Genomic Accession for BCU98189.1: complement(LC629474.1
Error fetching data for BK014853: HTTP Error 400: Bad Request
Error fetching data for PCCC01000030: HTTP Error 400: Bad Request
Error fetching data for QXJB01000015: HTTP Error 400: Bad Request
Genomic Accession for MBQ5630483.1: complement(JAFNWL010000145.1
Genomic Accession for YP_010756110.1: NC_073482.1
Genomic Accession for DAM57470.1: complement(BK052217.1
Genomic Accession for DAL72410.1: complement(BK021178.1
Genomic Accession for GIS09147.1: BOZZ01000004.1
Genomic Accession for BCU94434.1: LC629459.1
Error fetching data for KF268200: HTTP Error 400: Bad Request
Genomic Accession for MFJ8508641.1: JBIW

In [None]:
main["GenomeID"] = ""
main["GenomeID"] = main["GenBankID"].apply(lambda x: genomic_accessions.get(x, None))

for i in range(len(main)):
  if "complement" in str(main["GenomeID"][i]):
    main["GenomeID"][i] = main["GenBankID"][i].split("(")[-1]
    print(main["GenBankID"][i])
  if main["GenBankID"][i] == None or main["GenBankID"][i] == "":
    main["GenomeID"][i] = None
main.to_csv("FULL_results+GenomeID.csv", index=False)
main

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  main["GenomeID"][i] = main["GenBankID"][i].split("(")[-1]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  main[

EAK6855417.1
MCX2201125.1
MDU5307243.1
HCC1440778.1
HBH6004542.1
HDL2032282.1
MRC81627.1
MED3302581.1
KQB17995.1
HET8685922.1
HET8688091.1
NP_046315.1
QSV39475.1
AP_000051.1
XIF84270.1
AAW33116.1
QOV03172.1
AZI15564.1
QOV03173.1
UZE90035.1
ACZ92149.2
ABH01045.1
WRQ19809.1
ALB78183.1
WRQ19827.1
WRQ19845.1
NP_044190.1
NP_861851.1
AAA42522.1
BAA76963.1
HRC96190.1
MCI7601928.1
MCI7633214.1
UWI18491.1
UVY52706.1
UWI36984.1
DAW91747.1
UWD67627.1
UVY45549.1
APD78427.1
UZF96919.1
NP_047386.1
MBQ5630483.1
MBP5362725.1
NP_108659.1
KAH0372502.1
YP_009272543.1
YP_068061.1
YP_010796273.1
MBQ2167681.1
MBO7714060.1
MBO7241797.1
MBO7696524.1
MBO7715537.1
MBO7697041.1
MBO7694246.1
MBO7712828.1
MBO7712864.1
YP_009162347.1
YP_009414576.1
XGR28300.1
AGT76236.1
KAF8900737.1
YP_007346999.1
NP_062436.1
NP_064287.1
NP_659516.1
NP_040518.2
UAW96080.1
YP_094033.1
NP_015538.1
NP_040854.1
AGT75466.1
AGT76449.1
NP_044703.1
NP_077390.1
MFC7663686.1
QJP03672.1
AAS10433.1
BCU98190.1
BCV08106.1
BCU94968.1
BCU96511.1
B

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  main["GenomeID"][i] = main["GenBankID"][i].split("(")[-1]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  main[

QIJ58559.1
YP_010790601.1
WZB38155.1
QRV11652.1
QRV11644.1
QYW15027.1
WXG22732.1
YP_004009772.1
YP_004010294.1
YP_004009519.1
QUS52947.1
YP_009047095.1
YP_009051655.1
YP_003933582.1
YP_003969617.1
YP_007675246.1
YP_004123739.1
YP_009032608.1
YP_007518314.1
ADZ39805.1
YP_006489170.1
AEK79911.1
YP_007501177.1
KAI0341762.1
BBE29335.1
AFD22004.1
YP_006383557.1
YP_007010812.1
AFN37562.1
YP_007235979.1
YP_009217541.1
YP_009783144.1
YP_007517891.1
QZW33688.1
YP_008060104.1
YP_009008137.1
AGU01647.1
BAR30536.1
YP_008719821.1
YP_008719853.1
YP_009211588.1
YP_009005429.1
AIX42222.1
YP_004414801.1
YP_004935932.1
YP_009047156.1
YP_009099867.1
YP_009213519.1
YP_009109574.1
YP_009147558.1
YP_009112717.1
YP_009190323.1
YP_009162589.1
AMQ65969.1
YP_009203885.1
YP_009505687.1
YP_009704123.1
YP_009174146.1
ALE30461.1
YP_009211276.1
YP_009197882.1
YP_009198013.1
OCC01058.1
YP_009272876.1
YP_009272923.1
YP_009324168.1
YP_009595607.1
YP_009279999.1
OPZ86633.1
MDC0889085.1
MDC3384680.1
MBC95972.1
MDA9686287

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  main["GenomeID"][i] = main["GenBankID"][i].split("(")[-1]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  main[

MBO06619.1
MBC8429883.1
MAE62671.1
MEO1999289.1
MBM4030439.1
HUT37491.1
MBI4243693.1
MAE56250.1
MDA9007811.1
MAZ07533.1
MBA94843.1
MBB00706.1
NBS07768.1
MCC5848332.1
MCX6894949.1
MAD26109.1
MAG49942.1
MAG48589.1
MAG49771.1
HIJ10602.1
MAG49323.1
RLA61615.1
MDW7641723.1
MDW7641338.1
HCC73139.1
PHS22110.1
MBR5434462.1
MDX9694602.1
MEG1902988.1
MBO5810611.1
MDH7601173.1
HEY3377837.1
YP_010790712.1
MBQ3416088.1
MBR4003570.1
MBO5715289.1
NDG53439.1
YP_009620045.1
ATV46198.1
HIB83020.1
QGX41978.1
MDE2101029.1
HEV2278975.1
HZS07951.1
MBN2451868.1
MBN2451608.1
MDP7366166.1
MDP7367034.1
MDP7368567.1
MDP7365881.1
HEY6021524.1
MDP7365914.1
PQE22120.1
AUR86065.1
HYT41710.1
YP_010090862.1
MCK9416602.1
YP_010097843.1
CAB5221146.1
CAB4141269.1
CAB4162756.1
CAB4168152.1
CAB4126602.1
CAB5221867.1
CAB4170673.1
MBN1902360.1
MCM1266094.1
MEG0367343.1
DAC64040.1
MEK9696010.1
MEK9695742.1
MEK9694978.1
YP_010095095.1
XBY87759.1
YP_009508601.1
YP_009373239.1
HIF37435.1
YP_009838143.1
HWM26375.1
TXG86101.1
MCF7

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  main["GenomeID"][i] = main["GenBankID"][i].split("(")[-1]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  main[

QEJ80752.1
YP_010799986.1
QFR55786.1
YP_009910606.1
QHR77492.1
HIH06061.1
QGZ10481.1
UVZ42948.1
QGT54262.1
NP_073686.1
YP_009854086.1
HYM34099.1
HUT90058.1
MBT4995142.1
YP_010107304.1
MBF0875049.1
QJC19248.1
YP_010798521.1
QJT71872.1
MCK5020812.1
QJT70022.1
QPI17848.1
MBR4431869.1
MBP5422379.1
YP_010796998.1
QOI69101.1
QQM13896.1
HTB82130.1
YP_009389491.1
QOI66645.1
YP_010113748.1
QOJ53941.1
YP_010670009.1
YP_010669417.1
QPX65093.1
QQD36934.1
QQG32119.1
YP_010114692.1
NBP58505.1
NBP55005.1
NBP55030.1
HUT45556.1
QTH80085.2
DAF88746.1
DAF85828.1
DAD78869.1
DAD84233.1
DAD72321.1
DAF56532.1
DAF64363.1
DAJ41877.1
DAM31263.1
QZI94395.1
DAU03933.1
DAF79701.1
DAQ51055.1
QHJ78973.1
DAU74324.1
DAP83352.1
DAQ29212.1
DAI87465.1
DAL41398.1
DAR80535.1
DAS06867.1
DAV04765.1
DAT57027.1
DAR03116.1
DAQ54201.1
DAJ26306.1
DAK11991.1
DAS18361.1
DAN45936.1
DAT96603.1
DAU65350.1
DAW80596.1
DAX13020.1
DAQ47148.1
DAM29426.1
DAK66860.1
DAM57470.1
DAL21706.1
DAI10712.1
DAU75569.1
DAV45373.1
DAS03720.1
DAN40356.1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  main["GenomeID"][i] = main["GenBankID"][i].split("(")[-1]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  main[

DAR33301.1
DAP14008.1
DAL90500.1
DAT87944.1
DAU86428.1
DAE52150.1
DAQ18774.1
BCU93954.1
DAO80396.1
DAS78354.1
DAS48661.1
DAJ38996.1
DAN98500.1
DAI26734.1
DAP95445.1
DAF15172.1
DAJ56772.1
DAY33806.1
DAV25782.1
DAP71713.1
DAL80791.1
DAK11225.1
DAX35708.1
DAP92411.1
DAN45148.1
DAL72410.1
DAL15396.1
DAU41979.1
DAR80938.1
DAV00907.1
DAG59579.1
DAT87872.1
DAI02652.1
DAU73763.1
DAF75000.1
DAH99495.1
HEY7148018.1
HIS35831.1
YP_010798145.1
YP_009328903.1
QWT50655.1
BCZ16696.1
YP_010650127.1
QYN80598.1
HMO27256.1
NP_944079.1
UGO48219.1
UGO48220.1
CAH1027143.1
UGC97972.1
MBN75716.1
MBB38065.1
UKM62970.1
YP_004300732.1
YP_010656554.1
UOL48750.1
USL85445.1
URC15339.1
USL89528.1
MDY5645606.1
UTC25166.1
WAX11784.1
HSE61122.1
UXN78499.1
NP_690636.1
MDA1663856.1
WBV74353.1
WJJ54608.1
WJJ54797.1
WGL40795.1
WGN96457.1
WJZ28044.1
WJE88055.1
WJZ48166.1
WJZ47887.1
HMH51646.1
WMQ77644.1
XCK17000.1
WNM55411.1
WNM55626.1
MEE0930468.1
WPJ72151.1
DBA48924.1
DBA50534.1
DBA51401.1
DBA47008.1
DBA49230.1
DBA47500.1


Unnamed: 0,Hit,GenBankID,aln_hit,%I,P(H),E-value,Bit-Score,len(Qry),len(aln),%aln,...,Method,TP,superkingdom,phylum,class,order,family,genus,species,GenomeID
532,AOC84064.1,AOC84064.1,352,99.148,1.000,0.000000,716.0,679,352,0.518409,...,PSI-BLAST,FAdV-8,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Aviadenovirus,Fowl aviadenovirus E,AOC84064.1
729,ANA50312.1,ANA50312.1,354,98.023,1.000,0.000000,711.0,679,353,0.519882,...,PSI-BLAST,FAdV-8,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Aviadenovirus,Fowl aviadenovirus E,ANA50312.1
1509,XEQ86939.1,XEQ86939.1,374,99.465,1.000,0.000000,752.0,671,374,0.557377,...,PSI-BLAST,hAd2,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human adenovirus sp.,XEQ86939.1
227,QOV03173.1,QOV03173.1,378,72.487,1.000,0.000000,549.0,671,376,0.560358,...,PSI-BLAST,hAd2,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human mastadenovirus F,QOV03173.1
466,AGT76236.1,AGT76236.1,442,74.661,1.000,0.000000,573.0,671,430,0.640835,...,PSI-BLAST,hAd2,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human mastadenovirus B,AGT76236.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1210,A0A1V5MQJ2,MWAL01000503,489,9.400,0.992,0.009964,85.0,559,380,0.679785,...,FoldSeek,GC1,Bacteria,Bacteroidota,,,,,Bacteroidetes bacterium ADurb.Bin416,MWAL01000503
435,A0A3B1EJS3,MF990902,445,10.100,0.961,0.009964,74.0,559,385,0.688730,...,FoldSeek,GC1,Bacteria,,,,,,uncultured bacterium,MF990902
1382,A0A2I7RRS2,MG592590,540,11.200,0.933,0.009964,70.0,559,413,0.738819,...,FoldSeek,GC1,Viruses,Uroviricota,Caudoviricetes,,,,Vibrio phage 1.223.O._10N.261.48.A9,MG592590
2766,A0A6H0X6N1,MT259468,599,10.600,0.923,0.009964,69.0,559,387,0.692308,...,FoldSeek,GC1,Viruses,Uroviricota,Caudoviricetes,,Autographiviridae,,Aeromonas phage PS,MT259468


##**IPG** Genome/*Contig* retrieval

In [None]:
!pip install biopython
!apt-get update
!apt-get install ncbi-blast+

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85
Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy/main

In [None]:
import pandas as pd
from Bio import Entrez

Entrez.email = "juancarlos.ramirezm@estuadiante.uam.es"

def fetch_genomic_context(genome_id):
    try:
        # Fetch data from NCBI in XML format
        handle = Entrez.efetch(db="protein", id=genome_id, rettype="ipg", retmode="xml")
        records = Entrez.read(handle)
        handle.close()

        # Extract the genomic context from the parsed XML
        if records: #and 'GBSeq_source-db' in records[0]:
            print(records)
            with open("IPG_results.tsv", "a") as ipg:
              ipg.write(f"{records}\n")
            return records#[0]['GBSeq_source-db']
        else:
            return "No genome/contig information available"
    except Exception as e:
        return f"Error retrieving data for {genome_id}: {str(e)}"

# Example list of GenBank IDs
genome_ids = full_rep_id_list

# Retrieve genomic context for each GenBank ID
genomic_contexts = {genome_id: fetch_genomic_context(genome_id) for genome_id in genome_ids}

# Print the results
for genome_id, context in genomic_contexts.items():
    print(f"{genome_id}: {context}")

In [None]:
import pandas as pd

fields_ipg = ["Id", "Source", "Nucleotide Accession", "Start", "Stop", "Strand", "Protein", "Protein Name", "Organism", "Strain", "Assembly"]
df_ipg = pd.DataFrame(columns = fields_ipg)

with open("IPG_results_2.tsv", "r") as ipg:
  ipg_list = ipg.readlines()

for line in ipg_list:
  line = line.strip()
  if "Source" not in line:
    #print(line)
    line = line.split("\t")
    #print(line)
    #print("\n")
    while len(line) != 11:
      line.append("")
    df_ipg.loc[len(df_ipg.index)] = line

df_ipg

Unnamed: 0,Id,Source,Nucleotide Accession,Start,Stop,Strand,Protein,Protein Name,Organism,Strain,Assembly
0,799678,RefSeq,NC_001734.1,7850,10021,-,NP_044190.1,hypothetical protein,Canine mastadenovirus A,,GCF_000845925.1
1,799678,Swiss-Prot,,,,,Q96682.1,Preterminal protein,Canine adenovirus 1 strain RI261,,
2,799678,INSDC,Y07760.1,7850,10021,-,CAA69058.1,orf8,Canine adenovirus 1,,GCA_000845925.1
3,718888242,INSDC,OR544955.1,69794,71134,+,WPK28981.1,DNA end protector protein,Escherichia phage vB_EcoP_EP32B,,GCA_033967435.1
4,27156601,INSDC,JN880452.1,8420,13603,-,AFD22004.1,pre-terminal protein,Simian adenovirus A1285,A1285,GCA_006446335.1
...,...,...,...,...,...,...,...,...,...,...,...
4250,6134858,INSDC,AB211830.1,916,3913,+,BAF32130.1,pollen allergen,Cryptomeria japonica,,
4251,303407851,RefSeq,NC_049509.1,2963,4510,+,YP_009889342.1,hypothetical protein,Salmonella phage P46FS4,Salmonella sp.,GCF_011067595.1
4252,303407851,INSDC,MT078988.1,2963,4510,+,QIG62069.1,hypothetical protein,Salmonella phage P46FS4,Salmonella sp.,GCA_011067595.1
4253,44799567,RefSeq,NC_022774.1,24934,26487,+,YP_008771948.1,minor tail protein,Bacillus phage Slash,,GCF_000913795.1


In [None]:
# Missing GenBankID identification

id_list_ipg = df_ipg["Protein"].unique()
id_list_ipg

with open("FULL_representative_ids.txt", "r") as rep_id:
  rep_list = rep_id.read()
  rep_list = rep_list.split("\n")
  rep_list = list(filter(None, rep_list)) # Delete last line
  print(rep_list)

for query_id in rep_list:
  if query_id not in id_list_ipg:
    print(query_id)

['NP_044190.1', 'WPK28981.1', 'AFD22004.1', 'WP_087972990.1', 'WP_063336313.1', 'MDB4339469.1', 'WPJ21176.1', 'QDH49043.1', 'WP_061511719.1', 'XBY87759.1', 'AIS19786.1', 'ALB78183.1', 'UVY45549.1', 'DAO80396.1', 'HEY8985452.1', 'CAD5240754.1', 'WP_060812158.1', 'YP_009910369.1', 'AOC84064.1', 'NP_817303.1', 'MAI05278.1', 'APU01937.1', 'MAJ57753.1', 'BCU95600.1', 'MAI82168.1', 'WP_138909107.1', 'WP_388741811.1', 'YP_009099867.1', 'YP_009910463.1', 'YP_007673845.1', 'MBT72005.1', 'MAR78155.1', 'YP_009324168.1', 'KAE9558774.1', 'QHJ78973.1', 'MBC8429883.1', 'BCU98818.1', 'MBA41859.1', 'NP_077390.1', 'MDY5996077.1', 'KAG8956018.1', 'MBM3417389.1', 'HUT37491.1', 'MBT6072472.1', 'NBW34590.1', 'UZV41723.1', 'QQM13896.1', 'MDB4559066.1', 'WP_063384088.1', 'WP_187507283.1', 'DAI39303.1', 'MEC7807456.1', 'WP_048849597.1', 'DAL72410.1', 'MAR77710.1', 'DAV45373.1', 'MDB9980541.1', 'RPG04967.1', 'MCH2508301.1', 'AUR86065.1', 'DAX35708.1', 'NP_073686.1', 'MDP4148461.1', 'NDB28677.1', 'QEG05439.1', '

In [None]:
# Query Number count and validation

NR = 0
with open("IPG_results_2.tsv", "r") as ipg:
  ipg_list = ipg.readlines()
  #print(ipg_list[0])
for line in ipg_list:
  if "Id" in line and "Source" in line:
    NR = NR + 1
print(NR)

numero = len(df_ipg["Id"].unique())
print(numero)

2149
2136


In [None]:
# Sieving

# Secure-safe original result df
df_ipg_sieved = df_ipg.copy()

# Delete Swiss-Prot-yielded entries due to GenomeID and entry length lack
for entry in range(len(df_ipg_sieved)):
  if df_ipg_sieved.loc[entry, "Source"] == "Swiss-Prot":
    print(entry)
    df_ipg_sieved = df_ipg_sieved.drop(entry)
df_ipg_sieved = df_ipg_sieved.reset_index(drop=True)

# Delete other entries lacking GenomeID
for entry in range(len(df_ipg_sieved)):
  if df_ipg_sieved.loc[entry, "Nucleotide Accession"] == "":
    print(f"Entry-less: {entry}")
    df_ipg_sieved = df_ipg_sieved.drop(entry)
df_ipg_sieved = df_ipg_sieved.reset_index(drop=True)

# Delete other entries lacking Starting or Ending positions
for entry in range(len(df_ipg_sieved)):
  if df_ipg_sieved.loc[entry, "Start"] == "" or df_ipg_sieved.loc[entry, "Stop"] == "":
    print(f"Position-less: {entry}")
    df_ipg_sieved = df_ipg_sieved.drop(entry)
df_ipg_sieved = df_ipg_sieved.reset_index(drop=True)
df_ipg_sieved["len"] = ""
for entry in range(len(df_ipg_sieved)):
  df_ipg_sieved.loc[entry, "len"] = int(df_ipg_sieved.loc[entry, "Stop"]) - int(df_ipg_sieved.loc[entry, "Start"]) + 1

df_ipg_sieved.to_csv("IPG_results_sieved.csv", index=False)
df_ipg_sieved

1
225
336
1273
1369
2042
2291
2408
2529
2595
2699
3094
3251
3390
3716
3733
3798
3969
Position-less: 225
Position-less: 1146
Position-less: 1147
Position-less: 1165
Position-less: 1166
Position-less: 1167
Position-less: 1168
Position-less: 1169
Position-less: 1172
Position-less: 1173
Position-less: 1224
Position-less: 2690
Position-less: 2691
Position-less: 2766
Position-less: 3076
Position-less: 3084
Position-less: 3085
Position-less: 3086
Position-less: 3087
Position-less: 3088
Position-less: 3089
Position-less: 3090
Position-less: 3091
Position-less: 3092
Position-less: 3093
Position-less: 3094
Position-less: 3095
Position-less: 3096
Position-less: 3097
Position-less: 3098
Position-less: 3099
Position-less: 3100
Position-less: 3101
Position-less: 3102
Position-less: 3103
Position-less: 3104
Position-less: 3105
Position-less: 3106
Position-less: 3107
Position-less: 3108
Position-less: 3109
Position-less: 3110
Position-less: 3111
Position-less: 3112
Position-less: 3113
Position-less: 3

Unnamed: 0,Id,Source,Nucleotide Accession,Start,Stop,Strand,Protein,Protein Name,Organism,Strain,Assembly,len
0,799678,RefSeq,NC_001734.1,7850,10021,-,NP_044190.1,hypothetical protein,Canine mastadenovirus A,,GCF_000845925.1,2172
1,799678,INSDC,Y07760.1,7850,10021,-,CAA69058.1,orf8,Canine adenovirus 1,,GCA_000845925.1,2172
2,718888242,INSDC,OR544955.1,69794,71134,+,WPK28981.1,DNA end protector protein,Escherichia phage vB_EcoP_EP32B,,GCA_033967435.1,1341
3,27156601,INSDC,JN880452.1,8420,13603,-,AFD22004.1,pre-terminal protein,Simian adenovirus A1285,A1285,GCA_006446335.1,5184
4,149490929,RefSeq,NZ_CP083093.1,12817,13557,-,WP_087972990.1,hypothetical protein,Bacillus thuringiensis,B401,GCF_020809125.1,741
...,...,...,...,...,...,...,...,...,...,...,...,...
4154,6134858,INSDC,AB211830.1,916,3913,+,BAF32130.1,pollen allergen,Cryptomeria japonica,,,2998
4155,303407851,RefSeq,NC_049509.1,2963,4510,+,YP_009889342.1,hypothetical protein,Salmonella phage P46FS4,Salmonella sp.,GCF_011067595.1,1548
4156,303407851,INSDC,MT078988.1,2963,4510,+,QIG62069.1,hypothetical protein,Salmonella phage P46FS4,Salmonella sp.,GCA_011067595.1,1548
4157,44799567,RefSeq,NC_022774.1,24934,26487,+,YP_008771948.1,minor tail protein,Bacillus phage Slash,,GCF_000913795.1,1554


In [None]:
import pandas as pd
# New dataframe to store the final entries
df_ipg_final = pd.DataFrame(columns=fields_ipg)


# Grouping the dataframe by the "Id" column
grouped = df_ipg_sieved.groupby('Id')

# List to collect dataframes
df_list = []

# Iterating over each group
for group_id, group_df in grouped:
    # Sorting by 'len' in descending order
    sorted_df = group_df.sort_values(by='len', ascending=False)

    # Selecting the entry with the maximum 'len' value
    max_len_entry = sorted_df.iloc[0:1]  # This keeps it as a DataFrame

    # Collect the DataFrame
    df_list.append(max_len_entry)

# Concatenating all selected rows into a final dataframe
df_ipg_final = pd.concat(df_list, ignore_index=True)

# Ensuring the original order of "Id" is respected
df_ipg_final = df_ipg_final.set_index('Id').reindex(df_ipg_sieved['Id'].unique()).reset_index()

df_ipg_final.to_csv("IPG_results_FILTERED.csv", index=False)

# Displaying the final dataframe
df_ipg_final

Unnamed: 0,Id,Source,Nucleotide Accession,Start,Stop,Strand,Protein,Protein Name,Organism,Strain,Assembly,len
0,799678,RefSeq,NC_001734.1,7850,10021,-,NP_044190.1,hypothetical protein,Canine mastadenovirus A,,GCF_000845925.1,2172
1,718888242,INSDC,OR544955.1,69794,71134,+,WPK28981.1,DNA end protector protein,Escherichia phage vB_EcoP_EP32B,,GCA_033967435.1,1341
2,27156601,INSDC,JN880452.1,8420,13603,-,AFD22004.1,pre-terminal protein,Simian adenovirus A1285,A1285,GCA_006446335.1,5184
3,149490929,RefSeq,NZ_CP083093.1,12817,13557,-,WP_087972990.1,hypothetical protein,Bacillus thuringiensis,B401,GCF_020809125.1,741
4,108190427,RefSeq,NZ_JBDGIA010000015.1,1677,2480,+,WP_063336313.1,hypothetical protein,Bacillus inaquosorum,gbc_l,GCF_040784945.1,804
...,...,...,...,...,...,...,...,...,...,...,...,...
2131,212774742,RefSeq,NZ_JAQOUA010000005.1,318549,320114,-,WP_272528280.1,NosD domain-containing protein,Lautropia mirabilis,SCCH130 Lau2261318,GCF_028462385.1,1566
2132,321606571,INSDC,LR796879.1,27352,28899,-,CAB4171957.1,Pectate lyase superfamily protein,uncultured Caudovirales phage,,GCA_902990395.1,1548
2133,6134858,RefSeq,NC_081406.1,162138635,162141632,+,XP_057831880.2,,Cryptomeria japonica,,GCF_030272615.1,2998
2134,303407851,RefSeq,NC_049509.1,2963,4510,+,YP_009889342.1,hypothetical protein,Salmonella phage P46FS4,Salmonella sp.,GCF_011067595.1,1548


In [None]:
# Contig retrieval
from Bio import Entrez, SeqIO
import pandas as pd

df_ipg_final = pd.read_csv("IPG_results_FILTERED.csv")
genomeid_list = df_ipg_final["Nucleotide Accession"].to_list()
genomeid_list.append("GG666582.1")  # Include additional GenomeID

Entrez.email = "juancarlos.ramirezm@estudiante.uam.es"

def fetch_sequence_and_save(accession):
    if accession:
        print(f"Fetching {accession}")
        try:
            with Entrez.efetch(db="nuccore", id=accession, rettype="fasta", retmode="text") as handle:
                record = SeqIO.read(handle, "fasta")
                # Save the record to a FASTA file named after the accession
                with open(f"{accession}.fasta", "w") as fasta_out:
                    SeqIO.write(record, fasta_out, "fasta")
                print(f"Successfully retrieved and saved: {accession}")
        except Exception as e:
            print(f"Error fetching {accession}: {e}")
            print("No FASTA sequence available")
            with open("GENOME_ERRORS.txt", "a") as error_file:
                error_file.write(f"{accession}\n")

for genomeid in genomeid_list:
    fetch_sequence_and_save(genomeid)

Fetching NC_001734.1
Successfully retrieved and saved: NC_001734.1
Fetching OR544955.1
Successfully retrieved and saved: OR544955.1
Fetching JN880452.1
Successfully retrieved and saved: JN880452.1
Fetching NZ_CP083093.1
Successfully retrieved and saved: NZ_CP083093.1
Fetching NZ_JBDGIA010000015.1
Successfully retrieved and saved: NZ_JBDGIA010000015.1
Fetching JAOKRE010000001.1
Successfully retrieved and saved: JAOKRE010000001.1
Fetching OR666137.1
Successfully retrieved and saved: OR666137.1
Fetching MN038175.1
Successfully retrieved and saved: MN038175.1
Fetching NZ_LHZM01000090.1
Successfully retrieved and saved: NZ_LHZM01000090.1
Fetching OR096706.1
Successfully retrieved and saved: OR096706.1
Fetching KM096544.1
Successfully retrieved and saved: KM096544.1
Fetching KP279746.1
Successfully retrieved and saved: KP279746.1
Fetching OP073754.1
Successfully retrieved and saved: OP073754.1
Fetching BK038159.1
Successfully retrieved and saved: BK038159.1
Fetching DATCQX010001001.1
Success

In [None]:
!tar -czvf "0000.tar.gz" *.fasta

ABBGAK010000009.1.fasta
ABJPJW010000008.1.fasta
ABXP02000122.1.fasta
ADAD01000082.1.fasta
AJ586898.1.fasta
ALJD01000016.1.fasta
AMFJ01005424.1.fasta
AP013912.1.fasta
AP014200.1.fasta
AP017647.1.fasta
AP019525.1.fasta
ATBP01001652.1.fasta
AUSU01001187.1.fasta
AWSJ01000245.1.fasta
AY822469.1.fasta
AY848684.1.fasta
AZFC01000030.1.fasta
AZRA01000001.1.fasta
BCNO01000002.1.fasta
BDQX01000096.1.fasta
BDSW01000143.1.fasta
BEIQ01000001.1.fasta
BFAA01021776.1.fasta
BGZN01000059.1.fasta
BJFU01000002.1.fasta
BK013345.1.fasta
BK013347.1.fasta
BK014698.1.fasta
BK014751.1.fasta
BK014853.1.fasta
BK014964.1.fasta
BK015140.1.fasta
BK015217.1.fasta
BK015539.1.fasta
BK015645.1.fasta
BK015667.1.fasta
BK015689.1.fasta
BK015897.1.fasta
BK015931.1.fasta
BK015936.1.fasta
BK015993.1.fasta
BK016086.1.fasta
BK016121.1.fasta
BK016125.1.fasta
BK016136.1.fasta
BK016182.1.fasta
BK016274.1.fasta
BK016314.1.fasta
BK016585.1.fasta
BK016733.1.fasta
BK017080.1.fasta
BK017141.1.fasta
BK017257.1.fasta
BK017272.1.fasta
BK01

In [None]:
with open("IPG_results_2.tsv", "r") as ipg:
  ipg_list = ipg.readlines()

line_nr_list = []

for line_nr in range(len(ipg_list)):
  if "Id" in ipg_list[line_nr] and "Source" in ipg_list[line_nr]:
    print(line_nr)
    line_nr_list.append(line_nr)

line_nr_list

for i in line_nr_list:
  if i == i+1:
    print(i)

0
4
6
8
66
74
78
80
82
85
87
91
93
95
97
99
102
105
108
110
115
117
123
125
131
133
136
139
142
147
150
152
154
157
161
163
165
167
169
172
174
177
179
181
183
185
187
190
193
208
232
234
236
241
243
245
247
249
251
253
255
257
260
262
266
268
271
273
275
277
280
283
285
287
289
291
293
295
297
299
301
303
306
311
316
318
320
326
329
331
333
335
337
339
341
343
345
348
350
352
354
357
360
362
366
368
370
372
374
376
378
380
383
386
388
391
393
395
399
401
403
405
408
410
412
415
417
423
425
427
429
431
434
436
438
443
445
447
449
451
453
455
457
459
461
463
465
467
470
472
474
476
479
481
483
485
487
489
491
493
495
499
501
503
505
507
510
512
514
516
518
520
522
524
527
529
532
534
536
538
540
542
544
546
548
551
553
555
557
568
570
572
574
576
578
580
582
584
587
589
591
594
596
599
602
605
608
610
612
615
617
619
621
623
625
627
629
631
633
635
637
639
641
643
646
648
650
652
654
656
658
660
662
664
666
668
670
672
674
676
679
681
683
685
687
689
691
693
695
697
699
701
703
705
707


#**Taxonomy Completion**

In [None]:
import pandas as pd
from Bio import Entrez
from Bio import SeqIO
from collections import defaultdict
import os

main = pd.read_csv("FULL_results+GenomeID.csv")
Entrez.email = "juancarlos.ramirezm@estudiante.uam.es"

main

main["realm"] = ""
main["kingdom"] = ""

tax_ids = main["TaxID"].to_list()
tax_ids

Unnamed: 0,Hit,GenBankID,aln_hit,%I,P(H),E-value,Bit-Score,len(Qry),len(aln),%aln,...,superkingdom,phylum,class,order,family,genus,species,GenomeID,realm,kingdom
0,AOC84064.1,AOC84064.1,352,99.148,1.000,0.000000,716.0,679,352,0.518409,...,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Aviadenovirus,Fowl aviadenovirus E,AOC84064.1,,
1,ANA50312.1,ANA50312.1,354,98.023,1.000,0.000000,711.0,679,353,0.519882,...,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Aviadenovirus,Fowl aviadenovirus E,ANA50312.1,,
2,XEQ86939.1,XEQ86939.1,374,99.465,1.000,0.000000,752.0,671,374,0.557377,...,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human adenovirus sp.,XEQ86939.1,,
3,QOV03173.1,QOV03173.1,378,72.487,1.000,0.000000,549.0,671,376,0.560358,...,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human mastadenovirus F,QOV03173.1,,
4,AGT76236.1,AGT76236.1,442,74.661,1.000,0.000000,573.0,671,430,0.640835,...,Viruses,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human mastadenovirus B,AGT76236.1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3304,A0A1V5MQJ2,MWAL01000503,489,9.400,0.992,0.009964,85.0,559,380,0.679785,...,Bacteria,Bacteroidota,,,,,Bacteroidetes bacterium ADurb.Bin416,MWAL01000503,,
3305,A0A3B1EJS3,MF990902,445,10.100,0.961,0.009964,74.0,559,385,0.688730,...,Bacteria,,,,,,uncultured bacterium,MF990902,,
3306,A0A2I7RRS2,MG592590,540,11.200,0.933,0.009964,70.0,559,413,0.738819,...,Viruses,Uroviricota,Caudoviricetes,,,,Vibrio phage 1.223.O._10N.261.48.A9,MG592590,,
3307,A0A6H0X6N1,MT259468,599,10.600,0.923,0.009964,69.0,559,387,0.692308,...,Viruses,Uroviricota,Caudoviricetes,,Autographiviridae,,Aeromonas phage PS,MT259468,,




In [None]:
categories = {"Viruses": "viruses", "Bacteria": "bacteria", "Archaea": "archaea", "Eukaryota": "eukaryota", "Other": "other", "Unclassified": "unclassified"}
classification_counts = defaultdict(int)
classification_results = []

count = 0
# Query NCBI Taxonomy database for each Taxonomy ID
with open("log.txt", "a") as log:
  for tax_id in tax_ids:
    print(count)
    log.write(f"{count}\n")
    count = count + 1
    if tax_id != "":
      print(f"Querying Taxonomy ID: {tax_id}")
      log.write(f"Querying Taxonomy ID: {tax_id}\n")
      try:
          handle = Entrez.efetch(db="taxonomy", id=str(tax_id), retmode="xml")
          records = Entrez.read(handle)
          handle.close()

          # Extract lineage
          lineage = records[0]["Lineage"]
          lineage_lower = lineage.lower()
          print(f"Lineage for {tax_id}: {lineage}")
          log.write(f"Lineage for {tax_id}: {lineage}\n")

          # Check category membership
          result = {"TaxID": tax_id, "Lineage": lineage}
          for category, keyword in categories.items():
              if keyword in lineage_lower:
                  result["Category"] = category
                  classification_counts[category] += 1
                  break
          else:
              result["Category"] = "Other"

          classification_results.append(result)
      except Exception as e:
          print(f"Error fetching Taxonomy ID {tax_id}: {e}")
          log.write(f"Error fetching Taxonomy ID {tax_id}: {e}\n")
          print(f"EEEEEEEEEEEEERRRRRRRORRRRRRRRRRR\n\n\n\n\n++++++")
          log.write(f"EEEEEEEEEEEEERRRRRRRORRRRRRRRRRR\n\n\n\n\n++++++\n")

# Convert results to a DataFrame
results_df = pd.DataFrame(classification_results)

# Print summary
summary_df = pd.DataFrame.from_dict(classification_counts, orient="index", columns=["Count"])
print("\nClassification Summary:")
print(summary_df)

# Save results
results_df.to_csv("FULL_taxonomy_classification_results.csv", index=False)
print("\nClassification results saved to 'FULL_taxonomy_classification_results.csv'.")

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
Lineage for 2026714: cellular organisms; Archaea; Thermoproteati; Candidatus Bathyarchaeota; unclassified Candidatus Bathyarchaeota
1646
Querying Taxonomy ID: 2021690
Lineage for 2021690: cellular organisms; Bacteria; Bacillati; Bacillota; Bacilli; Bacillales; Bacillaceae; Bacillus; unclassified Bacillus (in: firmicutes)
1647
Querying Taxonomy ID: 2026773
Lineage for 2026773: cellular organisms; Archaea; Nanobdellati; Candidatus Pacearchaeota; unclassified Candidatus Pacearchaeota
1648
Querying Taxonomy ID: 2268204
Lineage for 2268204: cellular organisms; Archaea; Methanobacteriati; Thermoplasmatota; Thermoplasmata; Thermoplasmatales; unclassified Thermoplasmatales
1649
Querying Taxonomy ID: 1869227
Lineage for 1869227: cellular organisms; Bacteria; unclassified Bacteria
1650
Querying Taxonomy ID: 2053570
Lineage for 2053570: cellular organisms; Bacteria; Pseudomonadati; FCB group; Candidatus Latescibacterota; u

In [None]:
import pandas as pd
results_df = pd.read_csv("FULL_taxonomy_classification_results.csv")
tax_levels = {"Empire": [], "Realm": [], "Kingdom": [], "Phylum": [], "Class": [], "Order": [], "Family": [], "Genus": [], "Species": []}

OTU_list = []
for i in range(len(results_df)):
  OTU_list.append(results_df.at[i, "Lineage"])
#OTU_list
lineage_list = pd.DataFrame(tax_levels)

for i in range(len(OTU_list)):
  lineage = OTU_list[i].split(";")
  print(lineage)
  if "unclassified" not in lineage[-1] and lineage[-1][1:].count(" ") >= 1:
    lineage_list.at[i, "Species"] = lineage[-1][1:]
  if lineage[0] == "Viruses":
    lineage_list.loc[i, "Empire"] = "Viruses"
  elif lineage[0] == "cellular organisms":
    lineage_list.loc[i, "Empire"] = "Cytota"
  else:
    lineage_list.loc[i, "Empire"] = "unclassified entries"
  if "unclassified" in lineage[1]:
    lineage_list.loc[i, "Realm"] = ""
  elif lineage[1][-1] != "a":
    lineage_list.loc[i, "Realm"] = ""
  else:
    lineage_list.at[i, "Realm"] = lineage[1][1:]
  for level in lineage:
    #print(level)
    if level[-2:] == "ae" and "unclassified" not in level and level[-4:] != "diae" and " " not in level[1:] and level[-4:] != "neae" and level[-4:] != "inae":
      lineage_list.at[i, "Family"] = level[1:]

  if lineage[0] == "Viruses": # Viral classification
    for level in lineage:
      if level[-5:] == "viria" and "unclassified" not in level:
        lineage_list.at[i, "Realm"] = level[1:]
      if level[-5:] == "virae" and "unclassified" not in level:
        lineage_list.at[i, "Kingdom"] = level[1:]
      if level[-8:] == "viricota" and "unclassified" not in level:
        lineage_list.at[i, "Phylum"] = level[1:]
      if level[-9:] == "viricetes" and "unclassified" not in level:
        lineage_list.at[i, "Class"] = level[1:]
      if level[-7:] == "virales" and "unclassified" not in level:
        lineage_list.at[i, "Order"] = level[1:]
      if (level[-5:] == "virus" or level[-6:] == "viroid" or level[-9:] == "satellite" or level[-8:] == "viriform") and "unclassified" not in level and " " not in level[1:]:
        lineage_list.at[i, "Genus"] = level[1:]
  if lineage[1][1:] == "Bacteria" or lineage[1][1:] == "Archaea": # Bacterial/Achaeal classification
    for level in lineage:
      if level[-3:] == "ati" and "unclassified" not in level:
        lineage_list.at[i, "Kingdom"] = level[1:]
      if (level[-3:] == "ota" or level[1:] == "Candidatus Kryptoniota" or level[1:] == "Candidatus Aminicenantota" or level[1:] == "Candidatus Hydrothermarchaeota" or level[1:] == "Candidatus Eiseniibacteriota" or level[1:] == "Candidatus Zixiibacteriota") and "unclassified" not in level and "Candidatus " not in level:
        lineage_list.at[i, "Phylum"] = level[1:]
      if "unclassified" not in level and "Candidatus " in level and level[1:] != lineage_list["Realm"][i] and level[1:] != lineage_list["Kingdom"][i] and level[1:] and level[1:] != lineage_list["Empire"][i] and level != "cellular organisms" and level[1:] != lineage_list["Species"][i]:
        lineage_list.at[i, "Phylum"] = level[1:]
      if (level[-2:] == "ia" or level[-4:] == "etes" or level[-2:] == "ei" or level[-4:] == "neae" or level[-2:] == "bi" or level[-5:] == "cocci" or level[1:] == "Bacilli") and "unclassified" not in level and level[1:] != "Bacteria" and level[1:] not in ["Massilia", "Escherichia", "Ehrlicia", "Rickettsia", "Pregia", "Wolbachia", "Orientia", "Chlamydia", "Mannheimia", "Neorickettsia", "Hafnia", "Hafkinia", "Gortzia", "Bealeaia", "Seliberia"]:
        lineage_list.at[i, "Class"] = level[1:]
      if level[-4:] == "ales" and "unclassified" not in level:
        lineage_list.at[i, "Order"] = level[1:]
      if level[-5:] == "aceae" and "unclassified" not in level:
        lineage_list.at[i, "Family"] = level[1:]
      if level.count(" ") <= 1  and ("unclassified" not in level or "Candidatus " in level) and level[1:] != lineage_list["Realm"][i] and level[1:] != lineage_list["Kingdom"][i] and level[1:] != lineage_list["Phylum"][i] and level[1:] != lineage_list["Class"][i] and level[1:] != lineage_list["Order"][i] and level[1:] != lineage_list["Family"][i] and level[1:] != lineage_list["Empire"][i] and level != "cellular organisms":
        lineage_list.at[i, "Genus"] = level[1:]
      if "unclassified" not in level and "Candidatus " in level and level[1:] != lineage_list["Realm"][i] and level[1:] != lineage_list["Kingdom"][i] and level[1:] != lineage_list["Phylum"][i] and level[1:] != lineage_list["Class"][i] and level[1:] != lineage_list["Order"][i] and level[1:] != lineage_list["Family"][i] and level[1:] != lineage_list["Empire"][i] and level != "cellular organisms" and level[1:] != lineage_list["Species"][i] and level[1:] != "Candidatus Hydrothermarchaeota" and level[1:] != "Candidatus Kryptoniota" and level[1:] != "Candidatus Aminicenantota" and level[1:] != "Candidatus Eiseniibacteriota" and level[1:] != "Candidatus Zixiibacteriota":
        lineage_list.at[i, "Genus"] = level[1:]

  if lineage[1][1:] == "Eukaryota": # Eukaryotic classification
    for level in lineage:
      if level[1:] == "Metazoa":
        lineage_list.at[i, "Kingdom"] = level[1:]
      if level[1:] in ["Porifera", "Cnidaria", "Ctenophora", "Placozoa", "Chordata", "Echinodermata", "Arthropoda", "Nematoda", "Mollusca", "Annelida", "Platyhelminthes", "Nemertea", "Rotifera", "Bryozoa", "Tardigrada", "Onychophora", "Brachiopoda", "Chaetognatha", "Hemichordata", "Xenacoelomorpha", "Priapulida", "Loricifera", "Kinorhyncha", "Gastrotricha", "Cycliophora", "Micrognathozoa", "Phoronida", "Entoprocta", "Ectoprocta", "Acanthocephala", "Gnathostomulida"] or level[-4:] in ["zoa", "ta"]:  # Includes phyla with common suffixes
          lineage_list.at[i, "Phylum"] = level[1:]
      if (level[-6:] == "ophyta" or level[-6:] == "mycota") and "unclassified" not in level:
        lineage_list.at[i, "Phylum"] = level[1:]
      if (level[-7:] == "mycetes" or level[-7:] == "phyceae" or level[-3:] == "ata" or(lineage_list["Kingdom"][i] != "Metazoa" and level[-7:] == "opsida")) and "unclassified" not in level:
        lineage_list.at[i, "Class"] = level[1:]
      if (level[-4:] == "ales" or level[-7:] == "iformes" or (lineage_list["Kingdom"][i] == "Metazoa" and level[-3:] == "ida") or level[-2:] == "ea") and "unclassified" not in level:
        lineage_list.at[i, "Order"] = level[1:]
      if level[1:] not in [" ", "unclassified"] and level[1:] != lineage_list["Realm"][i] and level[1:] != lineage_list["Kingdom"][i] and level[1:] != lineage_list["Phylum"][i] and level[1:] != lineage_list["Empire"][i] and level[1:] != lineage_list["Class"][i] and level[1:] != lineage_list["Order"][i] and level[1:] != lineage_list["Family"][i]:
        lineage_list.at[i, "Genus"] = level[1:]
      if level[1:] == "Amebozoa":
        lineage_list.at[i, "Phylum"] = level[1:]
        lineage_list.at[i, "Kingdom"] = "Protozoa"
      if level[1:] == "Viridiplantae":
        lineage_list.at[i, "Kingdom"] = "Viridiplantae"

lineage_list.to_csv("FULL_lineage_classification_results.csv", index=False)
lineage_list

['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Aviadenovirus']
['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Aviadenovirus', ' Fowl aviadenovirus E']
['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Mastadenovirus', ' unclassified Human adenoviruses']
['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Mastadenovirus', ' Human mastadenovirus F']
['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Mastadenovirus']
['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Aviadenovirus', ' Duck aviadenovirus B']
['Viruses', ' Varidnaviria', ' Bamfordvirae'

  lineage_list.loc[i, "Empire"] = "Viruses"
  lineage_list.at[i, "Realm"] = lineage[1][1:]
  lineage_list.at[i, "Family"] = level[1:]
  lineage_list.at[i, "Kingdom"] = level[1:]
  lineage_list.at[i, "Phylum"] = level[1:]
  lineage_list.at[i, "Class"] = level[1:]
  lineage_list.at[i, "Order"] = level[1:]
  lineage_list.at[i, "Genus"] = level[1:]
  lineage_list.at[i, "Species"] = lineage[-1][1:]


['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Mastadenovirus', ' unclassified Mastadenovirus']
['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Mastadenovirus', ' Murine mastadenovirus B']
['cellular organisms', ' Bacteria', ' Bacillati', ' Bacillota', ' Bacilli', ' Lactobacillales', ' Aerococcaceae', ' Abiotrophia']
['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Siadenovirus', ' unclassified Siadenovirus']
['Viruses', ' unclassified bacterial viruses']
['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Siadenovirus', ' Turkey siadenovirus A']
['Viruses', ' Varidnaviria', ' Bamfordvirae', ' Preplasmiviricota', ' Tectiliviricetes', ' Rowavirales', ' Adenoviridae', ' Atadenovirus', ' Lizard

Unnamed: 0,Empire,Realm,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Aviadenovirus,
1,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Aviadenovirus,Fowl aviadenovirus E
2,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,
3,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human mastadenovirus F
4,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,
...,...,...,...,...,...,...,...,...,...
3304,Cytota,Bacteria,Pseudomonadati,Bacteroidota,,,,,
3305,Cytota,Bacteria,,,,,,,environmental samples
3306,Viruses,Duplodnaviria,Heunggongvirae,Uroviricota,Caudoviricetes,,Heunggongvirae,,
3307,Viruses,Duplodnaviria,Heunggongvirae,Uroviricota,Caudoviricetes,,Autographiviridae,,


In [None]:
if len(lineage_list) == len(main):
  for i in range(len(lineage_list)):
    main.at[i, "realm"] = lineage_list.at[i, "Realm"]
    main.at[i, "kingdom"] = lineage_list.at[i, "Kingdom"]
  cols = main.columns.tolist()
  cols = cols[:-9] + cols[-2:] + cols[-9:-2]
  main = main[cols]
  main.to_csv("FULL_results+GenomeID+Tax.csv", index=False)
main

Unnamed: 0,Hit,GenBankID,aln_hit,%I,P(H),E-value,Bit-Score,len(Qry),len(aln),%aln,...,superkingdom,realm,kingdom,phylum,class,order,family,genus,species,GenomeID
0,AOC84064.1,AOC84064.1,352,99.148,1.000,0.000000,716.0,679,352,0.518409,...,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Aviadenovirus,Fowl aviadenovirus E,AOC84064.1
1,ANA50312.1,ANA50312.1,354,98.023,1.000,0.000000,711.0,679,353,0.519882,...,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Aviadenovirus,Fowl aviadenovirus E,ANA50312.1
2,XEQ86939.1,XEQ86939.1,374,99.465,1.000,0.000000,752.0,671,374,0.557377,...,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human adenovirus sp.,XEQ86939.1
3,QOV03173.1,QOV03173.1,378,72.487,1.000,0.000000,549.0,671,376,0.560358,...,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human mastadenovirus F,QOV03173.1
4,AGT76236.1,AGT76236.1,442,74.661,1.000,0.000000,573.0,671,430,0.640835,...,Viruses,Varidnaviria,Bamfordvirae,Preplasmiviricota,Tectiliviricetes,Rowavirales,Adenoviridae,Mastadenovirus,Human mastadenovirus B,AGT76236.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3304,A0A1V5MQJ2,MWAL01000503,489,9.400,0.992,0.009964,85.0,559,380,0.679785,...,Bacteria,Bacteria,Pseudomonadati,Bacteroidota,,,,,Bacteroidetes bacterium ADurb.Bin416,MWAL01000503
3305,A0A3B1EJS3,MF990902,445,10.100,0.961,0.009964,74.0,559,385,0.688730,...,Bacteria,Bacteria,,,,,,,uncultured bacterium,MF990902
3306,A0A2I7RRS2,MG592590,540,11.200,0.933,0.009964,70.0,559,413,0.738819,...,Viruses,Duplodnaviria,Heunggongvirae,Uroviricota,Caudoviricetes,,,,Vibrio phage 1.223.O._10N.261.48.A9,MG592590
3307,A0A6H0X6N1,MT259468,599,10.600,0.923,0.009964,69.0,559,387,0.692308,...,Viruses,Duplodnaviria,Heunggongvirae,Uroviricota,Caudoviricetes,,Autographiviridae,,Aeromonas phage PS,MT259468
