# Species wrangling
In this notebook, I'll combine some information about which species we're interested in using for this project (from manually created spreadsheets) and find sources for genomes/transcriptomes for those we don't already have.

In [1]:
import pandas as pd
from os import listdir
from os.path import isfile, exists
from openpyxl import load_workbook
from collections import defaultdict, Counter
import numpy as np
from pysradb.sraweb import SRAweb
from tqdm import tqdm
import subprocess

  from tqdm.autonotebook import tqdm


In [38]:
run_date = '18Sep2025'

## Reading in the data

In [13]:
phyto_origin = pd.read_csv('../data/species_info/species_info_full.csv')
phyto_origin.head()

Unnamed: 0,abbreviated_species,species_gbif,species,common_name,has_genome,has_transcriptome,original_source,thermotolerance,reference,notes
0,A. americanus,Acorus americanus,Acorus americanus,,,,phytozome,,,
1,A. tequilanavar,Agave tequilana,Agave tequilana,,,,phytozome,,,
2,A. officinalis,Althaea officinalis,Althaea officinalis,marsh-mallow,,,phytozome,,,
3,A. linifolium,Alyssum linifolium,Alyssum linifolium,,,,phytozome,,,
4,A. hypochondriacus,Amaranthus hypochondriacus,Amaranthus hypochondriacus,,,,phytozome,,,


Add a path for the genomes we already have:

In [14]:
genome_top_path = '/mnt/research/Walker_Lab_Research/Serena_project_data/selection-under-heat_data/phytozome_genomes'
sp_path = [(f"{f.split('_')[0][0]}. {f.split('_')[0][1:]}", f'{genome_top_path}/{f}') for f in listdir(genome_top_path)]
sp_path_df = pd.DataFrame(sp_path, columns=['abbreviated_species', 'full_path'])
phyto_paths = sp_path_df.merge(phyto_origin, left_on='abbreviated_species', right_on='abbreviated_species', how='right')

Include function to read URLs from Excel files:

In [15]:
def get_hyperlink_columns(base_df, path, sheet_name):
    """
    Use openpyxl to detect hyperlinks and add them as new columns.

    From: https://github.com/serenalotreck/greencut-formatting/blob/main/notebooks/greencut_database_input_formatting.ipynb

    parameters:
        base_df, pandas df: the result of read_excel on path
        path, str: path to the excel document
        sheet_name, str: name of the sheet to read

    returns:
        updated_df, pandas df: base_df with added columns for hyperlinks
    """
    # Load the workbook using openpyxl
    workbook = load_workbook(path)
    
    # Select the sheet
    sheet = workbook[sheet_name] if sheet_name else workbook.active

    # Openpyxl will maintain rows that aren't relevant to the dataframe,
    # whereas pandas.read_excel just drops those -- we need to make them agree
    if sheet.max_row != len(base_df):
        rows_to_delete = [i for i in range(len(base_df) + 1, sheet.max_row)]
        for row_index in reversed(rows_to_delete):
            sheet.delete_rows(row_index)

    # Detect columns with hyperlinks
    hyperlink_cols = defaultdict(list)
    for row_idx, row in enumerate(sheet.iter_rows()):
        # Skip the first row, since it's the header
        if row_idx == 0:
            continue
        for col_index, cell in enumerate(row):
            hyperlink = cell.hyperlink.target if cell.hyperlink else None
            header_name = sheet.cell(row=1, column=col_index + 1).value
            if hyperlink:
                hyperlink_cols[header_name + "_hyperlink"].append(hyperlink)
            else:
                # We need to add NaN if there's no hyperlink to keep the values
                # in the right rows, but this does mean we'll get non-hyperlink columns
                hyperlink_cols[header_name + "_hyperlink"].append(np.nan)

    # Create a df with the new columns
    hyperlink_df = pd.DataFrame(hyperlink_cols)
    
    # Now remove any columns that are all NaN
    hyperlink_df = hyperlink_df.dropna(axis=1, how='all')

    # Concat this horizontally to the original df
    updated_df = pd.concat([base_df, hyperlink_df], axis=1)

    return updated_df

In [16]:
with_lit = pd.read_excel('../data/species_info/thermophilic_plant_species.xlsx', sheet_name='Sheet1')
with_lit.columns = with_lit.columns.str.replace(' ', '_')
with_list = with_lit.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
with_lit.head()

Unnamed: 0,Species_Name,Species_Notes,Transciptomic_data_in_NCBI_with_SRA_data,Publication_Link,Notes
0,Heliotropium curisavvicum,a prostrate plant from the Mojave desert. You...,,,Rowan suggested
1,Sesuvium verrucosum,Succulent halophytes from deserts and hot coas...,,,Rowan suggested
2,Sesuvium trianthema,Succulent halophytes from deserts and hot coas...,,,Rowan suggested
3,Ipomea pes-caprae,"tropical beach vine, grows on hot sands of tro...",,,Rowan suggested
4,Mollugo verticillata,carpet weed. Grows across blacktop that burns...,,,Rowan suggested


## Wrangling
Drop specific rows we know don't have info from manually looking:

In [17]:
with_lit = with_lit.drop(index=[60,125])
with_lit.tail()

Unnamed: 0,Species_Name,Species_Notes,Transciptomic_data_in_NCBI_with_SRA_data,Publication_Link,Notes
120,Azorella trifurcata (Gaertn.) Pers.,,,,"Grows in the high Andes, super arid, appears t..."
121,Azorella trisecta (H.Wolff) Mart.Fernández & C...,,,,"Grows in the high Andes, super arid, appears t..."
122,Azorella ulicina (Gillies & Hook.) G.M.Plunket...,,,,"Grows in the high Andes, super arid, appears t..."
123,Azorella valentini (Speg.) Mart.Fernández & C....,,,,"Grows in the high Andes, super arid, appears t..."
124,Cistanthe longiscapa,CAM species that grows in the atacama desert,,,"Paper from Ariel Orellana, gave a talk in the ..."


In [18]:
lit_w_links = get_hyperlink_columns(with_lit, '../data/species_info/thermophilic_plant_species.xlsx', sheet_name='Sheet1').drop(index=60)
lit_w_links.tail()

Unnamed: 0,Species_Name,Species_Notes,Transciptomic_data_in_NCBI_with_SRA_data,Publication_Link,Notes,Species Name_hyperlink
120,Azorella trifurcata (Gaertn.) Pers.,,,,"Grows in the high Andes, super arid, appears t...",https://en.wikipedia.org/w/index.php?title=Azo...
121,Azorella trisecta (H.Wolff) Mart.Fernández & C...,,,,"Grows in the high Andes, super arid, appears t...",https://en.wikipedia.org/w/index.php?title=Azo...
122,Azorella ulicina (Gillies & Hook.) G.M.Plunket...,,,,"Grows in the high Andes, super arid, appears t...",https://en.wikipedia.org/w/index.php?title=Azo...
123,Azorella valentini (Speg.) Mart.Fernández & C....,,,,"Grows in the high Andes, super arid, appears t...",
124,Cistanthe longiscapa,CAM species that grows in the atacama desert,,,"Paper from Ariel Orellana, gave a talk in the ...",


Get rid of additional text after the first two space-delimited words in the species column to get a base species name.

In [19]:
lit_w_links['species'] = lit_w_links['Species_Name'].str.split(r'[\xa0\s]+').str[:2].str.join(' ')

In [20]:
lit_w_links = lit_w_links.rename(columns={'Notes': 'original_source', 'Species_Notes': 'notes', 'Publication_Link': 'reference'})
lit_w_links.head()

Unnamed: 0,Species_Name,notes,Transciptomic_data_in_NCBI_with_SRA_data,reference,original_source,Species Name_hyperlink,species
0,Heliotropium curisavvicum,a prostrate plant from the Mojave desert. You...,,,Rowan suggested,,Heliotropium curisavvicum
1,Sesuvium verrucosum,Succulent halophytes from deserts and hot coas...,,,Rowan suggested,,Sesuvium verrucosum
2,Sesuvium trianthema,Succulent halophytes from deserts and hot coas...,,,Rowan suggested,,Sesuvium trianthema
3,Ipomea pes-caprae,"tropical beach vine, grows on hot sands of tro...",,,Rowan suggested,,Ipomea pes-caprae
4,Mollugo verticillata,carpet weed. Grows across blacktop that burns...,,,Rowan suggested,,Mollugo verticillata


Merge with the other list:

In [21]:
phyto_paths['reference'] = phyto_paths['reference'].astype('object')
all_species = lit_w_links.merge(phyto_paths, left_on=['species', 'notes', 'reference'], right_on=['species', 'notes', 'reference'], suffixes=('_berkley', '_serena'), how='outer')
all_species = all_species.drop(columns=['Species_Name', 'Transciptomic_data_in_NCBI_with_SRA_data', 'Species Name_hyperlink', 'abbreviated_species'])
all_species = all_species[['species', 'thermotolerance'] + sorted(c for c in all_species.columns if c != 'species' and c != 'thermotolerance')]
all_species = all_species.drop_duplicates()
all_species.head()

Unnamed: 0,species,thermotolerance,common_name,full_path,has_genome,has_transcriptome,notes,original_source_berkley,original_source_serena,reference,species_gbif
0,Acorus americanus,,,/mnt/research/Walker_Lab_Research/Serena_proje...,,,,,phytozome,,Acorus americanus
1,Agave sp.,,,,,,,,,,
2,Agave tequilana,,,/mnt/research/Walker_Lab_Research/Serena_proje...,,,,,phytozome,,Agave tequilana
3,Althaea officinalis,,marsh-mallow,/mnt/research/Walker_Lab_Research/Serena_proje...,,,,,phytozome,,Althaea officinalis
4,Alyssum linifolium,,,/mnt/research/Walker_Lab_Research/Serena_proje...,,,,,phytozome,,Alyssum linifolium


How many genera are represented here?

In [22]:
unique_genera = all_species.species.str.split(' ').str[0].value_counts()
print(f'There are {len(unique_genera)} unique genera in the list. The top genus is {unique_genera.idxmax()}, with {unique_genera.max()} species')
spec_in_common = set(phyto_paths.species).intersection(set(lit_w_links.species))
print(f'There were only {len(spec_in_common)} species in common between the two lists: {", ".join(list(spec_in_common))}')

There are 159 unique genera in the list. The top genus is Azorella, with 58 species
There were only 14 species in common between the two lists: Boechera crandallii, Boechera gunnisoniana, Salvadora persica, Boechera depauperata, Haloxylon salicornicum, Boechera pallidifolia, Boechera perennans, Vigna unguiculata, Calligonum comosum, Avicennia marina, Boechera holboellii, Salsola villosa, Phoenix dactylifera, Boechera stricta


Save the final thing:

In [23]:
all_species.to_csv(f'../data/species_info/berkley_serena_combined_species_list_{run_date}.csv', index=False)

## Looking in SRA
We want to know, before we manually search up all of these, that there's at least one entry in SRA for them. We'll use [`pysradb`](https://github.com/saketkc/pysradb) for this. We also can only use RNA-Seq data, so we'll also filter for that here.

In [24]:
db = SRAweb()

In [104]:
all_studies = {}
missing_rnaseq = []
for spec in tqdm(all_species.species):
    try:
        if isfile(all_species.loc[all_species.species==spec, 'full_path'].values[0]):
            print(f'Skipping species {spec} because it has a genome in Phytozome')
            continue
    except TypeError:
        studies = db.search_sra(spec, detailed=True)
        try:
            studies_rna = studies[studies.library_strategy.str.contains('RNA')]
            if len(studies_rna) > 0:
                print(f'There are {len(studies_rna)} RNA-seq studies in SRA for species {spec}')
                all_studies[spec] = studies_rna
            else:
                print(f'0 RNA-seq studies found for species {spec}.')
                missing_rnaseq.append(spec)
        except AttributeError:
            print(f'0 RNA-seq studies found for species {spec}.')
            missing_rnaseq.append(spec)

  0%|          | 0/294 [00:00<?, ?it/s]

Skipping species Acorus americanus because it has a genome in Phytozome


  1%|          | 2/294 [00:03<09:39,  1.99s/it]

0 RNA-seq studies found for species Agave sp..
Skipping species Agave tequilana  because it has a genome in Phytozome
Skipping species Althaea officinalis because it has a genome in Phytozome
Skipping species Alyssum linifolium because it has a genome in Phytozome
Skipping species Amaranthus hypochondriacus because it has a genome in Phytozome
Skipping species Amborella trichopoda because it has a genome in Phytozome
Skipping species Anacardium occidentale because it has a genome in Phytozome
Skipping species Ananas comosus because it has a genome in Phytozome
Skipping species Andropogon gerardi because it has a genome in Phytozome
Skipping species Aquilegia caerulea because it has a genome in Phytozome
Skipping species Arabidopsis halleri because it has a genome in Phytozome
Skipping species Arabidopsis lyrata because it has a genome in Phytozome
Skipping species Arabidopsis thaliana because it has a genome in Phytozome
Skipping species Arachis hypogaea because it has a genome in Phyt

  5%|▌         | 16/294 [00:33<09:44,  2.10s/it]

There are 127 RNA-seq studies in SRA for species Avicennia marina


  6%|▌         | 17/294 [00:56<17:57,  3.89s/it]

There are 127 RNA-seq studies in SRA for species Avicennia marina


  6%|▌         | 18/294 [01:00<17:46,  3.86s/it]

0 RNA-seq studies found for species Azorella acaulis.
No results found for Azorella albovaginata
0 RNA-seq studies found for species Azorella albovaginata.


  7%|▋         | 22/294 [01:03<11:27,  2.53s/it]

0 RNA-seq studies found for species Azorella allanii.
No results found for Azorella ameghinoi
0 RNA-seq studies found for species Azorella ameghinoi.
No results found for Azorella andina
0 RNA-seq studies found for species Azorella andina.
No results found for Azorella aretioides
0 RNA-seq studies found for species Azorella aretioides.


  8%|▊         | 24/294 [01:05<09:07,  2.03s/it]

No results found for Azorella biloba
0 RNA-seq studies found for species Azorella biloba.
No results found for Azorella boelckei
0 RNA-seq studies found for species Azorella boelckei.


  9%|▉         | 26/294 [01:08<08:21,  1.87s/it]

0 RNA-seq studies found for species Azorella burkartii.


  9%|▉         | 27/294 [01:11<09:42,  2.18s/it]

0 RNA-seq studies found for species Azorella cockaynei.


 10%|▉         | 28/294 [01:15<11:10,  2.52s/it]

0 RNA-seq studies found for species Azorella colensoi.
No results found for Azorella compacta
0 RNA-seq studies found for species Azorella compacta.


 11%|█         | 32/294 [01:19<06:24,  1.47s/it]

0 RNA-seq studies found for species Azorella corymbosa.
No results found for Azorella crassipes
0 RNA-seq studies found for species Azorella crassipes.
No results found for Azorella crenata
0 RNA-seq studies found for species Azorella crenata.
No results found for Azorella cryptantha
0 RNA-seq studies found for species Azorella cryptantha.


 12%|█▏        | 35/294 [01:20<04:20,  1.01s/it]

No results found for Azorella cuatrecasasii
0 RNA-seq studies found for species Azorella cuatrecasasii.
No results found for Azorella diapensioides
0 RNA-seq studies found for species Azorella diapensioides.
No results found for Azorella diversifolia
0 RNA-seq studies found for species Azorella diversifolia.


 13%|█▎        | 37/294 [01:21<03:47,  1.13it/s]

No results found for Azorella echegarayi
0 RNA-seq studies found for species Azorella echegarayi.


 13%|█▎        | 38/294 [01:24<05:37,  1.32s/it]

0 RNA-seq studies found for species Azorella exigua.
No results found for Azorella filamentosa
0 RNA-seq studies found for species Azorella filamentosa.


 14%|█▎        | 40/294 [01:27<05:55,  1.40s/it]

0 RNA-seq studies found for species Azorella fragosea.


 14%|█▍        | 41/294 [01:28<04:54,  1.16s/it]

No results found for Azorella fuegiana
0 RNA-seq studies found for species Azorella fuegiana.


 14%|█▍        | 42/294 [01:32<08:18,  1.98s/it]

0 RNA-seq studies found for species Azorella haastii.
No results found for Azorella hallei
0 RNA-seq studies found for species Azorella hallei.


 15%|█▍        | 44/294 [01:36<07:37,  1.83s/it]

0 RNA-seq studies found for species Azorella hookeri.


 15%|█▌        | 45/294 [01:40<09:28,  2.28s/it]

0 RNA-seq studies found for species Azorella hydrocotyloides.
No results found for Azorella julianii
0 RNA-seq studies found for species Azorella julianii.


 16%|█▌        | 47/294 [01:43<08:09,  1.98s/it]

0 RNA-seq studies found for species Azorella lyallii.


 16%|█▋        | 48/294 [01:47<09:57,  2.43s/it]

0 RNA-seq studies found for species Azorella lycopodioides.


 17%|█▋        | 51/294 [01:51<06:55,  1.71s/it]

0 RNA-seq studies found for species Azorella macquariensis.
No results found for Azorella madreporica
0 RNA-seq studies found for species Azorella madreporica.
No results found for Azorella microphylla
0 RNA-seq studies found for species Azorella microphylla.
No results found for Azorella monantha
0 RNA-seq studies found for species Azorella monantha.


 18%|█▊        | 54/294 [01:52<04:17,  1.07s/it]

No results found for Azorella monteroi
0 RNA-seq studies found for species Azorella monteroi.
No results found for Azorella multifida
0 RNA-seq studies found for species Azorella multifida.


 19%|█▉        | 56/294 [01:55<04:36,  1.16s/it]

0 RNA-seq studies found for species Azorella nitens.
No results found for Azorella nivalis
0 RNA-seq studies found for species Azorella nivalis.


 19%|█▉        | 57/294 [01:58<06:27,  1.63s/it]

0 RNA-seq studies found for species Azorella pallida.


 20%|█▉        | 58/294 [01:59<05:10,  1.31s/it]

No results found for Azorella patagonica
0 RNA-seq studies found for species Azorella patagonica.
No results found for Azorella pedunculata
0 RNA-seq studies found for species Azorella pedunculata.


 21%|██        | 62/294 [02:02<03:29,  1.11it/s]

0 RNA-seq studies found for species Azorella polaris.
No results found for Azorella prolifera
0 RNA-seq studies found for species Azorella prolifera.
No results found for Azorella pulvinata
0 RNA-seq studies found for species Azorella pulvinata.


 21%|██▏       | 63/294 [02:05<05:01,  1.30s/it]

0 RNA-seq studies found for species Azorella ranunculus.


 22%|██▏       | 64/294 [02:09<07:17,  1.90s/it]

0 RNA-seq studies found for species Azorella robusta.


 22%|██▏       | 66/294 [02:13<06:47,  1.79s/it]

0 RNA-seq studies found for species Azorella roughii.
No results found for Azorella ruizii
0 RNA-seq studies found for species Azorella ruizii.


 23%|██▎       | 67/294 [02:16<08:08,  2.15s/it]

0 RNA-seq studies found for species Azorella schizeilema.


 24%|██▍       | 70/294 [02:20<05:50,  1.57s/it]

0 RNA-seq studies found for species Azorella selago.
No results found for Azorella spinosa
0 RNA-seq studies found for species Azorella spinosa.
No results found for Azorella triacantha
0 RNA-seq studies found for species Azorella triacantha.
No results found for Azorella trifoliolata
0 RNA-seq studies found for species Azorella trifoliolata.


 24%|██▍       | 72/294 [02:25<07:12,  1.95s/it]

0 RNA-seq studies found for species Azorella trifurcata.


 26%|██▌       | 75/294 [02:29<05:17,  1.45s/it]

There are 1 RNA-seq studies in SRA for species Azorella trisecta
No results found for Azorella ulicina
0 RNA-seq studies found for species Azorella ulicina.
No results found for Azorella valentini
0 RNA-seq studies found for species Azorella valentini.
Skipping species Beta vulgaris subsp. vulgaris because it has a genome in Phytozome
Skipping species Betula platyphylla because it has a genome in Phytozome


 27%|██▋       | 78/294 [02:32<04:38,  1.29s/it]

0 RNA-seq studies found for species Boechera arcuata.


 27%|██▋       | 79/294 [02:36<06:27,  1.80s/it]

0 RNA-seq studies found for species Boechera breweri.


 27%|██▋       | 80/294 [02:40<07:42,  2.16s/it]

0 RNA-seq studies found for species Boechera cobrensis.


 28%|██▊       | 81/294 [02:44<08:49,  2.49s/it]

0 RNA-seq studies found for species Boechera constancei.


 28%|██▊       | 82/294 [02:47<09:51,  2.79s/it]

0 RNA-seq studies found for species Boechera crandallii.


 28%|██▊       | 83/294 [02:52<11:05,  3.16s/it]

There are 6 RNA-seq studies in SRA for species Boechera depauperata


 29%|██▊       | 84/294 [02:55<11:31,  3.29s/it]

0 RNA-seq studies found for species Boechera dispar.


 29%|██▉       | 85/294 [03:01<13:33,  3.89s/it]

There are 16 RNA-seq studies in SRA for species Boechera divaricarpa


 29%|██▉       | 86/294 [03:04<13:13,  3.82s/it]

0 RNA-seq studies found for species Boechera falcata.


 30%|██▉       | 87/294 [04:49<1:52:59, 32.75s/it]

There are 124 RNA-seq studies in SRA for species Boechera falcatoria


 30%|██▉       | 88/294 [04:52<1:23:15, 24.25s/it]

0 RNA-seq studies found for species Boechera fecunda.


 30%|███       | 89/294 [04:56<1:02:09, 18.19s/it]

There are 3 RNA-seq studies in SRA for species Boechera formosa
No results found for Boechera gracilenta
0 RNA-seq studies found for species Boechera gracilenta.


 31%|███       | 91/294 [04:59<35:41, 10.55s/it]  

There are 12 RNA-seq studies in SRA for species Boechera gunnisoniana


 31%|███▏      | 92/294 [05:02<29:44,  8.83s/it]

0 RNA-seq studies found for species Boechera hoffmannii.


 32%|███▏      | 93/294 [05:07<25:43,  7.68s/it]

There are 5 RNA-seq studies in SRA for species Boechera holboellii


 32%|███▏      | 94/294 [05:10<21:56,  6.58s/it]

There are 3 RNA-seq studies in SRA for species Boechera lasiocarpa


 32%|███▏      | 95/294 [05:14<19:10,  5.78s/it]

There are 3 RNA-seq studies in SRA for species Boechera lignifera


 33%|███▎      | 96/294 [06:48<1:42:30, 31.06s/it]

There are 124 RNA-seq studies in SRA for species Boechera ophira
No results found for Boechera oxylobula
0 RNA-seq studies found for species Boechera oxylobula.


 34%|███▎      | 99/294 [06:52<43:55, 13.52s/it]  

There are 11 RNA-seq studies in SRA for species Boechera pallidifolia
No results found for Boechera patens
0 RNA-seq studies found for species Boechera patens.


 34%|███▍      | 100/294 [06:54<34:41, 10.73s/it]

0 RNA-seq studies found for species Boechera paupercula.


 34%|███▍      | 101/294 [06:58<28:35,  8.89s/it]

0 RNA-seq studies found for species Boechera pendulina.


 35%|███▍      | 102/294 [07:02<24:14,  7.57s/it]

There are 3 RNA-seq studies in SRA for species Boechera perennans


 35%|███▌      | 103/294 [07:06<20:35,  6.47s/it]

0 RNA-seq studies found for species Boechera platysperma.


 35%|███▌      | 104/294 [07:10<17:58,  5.68s/it]

0 RNA-seq studies found for species Boechera puberula.


 36%|███▌      | 105/294 [07:14<16:14,  5.15s/it]

0 RNA-seq studies found for species Boechera pulchra.
No results found for Boechera quebecensis
0 RNA-seq studies found for species Boechera quebecensis.


 36%|███▋      | 107/294 [07:19<12:19,  3.96s/it]

0 RNA-seq studies found for species Boechera retrofracta.


 37%|███▋      | 108/294 [07:22<12:01,  3.88s/it]

There are 3 RNA-seq studies in SRA for species Boechera schistacea


 37%|███▋      | 109/294 [07:26<11:44,  3.81s/it]

0 RNA-seq studies found for species Boechera sparsiflora.


 37%|███▋      | 110/294 [07:30<11:40,  3.81s/it]

0 RNA-seq studies found for species Boechera spatifolia.
Skipping species Boechera stricta because it has a genome in Phytozome
Skipping species Boechera stricta because it has a genome in Phytozome
No results found for Boechera tularensis
0 RNA-seq studies found for species Boechera tularensis.


 39%|███▉      | 114/294 [07:33<05:53,  1.97s/it]

There are 12 RNA-seq studies in SRA for species Boechera williamsii


 42%|████▏     | 123/294 [07:37<02:10,  1.31it/s]

0 RNA-seq studies found for species Boechera yorkii.
Skipping species Botryococcus braunii because it has a genome in Phytozome
Skipping species Brachypodium distachyon because it has a genome in Phytozome
Skipping species Brachypodium hybridum because it has a genome in Phytozome
Skipping species Brachypodium stacei because it has a genome in Phytozome
Skipping species Brassica juncea because it has a genome in Phytozome
Skipping species Brassica oleracea var. capitata because it has a genome in Phytozome
Skipping species Brassica rapa because it has a genome in Phytozome
No results found for Breonadia salicina
0 RNA-seq studies found for species Breonadia salicina.
Skipping species Cakile maritima because it has a genome in Phytozome
No results found for Calligonum comosum
0 RNA-seq studies found for species Calligonum comosum.
Skipping species Camelina sativa var. DH55 because it has a genome in Phytozome
Skipping species Capsella rubella because it has a genome in Phytozome
Skippin

 44%|████▍     | 129/294 [07:42<02:15,  1.22it/s]

There are 5 RNA-seq studies in SRA for species Carnegiea gigantea
Skipping species Carya illinoinensis because it has a genome in Phytozome
Skipping species Castanea dentata because it has a genome in Phytozome
Skipping species Castanea mollissima because it has a genome in Phytozome
Skipping species Caulanthus amplexicaulis because it has a genome in Phytozome


 46%|████▌     | 134/294 [07:47<02:17,  1.16it/s]

0 RNA-seq studies found for species Celtis toka.
Skipping species Ceratodon purpureus because it has a genome in Phytozome
Skipping species Ceratopteris richardii because it has a genome in Phytozome
Skipping species Cercis canadensis because it has a genome in Phytozome
Skipping species Chamaecrista fasciculata because it has a genome in Phytozome
Skipping species Chasmanthium laxum because it has a genome in Phytozome
Skipping species Chlamydomonas reinhardtii because it has a genome in Phytozome
Skipping species Chromochloris zofingiensis because it has a genome in Phytozome
Skipping species Cicer arietinum because it has a genome in Phytozome
Skipping species Cinnamomum kanehirae because it has a genome in Phytozome


 49%|████▉     | 144/294 [07:51<01:34,  1.59it/s]

There are 18 RNA-seq studies in SRA for species Cistanthe longiscapa
Skipping species Citrus clementina because it has a genome in Phytozome
Skipping species Cleome violacea because it has a genome in Phytozome
Skipping species Clonorchis sinensis because it has a genome in Phytozome
Skipping species Coccomyxa subellipsoidea because it has a genome in Phytozome
Skipping species Coffea arabica because it has a genome in Phytozome
Skipping species Coreopsis grandiflora because it has a genome in Phytozome
Skipping species Corylus americana var. rush because it has a genome in Phytozome
Skipping species Corylus avellana because it has a genome in Phytozome
Skipping species Corymbia citriodora because it has a genome in Phytozome
Skipping species Crambe hispanica because it has a genome in Phytozome
Skipping species Crocus sativus because it has a genome in Phytozome


 53%|█████▎    | 156/294 [07:55<01:08,  2.02it/s]

There are 5 RNA-seq studies in SRA for species Cucurbita palmata


 53%|█████▎    | 157/294 [07:58<01:30,  1.51it/s]

0 RNA-seq studies found for species Cylindropuntia acanthocarpa.


 54%|█████▎    | 158/294 [08:02<01:57,  1.16it/s]

0 RNA-seq studies found for species Cylindropuntia bigelovii.
Skipping species Daucus carota because it has a genome in Phytozome
Skipping species Dendrocalamus strictus because it has a genome in Phytozome
Skipping species Dioscorea alata because it has a genome in Phytozome
Skipping species Diphasiastrum complanatum because it has a genome in Phytozome
Skipping species Dunaliella salina because it has a genome in Phytozome
No results found for Echinocactus sp
0 RNA-seq studies found for species Echinocactus sp.
Skipping species Ehretia anacua because it has a genome in Phytozome
Skipping species Eruca vesicaria because it has a genome in Phytozome
Skipping species Eschscholzia californica because it has a genome in Phytozome
Skipping species Eucalyptus grandis because it has a genome in Phytozome
Skipping species Euclidium syriacum because it has a genome in Phytozome
Skipping species Eutrema salsugineum because it has a genome in Phytozome


 58%|█████▊    | 171/294 [08:05<01:02,  1.95it/s]

0 RNA-seq studies found for species Ficus vasta.
Skipping species Fragaria vesca because it has a genome in Phytozome
Skipping species Glycine max because it has a genome in Phytozome
Skipping species Glycine max because it has a genome in Phytozome
Skipping species Glycine max because it has a genome in Phytozome
Skipping species Glycine soja because it has a genome in Phytozome
Skipping species Gossypium barbadense because it has a genome in Phytozome
Skipping species Gossypium darwinii because it has a genome in Phytozome
Skipping species Gossypium hirsutum because it has a genome in Phytozome
Skipping species Gossypium mustelinum because it has a genome in Phytozome
Skipping species Gossypium raimondii because it has a genome in Phytozome
Skipping species Gossypium tomentosum because it has a genome in Phytozome


 62%|██████▏   | 183/294 [08:15<01:12,  1.54it/s]

There are 16 RNA-seq studies in SRA for species Haloxylon persicum


 63%|██████▎   | 184/294 [08:29<02:22,  1.29s/it]

0 RNA-seq studies found for species Haloxylon salicornicum.
Skipping species Helianthus annuus because it has a genome in Phytozome


 63%|██████▎   | 186/294 [08:36<02:44,  1.52s/it]

There are 16 RNA-seq studies in SRA for species Heliotropium curisavvicum
Skipping species Hordeum vulgare because it has a genome in Phytozome
Skipping species Hydrangea quercifolia because it has a genome in Phytozome
Skipping species Hydrocotyle leucocephala because it has a genome in Phytozome
Skipping species Iberis amara because it has a genome in Phytozome
Skipping species Indigofera tinctoria because it has a genome in Phytozome
No results found for Ipomea pes-caprae
0 RNA-seq studies found for species Ipomea pes-caprae.


 66%|██████▌   | 193/294 [08:40<01:58,  1.17s/it]

0 RNA-seq studies found for species Juniperus procera.
Skipping species Kalanchoe fedtschenkoi because it has a genome in Phytozome
Skipping species Lactuca sativa because it has a genome in Phytozome
Skipping species Lens culinaris because it has a genome in Phytozome
Skipping species Lens ervoides because it has a genome in Phytozome
Skipping species Lepidium sativum because it has a genome in Phytozome


 68%|██████▊   | 199/294 [08:43<01:34,  1.01it/s]

0 RNA-seq studies found for species Leptadenia pyrotechnica.
Skipping species Liriodendron tulipifera because it has a genome in Phytozome
Skipping species Lotus japonicus because it has a genome in Phytozome
Skipping species Lunaria annua because it has a genome in Phytozome
Skipping species Lupinus albus because it has a genome in Phytozome
Skipping species Malcolmia maritima because it has a genome in Phytozome
Skipping species Malus domestica because it has a genome in Phytozome
Skipping species Marchantia polymorpha because it has a genome in Phytozome
Skipping species Medicago truncatula because it has a genome in Phytozome
Skipping species Micromonas pusilla because it has a genome in Phytozome
Skipping species Mimulus guttatus because it has a genome in Phytozome
Skipping species Mimulus guttatus because it has a genome in Phytozome
Skipping species Mimulus nasutus because it has a genome in Phytozome
Skipping species Mimulus tilingii because it has a genome in Phytozome
No res

 73%|███████▎  | 215/294 [08:47<00:44,  1.79it/s]

There are 1 RNA-seq studies in SRA for species Mollugo pentaphylla


 73%|███████▎  | 216/294 [08:50<00:52,  1.48it/s]

There are 2 RNA-seq studies in SRA for species Mollugo verticillata
Skipping species Morchella esculenta because it has a genome in Phytozome
No results found for Moringa peregrina
0 RNA-seq studies found for species Moringa peregrina.
Skipping species Musa acuminata because it has a genome in Phytozome
Skipping species Myagrum perfoliatum because it has a genome in Phytozome
Skipping species Notholithocarpus densiflorus because it has a genome in Phytozome
Skipping species Nymphaea colorata because it has a genome in Phytozome
Skipping species Olea europaea because it has a genome in Phytozome


 76%|███████▌  | 224/294 [08:53<00:40,  1.71it/s]

There are 6 RNA-seq studies in SRA for species Opuntia sp
Skipping species Oropetium thomaeum because it has a genome in Phytozome
Skipping species Oryza sativa because it has a genome in Phytozome
Skipping species Ostreococcus lucimarinus because it has a genome in Phytozome
Skipping species Panicum hallii because it has a genome in Phytozome
Skipping species Panicum virgatum because it has a genome in Phytozome
Skipping species Phaseolus acutifolius because it has a genome in Phytozome
Skipping species Phaseolus coccineus because it has a genome in Phytozome
Skipping species Phaseolus lunatus because it has a genome in Phytozome


 79%|███████▉  | 233/294 [09:30<01:53,  1.86s/it]

There are 340 RNA-seq studies in SRA for species Phoenix dactylifera
Skipping species Physcomitrella patens because it has a genome in Phytozome
Skipping species Podocarpus latifolius because it has a genome in Phytozome
Skipping species Poncirus trifoliata because it has a genome in Phytozome
Skipping species Populus deltoides because it has a genome in Phytozome
Skipping species Populus nigra x maximowiczii because it has a genome in Phytozome
Skipping species Populus tremula x Populus alba because it has a genome in Phytozome
Skipping species Populus trichocarpa because it has a genome in Phytozome
Skipping species Portulaca amilis because it has a genome in Phytozome
Skipping species Prunella vulgaris because it has a genome in Phytozome
Skipping species Prunus persica because it has a genome in Phytozome
Skipping species Quercus rubra because it has a genome in Phytozome


 83%|████████▎ | 245/294 [09:35<01:02,  1.28s/it]

0 RNA-seq studies found for species Rhanterium epapposum.


 84%|████████▎ | 246/294 [11:16<04:41,  5.86s/it]

There are 157 RNA-seq studies in SRA for species Rhyza stricta
Skipping species Ricinus communis because it has a genome in Phytozome
Skipping species Rorippa islandica because it has a genome in Phytozome
Skipping species Saccharum officinarum x spontaneum because it has a genome in Phytozome
Skipping species Salix alba because it has a genome in Phytozome
No results found for Salsola villosa
0 RNA-seq studies found for species Salsola villosa.


 87%|████████▋ | 257/294 [11:20<02:01,  3.28s/it]

0 RNA-seq studies found for species Salvadora persica.
Skipping species Salvia officinalis because it has a genome in Phytozome
Skipping species Sarracenia purpurea because it has a genome in Phytozome
Skipping species Schrenkiella parvula because it has a genome in Phytozome
Skipping species Selaginella moellendorffii because it has a genome in Phytozome
No results found for Sesuvium trianthema
0 RNA-seq studies found for species Sesuvium trianthema.


 88%|████████▊ | 259/294 [11:23<01:47,  3.07s/it]

There are 11 RNA-seq studies in SRA for species Sesuvium verrucosum
Skipping species Setaria italica because it has a genome in Phytozome
Skipping species Setaria viridis because it has a genome in Phytozome
Skipping species Sisymbrium irio because it has a genome in Phytozome
Skipping species Solanum lycopersicum because it has a genome in Phytozome
Skipping species Solanum tuberosum because it has a genome in Phytozome
Skipping species Sorghum bicolor because it has a genome in Phytozome
Skipping species Sphagnum fallax because it has a genome in Phytozome
Skipping species Sphagnum magellanicum because it has a genome in Phytozome
Skipping species Spinacia oleracea because it has a genome in Phytozome
Skipping species Spirodela polyrhiza because it has a genome in Phytozome
Skipping species Stanleya pinnata because it has a genome in Phytozome


 94%|█████████▍| 276/294 [11:28<00:22,  1.26s/it]

0 RNA-seq studies found for species Tamarindus indica.
Skipping species Theobroma cacao because it has a genome in Phytozome
Skipping species Thinopyrum intermedium because it has a genome in Phytozome
Skipping species Thlaspi arvense because it has a genome in Phytozome
Skipping species Thuja plicata because it has a genome in Phytozome
No results found for Tidestromia carnosa
0 RNA-seq studies found for species Tidestromia carnosa.
No results found for Tidestromia gemmata
0 RNA-seq studies found for species Tidestromia gemmata.
There are 2 RNA-seq studies in SRA for species Tidestromia lanuginosa


 95%|█████████▍| 278/294 [11:34<00:23,  1.49s/it]

There are 1 RNA-seq studies in SRA for species Tidestromia oblongifolia


 95%|█████████▌| 280/294 [11:38<00:21,  1.55s/it]

There are 1 RNA-seq studies in SRA for species Tidestromia suffruticosa
Skipping species Trifolium pratense because it has a genome in Phytozome
Skipping species Triticum aestivum because it has a genome in Phytozome
Skipping species Typha latifolia because it has a genome in Phytozome
Skipping species Urochloa fusca because it has a genome in Phytozome
Skipping species Vaccinium darrowii because it has a genome in Phytozome
Skipping species Vicia faba because it has a genome in Phytozome
Skipping species Vigna unguiculata because it has a genome in Phytozome
Skipping species Vitis vinifera because it has a genome in Phytozome
Skipping species Volvox carteri because it has a genome in Phytozome
Skipping species Yucca aloifolia because it has a genome in Phytozome


 99%|█████████▊| 290/294 [11:51<00:05,  1.42s/it]

There are 58 RNA-seq studies in SRA for species Yucca brevifolia
Skipping species Zea mays because it has a genome in Phytozome


 99%|█████████▉| 292/294 [11:55<00:03,  1.52s/it]

There are 36 RNA-seq studies in SRA for species Ziziphus spina


100%|██████████| 294/294 [12:00<00:00,  2.45s/it]

There are 49 RNA-seq studies in SRA for species Ziziphus spina christi
Skipping species Zostera marina because it has a genome in Phytozome





In [109]:
print(f'There are {len(all_studies)} organisms with RNAseq datasets that also don\'t have Phytozome genomes, {len(list(listdir(genome_top_path)))} organisms with Phytozome genomes, and {len(missing_rnaseq)} that don\'t have an RNAseq dataset in SRA.')

There are 32 organisms with RNAseq datasets that also don't have Phytozome genomes, 162 organisms with Phytozome genomes, and 101 that don't have an RNAseq dataset in SRA.


In [110]:
print([len(stud) for stud in all_studies.values()], sum([len(stud) for stud in all_studies.values()]))

[127, 1, 6, 16, 124, 3, 12, 5, 3, 3, 124, 11, 3, 3, 12, 5, 18, 5, 16, 16, 1, 2, 6, 340, 157, 11, 2, 1, 1, 58, 36, 49] 1177


Concat so that we can just delete the studies we don't want from a single spreadsheet:

In [111]:
for i, reps in enumerate([len(df.columns) - len(set(df.columns)) for df in all_studies.values()]):
    if reps > 0:
        print(f'Species {list(all_studies.keys())[i]} detailed metadata has duplicate columns')
        species_df = list(all_studies.values())[i]
        # Get the duplicated columns and their counts
        counted_col_names = Counter(species_df.columns.tolist())
        dup_cols = [k for k,v in counted_col_names.items() if v > 1]
        # Check which version of the columns actually have data (if any)
        dup_col_isna = [(col_name, [vals.isna().all() for col, vals in species_df[col_name].items()]) for col_name in dup_cols]
        # If they're all na, drop them all, if  not, keep the ones with info
        # In our dataset there's only one with data in every duplicated group, you'd want to
        # add a secondary check here if you were adding different species
        for col_name, contents in dup_col_isna:
            if set(contents) == set([True]):
                species_df = species_df.drop(columns=col_name)
                all_studies[list(all_studies.keys())[i]] = species_df
            else:
                # Get unique cols
                unique_cols = species_df.columns.duplicated(keep=False) == False
                df_unique = species_df.loc[:, unique_cols]
                # Choose the duplicated column that actually has data
                col_to_keep = species_df.T.groupby(species_df.T.index, axis=0).nth(contents.index(False)).T
                species_df = pd.concat([df_unique, col_to_keep], axis=1)
                all_studies[list(all_studies.keys())[i]] = species_df

In [112]:
all_studies_df = pd.concat([df.reset_index(drop=True) for df in all_studies.values()], ignore_index=True)#.drop_duplicates()
all_studies_df.head()

Unnamed: 0,run_accession,study_accession,study_title,experiment_accession,experiment_title,experiment_desc,organism_taxid,organism_name,library_name,library_strategy,...,subsrc_note,authority,library name,sample type,geo_loc,plant_id,geographic location,time,source population,tag
0,DRR119672,DRP004551,Avicennia marina leaves transcriptome,DRX112674,Illumina HiSeq 1500 paired end sequencing of S...,Illumina HiSeq 1500 paired end sequencing of S...,82927,Avicennia marina,,RNA-Seq,...,,,,,,,,,,
1,SRR10533306,SRP233272,Tissue-specific transcriptomes from Avicennia ...,SRX7217273,Tissue-specific transcriptomes from Avicennia ...,Tissue-specific transcriptomes from Avicennia ...,82927,Avicennia marina,Transcriptomic reads of Avicennia_marina_Flowers,RNA-Seq,...,,,,,,,,,,
2,SRR12145926,SRP270003,RNASeq to identify salt responsive genes from ...,SRX8666650,RNASeq to identify salt responsive genes from ...,RNASeq to identify salt responsive genes from ...,82927,Avicennia marina,am_Leaf_salt_0h,RNA-Seq,...,,,,,,,,,,
3,SRR12145992,SRP270003,RNASeq to identify salt responsive genes from ...,SRX8666716,RNASeq to identify salt responsive genes from ...,RNASeq to identify salt responsive genes from ...,82927,Avicennia marina,am_salt_leaf_24h,RNA-Seq,...,,,,,,,,,,
4,SRR12145993,SRP270003,RNASeq to identify salt responsive genes from ...,SRX8666717,RNASeq to identify salt responsive genes from ...,RNASeq to identify salt responsive genes from ...,82927,Avicennia marina,am_salt_leaf_48h,RNA-Seq,...,,,,,,,,,,


Save out the studies so we don't have to run the search again:

In [113]:
all_studies_df.to_csv(f'/mnt/research/Walker_Lab_Research/Serena_project_data/selection-under-heat_data/sra_study_info/all_SRA_studies_{run_date}.csv', index=False)

Read them back in:

In [83]:
all_studies_df = pd.read_csv('/mnt/research/Walker_Lab_Research/Serena_project_data/selection-under-heat_data/sra_study_info/all_SRA_studies_17Sep2025.csv')
all_studies_df.head()

Unnamed: 0,run_accession,study_accession,study_title,experiment_accession,experiment_title,experiment_desc,organism_taxid,organism_name,library_name,library_strategy,...,subsrc_note,authority,library name,sample type,geo_loc,plant_id,geographic location,time,source population,tag
0,DRR119672,DRP004551,Avicennia marina leaves transcriptome,DRX112674,Illumina HiSeq 1500 paired end sequencing of S...,Illumina HiSeq 1500 paired end sequencing of S...,82927,Avicennia marina,,RNA-Seq,...,,,,,,,,,,
1,SRR10533306,SRP233272,Tissue-specific transcriptomes from Avicennia ...,SRX7217273,Tissue-specific transcriptomes from Avicennia ...,Tissue-specific transcriptomes from Avicennia ...,82927,Avicennia marina,Transcriptomic reads of Avicennia_marina_Flowers,RNA-Seq,...,,,,,,,,,,
2,SRR12145926,SRP270003,RNASeq to identify salt responsive genes from ...,SRX8666650,RNASeq to identify salt responsive genes from ...,RNASeq to identify salt responsive genes from ...,82927,Avicennia marina,am_Leaf_salt_0h,RNA-Seq,...,,,,,,,,,,
3,SRR12145992,SRP270003,RNASeq to identify salt responsive genes from ...,SRX8666716,RNASeq to identify salt responsive genes from ...,RNASeq to identify salt responsive genes from ...,82927,Avicennia marina,am_salt_leaf_24h,RNA-Seq,...,,,,,,,,,,
4,SRR12145993,SRP270003,RNASeq to identify salt responsive genes from ...,SRX8666717,RNASeq to identify salt responsive genes from ...,RNASeq to identify salt responsive genes from ...,82927,Avicennia marina,am_salt_leaf_48h,RNA-Seq,...,,,,,,,,,,


We want to perform further filtering on this list to get a single rep per organism. This will require some if/else logic on the studies for each organism. First, we'll drop any `SINGLE` in the `library_layout` column, if there is a `PAIRED` option for that species, because paired end reads are more reliable.

In [84]:
drop_for_single = []
for species in all_studies_df.organism_name.unique():
    spec_subset = all_studies_df[all_studies_df.organism_name == species]
    single_end = spec_subset[spec_subset.library_layout == 'SINGLE']
    if len(single_end) == len(spec_subset): # Means there's only single end
        print(f'There are only single-end reads for species {species}')
        continue
    else:
        drop_for_single.extend(single_end.index.tolist())
        if len(single_end) > 0:
            print(f'For species {species}, {len(single_end)} of {len(spec_subset)} total studies are single-end reads, and will be dropped')

For species Avicennia marina, 5 of 88 total studies are single-end reads, and will be dropped
There are only single-end reads for species Boechera depauperata
For species Boechera divaricarpa, 15 of 48 total studies are single-end reads, and will be dropped
For species Boechera stricta, 49 of 115 total studies are single-end reads, and will be dropped
There are only single-end reads for species Arabidopsis thaliana
For species Boechera williamsii, 18 of 36 total studies are single-end reads, and will be dropped
For species Boechera pallidifolia, 24 of 33 total studies are single-end reads, and will be dropped
There are only single-end reads for species Opuntia robusta
There are only single-end reads for species Opuntia joconostle
There are only single-end reads for species Opuntia ficus-indica
For species Phoenix dactylifera, 23 of 267 total studies are single-end reads, and will be dropped
There are only single-end reads for species Coccotrypes dactyliperda
There are only single-end r

In [85]:
all_studies_df = all_studies_df.drop(index=drop_for_single)

Unfortunately the read length is not included in the metadata returned by our search, so we can't filter based on that; we'll have to check that manually at the end. However, we also want to know that our samples are coming from leaf tissue. While there is not a eddicated field for tissue type, many studies inclued this information in one of the following fields: `study_title`, `experiment_title`, `experiment_desc`, `tissue`, `tissue_type`. As for the single vs. paired filter, we'll look for any rows with the string `"leaf"` in them, but if none have them, we'll just keep what's there.

In [86]:
drop_for_leaf = []
for species in all_studies_df.organism_name.unique():
    spec_subset = all_studies_df[all_studies_df.organism_name == species]
    # This feels unwieldy and like there's probably a cleaner way to do it
    has_leaf = spec_subset[(spec_subset.study_title.str.lower().str.contains('leaf')) | (spec_subset.experiment_title.str.lower().str.contains('leaf')) | (spec_subset.experiment_desc.str.lower().str.contains('leaf')) | (spec_subset.tissue.str.lower().str.contains('leaf')) | (spec_subset.tissue_type.str.lower().str.contains('leaf')) | (spec_subset.study_title.str.lower().str.contains('leaves')) | (spec_subset.experiment_title.str.lower().str.contains('leaves')) | (spec_subset.experiment_desc.str.lower().str.contains('leaves')) | (spec_subset.tissue.str.lower().str.contains('leaves')) | (spec_subset.tissue_type.str.lower().str.contains('leaves'))]
    if len(has_leaf) == 0: # Means there's no leaf specific one
        print(f'There are no leaf-specific studies for species {species}')
        continue
    else:
        not_leaf_idxs = spec_subset.drop(index=has_leaf.index.tolist()).index.tolist()
        drop_for_leaf.extend(not_leaf_idxs)
        print(f'For species {species}, {len(has_leaf)} of {len(spec_subset)} total studies are leaf specific, all others will be dropped')

For species Avicennia marina, 50 of 83 total studies are leaf specific, all others will be dropped
There are no leaf-specific studies for species mangrove metagenome
For species Rhizophora apiculata, 6 of 12 total studies are leaf specific, all others will be dropped
For species Avicennia marina var. rumphiana, 2 of 2 total studies are leaf specific, all others will be dropped
For species Avicennia marina subsp. eucalyptifolia, 1 of 2 total studies are leaf specific, all others will be dropped
For species Avicennia marina subsp. australasica, 1 of 1 total studies are leaf specific, all others will be dropped
For species Avicennia marina subsp. marina, 1 of 1 total studies are leaf specific, all others will be dropped
There are no leaf-specific studies for species sediment metagenome
There are no leaf-specific studies for species Azorella trisecta
There are no leaf-specific studies for species Boechera depauperata
There are no leaf-specific studies for species Boechera divaricarpa
There

In [87]:
all_studies_df = all_studies_df.drop(index=drop_for_leaf)

The last thing we can do tom ake this list easiest to manage is to remove duplicate entries from the same experiment, since we only need one rep. Any species that still have multiple experiments we will include one rep from all of them, as we can manually check the read lengths to choose among them. The groupings will have the same Bioproject number, but different sample accession numbers and different experiment numbers. We'll go through each remaining bioproject and keep just the first sample. Note that some bioprojects contain multiple species, so we're first going to look at each species and then within that species, the bioproject numbers.

In [93]:
drop_for_reps = []
for species in all_studies_df.organism_name.unique():
    spec_subset = all_studies_df[all_studies_df.organism_name == species]
    for proj in spec_subset.bioproject.unique():
        bio_subset = spec_subset[spec_subset.bioproject == proj]
        all_but_first = bio_subset.index.tolist()[1:]
        drop_for_reps.extend(all_but_first)
print(f'{len(drop_for_reps)} replicate rows will be dropped (out of a total of {len(all_studies_df)})')

586 replicate rows will be dropped (out of a total of 684)


In [95]:
all_studies_df = all_studies_df.drop(index=drop_for_reps)

Save out what we have:

In [98]:
all_studies_df.to_csv(f'/mnt/research/Walker_Lab_Research/Serena_project_data/selection-under-heat_data/sra_study_info/filtered_single_rep_studies_{run_date}.csv', index=False)

Since I revised the above code to use the `detailed` additional columns, I already have the read lengths for some but not all of the studies that result from adding the `tissue` and `tissue_type` columns. We'll merge the old results here to save the work of having to manually add them all back: 

In [99]:
all_studies_df = pd.read_csv('/mnt/research/Walker_Lab_Research/Serena_project_data/selection-under-heat_data/sra_study_info/filtered_single_rep_studies_18Sep2025.csv')
all_studies_df.head()

Unnamed: 0,run_accession,study_accession,study_title,experiment_accession,experiment_title,experiment_desc,organism_taxid,organism_name,library_name,library_strategy,...,subsrc_note,authority,library name,sample type,geo_loc,plant_id,geographic location,time,source population,tag
0,DRR119672,DRP004551,Avicennia marina leaves transcriptome,DRX112674,Illumina HiSeq 1500 paired end sequencing of S...,Illumina HiSeq 1500 paired end sequencing of S...,82927,Avicennia marina,,RNA-Seq,...,,,,,,,,,,
1,SRR10533306,SRP233272,Tissue-specific transcriptomes from Avicennia ...,SRX7217273,Tissue-specific transcriptomes from Avicennia ...,Tissue-specific transcriptomes from Avicennia ...,82927,Avicennia marina,Transcriptomic reads of Avicennia_marina_Flowers,RNA-Seq,...,,,,,,,,,,
2,SRR12145926,SRP270003,RNASeq to identify salt responsive genes from ...,SRX8666650,RNASeq to identify salt responsive genes from ...,RNASeq to identify salt responsive genes from ...,82927,Avicennia marina,am_Leaf_salt_0h,RNA-Seq,...,,,,,,,,,,
3,SRR12146026,SRP270005,Whole Transcriptome sequencing and analaysis i...,SRX8666750,RNASeq from Avicennia leaf tissue,RNASeq from Avicennia leaf tissue,82927,Avicennia marina,am_rnaseq_leaf,RNA-Seq,...,,,,,,,,,,
4,SRR12185868,SRP271133,Gazi bay and Mida creek mangrove metagenome Ra...,SRX8700441,16S rRNA amplicon metagenomics sequences from ...,16S rRNA amplicon metagenomics sequences from ...,1284368,mangrove metagenome,Ed16S-AVM1-1-5cm,RNA-Seq,...,,,,,,,,,,


In [100]:
with_read_lens = pd.read_csv('/mnt/research/Walker_Lab_Research/Serena_project_data/selection-under-heat_data/sra_study_info/filtered_single_rep_studies_15Sep2025_with_read_lengths.csv')
with_read_lens = with_read_lens.drop(columns='Unnamed: 8')
with_read_lens.head()

Unnamed: 0,organism_name,len_reads,study_accession,study_title,experiment_accession,experiment_title,experiment_desc,organism_taxid,library_name,library_strategy,...,biosample,bioproject,instrument,instrument_model,instrument_model_desc,total_spots,total_size,run_accession,run_total_spots,run_total_bases
0,Adenophora,150,SRP406394,Phylogenetic Genomics of Adenophora,SRX18176188,RNA-seq of Adenophora stricta subsp. stricta :,RNA-seq of Adenophora stricta subsp. stricta :,82270,am_Leaf_salt_0h,RNA-Seq,...,SAMN15447932,PRJNA643813,NextSeq 500,NextSeq 500,ILLUMINA,11648210,757097881,SRR12145926,11648210,1750406459
1,Adenophora stricta,150,SRP549143,This BioProject includes raw sequencing data f...,SRX26957374,leaf+stem RNA of Adenophora stricta,leaf+stem RNA of Adenophora stricta,82279,am_rnaseq_leaf,RNA-Seq,...,SAMN15447991,PRJNA392014,NextSeq 500,NextSeq 500,ILLUMINA,54006517,6346122539,SRR12146026,54006517,14987554557
2,Avicennia marina,75,SRP270003,RNASeq to identify salt responsive genes from ...,SRX8666650,RNASeq to identify salt responsive genes from ...,RNASeq to identify salt responsive genes from ...,82927,Ed16S-AVM1-1-5cm,RNA-Seq,...,SAMN15492854,PRJNA644929,Illumina MiSeq,Illumina MiSeq,ILLUMINA,49648,17902988,SRR12185868,49648,29735410
3,Avicennia marina,139,SRP270005,Whole Transcriptome sequencing and analaysis i...,SRX8666750,RNASeq from Avicennia leaf tissue,RNASeq from Avicennia leaf tissue,82927,RA_leaf_500_r2,RNA-Seq,...,SAMN22185018,PRJNA719266,Illumina HiSeq 2000,Illumina HiSeq 2000,ILLUMINA,26620432,3597436477,SRR16279084,26620432,5270845536
4,Avicennia marina,76,SRP058424,Transcriptomic changes in Avicennia marina in...,SRX1030223,Avicennia marina A1 accession control leaf tra...,Avicennia marina A1 accession control leaf tra...,82927,Arumphiana_leaf,RNA-Seq,...,SAMN26804911,PRJNA817364,Illumina NovaSeq 6000,Illumina NovaSeq 6000,ILLUMINA,30063634,3007376940,SRR18449582,30063634,9019090200


Because we had additional fields to filter on, the newer data frame will have some studies dropped compared to the old if having the additional fields identified leaf-specific studies. The symptom of this in the merged dataframe is rows that have a run accession but no other information; we'll drop any rows that have no organism name.

In [101]:
# Fix the type of this one col that causes an issue
all_studies_df.sample_title = all_studies_df.sample_title.astype('object')
# Merge
merged = all_studies_df.merge(with_read_lens, left_on='run_accession', right_on='run_accession', suffixes=('_new', '_old'), how='outer')
# Drop cols from old
merged = merged.drop(columns=[col for col in merged.columns if col[-4:] == '_old'])
# Drop _new suffix
merged.columns = merged.columns.str.replace('_new', '', regex=False)
# Drop rows with no info in the organism name column
merged = merged.dropna(subset=['organism_name'])

In [103]:
# Make the first three columns be the species, the run accession, and the read length
merged = merged[['organism_name', 'run_accession', 'len_reads', 'tissue', 'tissue_type'] + [col for col in merged.columns if col not in ['organism_name', 'run_accession', 'len_reads', 'tissue', 'tissue_type']]]
merged.head()

Unnamed: 0,organism_name,run_accession,len_reads,tissue,tissue_type,study_accession,study_title,experiment_accession,experiment_title,experiment_desc,...,subsrc_note,authority,library name,sample type,geo_loc,plant_id,geographic location,time,source population,tag
0,Avicennia marina,DRR119672,,,leaf,DRP004551,Avicennia marina leaves transcriptome,DRX112674,Illumina HiSeq 1500 paired end sequencing of S...,Illumina HiSeq 1500 paired end sequencing of S...,...,,,,,,,,,,
2,Boechera gunnisoniana,ERR1548596,101.0,,,ERP016601,Abiotic stress-response reprogramming events u...,ERX1619366,Illumina HiSeq 2000 paired end sequencing; Abi...,Illumina HiSeq 2000 paired end sequencing; Abi...,...,,,,,,,,,,
4,Sesuvium verrucosum,ERR2040195,,,,ERP023948,1000 Plant (1KP) Transcriptomes: The Remaining...,ERX2099252,Illumina HiSeq 2000 paired end sequencing,Illumina HiSeq 2000 paired end sequencing,...,,,,,,,,,,
5,Trigastrotheca pentaphylla,ERR2040238,,,,ERP023948,1000 Plant (1KP) Transcriptomes: The Remaining...,ERX2099295,Illumina HiSeq 2000 paired end sequencing,Illumina HiSeq 2000 paired end sequencing,...,,,,,,,,,,
6,Mollugo verticillata,ERR2040239,,,,ERP023948,1000 Plant (1KP) Transcriptomes: The Remaining...,ERX2099296,Illumina HiSeq 2000 paired end sequencing,Illumina HiSeq 2000 paired end sequencing,...,,,,,,,,,,


In [104]:
merged.to_csv(f'/mnt/research/Walker_Lab_Research/Serena_project_data/selection-under-heat_data/sra_study_info/filtered_single_rep_studies_{run_date}_with_len_reads.csv', index=False)

## Getting fastq files
After exporting the above file, I manually checked read lengths and removed most duplicate samples. For those with similar length paired read samples, I left duplicates if (a) I thought it would be valuable or (b) if I wasn't certain about the tissue origin for the samples. Here, we'll read back in the manually finalized dataframe and use it to download the datasets.

In [105]:
final_df = pd.read_csv('/mnt/research/Walker_Lab_Research/Serena_project_data/selection-under-heat_data/sra_study_info/filtered_single_rep_studies_18Sep2025_manual_filtering.csv')
final_df.head()

Unnamed: 0,organism_name,run_accession,len_reads,tissue,tissue_type,study_accession,study_title,experiment_accession,experiment_title,experiment_desc,...,subsrc_note,authority,library name,sample type,geo_loc,plant_id,geographic location,time,source population,tag
0,Adenophora stricta,SRR31592529,150,leaf-stem,,SRP549143,This BioProject includes raw sequencing data f...,SRX26957374,leaf+stem RNA of Adenophora stricta,leaf+stem RNA of Adenophora stricta,...,,,,,,,,,,
1,Avicennia marina,SRR12145926,150,Leaf,,SRP270003,RNASeq to identify salt responsive genes from ...,SRX8666650,RNASeq to identify salt responsive genes from ...,RNASeq to identify salt responsive genes from ...,...,,,,,,,,,,
2,Avicennia marina,SRR12146026,150,Leaf,,SRP270005,Whole Transcriptome sequencing and analaysis i...,SRX8666750,RNASeq from Avicennia leaf tissue,RNASeq from Avicennia leaf tissue,...,,,,,,,,,,
3,Avicennia marina subsp. australasica,SRR18449588,100,leaf,,SRP365449,Mangrove Genome Project,SRX14582333,Amarinaau_leaf,Amarinaau_leaf,...,,,,,,,,,,
4,Avicennia marina subsp. eucalyptifolia,SRR18449587,100,leaf,,SRP365449,Mangrove Genome Project,SRX14582334,Amarinaeu_leaf,Amarinaeu_leaf,...,,,,,,,,,,


Rather than use the SRA toolkit, which is prohibitively slow, we'll use `pysradb` with the `detailed` option to get FTP links for the data we're interested in.

In [107]:
db = SRAweb()

In [108]:
ftp_links = {}


0    leaf-stem
Name: tissue, dtype: object
0    Leaf
Name: tissue, dtype: object
0    Leaf
Name: tissue, dtype: object


KeyboardInterrupt: 