# phenotypes

This notebook is a simple exploration of the phenotypes in the dataset, aiming to process and clean the data as well as gather some insights about the distribution of the phenotypes.

It's species agnostic, by setting var species and making sure it's following tha paths conventions built for _E. coli_

## Imports

In [40]:
import os
import re
import pandas as pd
import warnings # for deprecatd function (using an old version of pandas to append)
import plotly.graph_objects as go
import plotly.io as pio

warnings.filterwarnings('ignore')

os.chdir(os.path.expanduser('~/capstone-project'))

species='Escherichia_coli'

## Processing

Mainly consists of removing all ids that are not taken in the genotypic dataset (constructing the pangenome)

In [6]:
# -- dropping out all genome ids that we don't have a corresponding entry for in genome_ids file

X_df = pd.read_csv(f'data/presence_matrices/{species}_filtered_SxG.csv', index_col=0); 
X_df.index = X_df.index.astype('float')

genome_ids_file = f'data/pangenome_pipeline_output/genome_ids/{species}_genome_ids.csv'
samples_df = pd.read_csv(genome_ids_file, dtype=str)
samples_list=samples_df['genome.genome_ids'].tolist()

In [8]:
#  --removing all genome ids that are not in the genome_ids file

count_files=0
for file in os.listdir('data/phenotypes/'):
    if file.endswith('.csv') and species in file:
        count_files+=1
        
        y=pd.read_csv(f'data/phenotypes//{file}', dtype=str)
        old=y.shape[0]

        y=y[y['genome_id'].isin(samples_list)]
        new=y.shape[0]

        print(f'{file} - {old-new} samples removed')

        if not os.path.exists('data/processed_phenotypes/'):
            os.makedirs('data/processed_phenotypes/')
        y.to_csv(f'data/processed_phenotypes/{file}', index=False)

Escherichia_coli_streptomycin.csv - 29 samples removed
Escherichia_coli_nitrofurantoin.csv - 145 samples removed
Escherichia_coli_kanamycin.csv - 11 samples removed
Escherichia_coli_aztreonam.csv - 70 samples removed
Escherichia_coli_doxycycline.csv - 119 samples removed
Escherichia_coli_tobramycin.csv - 476 samples removed
Escherichia_coli_oxytetracycline.csv - 92 samples removed
Escherichia_coli_cefotetan.csv - 109 samples removed
Escherichia_coli_sulfamethoxazole.csv - 62 samples removed
Escherichia_coli_nalidixic_acid.csv - 42 samples removed
Escherichia_coli_tetracycline.csv - 90 samples removed
Escherichia_coli_imipenem.csv - 187 samples removed
Escherichia_coli_cefoxitin.csv - 98 samples removed
Escherichia_coli_tigecycline.csv - 225 samples removed
Escherichia_coli_cefalothin.csv - 421 samples removed
Escherichia_coli_cefotaxime.csv - 728 samples removed
Escherichia_coli_amikacin.csv - 99 samples removed
Escherichia_coli_trimethoprim_sulphamethoxazole.csv - 183 samples removed


## Filtration

based on:
* class balance: those with an imbalance > 80/20 are removed
* SIR readings counts: those with less than 100 sample readings are removed

In [15]:
# -- plotting all pheno data for all available drugs

df = pd.DataFrame(columns=['drug', 'R', 'S'])

for file in os.listdir('data/processed_phenotypes/'):
    if file.endswith('.csv'):
        drug = re.match(r'Escherichia_coli_(.*).csv', file).group(1)
        data = pd.read_csv(f'data/processed_phenotypes/{file}')
        # -- counting the number of 1s in SIR col and save in a count var
        count = data['SIR'].value_counts()
        count_R=count.get(1, 0)
        count_S=count.get(0, 0)

        if count_R == 0 and count_S == 0:
            continue

        df = df.append({'drug': drug, 'R': count_R, 'S': count_S}, ignore_index=True)

df.set_index('drug', inplace=True)

# -- adding percentage columns
df['R%'] = df['R'] / (df['R'] + df['S']) * 100
df['S%'] = df['S'] / (df['R'] + df['S']) * 100  
df['samples no.'] = df['R'] + df['S']
# df

In [21]:
#  -- filtering out those drugs with over a 70/30 class imbalance and less than 100 samples
df = df[(df['R%'] < 80) & (df['S%'] < 80) & (df['samples no.'] > 100)]
df.to_csv(f'data/{species}_pheno_data.csv') #location is debatable...

num_drugs = df.shape[0]
drugs_list=df.index.to_list()

print(f'Number of drugs: {num_drugs}')
print(f'Drugs: {drugs_list}')

df

Number of drugs: 11
Drugs: ['streptomycin', 'sulfamethoxazole', 'tetracycline', 'cefalothin', 'trimethoprim_sulphamethoxazole', 'amoxicillin_clavulanic_acid', 'trimethoprim', 'amoxicillin', 'ampicillin', 'levofloxacin', 'ciprofloxacin']


Unnamed: 0_level_0,R,S,R%,S%,samples no.
drug,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
streptomycin,55,78,41.353383,58.646617,133
sulfamethoxazole,83,50,62.406015,37.593985,133
tetracycline,140,85,62.222222,37.777778,225
cefalothin,196,170,53.551913,46.448087,366
trimethoprim_sulphamethoxazole,180,96,65.217391,34.782609,276
amoxicillin_clavulanic_acid,401,1017,28.279267,71.720733,1418
trimethoprim,216,283,43.286573,56.713427,499
amoxicillin,596,388,60.569106,39.430894,984
ampicillin,312,194,61.660079,38.339921,506
levofloxacin,180,58,75.630252,24.369748,238


_trimethoprim sulfamethoxazole_ and _amoxicillin clavulanate_ will be  removed from the dataset as no ARGs for them have been found during extraction

In [56]:
antibiotics = ['streptomycin', 'sulfamethoxazole', 'tetracycline', 'cefalothin', 'amoxicillin_clavulanic_acid', 'trimethoprim', 'amoxicillin', 'ampicillin', 'levofloxacin', 'ciprofloxacin']
df = df[df.index.isin(antibiotics)]

df

Unnamed: 0_level_0,R,S,R%,S%,samples no.
drug,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
streptomycin,55,78,41.353383,58.646617,133
sulfamethoxazole,83,50,62.406015,37.593985,133
tetracycline,140,85,62.222222,37.777778,225
cefalothin,196,170,53.551913,46.448087,366
amoxicillin_clavulanic_acid,401,1017,28.279267,71.720733,1418
trimethoprim,216,283,43.286573,56.713427,499
amoxicillin,596,388,60.569106,39.430894,984
ampicillin,312,194,61.660079,38.339921,506
levofloxacin,180,58,75.630252,24.369748,238
ciprofloxacin,444,1228,26.555024,73.444976,1672


In [57]:
df_sorted = df.sort_values(by=df.columns[-1], ascending=False)

trace1 = go.Bar(x=df_sorted.index, y=df_sorted[df_sorted.columns[0]], name=df_sorted.columns[0], marker_color='#FF5A5F')
trace2 = go.Bar(x=df_sorted.index, y=df_sorted[df_sorted.columns[1]], name=df_sorted.columns[1], marker_color='#BFD7EA')

fig = go.Figure(data=[trace2, trace1])

fig.update_layout(plot_bgcolor='white')
fig.update_layout(barmode='stack', width=900, height=600)
# add axes
fig.update_xaxes(title_text='Drug')
fig.update_yaxes(title_text='Number of samples')
fig.update_layout(title=f'Number of samples per drug for {species}')
fig.show()

pio.write_image(fig, f'figures/phenotypic_drugs_distributions_{species}.png')

to do:

- [ ] make imbalance 70/30
- [ ] make freq 200