<a href="https://colab.research.google.com/github/sanjaynagi/rna-seq-meta/blob/main/workflow/notebooks/strip_plot_gene_expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
! git clone https://github.com/sanjaynagi/rna-seq-meta.git

Cloning into 'rna-seq-meta'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 30 (delta 1), reused 27 (delta 1), pack-reused 0[K
Unpacking objects: 100% (30/30), done.


In [14]:
import pandas as pd
import numpy as np
import plotly.express as px

def gene_ids_from_domain(gene_annot_df, domain):
    gene_list = []
    if isinstance(domain, list):
        for dom in domain:
            ids = gene_annot_df.query("domain == @domain")['gene_id']
            gene_list.append(ids)
            return(np.unique(gene_list))
    else:
        return(gene_annot_df.query("domain == @domain")['gene_id'].to_numpy())

def load_species(df, meta):
    spp=np.array([])
    for c in df['comparison']:
        c = c.split("_")[0]
        spps = meta.query(f"condition == '{c}'")['species'].unique()
        spp = np.append(spp, spps)
    return spp

def plotly_strip_genes(gene_ids, title, plot_type='strip', fc_path="rna-seq-meta/results/fc_data.tsv", meta_path="rna-seq-meta/config/ALLcoldata.txt", width=1000, height=None):

  """
  plots fold changes of provided AGAP gene IDs
  """
  # load metadata
  metadata = pd.read_csv(meta_path, sep="\t")
  meta = metadata[['condition', 'species']].drop_duplicates()
  meta.loc[:,'condition'] = meta['condition'].str.replace("_fun", "")
  
  # load fold change data and remove gene description column
  fc_data = pd.read_csv(fc_path, sep="\t")
  #pval_data = pd.read_csv("rna-seq-meta/results/pval_data.tsv", sep="\t")
  fc_data = fc_data.iloc[:, :-1]

  fam_fc_data = fc_data.query("GeneID in @gene_ids").copy()
  fam_fc_data.loc[:, 'Label'] = [id_ + " | " + name if name != "" else id_ for id_, name in zip(fam_fc_data['GeneID'].fillna(""), fam_fc_data['GeneName'].fillna(""))]
  fam_fc_data =fam_fc_data.drop(columns=['GeneName', 'GeneID']).melt(id_vars='Label', var_name='comparison', value_name='log2FC')
  fam_fc_data.loc[:, 'comparison'] = fam_fc_data['comparison'].str.replace("_log2FoldChange", "")
  fam_fc_data.loc[:, 'species'] = load_species(fam_fc_data, meta)
  fam_fc_data.loc[:, 'log2FC'] *= -1 # invert the FCs (currently > 0 log2FC = overexpression in susceptible)

  if not height:
    height = np.min([fam_fc_data.shape[0]*12, 2500])
  
  my_plot = px.strip if plot_type == 'strip' else px.box
  fig = my_plot(
      fam_fc_data, 
      y='Label', 
      x='log2FC', 
      color='species',
      title=title, 
      hover_data=['comparison'],
      width=width, 
      height=height,
      template='ggplot2'
  )
  fig.update_layout(titlefont=dict(size =20, color='black', family='Arial, sans-serif'), xaxis_range=[-4,6],     xaxis_title="log2 Fold Change", yaxis_title="Gene")
  fig.add_vline(0,  line_width=1, line_dash="dash", line_color="grey")
  fig.show()
  return(fc_data)

### RNA-Seq-Meta 

This notebook produces interactive strip and boxplots with plotly, to summarise gene expression across *An. gambiae* RNA-Sequencing experiments. Still in development. 

Currently *An. gambiae* is not split into *gambiae and coluzzii*. You can toggle which species are displayed by clicking the legend. Because *funestus* are included, only genes with orthologs

Requesting feedback and ideas.

In [15]:
fc_data = plotly_strip_genes(gene_ids=["AGAP006227", "AGAP006228"], title="Coeae1f", height=300)

You can also produce a boxplot, although the hovertext is doesnt work properly.

In [11]:
plotly_strip_genes(gene_ids=["AGAP006227", "AGAP006228"], title="Coeae1f", plot_type='boxplot', height=300)

**Across gene families linked to resistance**


`"rna-seq-meta/resources/Anogam_long.pep_Pfamscan.seqs"` maps genes to pfam domains in *An. gambiae*, and we can use this to produce separate plots for different gene families. We also have GO term data `Anogam_long.pep_eggnog_diamond.emapper.annotations.GO`, which we could use. 

If you have ideas for genesets to use or for improvements to the plot, please let me know :)

In [7]:
# Read in .csv file containing pfam and go terms
pfam_df = pd.read_csv("rna-seq-meta/resources/Anogam_long.pep_Pfamscan.seqs", sep="\s+", header=None)
go_df = pd.read_csv("rna-seq-meta/resources/Anogam_long.pep_eggnog_diamond.emapper.annotations.GO", sep="\t", header=None)

pfam_df.columns = ["transcript", "pstart", "pend", "pfamid", "domain", "domseq"]
go_df.columns = ['transcript', 'GO_terms']

gene_annot_df = pfam_df.merge(go_df)
gene_annot_df.loc[:, 'gene_id'] = gene_annot_df.loc[:, 'transcript'].str.replace("Anogam_", "").str.replace("-R[A-Z]", "")


The default value of regex will change from True to False in a future version.



In [8]:
# a dict with gene families and their respective Pfam domain for extracting
gene_fams = {'CSP': 'OS-D',
            'Cytochrome P450s':'p450', 
             'GSTs':['GST_N', 'GST_N_3', 'GST_C'], 
             'ABC-transporters':['ABC_membrane', 'ABC_tran'],
            'Carboxylesterases': 'COesterase', 
             'Odorant binding proteins':'PBP_GOBP', 
             'Olfactory receptors':'7tm_6', 
             'Ionotropic receptors':['Lig_chan','7tm_1'],
             'Gustatory receptors': '7tm_7',
            'Fatty acid synthases':'ketoacyl-synt',
            'FA Elongase':'ELO',
            'FA desaturase':'FA_desaturase',
            'FA reductase':'NAD_binding_4',
            }


for name, domain in gene_fams.items():
    
    gene_ids = gene_ids_from_domain(gene_annot_df, domain)
    plotly_strip_genes(gene_ids, title=name, plot_type='boxplot')