# Tutorial on How To Use The *mqhandler* Package

In this tutorial, we will show how to use the different functionalities of the mqhandler. The mqhandler package comprises 4 functionalities:
- filter protein IDs
- remap gene names
- reduce gene names
- map orthologs

In this tutorial, we will first load the data into a pandas dataframe. Then, the protein ids will be filtered, the gene names remapped and reduced to gene names having an Ensembl ID. Finally, we will map the gene names to human orthologs.

## Installation

In [6]:
# Install mqhandler python package
import sys
!{sys.executable} -m pip install proharmed

Defaulting to user installation because normal site-packages is not writeable
Collecting proharmed
  Downloading proharmed-0.0.3.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: proharmed
  Building wheel for proharmed (setup.py) ... [?25ldone
[?25h  Created wheel for proharmed: filename=proharmed-0.0.3-py3-none-any.whl size=36509 sha256=29c26550d27d1c8bba4ff24c858d5aea5c80594bcef91838094fc4e6b1161f99
  Stored in directory: /home/kikky/.cache/pip/wheels/e8/b9/1f/d67b69ed2d1df40ad35434054bf8ad706caf22528999e2a0d7
Successfully built proharmed
Installing collected packages: proharmed
Successfully installed proharmed-0.0.3
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [7]:
from importlib.metadata import version
version('proharmed')

'0.0.3'

## 1. Load Data

In order to use the mqhandler functionalities, you need to load your data into a pandas dataframe. This could be a MaxQuant proteinGroups.txt output, a single column with protein IDs or any other matrix with a column of ids/gene names.

In this tutorial, we will use proteomics data processed with MaxQuant.

#### 1.1 Imports

In [3]:
import pandas as pd
from proharmed.mq_utils.runner_utils import find_delimiter

#### 1.2 Specify File Path

In [4]:
file = "<file_path>"

#### 1.3 Load Your Data

In [5]:
data = pd.read_table(file, sep=find_delimiter(file)).fillna("")
data.head(5)

FileNotFoundError: [Errno 2] No such file or directory: '<file_path>'

## 2. Filter Protein IDs

For a protein assignment using MaxQuant, Fasta files are required. Since MaxQuant can also be used to run several data collectively, it can also happen that results are provided with protein IDs of several organisms. This method makes it possible to check the protein IDs for their organism by directly accessing the Uniprot database, and to remove incorrectly assigned IDs. Additionally, decoy (REV_) and contaminants (CON_) IDs and/or unreviewed protein IDs can be removed.


First you need to import the method from the mqhandler package. Following the specification of a few parameters, the method can be called. The selected protein column is filtered and a new dataframe is returned. 

One might be interested to know how many IDs were filtered out, in total and per row. Therefore, with this call, you can generate 2 data frames that display this information as a table.

In addition to the information as a table, it can also be displayed directly as plots with a simple call.

#### 2.1 Imports

In [None]:
from proharmed import filter_ids as fi

#### 2.2 Set Preferences

In [None]:
# mandatory
protein_column = "Protein IDs" # Name of column with protein IDs

# optional
organism = "rat" # Specify organism the IDs should match to
rev_con = False # Bool to indicate if protein IDs from decoy (REV__) and contaminants (CON__) should be kept
reviewed = False # Bool to indicate if newly retrieved protein IDS should be reduced to reviewed ones
keep_empty = False # Bool to indicate if empty ID cells should be kept or deleted
res_column = None # Name of column for filer protein IDs results. If None, the protein_column will be overridden

#### 2.2 Run filter_protein_ids

In [None]:
fi_data, fi_log_dict = fi.filter_protein_ids(data = data, protein_column = protein_column, 
                                             organism = organism, rev_con = rev_con, keep_empty = keep_empty, 
                                             reviewed = reviewed, res_column = res_column)
fi_data.head(5)

#### 2.3 Inspect Logging

In [None]:
from proharmed.mq_utils import plotting as pt
out_dir = "<out_dir>"
pt.create_overview_plot(fi_log_dict["Overview_Log"], out_dir = out_dir)

In [None]:
pt.create_filter_detailed_plot(fi_log_dict["Detailed_Log"], organism = organism, 
                               reviewed = reviewed, decoy = rev_con, out_dir = out_dir)

## 3. Remap Gene Names

Besides protein IDs, gene names are also taken out of the respective Fasta files and mapped. These are needed for easier naming in plots and in analytical procedures such as enrichment analysis. Unfortunately, Fasta files are not always complete in terms of gene names.

This method makes it possible to retrieve the assigned gene names based on the protein IDs with direct access to the Uniprot database and to fill the empty entries in the user file or even replace existing entries. There are multiple possible modes for which names should be taken.


Again, you need to import the mqhandler's function and specify some preferences before running the method. The selected gene name column is remapped based on the protein ids column that has been specified. A dataframe including the remapped gene name column and all other columns of the original data is returned.

Here, too, it is possible to subsequently obtain information on how many gene names were found for how many rows. This can also be displayed as a plot with a simple call.

In this tutorial, we will call the remap gene names function on the data that has already been processed using the filter IDs method. 

#### 3.1 Imports

In [None]:
from proharmed import remap_genenames as rmg

#### 3.2 Set Preferences

In [None]:
# mandatory
mode = "uniprot_primary" # Mode of refilling. See below for more infos.
protein_column = "Protein IDs" # Name of column with protein IDs

# optional
gene_column = "Gene names" # Name of column with gene names
skip_filled = False # Bool to indicate if already filled gene names should be skipped
organism = "rat" # Specify organism the IDs should match to
fasta = None # Path of Fasta file when mode all or fasta
keep_empty = False # Bool to indicate if empty gene names cells should be kept or deleted
res_column = None # Name of column for remap gene names results. If None, the gene_column will be overridden

**Modes of refilling:**
- all : Use primarly fasta infos and additionally uniprot infos.
- fasta: Use information extracted from fasta headers.
- uniprot: Use mapping information from uniprot and use all gene names.
- uniprot_primary: Use mapping information from uniprot and only all primary gene names.
- uniprot_one: Use mapping information from uniprot and only use most frequent single gene name.

#### 3.3 Run remap_genenames

In [None]:
rmg_data, rmg_log_dict = rmg.remap_genenames(data = fi_data, mode=mode, protein_column = protein_column,
                                            gene_column = gene_column, skip_filled = skip_filled, organism = organism, 
                                             fasta = fasta, keep_empty = keep_empty, res_column = res_column)
rmg_data.head(5)

## 4. Reduce Gene Names

A well-known problem with gene symbols is that they are not unique and slight changes in spelling can lead to problems. Often there are different gene symbols for the same gene in UniProt. Depending on which protein IDs you used to get the gene symbol, you can get multiple gene symbols for the same gene by using the previous remap function.

This method makes it possible to reduce the gene symbols to a common gene symbol using different features and databases, thus preventing redundancy. There are multiple possible modes for which names should be taken. 

Again, you need to import the mqhandler's function and specify some preferences before running the method. A dataframe including the reduced gene name column and all other columns of the original data is returned.

Here, too, it is possible to subsequently obtain information on how many gene names were reduced for how many rows. This can also be displayed as a plot with a simple call.

In this tutorial, we will call the reduce gene names function on the data that has already been processed using the filter IDs and remap gene names method. 

#### 4.1 Imports

In [8]:
from proharmed import reduce_genenames as rdg

#### 4.2 Set Preferences

In [None]:
# mandatory
mode = "ensembl" # Mode of reduction. See below for more infos-
gene_column = "Remapped Gene Names" # Name of column with gene names
organism = "human" # Specify organism the IDs should match to

# optional
res_column = False # Name of column of reduced gene names results. If None, the gene_column will be overridden
keep_empty = False # Bool to indicate if empty reduced gene names cells should be kept or deleted
HGNC_mode = None # Mode on how to reduce the gene names using HGNC (mostfrequent, all)

**Modes of reduction:**
- ensembl : Use gProfiler to reduce gene names to those having a Ensembl ID
- HGNC: Use HGNC database to reduce gene names to those having an entry in HGNC (only for human)
- mygeneinfo: Use mygeneinfo database to reduce gene names to those having an entry in mygeneinfo
- enrichment: Use gProfiler to reduce gene names to those having a functional annotation

#### 4.3 Run reduce_genenames

In [None]:
rdg_data, rdg_log_dict = rdg.reduce_genenames(data = rmg_data, mode = mode, gene_column = gene_column, 
                                              organism = organism, res_column = res_column, keep_empty = keep_empty,
                                             HGNC_mode = HGNC_mode)
rdg_data.head(5)

#### 4.4 Inspect Logging

In [None]:
pt.create_overview_plot(rdg_log_dict["Overview_Log"], out_dir = out_dir)

In [None]:
pt.create_reduced_detailed_plot(rdg_log_dict["Detailed_Log"], out_dir = out_dir)

## 5. Map Orthologs

Suppose you want to compare data between organisms, for example if you want to do a review across several species, you come across a known problem. Gene names differ between species, making it necessary to map all IDs to a selected organism through an ortholog mapping.

Using the commonly used gProfiler, this method simply maps the gene names from the current organism to the target organism.

Again, you need to import the mqhandler's function and specify some preferences before running the method. The selected gene name column is mapped to its orthologs based on target organism that has been specified. A dataframe including the mapped ortholog gene names and all other columns of the original data is returned.

Unfortunately, depending on the original and target organism, there are more or less cases where no orthologous gene could be found. For a simplified overview of how many cases this was the case, this method can be used to obtain this information.

As with the previous tasks, the log information can be displayed in plots.

#### 5.1 Imports

In [None]:
from proharmed import map_orthologs as mo

#### 5.2 Set Preferences

In [None]:
# mandatory
gene_column = "Gene Names" # Name of column with gene names
source_organism = "mouse" # Specify organism the IDs match to
tar_organism = "human" # Specify organism the IDs should me mapped to

# optional
keep_empty = False # Bool to indicate if empty ortholog gene names cells should be kept or deleted
res_column = None # Name of column of orthologs gene names results. If None, the gene_column will be overridden

#### 5.3 Run map_orthologs

In [None]:
mo_data, mo_log_dict = mo.map_orthologs(data = rdg_data, gene_column = gene_column, organism = source_organism,
                                           tar_organism = tar_organism, keep_empty = keep_empty, 
                                            res_column = res_column)
mo_log_dict

In [None]:
pt.create_overview_plot(rdg_log_dict["Overview_Log"], out_dir = out_dir)

In [None]:
pt.create_ortholog_detailed_plot(rdg_log_dict["Detailed_Log"], organism = organism out_dir = out_dir)

#### 5.4 Inspect Logging

In [None]:
pt.create_overview_plot(mo_log_dict["Overview_Log"], out_dir = out_dir)

In [None]:
pt.create_ortholog_detailed_plot(mo_log_dict["Detailed_Log"], organism = source_organism, out_dir = out_dir)