# `nf_rnaseq` notebook

The `nf_rnaseq` package can be used to query a number of databases and harmonize gene identifiers in the course of an RNA-seq analysis.

In [1]:
import pandas as pd
from io import StringIO

from nf_rnaseq import variables
from nf_rnaseq.biomart import BioMart
from nf_rnaseq.hgnc import HGNC
from nf_rnaseq.uniprot import UniProt, UniProtPOST, UniProtGET
from nf_rnaseq import load

## Variables

This module contains a dictionary for the default properties needed at instantiation of {class}`BioMart`, {class}`HGNC`, {class}`UniProt`, {class}`UniProtGET`, and {class}`UniProtPOST`.

This package is optimized only to query the provided `url_base`, but the `term_in` and `term_out` can be modified.

In [2]:
dict_databases = variables.DICT_DATABASES
dict_databases

{'BioMart': {'GET': {'api_object': nf_rnaseq.biomart.BioMart,
   'term_in': 'ensembl_transcript_id_version',
   'term_out': 'external_gene_name',
   'url_base': 'http://www.ensembl.org/biomart/martservice?query=<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE Query><Query  virtualSchemaName = "default" formatter = "TSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" ><Dataset name = "hsapiens_gene_ensembl" interface = "default" ><Filter name = "<TERM_IN>" value = "<IDS>"/><Attribute name = "<TERM_IN>" /><Attribute name = "<TERM_OUT>" /></Dataset></Query>',
   'headers': None}},
 'HGNC': {'GET': {'api_object': nf_rnaseq.hgnc.HGNC,
   'term_in': 'mane_select',
   'term_out': 'symbol',
   'url_base': 'https://rest.genenames.org/fetch',
   'headers': "{'Accept': 'application/json'}"}},
 'UniProt': {'GET': {'api_object': nf_rnaseq.uniprot.UniProt,
   'term_in': 'UniProtKB_AC-ID',
   'term_out': 'Gene_Name',
   'url_base': 'https://rest.uniprot.org/uniprotkb',
   'heade

<style>
  table {
    margin: auto;
    width: 100%; /* Adjust the width as needed */
  }
  td {
    text-align: justify;
    padding: 8px; /* Adding padding for better readability */
  }
  th
  {
    text-align: center;
  }
</style>

## API schema

The use of the API clients is governed by a series of `ABC` and `dataclass` objects from the `api_schema` module whose inheritance, properties, and functions are described below:

<ins>**{class}`APIClient`**</ins>

Parent Class that governs all shared API client properties and functions
    
**Properties**

<table>
    <tr>
        <th>Properties</th>
        <th>Type (Default)</th>
        <th>Description</th> 
    </tr>
    <tr>
        <td>identifier</td>
        <td>str</td>
        <td>String value containing search term or comma-delimited set of search terms</td>
    </tr>
    <tr>
        <td>term_in</td>
        <td>str (default: None)</td>
        <td>Term to which to convert in query database</td>
    </tr>
    <tr>
        <td>term_out</td>
        <td>str (default: None)</td>
        <td>Term to which to convert in query database</td>
    </tr>
</table>

**Functions**

+ {func}`APIClient.__post_init__`
  
+ {func}`APIClient.check_response`

+ {func}`APIClient.process_identifier`

+ {func}`APIClient.query_api` (`@abstractmethod`)

<table>
    <tr>
        <th>Function</th>
        <th>Description</th> 
    </tr>
    <tr>
        <td>__post_init__</td>
        <td>Upon initialization, the `process_identifier` function is called</td>
    </tr>
    <tr>
        <td>check_response</td>
        <td>Raise for status with `requests` otherwise log error</td>
    </tr>
    <tr>
        <td>process_identifier</td>
        <td>For `identifier` strip [ and ], split on comma, strip extra spaces; save results as `identifier` and list version as `list_identifier`</td>
    </tr>
    <tr>
        <td>query_api</td>
        <td>Abstract method to query API implemented at level of sub-class</td>
    </tr>
</table>

<br>

<ins>**{class}`APIClientGET`**</ins>

Child class of `APIClient` that provides basic `GET` functionality for HTTP requests

**Additional properties**

<table>
    <tr>
        <th>Properties</th>
        <th>Type (Default)</th>
        <th>Description</th> 
    </tr>
    <tr>
        <td>headers</td>
        <td>str</td>
        <td>String value containing search term or comma-delimited set of search terms</td>
    </tr>
    <tr>
        <td>polling_interval</td>
        <td>int (default: 5)</td>
        <td>How often a poll check occurs for change of state, if necessary (e.g., GET after POST)</td>
    </tr>
</table>

**Additional functions**

+ {func}`APIClientGET.__post_init__`

+ {func}`APIClientGET.query_api`

+ {func}`APIClient.create_query_url` (`@abstractmethod`)

+ {func}`APIClient.check_if_job_ready` (`@abstractmethod`)

+ {func}`APIClient.maybe_get_gene_names` (`@abstractmethod`)

<table>
    <tr>
        <th>Function</th>
        <th>Description</th> 
    </tr>
    <tr>
        <td>__post_init__</td>
        <td>Upon initialization, the `super().__post_init__`, `create_query_url`, `query_api`, and `maybe_get_gene_names` functions are called </td>
    </tr>
    <tr>
        <td>query_api</td>
        <td>Query API and add the output to `self.json` if json otherwise `self.text`</td>
    </tr>
    <tr>
        <td>process_identifier (@abstractmethod)</td>
        <td>Abstract method generate the URL to query implemented at level of sub-class</td>
    </tr>
    <tr>
        <td>check_if_job_ready (@abstractmethod)</td>
        <td>Abstract method generate check if job ready if POST necessary implemented at level of sub-class; should return `True` if needed and `False` otherwise</td>
    </tr>
    <tr>
        <td>maybe_get_gene_names (@abstractmethod)</td>
        <td>Abstract method generate check if job ready if POST necessary implemented at level of sub-class; should return `True` if needed and `False` otherwise</td>
    </tr>
</table>

<br>

<ins>**{class}`APIClientPOST`**</ins>

Child class of `APIClient` that provides basic `POST` functionality for HTTP requests

TODO

## BioMart

{class}`BioMart` is a child class of can be used to retrieve multiple comma-delimited entries from [Ensembl's BioMart](https://useast.ensembl.org/info/data/biomart/index.html). Note that the following will produce a `requests.exceptions.JSONDecodeError` but that the results of the API query will be stored in the `text` property of the {class}`BioMart` object instead of in the `json` property as a result.

In [3]:
dict_biomart = dict_databases["BioMart"]["GET"]
biomart_obj = BioMart(
    identifier="ENST00000614007.1,ENST00000493287.5,ENST00000582431.2",
    term_in=dict_biomart["term_in"],
    term_out=dict_biomart["term_out"],
    url_base=dict_biomart["url_base"],
)

pd.DataFrame(
    {"original_id": biomart_obj.list_identifier, "gene_names": biomart_obj.list_gene_names, "source": "BioMart"}
)

ERROR:root:Error at division
Traceback (most recent call last):
  File "/home/whitej6/miniforge3/envs/nf_rna/lib/python3.11/site-packages/requests/models.py", line 974, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/whitej6/miniforge3/envs/nf_rna/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/whitej6/miniforge3/envs/nf_rna/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/whitej6/miniforge3/envs/nf_rna/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call la

Unnamed: 0,original_id,gene_names,source
0,ENST00000493287.5,[MRPL20],BioMart
1,ENST00000582431.2,[RN7SL657P],BioMart
2,ENST00000614007.1,[U6],BioMart


## HGNC

{class}`HGNC` can be used to retrieve single entries from the Human Genome Nomenclature Committee's (HGNC) [API](https://www.genenames.org/help/rest). Allowable searchable fields can be found on their website.

In [4]:
dict_hgnc = dict_databases["HGNC"]["GET"]
hgnc_obj = HGNC(
    identifier="NM_033360",
    term_in="refseq_accession",
    term_out="symbol",
    url_base=dict_hgnc["url_base"],
    headers=dict_hgnc["headers"],
)

pd.DataFrame({"original_id": hgnc_obj.list_identifier, "gene_names": hgnc_obj.list_gene_names, "source": "HGNC"})

Unnamed: 0,original_id,gene_names,source
0,NM_033360,KRAS,HGNC


## UniProt

### Single entry retrieval

{class}`UniProt` can be used to retrieve single entries from UniProtKB's [individual entry API](https://www.uniprot.org/help/api_retrieve_entries).

In [5]:
uniprot_obj = UniProt(
    identifier="P24468",
    term_in=dict_databases["UniProt"]["GET"]["term_in"],
    term_out=dict_databases["UniProt"]["GET"]["term_out"],
    url_base=dict_databases["UniProt"]["GET"]["url_base"],
)

pd.DataFrame(
    {"original_id": uniprot_obj.list_identifier, "gene_names": uniprot_obj.list_gene_names, "source": "UniProt"}
)

Unnamed: 0,original_id,gene_names,source
0,P24468,NR2F2,UniProt


### Bulk entry retrieval

A combination of {class}`UniProtPOST` and {class}`UniProtGET` can be used to retrieve a large number of entries from UniProtKB's [ID mapping API](https://www.uniprot.org/help/id_mapping).

In [6]:
str_id = "P24468, C9J5X1, Q5W5X9"
str_db = "UniProtBULK"

dict_post = dict_databases[str_db]["POST"]
uniprot_post_obj = UniProtPOST(
    identifier=str_id,
    term_in=dict_post["term_in"],
    term_out=dict_post["term_out"],
    url_base=dict_post["url_base"],
)

dict_get = dict_databases[str_db]["GET"]
uniprot_get_obj = UniProtGET(
    identifier=str_id,
    term_in=dict_get["term_in"],
    term_out=dict_get["term_out"],
    url_base=dict_get["url_base"],
    jobId=uniprot_post_obj.jobId,
)

pd.DataFrame(
    {
        "original_id": uniprot_get_obj.list_identifier,
        "gene_names": uniprot_get_obj.list_gene_names,
        "source": "UniProtBULK",
    }
)

Unnamed: 0,original_id,gene_names,source
0,P24468,[NR2F2],UniProtBULK
1,C9J5X1,[IGF1R],UniProtBULK
2,Q5W5X9,[TTC23],UniProtBULK


## Command line script

The package also provides a command line script with the following inputs:

- `cachePath`: path to store `requests_cache` object

- `database`: keys in {dict}`variables.DICT_DATABASES` (BioMart, HGNC, UniProt, UniProtBULK)

- `input`: identifier or comma delimited list of identifiers

- `tsv`: a `store_true` flag; if True save as TSV otherwise save as CSV

### CSV output
```
get_gene_name \\
    -i <INPUT_IDS> \\
    -d <DATABASE> \\
    -c <CACHE_DIR> \\
    > <FILE_NAME>.csv
```

In [7]:
!get_gene_name \
    -i "P24468, C9J5X1, Q5W5X9" \
    -d "UniProtBULK"

2024-08-16 15:32:01,025 - nf_rnaseq.cli.get_gene_name - INFO - Querying API for UniProtBULK
2024-08-16 15:32:02,799 - nf_rnaseq.uniprot - INFO - 
30672867582f7e26860278d92bf5058a91631230
{'results': [{'from': 'P24468', 'to': 'NR2F2'}, {'from': 'C9J5X1', 'to': 'IGF1R'}, {'from': 'Q5W5X9', 'to': 'TTC23'}], 'failedIds': []}
2024-08-16 15:32:02,799 - nf_rnaseq.api_schema - INFO - 
P24468,C9J5X1,Q5W5X9
{'results': [{'from': 'P24468', 'to': 'NR2F2'}, {'from': 'C9J5X1', 'to': 'IGF1R'}, {'from': 'Q5W5X9', 'to': 'TTC23'}], 'failedIds': []}

P24468              ,['NR2F2']           ,UniProtBULK
C9J5X1              ,['IGF1R']           ,UniProtBULK
Q5W5X9              ,['TTC23']           ,UniProtBULK



### TSV output
```
get_gene_name \\
    -i <INPUT_IDS> \\
    -d <DATABASE> \\
    -c <CACHE_DIR> \\
    -t \\
    > <FILE_NAME>.tsv
```

In [8]:
!get_gene_name \
    -i "P24468, C9J5X1, Q5W5X9" \
    -d "UniProtBULK" \
    -t

2024-08-16 15:32:06,455 - nf_rnaseq.cli.get_gene_name - INFO - Querying API for UniProtBULK
2024-08-16 15:32:08,084 - nf_rnaseq.uniprot - INFO - 
30672867582f7e26860278d92bf5058a91631230
{'results': [{'from': 'P24468', 'to': 'NR2F2'}, {'from': 'C9J5X1', 'to': 'IGF1R'}, {'from': 'Q5W5X9', 'to': 'TTC23'}], 'failedIds': []}
2024-08-16 15:32:08,084 - nf_rnaseq.api_schema - INFO - 
P24468,C9J5X1,Q5W5X9
{'results': [{'from': 'P24468', 'to': 'NR2F2'}, {'from': 'C9J5X1', 'to': 'IGF1R'}, {'from': 'Q5W5X9', 'to': 'TTC23'}], 'failedIds': []}

P24468              	['NR2F2']           	UniProtBULK
C9J5X1              	['IGF1R']           	UniProtBULK
Q5W5X9              	['TTC23']           	UniProtBULK



### Analysis

The package also provides an analysis module for processing the resulting CSV and TSV files. For the purposes of visualization, these files have additional spaces . Moreover, the output IDs take the format of .

In [9]:
str_tsv = "\
P24468              \t['NR2F2']           \tUniProtBULK\n\
C9J5X1              \t['IGF1R']           \tUniProtBULK\n\
Q5W5X9              \t['TTC23']           \tUniProtBULK\
"

df_tsv = pd.read_table(StringIO(str_tsv), sep="\t", header=None)

df_tsv.columns = ["original_id", "gene_name", "source"]

df_tsv["original_id"] = df_tsv["original_id"].apply(lambda x: x.strip())
df_tsv["gene_name"] = df_tsv["gene_name"].apply(load.literal_eval_list)
df_tsv["source"] = df_tsv["source"].apply(lambda x: x.strip())

df_tsv

Unnamed: 0,original_id,gene_name,source
0,P24468,[NR2F2],UniProtBULK
1,C9J5X1,[IGF1R],UniProtBULK
2,Q5W5X9,[TTC23],UniProtBULK
