# Objectives

Since CELLxGENE serves as an initiating data source for the NCBI Cell
pilot, the objectives of this document include demonstration of:

-   Identification of CELLxGENE datasets for a particular organism, and
    tissue

-   Identification of CELLxGENE citation publications within PubMed

## Background

All single-cell RNA data from CZ CELLxGENE Discover is accessed,
queried, and analyzed using the [CELLxGENE Discover
Census](https://chanzuckerberg.github.io/cellxgene-census/). Using
cell-based slicing and querying one can:

-   Interact with the data through TileDB-SOMA

-   Get slices in AnnData, Seurat, or SingleCellExperiment objects

The following sections draw from CZ CELLxGENE tutorials which
demonstrate how to use the Census to:

-   [Explore and query the Census in the context of a single tissue,
    lung](https://chanzuckerberg.github.io/cellxgene-census/notebooks/analysis_demo/comp_bio_explore_and_load_lung_data.html)

-   [Query the expression data and cell/gene metadata from the Census,
    and load them into common in-memory Python
    objects](https://chanzuckerberg.github.io/cellxgene-census/notebooks/api_demo/census_query_extract.html)

-   [Generate a citation string for all datasets contained in a Census
    slice](https://chanzuckerberg.github.io/cellxgene-census/notebooks/api_demo/census_citation_generation.html)

## Development environment

See [Introduction.ipynb)](Introduction.ipynb).

### Jupyter Notebook

Launch Jupyter Notebook from a terminal in which `.zshenv` has been
sourced, and the virtual environment has been activated.

### Emacs Org Mode

Evaluate this code block to define environment variables:

``` zsh
source .zshenv
```

Evaluate this code block to activate the virtual environment:

``` commonlisp
(pyvenv-activate "../../.venv")
```

# Identification of CELLxGENE datasets for human, lung cells

Following the first tutorial, but anticipating a time consuming process,
the first call of this function obtains all human lung cell metadata and
datasets from the CZ CELLxGENE Census, and writes the resulting Pandas
DataFrames to a `.parquet` file. On subsequent calls, this function
reads the `.parquet` files. In both cases, the resulting DataFrames are
returned.

To begin, import modules, and assign module scope variables:

In [None]:
import os

import pandas as pd

DATA_DIR = "../data"

NCBI_EMAIL = os.environ.get("NCBI_EMAIL")
NCBI_API_KEY = os.environ.get("NCBI_API_KEY")
NCBI_API_SLEEP = 1

NCBI_CELL_DIR = f"{DATA_DIR}/ncbi-cell"



Then define the function that does the work:

In [None]:
def get_lung_obs_and_datasets():
    """Use the CZ CELLxGENE Census to obtain all unprocessed human
    lung cell metadata and datasets, then write the resulting Pandas
    DataFrames to parquet files, or, if the files exist, read them.

    Parameters
    ----------
    None

    Returns
    -------
    lung_obs : pd.DataFrame
        DataFrame containing unprocessed dataset metadata
    lung_datasets : pd.DataFrame
        DataFrame containing unprocessed dataset descriptions
    """
    # Create and write, or read DataFrames
    lung_obs_parquet = f"{NCBI_CELL_DIR}/up_lung_obs.parquet"
    lung_datasets_parquet = f"{NCBI_CELL_DIR}/up_lung_datasets.parquet"
    if not os.path.exists(lung_obs_parquet) or not os.path.exists(
             lung_datasets_parquet
    ):
        print("Opening soma")
        census = cellxgene_census.open_soma(census_version="latest")

        print("Collecting all datasets")
        datasets = census["census_info"]["datasets"].read().concat().to_pandas()

        print("Collecting lung obs")
        lung_obs = (
            census["census_data"]["homo_sapiens"]
            .obs.read(
                value_filter="tissue_general == 'lung' and is_primary_data == True"
            )
            .concat()
            .to_pandas()
        )

        print("Closing soma")
        census.close()

        print("Writing unprocessed lung obs parquet")
        lung_obs.to_parquet(lung_obs_parquet)

        print("Finding unprocessed lung datasets")
        lung_datasets = datasets[datasets["dataset_id"].isin(lung_obs["dataset_id"])]

        print("Writing unprocessed lung datasets parquet")
        lung_datasets.to_parquet(lung_datasets_parquet)

    else:

        print("Reading unprocessed lung obs parquet")
        lung_obs = pd.read_parquet(lung_obs_parquet)

        print("Reading unprocessed lung datasets parquet")
        lung_datasets = pd.read_parquet(lung_datasets_parquet)

    return lung_obs, lung_datasets


Now call the function to obtain the human lung cell metadata and
datasets:

In [None]:
lung_obs, lung_datasets = get_lung_obs_and_datasets()


Now view the returned DataFrames:

    print(f"lung_obs:\n\n{lung_obs}")
    print()
    print(f"lung_datasets:\n\n{lung_datasets}")

# Identification of CELLxGENE citation publications within PubMed