# Selecting Subcortical Interneurons from the Braun et al. Dataset

In this notebook, I loaded the Braun Anndata object `human_dev_GRCh38-3.0.0.h5ad` anselected some subcortical interneurons, to integrate them into the main notebook with the developing brain meta-atlas. . This integration aims to enhance the diversity and accuracy of cell type annotations in the developing brain meta-atlas.

In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
import anndata as ad

In [2]:
sc.logging.print_versions()
sc.set_figure_params(facecolor="white", figsize=(7, 4))
sc.settings.verbosity = 3

-----
anndata     0.10.8
scanpy      1.10.1
-----
PIL                         10.2.0
anyio                       NA
arrow                       1.3.0
asciitree                   NA
asttokens                   NA
astunparse                  1.6.3
attr                        23.1.0
attrs                       23.1.0
babel                       2.11.0
bottleneck                  1.3.7
brotli                      1.0.9
certifi                     2024.08.30
cffi                        1.16.0
charset_normalizer          2.0.4
cloudpickle                 3.0.0
colorama                    0.4.6
comm                        0.2.1
cycler                      0.10.0
cython_runtime              NA
dask                        2024.7.0
dateutil                    2.8.2
debugpy                     1.6.7
decorator                   5.1.1
defusedxml                  0.7.1
executing                   0.8.3
fastjsonschema              NA
fqdn                        NA
h5py                        3.9.0
id

#### I loaded the Braun et al. integral dataset.

In [3]:
braun = sc.read_h5ad('/hpc/hers_basak/rnaseq_data/Silettilab/samples/human_dev_GRCh38-3.0.0.h5ad')

In [4]:
braun

AnnData object with n_obs × n_vars = 1665937 × 33538
    obs: 'CellClass', 'CellCycleFraction', 'DoubletFlag', 'DoubletScore', 'Region', 'Subdivision', 'Subregion', 'Tissue', 'TopLevelCluster', 'TotalUMIs', 'organism_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'suspension_type', 'dissection', 'total_UMIs', 'sample_id', 'cluster_id', 'NGenes', 'AnnotationAle', 'Neuroepithelial', 'TotalUMI', 'Chemistry', 'assay'
    var: 'Chromosome', 'End', 'Gene', 'Start', 'Strand', 'Selected'
    uns: 'batch_condition', 'config', 'radius', 'schema_version', 'species', 'title'
    obsm: 'PCA', 'TSNE', 'UMAP'
    obsp: 'KNN', 'MKNN', 'RNN'

#### To retain the metadata of the donor, it is crucial that the column names match. Therefore, I renamed the donor variable to align with the naming convention used in the training meta-atlas.

In [5]:
braun.obs.rename(columns={'donor_id': 'donor_kim'}, inplace=True)

#### Given the extensive size of the Braun et al. dataset, I have implemented some preliminary filtering to streamline the data and make the dataset more manageable before proceeding with further analyses.

In [6]:
braun = braun[braun.obs.NGenes > 1000, :]
sc.pp.filter_cells(braun, min_genes=200)
sc.pp.filter_genes(braun, min_cells=3)

  adata.obs["n_genes"] = number


filtered out 1523 genes that are detected in less than 3 cells


#### I realized that these cells had not been filtered for cell cycling, so I applied a threshold to `CellCycleFraction`—matching the criteria used for the rest of the training dataset (`noAdolescence_nocc_noclusters_ThirdManualAnnotations.h5ad`)—to merge them appropriately.

In [7]:
braun = braun[braun.obs["CellCycleFraction"] < 0.01]

#### I selected only the cells belonging to the region 'Subcortex'.

In [8]:
braun = braun[braun.obs['Subregion'] == "b'Subcortex'"]

#### I selected only the cells belonging to the nIPCs subtypes.

In [9]:
braun = braun[braun.obs['AnnotationAle'].isin(["Neuronal IPC", "Neuroblast"])]

#### I applied some adjustments to make these cells merging correctly to the meta-atlas.

In [10]:
braun.var = (
    braun.var
    .reset_index()                      
    .rename(columns={'index': 'ensemble_ids'})  
    .set_index('Gene')                  
)

AnnData expects .var.index to contain strings, but got values like:
    ['AL627309.1', 'AL627309.3', 'LINC00115', 'FAM41C', 'AL645608.1']

    Inferred to be: categorical

  value_idx = self._prep_dim_index(value.index, attr)


#### I used a function to eliminate duplicated genes.

In [11]:
braun.var.index = braun.var.index.astype(str)
braun.var_names_make_unique(join="_")

In [12]:
braun.obs['ThirdManualAnnotations'] = 'Subcortical nIPCs'

#### save the result to import it into the main notebook `from_noAdolescence_to_final_training_dataset.ipynb`

In [13]:
braun.write_h5ad('/hpc/hers_basak/rnaseq_data/Silettilab/samples/final_useful_datasets/braun_subcortical.h5ad')