# Differential Expression Analysis among the final labels
In this notebook, I analyzed the manual annotation stored in the column 'ThirdManualAnnotation' from `noAdolescence_nocc_noclusters_ThirdManualAnnotations_Interneurons.h5ad`, retrieving the top 25 marker genes for each label.

In [1]:
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import anndata as ad

In [2]:
adata = sc.read_h5ad('/hpc/hers_basak/rnaseq_data/Silettilab/icoratella/final_useful_datasets/noAdolescence_nocc_noclusters_ThirdManualAnnotations_Interneurons.h5ad')

In [3]:
ccGenesHuman = np.loadtxt('/hpc/hers_basak/rnaseq_data/Silettilab/icoratella/models/ccGenesHuman.txt', dtype=str)
mask = ~adata.var_names.isin(ccGenesHuman)
adata = adata[:, mask]

In [4]:
badGenes = []
for el in adata.var_names:
    if el.startswith('MT-'):
        badGenes.append(el)
    elif el.startswith('RP'):
        badGenes.append(el)

adata = adata[:, ~adata.var_names.isin(badGenes)]

#### I normalized, logarithmized scaled the data, and performed PCA.

In [5]:
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
#sc.pp.scale(adata, max_value=10)
#sc.tl.pca(adata, svd_solver="auto")

  view_to_actual(adata)


#### I performed Differential Expression Analysis, and plotted the first 25 markers for each cluster.

In [6]:
sc.tl.rank_genes_groups(adata, "ThirdManualAnnotations", method="logreg")
for group in adata.uns['rank_genes_groups']['names'].dtype.names:
    genes = adata.uns['rank_genes_groups']['names'][group][:25]
    print(f"Group {group}:")
    print("', '".join(genes))
    print("\n")

Group Astrocytes:
NTRK2', 'CLU', 'LRIG1', 'APOE', 'GPC5', 'LINC00609', 'LRRC4C', 'PLCG2', 'SPARCL1', 'PPP2R2B', 'DAAM2', 'AQP4', 'ATP1A2', 'ADCY2', 'GLIS3', 'CTNNA2', 'CST3', 'BCAN', 'MGST1', 'RYR3', 'NEAT1', 'LIMCH1', 'FAM107A', 'NAV3', 'TNC


Group OPCs:
PCDH15', 'LHFPL3', 'OPCML', 'CA10', 'FGF14', 'NXPH1', 'MMP16', 'DCC', 'TNR', 'CNTN1', 'NRXN3', 'ANKS1B', 'SGCD', 'KIF13A', 'SNTG1', 'LRRC4C', 'RNF144A', 'DISC1', 'ASTN2', 'SEMA5A', 'SGCZ', 'CNTNAP5', 'SORCS3', 'GRIA2', 'EPN2


Group Subcortical nIPCs:
HBA2', 'HBA1', 'HES6', 'CACNA2D1', 'AUTS2', 'KCNB2', 'DLX2', 'MAML3', 'PDE4D', 'CCNI', 'NFIB', 'ZIC1', 'KALRN', 'ELAVL2', 'CCSER1', 'PRKX', 'CHRDL1', 'ROBO1', 'NNAT', 'RUNX1T1', 'GNG4', 'STMN1', 'PENK', 'CDKN2D', 'FRMD4B


Group early Radial Glia:
PLCG2', 'MALAT1', 'LINC00486', 'TUBA1A', 'EEF1G', 'HEY1', 'SFRP1', 'EEF1A1', 'HDAC9', 'HBG2', 'CDH12', 'ID4', 'LEF1', 'EIF1AY', 'LIX1', 'HES4', 'KIAA1217', 'VIM', 'UNC13C', 'MAP1B', 'BAIAP2L1', 'B3GAT2', 'FAM182B', 'MEF2C', 'DPP10


Group late

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#### I also performed a filtering analysis on the previous DEA to see more specific markers.

In [7]:
sc.tl.filter_rank_genes_groups(adata,  max_out_group_fraction=0.5)
filtered_genes = {}
for group in adata.uns['rank_genes_groups_filtered']['names'].dtype.names:
    genes = adata.uns['rank_genes_groups_filtered']['names'][group]
    filtered_gene_list = [gene for gene in genes if pd.notnull(gene)]
    filtered_genes[group] = filtered_gene_list[:25]

Overall, the selected markers for each class appear to be appropriate.