<a href="https://colab.research.google.com/github/robinanwyl/oud_transcriptomics/blob/main/BENG204_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BENG 204 Project: Understanding Transcriptional Responses to Opioid Exposure Across Neurodevelopmental Stages in Brain Organoid Models

Mount the drive (run this cell every time the notebook is opened, and enable permissions if prompted)

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# filepath for the project data is now "/content/drive/My Drive/BENG204_Project/BENG204_Project_Data/"

Mounted at /content/drive


If the scanpy and anndata import statements are underlined with a red squiggle, run this cell to re-install those packages.

In [None]:
#%pip install scanpy
#%pip install anndata

Import statements

In [None]:
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
# etc.

## Read in Kim et al scRNA-seq 10X output files and save as .h5ad.gz files (perform once)
**Read in the scRNA-seq data for Kim et al dataset day 53 untreated sample and day 53 acute fentanyl treatment sample. Merge the datasets (using unique cell barcodes with sample identifiers). Save the two datasets and the merged dataset as `.h5ad.gz` files.**

The original sample IDs are KH001 for the day 53 untreated sample and KH002 for the day 53 acute fentanyl treatment sample. Each sample has its own folder containing 3 compressed (`.gz`) files, which are the 10X Genomics CellRanger output files:

*   `matrix.mtx.gz` is a count matrix where rows are single cells, columns are genes, and each cell is the read count of that gene in that cell
*   `barcodes.tsv.gz` contains the cell barcodes (each cell is labeled with a unique barcode, which is used as an identifier)
*   `features.tsv.gz` contains the gene names

For each sample, we will first use `scanpy.read_10x_mtx()` to read the 3 files into a single `AnnData` object that contains the cell-by-gene matrix and associated metadata (barcodes and features):

In [None]:
# sample1_path = "/content/drive/My Drive/BENG204_Project/BENG204_Project_Data/Kim_KH001_Day53_Untreated"
# adata1 = sc.read_10x_mtx(sample1_path, var_names="gene_symbols", cache=True)
# adata1.obs["sample"] = "kim_day53_untreated"
# print("kim_day53_untreated\n", adata1)

# sample2_path = "/content/drive/My Drive/BENG204_Project/BENG204_Project_Data/Kim_KH002_Day53_FTY_Acute"
# adata2 = sc.read_10x_mtx(sample2_path, var_names="gene_symbols", cache=True)
# adata2.obs["sample"] = "kim_day53_fty_acute"
# print("\nkim_day53_fty_acute\n", adata2)

kim_day53_untreated
 AnnData object with n_obs × n_vars = 5131 × 33538
    obs: 'sample'
    var: 'gene_ids', 'feature_types'

kim_day53_fty_acute
 AnnData object with n_obs × n_vars = 5499 × 33538
    obs: 'sample'
    var: 'gene_ids', 'feature_types'


Now we will merge the two samples into one AnnData object. Since we want to keep track of which sample is which, we will first prepend the cell barcodes with a sample description: "d53_ut" for the untreated samples and "d53_fty" for the treated samples.

In [None]:
# adata1.obs.index = [f"d53_ut_{bc}" for bc in adata1.obs.index]
# adata2.obs.index = [f"d53_fty_{bc}" for bc in adata2.obs.index]
# adata_combined = sc.concat([adata1, adata2], label="batch", keys=["sample1", "sample2"])
# print("day 53 samples merged\n", adata_combined)

day 53 samples merged
 AnnData object with n_obs × n_vars = 10630 × 33538
    obs: 'sample', 'batch'


Now we will save the three AnnData objects as `.h5ad.gz` files for later use.

In [None]:
# adata1.write("/content/drive/My Drive/BENG204_Project/BENG204_Project_Data/kim_d53_ut.h5ad.gz", compression="gzip")
# adata2.write("/content/drive/My Drive/BENG204_Project/BENG204_Project_Data/kim_d53_fty.h5ad.gz", compression="gzip")
# adata_combined.write("/content/drive/My Drive/BENG204_Project/BENG204_Project_Data/kim_d53_combined.h5ad.gz", compression="gzip")

## Read in Kim et al .h5ad.gz files

Now the Kim et al datasets that were saved as .h5ad.gz files can be read in as AnnData objects using `scanpy.read_h5ad()`.

In [None]:
kim_d53_ut = sc.read_h5ad("/content/drive/My Drive/BENG204_Project/BENG204_Project_Data/kim_d53_ut.h5ad.gz")
kim_d53_fty = sc.read_h5ad("/content/drive/My Drive/BENG204_Project/BENG204_Project_Data/kim_d53_fty.h5ad.gz")
kim_d53 = sc.read_h5ad("/content/drive/My Drive/BENG204_Project/BENG204_Project_Data/kim_d53_combined.h5ad.gz")