In [3]:
!date

Wed Jul 21 18:55:02 PDT 2021


___
# Notebook 1: Downloading sequencing data. 
#### This notebook was written by Victoria Jorgensen. 
Questions can be sent to jorgensen at caltech.edu. 
___

#### Import all packages (some not necessary)

In [84]:
import numpy as np
import pandas as pd
import scanpy as sc
import anndata

%config InlineBackend.figure_format = 'retina'

## Loading in supplemental data: 
You will first need to download all supplementary data (meta data and gene lists) from GitHub repository [hEP-structures_MZG](https://github.com/vjorgensen/hEP-structures_MZG).  Sequencing data above was added to the "data" folder in the above github repository in a subsequent folder titled "matrix_data" and then further into folders "MZG_hEP" and "MZG_nat" depending on the source of each sample. This step is necessary because the sequencing files were too large to upload to GitHub. 

## Loading in sequencing data:  
This notebook will help you read in all sequencing data generated in this study for 2D hEPSCs, Day 5 hEP-structures, Day 6 hEP-structures, and Day 5/6 natural embryos. To use this notebook, you will need to download the following from GEO at accession number [GSE178326](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE178326). Download "GSE178326_RAW.tar" in the supplementary files for the following:

For hEP-structures (put data in folder "data/matrix_data/MZG_hEP"):
- GSM5387817_EPSC_matrix.mtx, 
- GSM5387817_EPSC_barcodes.tsv
- GSM5387817_EPSC_features.tsv
- GSM5387817_EPSC_LMO-tags.csv

For natural embryos (put data in folder "data/matrix_data/MZG_nat"): 
- GSM5387818_natural_matrix.mtx, 
- GSM5387818_natural_barcodes.tsv
- GSM5387818_natural_features.tsv



**Note: You will have to rename the above files to "matrix.mtx.gz", "barcodes.tsv.gz", and "features.tsv.gz" in "MZG_hEP" and "MZG_nat" to use the following code!**

In [85]:
# Data from hEP-structures
Goetz_D = sc.read_10x_mtx("../data/matrix_data/MZG_hEP/")

# Data from D5/D6 natural embryos
Goetz_D2 = sc.read_10x_mtx("../data/matrix_data/MZG_nat/")

#### Concatenate the two data frames. 

In [197]:
adata = Goetz_D.concatenate(Goetz_D2, index_unique=None)

#### Organize the data by adding in gene_name column.

In [198]:
# Make gene name column
adata.var["gene_name"] = adata.var.index
adata.var

Unnamed: 0,gene_ids,feature_types,gene_name
RP11-34P13.3,ENSG00000243485,Gene Expression,RP11-34P13.3
FAM138A,ENSG00000237613,Gene Expression,FAM138A
OR4F5,ENSG00000186092,Gene Expression,OR4F5
RP11-34P13.7,ENSG00000238009,Gene Expression,RP11-34P13.7
RP11-34P13.8,ENSG00000239945,Gene Expression,RP11-34P13.8
...,...,...,...
AC233755.2,ENSG00000277856,Gene Expression,AC233755.2
AC233755.1,ENSG00000275063,Gene Expression,AC233755.1
AC240274.1,ENSG00000271254,Gene Expression,AC240274.1
AC213203.1,ENSG00000277475,Gene Expression,AC213203.1


#### Load in labels for samples based on barcodes.

In [199]:
# Barcodes/LMOs for EPSCs, D5 hEP-structures, and D6 hEP-structures.  (Multiplexed samples, hence LMO tags)
tags_ep = pd.read_csv("../data/matrix_data/MZG_hEP/GSM5387817_EPSC_LMO-tags.csv")

# Barcodes/LMOs for EPSCs, D5 hEP-structures, and D6 hEP-structures.  (Multiplexed samples, hence LMO tags)
tags_nat = pd.read_csv("../data/matrix_data/MZG_nat/barcodes.tsv.gz", names=["cell_barcode", "sample_id", "sample_number"], header=None)
tags_nat["sample_id"] = "Nat"
tags_nat["sample_number"] = 7

# Combine the two dataframes, note that order must match order of adata file. (EPSC then natural samples)
tag = (tags_ep, tags_nat)
tags = pd.concat(tag, ignore_index=True)
tags = tags.set_index("cell_barcode").rename_axis(index=None, axis=1)
tags

Unnamed: 0,sample_id,sample_number
AAACCCAAGCGATGAC11,unknown,0
AAACCCAAGGTAAAGG11,D6-R3,6
AAACCCAAGTTGAAGT11,D6-R1,4
AAACCCACAAATCCCA11,D6-R3,6
AAACCCACAACCGTGC11,D6-R1,4
...,...,...
TTTCACAGTCCGGACT-1,Nat,7
TTTCATGCATCCTCAC-1,Nat,7
TTTGATCAGGAGCAAA-1,Nat,7
TTTGGTTAGACGGTTG-1,Nat,7


In [200]:
# Generate 'cell_group' column to distinguish natural embryo, 2D EPSCs, D5 hEP, and D6 hEP.
# Note that 'sample_id' distinguishes between replicates.
tags['cell_group'] = [tags['sample_id'][i].split('-')[0] for i in range(len(tags))]
tags.index.name = None

tags

Unnamed: 0,sample_id,sample_number,cell_group
AAACCCAAGCGATGAC11,unknown,0,unknown
AAACCCAAGGTAAAGG11,D6-R3,6,D6
AAACCCAAGTTGAAGT11,D6-R1,4,D6
AAACCCACAAATCCCA11,D6-R3,6,D6
AAACCCACAACCGTGC11,D6-R1,4,D6
...,...,...,...
TTTCACAGTCCGGACT-1,Nat,7,Nat
TTTCATGCATCCTCAC-1,Nat,7,Nat
TTTGATCAGGAGCAAA-1,Nat,7,Nat
TTTGGTTAGACGGTTG-1,Nat,7,Nat


#### Add this sample data to adata. 

In [201]:
# Add meta data to adata.obs
adata.obs = pd.concat([adata.obs, tags], axis=1)

adata.obs

Unnamed: 0,batch,sample_id,sample_number,cell_group
AAACCCAAGCGATGAC11,0,unknown,0,unknown
AAACCCAAGGTAAAGG11,0,D6-R3,6,D6
AAACCCAAGTTGAAGT11,0,D6-R1,4,D6
AAACCCACAAATCCCA11,0,D6-R3,6,D6
AAACCCACAACCGTGC11,0,D6-R1,4,D6
...,...,...,...,...
TTTCACAGTCCGGACT-1,1,Nat,7,Nat
TTTCATGCATCCTCAC-1,1,Nat,7,Nat
TTTGATCAGGAGCAAA-1,1,Nat,7,Nat
TTTGGTTAGACGGTTG-1,1,Nat,7,Nat


### Remove all "unknown" cells
Some samples were labeled "unknown" due to duplicates from multiplexing/ambiguous LMO tagging/etc. The following code removes those cells from further analysis. 

In [202]:
# Removes cells with unknown sample-id (i.e. cells with ambiguous barcoding)
adata = adata[adata.obs['sample_id']!= "unknown"] 
adata.obs

Unnamed: 0,batch,sample_id,sample_number,cell_group
AAACCCAAGGTAAAGG11,0,D6-R3,6,D6
AAACCCAAGTTGAAGT11,0,D6-R1,4,D6
AAACCCACAAATCCCA11,0,D6-R3,6,D6
AAACCCACAACCGTGC11,0,D6-R1,4,D6
AAACCCACACAACATC11,0,D6-R2,5,D6
...,...,...,...,...
TTTCACAGTCCGGACT-1,1,Nat,7,Nat
TTTCATGCATCCTCAC-1,1,Nat,7,Nat
TTTGATCAGGAGCAAA-1,1,Nat,7,Nat
TTTGGTTAGACGGTTG-1,1,Nat,7,Nat


#### Make sample titles more clear.

In [203]:
# Give labels a more descriptive name 
adata.obs["sample_group"] = " "
adata.obs.loc[(adata.obs["cell_group"]=="Nat"), "sample_group"] = "Natural human embryo"
adata.obs.loc[(adata.obs["cell_group"]=="D5"), "sample_group"] = "Day 5 hEP-structures"
adata.obs.loc[(adata.obs["cell_group"]=="D6"), "sample_group"] = "Day 6 hEP-structures"
adata.obs.loc[(adata.obs["cell_group"]=="2D"), "sample_group"] = "hEPSCs in 2D"
# adata.obs.loc[(adata.obs["cell_group"]=="unknown"), "sample_group"] = "unknown"


Trying to set attribute `.obs` of view, copying.


#### Remove superfluous columns.

In [204]:
adata.obs = adata.obs.drop(columns=["batch", "sample_number"])
adata.obs

Unnamed: 0,sample_id,cell_group,sample_group
AAACCCAAGGTAAAGG11,D6-R3,D6,Day 6 hEP-structures
AAACCCAAGTTGAAGT11,D6-R1,D6,Day 6 hEP-structures
AAACCCACAAATCCCA11,D6-R3,D6,Day 6 hEP-structures
AAACCCACAACCGTGC11,D6-R1,D6,Day 6 hEP-structures
AAACCCACACAACATC11,D6-R2,D6,Day 6 hEP-structures
...,...,...,...
TTTCACAGTCCGGACT-1,Nat,Nat,Natural human embryo
TTTCATGCATCCTCAC-1,Nat,Nat,Natural human embryo
TTTGATCAGGAGCAAA-1,Nat,Nat,Natural human embryo
TTTGGTTAGACGGTTG-1,Nat,Nat,Natural human embryo


Save adata file to be used for nb2_data-analaysis.ipynb in matrix_data folder.

In [205]:
# Save adata.
adata.write("../data/matrix_data/adata.h5ad")

... storing 'sample_id' as categorical
... storing 'cell_group' as categorical
... storing 'sample_group' as categorical
... storing 'feature_types' as categorical


___
___

In [206]:
%load_ext watermark
%watermark -v -p numpy,pandas,scanpy,anndata,jupyterlab

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.7.10
IPython 7.22.0

numpy 1.19.4
pandas 1.1.4
scanpy 1.7.2
anndata 0.7.6
jupyterlab 3.0.11
