In [1]:
!date

Wed Jul 21 15:36:30 PDT 2021


Import all packages (some not necessary)

In [2]:
import numpy as np
import pandas as pd
import scanpy as sc
import anndata
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'

## Loading in data:  
This notebook will help you read in all sequencing data generated in this study for 2D hEPSCs, Day 5 hEP-structures, Day 6 hEP-structures, and Day 5/6 natural embryos. To use this notebook, you will need to download the following from GEO at accession number [GSE178326](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE178326). Download "GSE178326_RAW.tar" in the supplementary files for the following:
For hEP-structures: 
- EPSC_matrix.mtx, 
- EPSC_barcodes.tsv
- EPSC_features.tsv

For natural embryos: 
- natural_matrix.mtx, 
- natural_barcodes.tsv
- natural_features.tsv

**Note: You will have to rename files to "matrix.mtx", "barcodes.tsv", and "features.tsv" to use the following code!**

In [5]:
# Data from hEP-structures
Goetz_D = sc.read_10x_mtx("../../GEO_submission_Sozen-2021/MZG_hEP/")

# Data from D5/D6 natural embryos
Goetz_D2 = sc.read_10x_mtx("../../GEO_submission_Sozen-2021/MZG_nat")

Concatenate the two data frames. 

In [6]:
adata = Goetz_D.concatenate(Goetz_D2, index_unique=None)

Organize the data.

In [7]:
# Make gene name column
adata.var["gene_name"] = adata.var.index
adata.var

Unnamed: 0,gene_ids,feature_types,gene_name
RP11-34P13.3,ENSG00000243485,Gene Expression,RP11-34P13.3
FAM138A,ENSG00000237613,Gene Expression,FAM138A
OR4F5,ENSG00000186092,Gene Expression,OR4F5
RP11-34P13.7,ENSG00000238009,Gene Expression,RP11-34P13.7
RP11-34P13.8,ENSG00000239945,Gene Expression,RP11-34P13.8
...,...,...,...
AC233755.2,ENSG00000277856,Gene Expression,AC233755.2
AC233755.1,ENSG00000275063,Gene Expression,AC233755.1
AC240274.1,ENSG00000271254,Gene Expression,AC240274.1
AC213203.1,ENSG00000277475,Gene Expression,AC213203.1


Load in meta data.

**Note:** Analysis was done in the nb2_data-analysis in order to generate this meta data file, and then was added retrospectively.  You can skip this code for now and move to nb2, where you can generate "all_meta.csv"). Alternatively, you can find the "all_meta.csv" file at the following GitHub repository: [hEP-structures_MZG](https://github.com/vjorgensen/hEP-structures_MZG)

In [10]:
# Read in meta data csv
all_meta = pd.read_csv("../../GEO_submission_Sozen-2021/all_meta.csv", index_col=0)

# Generate 'cell_group' column to distinguish natural embryo, 2D EPSCs, D5 hEP, and D6 hEP.
# Note that 'sample_id' distinguishes between replicates.
all_meta['cell_group'] = [all_meta['sample_id'][i].split('-')[0] for i in range(len(all_meta))]
all_meta.index.name = None

all_meta

Unnamed: 0,sample_id,sample_number,cell_group
AAACCCAAGCGATGAC11,unknown,0,unknown
AAACCCAAGGTAAAGG11,D6-R3,6,D6
AAACCCAAGTTGAAGT11,D6-R1,4,D6
AAACCCACAAATCCCA11,D6-R3,6,D6
AAACCCACAACCGTGC11,D6-R1,4,D6
...,...,...,...
TTTCACAGTCCGGACT-1,Nat,9,Nat
TTTCATGCATCCTCAC-1,Nat,9,Nat
TTTGATCAGGAGCAAA-1,Nat,9,Nat
TTTGGTTAGACGGTTG-1,Nat,9,Nat


Add metadata to adata file. 

In [11]:
# Add meta data to adata.obs
adata.obs =pd.concat ([adata.obs, all_meta], axis=1)

adata.obs

Unnamed: 0,batch,sample_id,sample_number,cell_group
AAACCCAAGCGATGAC11,0,unknown,0,unknown
AAACCCAAGGTAAAGG11,0,D6-R3,6,D6
AAACCCAAGTTGAAGT11,0,D6-R1,4,D6
AAACCCACAAATCCCA11,0,D6-R3,6,D6
AAACCCACAACCGTGC11,0,D6-R1,4,D6
...,...,...,...,...
TTTCACAGTCCGGACT-1,1,Nat,9,Nat
TTTCATGCATCCTCAC-1,1,Nat,9,Nat
TTTGATCAGGAGCAAA-1,1,Nat,9,Nat
TTTGGTTAGACGGTTG-1,1,Nat,9,Nat


Some samples were labeled "unknown" due to duplicates from multiplexing/ambiguous LMO tagging/etc. The following code removes those cells from further analysis. 

In [12]:
# Removes cells with unknown sample-id (i.e. cells with ambiguous barcoding)
adata = adata[adata.obs['sample_id']!= 'unknown'] 

In [15]:
pwd!

'/Users/vickijorgensen/Caltech/Projects/2021_Sozen_NatComm/notebooks/2021-07 copy'

Save adata file to be used for nb2_data-analaysis.ipynb.

In [16]:
# Save adata.
adata.write("../../GEO_submission_Sozen-2021/adata.h5ad")

In [17]:
%load_ext watermark
%watermark -v -p numpy,pandas,scanpy,anndata,jupyterlab,matplotlib

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.7.10
IPython 7.22.0

numpy 1.19.4
pandas 1.1.4
scanpy 1.7.2
anndata 0.7.6
jupyterlab 3.0.11
matplotlib 3.3.4
