This notebook is for preparing all datasets for integration. 

This involves:
* reading in each dataset
* check metadata all correct
* add additional metadata regarding site and cancer_subtype
* add metadata for integration_id --> samples that are not biologically distinct (eg. two biopsies from one tumour) get same id
* use integration id to merge layers --> layers in dataset will represent how they will be integrated 
* exclude any samples with <100 myeloid cells
* record number of cells

Backing up to rdm: 
``` bash
rsync -azvhp /scratch/user/s4436039/scdata/Myeloid_Cells/Myeloid_Cells_Integrate/ /QRISdata/Q5935/nikita/scdata/Myeloid_Cells/Myeloid_Cells_Integrate
```

In [1]:
#set wd
getwd()
setwd('/scratch/user/s4436039/scdata/Myeloid_Cells')
getwd()

In [2]:
#Load packages
library(dplyr)
library(Seurat)
library(patchwork)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: SeuratObject

Loading required package: sp


Attaching package: ‘SeuratObject’


The following object is masked from ‘package:base’:

    intersect




## GSE184880

In [12]:
HGSOC <- readRDS("/scratch/user/s4436039/scdata/Myeloid_Cells/GSE184880_myeloid.RDS")

In [13]:
HGSOC
HGSOC@project.name
head(HGSOC@meta.data)

An object of class Seurat 
27984 features across 7799 samples within 1 assay 
Active assay: RNA (27984 features, 2000 variable features)
 25 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, counts.9, counts.10, counts.11, counts.12, data.1, data.2, data.3, data.4, data.5, data.6, data.7, data.8, data.9, data.10, data.11, data.12, scale.data
 2 dimensional reductions calculated: pca, umap

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.2,seurat_clusters
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<fct>,<fct>
GSE184880_Cancer1_AAACCCACAGCTGCCA-1,GSE184880,9374,2655,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,15.980371,1,1
GSE184880_Cancer1_AAACCCACATGACGGA-1,GSE184880,2659,1246,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,8.837909,1,1
GSE184880_Cancer1_AAACGAACAGTAGTGG-1,GSE184880,3020,1206,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,13.807947,1,1
GSE184880_Cancer1_AAACGAATCACCCTCA-1,GSE184880,50940,6660,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,8.531606,1,1
GSE184880_Cancer1_AAACGCTTCTCCACTG-1,GSE184880,10129,2880,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,11.225195,1,1
GSE184880_Cancer1_AAACGCTTCTGCTCTG-1,GSE184880,12756,3352,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,9.321104,1,1


In [15]:
table(HGSOC$sample_type)
table(HGSOC$cancer_type)
table(HGSOC$patient_id)
table(HGSOC$sample_id)


Healthy_ovary        tumour 
         1457          6342 


Healthy   HGSOC 
   1457    6342 


Cancer1 Cancer2 Cancer3 Cancer4 Cancer5 Cancer6 Cancer7   Norm1   Norm2   Norm3 
   2298    1080     577     792     695     652     248      54     281     360 
  Norm4   Norm5 
    193     569 


GSE184880_Healthy_Norm1 GSE184880_Healthy_Norm2 GSE184880_Healthy_Norm3 
                     54                     281                     360 
GSE184880_Healthy_Norm4 GSE184880_Healthy_Norm5 GSE184880_HGSOC_Cancer1 
                    193                     569                    2298 
GSE184880_HGSOC_Cancer2 GSE184880_HGSOC_Cancer3 GSE184880_HGSOC_Cancer4 
                   1080                     577                     792 
GSE184880_HGSOC_Cancer5 GSE184880_HGSOC_Cancer6 GSE184880_HGSOC_Cancer7 
                    695                     652                     248 

In [16]:
#set site metadata
HGSOC@meta.data$site <- "ovary"

In [17]:
#set subtype metadata

#split by cancer_type
HGSOC_tumour <- subset(HGSOC, subset = cancer_type %in% c("HGSOC"))
HGSOC_healthy <- subset(HGSOC, subset = cancer_type %in% c("Healthy"))

HGSOC_tumour@meta.data$cancer_subtype <- "HGSOC"
HGSOC_healthy@meta.data$cancer_subtype <- "NA"

#Merge seurat objects back together
HGSOC <- merge(HGSOC_tumour, y = c(HGSOC_healthy), project = "GSE184880")

In [19]:
#set integration_id metadata
HGSOC@meta.data$integration_id <- HGSOC@meta.data$sample_id

In [20]:
HGSOC
HGSOC@project.name
head(HGSOC@meta.data)

An object of class Seurat 
27984 features across 7799 samples within 1 assay 
Active assay: RNA (27984 features, 2000 variable features)
 26 layers present: counts.1.1, counts.10.2, counts.11.2, counts.12.2, counts.2.1, counts.3.1, counts.4.1, counts.5.1, counts.6.1, counts.7.1, data.1.1, data.2.1, data.3.1, data.4.1, data.5.1, data.6.1, data.7.1, scale.data.1, counts.8.2, counts.9.2, data.8.2, data.9.2, data.10.2, data.11.2, data.12.2, scale.data.2

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.2,seurat_clusters,site,cancer_subtype,integration_id
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
GSE184880_Cancer1_AAACCCACAGCTGCCA-1,GSE184880,9374,2655,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,15.980371,1,1,ovary,HGSOC,GSE184880_HGSOC_Cancer1
GSE184880_Cancer1_AAACCCACATGACGGA-1,GSE184880,2659,1246,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,8.837909,1,1,ovary,HGSOC,GSE184880_HGSOC_Cancer1
GSE184880_Cancer1_AAACGAACAGTAGTGG-1,GSE184880,3020,1206,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,13.807947,1,1,ovary,HGSOC,GSE184880_HGSOC_Cancer1
GSE184880_Cancer1_AAACGAATCACCCTCA-1,GSE184880,50940,6660,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,8.531606,1,1,ovary,HGSOC,GSE184880_HGSOC_Cancer1
GSE184880_Cancer1_AAACGCTTCTCCACTG-1,GSE184880,10129,2880,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,11.225195,1,1,ovary,HGSOC,GSE184880_HGSOC_Cancer1
GSE184880_Cancer1_AAACGCTTCTGCTCTG-1,GSE184880,12756,3352,tumour,HGSOC,Cancer1,GSE184880_HGSOC_Cancer1,9.321104,1,1,ovary,HGSOC,GSE184880_HGSOC_Cancer1


In [30]:
#exclude any samples with <100 cells
table(HGSOC$integration_id)
#exclude Norm1
HGSOC <- subset(HGSOC, !(subset = integration_id %in% c("GSE184880_Healthy_Norm1")))
table(HGSOC$integration_id)


GSE184880_Healthy_Norm1 GSE184880_Healthy_Norm2 GSE184880_Healthy_Norm3 
                     54                     281                     360 
GSE184880_Healthy_Norm4 GSE184880_Healthy_Norm5 GSE184880_HGSOC_Cancer1 
                    193                     569                    2298 
GSE184880_HGSOC_Cancer2 GSE184880_HGSOC_Cancer3 GSE184880_HGSOC_Cancer4 
                   1080                     577                     792 
GSE184880_HGSOC_Cancer5 GSE184880_HGSOC_Cancer6 GSE184880_HGSOC_Cancer7 
                    695                     652                     248 


GSE184880_Healthy_Norm2 GSE184880_Healthy_Norm3 GSE184880_Healthy_Norm4 
                    281                     360                     193 
GSE184880_Healthy_Norm5 GSE184880_HGSOC_Cancer1 GSE184880_HGSOC_Cancer2 
                    569                    2298                    1080 
GSE184880_HGSOC_Cancer3 GSE184880_HGSOC_Cancer4 GSE184880_HGSOC_Cancer5 
                    577                     792                     695 
GSE184880_HGSOC_Cancer6 GSE184880_HGSOC_Cancer7 
                    652                     248 

In [31]:
#join layers and then split them by integration_id
Layers(HGSOC[["RNA"]])
#join layers
HGSOC[["RNA"]] <- JoinLayers(HGSOC[["RNA"]])
Layers(HGSOC[["RNA"]])
#split layers
HGSOC[["RNA"]] <- split(HGSOC[["RNA"]], f = HGSOC$integration_id)
Layers(HGSOC[["RNA"]])


Splitting ‘counts’, ‘data’ layers. Not splitting ‘scale.data’. If you would like to split other layers, set in `layers` argument.



In [33]:
#record number of cells
table(HGSOC$integration_id)


GSE184880_Healthy_Norm2 GSE184880_Healthy_Norm3 GSE184880_Healthy_Norm4 
                    281                     360                     193 
GSE184880_Healthy_Norm5 GSE184880_HGSOC_Cancer1 GSE184880_HGSOC_Cancer2 
                    569                    2298                    1080 
GSE184880_HGSOC_Cancer3 GSE184880_HGSOC_Cancer4 GSE184880_HGSOC_Cancer5 
                    577                     792                     695 
GSE184880_HGSOC_Cancer6 GSE184880_HGSOC_Cancer7 
                    652                     248 

In [34]:
#re-export seurat object ready for integration
saveRDS(HGSOC, "/scratch/user/s4436039/scdata/Myeloid_Cells/Myeloid_Cells_Integrate/GSE184880_myeloid_int.RDS")

In [36]:
#remove all objects in R
rm(list = ls())

## GSE213243

In [44]:
HGSOC_tu <- readRDS("/scratch/user/s4436039/scdata/Myeloid_Cells/GSE213243_Tumour_myeloid.RDS")
HGSOC_As <- readRDS("/scratch/user/s4436039/scdata/Myeloid_Cells/GSE213243_Ascites_myeloid.RDS")

In [45]:
HGSOC_tu
HGSOC_tu@project.name
head(HGSOC_tu@meta.data)

HGSOC_As
HGSOC_As@project.name
head(HGSOC_As@meta.data)

An object of class Seurat 
58825 features across 804 samples within 1 assay 
Active assay: RNA (58825 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 2 dimensional reductions calculated: pca, umap

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<fct>,<fct>
GSE213243_tumour_AAAGGTACACGCAGTC-1,GSE213243,8050,2780,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,19.962733,3,3
GSE213243_tumour_AAATGGACACACGCCA-1,GSE213243,5854,2467,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,4.936795,3,3
GSE213243_tumour_AACAAAGCAATTTCCT-1,GSE213243,6073,2541,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,6.323069,3,3
GSE213243_tumour_AACACACGTAGCTTTG-1,GSE213243,13497,3862,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,5.319701,3,3
GSE213243_tumour_AACACACTCGCTGTTC-1,GSE213243,8644,3306,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,10.596946,3,3
GSE213243_tumour_AACAGGGCAACCCTAA-1,GSE213243,6263,2562,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,3.544627,3,3


An object of class Seurat 
58825 features across 2688 samples within 1 assay 
Active assay: RNA (58825 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 2 dimensional reductions calculated: pca, umap

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<fct>,<fct>
GSE213243_ascites_AAACCCAAGTAGCAAT-2,GSE213243,16943,4684,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,7.572449,5,5
GSE213243_ascites_AAACCCACAGTCGTTA-2,GSE213243,14219,3822,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,5.02145,1,1
GSE213243_ascites_AAACCCATCCGTAGTA-2,GSE213243,15634,4224,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,6.556224,5,5
GSE213243_ascites_AAACGAAAGTGCTCGC-2,GSE213243,3007,1377,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,28.766212,6,6
GSE213243_ascites_AAACGAAGTATGGTAA-2,GSE213243,13828,4227,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,4.122071,5,5
GSE213243_ascites_AAACGCTAGTATCTGC-2,GSE213243,12945,3944,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,8.937814,6,6


In [46]:
table(HGSOC_tu$sample_type)
table(HGSOC_tu$cancer_type)
table(HGSOC_tu$patient_id)
table(HGSOC_tu$sample_id)

table(HGSOC_As$sample_type)
table(HGSOC_As$cancer_type)
table(HGSOC_As$patient_id)
table(HGSOC_As$sample_id)


tumour 
   804 


HGSOC 
  804 


pt-1 
 804 


GSE213243_HGSOC_tumour 
                   804 


ascites 
   2688 


HGSOC 
 2688 


pt-1 
2688 


GSE213243_HGSOC_ascites 
                   2688 

In [47]:
#set site metadata
HGSOC_tu@meta.data$site <- "ovary"
HGSOC_As@meta.data$site <- "ascites fluid"

In [48]:
#set subtype metadata

#split by cancer_type
HGSOC_tu@meta.data$cancer_subtype <- "HGSOC"
HGSOC_As@meta.data$cancer_subtype <- "HGSOC"

In [49]:
#merge objects
HGSOC <- merge(HGSOC_tu, y = c(HGSOC_As), project = "GSE213243")

In [50]:
#set integration_id metadata
HGSOC@meta.data$integration_id <- HGSOC@meta.data$sample_id

In [52]:
HGSOC
HGSOC@project.name
head(HGSOC@meta.data)
tail(HGSOC@meta.data)

An object of class Seurat 
58825 features across 3492 samples within 1 assay 
Active assay: RNA (58825 features, 2000 variable features)
 6 layers present: counts.1, counts.2, data.1, scale.data.1, data.2, scale.data.2

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters,site,cancer_subtype,integration_id
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
GSE213243_tumour_AAAGGTACACGCAGTC-1,GSE213243,8050,2780,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,19.962733,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AAATGGACACACGCCA-1,GSE213243,5854,2467,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,4.936795,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AACAAAGCAATTTCCT-1,GSE213243,6073,2541,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,6.323069,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AACACACGTAGCTTTG-1,GSE213243,13497,3862,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,5.319701,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AACACACTCGCTGTTC-1,GSE213243,8644,3306,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,10.596946,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AACAGGGCAACCCTAA-1,GSE213243,6263,2562,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,3.544627,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour


Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters,site,cancer_subtype,integration_id
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
GSE213243_ascites_TTTGATCGTTAGGCCC-2,GSE213243,20342,4702,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,5.899125,5,5,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGATCTCTCGGCTT-2,GSE213243,1614,820,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,34.262701,1,1,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGGAGCACGTCTCT-2,GSE213243,10549,3639,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,11.119537,6,6,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGGAGGTCCTGGGT-2,GSE213243,4613,2061,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,12.421418,1,1,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGGTTCATCCTATT-2,GSE213243,6073,2678,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,11.954553,1,1,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGTTGCATGATGCT-2,GSE213243,14293,4430,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,5.044427,6,6,ascites fluid,HGSOC,GSE213243_HGSOC_ascites


In [54]:
#exclude any samples with <100 cells
table(HGSOC$integration_id)
#none to exclude


GSE213243_HGSOC_ascites  GSE213243_HGSOC_tumour 
                   2688                     804 

In [55]:
#join layers and then split them by integration_id
Layers(HGSOC[["RNA"]])
#join layers
HGSOC[["RNA"]] <- JoinLayers(HGSOC[["RNA"]])
Layers(HGSOC[["RNA"]])
#split layers
HGSOC[["RNA"]] <- split(HGSOC[["RNA"]], f = HGSOC$integration_id)
Layers(HGSOC[["RNA"]])


Splitting ‘counts’, ‘data’ layers. Not splitting ‘scale.data’. If you would like to split other layers, set in `layers` argument.



In [56]:
#record number of cells
table(HGSOC$integration_id)


GSE213243_HGSOC_ascites  GSE213243_HGSOC_tumour 
                   2688                     804 

In [57]:
#re-export seurat object ready for integration
saveRDS(HGSOC, "/scratch/user/s4436039/scdata/Myeloid_Cells/Myeloid_Cells_Integrate/GSE213243_myeloid_int.RDS")

In [58]:
#remove all objects in R
rm(list = ls())

In [None]:
HGSOC_tu <- readRDS("/scratch/user/s4436039/scdata/Myeloid_Cells/GSE213243_Tumour_myeloid.RDS")
HGSOC_As <- readRDS("/scratch/user/s4436039/scdata/Myeloid_Cells/GSE213243_Ascites_myeloid.RDS")

In [None]:
HGSOC_tu
HGSOC_tu@project.name
head(HGSOC_tu@meta.data)

HGSOC_As
HGSOC_As@project.name
head(HGSOC_As@meta.data)

An object of class Seurat 
58825 features across 804 samples within 1 assay 
Active assay: RNA (58825 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 2 dimensional reductions calculated: pca, umap

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<fct>,<fct>
GSE213243_tumour_AAAGGTACACGCAGTC-1,GSE213243,8050,2780,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,19.962733,3,3
GSE213243_tumour_AAATGGACACACGCCA-1,GSE213243,5854,2467,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,4.936795,3,3
GSE213243_tumour_AACAAAGCAATTTCCT-1,GSE213243,6073,2541,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,6.323069,3,3
GSE213243_tumour_AACACACGTAGCTTTG-1,GSE213243,13497,3862,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,5.319701,3,3
GSE213243_tumour_AACACACTCGCTGTTC-1,GSE213243,8644,3306,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,10.596946,3,3
GSE213243_tumour_AACAGGGCAACCCTAA-1,GSE213243,6263,2562,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,3.544627,3,3


An object of class Seurat 
58825 features across 2688 samples within 1 assay 
Active assay: RNA (58825 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 2 dimensional reductions calculated: pca, umap

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<fct>,<fct>
GSE213243_ascites_AAACCCAAGTAGCAAT-2,GSE213243,16943,4684,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,7.572449,5,5
GSE213243_ascites_AAACCCACAGTCGTTA-2,GSE213243,14219,3822,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,5.02145,1,1
GSE213243_ascites_AAACCCATCCGTAGTA-2,GSE213243,15634,4224,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,6.556224,5,5
GSE213243_ascites_AAACGAAAGTGCTCGC-2,GSE213243,3007,1377,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,28.766212,6,6
GSE213243_ascites_AAACGAAGTATGGTAA-2,GSE213243,13828,4227,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,4.122071,5,5
GSE213243_ascites_AAACGCTAGTATCTGC-2,GSE213243,12945,3944,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,8.937814,6,6


In [None]:
table(HGSOC_tu$sample_type)
table(HGSOC_tu$cancer_type)
table(HGSOC_tu$patient_id)
table(HGSOC_tu$sample_id)

table(HGSOC_As$sample_type)
table(HGSOC_As$cancer_type)
table(HGSOC_As$patient_id)
table(HGSOC_As$sample_id)


tumour 
   804 


HGSOC 
  804 


pt-1 
 804 


GSE213243_HGSOC_tumour 
                   804 


ascites 
   2688 


HGSOC 
 2688 


pt-1 
2688 


GSE213243_HGSOC_ascites 
                   2688 

In [None]:
#set site metadata
HGSOC_tu@meta.data$site <- "ovary"
HGSOC_As@meta.data$site <- "ascites fluid"

In [None]:
#set subtype metadata

#split by cancer_type
HGSOC_tu@meta.data$cancer_subtype <- "HGSOC"
HGSOC_As@meta.data$cancer_subtype <- "HGSOC"

In [None]:
#merge objects
HGSOC <- merge(HGSOC_tu, y = c(HGSOC_As), project = "GSE213243")

In [None]:
#set integration_id metadata
HGSOC@meta.data$integration_id <- HGSOC@meta.data$sample_id

In [None]:
HGSOC
HGSOC@project.name
head(HGSOC@meta.data)
tail(HGSOC@meta.data)

An object of class Seurat 
58825 features across 3492 samples within 1 assay 
Active assay: RNA (58825 features, 2000 variable features)
 6 layers present: counts.1, counts.2, data.1, scale.data.1, data.2, scale.data.2

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters,site,cancer_subtype,integration_id
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
GSE213243_tumour_AAAGGTACACGCAGTC-1,GSE213243,8050,2780,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,19.962733,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AAATGGACACACGCCA-1,GSE213243,5854,2467,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,4.936795,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AACAAAGCAATTTCCT-1,GSE213243,6073,2541,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,6.323069,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AACACACGTAGCTTTG-1,GSE213243,13497,3862,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,5.319701,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AACACACTCGCTGTTC-1,GSE213243,8644,3306,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,10.596946,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour
GSE213243_tumour_AACAGGGCAACCCTAA-1,GSE213243,6263,2562,tumour,HGSOC,pt-1,GSE213243_HGSOC_tumour,3.544627,3,3,ovary,HGSOC,GSE213243_HGSOC_tumour


Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters,site,cancer_subtype,integration_id
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
GSE213243_ascites_TTTGATCGTTAGGCCC-2,GSE213243,20342,4702,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,5.899125,5,5,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGATCTCTCGGCTT-2,GSE213243,1614,820,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,34.262701,1,1,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGGAGCACGTCTCT-2,GSE213243,10549,3639,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,11.119537,6,6,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGGAGGTCCTGGGT-2,GSE213243,4613,2061,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,12.421418,1,1,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGGTTCATCCTATT-2,GSE213243,6073,2678,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,11.954553,1,1,ascites fluid,HGSOC,GSE213243_HGSOC_ascites
GSE213243_ascites_TTTGTTGCATGATGCT-2,GSE213243,14293,4430,ascites,HGSOC,pt-1,GSE213243_HGSOC_ascites,5.044427,6,6,ascites fluid,HGSOC,GSE213243_HGSOC_ascites


In [None]:
#exclude any samples with <100 cells
table(HGSOC$integration_id)
#none to exclude


GSE213243_HGSOC_ascites  GSE213243_HGSOC_tumour 
                   2688                     804 

In [None]:
#join layers and then split them by integration_id
Layers(HGSOC[["RNA"]])
#join layers
HGSOC[["RNA"]] <- JoinLayers(HGSOC[["RNA"]])
Layers(HGSOC[["RNA"]])
#split layers
HGSOC[["RNA"]] <- split(HGSOC[["RNA"]], f = HGSOC$integration_id)
Layers(HGSOC[["RNA"]])


Splitting ‘counts’, ‘data’ layers. Not splitting ‘scale.data’. If you would like to split other layers, set in `layers` argument.



In [None]:
#record number of cells
table(HGSOC$integration_id)


GSE213243_HGSOC_ascites  GSE213243_HGSOC_tumour 
                   2688                     804 

In [None]:
#re-export seurat object ready for integration
saveRDS(HGSOC, "/scratch/user/s4436039/scdata/Myeloid_Cells/Myeloid_Cells_Integrate/GSE213243_myeloid_int.RDS")

In [None]:
#remove all objects in R
rm(list = ls())

## GSE217517

In [59]:
HGSOC <- readRDS("/scratch/user/s4436039/scdata/Myeloid_Cells/GSE217517_myeloid.RDS")

In [60]:
HGSOC
HGSOC@project.name
head(HGSOC@meta.data)

An object of class Seurat 
36601 features across 8457 samples within 1 assay 
Active assay: RNA (36601 features, 2000 variable features)
 17 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, data.1, data.2, data.3, data.4, data.5, data.6, data.7, data.8, scale.data
 2 dimensional reductions calculated: pca, umap

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters,RNA_snn_res.0.2
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<fct>,<fct>,<fct>
GSE217517_pt1_AAACGAAAGAACCCGA-1,GSE217517,7268,2217,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,3.76995,9,1,1
GSE217517_pt1_AAAGAACCAGGGCTTC-1,GSE217517,20132,4339,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,7.634612,5,1,1
GSE217517_pt1_AAAGAACTCCATGAGT-1,GSE217517,4183,1410,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,35.142242,9,1,1
GSE217517_pt1_AAAGGATTCTATTTCG-1,GSE217517,3037,1274,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,6.914718,9,1,1
GSE217517_pt1_AAATGGACACTGAGGA-1,GSE217517,9516,2822,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,2.847835,5,1,1
GSE217517_pt1_AACAGGGGTCATCGGC-1,GSE217517,22104,4611,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,7.69544,9,1,1


In [61]:
table(HGSOC$sample_type)
table(HGSOC$cancer_type)
table(HGSOC$patient_id)
table(HGSOC$sample_id)


tumour 
  8457 


HGSOC 
 8457 


 pt1  pt2  pt3  pt4  pt5  pt6  pt7  pt8 
 842  966 2678 1517 1004   37 1054  359 


GSE217517_HGSOC_pt1 GSE217517_HGSOC_pt2 GSE217517_HGSOC_pt3 GSE217517_HGSOC_pt4 
                842                 966                2678                1517 
GSE217517_HGSOC_pt5 GSE217517_HGSOC_pt6 GSE217517_HGSOC_pt7 GSE217517_HGSOC_pt8 
               1004                  37                1054                 359 

In [62]:
#set site metadata
HGSOC@meta.data$site <- "ovary"

In [63]:
#set subtype metadata
HGSOC@meta.data$cancer_subtype <- "HGSOC"

In [64]:
#set integration_id metadata
HGSOC@meta.data$integration_id <- HGSOC@meta.data$sample_id

In [65]:
HGSOC
HGSOC@project.name
head(HGSOC@meta.data)

An object of class Seurat 
36601 features across 8457 samples within 1 assay 
Active assay: RNA (36601 features, 2000 variable features)
 17 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, data.1, data.2, data.3, data.4, data.5, data.6, data.7, data.8, scale.data
 2 dimensional reductions calculated: pca, umap

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,sample_type,cancer_type,patient_id,sample_id,percent.mt,RNA_snn_res.0.5,seurat_clusters,RNA_snn_res.0.2,site,cancer_subtype,integration_id
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<fct>,<fct>,<fct>,<chr>,<chr>,<chr>
GSE217517_pt1_AAACGAAAGAACCCGA-1,GSE217517,7268,2217,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,3.76995,9,1,1,ovary,HGSOC,GSE217517_HGSOC_pt1
GSE217517_pt1_AAAGAACCAGGGCTTC-1,GSE217517,20132,4339,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,7.634612,5,1,1,ovary,HGSOC,GSE217517_HGSOC_pt1
GSE217517_pt1_AAAGAACTCCATGAGT-1,GSE217517,4183,1410,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,35.142242,9,1,1,ovary,HGSOC,GSE217517_HGSOC_pt1
GSE217517_pt1_AAAGGATTCTATTTCG-1,GSE217517,3037,1274,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,6.914718,9,1,1,ovary,HGSOC,GSE217517_HGSOC_pt1
GSE217517_pt1_AAATGGACACTGAGGA-1,GSE217517,9516,2822,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,2.847835,5,1,1,ovary,HGSOC,GSE217517_HGSOC_pt1
GSE217517_pt1_AACAGGGGTCATCGGC-1,GSE217517,22104,4611,tumour,HGSOC,pt1,GSE217517_HGSOC_pt1,7.69544,9,1,1,ovary,HGSOC,GSE217517_HGSOC_pt1


In [67]:
#exclude any samples with <100 cells
table(HGSOC$integration_id)
#exclude patient 6
HGSOC <- subset(HGSOC, !(subset = integration_id %in% c("GSE217517_HGSOC_pt6")))
table(HGSOC$integration_id)


GSE217517_HGSOC_pt1 GSE217517_HGSOC_pt2 GSE217517_HGSOC_pt3 GSE217517_HGSOC_pt4 
                842                 966                2678                1517 
GSE217517_HGSOC_pt5 GSE217517_HGSOC_pt6 GSE217517_HGSOC_pt7 GSE217517_HGSOC_pt8 
               1004                  37                1054                 359 


GSE217517_HGSOC_pt1 GSE217517_HGSOC_pt2 GSE217517_HGSOC_pt3 GSE217517_HGSOC_pt4 
                842                 966                2678                1517 
GSE217517_HGSOC_pt5 GSE217517_HGSOC_pt7 GSE217517_HGSOC_pt8 
               1004                1054                 359 

In [68]:
#join layers and then split them by integration_id
Layers(HGSOC[["RNA"]])
#join layers
HGSOC[["RNA"]] <- JoinLayers(HGSOC[["RNA"]])
Layers(HGSOC[["RNA"]])
#split layers
HGSOC[["RNA"]] <- split(HGSOC[["RNA"]], f = HGSOC$integration_id)
Layers(HGSOC[["RNA"]])


Splitting ‘counts’, ‘data’ layers. Not splitting ‘scale.data’. If you would like to split other layers, set in `layers` argument.



In [69]:
#record number of cells
table(HGSOC$integration_id)


GSE217517_HGSOC_pt1 GSE217517_HGSOC_pt2 GSE217517_HGSOC_pt3 GSE217517_HGSOC_pt4 
                842                 966                2678                1517 
GSE217517_HGSOC_pt5 GSE217517_HGSOC_pt7 GSE217517_HGSOC_pt8 
               1004                1054                 359 

In [70]:
#re-export seurat object ready for integration
saveRDS(HGSOC, "/scratch/user/s4436039/scdata/Myeloid_Cells/Myeloid_Cells_Integrate/GSE217517_myeloid_int.RDS")

In [71]:
#remove all objects in R
rm(list = ls())