# X-inactivation Preprocess

This notebook prepares a part of the Lupus dataset for detection of X-inactivation.

1. Get "SLEcrossX_nonorm.h5ad" from Ricahrd (Google Drive).
2. Filter for donors that we have metadata for in cluestime1.csv from Meena.
3. Filter for genes that have a certain cell count (5000).
4. Filter for cells from batch 1.
5. Save the data

In [68]:
import pandas as pd
import scanpy.api as sc

### Read in SLE AnnData 

In [4]:
adata = sc.read(data_path + 'SLEcrossX_nonorm.h5ad')

### Read the CLUES metadata

In [5]:
metadata = pd.read_csv(data_path + '../misc/cluestime1.csv', sep='\t')
metadata['ind_cov'] = metadata['subjectid'].astype(str) + '_' + metadata['subjectid'].astype(str)

### Filter the SLE data for only donors in CLUES

In [6]:
def clean_donor(x):
    
    return x.split()

In [7]:
clues_adata = adata[adata.obs.ind_cov.isin(metadata.ind_cov.tolist())].copy()

In [8]:
clues_index = clues_adata.obs.index.values

In [9]:
clues_adata.obs = clues_adata.obs.merge(metadata[['ind_cov', 'female']], on='ind_cov', how='inner')
clues_adata.obs.index = clues_index

AnnData expects string indices for some functionality, but your first two indices are: Int64Index([0, 1], dtype='int64'). 


### Show cell counts by donor

In [50]:
print('Male donors:')
clues_adata.obs.query('female == 0').groupby(['ind_cov']).size()

Male donors:


ind_cov
1051_1051    2753
1195_1195    3265
1196_1196    3525
1414_1414    4028
1522_1522    4802
1558_1558    4208
1615_1615    4815
1621_1621    5135
dtype: int64

### Filter gene list

In [28]:
sc.pp.filter_genes(clues_adata, min_cells=5000)

### Get chromosome information about each gene

In [44]:
gene_metadata = pd.read_csv(data_path + '../misc/my_uscs_ids_symbols.map.bed', sep='\t', header=None)\
    .iloc[:, [0, 5]]\
    .rename(columns={0:'chromosome', 5:'gene_name'})\
    .drop_duplicates(subset=['gene_name'])\
    .set_index('gene_name')\
    

In [49]:
clues_adata.var = clues_adata.var.join(gene_metadata)

### Filter for single batch

In [65]:
clues_adata = clues_adata[clues_adata.obs.batch == '0'].copy()

In [66]:
clues_adata.shape

(273918, 7379)

### Save the AnnData object

In [67]:
sc.write(data_path + 'clues_data.h5ad', clues_adata)