# Marker investigation 

In this notebook the all of the variants in the [Pv4 data release](https://www.malariagen.net/resource/30) are used to identify regions of the core genome that are microhaplotype candidates (<200 bp length). The notebook consists of three parts 
- Setup and Loading the Data 
- Subset the variants 
- Perform a sliding window across the core regions of the genome calculaing certain statistics for each window

The variants are subset to only include ones with the following characteristics: 

- Clonal samples only (FWS > 0.95)
- Unique samples only, > 50% callable (Richard's "in_analysis_set" metadata column) 
- QC pass (Filter pass) 
- Only biallelic SNPs 
- Located in core genome 

Files used in this notebook are available through the [Pv4 data release](https://www.malariagen.net/resource/30), but are also attached to the repo

Questions 
- Do we also want to the usable study list in other notebook (Sasha's notes: samples only in GSK and Price studies + anything in Pv1.0 release)

## Setup 

In [1]:
from malariagen_data.pv4 import Pv4
import pandas as pd
import numpy as np
import allel
import dask.array as da
import collections
import math

In [2]:
# Supress warning 
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)   

## Load Data  

Using the Pv4 data package we can access the files that are stored on the cloud. This is set up with the following code:

In [3]:
pv4 = Pv4("gs://pv4_staging/")

Using this we can load the **sample metadata**

In [4]:
pv4_metadata = pv4.sample_metadata()

pv4_metadata.head()

Unnamed: 0,Sample,Study,Site,First-level administrative division,Country,Lat,Long,Year,ENA,All samples same individual,Population,% callable,QC pass,Exclusion reason,Is returning traveller
0,BBH-1-125,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678989,BBH-1-125,AF,88.52,True,Analysis_set,False
1,BBH_1_132,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678991,BBH_1_132,AF,90.2,True,Analysis_set,False
2,BBH_1_137,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2679003,BBH_1_137,AF,87.09,True,Analysis_set,False
3,BBH_1_153,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678992,BBH_1_153,AF,90.6,True,Analysis_set,False
4,BBH_1_162,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678993,BBH_1_162,AF,91.67,True,Analysis_set,False


We can also use the package to load the **variant data**

In [5]:
variant_dataset = pv4.variant_calls(extended=True)
variant_dataset

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.80 kiB,14.80 kiB
Shape,"(1895,)","(1895,)"
Count,1 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 14.80 kiB 14.80 kiB Shape (1895,) (1895,) Count 1 Tasks 1 Chunks Type object numpy.ndarray",1895  1,

Unnamed: 0,Array,Chunk
Bytes,14.80 kiB,14.80 kiB
Shape,"(1895,)","(1895,)"
Count,1 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,244.12 MiB,3.00 MiB
Shape,"(4571056, 7)","(65536, 6)"
Count,350 Tasks,140 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 244.12 MiB 3.00 MiB Shape (4571056, 7) (65536, 6) Count 350 Tasks 140 Chunks Type object numpy.ndarray",7  4571056,

Unnamed: 0,Array,Chunk
Bytes,244.12 MiB,3.00 MiB
Shape,"(4571056, 7)","(65536, 6)"
Count,350 Tasks,140 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895, 2)","(65536, 64, 2)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 16.13 GiB 8.00 MiB Shape (4571056, 1895, 2) (65536, 64, 2) Count 2100 Tasks 2100 Chunks Type int8 numpy.ndarray",2  1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895, 2)","(65536, 64, 2)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,112.94 GiB,56.00 MiB
Shape,"(4571056, 1895, 7)","(65536, 64, 7)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 112.94 GiB 56.00 MiB Shape (4571056, 1895, 7) (65536, 64, 7) Count 2100 Tasks 2100 Chunks Type int16 numpy.ndarray",7  1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,112.94 GiB,56.00 MiB
Shape,"(4571056, 1895, 7)","(65536, 64, 7)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 16.13 GiB 8.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type int16 numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.07 GiB,4.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 8.07 GiB 4.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type int8 numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,8.07 GiB,4.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 64.54 GiB 32.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type object numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 64.54 GiB 32.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type object numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,96.81 GiB,48.00 MiB
Shape,"(4571056, 1895, 3)","(65536, 64, 3)"
Count,2100 Tasks,2100 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 96.81 GiB 48.00 MiB Shape (4571056, 1895, 3) (65536, 64, 3) Count 2100 Tasks 2100 Chunks Type int32 numpy.ndarray",3  1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,96.81 GiB,48.00 MiB
Shape,"(4571056, 1895, 3)","(65536, 64, 3)"
Count,2100 Tasks,2100 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 104.62 MiB 1.50 MiB Shape (4571056, 6) (65536, 6) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",6  4571056,

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 104.62 MiB 1.50 MiB Shape (4571056, 6) (65536, 6) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",6  4571056,

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 104.62 MiB 1.50 MiB Shape (4571056, 6) (65536, 6) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",6  4571056,

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray


## Subset Variants

We only want to include certain variants in this analysis. Below we filter the variant dataset to only include: 
* samples in the analysis set
* samples with FWS > 0.95
* samples with percent callable > 50% 
* variants that are SNPs 
* filter pass variants 
* biallelic snps 

We will need the [FWS values](https://www.malariagen.net/sites/default/files/Pv4_fws.txt) which are stored in a separate file within the repository. The following code loads the FWS data and adds it to the existing metadata:

In [6]:
pv4_fws = pd.read_csv('../supplementary_files/Pv4_fws.txt', sep='\t', comment='t')
pv4_metadata = pd.merge(pv4_metadata, pv4_fws, on='Sample', how='outer')

Filter variants to only include samples in the **analysis_set** with **FWS > 0.95** and **percent callable > 50%**

In [7]:
loc_filtered_samples = (
    (pv4_metadata["Fws"] > 0.95)
    & (pv4_metadata["% callable"] > 50)
    & (pv4_metadata["Exclusion reason"] == "Analysis_set")
)
subset_metadata = pv4_metadata[loc_filtered_samples]
variant_dataset_filtered = variant_dataset.isel(samples=loc_filtered_samples)

Subset variants to only include ones which **pass filters** and are **coding snps**  

In [8]:
filters = (
    (variant_dataset_filtered["variant_filter_pass"].data)
    & (variant_dataset_filtered["variant_is_snp"].data)
    & (variant_dataset_filtered["variant_CDS"].data)
)
variant_dataset_filtered = variant_dataset_filtered.isel(variants=filters)

Filter variants to only include **biallelic** snps

In [9]:
biallelic_filter = (variant_dataset_filtered["variant_numalt"] == 1).data
variant_dataset_filtered = variant_dataset_filtered.isel(variants=biallelic_filter)
variant_dataset_filtered

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.42 kiB,5.42 kiB
Shape,"(694,)","(694,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 5.42 kiB 5.42 kiB Shape (694,) (694,) Count 2 Tasks 1 Chunks Type object numpy.ndarray",694  1,

Unnamed: 0,Array,Chunk
Bytes,5.42 kiB,5.42 kiB
Shape,"(694,)","(694,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,23.51 MiB,744.80 kiB
Shape,"(440222, 7)","(15889, 6)"
Count,514 Tasks,82 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 23.51 MiB 744.80 kiB Shape (440222, 7) (15889, 6) Count 514 Tasks 82 Chunks Type object numpy.ndarray",7  440222,

Unnamed: 0,Array,Chunk
Bytes,23.51 MiB,744.80 kiB
Shape,"(440222, 7)","(15889, 6)"
Count,514 Tasks,82 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,582.72 MiB,1.18 MiB
Shape,"(440222, 694, 2)","(15889, 39, 2)"
Count,6660 Tasks,1230 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 582.72 MiB 1.18 MiB Shape (440222, 694, 2) (15889, 39, 2) Count 6660 Tasks 1230 Chunks Type int8 numpy.ndarray",2  694  440222,

Unnamed: 0,Array,Chunk
Bytes,582.72 MiB,1.18 MiB
Shape,"(440222, 694, 2)","(15889, 39, 2)"
Count,6660 Tasks,1230 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.98 GiB,8.27 MiB
Shape,"(440222, 694, 7)","(15889, 39, 7)"
Count,6660 Tasks,1230 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 3.98 GiB 8.27 MiB Shape (440222, 694, 7) (15889, 39, 7) Count 6660 Tasks 1230 Chunks Type int16 numpy.ndarray",7  694  440222,

Unnamed: 0,Array,Chunk
Bytes,3.98 GiB,8.27 MiB
Shape,"(440222, 694, 7)","(15889, 39, 7)"
Count,6660 Tasks,1230 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,582.72 MiB,1.18 MiB
Shape,"(440222, 694)","(15889, 39)"
Count,6660 Tasks,1230 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 582.72 MiB 1.18 MiB Shape (440222, 694) (15889, 39) Count 6660 Tasks 1230 Chunks Type int16 numpy.ndarray",694  440222,

Unnamed: 0,Array,Chunk
Bytes,582.72 MiB,1.18 MiB
Shape,"(440222, 694)","(15889, 39)"
Count,6660 Tasks,1230 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,291.36 MiB,605.15 kiB
Shape,"(440222, 694)","(15889, 39)"
Count,6660 Tasks,1230 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 291.36 MiB 605.15 kiB Shape (440222, 694) (15889, 39) Count 6660 Tasks 1230 Chunks Type int8 numpy.ndarray",694  440222,

Unnamed: 0,Array,Chunk
Bytes,291.36 MiB,605.15 kiB
Shape,"(440222, 694)","(15889, 39)"
Count,6660 Tasks,1230 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.28 GiB,4.73 MiB
Shape,"(440222, 694)","(15889, 39)"
Count,6660 Tasks,1230 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 2.28 GiB 4.73 MiB Shape (440222, 694) (15889, 39) Count 6660 Tasks 1230 Chunks Type object numpy.ndarray",694  440222,

Unnamed: 0,Array,Chunk
Bytes,2.28 GiB,4.73 MiB
Shape,"(440222, 694)","(15889, 39)"
Count,6660 Tasks,1230 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.28 GiB,4.73 MiB
Shape,"(440222, 694)","(15889, 39)"
Count,6660 Tasks,1230 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 2.28 GiB 4.73 MiB Shape (440222, 694) (15889, 39) Count 6660 Tasks 1230 Chunks Type object numpy.ndarray",694  440222,

Unnamed: 0,Array,Chunk
Bytes,2.28 GiB,4.73 MiB
Shape,"(440222, 694)","(15889, 39)"
Count,6660 Tasks,1230 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.41 GiB,7.09 MiB
Shape,"(440222, 694, 3)","(15889, 39, 3)"
Count,6660 Tasks,1230 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 3.41 GiB 7.09 MiB Shape (440222, 694, 3) (15889, 39, 3) Count 6660 Tasks 1230 Chunks Type int32 numpy.ndarray",3  694  440222,

Unnamed: 0,Array,Chunk
Bytes,3.41 GiB,7.09 MiB
Shape,"(440222, 694, 3)","(15889, 39, 3)"
Count,6660 Tasks,1230 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 10.08 MiB 372.40 kiB Shape (440222, 6) (15889, 6) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",6  440222,

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 10.08 MiB 372.40 kiB Shape (440222, 6) (15889, 6) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",6  440222,

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 10.08 MiB 372.40 kiB Shape (440222, 6) (15889, 6) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",6  440222,

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray


**Only include variants that have a MAF > 0.1 and missingness < 0.1**

Perform an allele count on the genotypes and convert to frequency

In [10]:
%%time
# allele frequency for all samples
gt = allel.GenotypeDaskArray(variant_dataset_filtered["call_genotype"].data)
ac_pop = gt.count_alleles()
ac_pop_freq = ac_pop.to_frequencies().compute()
ac_pop_freq

CPU times: user 1min 57s, sys: 3.92 s, total: 2min 1s
Wall time: 3min 11s


array([[0.98991354, 0.01008646],
       [0.64841499, 0.35158501],
       [0.98559078, 0.01440922],
       ...,
       [0.99855282, 0.00144718],
       [0.99422799, 0.00577201],
       [1.        , 0.        ]])

Calculate the missingness frequency for each SNP

In [11]:
%%time 
freq_missing = gt.count_missing(axis=1).compute() / gt.shape[1]

CPU times: user 56.1 s, sys: 1.93 s, total: 58 s
Wall time: 55.1 s


Filter the variants to only include minor allele frequency over 0.1 and missingness less than 0.1 

In [12]:
pop_freq_filter = (ac_pop_freq[:, :2].min(axis=1) > 0.1) & (freq_missing < 0.1)
variant_dataset_filtered = variant_dataset_filtered.isel(variants=pop_freq_filter)
variant_dataset_filtered

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.42 kiB,5.42 kiB
Shape,"(694,)","(694,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 5.42 kiB 5.42 kiB Shape (694,) (694,) Count 2 Tasks 1 Chunks Type object numpy.ndarray",694  1,

Unnamed: 0,Array,Chunk
Bytes,5.42 kiB,5.42 kiB
Shape,"(694,)","(694,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,736.31 kiB,30.05 kiB
Shape,"(13464, 7)","(641, 6)"
Count,596 Tasks,82 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 736.31 kiB 30.05 kiB Shape (13464, 7) (641, 6) Count 596 Tasks 82 Chunks Type object numpy.ndarray",7  13464,

Unnamed: 0,Array,Chunk
Bytes,736.31 kiB,30.05 kiB
Shape,"(13464, 7)","(641, 6)"
Count,596 Tasks,82 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.82 MiB,48.83 kiB
Shape,"(13464, 694, 2)","(641, 39, 2)"
Count,7890 Tasks,1230 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 17.82 MiB 48.83 kiB Shape (13464, 694, 2) (641, 39, 2) Count 7890 Tasks 1230 Chunks Type int8 numpy.ndarray",2  694  13464,

Unnamed: 0,Array,Chunk
Bytes,17.82 MiB,48.83 kiB
Shape,"(13464, 694, 2)","(641, 39, 2)"
Count,7890 Tasks,1230 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,124.76 MiB,341.78 kiB
Shape,"(13464, 694, 7)","(641, 39, 7)"
Count,7890 Tasks,1230 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 124.76 MiB 341.78 kiB Shape (13464, 694, 7) (641, 39, 7) Count 7890 Tasks 1230 Chunks Type int16 numpy.ndarray",7  694  13464,

Unnamed: 0,Array,Chunk
Bytes,124.76 MiB,341.78 kiB
Shape,"(13464, 694, 7)","(641, 39, 7)"
Count,7890 Tasks,1230 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.82 MiB,48.83 kiB
Shape,"(13464, 694)","(641, 39)"
Count,7890 Tasks,1230 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 17.82 MiB 48.83 kiB Shape (13464, 694) (641, 39) Count 7890 Tasks 1230 Chunks Type int16 numpy.ndarray",694  13464,

Unnamed: 0,Array,Chunk
Bytes,17.82 MiB,48.83 kiB
Shape,"(13464, 694)","(641, 39)"
Count,7890 Tasks,1230 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.91 MiB,24.41 kiB
Shape,"(13464, 694)","(641, 39)"
Count,7890 Tasks,1230 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 8.91 MiB 24.41 kiB Shape (13464, 694) (641, 39) Count 7890 Tasks 1230 Chunks Type int8 numpy.ndarray",694  13464,

Unnamed: 0,Array,Chunk
Bytes,8.91 MiB,24.41 kiB
Shape,"(13464, 694)","(641, 39)"
Count,7890 Tasks,1230 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,71.29 MiB,195.30 kiB
Shape,"(13464, 694)","(641, 39)"
Count,7890 Tasks,1230 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 71.29 MiB 195.30 kiB Shape (13464, 694) (641, 39) Count 7890 Tasks 1230 Chunks Type object numpy.ndarray",694  13464,

Unnamed: 0,Array,Chunk
Bytes,71.29 MiB,195.30 kiB
Shape,"(13464, 694)","(641, 39)"
Count,7890 Tasks,1230 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,71.29 MiB,195.30 kiB
Shape,"(13464, 694)","(641, 39)"
Count,7890 Tasks,1230 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 71.29 MiB 195.30 kiB Shape (13464, 694) (641, 39) Count 7890 Tasks 1230 Chunks Type object numpy.ndarray",694  13464,

Unnamed: 0,Array,Chunk
Bytes,71.29 MiB,195.30 kiB
Shape,"(13464, 694)","(641, 39)"
Count,7890 Tasks,1230 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.93 MiB,292.96 kiB
Shape,"(13464, 694, 3)","(641, 39, 3)"
Count,7890 Tasks,1230 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 106.93 MiB 292.96 kiB Shape (13464, 694, 3) (641, 39, 3) Count 7890 Tasks 1230 Chunks Type int32 numpy.ndarray",3  694  13464,

Unnamed: 0,Array,Chunk
Bytes,106.93 MiB,292.96 kiB
Shape,"(13464, 694, 3)","(641, 39, 3)"
Count,7890 Tasks,1230 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,315.56 kiB,15.02 kiB
Shape,"(13464, 6)","(641, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 315.56 kiB 15.02 kiB Shape (13464, 6) (641, 6) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",6  13464,

Unnamed: 0,Array,Chunk
Bytes,315.56 kiB,15.02 kiB
Shape,"(13464, 6)","(641, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,315.56 kiB,15.02 kiB
Shape,"(13464, 6)","(641, 6)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 315.56 kiB 15.02 kiB Shape (13464, 6) (641, 6) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",6  13464,

Unnamed: 0,Array,Chunk
Bytes,315.56 kiB,15.02 kiB
Shape,"(13464, 6)","(641, 6)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.15 kiB 641 B Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,13.15 kiB,641 B
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 52.59 kiB 2.50 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,52.59 kiB,2.50 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 105.19 kiB 5.01 kiB Shape (13464,) (641,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13464  1,

Unnamed: 0,Array,Chunk
Bytes,105.19 kiB,5.01 kiB
Shape,"(13464,)","(641,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,315.56 kiB,15.02 kiB
Shape,"(13464, 6)","(641, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 315.56 kiB 15.02 kiB Shape (13464, 6) (641, 6) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",6  13464,

Unnamed: 0,Array,Chunk
Bytes,315.56 kiB,15.02 kiB
Shape,"(13464, 6)","(641, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray


## Load core region data 

Load [Pv4 regions](https://www.malariagen.net/sites/default/files/Pv4_regions.bed.gz) into pandas dataframe. This file details the chromosome, the start and end, and the type of the region.

In [22]:
pv4_regions = pd.read_csv(
    "../supplementary_files/Pv4_regions.bed", sep="\t", comment="t", header=None
)
header = ["chrom", "chromStart", "chromEnd", "name"]
pv4_regions.columns = header[: len(pv4_regions.columns)]

In [23]:
# Convert regions to be 1-based
pv4_regions[["chromStart", "chromEnd"]] += 1

In [24]:
pv4_regions.loc[pv4_regions.name == "Core"]

Unnamed: 0,chrom,chromStart,chromEnd,name
1,PvP01_01_v1,116542,677963,Core
3,PvP01_01_v1,679790,903592,Core
6,PvP01_02_v1,100156,162349,Core
8,PvP01_02_v1,164088,745644,Core
11,PvP01_03_v1,108062,630664,Core
13,PvP01_03_v1,632482,894723,Core
16,PvP01_04_v1,185115,564966,Core
18,PvP01_04_v1,566928,685686,Core
20,PvP01_04_v1,748924,967651,Core
23,PvP01_05_v1,143102,844199,Core


In [20]:
total_variants = 0

for index, row in (pv4_regions.loc[pv4_regions.name == "Core"]).iterrows():

    filter_values = (variant_dataset_filtered["variant_chrom"] == row.chrom).data
    variant_dataset_chrom = variant_dataset_filtered.isel(variants=filter_values)

    test_variants = variant_dataset_chrom.set_index(
        variants="variant_position", samples="sample_id"
    )

    variant_count = test_variants.sel(
        variants=slice(row.chromStart, row.chromEnd)
    ).dims["variants"]
    print(variant_count)
    total_variants += variant_count
print("Total variants to be included in analysis : ", total_variants)

388
169
34
427
587
209
293
82
209
432
322
241
494
859
174
610
300
425
698
598
249
93
843
274
370
149
1052
591
602
1125
565
Total variants to be included in analysis :  13464


In [None]:
# Could add in a plot of the variants across the genome and the core region boundaries? 

# Sliding window through core regions 

Perform a sliding window through the core region of the genome. For each window calculate:
- the number of biallelic snps in the window 
- for each unique allele how many samples have that allele

In [25]:
def filter_variants(variant_dataset, field, value):
    filter_values = (variant_dataset[field] == value).data
    variant_dataset_filtered = variant_dataset.isel(variants=filter_values)
    return variant_dataset_filtered


def variant_positions(positions):
    return list(positions)


def unique_allele_counts_in_window(gt):
    unique, index, counts = np.unique(gt, axis=1, return_counts=True, return_index=True)
    # Find index with the missing or het
    alleles_with_missing = []
    alleles_with_het = []
    for i in range(len(index)):
        if -1 in (gt[:, index[i]].compute()):
            alleles_with_missing.append(i)
        if True in gt[:, int(index[i])].is_het().compute():
            alleles_with_het.append(i)

    return counts, alleles_with_missing, alleles_with_het


def calculate_stats(variant_dataset, window_length, step):
    pos = variant_dataset["variants"].data

    # Find windows with variants
        # Find windows with variants
    n_variants, windows = allel.windowed_count(
        pos, size=window_length, step=step
    )
    index_with_variants = [i for i, var in enumerate(n_variants) if var != 0]
    window_with_variants = [list(windows[i]) for i in index_with_variants]

    # Find windows with unique variants
    positions, windows, counts = allel.windowed_statistic(
        pos, pos, statistic=variant_positions, windows=window_with_variants
    )
    unique_var, unique_var_index = np.unique(positions, return_index=True)
    unique_windows = [list(windows[i]) for i in unique_var_index]
    
    # Count occurances of each unique allele
    values = allel.GenotypeDaskArray(variant_dataset["call_genotype"].data)
    allele_counts, windows, counts = allel.windowed_statistic(
        pos,
        values,
        statistic=unique_allele_counts_in_window,
        windows=unique_windows,
        fill=[0, None, None],
    )
    n_variants, windows = allel.windowed_count(
        pos, windows=unique_windows
    )
    return n_variants, allele_counts, windows

In [26]:
def evaluate_marker_options(
    variant_dataset, chrom, region_df, window_length=200, step=50
):

    # Filter variants to chromosome and set index
    variant_dataset = filter_variants(variant_dataset, "variant_chrom", chrom)
    variant_dataset = variant_dataset.set_index(
        variants="variant_position", samples="sample_id"
    )

    # Find core region boundaries for chromosome
    core_region_df = region_df.loc[
        (region_df.chrom == chrom) & (region_df.name == "Core")
    ]

    biallelic_counts = []
    unique_allele_counts = []
    unique_alleles_with_missing = []
    unique_alleles_with_het = []
    window_start = []
    window_end = []
    variant_counts = []

    # For each region
    for index, row in core_region_df.iterrows():
        print(f"starting sliding window for region: {row.chromStart}-{row.chromEnd}")

        # Restrict variants to region
        variant_dataset_region = variant_dataset.sel(
            variants=slice(row.chromStart, row.chromEnd)
        )

        # STATS
        n_variants, allele_counts, windows = calculate_stats(
            variant_dataset_region, window_length, step
        )

        # Concatenate results
        window_start = window_start + list(windows[:, 0])
        window_end = window_end + list(windows[:, 1])
        variant_counts = variant_counts + list(n_variants)
        unique_allele_counts = unique_allele_counts + list(
            allele_counts[:, 0]
        )
        unique_alleles_with_missing = unique_alleles_with_missing + list(
            allele_counts[:, 1]
        )
        unique_alleles_with_het = unique_alleles_with_het + list(allele_counts[:, 2])

    return (
        variant_counts,
        unique_allele_counts,
        unique_alleles_with_missing,
        unique_alleles_with_het,
        window_start,
        window_end,
    )

**Evaluate Markers for one Chrom**

In [27]:
%%time 
(
    variant_counts,
    unique_allele_counts,
    unique_alleles_with_missing,
    unique_alleles_with_het,
    window_start,
    window_end,
) = evaluate_marker_options(variant_dataset_filtered, "PvP01_02_v1", pv4_regions)

starting sliding window for region: 100156-162349
starting sliding window for region: 164088-745644
CPU times: user 14min 14s, sys: 29.6 s, total: 14min 44s
Wall time: 15min 30s


In [28]:
PvP01_02_v1_df = pd.DataFrame(
    data={
        "window_start": window_start,
        "window_end": window_end,
        "variant_counts": variant_counts,
        "unique_allele_counts": unique_allele_counts,
        "unique_alleles_with_missing_index": unique_alleles_with_missing,
        "unique_alleles_with_het_index": unique_alleles_with_het,
    }
)
PvP01_02_v1_df

Unnamed: 0,window_start,window_end,variant_counts,unique_allele_counts,unique_alleles_with_missing_index,unique_alleles_with_het_index
0,101265,101464,1,"[617, 2, 75]",[],[1]
1,101365,101564,1,"[514, 1, 179]",[],[1]
2,101415,101614,2,"[293, 221, 1, 3, 176]",[],[2]
3,101465,101664,4,"[287, 1, 3, 2, 1, 74, 2, 107, 37, 1, 3, 46, 90...",[4],"[1, 6, 9]"
4,101565,101764,4,"[255, 35, 1, 3, 2, 1, 42, 78, 2, 28, 170, 68, 9]",[5],"[2, 8]"
...,...,...,...,...,...,...
495,739961,740160,1,"[623, 2, 69]",[],[1]
496,740361,740560,1,"[583, 1, 110]",[],[1]
497,744461,744660,1,"[7, 498, 3, 186]",[0],[2]
498,744961,745160,3,"[7, 1, 1, 85, 4, 3, 593]","[0, 1, 2]",[5]


# Calculate entropy and heterozygosity

In [44]:
unique_allele_freqs = []
unique_allele_count = []
entropy = []
het = []
df_with_stats = PvP01_02_v1_df.copy()
for index, row in PvP01_02_v1_df.iterrows():
    
    gt_counts = row.unique_allele_counts
    n_alleles = len(gt_counts)
    gt_freqs = gt_counts/sum(gt_counts)
    
    unique_allele_freqs.append(gt_freqs)
    unique_allele_count.append(n_alleles)
    entropy.append(-np.sum(gt_freqs * np.log(gt_freqs)))
    het.append(1.0 - np.sum(gt_freqs ** 2))
    
df_with_stats["unique_allele_frequencies"] = unique_allele_freqs
df_with_stats["unique_allele_count"] = unique_allele_count
df_with_stats["entropy"] = entropy
df_with_stats["het"] = het
df_with_stats

Unnamed: 0,window_start,window_end,variant_counts,unique_allele_counts,unique_alleles_with_missing_index,unique_alleles_with_het_index,unique_allele_frequencies,unique_allele_count,entropy,het
0,101265,101464,1,"[617, 2, 75]",[],[1],"[0.8890489913544669, 0.002881844380403458, 0.1...",3,0.361864,0.197905
1,101365,101564,1,"[514, 1, 179]",[],[1],"[0.7406340057636888, 0.001440922190201729, 0.2...",3,0.581312,0.384934
2,101415,101614,2,"[293, 221, 1, 3, 176]",[],[2],"[0.42219020172910665, 0.3184438040345821, 0.00...",5,1.109352,0.656014
3,101465,101664,4,"[287, 1, 3, 2, 1, 74, 2, 107, 37, 1, 3, 46, 90...",[4],"[1, 6, 9]","[0.41354466858789624, 0.001440922190201729, 0....",14,1.766696,0.766405
4,101565,101764,4,"[255, 35, 1, 3, 2, 1, 42, 78, 2, 28, 170, 68, 9]",[5],"[2, 8]","[0.36743515850144093, 0.05043227665706052, 0.0...",13,1.768088,0.774714
...,...,...,...,...,...,...,...,...,...,...
495,739961,740160,1,"[623, 2, 69]",[],[1],"[0.8976945244956772, 0.002881844380403458, 0.0...",3,0.343247,0.184251
496,740361,740560,1,"[583, 1, 110]",[],[1],"[0.840057636887608, 0.001440922190201729, 0.15...",3,0.447795,0.269178
497,744461,744660,1,"[7, 498, 3, 186]",[0],[2],"[0.010086455331412104, 0.7175792507204611, 0.0...",4,0.660937,0.413129
498,744961,745160,3,"[7, 1, 1, 85, 4, 3, 593]","[0, 1, 2]",[5],"[0.010086455331412104, 0.001440922190201729, 0...",7,0.510040,0.254728


In [30]:
df_with_stats.to_csv('sliding_window_results/PvP01_02_v1_windowed_heterozygosity.csv')

# Perform sliding window and entropy and heterozygosity for all chromosomes

In [None]:
%%time 
chromosomes = np.unique(variant_dataset_filtered["variant_chrom"].data.compute())
for chrom in chromosomes:
    if chrom == "PvP01_02_v1":
        continue
    else:
        # Calculate window stats
        (
            variant_counts,
            unique_allele_counts,
            unique_alleles_with_missing,
            unique_alleles_with_het,
            window_start,
            window_end,
        ) = evaluate_marker_options(variant_dataset_filtered, chrom, pv4_regions)
        # Format data
        df = pd.DataFrame(
            data={
                "window_start": window_start,
                "window_end": window_end,
                "variant_counts": variant_counts,
                "unique_allele_counts": unique_allele_counts,
                "unique_alleles_with_missing_index": unique_alleles_with_missing,
                "unique_alleles_with_het_index": unique_alleles_with_het,
            }
        )
        # Calculate entropy and hetrozygosity
        unique_allele_count = []
        unique_allele_freqs = []
        entropy = []
        het = []
        df_with_stats = df.copy()
        for index, row in df.iterrows():
            gt_counts = row.unique_allele_counts
            n_alleles = len(gt_counts)
            gt_freqs = gt_counts/sum(gt_counts)
            
            unique_allele_freqs.append(list(gt_freqs))
            unique_allele_count.append(n_alleles)
            entropy.append(-np.sum(gt_freqs * np.log(gt_freqs)))
            het.append(1.0 - np.sum(gt_freqs ** 2))
        df_with_stats["unique_allele_frequencies"] = unique_allele_freqs
        df_with_stats["unique_allele_count"] = unique_allele_count
        df_with_stats["entropy"] = entropy
        df_with_stats["het"] = het
        # Output to csv
        df_with_stats.to_csv(f"sliding_window_results/{chrom}_windowed_heterozygosity.csv")

starting sliding window for region: 116542-677963
starting sliding window for region: 679790-903592
starting sliding window for region: 108062-630664
starting sliding window for region: 632482-894723
starting sliding window for region: 185115-564966
starting sliding window for region: 566928-685686
starting sliding window for region: 748924-967651
starting sliding window for region: 143102-844199
starting sliding window for region: 846073-1408237
starting sliding window for region: 39349-335698
starting sliding window for region: 337570-1017210
starting sliding window for region: 1258300-1463495
starting sliding window for region: 28424-1132375
starting sliding window for region: 1134203-1627673
starting sliding window for region: 255762-977412
starting sliding window for region: 979223-2156012
starting sliding window for region: 140990-1011339
starting sliding window for region: 1013168-1327962
starting sliding window for region: 1375230-1472801
starting sliding window for region: 621

In [59]:
df_with_stats

Unnamed: 0,window_start,window_end,variant_counts,unique_allele_counts,unique_alleles_with_missing_index,unique_alleles_with_het_index,unique_allele_frequencies,unique_allele_count,entropy,het
0,185360,185559,2,"[2, 1, 144, 1, 1, 3, 542]","[0, 1]","[3, 4]","[0.002881844380403458, 0.001440922190201729, 0...",7,0.588049,0.346984
1,435360,435559,1,"[269, 3, 422]",[],[1],"[0.38760806916426516, 0.004322766570605188, 0....",3,0.693386,0.479993
2,574031,574230,1,"[2, 594, 1, 97]",[0],[2],"[0.002881844380403458, 0.8559077809798271, 0.0...",4,0.43449,0.247876
3,753448,753647,1,"[533, 2, 159]",[],[1],"[0.7680115273775217, 0.002881844380403458, 0.2...",3,0.557178,0.35766
