# Prediction on real data

After having trained models, our ultimate goal is to scan genomic regions to see if they are under selection, and if so, what type of selection. In this short tutorial, you will learn how to use trained models to scan genomic regions.

First of all, we should import BaSe module:

In [1]:
%run -i ../BaSe/Preprocess.py
%run -i ../BaSe/Model.py

Using TensorFlow backend.


Next, we should load the vcf file. To do so, we can use `VCF()`:

In [2]:
file_name = "/Users/ulas/Projects/balancing_selection/Data/VCFs/test.vcf"

vcf = VCF(file_name)

### 1. Create summary statistics and predict by using trained ANN model

In order to use trained ANN model to predict selection, we should first calculate summary statistics. `create_stat()` is an convient function that scans the given region for candidate targets and calculates summary statistics for each target. It accepts following arguments:
* __N__: Length of the sequence (should be same as the simulated sequence length)
* __target_freq__: A tuple specifying the frequency range for targets.
* __target_list__: A list of target SNPs. If None, scans the target region for all candidate targets.
* __target_range__: A tuple specifying the target range of positions. If None, scans all the positions.
* __scale__: If True, performs feature scaling. 
* __pca__: If True, performs pca.

In [3]:
stat_matrix, snps = vcf.create_stat(N=50000, target_freq=(0.4,0.6), target_list=None, 
                                 target_range=None, scale=True, pca=False)

4 candidate targets have been found.


Here, we can print the IDs of candidate SNPs:

In [4]:
print(snps)

['rs7225123' 'rs4890183' 'rs1609550' 'rs1109995']


In this example, our input data (test.vcf) is obtained from 1000 Genomes and contains variation information for first 80kb of the chromosome 17. Here, we want to scan the whole region for target SNPs. Since our simulations were conditioned on a final allele frequency range of 0.4 and 0.6, we use the same target frequency range here. Furthermore, we should use exactly the same preprocessing steps as used for training data. Since we performed feature scaling but not pca on training data (see ANN_training python notebook), here we also perform only feature scaling. This function returns a matrix containing summary statistic values for target SNPs and a list of target SNPs found.

Now, to perform prediction, we can use `predict()` function. It requires following arguments:

* __x__: The input data.
* __model__: Full path to the trained model.
* __labels__: Labels of target SNPs.
* __test__: Test number.


In [6]:
ann_model = "/Users/ulas/Projects/balancing_selection/Data/Model/ANN_model_recent_1.h5"

results_ann_recent_1 = predict(stat_matrix, ann_model, labels=snps, test=1)

The output is a pandas DataFrame with three columns: the first column contains the SNP ID, the second column contains the prediction value (probability), and the third column contains the predicted class.

In [7]:
results_ann_recent_1

Unnamed: 0,SNP,Pred,PredClass
0,rs7225123,0.933917,Selection
1,rs4890183,0.906652,Selection
2,rs1609550,0.694114,Selection
3,rs1109995,0.057465,Neutral


### 2. Create images and predict by using trained CNN model

Next, we should create images to use trained CNN for prediction. Similarly, we can use `create_image()` function. It accepts following key arguments:

* __N__: Length of the sequence (should be same as the simulated sequence length)
* __sort__: Sorting algorithm. either:
    * __gen_sim__: sorting based on genetic similarity
    * __freq__: sorting based on frequency
* __method__: sorting method. either:
    * __t__: together. sorting whole individuals together
    * __s__: seperate. sorting two haplotype groups seperately.
* __target_freq__: A tuple specifying the frequency range for targets.
* __target_list__: A list of target SNPs. If None, scans the target region for all candidate targets.
* __target_range__: A tuple specifying the target range of positions. If None, scans all the positions.
* __img_dim__: Image dimension (nrow, ncol)

Again, it is important to perform the same preprocessing steps as used for training data.

In [8]:
im_matrix, snps = vcf.create_image(N=50000, sort="freq", method="s", target_freq=(0.4,0.6), 
                                   target_list=None, target_range=None, img_dim=(128,128))

4 candidate targets have been found.


Now, we can use the same `predict()` function to perform prediction. However, this time, we will use trained CNN model:

In [9]:
cnn_model = "/Users/ulas/Projects/balancing_selection/Data/Model/CNN_model_recent_1.h5"

results_cnn_recent_1 = predict(im_matrix, cnn_model, labels=snps, test=1)

In [10]:
results_cnn_recent_1

Unnamed: 0,SNP,Pred,PredClass
0,rs7225123,0.997502,Selection
1,rs4890183,0.683404,Selection
2,rs1609550,0.860795,Selection
3,rs1109995,0.6319,Selection
