## Final Assignment


Before working on this assignment please read the instructions fully. Use blackboard to submit a link to your repository. Upload a rendered document (html/pdf) as well as the original code. Please familiarize yourself with the criteria before beginning the assignment.

You should define a research question yourself based on at least two data sources that can be merged into a tidy dataset. The research question should be life science related. The research question should be a question with a causual nature. For instance questions like: How do independent variables X influence the dependent variable of Y? The research question should be answered with an interactive visual, and if possible tested for significance.
If you use code snippets from others you should refer to the original author, otherwise you will be accused of plagiarism. Please be prepared to explain your code in a verbal exam. 



Assessment criteria

Conditional
- No data and or api-key information is stored in the repository. 
- No hard datapaths are used, datapaths are provided in a configfile.
- At least two data sets are merged into one tidy dataframe.

Graded
- (5 pt) The research question is stated. 
- (5 pt) Links to sources are provided and a small description about the data
- (20 pt) Data qualitity and data quantity are inspected and reported. Appropiate transformations are applied.
- (20 pt) Assumptions and presuppositions are made explicit (chosen data storage method, chosen analysis method, chosen design). An argumentative approach is used explaining steps, taken into account data quality and quantity. Explanation is provided either with comments in the code or in a seperate document.
- (10 pt) Interactive visualization is extracted from correct analysis of (incomplete) data
- (10 pt) The design supports the research question. The data is informative in relation to the topic. Visualization is functional and attractive Figures contain X and Y labels, title and captions. (10)
- (20 pt) Code is efficient coded, according to coding style without code smells and easy to read. Code is demonstrated robust and flexible 
- (10 pt) All the code is stored in repository with Readme including most relevant information to implement the code. used software is suitably licensed and documented


### About the data

You can either choose 
- a dataset combination provided on blackboard
- two datasets on the web from two different sources which can be used to answer a research question
- the data from your project

You are welcome to choose datasets at your discretion, but keep in mind they will be shared with others, so choose appropriate datasets. You are welcome to use datasets of your own as well, but minimual two datasets should be coming from the web and or API's. 

Also, you are welcome to preserve data in its original language, but for the purposes of grading you should provide english translations in your visualization. 

### Instructions:

Define a research question, select data and code your data acquisition, data processing, data analysis and visualization. Use a repository with a commit strategy and write a readme file. Make sure that you document your choices. 

## Question

In [13]:
# How many percentage of similarity are there between Father's whole genome SNPs with his chiidren?

## Links to sources

In [14]:
# Family of Five - Genome Dataset | Kaggle

## Dataset description

The dataset is a complete genome of a Family of five - Two Parents, Three Siblings (Genome Phenotype SNPs Raw Data)which have been analysed only two children with their Father.The dataset is represented as a sequence of SNPs represented by the following symbols: A (adenine), C (cytosine), G (guanine), T (thymine). It contains Chromosomes 1-22, X, Y, and mitochondrial DNA.

## Load the Data

In [17]:
import pandas as pd
import numpy as np
import yaml
from bokeh.io import output_notebook
from bokeh.plotting import ColumnDataSource
from bokeh.palettes import Bright6
from bokeh.plotting import figure, show, output_notebook
from bokeh.palettes import HighContrast3
from functools import reduce
from bokeh.palettes import Spectral
from bokeh.palettes import Greys256, Inferno256, Magma256,Plasma256
from bokeh.palettes import Viridis256, Cividis256, Turbo256
output_notebook()

In [22]:
# load config file
def get_config():
    with open("Final_Assignment.yaml", 'r') as stream:
        config = yaml.safe_load(stream)
    return config
config = get_config()

#load and read data frames
Child1Genome = config["Child1Genome"]
Child2Genome = config["Child2Genome"]
Child3Genome = config["Child3Genome"]
FatherGenome = config["FatherGenome"]
MotherGenome = config["MotherGenome"]

df_Child1Genome = pd.read_csv(Child1Genome, low_memory=False)
df_Child2Genome = pd.read_csv(Child2Genome, low_memory=False)
df_Child3Genome = pd.read_csv(Child3Genome, low_memory=False)
df_FatherGenome = pd.read_csv(FatherGenome, low_memory=False)
df_MotherGenome = pd.read_csv(MotherGenome, low_memory=False)
print(df_Child1Genome.head())
print(df_Child2Genome.head())
print(df_FatherGenome.head())

FileNotFoundError: [Errno 2] No such file or directory: 'Final_Assignment.yaml'

## inspection the Data

In [None]:
#checking for noisy and outliers Data:
    #categorical data like SNPs does not have outliers and noisy data


#checking for missing Data:
df_Child1Genome2= df_Child1Genome.replace('--', np.nan)
df_Child1Genome2.isna().any()

df_Child2Genome2=df_Child2Genome.replace('--', np.nan)
df_Child2Genome2.isna().any()

df_FatherGenome2=df_FatherGenome.replace('--', np.nan)
df_FatherGenome2.isna().any()

#proportion of missing Data:

len(df_Child1Genome2[df_Child1Genome2['genotype'].isna()])/len(df_Child1Genome2['genotype'])* 100
len(df_Child2Genome2[df_Child2Genome2['genotype'].isna()])/len(df_Child2Genome2['genotype'])* 100
len(df_FatherGenome2[df_FatherGenome2['genotype'].isna()])/len(df_FatherGenome2['genotype'])* 100

#missing Data Handling:
#becuase the proportion of missing Data are less than 3 Persent, these would be droped
clean_Child1Genome2= df_Child1Genome2.dropna()
clean_Child2Genome2 = df_Child2Genome2.dropna()
clean_FatherGenome2 = df_FatherGenome2.dropna()
clean_FatherGenome2.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 595186 entries, 0 to 601801
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   # rsid      595186 non-null  object
 1   chromosome  595186 non-null  object
 2   position    595186 non-null  int64 
 3   genotype    595186 non-null  object
dtypes: int64(1), object(3)
memory usage: 22.7+ MB


## wrangeling the Data

In [None]:
#rearrenge the order of columns:
clean1_Child1Genome2= clean_Child1Genome2[['chromosome', 'position', '# rsid', 'genotype']]
clean1_Child1Genome2.head()

clean1_Child2Genome2= clean_Child2Genome2[['chromosome', 'position', '# rsid', 'genotype']]
clean1_Child2Genome2.head()

clean1_FatherGenome2=clean_FatherGenome2[['chromosome', 'position', '# rsid', 'genotype']]
clean1_FatherGenome2.head()

#renamed columns:
clean2_Child1Genome2=clean1_Child1Genome2.rename(columns={"# rsid":"rsid_Ch1","genotype":"genotype_Ch1"})
clean2_Child2Genome2=clean1_Child2Genome2.rename(columns={"# rsid":"rsid_Ch2","genotype":"genotype_Ch2"})
clean2_FatherGenome2=clean1_FatherGenome2.rename(columns={"# rsid":"rsid_Fa","genotype":"genotype_Fa"})

print(clean2_Child1Genome2.head())
print(clean2_Child2Genome2.head())
print(clean2_FatherGenome2.head())

#merging Father's genome with two children on "chromosome","position" columns
FamilyGenome = pd.merge(pd.merge(clean2_FatherGenome2,clean2_Child1Genome2,on=["chromosome","position"]),clean2_Child2Genome2,on=["chromosome","position"]) 
FamilyGenome.head() 


  chromosome  position     rsid_Ch1 genotype_Ch1
0          1    734462   rs12564807           AA
1          1    752721    rs3131972           AG
2          1    760998  rs148828841           AC
3          1    776546   rs12124819           AG
4          1    787173  rs115093905           GG
  chromosome  position     rsid_Ch2 genotype_Ch2
0          1     69869  rs548049170           TT
1          1    565508    rs9283150           AA
2          1    727841  rs116587930           GG
3          1    752721    rs3131972           GG
4          1    754105   rs12184325           CC
  chromosome  position      rsid_Fa genotype_Fa
0          1    734462   rs12564807          AA
1          1    752721    rs3131972          AG
2          1    760998  rs148828841          AC
3          1    776546   rs12124819          AA
4          1    787173  rs115093905          GG


Unnamed: 0,chromosome,position,rsid_Fa,genotype_Fa,rsid_Ch1,genotype_Ch1,rsid_Ch2,genotype_Ch2
0,1,752721,rs3131972,AG,rs3131972,AG,rs3131972,GG
1,1,776546,rs12124819,AA,rs12124819,AG,rs12124819,AA
2,1,824398,rs7538305,AA,rs7538305,AA,rs7538305,AA
3,1,846808,rs4475691,TT,rs4475691,CT,rs4475691,TT
4,1,854250,rs7537756,GG,rs7537756,AG,rs7537756,GG


## Result

In [None]:
#similarity analysis by using Definition on Datasets
def similarity1(genotype):
    if genotype["genotype_Fa"] == genotype["genotype_Ch1"]:
        similarity_Ch1 =1
    else:
        similarity_Ch1= 0
    return similarity_Ch1 
       
def similarity2(genotype):
    if genotype["genotype_Fa"] == genotype["genotype_Ch2"]:
        similarity_Ch2 =1
    else:
        similarity_Ch2= 0
    return similarity_Ch2 

FamilyGenome["similarity_Ch1"] = FamilyGenome.apply(similarity1,axis=1)       
FamilyGenome["similarity_Ch2"] = FamilyGenome.apply(similarity2,axis=1)
FamilyGenome.head()

FamilyGenome=FamilyGenome.groupby("chromosome",as_index=False).aggregate({'similarity_Ch1':'sum','similarity_Ch2':'sum'})
print(FamilyGenome)

   chromosome  similarity_Ch1  similarity_Ch2
0           1            6038            5999
1          10            3634            3669
2          11            3757            3810
3          12            3778            3733
4          13            2731            2728
5          14            2528            2566
6          15            2311            2300
7          16            2536            2570
8          17            2422            2464
9          18            2205            2109
10         19            2226            2286
11          2            5820            5852
12         20            1861            1845
13         21            1134            1097
14         22            1151            1165
15          3            4921            4925
16          4            4328            4355
17          5            4150            4148
18          6            6120            5940
19          7            4168            4127
20          8            3782     

In [None]:



#chromosome = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','Y','X','MT']
#similarity = ["similarity_Ch1","similarity_Ch2"]
#p = figure(x_range=chromosome, height=250, title="genotype similarity between fathar and childs",
           #toolbar_location=None, tools="hover")

#p.vbar_stack(similarity, x='chromosome', width=0.9, color=HighContrast3, source=FamilyGenome,
             #legend_label=similarity)

#p.y_range.start = 0
#p.x_range.range_padding = 0.1
#p.xgrid.grid_line_color = None
#p.axis.minor_tick_line_color = None
#p.outline_line_color = None
#p.legend.location = "top_left"
#p.legend.orientation = "horizontal"
#'similarity_Ch1'    :[ 6038,5820,4921,4328,4150,6120,4168,3782,3252,3634,3757,3778,2731,2528,2311]

In [None]:


chromosome = FamilyGenome['chromosome'].tolist()
similarity1 = FamilyGenome['similarity_Ch1'].tolist()
similarity2 = FamilyGenome['similarity_Ch2'].tolist()

similarity = ['similarity_Ch1','similarity_Ch2']
colors = ["#c9d9d3", "#718dbf"]

#data =FamilyGenome=FamilyGenome.groupby("chromosome").aggregate({'similarity_Ch1':'sum','similarity_Ch2':'sum'})
data = {'chromosome':chromosome,
'similarity_Ch1'    :similarity1,                                                                                                                                                                      
'similarity_Ch2'    :similarity2}


g = figure(x_range=chromosome,height=500,title="genotype similarity between fathar and childs")
          
g.vbar_stack(similarity,x='chromosome',source=data,color=colors,width=0.9,legend_label=similarity)
g.y_range.start = 0
g.x_range.range_padding = 0.1
g.xgrid.grid_line_color = None
g.axis.minor_tick_line_color = None
g.outline_line_color = None
g.legend.location = "top_left"
g.legend.orientation = "horizontal"
show(g)