# phenotype prediction case study

The humble fruit fly, *Drosophila melanogaster* is one of the most important and well-studied model organisms in biological and biomedical research.

Early research using the fruit fly helped to establish the basic 'rules' of genetics and inheritance, including generating basic information about how mutations occur.

The fruit fly has been used extensively to learn about neural biology, including neurodevelopment and how neurological disorders important for human health occur, like Alzheimer's and Parkinson's.

The close historical association between fruit fly and human populations led to the use of the fruit fly as a model for studying early human migrations, including understanding how humans may have adapted to their local environments as they migrated out of Africa to colonize the globe.

The fruit fly was one of the first animals used to extensively study the links between genetic vatiation and differences in phenotypes at a whole-genome scale. In 2012, a public data-bank of ~200 'reference' fly lines were fully genome-sequenced and made available for use in a wide variety of genome-phenotype association experiments, with the results of all experiments made freely available to the public through the Drosophila Genetics Reference Panel (DGRP), now at the [dgrp2 website](http://dgrp.gnets.ncsu.edu/).

In this exercise, we will develop a neural-network model for predicting a fly's 'longevity' (normalized lifespan) from >17,000 genomic mutations (SNPs) dispersed along the fruit fly's two main autosomal chromosomes.

## data download and organization

xx

In [None]:
import pandas
import numpy as np

dataframe = pandas.read_csv('https://raw.githubusercontent.com/bryankolaczkowski/ALS3200C/main/phenopred.data.csv')
dataframe.head()

xx

In [None]:
train_dataframe = dataframe.sample(frac=0.8, random_state=402201)
valid_dataframe = dataframe.drop(train_dataframe.index)
print(train_dataframe.shape, valid_dataframe.shape)

In [None]:
snp_ids = [ x for x in dataframe.columns if x.find('SNP') == 0]
train_x = train_dataframe[snp_ids].to_numpy(dtype=np.float32)
valid_x = valid_dataframe[snp_ids].to_numpy(dtype=np.float32)
print(train_x.shape, valid_x.shape)

In [None]:
train_y = train_dataframe['LS'].to_numpy(dtype=np.float32)
valid_y = valid_dataframe['LS'].to_numpy(dtype=np.float32)
print(train_y.shape, valid_y.shape)

In [None]:
import tensorflow as tf

train_data = tf.data.Dataset.from_tensor_slices((train_x,train_y)).batch(10)
valid_data = tf.data.Dataset.from_tensor_slices((valid_x,valid_y)).batch(36)


In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.InputLayer(input_shape=[17165]))
model.add(tf.keras.layers.Dropout(rate=0.98))
model.add(tf.keras.layers.Dense(units=1))
model.compile(optimizer=tf.keras.optimizers.RMSprop(),
              loss=tf.keras.losses.MeanAbsoluteError())
model.summary()

model.fit(train_data, epochs=1000, validation_data=valid_data)

train_y_hat = model.predict(train_x)
valid_y_hat = model.predict(valid_x)

import matplotlib.pyplot as plt
plt.plot([10,60],[10,60])
plt.scatter(train_y, train_y_hat, marker='o')
plt.scatter(valid_y, valid_y_hat, marker='+')