<a id="1.1"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Introduction</h3>


![](https://www.nutracera.com/wp-content/uploads/2018/04/Why-is-Methylation-Important-1200x500.png)

**WHAT IS METHYLATION IMPORTANT?**

DNA methylation is a biological process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence. When located in a gene promoter, DNA methylation typically acts to repress gene transcription. In mammals, DNA methylation is essential for normal development and is associated with a number of key processes including genomic imprinting, X-chromosome inactivation, repression of transposable elements, aging, and carcinogenesis.

**WHY IS METHYLATION IMPORTANT?**

The body is a very complex machine, with various gears and switches that need to be all functioning properly to operate optimally. Think of methylation, and the opposite action, demethylation, as the mechanism that allows the gears to turn, and turns biological switches on and off for a host of systems in the body.

**HOW DOES METHYLATION HAPPEN?**

CH3 is provided to the body through a universal methyl donor known as SAMe (S-adenosylmethionine). SAMe readily gives away its methyl group to other substances in the body, which enables the cardiovascular, neurological, reproductive, and detoxification systems to perform their functions.

Unfortunately, the system that produces SAMe is reliant on one switch being turned on by a critical B vitamin, 5-MTHF (also known as active folate or methylfolate).

Sources(Wikipedia.com,thorne.com)



![](https://www.researchgate.net/profile/Marian-Hajduch/publication/323190556/figure/fig1/AS:610867326513153@1522653528846/Interplay-between-DNA-methylation-gene-transcription-and-chromatin-structure-The.png)

<a id="1.1"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Libraries And Utilities</h3>


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import pymc3 as pm
import theano.tensor as tt
import string
import nltk
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
sns.set_style('darkgrid')
import plotly.express as ex
import plotly.graph_objs as go
import plotly.offline as pyo
from plotly.subplots import make_subplots
pyo.init_notebook_mode()
from sklearn.decomposition import TruncatedSVD,PCA
from sklearn.cluster import KMeans
import matplotlib.gridspec as gridspec
import random
from tqdm.notebook import tqdm
import gc
from scipy.stats.mstats import mquantiles
%pip install joypy
from joypy import joyplot
plt.rc('figure',figsize=(17,10))
sns.set_context('paper',font_scale=2)


<a id="1.1"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Data Loading</h3>

In [None]:
s_data = pd.read_csv('/kaggle/input/cpg-values-of-smoking-and-non-smoking-patients/Smoker_Epigenetic_df.csv')
s_data.Gender = s_data.Gender.str.lower()
s_data.drop(columns=['GSM'],inplace=True)
s_data.head(3)


<a id="1.1"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Exploratory Data Analysis</h3>


In [None]:
plt.title('Number of Missing Values at Each Feature')
sns.heatmap(s_data.isna().sum().to_frame(),cmap='coolwarm',linewidth=2,annot=True)
s_data.dropna(inplace=True)
plt.show()


Apparently, there are 62 samples in our dataset the are missing all their genetic features; as we are interested in exploring genetic attributes, we will drop the missing samples in this stage, but it may be interesting to try and treat those samples as a test set for a predictive model and try to replace the missing value with the predictive model.

In [None]:
fig = plt.figure()

ax1 = plt.subplot(221)
ax1.set_title('Distribution of Genders')
sns.countplot(x=s_data['Gender'],ax=ax1,palette=['tab:pink','tab:blue'])
ax2 = plt.subplot(222)
ax2.set_title('Distribution of Smoking Labels')
sns.countplot(x=s_data['Smoking Status'],ax=ax2)
ax3 = plt.subplot(212)
ax3.set_title('Distribution of Sample Ages')
sns.histplot(data=s_data['Age'],ax=ax3,kde=True)


plt.tight_layout()
plt.show()

**Observation**: depending on our questions of interest, we will have to take in mind that our dataset is imbalanced both gender-wise and smoking status wise as well as our dataset age distribution appearing to be negatively skewed; such skewness reduces the confidence of our inference and models among younger patients as the average age is centered around 55.

In [None]:
joyplot(
    data=s_data[list(s_data.columns[4:])], 
    figsize=(15, 12),
    alpha=0.85
    ,title='Difference in Probe Methylation Distribution Across Given Cites'

)
plt.show()

**Observation**: looking at the distribution of each individual probe we can see that many of our probes follow a bimodal distribution this can point out to 2 distinct underlying groups in our data.
We have 2 known to us groups: smokers and non-smokers and females/males, but it is not exclusive to those groups. We may uncover an underlying group originating on the ages of our patients, for example.

In [None]:
joyplot(
    data=s_data[list(s_data.columns[4:])+['Smoking Status']], 
    figsize=(13, 8),
    by='Smoking Status',
    alpha=0.85
    ,title='Difference in Probe Methylation Between Smoking Status'

)
plt.show()

**Observation**: when looking at the difference in the distribution of methylation level in our probes, we see no significant visible difference between smokers and non-smokers.

In [None]:
joyplot(
    data=s_data[list(s_data.columns[4:])+['Gender']], 
    figsize=(13, 8),
    by='Gender',
    alpha=0.85,title='Difference in Probe Methylation Between Genders'
)
plt.show()

**Observation**: We can see that when looking at the difference in distribution based on gender, there is an amazingly significant difference between the two genders, the genetic explanations behind this difference are unclear to me as we do not have the associated genes to our probes in our dataset.

In [None]:
#Encoding Categorical Features
s_data.Gender = s_data.Gender.astype('category').cat.codes
s_data['Smoking Status'] = s_data['Smoking Status'].astype('category').cat.codes

In [None]:
cx = sns.clustermap(np.round(s_data.corr(),2),linewidth=0.8,cmap='vlag',figsize=(15,15),annot=True,annot_kws=dict(fontsize=11))
cx.ax_row_dendrogram.set_visible(False)
cx.ax_col_dendrogram.set_visible(False)
cx.fig.suptitle('Pearson Correlation Between Features') 
cx.fig.tight_layout()
plt.show()

**Observation**: looking at the clustered Pearson correlations, we see many features correlated to each other, especially our probes.
Such an observation leads us to believe that we have multicollinearity in our data and that we most likely cannot assume independence between all the probes we are working with. Next, we will try and reduce the dimensionality of our data and try to confirm our hypothesis.
If the same data in a reduced dimension will have a high EVR, we will continue our analysis with the more appropriate reduced dimension.

In [None]:
pca = PCA(2)
transformed = pca.fit_transform(s_data.iloc[:,4:])
t_df = pd.DataFrame(transformed,columns=['pc1','pc2'])
t_df['Gender'] = s_data.Gender
t_df['Age'] = s_data.Age
t_df['Smoking Status'] = s_data['Smoking Status']

sns.barplot(x=['PC_1','PC_2'],y=pca.explained_variance_ratio_)
sns.pointplot(x=['PC_1','PC_2'],y=np.cumsum(pca.explained_variance_ratio_),lw=5,legend=True,label='Cumulative',color='tab:red')
plt.ylabel('Explained Variance')
plt.title('Explained Variance Ratio After Projecting $R^{20} \longrightarrow R^{2}$')
plt.show()

**Observation**: Looking at the amount of variance we are able to preserve even after projecting our data to a 2-dimensional space confirms our prior hypothesis, being able to preserve more than 80% of the variance with just 2 Principal Components is amazing.

In [None]:
ax1 = plt.subplot(221)
ax1.set_title('$R^{2}$ Reduced Dimension of Our Genomic Representation of Each Sample',fontsize=15)
sns.scatterplot(x=t_df['pc1'],y=t_df['pc2'],hue=t_df['Gender'],ax=ax1)
ax1.set_title('$R^{2}$ Reduced Dimension of Our Genomic Representation of Each Sample',fontsize=15)
ax2 = plt.subplot(223)
sns.scatterplot(x=t_df['pc1'],y=t_df['pc2'],hue=t_df['Smoking Status'],ax=ax2)
ax3 = plt.subplot(122)
sns.scatterplot(x=t_df['pc1'],y=t_df['pc2'],size=t_df['Age'],hue=t_df['Age'],ax=ax3)
ax2.set_title('$R^{2}$ Reduced Dimension of Our Genomic Representation of Each Sample',fontsize=15)
ax3.set_title('$R^{2}$ Reduced Dimension of Our Genomic Representation of Each Sample',fontsize=15)

plt.tight_layout()

plt.show()

**Observation**: now that we represent all our 20 probes with linear combinations consisting of 2 coefficients, we revisit the analysis we performed based on ridge plots in an earlier section, and here the massive difference between the genders is clear as day light!

<a id="1.1"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Probabilistic Inference</h3>


In [None]:
plt.figure(figsize=(15,8))
plt.title('Gender vs PC 1 Value')
sns.scatterplot(x=t_df['pc1'],y=t_df['Gender'])
plt.show()

It looks clear that *the probability* of being male increases as the first Principal Component increases. We are interested in modeling the probability here. The best we can do is ask, "At PC1 Value $X$, what is the probability of a being male?". The goal of the following experiment is that question.

We need a function of PC1, call it $p(X)$, that is bounded between 0 and 1 and changes from 1 to 0 as we increase PC1. Such a function is well defined and known to us all, the *logistic function.*

$$p(X) = \frac{1}{ 1 + e^{ \;\beta X } } $$

In this model, $\beta$ is the variable we are uncertain about. Below are some examples for different value of beta plotted for $\beta = -2, 52, 7$.

In [None]:
plt.title('Different Values of Beta Example')
x = np.linspace(-4, 4, 100)
plt.plot(x, 1.0 / (1.0 + np.exp(-2 * x)), label=r"$\beta = -2$",lw=3)
plt.plot(x, 1.0 / (1.0 + np.exp(52 * x)), label=r"$\beta = 52$",lw=3)
plt.plot(x, 1.0 / (1.0 + np.exp(7 * x) ), label=r"$\beta = 7$",lw=3)
plt.legend();

We can *shift* our logsitic function along the x axis by adding some constant $\alpha$ to our exponent, i.e.

$$p(X) = \frac{1}{ 1 + e^{ \;\beta X + \alpha } } $$


In [None]:
plt.title('Different Values of Beta and Alpha Example')
x = np.linspace(-4, 4, 100)
plt.plot(x, 1.0 / (1.0 + np.exp(-2 * x)), label=r"$\beta = -2$",ls="--", lw=3)
plt.plot(x, 1.0 / (1.0 + np.exp(52 * x)), label=r"$\beta = 52$",ls="--", lw=3)
plt.plot(x, 1.0 / (1.0 + np.exp(7 * x) ), label=r"$\beta = 7$", ls="--", lw=3)

plt.plot(x, 1.0 / (1.0 + np.exp(-2 * x+3)), label=r"$\beta = -2 \alpha = 3$",
         color="#348ABD")
plt.plot(x, 1.0 / (1.0 + np.exp(52 * x-1)), label=r"$\beta = 52 \alpha = -1$",
         color="#A60628")
plt.plot(x, 1.0 / (1.0 + np.exp(7 * x+2) ), label=r"$\beta = 7 \alpha = 2$",
         color="#7A68A6")

plt.legend(loc="lower left");

$$ \text{Sample is Male, $M_i$} \sim \text{Ber}( \;p(PC1_i)\; ), \;\; i=1..N$$

where $p(PC1)$ is our logistic function and $PC1_i$ are the PC1 values.

In [None]:
with pm.Model() as model:
    beta = pm.Normal("beta", mu=0, tau=0.001, testval=0)
    alpha = pm.Normal("alpha", mu=0, tau=1/t_df.pc1.std(), testval=0)
    p = pm.Deterministic("p_parm", 1.0/(1. + tt.exp(beta*t_df.pc1 + alpha)))

Notice in the above code we had to set the values of `beta` and `alpha` to 0. The reason for this is that if `beta` and `alpha` are very large, they make `p` equal to 1 or 0. Unfortunately, `pm.Bernoulli` does not like probabilities of exactly 0 or 1, though they are mathematically well-defined probabilities. So by setting the coefficient values to `0`, we set the variable `p` to be a reasonable starting value.

In [None]:
with model:
    observed = pm.Bernoulli("obs", p, observed=t_df.Gender)
    start = pm.find_MAP()
    step = pm.Metropolis()
    trace = pm.sample(120000, step=step, start=start)
    burned_trace = trace[100000::2]

In [None]:
alpha_samples = burned_trace["alpha"][:, None]
beta_samples = burned_trace["beta"][:, None]
plt.subplot(211)
plt.title(r"Posterior distributions of the variables $\alpha, \beta$")
sns.histplot(beta_samples, bins=35, alpha=0.85,label=r"posterior of $\beta$", palette=["#7A68A6"],stat='probability')
plt.legend()

plt.subplot(212)
sns.histplot(alpha_samples, bins=35, alpha=0.85,label=r"posterior of $\alpha$", palette=["#A60628"],stat='probability')
plt.legend();

All samples of $\beta$ are smaller than 0. If instead the posterior was centered around 0, we may suspect that $\beta = 0$, implying that PC1 has no effect on the probability of being Male based on PC1. 

In contrast, all $\alpha$ posterior values are centered around 0, implying that it is correct to believe that $\alpha$ is close to 0. 
  

Next, let's look at the *expected probability* for a specific value of PC1. That is, we average over all samples from the posterior to get a likely value for $p(PC1_i)$.

In [None]:
t = np.linspace(t_df.pc1.min() - 2, t_df.pc1.max()+2, 50)[:, None]
def logistic(x, beta, alpha=0):
    return 1.0 / (1.0 + np.exp(np.dot(beta, x) + alpha))
p_t = logistic(t.T, beta_samples, alpha_samples)

mean_prob_t = p_t.mean(axis=0)

In [None]:
plt.plot(t, mean_prob_t, lw=3, label="average posterior \nprobability \ of defect")
plt.plot(t, p_t[0, :], ls="--", label="realization from posterior")
plt.plot(t, p_t[-2, :], ls="--", label="realization from posterior")
plt.scatter(t_df.pc1, t_df.Gender, color="tab:red", s=50, alpha=0.5)
plt.title("Posterior expected value of probability of being Male; \
plus realizations")
plt.legend()
plt.ylim(-0.1, 1.1)
plt.xlim(t.min(), t.max())
plt.ylabel("probability")
plt.xlabel("temperature");

Above we also plotted two possible realizations of what the actual underlying system might be. Both are equally likely as any other draw. The blue line is what occurs when we average all the 242000 possible dotted lines together.

An interesting question to ask is for what PC1 value are we most uncertain about the male gender probability? Below we plot the expected value line **and** the associated 95% intervals for each temperature. 

In [None]:
qs = mquantiles(p_t, [0.025, 0.975], axis=0)
plt.fill_between(t[:, 0], *qs, alpha=0.7,color="#7A68A6")
plt.plot(t[:, 0], qs[0], label="95% CI", color="#7A68A6", alpha=0.7)
plt.plot(t, mean_prob_t, lw=1, ls="--", color="k",
         label="average posterior \nprobability of defect")
plt.xlim(t.min(), t.max())
plt.ylim(-0.02, 1.02)
plt.legend()
sns.scatterplot(x=t_df.pc1,y= t_df.Gender, color="tab:red", s=50, alpha=0.5)
plt.xlabel("$PC_1$, $X$")
plt.ylabel("probability estimate")
plt.title("Posterior probability estimates given $PC_1$ Value. $X$");

The *95% credible interval*, or 95% CI, painted in purple, represents the interval, for each $PC_1$ value, that contains 95% of the distribution. For example, at 0.01 , we can be 95% sure that the probability of being male between 0.98 and 0.99.