<a href="https://colab.research.google.com/github/simulate111/LOO_Pairs/blob/main/exercise-4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TKO_7092 Evaluation of Machine Learning Methods 2024

---

Student name: Mohammadreza Akhtari

Student number: 2304399

Student email: mohammadreza.akhtari@utu.fi

---

## Exercise 4

Complete the tasks given to you in the letter below. In your submission, explain clearly, precisely, and comprehensively why the cross-validation described in the letter failed, how cross-validation should be performed in the given scenario and why  your cross-validation will give a reliable estimate of the generalisation performance. Then implement the correct cross-validation for the scenario and report its results.

Remember to follow all the general exercise guidelines that are stated in Moodle. Full points (2p) will be given for a submission that demonstrates a deep understanding of cross-validation on pair-input data and implements the requested cross-validation correctly (incl. reporting the results). Partial points (1p) will be given if there are small error(s) but the overall approach is correct. No points will be given if there are significant error(s).

The deadline of this exercise is **Wednesday 21 February 2024 at 11:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.

---


Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. Currently I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have a list of over 100.000 potential drug molecules, but their affinities still need to be verified in the lab. Obviously I do not have the resources to measure all the possible drug-target pairs, so I need to prioritise. I have decided to do this with the use of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had made in the lab, which comprise of all the 77 target proteins of interest but only 59 different drug molecules. Then I performed a leave-one-out cross-validation with this same data to estimate the generalisation performance of the model. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of the remaining drug molecules. The problem is: when I selected the highest predicted affinities and tried to verify them in the lab, I found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get a reliable estimate. Also, implement the correct leave-one-out cross-validation and report its results. I need to know whether I am wasting my lab resources by using my model.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist

---

#### Answer the questions about cross-validation on pair-input data

In [1]:
# Why did the estimation described in the letter fail?

# We should notice that drug-target interaction are pair-input issue and should not be considered independent, which is the assumption in lleave-one-out cross validation.
#Therefore, conventional cross-validation leads to the optimistic or unrealistic results for generalization to unseen data.
#In other words, trainnig dataset share the same dependencies with test dataset.
#So, pair-input having dependencies as they share data or information together and should not be considered independent for leave-one-out-cross validation to achieve better results in practice.
#*********************************************************************************************************************************************************************
# How should leave-one-out cross-validation be performed in the given scenario and why?
# Remember to provide comprehensive and precise arguments.

#First of all, the evaluation method should be independent of pair-input data set. TO perform leave-one-one-out cross validation, we should consider the out of sample observations.
#For predicting inside the samples, we perform normal cross validation.->Training are all in sample observation excluding the test samples.
#For those data which has the second element in common (type B), we should remove those data which share the first values for training observations.
#If out of smaple is in C region sharing the first element, we should remove those data which share the second element with test data.
#For type D, test data which share both the first and second element should be removed to obtain more realistic outcomes.

#### Import libraries

In [2]:
# Import the libraries you need.
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold, LeaveOneOut
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

#### Write utility functions

In [3]:
# Write the utility functions you need in your analysis.
"""
C-index function:
- INPUTS:
'y' an array of the true output values
'yp' an array of predicted output values
- OUTPUT:
The c-index value
"""
def cindex(y, yp):
    n = 0
    h_num = 0
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt):
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt):
                    h_num += 1
                elif (p == np):
                    h_num += 0.5

    # Check if n is zero before division
    if n == 0:
        return 0.0
    else:
        return h_num / n



#### Load datasets

In [4]:
# Read the data files (input.data, output.data, pairs.data).
input_df = pd.read_csv('https://raw.githubusercontent.com/simulate111/LOO_Pairs/main/input.data', header=None, sep=' ')  # Assuming input data is space-separated
output_df = pd.read_csv('https://raw.githubusercontent.com/simulate111/LOO_Pairs/main/output.data', header=None, sep=' ')  # Assuming output data is space-separated
pairs_df = pd.read_csv('https://raw.githubusercontent.com/simulate111/LOO_Pairs/main/pairs.data', header=None, sep=' ')  # Assuming pairs data is space-separated
display('input:', input_df.head(), 'input:', input_df.shape)
display('output:', output_df.head(),'output:', output_df.shape)
display('pairs:', pairs_df.head(), 'pairs:',pairs_df.shape)

'input:'

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,57,58,59,60,61,62,63,64,65,66
0,0.759222,0.709585,0.253151,0.421082,0.72778,0.404487,0.709027,0.242963,0.407292,0.379971,...,0.838616,0.16505,0.515334,0.332678,0.577533,0.678125,0.463608,0.538938,0.460883,0.345251
1,0.034584,0.30472,0.688257,0.296396,0.151878,0.830755,0.270656,0.705392,0.18612,0.085594,...,0.472762,0.730013,0.639373,0.445218,0.45568,0.090737,0.308432,0.079023,0.603089,0.197008
2,0.737867,0.236079,0.905987,0.163612,0.801455,0.789823,0.393999,0.522067,0.411352,0.781861,...,0.595468,0.582292,0.836193,0.281514,0.79179,0.081695,0.58345,0.422539,0.076437,0.299662
3,0.406913,0.60774,0.235365,0.888679,0.150347,0.598991,0.130108,0.465818,0.799953,0.906878,...,0.45388,0.311799,0.534668,0.563793,0.727767,0.172686,0.908368,0.786892,0.790459,0.666388
4,0.697707,0.432565,0.650329,0.886065,0.32866,0.576926,0.5231,0.080463,0.131349,0.913496,...,0.583892,0.444141,0.249423,0.11069,0.42077,0.250148,0.19635,0.427255,0.166715,0.91972


'input:'

(400, 67)

'output:'

Unnamed: 0,0
0,0.733933
1,0.569419
2,0.832588
3,0.389664
4,0.725953


'output:'

(400, 1)

'pairs:'

Unnamed: 0,0,1
0,D40,T2
1,D31,T64
2,D6,T58
3,D56,T49
4,D20,T28


'pairs:'

(400, 2)

In [5]:
#Data standardization using z-score
scaler = StandardScaler()
input_data_scaled = scaler.fit_transform(input_df)

#### Implement and run cross-validation

In [6]:
# Implement and run the requested cross-validation. Report and interpret its results.

In [7]:
#Type A
knn = KNeighborsRegressor(n_neighbors=10)
loo = LeaveOneOut()
yp = []
for train_index, test_index in loo.split(input_data_scaled):
    X_train, X_test = input_data_scaled[train_index], input_data_scaled[test_index]
    y_train, y_test = output_df.values[train_index], output_df.values[test_index]
    knn.fit(X_train, y_train)
    yp0 = knn.predict(X_test)
    yp.extend(yp0)
#print('yp',len(yp))
print("C-index:", round(cindex(output_df.values, yp), 2))

C-index: 0.83


In [8]:
# Type B
knn = KNeighborsRegressor(n_neighbors=10)
loo = LeaveOneOut()
Cindex = []
for train, test in loo.split(input_data_scaled):
    X_train, X_test = input_data_scaled[train], input_data_scaled[test]
    y_train, y_test = output_df.values[train], output_df.values[test]
    yp = []
    for i in range(len(X_train) - 1):
        #print('X_train',X_train)
        X_train1 = np.delete(X_train, [i, i + 1], axis=0)
        y_train1 = np.delete(y_train, [i, i + 1], axis=0)
        knn.fit(X_train1, y_train1)
        #print(X_test)
        yp0 = knn.predict(X_test)
        #print('yp0',yp0)
        yp.extend(yp0)
    #print('yp',[arr[0] for arr in yp])
    #Flattening
    yp=[arr[0] for arr in yp]
    #print('y_test',y_test)
    c_index = cindex(output_df[:-2].values, yp)
    #print('c_index', c_index)
    Cindex.append(c_index)
print("C-index:", round(np.mean(Cindex),2))

C-index: 0.5


In [9]:
# Type C
knn = KNeighborsRegressor(n_neighbors=10)
loo = LeaveOneOut()
ypT = []
for train, test in loo.split(input_data_scaled):
    yp = []
    X_train, X_test = input_data_scaled[train], input_data_scaled[test]
    y_train, y_test = output_df.values[train], output_df.values[test]
    #Removing corresponsing columns anf then train the model
    for i in range(X_train.shape[1] - 1):
        X_train2 = np.delete(X_train, [i, i + 1], axis=1)
        X_test2 = np.delete(X_test, [i, i + 1], axis=1)
        knn.fit(X_train2, y_train)
        yp0 = knn.predict(X_test2)
        yp.extend(yp0)
    ypT.append(yp)
print("C-index:", round(np.mean(cindex(output_df[:-1].values, ypT)),2))

C-index: 0.82


In [10]:
# Type D
knn = KNeighborsRegressor(n_neighbors=10)
loo = LeaveOneOut()
ypT = []
CT=[]
for train, test in loo.split(input_data_scaled):
    yp = []
    X_train, X_test = input_data_scaled[train], input_data_scaled[test]
    y_train, y_test = output_df.values[train], output_df.values[test]
    for i in range(X_train.shape[1] - 1):
        X_train1 = np.delete(X_train, [i, i + 1], axis=1)
        X_test1 = np.delete(X_test, [i, i + 1], axis=1)
        for i in range(len(X_train1) - 1):
            X_train2 = np.delete(X_train1, [i, i + 1], axis=0)
            y_train2 = np.delete(y_train, [i, i + 1], axis=0)
            knn.fit(X_train2, y_train2)
            yp0 = knn.predict(X_test1)
            yp.extend(yp0)
        c_index = cindex(output_df[:-2].values, yp)
        Cindex.append(c_index)
        #print(Cindex)
    CT.append(np.mean(Cindex))
    print("Average C-index:", round(np.mean(CT),2))
print("C-index:", round(np.mean(CT),2))

Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5
Average C-index: 0.5


KeyboardInterrupt: 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=54a21652-0e9b-45ff-a672-e40722b5b96c' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>