**Unsupervised Machine Learning Approach for Tumor Prediction from Random Gene Expression Extracts**



# **Abstract**

Predicting disease using machine learning is becoming a more demanding subject in the healthcare field. Supervised machine learning, which learns from data given has become reliable in recent years. However, the difficulty of acquiring the expensive expert annotated data needed for the supervised algorithms limits the ways we can utilize it in various applications. In contrast, unsupervised machine learning models learn from clustering unannotated data, which makes them more versatile, which means that they can use any data that is available. In this paper, we trained an unsupervised model and a supervised model with a dataset of random extraction of gene expressions from patients who have certain types of tumors: BRCA, KIRC, COAD, LUAD, and PRAD. We trained a logistic regression model and a k-means model enhanced with an autoencoder, so we could compare the results of both to properly assess the results of an unsupervised model. After finishing the program, we repeated the train and test process 5000 times, so we could collect the averages and observe the outliers. The supervised model had a median accuracy of 100% which isn't surprising considering the data was labeled. After training the autoencoder, the unsupervised model had a median accuracy of 51% in tumor prediction, which is impressive considering the challenges of using unlabeled data. These results highlight the potential of unsupervised learning models in disease prediction. This study demonstrates that with optimization, unsupervised models can be on par with supervised models, while not depending on expensive human annotations.


# Importing the Dataset


This collection of data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of gene expressions of patients having different types of tumor: BRCA, KIRC, COAD, LUAD and PRAD.

You can access the dataset [here](https://archive.ics.uci.edu/dataset/401/gene+expression+cancer+rna+seq).

The labels and data have to be uploaded separately

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
from google.colab import files
uploaded = files.upload()

# Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
import torch
from torchvision import datasets
from torchvision import transforms
import matplotlib.pyplot as plt

In [None]:
import os
from torch.utils.data import Dataset

In [None]:
from sklearn.cluster import KMeans

In [None]:
from sklearn.metrics import accuracy_score

#Setting up the Dataset

In [None]:
data = pd.read_csv("data.csv")
data.describe()

In [None]:
label = pd.read_csv("labels.csv")
print(label)

In [None]:
print(data)

In [None]:
merged = pd.concat([data,label], axis=1)

In [None]:
print(merged)

In [None]:
X = data.drop("Unnamed: 0", axis=1)
X.head()
X.to_numpy()

In [None]:
X.to_numpy().shape

In [None]:
Y = label.drop("Unnamed: 0", axis=1)
Y.head()

In [None]:
# text to numbers
labelmapping = {'PRAD':0,'LUAD':1,'BRCA':2,'KIRC':3,"COAD":4}
Y = Y['Class'].tolist()
Y = [labelmapping[y] for y in Y]

#Logistic Regression Model

In [None]:
#training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X.to_numpy(), np.array(Y), test_size=0.20)

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
print('Score:', score)

predictions = model.predict(X_test)
print('Predictions:', predictions)

In [None]:
acc = accuracy_score(y_test, predictions)
print("Accuracy:", acc)

#K-means


In [None]:

kmeans = KMeans(n_clusters=5)
kmeans.fit(X_train)
prediction = kmeans.predict(X_test)


In [None]:
acc = accuracy_score(y_test, prediction)
print("Accuracy:", acc)

#AUTOENCODER


In [None]:

class CustomGeneDataset(Dataset):
  def __init__(self, features, labels, transform=None, target_transform=None):
    self.features = torch.FloatTensor(features)
    self.labels = torch.FloatTensor(labels)


  def __len__(self):
    return len(self.labels)

  def __getitem__(self, idx):
    data = self.features[idx]
    label = self.labels[idx]
    return data,label

In [None]:
# Creating a PyTorch class
# 28*28 ==> 9 ==> 28*28
class AE(torch.nn.Module):
	def __init__(self):
		super().__init__()

		self.encoder = torch.nn.Sequential(
			torch.nn.Linear(20531,1024),
			torch.nn.ReLU(),
			torch.nn.Linear(1024, 512),
			torch.nn.ReLU(),
			torch.nn.Linear(512, 256),
			torch.nn.ReLU(),
			torch.nn.Linear(256, 128),
			torch.nn.ReLU(),
			torch.nn.Linear(128, 64)
		)

		self.decoder = torch.nn.Sequential(
			torch.nn.Linear(64, 128),
			torch.nn.ReLU(),
			torch.nn.Linear(128, 256),
			torch.nn.ReLU(),
			torch.nn.Linear(256, 512),
			torch.nn.ReLU(),
			torch.nn.Linear(512, 1024),
			torch.nn.ReLU(),
			torch.nn.Linear(1024,20531),
			torch.nn.Sigmoid()
		)

	def forward(self, x):
		encoded = self.encoder(x)
		decoded = self.decoder(encoded)
		return decoded

	def encode(self, x):
		encoded = self.encoder(x)
		return encoded


In [None]:
# Transforms to a PyTorch Tensor
tensor_transform = transforms.ToTensor()

train_dataset = CustomGeneDataset(X_train, y_train)
test_dataset = CustomGeneDataset(X_test, y_test)


In [None]:
train_loader = torch.utils.data.DataLoader(dataset = train_dataset, batch_size = 8, shuffle = True)
test_loader = torch.utils.data.DataLoader(dataset = test_dataset, batch_size = 8, shuffle = True)

In [None]:
# Model Initialization
model = AE()
model = model.cuda()
# Validation using MSE Loss function
loss_function = torch.nn.MSELoss()

#Adam Optimizer
optimizer = torch.optim.Adam(model.parameters(),
							lr = 1e-3,
							weight_decay = 1e-8)


In [None]:
epochs = 20
outputs = []
losses = []
for epoch in range(epochs):
    for (data, reconstructed) in train_loader:
        # Output of Autoencoder
        reconstructed = model(data)

        # Calculating the loss function
        loss = loss_function(reconstructed, data)

        # The gradients are set to zero,
        # the gradient is computed and stored.
        # .step() performs parameter update
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Storing the losses in a list for plotting
        losses.append(loss)





In [None]:
trainset_encoded_representation = []
with torch.no_grad():
  for (data,reconstructed) in train_loader:
    encoded = model.encode(data)
    trainset_encoded_representation.extend(encoded)

In [None]:
testset_encoded_representation = []
with torch.no_grad():
  for (data,reconstructed) in test_loader:
    encoded = model.encode(data)
    testset_encoded_representation.extend(encoded)

In [None]:
trainset_encoded_representation = [encoded.cpu().numpy() for encoded in trainset_encoded_representation]
testset_encoded_representation = [encoded.cpu().numpy() for encoded in testset_encoded_representation]

In [None]:

kmeans = KMeans(n_clusters=3)
kmeans.fit(trainset_encoded_representation)
prediction = kmeans.predict(testset_encoded_representation)

In [None]:
acc = accuracy_score(y_test, prediction)
print("Accuracy:", acc)

In [None]:
#print(trainset_encoded_representation)

# Citations


Fiorini,Samuele. (2016). gene expression cancer RNA-Seq. UCI Machine Learning Repository. https://doi.org/10.24432/C5R88H.

https://www.geeksforgeeks.org/implementing-an-autoencoder-in-pytorch/

https://pytorch.org/tutorials/beginner/basics/data_tutorial.html