<a href="https://colab.research.google.com/github/szezlong/gene-classification/blob/main/Gene_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Klasyfikacja choroby nowotworowej za pomocą analizy ekspresji genów
Źródło zbioru danych: https://www.kaggle.com/datasets/crawford/gene-expression/data?select=actual.csv

Pochodzi z badania: https://www.dkfz.de/genomics-proteomics/fileadmin/downloads/Expression/Golub_1999.pdf

The dataset used in Golub's research was sourced from bone marrow samples obtained from patients diagnosed with acute leukemia. It included samples of acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML).

It served as the basis for developing a class predictor to classify new, unknown samples of acute leukemias based on their gene expression profiles.

##Konfiguracja środowiska

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

##Dataset

In [21]:
from google.colab import userdata
import os

os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')

In [22]:
!kaggle datasets download -d crawford/gene-expression
!unzip "gene-expression.zip"

Dataset URL: https://www.kaggle.com/datasets/crawford/gene-expression
License(s): CC0-1.0
Downloading gene-expression.zip to /content
  0% 0.00/1.41M [00:00<?, ?B/s]
100% 1.41M/1.41M [00:00<00:00, 135MB/s]
Archive:  gene-expression.zip
  inflating: actual.csv              
  inflating: data_set_ALL_AML_independent.csv  
  inflating: data_set_ALL_AML_train.csv  


In [116]:
train_df = pd.read_csv('data_set_ALL_AML_train.csv')
test_df = pd.read_csv('data_set_ALL_AML_independent.csv')
patient_result_df = pd.read_csv('actual.csv')

###Opis danych

There are two datasets containing the initial (training, 38 samples) and independent (test, 34 samples) datasets used in the paper. These datasets contain measurements corresponding to ALL and AML samples from Bone Marrow and Peripheral Blood.

In [14]:
patient_result_df['cancer'].value_counts()

cancer
ALL    47
AML    25
Name: count, dtype: int64

In the combined training and testing sets there are 72 patients. Each patient is marked either as "ALL" or "AML" based on the type of leukemia they are diagnosed with.

In [24]:
print(train_df.shape)
print(test_df.shape)

(7129, 78)
(7129, 70)


In [32]:
train_df.head()

Unnamed: 0,Gene Description,Gene Accession Number,1,call,2,call.1,3,call.2,4,call.3,...,29,call.33,30,call.34,31,call.35,32,call.36,33,call.37
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-214,A,-139,A,-76,A,-135,A,...,15,A,-318,A,-32,A,-124,A,-135,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-153,A,-73,A,-49,A,-114,A,...,-114,A,-192,A,-49,A,-79,A,-186,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,-58,A,-1,A,-307,A,265,A,...,2,A,-95,A,49,A,-37,A,-70,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,88,A,283,A,309,A,12,A,...,193,A,312,A,230,P,330,A,337,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-295,A,-264,A,-376,A,-419,A,...,-51,A,-139,A,-367,A,-188,A,-407,A


In [19]:
test_df.head()

Unnamed: 0,Gene Description,Gene Accession Number,39,call,40,call.1,42,call.2,47,call.3,...,65,call.29,66,call.30,63,call.31,64,call.32,62,call.33
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-342,A,-87,A,22,A,-243,A,...,-62,A,-58,A,-161,A,-48,A,-176,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-200,A,-248,A,-153,A,-218,A,...,-198,A,-217,A,-215,A,-531,A,-284,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,41,A,262,A,17,A,-163,A,...,-5,A,63,A,-46,A,-124,A,-81,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,328,A,295,A,276,A,182,A,...,141,A,95,A,146,A,431,A,9,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-224,A,-226,A,-211,A,-289,A,...,-256,A,-191,A,-172,A,-496,A,-294,A


The gene descriptions, numbering 7129 in total, are listed in the rows, while each patient's values are displayed in the columns.

##Preprocessing

In [117]:
keep_train = [col for col in train_df.columns if "call" not in col]
keep_test = [col for col in test_df.columns if "call" not in col]

train_df = train_df[keep_train]
test_df = test_df[keep_test]

Podjęto decyzję o usunięciu kolumn 'call'. Kolumny te nie są nigdzie wyjaśnione. Co więcej, wartości dla różnych genów w jednej kolumnie 'call' są takie same, co sugeruje, że te kolumny mogą nie mieć znaczenia w analizie genów.

In [118]:
train_df = train_df.T
test_df = test_df.T

In [119]:
train_df.columns = train_df.iloc[1]
train_df = train_df.drop(["Gene Description", "Gene Accession Number"]).apply(pd.to_numeric)
train_df.index = pd.to_numeric(train_df.index) #ta linijka chyba nie jest must-have
train_df.sort_index(inplace=True) #ta linijka chyba nie jest must-have
train_df.head()

Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
3,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41
4,-135,-114,265,12,-419,-585,158,-253,49,31,...,240,835,218,174,-110,627,170,-50,126,-91
5,-106,-125,-76,168,-230,-284,4,-122,70,252,...,156,649,57,504,-26,250,314,14,56,-25


In [120]:
test_df.columns = test_df.iloc[1]
test_df = test_df.drop(["Gene Description", "Gene Accession Number"]).apply(pd.to_numeric)
test_df.index = pd.to_numeric(test_df.index) #ta linijka chyba nie jest must-have
test_df.sort_index(inplace=True) #ta linijka chyba nie jest must-have
test_df.head()

Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
39,-342,-200,41,328,-224,-427,-656,-292,137,-144,...,277,1023,67,214,-135,1074,475,48,168,-70
40,-87,-248,262,295,-226,-493,367,-452,194,162,...,83,529,-295,352,-67,67,263,-33,-33,-21
41,-62,-23,-7,142,-233,-284,-167,-97,-12,-70,...,129,383,46,104,15,245,164,84,100,-18
42,22,-153,17,276,-211,-250,55,-141,0,500,...,413,399,16,558,24,893,297,6,1971,-42
43,86,-36,-141,252,-201,-384,-420,-197,-60,-468,...,341,91,-84,615,-52,1235,9,7,1545,-81


In [121]:
print(train_df.shape)
print(test_df.shape)

(38, 7129)
(34, 7129)


Otrzymano 38 pacjentów jako wiersze w zbiorze treningowym, a pozostałych 34 jako wiersze w zbiorze testowym. Zbiory te zawierają informacje na temat ekspresji 7129 genów dla każdego pacjenta.

In [122]:
train_df = train_df.reset_index(drop=True)
train_df = train_df.merge(patient_result_df[['cancer']], how='left', left_index=True, right_index=True)

In [123]:
dic = {'ALL':0,'AML':1}
train_df.replace(dic,inplace=True)
train_df.head()

Unnamed: 0,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at,cancer
0,-214,-153,-58,88,-295,-558,199,-176,252,206,...,511,-125,389,-37,793,329,36,191,-37,0
1,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,837,-36,442,-17,782,295,11,76,-14,0
2,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,1199,33,168,52,1138,777,41,228,-41,0
3,-135,-114,265,12,-419,-585,158,-253,49,31,...,835,218,174,-110,627,170,-50,126,-91,0
4,-106,-125,-76,168,-230,-284,4,-122,70,252,...,649,57,504,-26,250,314,14,56,-25,0


In [124]:
test_df = test_df.reset_index(drop=True)
test_df = test_df.merge(patient_result_df[['cancer']], how='left', left_index=True, right_index=True)
test_df.replace(dic,inplace=True)
test_df.head()

Unnamed: 0,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at,cancer
0,-342,-200,41,328,-224,-427,-656,-292,137,-144,...,1023,67,214,-135,1074,475,48,168,-70,0
1,-87,-248,262,295,-226,-493,367,-452,194,162,...,529,-295,352,-67,67,263,-33,-33,-21,0
2,-62,-23,-7,142,-233,-284,-167,-97,-12,-70,...,383,46,104,15,245,164,84,100,-18,0
3,22,-153,17,276,-211,-250,55,-141,0,500,...,399,16,558,24,893,297,6,1971,-42,0
4,86,-36,-141,252,-201,-384,-420,-197,-60,-468,...,91,-84,615,-52,1235,9,7,1545,-81,0


Do danych dodano informację o typie rozpoznanego nowotworu (ALL = 1, AML = 0)
//mam nadzieję że dobrze...

##Selekcja istotnych genów