<a href="https://colab.research.google.com/github/sunonmountain/Data-Science-Projects-Python/blob/main/ICR_Identifying_Age_Related_Conditions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Goal of the Competition
The goal of this competition is to predict if a person has any of three medical conditions. You are being asked to predict if the person has one or more of any of the three medical conditions (Class 1), or none of the three medical conditions (Class 0). You will create a model trained on measurements of health characteristics.

To determine if someone has these medical conditions requires a long and intrusive process to collect information from patients. With predictive models, we can shorten this process and keep patient details private by collecting key characteristics relative to the conditions, then encoding these characteristics.

Your work will help researchers discover the relationship between measurements of certain characteristics and potential patient conditions.

##Context
They say age is just a number but a whole host of health issues come with aging. From heart disease and dementia to hearing loss and arthritis, aging is a risk factor for numerous diseases and complications. The growing field of bioinformatics includes research into interventions that can help slow and reverse biological aging and prevent major age-related ailments. Data science could have a role to play in developing new methods to solve problems with diverse data, even if the number of samples is small.

Currently, models like XGBoost and random forest are used to predict medical conditions yet the models' performance is not good enough. Dealing with critical problems where lives are on the line, models need to make correct predictions reliably and consistently between different cases.

Founded in 2015, competition host InVitro Cell Research, LLC (ICR) is a privately funded company focused on regenerative and preventive personalized medicine. Their offices and labs in the greater New York City area offer state-of-the-art research space. InVitro Cell Research's Scientists are what set them apart, helping guide and defining their mission of researching how to repair aging people fast.

In this competition, you’ll work with measurements of health characteristic data to solve critical problems in bioinformatics. Based on minimal training, you’ll create a model to predict if a person has any of three medical conditions, with an aim to improve on existing methods.

You could help advance the growing field of bioinformatics and explore new methods to solve complex problems with diverse data.

## Importing Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing Datasets

In [None]:
trainset = pd.read_csv('train.csv')
trainset = trainset.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)




In [None]:
testset = pd.read_csv('test.csv')
testset = testset.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)


In [None]:
trainset.shape

(617, 58)

In [None]:
testset.shape

(5, 57)

Check for Nulls

In [None]:
trainset.isnull().sum()

Id        0
AB        0
AF        0
AH        0
AM        0
AR        0
AX        0
AY        0
AZ        0
BC        0
BD        0
BN        0
BP        0
BQ       60
BR        0
BZ        0
CB        2
CC        3
CD        0
CF        0
CH        0
CL        0
CR        0
CS        0
CU        0
CW        0
DA        0
DE        0
DF        0
DH        0
DI        0
DL        0
DN        0
DU        1
DV        0
DY        0
EB        0
EE        0
EG        0
EH        0
EJ        0
EL       60
EP        0
EU        0
FC        1
FD        0
FE        0
FI        0
FL        1
FR        0
FS        2
GB        0
GE        0
GF        0
GH        0
GI        0
GL        1
Class     0
dtype: int64

In [None]:
testset.isnull().sum()

Id     0
AB     0
AF     0
AH     0
AM     0
AR     0
AX     0
AY     0
AZ     0
BC     0
BD     0
BN     0
BP     0
BQ     0
BR     0
BZ     0
CB     0
CC     0
CD     0
CF     0
CH     0
CL     0
CR     0
CS     0
CU     0
CW     0
DA     0
DE     0
DF     0
DH     0
DI     0
DL     0
DN     0
DU     0
DV     0
DY     0
EB     0
EE     0
EG     0
EH     0
EJ     0
EL     0
EP     0
EU     0
FC     0
FD     0
FE     0
FI     0
FL     0
FR     0
FS     0
GB     0
GE     0
GF     0
GH     0
GI     0
GL     0
dtype: int64

In [None]:
(trainset['BQ'] == 0).sum()

0

In [None]:
(trainset['CB'] == 0).sum()
(trainset['CC'] == 0).sum()

0

In [None]:
(trainset['EL'] == 0).sum()

0

In [None]:
(trainset['FC'] == 0).sum()
(trainset['FL'] == 0).sum()
(trainset['FS'] == 0).sum()

0

In [None]:
(trainset['GL'] == 0).sum()

0

In [None]:
trainset.head()


Unnamed: 0,Id,AB,AF,AH,AM,AR,AX,AY,AZ,BC,...,FL,FR,FS,GB,GE,GF,GH,GI,GL,Class
0,000ff2bfdfe9,0.209377,3109.03329,85.200147,22.394407,8.138688,0.699861,0.025578,9.812214,5.555634,...,7.298162,1.73855,0.094822,11.339138,72.611063,2003.810319,22.136229,69.834944,0.120343,1
1,007255e47698,0.145282,978.76416,85.200147,36.968889,8.138688,3.63219,0.025578,13.51779,1.2299,...,0.173229,0.49706,0.568932,9.292698,72.611063,27981.56275,29.13543,32.131996,21.978,0
2,013f2bd269f5,0.47003,2635.10654,85.200147,32.360553,8.138688,6.73284,0.025578,12.82457,1.2299,...,7.70956,0.97556,1.198821,37.077772,88.609437,13676.95781,28.022851,35.192676,0.196941,0
3,043ac50845d5,0.252107,3819.65177,120.201618,77.112203,8.138688,3.685344,0.025578,11.053708,1.2299,...,6.122162,0.49706,0.284466,18.529584,82.416803,2094.262452,39.948656,90.493248,0.155829,0
4,044fb8a146ec,0.380297,3733.04844,85.200147,14.103738,8.138688,3.942255,0.05481,3.396778,102.15198,...,8.153058,48.50134,0.121914,16.408728,146.109943,8524.370502,45.381316,36.262628,0.096614,1


As the column headings are anonymous, I have decided to drop 'NA' as replacing with an average figure would introduce bias. Also the 'NA' could mean not applicable - replace with 0. But I don't know this without consulting with the data provider.

In [None]:
new_data = trainset.dropna(axis=0, how='any')
new_data = new_data.drop(['Id'], axis=1)




In [None]:
new_data.shape


(548, 57)

EJ column is categorical, so I will use get dummies.

In [None]:
new_data = pd.get_dummies(new_data,columns=['EJ'], drop_first=False)
new_data.shape

(548, 58)

In [None]:
list(new_data.columns)


['AB',
 'AF',
 'AH',
 'AM',
 'AR',
 'AX',
 'AY',
 'AZ',
 'BC',
 'BD ',
 'BN',
 'BP',
 'BQ',
 'BR',
 'BZ',
 'CB',
 'CC',
 'CD ',
 'CF',
 'CH',
 'CL',
 'CR',
 'CS',
 'CU',
 'CW ',
 'DA',
 'DE',
 'DF',
 'DH',
 'DI',
 'DL',
 'DN',
 'DU',
 'DV',
 'DY',
 'EB',
 'EE',
 'EG',
 'EH',
 'EL',
 'EP',
 'EU',
 'FC',
 'FD ',
 'FE',
 'FI',
 'FL',
 'FR',
 'FS',
 'GB',
 'GE',
 'GF',
 'GH',
 'GI',
 'GL',
 'Class',
 'EJ_A',
 'EJ_B']

In [None]:
new_data.head()

Unnamed: 0,AB,AF,AH,AM,AR,AX,AY,AZ,BC,BD,...,FS,GB,GE,GF,GH,GI,GL,Class,EJ_A,EJ_B
0,0.209377,3109.03329,85.200147,22.394407,8.138688,0.699861,0.025578,9.812214,5.555634,4126.58731,...,0.094822,11.339138,72.611063,2003.810319,22.136229,69.834944,0.120343,1,0,1
1,0.145282,978.76416,85.200147,36.968889,8.138688,3.63219,0.025578,13.51779,1.2299,5496.92824,...,0.568932,9.292698,72.611063,27981.56275,29.13543,32.131996,21.978,0,1,0
2,0.47003,2635.10654,85.200147,32.360553,8.138688,6.73284,0.025578,12.82457,1.2299,5135.78024,...,1.198821,37.077772,88.609437,13676.95781,28.022851,35.192676,0.196941,0,0,1
3,0.252107,3819.65177,120.201618,77.112203,8.138688,3.685344,0.025578,11.053708,1.2299,4169.67738,...,0.284466,18.529584,82.416803,2094.262452,39.948656,90.493248,0.155829,0,0,1
4,0.380297,3733.04844,85.200147,14.103738,8.138688,3.942255,0.05481,3.396778,102.15198,5728.73412,...,0.121914,16.408728,146.109943,8524.370502,45.381316,36.262628,0.096614,1,0,1


Seperate data into x and y

In [None]:
target = list(new_data)


target.remove('Class')
X = new_data[target].values
y = new_data['Class'].values

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
print(X_train)

[[3.41840000e-01 2.39213370e+03 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]
 [3.41840000e-01 1.52480570e+03 8.52001470e+01 ... 1.58711199e-01
  0.00000000e+00 1.00000000e+00]
 [2.69199000e-01 9.66454830e+02 8.52001470e+01 ... 9.28729640e-02
  0.00000000e+00 1.00000000e+00]
 ...
 [5.04214000e-01 6.08931532e+03 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]
 [1.75193000e-01 1.92593280e+02 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]
 [3.93116000e-01 2.30447404e+03 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]]


In [None]:
print(y_train)

[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0
 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0
 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 1 0 0]


In [None]:
print(X_test)

[[5.85401000e-01 3.60006097e+03 1.23241059e+02 ... 6.77621360e-02
  0.00000000e+00 1.00000000e+00]
 [3.71751000e-01 2.07222301e+03 1.20532560e+02 ... 2.22929782e-01
  0.00000000e+00 1.00000000e+00]
 [3.24748000e-01 4.98960924e+03 8.52001470e+01 ... 3.56400000e-01
  0.00000000e+00 1.00000000e+00]
 ...
 [3.29021000e-01 1.92593280e+02 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]
 [8.20416000e-01 3.45744795e+03 1.73657460e+02 ... 2.25720000e+00
  0.00000000e+00 1.00000000e+00]
 [2.09377000e-01 2.98122825e+03 1.44551982e+02 ... 7.75945950e-02
  0.00000000e+00 1.00000000e+00]]


In [None]:
print(y_test)

[1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0]


In [None]:
print(X_train)

[[3.41840000e-01 2.39213370e+03 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]
 [3.41840000e-01 1.52480570e+03 8.52001470e+01 ... 1.58711199e-01
  0.00000000e+00 1.00000000e+00]
 [2.69199000e-01 9.66454830e+02 8.52001470e+01 ... 9.28729640e-02
  0.00000000e+00 1.00000000e+00]
 ...
 [5.04214000e-01 6.08931532e+03 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]
 [1.75193000e-01 1.92593280e+02 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]
 [3.93116000e-01 2.30447404e+03 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]]


In [None]:
print(X_test)

[[5.85401000e-01 3.60006097e+03 1.23241059e+02 ... 6.77621360e-02
  0.00000000e+00 1.00000000e+00]
 [3.71751000e-01 2.07222301e+03 1.20532560e+02 ... 2.22929782e-01
  0.00000000e+00 1.00000000e+00]
 [3.24748000e-01 4.98960924e+03 8.52001470e+01 ... 3.56400000e-01
  0.00000000e+00 1.00000000e+00]
 ...
 [3.29021000e-01 1.92593280e+02 8.52001470e+01 ... 2.19780000e+01
  1.00000000e+00 0.00000000e+00]
 [8.20416000e-01 3.45744795e+03 1.73657460e+02 ... 2.25720000e+00
  0.00000000e+00 1.00000000e+00]
 [2.09377000e-01 2.98122825e+03 1.44551982e+02 ... 7.75945950e-02
  0.00000000e+00 1.00000000e+00]]


## As this is a classification problem, I will use Kernel SVM

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', C = 0.75, gamma= 0.1, random_state = 0)

classifier.fit(X_train, y_train)

##Predicting the Test set results

In [None]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 1]
 [0 0]
 [0 1]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 0]
 [0 1]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]]


##Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[110   0]
 [ 27   0]]


0.8029197080291971

Predicted output on test file data,
first drop ID column from testset dataframe

In [None]:
ntestset = testset.drop(['Id'], axis=1)

'EJ' column is categorical need to use get dummies before using classifier.

In [None]:
newtest_set = pd.get_dummies(ntestset,columns=['EJ'], drop_first=False)
newtest_set.shape

(5, 56)

In [None]:
newtest_set.head

<bound method NDFrame.head of     AB   AF   AH   AM   AR   AX   AY   AZ   BC  BD   ...   FL   FR   FS   GB  \
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   

    GE   GF   GH   GI   GL  EJ_A  
0  0.0  0.0  0.0  0.0  0.0     1  
1  0.0  0.0  0.0  0.0  0.0     1  
2  0.0  0.0  0.0  0.0  0.0     1  
3  0.0  0.0  0.0  0.0  0.0     1  
4  0.0  0.0  0.0  0.0  0.0     1  

[5 rows x 56 columns]>

As there is not a column for EJ_B , I will add this column ad populate with 0.

In [None]:
EJ_B = [0,0,0,0,0]
newtest_set['EJ_B'] = EJ_B
newtest_set.head

<bound method NDFrame.head of     AB   AF   AH   AM   AR   AX   AY   AZ   BC  BD   ...   FR   FS   GB   GE  \
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   

    GF   GH   GI   GL  EJ_A  EJ_B  
0  0.0  0.0  0.0  0.0     1     0  
1  0.0  0.0  0.0  0.0     1     0  
2  0.0  0.0  0.0  0.0     1     0  
3  0.0  0.0  0.0  0.0     1     0  
4  0.0  0.0  0.0  0.0     1     0  

[5 rows x 57 columns]>

In [None]:
entry_y_pred = classifier.predict(newtest_set)
entry_y_pred



array([0, 0, 0, 0, 0])

In [None]:
len(entry_y_pred)

5

The submission file requires an estimate for the probabilities for each class. I can do this as we have 80.29% accuracy.

In [None]:
entry_y_pred

array([0, 0, 0, 0, 0])

In [None]:
percentage_predict_correct = [0.8,0.8,0.8,0.8,0.8]
percentage_predict_incorrect = [0.2,0.2,0.2,0.2,0.2]

In [None]:
submission = pd.DataFrame(list(zip(testset['Id'],percentage_predict_correct,percentage_predict_incorrect)), columns=['Id', 'class_0','class_1'])

In [None]:



submission.head

<bound method NDFrame.head of              Id  class_0  class_1
0  00eed32682bb      0.8      0.2
1  010ebe33f668      0.8      0.2
2  02fa521e1838      0.8      0.2
3  040e15f562a2      0.8      0.2
4  046e85c7cc7f      0.8      0.2>

Create and download submission.csv

In [None]:
submission.to_csv('submission.csv',index=False)

In [None]:
from google.colab import files
files.download('submission.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>