## Dataset information as per Kaggle (https://www.kaggle.com/competitions/icr-identify-age-related-conditions/data)

1. **train.csv** - The training set.
   * `Id` Unique identifier for each observation.
   * `AB-GL` Fifty-six anonymized health characteristics. All are numeric except for EJ, which is categorical.
   * `Class` A binary target: 1 indicates the subject has been diagnosed with one of the three conditions, 0 indicates they have not.
2. **test.csv** - The test set. Your goal is to predict the probability that a subject in this set belongs to each of the two classes.
3. **greeks.csv** - Supplemental metadata, only available for the training set.
   * `Alpha` Identifies the type of age-related condition, if present.
     * `A` No age-related condition. Corresponds to class 0.
     * `B`, `D`, `G` The three age-related conditions. Correspond to class 1.
   * `Beta`, `Gamma`, `Delta` Three experimental characteristics.
   * `Epsilon` The date the data for this subject was collected. Note that all of the data in the test set was collected after the training set was collected.

In [8]:
# standard
import pandas as pd
import numpy as np
import random
import os

# tf, keras, and sklearn
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from keras import models
from keras import layers
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.manifold import TSNE


# plots and images
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

In [9]:
greeks = pd.read_csv('../data/greeks.csv')
test = pd.read_csv('../data/test.csv')
train = pd.read_csv('../data/train.csv')

greeks.columns = map(str.lower, greeks.columns)
test.columns = map(str.lower, test.columns)
train.columns = map(str.lower, train.columns)

print('Shape of greeks df:', greeks.shape)
print('Shape of test df:', test.shape)
print('Shape of train df:', train.shape)

# Merge train and greeks to get all columns in the same DataFrame
df = pd.merge(train, greeks, on='id')

Shape of greeks df: (617, 6)
Shape of test df: (5, 57)
Shape of train df: (617, 58)


In [10]:
df['ej'].replace(to_replace=['A', 'B'], value=[0, 1], inplace=True)

# Multiclass
target_variable_multiclass = df['alpha']

# Binary
target_variable = df['class']
features_variable=df.drop(['class', 'id', 'alpha', 'beta', 'gamma', 'delta', 'epsilon'],axis=1)

In [11]:
# Fill in NaN values
imputer = KNNImputer(n_neighbors=2)
features_variable = imputer.fit_transform(features_variable)

# Standardize values
standarized_data = StandardScaler().fit_transform(features_variable)

# tSNE to keep 12 features
model = TSNE(n_components=12, random_state=0,perplexity=50, n_iter=5000, method='exact')

# Get our features
tsne_features = model.fit_transform(standarized_data)

In [17]:
tsne_features = pd.DataFrame(tsne_features)
display(tsne_features.head())
display(tsne_features.shape)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,-0.396739,1.418988,6.536719,-0.599304,3.496097,0.39731,2.572014,-2.525381,-2.697865,-2.430291,6.295337,-6.202816
1,-0.145949,0.773972,0.601172,2.354358,3.173093,0.493702,0.384078,4.039861,-1.398307,-8.212885,-7.352618,1.137156
2,-0.914068,-9.540289,-2.439326,-1.074276,-2.873172,5.748519,1.505882,-1.919561,2.055977,0.926398,3.344955,4.421085
3,0.213694,-0.851135,-2.53149,0.367552,0.452652,-4.702112,-2.974192,-0.29176,4.750243,0.70752,-0.912372,7.766903
4,-4.719997,3.157185,-0.555827,4.228339,-4.961008,-2.026945,-0.174429,-1.053386,-6.764446,-0.988785,-0.237722,-0.250103


(617, 12)