# Data Processing

Data visualization, cleaning, division, and normalization.

* Execute small adjustments/renaming of columns
* Disease parameter: DMD/Cnt -> 1/0
* Remove rows if:
  * Sample.ID's value is "BLANK", "POOL 1" or "POOL 2"
  * Disease or Sample.ID value for row is missing  
* If there are multiple rows with same Sample.ID, drop the duplicate rows with fewer value entires
  * Might need to evalueate manually or reevaluate the heuristiics (which proteins to prioritize over others)
* Remove columns of antibody consentrations if there are no data entries

* TODO: debugg, continue with later points
* Potentionally normalize intensities: [-1, 1] or [0, 1]
* Create new column for LoA based on FT1-5 

* Flytta om ordningen på kolumnerna så att varje rad blir en vektor på formen <<METADATA>,<Y>,<X>> där <X> är [LoA], <Y> är intensiteterna (och eventuellt ålder senare) och <METADATA> är allt annat.

* Run a SVM with our input data

## Enviroment to read and process DMD data

Imports used for data processing and cleaning

In [98]:
import pandas as pd

Load CSV file through pandas dataframe

In [99]:
dataset = pd.read_csv('normalised_data_all_w_clinical_kex_20240321.csv')
dataset.head() # Pre-view first five rows

Unnamed: 0,ID,Sample.ID,Participant.ID,dataset,Disease,patregag,TREAT,FT1,FT2,FT3,...,HPA049320_SBA4_rep1,HPA051620_SBA4_rep1,HPA054862_SBA4_rep1,HPA003901_SBA4_rep1,HPA035863_SBA4_rep1,HPA040052_SBA4_rep1,HPA041542_SBA4_rep1,HPA044582_SBA4_rep1,HPA045702_SBA4_rep1,HPA048982_SBA4_rep1
0,1,S87,P3,DNHS,DMD,11.61,,0.0,0.0,,...,12.476497,12.666707,12.3314,12.410101,12.504796,12.580695,12.574592,12.797013,12.492433,12.622226
1,2,S10,P3,DNHS,DMD,12.62,,,,,...,12.494297,12.564169,12.354718,12.581718,12.441237,12.583769,12.579893,12.660894,12.450384,12.704464
2,3,S19,P3,DNHS,DMD,13.62,,,,,...,12.554581,12.70724,12.460004,12.610387,12.504661,12.556319,12.679701,12.824755,12.517335,12.83697
3,4,S137,P4,DNHS,DMD,11.43,,,,,...,12.455811,12.505603,12.416459,12.510771,12.472934,12.556096,12.47376,12.74402,12.511776,12.666065
4,5,S182,P4,DNHS,DMD,13.25,,,,,...,12.523768,12.598164,12.369592,12.605211,12.523768,12.503911,12.557672,12.926602,12.518327,12.668579


## First Data Visualization & Processing

In [100]:

# Dictionary for value conversion
token_to_val = {
    "DMD": 1,
    "Cnt": 0
}
# Rename columns
dataset.rename(columns={dataset.columns[0]: 'ID'}, inplace=True)
dataset.rename(columns={'Sample.ID': 'Sample_ID'}, inplace=True)
dataset.rename(columns={'Participant.ID': 'Participant_ID'}, inplace=True)
dataset.rename(columns={'dataset': 'Dataset'}, inplace=True)
dataset.rename(columns={'patregag': 'Age'}, inplace=True)

# Replace the string values in the column using the mapping in token_to_val
dataset['Disease'] = dataset['Disease'].replace(token_to_val)

# Verify the change
print(dataset.columns)
dataset.head()

Index(['ID', 'Sample_ID', 'Participant_ID', 'Dataset', 'Disease', 'Age',
       'TREAT', 'FT1', 'FT2', 'FT3',
       ...
       'HPA049320_SBA4_rep1', 'HPA051620_SBA4_rep1', 'HPA054862_SBA4_rep1',
       'HPA003901_SBA4_rep1', 'HPA035863_SBA4_rep1', 'HPA040052_SBA4_rep1',
       'HPA041542_SBA4_rep1', 'HPA044582_SBA4_rep1', 'HPA045702_SBA4_rep1',
       'HPA048982_SBA4_rep1'],
      dtype='object', length=1039)


  dataset['Disease'] = dataset['Disease'].replace(token_to_val)


Unnamed: 0,ID,Sample_ID,Participant_ID,Dataset,Disease,Age,TREAT,FT1,FT2,FT3,...,HPA049320_SBA4_rep1,HPA051620_SBA4_rep1,HPA054862_SBA4_rep1,HPA003901_SBA4_rep1,HPA035863_SBA4_rep1,HPA040052_SBA4_rep1,HPA041542_SBA4_rep1,HPA044582_SBA4_rep1,HPA045702_SBA4_rep1,HPA048982_SBA4_rep1
0,1,S87,P3,DNHS,1.0,11.61,,0.0,0.0,,...,12.476497,12.666707,12.3314,12.410101,12.504796,12.580695,12.574592,12.797013,12.492433,12.622226
1,2,S10,P3,DNHS,1.0,12.62,,,,,...,12.494297,12.564169,12.354718,12.581718,12.441237,12.583769,12.579893,12.660894,12.450384,12.704464
2,3,S19,P3,DNHS,1.0,13.62,,,,,...,12.554581,12.70724,12.460004,12.610387,12.504661,12.556319,12.679701,12.824755,12.517335,12.83697
3,4,S137,P4,DNHS,1.0,11.43,,,,,...,12.455811,12.505603,12.416459,12.510771,12.472934,12.556096,12.47376,12.74402,12.511776,12.666065
4,5,S182,P4,DNHS,1.0,13.25,,,,,...,12.523768,12.598164,12.369592,12.605211,12.523768,12.503911,12.557672,12.926602,12.518327,12.668579


In [101]:
# Give control group (non DMD) default value of 34 (top score) on FT5
in_control = dataset['Disease']  == 0.0
control_index = in_control[in_control == True].index
dataset.loc[control_index, 'FT5'] = 34

# Verify the change
print(dataset.iloc[:15, 7:12])

         FT1       FT2       FT3    FT4   FT5
0   0.000000  0.000000       NaN    NaN   NaN
1        NaN       NaN       NaN    NaN   NaN
2        NaN       NaN       NaN    NaN   NaN
3        NaN       NaN       NaN    NaN   NaN
4        NaN       NaN       NaN    NaN   NaN
5   2.538071  0.628931  0.367647  488.7  32.0
6   2.531646  0.492611  0.275482  475.0  29.0
7        NaN       NaN       NaN    NaN   NaN
8   2.237136  0.346021  0.176991  449.0  31.0
9        NaN       NaN       NaN    NaN  34.0
10       NaN       NaN       NaN    NaN  34.0
11       NaN       NaN       NaN    NaN  34.0
12       NaN       NaN       NaN    NaN  34.0
13       NaN       NaN       NaN    NaN  34.0
14       NaN       NaN       NaN    NaN  34.0


## Column Based Data Clean-up

In [102]:
def calculate_column_value_percentage(df, start_column=1):
    """
    Calculates the percentage of actual (non-NA) data points for each column in a pandas DataFrame
    within a specified interval.

    :param df: A pandas DataFrame with potential NA values.
    :param start_column: The starting column index for the interval (1-based index).
    :param end_column: The ending column index for the interval. If None, calculates up to the last column.
    :return: A pandas Series with the percentage of non-NA values for each column in the interval.
    """
    # Adjust for 0-based indexing
    start_index = max(0, start_column - 1)

    # Select only the columns within the specified interval
    interval_df = df.iloc[:, start_index:]

    # Calculate the total number of non-NA values for each column
    value_counts = interval_df.count()

    # Calculate the total number of rows (to handle potential NA rows)
    total_rows = len(df)

    # Calculate the percentage of non-NA values for each column
    value_percentage = (value_counts / total_rows) * 100

    return value_percentage

In [103]:
# Calculate column statistics for low content columns
value_percentage = calculate_column_value_percentage(dataset, 15)
limit = 50
low_percentage_columns = value_percentage[value_percentage < limit]

# Visualize status
num = 0
for column, percentage in low_percentage_columns.items():
    print(f"Column {column} has {percentage:.2f}% values")
    num += 1

print(f"We have {num} proteins with less than {limit}% datapoints")

Column HPA003948_SBA1_rep1 has 0.00% values
Column HPA059806_SBA1_rep1 has 11.20% values
Column HPA055893_SBA2_rep1 has 41.15% values
Column HPA035933_SBA2_rep1 has 33.07% values
Column HPA009426_SBA2_rep1 has 32.29% values
Column HPA057437_SBA3_rep1 has 0.00% values
Column HPA003909_SBA3_rep1 has 15.89% values
Column HPA015774_SBA3_rep1 has 42.45% values
Column HPA003223_SBA3_rep1 has 6.77% values
Column HPA003948_SBA3_rep1 has 0.00% values
Column Empty_SBA3_rep1 has 8.85% values
Column HPA040591_SBA3_rep1 has 30.99% values
Column HPA034960_SBA3_rep1 has 0.00% values
Column HPA021513_SBA3_rep1 has 43.75% values
Column HPA058513_SBA3_rep1 has 0.00% values
Column HPA036287_SBA3_rep1 has 11.46% values
Column HPA073315_SBA3_rep1 has 13.54% values
Column HPA041863_SBA3_rep1 has 34.38% values
Column HPA004712_SBA3_rep1 has 34.64% values
Column HPA074922_SBA3_rep1 has 0.00% values
Column HPA000837_SBA4_rep1 has 0.00% values
Column HPA001482_SBA4_rep1 has 0.00% values
Column HPA000293_SBA4_re

In [104]:
# Remove empty columns
columns_to_drop = low_percentage_columns.index
print("Columns to drop:", columns_to_drop)

# Check changes
print("Before drop:", dataset.shape)
dataset.drop(labels=columns_to_drop, axis="columns", inplace=True)
print("After drop:", dataset.shape)

Columns to drop: Index(['HPA003948_SBA1_rep1', 'HPA059806_SBA1_rep1', 'HPA055893_SBA2_rep1',
       'HPA035933_SBA2_rep1', 'HPA009426_SBA2_rep1', 'HPA057437_SBA3_rep1',
       'HPA003909_SBA3_rep1', 'HPA015774_SBA3_rep1', 'HPA003223_SBA3_rep1',
       'HPA003948_SBA3_rep1', 'Empty_SBA3_rep1', 'HPA040591_SBA3_rep1',
       'HPA034960_SBA3_rep1', 'HPA021513_SBA3_rep1', 'HPA058513_SBA3_rep1',
       'HPA036287_SBA3_rep1', 'HPA073315_SBA3_rep1', 'HPA041863_SBA3_rep1',
       'HPA004712_SBA3_rep1', 'HPA074922_SBA3_rep1', 'HPA000837_SBA4_rep1',
       'HPA001482_SBA4_rep1', 'HPA000293_SBA4_rep1', 'HPA007982_SBA4_rep1',
       'HPA028190_SBA4_rep1', 'HPA031466_SBA4_rep1', 'HPA040972_SBA4_rep1',
       'HPA001526_SBA4_rep1', 'HPA002021_SBA4_rep1', 'HPA010558_SBA4_rep1',
       'HPA013390_SBA4_rep1', 'HPA020610_SBA4_rep1', 'HPA028657_SBA4_rep1',
       'HPA030651_SBA4_rep1', 'HPA000226_SBA4_rep1', 'HPA007316_SBA4_rep1',
       'HPA008128_SBA4_rep1', 'HPA064736_SBA4_rep1', 'HPA041991_SBA4_rep1',

In [105]:
# Remove abundant data and calibration columns
print("Before drop:", dataset.shape)
dataset.drop(labels=['TREAT', 'Plate', 'Location', 'Empty_SBA1_rep1', 'Rabbit.IgG_SBA1_rep1'], axis='columns', inplace=True)
print("After drop:", dataset.shape)

Before drop: (384, 981)
After drop: (384, 976)


### Row Based Data Clean-up

In [106]:
def remove_wrong_value_rows(df, column_name, wrong_val):
    """
    Removes rows from the DataFrame where the specified column has the specified wrong value.

    :param df: A pandas DataFrame from which rows will be removed.
    :param column_name: The name of the column to check for the wrong value.
    :param wrong_val: The value considered wrong in the specified column.
    :return: A pandas DataFrame with rows containing the wrong value in the specified column removed.
    """
    if isinstance(wrong_val, str):
        wrong_val = list([wrong_val])
        
    for val in wrong_val:
        # Find indices of rows with the wrong value
        incorrect = dataset[column_name] == val
        indices_to_drop = incorrect[incorrect == True].index
        # Drop these rows
        df.drop(indices_to_drop, inplace=True)
    return df

In [107]:
# Drop rows with invalid sample data
print("Before drop:", dataset.shape)
dataset = remove_wrong_value_rows(dataset, 'Sample_ID', ['BLANK', 'POOL 1', 'POOL 2'])
print("After drop:", dataset.shape)

Before drop: (384, 976)
After drop: (372, 976)


In [108]:
# Drop rows with NaN in the row's key values
print("Before drop:", dataset.shape)
dataset.dropna(subset=['Sample_ID','Disease'], inplace=True)
print("After drop:", dataset.shape)

Before drop: (372, 976)
After drop: (357, 976)


#### Handle sample duplicates

In [109]:
def get_duplicate_indecies(df, columns):
    """
    Find indices of rows with the wrong value in the specified column.
    """
    duplicate = df.duplicated(subset=columns, keep=False)
    duplicate_indexes = duplicate[duplicate == True].index
    return duplicate_indexes

In [110]:
def calculate_row_value_percentage(df, start_column=0):
    """
    Calculates the percentage of actual (non-NA) data points for each row in a pandas DataFrame.

    :param df: A pandas DataFrame with potential NA values.
    :return: A pandas Series with the percentage of non-NA values for each row.
    """
    # Adjust for 0-based indexing
    start_index = max(0, start_column - 1)

    # Select only the columns within the specified interval
    interval_df = df.iloc[:, start_index:]

    # Calculate the number of non-NA values per row
    value_counts_per_row = df.notna().sum(axis=1)

    # Calculate the total number of columns (to handle potential NA values)
    total_columns = interval_df.shape[1]

    # Calculate the percentage of non-NA values for each row
    value_percentage_per_row = (value_counts_per_row / total_columns) * 100

    return value_percentage_per_row

In [111]:
def remove_duplicate_rows(df, duplicate_indexes, row_val_percentages):

    for i in duplicate_indexes:
        # For each duplicate find the duplicate sample.ID value using the index
        sample_ID = df.iloc[i]['Sample_ID']

        # Find all row indicies of occurances of the value
        duplicate_sample_ID_indicies = df.index[df['Sample_ID'] == sample_ID]

        # Find which of these rows have the highest percentage in row_val_percentages
        best_index = -1
        best_val = -1
        for duplicate_idx in duplicate_sample_ID_indicies:
            val = row_val_percentages.loc[duplicate_idx]

            if (val > best_val):
                best_val = val
                best_index = duplicate_idx

        # Remove best from list of duplicates
        duplicate_sample_ID_indicies = duplicate_sample_ID_indicies.drop(best_index)

        # Drop the rest of the duplicates
        df.drop(index=duplicate_sample_ID_indicies, inplace=True)

In [112]:
# Remove duplicate rows for same Sample_ID
duplicate_indexes = get_duplicate_indecies(dataset, 'Sample_ID')
row_val_percentages = calculate_row_value_percentage(dataset, start_column=15)

# Check changes
print("Before drop:", dataset.shape)
remove_duplicate_rows(dataset, duplicate_indexes, row_val_percentages)
print("After drop:", dataset.shape)

Before drop: (357, 976)
After drop: (342, 976)


##### Row handling based on FT5 (Should consider data generation based on age/other tests)

In [113]:
# Drop rows with NaN values in the FT5 column
not_na = dataset['FT5'].notna()
indices_to_drop = not_na[not_na == False].index

# Check changes
print("Before drop:", dataset.shape)
dataset.drop(indices_to_drop, inplace=True)
print("After drop:", dataset.shape)

dataset.head(15)

Before drop: (342, 976)
After drop: (301, 976)


Unnamed: 0,ID,Sample_ID,Participant_ID,Dataset,Disease,Age,FT1,FT2,FT3,FT4,...,HPA049320_SBA4_rep1,HPA051620_SBA4_rep1,HPA054862_SBA4_rep1,HPA003901_SBA4_rep1,HPA035863_SBA4_rep1,HPA040052_SBA4_rep1,HPA041542_SBA4_rep1,HPA044582_SBA4_rep1,HPA045702_SBA4_rep1,HPA048982_SBA4_rep1
5,6,S237,P5,DNHS,1.0,11.72,2.538071,0.628931,0.367647,488.7,...,12.543301,12.608368,,,12.598509,12.452527,12.495444,12.800951,12.696679,12.602999
6,7,S220,P5,DNHS,1.0,13.41,2.531646,0.492611,0.275482,475.0,...,12.390816,12.571472,12.389882,12.490296,12.522449,12.519723,12.413517,12.917533,12.514924,12.618601
8,9,S91,P6,DNHS,1.0,11.76,2.237136,0.346021,0.176991,449.0,...,12.464468,12.599863,12.459217,12.509688,12.482434,12.507582,12.593243,12.733109,12.554774,12.587261
9,10,S49,P134,DNHS,0.0,,,,,,...,12.560804,12.531565,12.490397,12.419143,12.585085,12.509866,12.386933,12.820761,12.532538,12.661084
16,17,S79,P134,DNHS,0.0,,,,,,...,12.583934,12.599499,12.52721,12.636333,12.440678,12.555278,12.617759,12.909895,12.578293,12.685735
18,19,S70,P135,DNHS,0.0,,,,,,...,12.456166,12.597872,12.496037,12.652614,12.457952,12.487103,12.521334,12.823034,12.541253,12.606741
21,22,S2,P135,DNHS,0.0,,,,,,...,12.481201,12.621694,12.415145,12.616261,12.450933,12.536485,12.559103,12.795596,12.552749,12.618074
26,27,S83,P136,DNHS,0.0,,,,,,...,12.403677,12.545281,,12.439043,12.489224,12.565305,12.60043,12.764371,12.561835,
27,28,S41,P136,DNHS,0.0,,,,,,...,12.391321,12.450349,12.378164,12.358069,12.510605,12.474641,12.471794,12.904288,12.526909,
31,32,S276,P7,DNHS,1.0,10.9,1.461988,0.155763,0.114155,317.0,...,12.47645,12.425208,,12.397139,12.523445,12.530703,12.50394,12.791145,12.471993,12.604703


# Northstar Prediction Estimation

## Feature Selection Method

Handle Missing Values:
Before feature selection, ensure all NaN values are handled appropriately, either by imputation or by removing rows/columns with NaN values.

In [114]:
from sklearn.impute import SimpleImputer

# Isolate relevant data
df = dataset.iloc[:, 10:] 

# Option 1: Impute missing values (e.g., with the mean of each column)
imputer = SimpleImputer(strategy="mean")
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Option 2: Drop rows with any NaN values (use with caution)
# df_dropped = dataset.dropna()

In [115]:
from sklearn.model_selection import train_test_split

y = df_imputed['FT5'] # Vector for the target variable
X = df_imputed.iloc[:, 1:] # Matrix with variable input

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 1. Univariate Feature Selection
This method selects the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator.

In [116]:
from sklearn.feature_selection import SelectKBest, f_classif

# Select the top 100 features based on ANOVA F-value
select_k_best = SelectKBest(f_classif, k=100)
X_train_selected = select_k_best.fit_transform(X_train, y_train)
X_test_selected = select_k_best.transform(X_test)

print("Selected Features Shape:", X_train_selected.shape)

Selected Features Shape: (240, 100)


### 2. Feature Selection Using Model
You can use a model to determine the importance of each feature and select the most important features accordingly. Here, we'll use ExtraTreesClassifier as an example for classification. For regression tasks, you could use ExtraTreesRegressor.

In [117]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

model = ExtraTreesClassifier(n_estimators=50)
model = model.fit(X_train, y_train)

# Model-based feature selection
model_select = SelectFromModel(model, prefit=True)
X_train_model = model_select.transform(X_train)
X_test_model = model_select.transform(X_test)

print("Model Selected Features Shape:", X_train_model.shape)

Model Selected Features Shape: (240, 444)




### 3. Recursive Feature Elimination (RFE)
RFE works by recursively removing the least important feature and building a model on those features that remain.

In [97]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Initialize the model to be used
model = LogisticRegression(max_iter=1000)

# Initialize RFE and select the top 100 features
rfe = RFE(estimator=model, n_features_to_select=100, step=1)
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_test_rfe = rfe.transform(X_test)

print("RFE Selected Features Shape:", X_train_rfe.shape)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

KeyboardInterrupt: 

## Test for SVM

## Old notes

In [None]:
# Get summary statistics for whole data set
max = dataset.max(axis=None)
min = dataset.min(axis=None)
median = dataset.median(axis=None)

  max = dataset.max(axis=None)
  max = dataset.max(axis=None)
  min = dataset.min(axis=None)
  min = dataset.min(axis=None)
  median = dataset.median(axis=None)
  median = dataset.median(axis=None)


In [None]:
print(max)

ID                           384
Disease                      1.0
serum.age                   16.3
FT1                     3.215434
FT2                     0.628931
                         ...    
HPA040052_SBA4_rep1    12.907514
HPA041542_SBA4_rep1    12.814063
HPA044582_SBA4_rep1    13.232916
HPA045702_SBA4_rep1    12.836329
HPA048982_SBA4_rep1    12.919319
Length: 1035, dtype: object


In [None]:
print(min)

ID                             1
Disease                      0.0
serum.age               4.016427
FT1                          0.0
FT2                          0.0
                         ...    
HPA040052_SBA4_rep1    11.823214
HPA041542_SBA4_rep1    12.013636
HPA044582_SBA4_rep1    12.580456
HPA045702_SBA4_rep1    12.267231
HPA048982_SBA4_rep1     11.92033
Length: 1035, dtype: object


In [None]:
print(median)

ID                     192.500000
Disease                  1.000000
serum.age                7.585216
FT1                      0.458295
FT2                      0.244801
                          ...    
HPA040052_SBA4_rep1     12.543317
HPA041542_SBA4_rep1     12.559998
HPA044582_SBA4_rep1     12.803850
HPA045702_SBA4_rep1     12.530788
HPA048982_SBA4_rep1     12.640344
Length: 1034, dtype: float64


In [None]:
# Get protein intensitiy values only by removing first 15 columns
intensities = dataset.iloc[:, 14:]
# print(intensities)

# Calculate min, max, and median for intensities
min_values = intensities.min()
max_values = intensities.max()
median_values = intensities.median()

# Now, calculate the average of these statistics across the columns
global_min = min_values.min()
global_max = max_values.max()
global_median = median_values.mean()

print("Global maximum intensity:", global_max)
print("Global median intensity:", global_median)
print("Global minimum intensity:", global_min)


Global maximum intensity: 18.665777617921
Global median intensity: 12.325486692708193
Global minimum intensity: 8.72125121808696


In [None]:
# Normalize protein intensities in dataset
"""
 OBS: we do not want to do this initially
      also it might make the  model less useful
      since it makes intensities from other
      studies or clinical samples harder to
      compare.

      Also, we are normalizing the intensities globally instead of per-column
      (per-protein). Would doing it per protein be better?

      Disabled for now.
"""
# dataset.iloc[:, 14:] = (dataset.iloc[:, 14:] - global_min) / (global_max - global_min)

print(dataset.head())

   ID Sample.ID Participant.ID dataset  Disease  serum.age TREAT  FT1  FT2  \
0   1       S87             P3    DNHS      1.0   11.60849   NaN  0.0  0.0   
1   2       S10             P3    DNHS      1.0   12.62423   NaN  NaN  NaN   
2   3       S19             P3    DNHS      1.0   13.62081   NaN  NaN  NaN   
3   4      S137             P4    DNHS      1.0   11.42505   NaN  NaN  NaN   
4   5      S182             P4    DNHS      1.0   13.24572   NaN  NaN  NaN   

   FT3  ...  HPA049320_SBA4_rep1  HPA051620_SBA4_rep1  HPA054862_SBA4_rep1  \
0  NaN  ...            12.476497            12.666707            12.331400   
1  NaN  ...            12.494297            12.564169            12.354718   
2  NaN  ...            12.554581            12.707240            12.460004   
3  NaN  ...            12.455811            12.505603            12.416459   
4  NaN  ...            12.523768            12.598164            12.369592   

  HPA003901_SBA4_rep1  HPA035863_SBA4_rep1  HPA040052_SBA4_rep

In [None]:
# Drop physical tests (might include them in the future)
columns_to_drop = dataset.columns[7:12]
dataset = dataset.drop(columns=columns_to_drop)

print(columns_to_drop)

Index(['FT1', 'FT2', 'FT3', 'FT4', 'FT5'], dtype='object')


In [None]:
# Print final dataset
print(dataset.head())

   ID Sample.ID Participant.ID dataset  Disease  serum.age TREAT  Plate  \
0   1       S87             P3    DNHS      1.0   11.60849   NaN      1   
1   2       S10             P3    DNHS      1.0   12.62423   NaN      1   
2   3       S19             P3    DNHS      1.0   13.62081   NaN      1   
3   4      S137             P4    DNHS      1.0   11.42505   NaN      2   
4   5      S182             P4    DNHS      1.0   13.24572   NaN      2   

  Location  HPA029198_SBA1_rep1  ...  HPA049320_SBA4_rep1  \
0      D12            12.264364  ...            12.476497   
1       C2            12.264976  ...            12.494297   
2       D3            12.278605  ...            12.554581   
3       H6            12.351390  ...            12.455811   
4      H12            12.301256  ...            12.523768   

   HPA051620_SBA4_rep1  HPA054862_SBA4_rep1  HPA003901_SBA4_rep1  \
0            12.666707            12.331400            12.410101   
1            12.564169            12.354718   

In [None]:
# Create list of training vectors consisting of age and protein intensities

columns = dataset.columns
# print(columns)

# Select specific columns (in this case only 5) and then from 9 through the last column
dataset_ANN_vectors = dataset.iloc[:, [5] + list(range(9, dataset.shape[1]))]

print(dataset_ANN_vectors.head())

   serum.age  HPA029198_SBA1_rep1  HPA047839_SBA1_rep1  HPA029005_SBA1_rep1  \
0   11.60849            12.264364            12.159417            12.348176   
1   12.62423            12.264976            12.226864            12.430424   
2   13.62081            12.278605            12.124720            12.377718   
3   11.42505            12.351390            12.227065            12.320271   
4   13.24572            12.301256            12.260200            12.403662   

   HPA049793_SBA1_rep1  HPA014245_SBA1_rep1  HPA053793_SBA1_rep1  \
0            12.906300            12.211722            12.322140   
1            13.152528            12.245174            12.340043   
2            13.055278            12.206496            12.354011   
3            12.655442            12.267009            12.340713   
4            13.016106            12.140189            12.333273   

   HPA059859_SBA1_rep1  HPA038097_SBA1_rep1  HPA061383_SBA1_rep1  ...  \
0            12.605700            12.446275

In [None]:
# np.loadtxt(dataset, delimiter=",", skiprows=0)

# r = np.genfromtxt(dataset, delimiter=',', names=True, case_sensitive=True)
# print(repr(r))