# Data Processing

Data visualization, cleaning, division, and normalization.

* Execute small adjustments/renaming of columns
* Disease parameter: DMD/Cnt -> 1/0
* Remove rows if:
  * Sample.ID's value is "BLANK", "POOL 1" or "POOL 2"
  * Disease or Sample.ID value for row is missing  
* If there are multiple rows with same Sample.ID, drop the duplicate rows with fewer value entires
  * Might need to evalueate manually or reevaluate the heuristiics (which proteins to prioritize over others)
* Remove columns of antibody consentrations if there are no data entries

* TODO: debugg, continue with later points
* Potentionally normalize intensities: [-1, 1] or [0, 1]
* Create new column for LoA based on FT1-5 

* Flytta om ordningen på kolumnerna så att varje rad blir en vektor på formen <<METADATA>,<Y>,<X>> där <X> är [LoA], <Y> är intensiteterna (och eventuellt ålder senare) och <METADATA> är allt annat.

* Run a SVM with our input data

## Enviroment to read and process DMD data

Imports used for data processing and cleaning

In [115]:
import pandas as pd
import numpy as np

Load CSV file through pandas dataframe

In [116]:
dataset = pd.read_csv('normalised_data_all_w_clinical_kex.csv')
dataset.head() # Pre-view first five rows

Unnamed: 0.1,Unnamed: 0,Sample.ID,Participant.ID,dataset,Disease,serum.age,TREAT,FT1,FT2,FT3,...,HPA049320_SBA4_rep1,HPA051620_SBA4_rep1,HPA054862_SBA4_rep1,HPA003901_SBA4_rep1,HPA035863_SBA4_rep1,HPA040052_SBA4_rep1,HPA041542_SBA4_rep1,HPA044582_SBA4_rep1,HPA045702_SBA4_rep1,HPA048982_SBA4_rep1
0,1,S87,P3,DNHS,DMD,11.60849,,0.0,0.0,,...,12.476497,12.666707,12.3314,12.410101,12.504796,12.580695,12.574592,12.797013,12.492433,12.622226
1,2,S10,P3,DNHS,DMD,12.62423,,,,,...,12.494297,12.564169,12.354718,12.581718,12.441237,12.583769,12.579893,12.660894,12.450384,12.704464
2,3,S19,P3,DNHS,DMD,13.62081,,,,,...,12.554581,12.70724,12.460004,12.610387,12.504661,12.556319,12.679701,12.824755,12.517335,12.83697
3,4,S137,P4,DNHS,DMD,11.42505,,,,,...,12.455811,12.505603,12.416459,12.510771,12.472934,12.556096,12.47376,12.74402,12.511776,12.666065
4,5,S182,P4,DNHS,DMD,13.24572,,,,,...,12.523768,12.598164,12.369592,12.605211,12.523768,12.503911,12.557672,12.926602,12.518327,12.668579


### First Data Visualization & Processing

In [117]:
# Dictionary for value conversion
token_to_val = {
    "DMD": 1,
    "Cnt": 0,
}

# Replace the string values in the column using the mapping in token_to_val
dataset['Disease'] = dataset['Disease'].replace(token_to_val)

# Rename columns for uniform representation
dataset.rename(columns={dataset.columns[0]: 'ID'}, inplace=True)
dataset.rename(columns={'dataset': 'Dataset'}, inplace=True)
dataset.rename(columns={'serum.age': 'Participant.Age'}, inplace=True)

# Verify the change
print(dataset.columns)
print(dataset)

Index(['ID', 'Sample.ID', 'Participant.ID', 'Dataset', 'Disease',
       'Participant.Age', 'TREAT', 'FT1', 'FT2', 'FT3',
       ...
       'HPA049320_SBA4_rep1', 'HPA051620_SBA4_rep1', 'HPA054862_SBA4_rep1',
       'HPA003901_SBA4_rep1', 'HPA035863_SBA4_rep1', 'HPA040052_SBA4_rep1',
       'HPA041542_SBA4_rep1', 'HPA044582_SBA4_rep1', 'HPA045702_SBA4_rep1',
       'HPA048982_SBA4_rep1'],
      dtype='object', length=1039)
      ID Sample.ID Participant.ID Dataset  Disease  Participant.Age  \
0      1       S87             P3    DNHS      1.0         11.60849   
1      2       S10             P3    DNHS      1.0         12.62423   
2      3       S19             P3    DNHS      1.0         13.62081   
3      4      S137             P4    DNHS      1.0         11.42505   
4      5      S182             P4    DNHS      1.0         13.24572   
..   ...       ...            ...     ...      ...              ...   
379  380      S306            P29  FORDMD      1.0              NaN   
380  

  dataset['Disease'] = dataset['Disease'].replace(token_to_val)


### Row Based Data Clean-up

Handling incorrect values in key columns

In [118]:
def remove_wrong_value_rows(df, column_name, wrong_val):
    """
    Removes rows from the DataFrame where the specified column has the specified wrong value.

    :param df: A pandas DataFrame from which rows will be removed.
    :param column_name: The name of the column to check for the wrong value.
    :param wrong_val: The value considered wrong in the specified column.
    :return: A pandas DataFrame with rows containing the wrong value in the specified column removed.
    """
    if isinstance(wrong_val, str):
        wrong_val = list([wrong_val])
    for val in wrong_val:
        # Find indices of rows with the wrong value
        incorrect = dataset[column_name] == val
        indices_to_drop = incorrect[incorrect == True].index
        # Drop these rows
        df.drop(indices_to_drop, inplace=True)
    return df

In [119]:
# Drop rows with invalid sample data
print("Before drop:", dataset.shape)
dataset = remove_wrong_value_rows(dataset, 'Sample.ID', ['BLANK', 'POOL 1', 'POOL 2'])
print("After drop:", dataset.shape)

Before drop: (384, 1039)
After drop: (372, 1039)


Drop rows with NaN in the row's subset for key values

In [120]:
print("Before drop:", dataset.shape)
dataset.dropna(subset=['Sample.ID','Disease'], inplace=True)
print("After drop:", dataset.shape)

Before drop: (372, 1039)
After drop: (357, 1039)


Choose from and remove sample duplicates

In [121]:
def get_duplicate_indecies(df, columns):
    """
    Find indices of rows with the wrong value in the specified column.
    """
    duplicate = df.duplicated(subset=columns, keep=False)
    duplicate_indexes = duplicate[duplicate == True].index
    return duplicate_indexes

In [122]:
def calculate_row_value_percentage(df, start_column=0):
    """
    Calculates the percentage of actual (non-NA) data points for each row in a pandas DataFrame.

    :param df: A pandas DataFrame with potential NA values.
    :return: A pandas Series with the percentage of non-NA values for each row.
    """
    # Adjust for 0-based indexing
    start_index = max(0, start_column - 1)

    # Select only the columns within the specified interval
    interval_df = df.iloc[:, start_index:]

    # Calculate the number of non-NA values per row
    value_counts_per_row = df.notna().sum(axis=1)

    # Calculate the total number of columns (to handle potential NA values)
    total_columns = interval_df.shape[1]

    # Calculate the percentage of non-NA values for each row
    value_percentage_per_row = (value_counts_per_row / total_columns) * 100

    return value_percentage_per_row

In [123]:
def remove_duplicate_rows(df, duplicate_indexes, row_val_percentages):

    for i in duplicate_indexes:
        # For each duplicate find the duplicate sample.ID value using the index
        sample_ID = df.iloc[i]['Sample.ID']

        # Find all row indicies of occurances of the value
        duplicate_sample_ID_indicies = df.index[df['Sample.ID'] == sample_ID]

        # Find which of these rows have the highest percentage in row_val_percentages
        best_index = -1
        best_val = -1
        for duplicate_idx in duplicate_sample_ID_indicies:
            val = row_val_percentages.loc[duplicate_idx]

            if (val > best_val):
                best_val = val
                best_index = duplicate_idx

        # Remove best from list of duplicates
        duplicate_sample_ID_indicies = duplicate_sample_ID_indicies.drop(best_index)

        # Drop the rest of the duplicates
        df.drop(index=duplicate_sample_ID_indicies, inplace=True)

In [124]:
print("Before drop:", dataset.shape)
duplicate_indexes = get_duplicate_indecies(dataset, 'Sample.ID')
row_val_percentages = calculate_row_value_percentage(dataset, start_column=15)
remove_duplicate_rows(dataset, duplicate_indexes, row_val_percentages)
print("After drop:", dataset.shape)

Before drop: (357, 1039)
After drop: (342, 1039)


### Column Based Data Clean-up

In [125]:
def calculate_column_value_percentage(df, start_column=1):
    """
    Calculates the percentage of actual (non-NA) data points for each column in a pandas DataFrame
    within a specified interval.

    :param df: A pandas DataFrame with potential NA values.
    :param start_column: The starting column index for the interval (1-based index).
    :param end_column: The ending column index for the interval. If None, calculates up to the last column.
    :return: A pandas Series with the percentage of non-NA values for each column in the interval.
    """
    # Adjust for 0-based indexing
    start_index = max(0, start_column - 1)

    # Select only the columns within the specified interval
    interval_df = df.iloc[:, start_index:]

    # Calculate the total number of non-NA values for each column
    value_counts = interval_df.count()

    # Calculate the total number of rows (to handle potential NA rows)
    total_rows = len(df)

    # Calculate the percentage of non-NA values for each column
    value_percentage = (value_counts / total_rows) * 100

    return value_percentage

In [126]:
# Remove empty columns
value_percentage = calculate_column_value_percentage(dataset, 15)
columns_to_drop = value_percentage[value_percentage == 0].index
print("Columns to drop:", columns_to_drop)

print("Before drop:", dataset.shape)
dataset.drop(labels=columns_to_drop, axis="columns", inplace=True)
print("After drop:", dataset.shape)

Columns to drop: Index(['HPA003948_SBA1_rep1', 'HPA057437_SBA3_rep1', 'HPA003948_SBA3_rep1',
       'HPA034960_SBA3_rep1', 'HPA058513_SBA3_rep1', 'HPA074922_SBA3_rep1',
       'HPA000837_SBA4_rep1', 'HPA001482_SBA4_rep1', 'HPA000293_SBA4_rep1',
       'HPA031466_SBA4_rep1', 'HPA040972_SBA4_rep1', 'HPA010558_SBA4_rep1',
       'HPA020610_SBA4_rep1', 'HPA030651_SBA4_rep1', 'HPA000226_SBA4_rep1',
       'HPA041991_SBA4_rep1', 'HPA000564_SBA4_rep1', 'HPA035222_SBA4_rep1',
       'HPA041943_SBA4_rep1', 'HPA001833_SBA4_rep1', 'HPA003927_SBA4_rep1',
       'HPA011918_SBA4_rep1', 'HPA023314_SBA4_rep1', 'HPA031335_SBA4_rep1',
       'HPA031717_SBA4_rep1'],
      dtype='object')
Before drop: (342, 1039)
After drop: (342, 1014)


## LoA Prediction Estimation

Function for transforming physical tests into an estimate for LoA which becomes our quantity to predict.

In [None]:
def get_LoA(FT1, FT2, FT3, FT4, FT5):
    """Convert FTs into an estimate for LoA

    Assumes N/A values are None ssss
    """
    threshold_FT1 = 20
    threshold_FT2 = 20
    threshold_FT3 = 20
    threshold_FT4 = 20
    threshold_FT5 = 20
    if FT5 && FT5 < threshold_FT5:
        return true
    if FT1 && FT1 < threshold_FT1:
        return true
    if FT2 && FT2 < threshold_FT2:
        return true
    if FT3 && FT3 < threshold_FT3:
        return true
    if FT4 && FT4 < threshold_FT4:
        return true
    else:
        return false

## Test for SVM

## Old notes

In [166]:
# Get summary statistics for whole data set
max = dataset.max(axis=None)
min = dataset.min(axis=None)
median = dataset.median(axis=None)

  max = dataset.max(axis=None)
  max = dataset.max(axis=None)
  min = dataset.min(axis=None)
  min = dataset.min(axis=None)
  median = dataset.median(axis=None)
  median = dataset.median(axis=None)


In [167]:
print(max)

ID                           384
Disease                      1.0
serum.age                   16.3
FT1                     3.215434
FT2                     0.628931
                         ...    
HPA040052_SBA4_rep1    12.907514
HPA041542_SBA4_rep1    12.814063
HPA044582_SBA4_rep1    13.232916
HPA045702_SBA4_rep1    12.836329
HPA048982_SBA4_rep1    12.919319
Length: 1035, dtype: object


In [168]:
print(min)

ID                             1
Disease                      0.0
serum.age               4.016427
FT1                          0.0
FT2                          0.0
                         ...    
HPA040052_SBA4_rep1    11.823214
HPA041542_SBA4_rep1    12.013636
HPA044582_SBA4_rep1    12.580456
HPA045702_SBA4_rep1    12.267231
HPA048982_SBA4_rep1     11.92033
Length: 1035, dtype: object


In [169]:
print(median)

ID                     192.500000
Disease                  1.000000
serum.age                7.585216
FT1                      0.458295
FT2                      0.244801
                          ...    
HPA040052_SBA4_rep1     12.543317
HPA041542_SBA4_rep1     12.559998
HPA044582_SBA4_rep1     12.803850
HPA045702_SBA4_rep1     12.530788
HPA048982_SBA4_rep1     12.640344
Length: 1034, dtype: float64


In [170]:
# Get protein intensitiy values only by removing first 15 columns
intensities = dataset.iloc[:, 14:]
# print(intensities)

# Calculate min, max, and median for intensities
min_values = intensities.min()
max_values = intensities.max()
median_values = intensities.median()

# Now, calculate the average of these statistics across the columns
global_min = min_values.min()
global_max = max_values.max()
global_median = median_values.mean()

print("Global maximum intensity:", global_max)
print("Global median intensity:", global_median)
print("Global minimum intensity:", global_min)


Global maximum intensity: 18.665777617921
Global median intensity: 12.325486692708193
Global minimum intensity: 8.72125121808696


In [171]:
# Normalize protein intensities in dataset
"""
 OBS: we do not want to do this initially
      also it might make the  model less useful
      since it makes intensities from other
      studies or clinical samples harder to
      compare.

      Also, we are normalizing the intensities globally instead of per-column
      (per-protein). Would doing it per protein be better?

      Disabled for now.
"""
# dataset.iloc[:, 14:] = (dataset.iloc[:, 14:] - global_min) / (global_max - global_min)

print(dataset.head())

   ID Sample.ID Participant.ID dataset  Disease  serum.age TREAT  FT1  FT2  \
0   1       S87             P3    DNHS      1.0   11.60849   NaN  0.0  0.0   
1   2       S10             P3    DNHS      1.0   12.62423   NaN  NaN  NaN   
2   3       S19             P3    DNHS      1.0   13.62081   NaN  NaN  NaN   
3   4      S137             P4    DNHS      1.0   11.42505   NaN  NaN  NaN   
4   5      S182             P4    DNHS      1.0   13.24572   NaN  NaN  NaN   

   FT3  ...  HPA049320_SBA4_rep1  HPA051620_SBA4_rep1  HPA054862_SBA4_rep1  \
0  NaN  ...            12.476497            12.666707            12.331400   
1  NaN  ...            12.494297            12.564169            12.354718   
2  NaN  ...            12.554581            12.707240            12.460004   
3  NaN  ...            12.455811            12.505603            12.416459   
4  NaN  ...            12.523768            12.598164            12.369592   

  HPA003901_SBA4_rep1  HPA035863_SBA4_rep1  HPA040052_SBA4_rep

In [172]:
# Drop physical tests (might include them in the future)
columns_to_drop = dataset.columns[7:12]
dataset = dataset.drop(columns=columns_to_drop)

print(columns_to_drop)

Index(['FT1', 'FT2', 'FT3', 'FT4', 'FT5'], dtype='object')


In [173]:
# Print final dataset
print(dataset.head())

   ID Sample.ID Participant.ID dataset  Disease  serum.age TREAT  Plate  \
0   1       S87             P3    DNHS      1.0   11.60849   NaN      1   
1   2       S10             P3    DNHS      1.0   12.62423   NaN      1   
2   3       S19             P3    DNHS      1.0   13.62081   NaN      1   
3   4      S137             P4    DNHS      1.0   11.42505   NaN      2   
4   5      S182             P4    DNHS      1.0   13.24572   NaN      2   

  Location  HPA029198_SBA1_rep1  ...  HPA049320_SBA4_rep1  \
0      D12            12.264364  ...            12.476497   
1       C2            12.264976  ...            12.494297   
2       D3            12.278605  ...            12.554581   
3       H6            12.351390  ...            12.455811   
4      H12            12.301256  ...            12.523768   

   HPA051620_SBA4_rep1  HPA054862_SBA4_rep1  HPA003901_SBA4_rep1  \
0            12.666707            12.331400            12.410101   
1            12.564169            12.354718   

In [174]:
# Create list of training vectors consisting of age and protein intensities

columns = dataset.columns
# print(columns)

# Select specific columns (in this case only 5) and then from 9 through the last column
dataset_ANN_vectors = dataset.iloc[:, [5] + list(range(9, dataset.shape[1]))]

print(dataset_ANN_vectors.head())

   serum.age  HPA029198_SBA1_rep1  HPA047839_SBA1_rep1  HPA029005_SBA1_rep1  \
0   11.60849            12.264364            12.159417            12.348176   
1   12.62423            12.264976            12.226864            12.430424   
2   13.62081            12.278605            12.124720            12.377718   
3   11.42505            12.351390            12.227065            12.320271   
4   13.24572            12.301256            12.260200            12.403662   

   HPA049793_SBA1_rep1  HPA014245_SBA1_rep1  HPA053793_SBA1_rep1  \
0            12.906300            12.211722            12.322140   
1            13.152528            12.245174            12.340043   
2            13.055278            12.206496            12.354011   
3            12.655442            12.267009            12.340713   
4            13.016106            12.140189            12.333273   

   HPA059859_SBA1_rep1  HPA038097_SBA1_rep1  HPA061383_SBA1_rep1  ...  \
0            12.605700            12.446275

In [175]:
# np.loadtxt(dataset, delimiter=",", skiprows=0)

# r = np.genfromtxt(dataset, delimiter=',', names=True, case_sensitive=True)
# print(repr(r))