# CONSTANTS

**ALPHA** is a constant that is commonly used in statistical hypothesis testing to determine the significance level of a test. The value of ALPHA is typically set to 0.05 (or 5%), which means there is a 5% chance that the test results will be considered significant by chance. The value of ALPHA is used in conjunction with the p-value calculated from the data to determine whether the null hypothesis should be rejected or not.

In [None]:
ALPHA = 0.05

**RANDOM_STATE** is a constant that is typically used to initialize the random number generator in a library or algorithm that uses randomness in some aspect of its operation. This means that when the RANDOM_STATE is fixed to a certain value, the algorithm will generate the same random results every time it is run, which can be useful for reproducing results and ensuring result consistency.

In [None]:
RANDOM_STATE = 0

# LIBS

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind, mannwhitneyu, chi2_contingency, shapiro
from sklearn.model_selection import train_test_split

import pandas as pd
import matplotlib.pyplot as plt

# READ FILE

In [None]:
# Define the file path for the data file
file_path = '../dataset/dataset.xlsx'

dtypes = {'Sex': 'category', 
          'Age': int, 
          'Height': float, 
          'Weight': float, 
          'BMI': float, 
          'BSA': float,
          'DM': 'category', 
          'HBP': 'category', 
          'Type of Surgery': 'category', 
          'Redo': 'category', 
          'Urgency': 'category', 
          'CPB': 'category',
          'Ht': float, 
          'Hb': float, 
          'Creatinine': float, 
          'Blood Bags': int, 
          'TRUST': int, 
          'TRACK': int}

# Read the Excel file and select only the desired columns
df = pd.read_excel(file_path, 
                   usecols=list(dtypes.keys()), 
                   dtype=dtypes)

## - Filter

In [None]:
# The dataset was filtered to include only individuals over the age of 18
df = df[df.Age >= 18]

# VISUALIZE

## - Describe

In [None]:
df.describe()

## - Head

In [None]:
df.head()

## - Info

In [None]:
df.info()

# FEATURES 

## - Categoric

- Type of Surgery

In [None]:
# Normalize string values by converting to lowercase and removing accents
df['Type of Surgery'] = df['Type of Surgery'].apply(lambda x: x.strip().lower().replace('[^a-z\s]', ''))

# Create binary columns for each unique type of surgery
for type_of_surgery in df['Type of Surgery'].unique():
    df['type_' + type_of_surgery] = (df['Type of Surgery'] == type_of_surgery).astype(int)

# Drop the original 'Type of Surgery' column
df.drop('Type of Surgery',axis=1,inplace=True)

- Replace 2 with 0

In [None]:
# Convert categorical columns with a specific value to binary encoding
for col in list(df.select_dtypes(include=['category']).columns):
    df[col] = df[col].replace({2: 0}).astype(int)

# DATA DIVISION

The importance of dividing data into training and testing subsets lies in the ability to assess a machine learning model's performance on unseen data. By using a portion of the available data for training, we can build a model that learns the underlying patterns in the data. Then, by evaluating the model's performance on a separate set of testing data, we can obtain an estimate of its ability to generalize to new, unseen data.

In [None]:
# Splitting data into predictor and target variables
X = df.drop(columns=['Blood Bags'],axis=1) 
y = df['Blood Bags'] >= 1

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    stratify=y, 
                                                    random_state=RANDOM_STATE)

# NULL VALUES

Imputation of missing values involves replacing missing values with estimated values based on a variety of statistical or machine learning techniques. The goal is to replace missing values with plausible and relevant values that allow the data to be analyzed and used for modeling or analysis purposes.

## - Hb

We will examine the correlation between the 'Hb' feature and other features to explore the potential of using any feature to estimate missing values in 'Hb'.

In [None]:
# Calculate the correlation
correlation = X_train.corr()['Hb']

# Create a bar plot of the correlations
correlation.plot(kind='bar')
plt.xlabel('Columns')
plt.ylabel('Correlation')
plt.title('Correlation with column "Hb"')
plt.show()

Due to a strong correlation between 'Hb' and 'Ht', we will use 'Ht' values to estimate missing 'Hb' values.

- Impute missing 'Hb' values in the X_train dataset.

In [None]:
# Find missing values in 'Hb'
missing_hb = np.isnan(X_train['Hb'])

# Iterate over rows with missing 'Hb' values
for index, patient in X_train[missing_hb].iterrows():
    # Find the indices of the three closest 'Ht' values
    indices = (X_train.loc[~missing_hb, 'Ht'] - patient['Ht']).abs().sort_values().index[:3]
    # Estimate 'Hb' by taking the mean of the closest 'Hb' values
    X_train.loc[index, 'Hb'] = X_train.loc[indices, 'Hb'].mean()

- Impute missing 'Hb' values in the X_test dataset.

In [None]:
# Find missing values in 'Hb'
missing_hb = np.isnan(X_test['Hb'])

# Iterate over rows with missing 'Hb' values
for index, patient in X_test[missing_hb].iterrows():
    # Find the indices of the three closest 'Ht' values
    indices = (X_train['Ht'] - patient['Ht']).abs().sort_values().index[:3]
    # Estimate 'Hb' by taking the mean of the closest 'Hb' values
    X_test.loc[index, 'Hb'] = X_train.loc[indices, 'Hb'].mean()

# FEATURE SELECTION

In [None]:
FEATURES = []

## - Statistical

In this analysis, we will be performing feature selection based on statistical differences. 

In [None]:
def analysis_of_binary_feature(feature):
    
    # Merge the data into a single DataFrame
    data = pd.concat([X_train, y_train], axis=1)
    
    # Create contingency table
    contingency_table = pd.crosstab(data[feature], data['Blood Bags'])
    
    # Calculate percentages
    percentages = contingency_table.div(contingency_table.sum(axis=1), axis=0) * 100

    # Plot the contingency table
    plt.figure(figsize=(8, 6))
    sns.heatmap(contingency_table, annot=True, fmt='d', cmap="YlGnBu")
    plt.xlabel('Blood Bags')
    plt.ylabel(feature)
    plt.title('Contingency Table')
    plt.show()

    # Print counts and percentages
    for index, row in contingency_table.iterrows():
        total_samples = row.sum()
        positive_percentage = 100 * row[True] / total_samples
        negative_percentage = 100 * row[False] / total_samples
        
        print(f"{feature} == {index}: {total_samples} ({100*total_samples/len(X_train):.2f}%)")
        print(f" - positives: {row[True]} ({positive_percentage:.2f}%)")
        print(f" - negatives: {row[False]} ({negative_percentage:.2f}%)")
    
    # Perform the Chi-square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    print("\nChi-square test:")
    print("Chi-square statistic:", chi2_stat)
    print("p-value:", p_value, '\n')
    
    return p_value

In [None]:
def analysis_of_numerical_feature(feature, test='auto'):
    
    # Merge the data into a single DataFrame
    data = pd.concat([X_train, y_train], axis=1)
    
    # Descriptive statistics
    print(data.groupby('Blood Bags')[feature].describe())

    # Density plot
    data.groupby('Blood Bags')[feature].plot(kind='density',legend=True)
    plt.show()
        
    # grouping
    group1 = data[data['Blood Bags'] == 0][feature]
    group2 = data[data['Blood Bags'] == 1][feature]
    
    # Check normality
    if test == 'auto':
        _, p_value_group1 = shapiro(group1)
        _, p_value_group2 = shapiro(group2)

        if p_value_group1 > 0.05 and p_value_group2 > 0.05:
            test = 't-test'
        else:
            test = 'U test'
    
    # Mann-Whitney U test
    if test == 'U test':
        U_stat, p_value = mannwhitneyu(group1, group2)
        print("Mann-Whitney U test:")
        print("U statistic:", U_stat)
        print("p-value:", p_value, '\n')
        return p_value
      
    # Student's t-test
    elif test == 't-test':  
        t_stat, p_value = ttest_ind(group1, group2)
        print("Student's t-test:")
        print("t-statistic:", t_stat)
        print("p-value:", p_value, '\n')
        return p_value

In [None]:
for feature in X_train.columns:
    
    print('-'*(33-len(feature)//2), feature.upper(), '-'*(33-len(feature)//2))
    
    if len(X_train[feature].unique()) == 2:
        p = analysis_of_binary_feature(feature)
    else:
        p = analysis_of_numerical_feature(feature)
    
    if p < ALPHA:
        FEATURES.append(feature)

## - Correlation

Highly correlated features can provide redundant information and may not add much value to the model. In such cases, it is possible to consider removing one of the features to avoid multicollinearity and reduce the complexity of the model.

In [None]:
# Compute the correlation matrix
corr = pd.concat([X_train[FEATURES], y_train], axis=1).corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 20))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap=cmap, center=0, annot=True, square=True, fmt='.2f',
            linewidths=.5, cbar_kws={"shrink": .5}, annot_kws={"fontsize":12})

# Customize tick labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, horizontalalignment='right', fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=12)

# Add a title
ax.set_title("Correlation Matrix", fontsize=16)

plt.show()

In [None]:
FEATURES.remove('Weight') # Highly correlated with BSA
FEATURES.remove('BMI') # Highly correlated with BSA
FEATURES.remove('Height') # Highly correlated with BSA
FEATURES.remove('Ht') # Highly correlated with Hb

# SAVE

In [None]:
X_train[FEATURES].to_csv('../dataset/X_train.csv')
X_test[FEATURES].to_csv('../dataset/X_test.csv')
y_train.to_csv('../dataset/y_train.csv')
y_test.to_csv('../dataset/y_test.csv')