## Importing Libraries and Loading Metadata
-   Import necessary libraries: Pandas for data handling, NumPy for numerical operations, OS for file operations, scikit-learn for data preprocessing, and TensorFlow/Keras for deep learning.
-   Load the metadata CSV file containing information about images and associated tabular data using Pandas.

In [12]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Concatenate

# Load metadata
metadata_path = 'covid_cx_dataset/metadata.csv'
metadata = pd.read_csv(metadata_path)

## Image Preprocessing Function
-    `preprocess_image` function loads an image from a file path, resizes it to 224x224 pixels, converts it to a NumPy array, and normalizes pixel values to the range [0, 1].

In [2]:
# Define function to preprocess images
def preprocess_image(filepath):
    img = load_img(filepath, target_size=(224, 224))
    img = img_to_array(img)
    img = img / 255.0
    return img

## Creating File Paths for Images


In [18]:
# Prepare image file paths and labels
metadata['filepath'] = metadata['filename'].apply(lambda x: os.path.join('covid_cx_dataset/covid19', x))
labels = metadata['label']


## Splitting Data into Train and Validation Sets

-    Uses `train_test_split` from scikit-learn to split `metadata` into training (`train_metadata`) and validation (`val_metadata`) sets with a 80-20 ratio.

In [19]:
# Split data into training and validation sets
train_metadata, val_metadata = train_test_split(metadata, test_size=0.2, random_state=42)


In [20]:
print(train_metadata.head())

     filename  patient_id     sex   age view     label  pcr_test  survival  \
24    284.jpg          19    male  76.0  NaN  COVID-19  positive       NaN   
17    277.jpg          16    male  70.0  NaN  COVID-19  positive       NaN   
66    326.jpg          36  female  76.0  NaN  COVID-19  positive       NaN   
148  411.jpeg         102    male  70.0   AP  COVID-19       NaN       NaN   
249   678.jpg         168  female  68.0   AP  COVID-19  positive       1.0   

    location  admission_offset  ...  has_dyspnea  has_diarrhea  spo2  \
24     Spain               5.0  ...          2.0           NaN   NaN   
17     Spain               5.0  ...          NaN           NaN   NaN   
66     Spain               3.0  ...          1.0           NaN   NaN   
148       UK               6.0  ...          1.0           NaN   NaN   
249       UK              12.0  ...          NaN           NaN   NaN   

     other_symptoms              medical_background  opacification  other  \
24              NaN  

In [21]:
print(val_metadata.head())

    filename  patient_id     sex   age view     label  pcr_test  survival  \
407  858.png         269  female  80.0   AP  COVID-19  positive       0.0   
444  895.jpg         299    male  63.0   AP  COVID-19  positive       NaN   
117  377.png          63     NaN   NaN  NaN  COVID-19       NaN       NaN   
30   290.jpg          21    male  73.0  NaN  COVID-19  positive       NaN   
415  866.png         273    male  77.0   AP  COVID-19  positive       NaN   

    location  admission_offset  ...  has_dyspnea  has_diarrhea  spo2  \
407   Jordan              17.0  ...          NaN           NaN   NaN   
444    Italy               NaN  ...          NaN           NaN   NaN   
117       US               NaN  ...          NaN           NaN   NaN   
30     Spain               3.0  ...          1.0           NaN   NaN   
415    Italy               NaN  ...          1.0           NaN  84.0   

     other_symptoms      medical_background  opacification  other  \
407             NaN                

## Preparing Image Data

-   Uses list comprehension to preprocess each image file path in `train_metadata` and `val_metadata`.
-   Converts processed images into NumPy arrays and stores them in `train_images` and `val_images`.

In [22]:
# Process image data
train_images = np.array([preprocess_image(filepath) for filepath in train_metadata['filepath']])
val_images = np.array([preprocess_image(filepath) for filepath in val_metadata['filepath']])

## Preparing Tabular Data


In [23]:
# Define features for tabular data
feature_cols = ['age', 'sex_numeric', 'admission_offset', 'symptom_offset', 'has_fever', 'has_cough', 'has_dyspnea']

# Prepare tabular data
X_train_tabular = train_metadata[feature_cols].fillna(0)
X_val_tabular = val_metadata[feature_cols].fillna(0)

# Scale tabular data
scaler = StandardScaler()
X_train_tabular_scaled = scaler.fit_transform(X_train_tabular)
X_val_tabular_scaled = scaler.transform(X_val_tabular)

# Define input shape for tabular data
input_shape_tabular = (X_train_tabular_scaled.shape[1],)


KeyError: "['sex_numeric'] not in index"

##  Standardizing Tabular Data

Uses StandardScaler to standardize (fit_transform for training and transform for validation) X_train_tabular and X_val_tabular.

In [10]:
# Standardize numerical features
scaler = StandardScaler()
X_train_tabular_scaled = scaler.fit_transform(X_train_tabular)
X_val_tabular_scaled = scaler.transform(X_val_tabular)

## Building the Model

-   Defines input shapes for image (`input_shape_img`) and tabular (`input_shape_tabular`) data.
-   Constructs the model using the Functional API of Keras:
    -   **Image Branch**: Convolves and pools the image data.
    -   **Tabular Branch**: Passes tabular data through dense layers.
    -   **Concatenation**: Combines outputs of image and tabular branches.
    -   **Output Layer**: Uses `sigmoid` activation for binary classification (adjust as needed).
-   Compiles the model with `adam` optimizer, `binary_crossentropy` loss function, and `accuracy` metric.

In [11]:
# Combine image and tabular data
input_shape_img = train_images[0].shape
input_shape_tabular = X_train_tabular_scaled.shape[1]

# Image input branch
img_input = Input(shape=input_shape_img, name='image_input')
x = Conv2D(32, (3, 3), activation='relu')(img_input)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(64, (3, 3), activation='relu')(x)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(128, (3, 3), activation='relu')(x)
x = MaxPooling2D((2, 2))(x)
x = Flatten()(x)

# Tabular input branch
tabular_input = Input(shape=input_shape_tabular, name='tabular_input')
y = Dense(64, activation='relu')(tabular_input)
y = Dense(32, activation='relu')(y)

# Concatenate image and tabular branches
combined = Concatenate()([x, y])
z = Dense(64, activation='relu')(combined)
output = Dense(1, activation='sigmoid')(z)  # Adjust output layer based on your label format

# Build model
model = Model(inputs=[img_input, tabular_input], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

ValueError: Cannot convert '7' to a shape.