### Experiment: Determine Optimal Number of Layers

We plan to conduct an experiment to find the optimal number of layers for our neural network model given the dataset we have decided upon, the Stroke Prediction dataset (https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset). This will involve training multiple models with varying numbers of layers and evaluating their performance on a validation set.

To efficiently run these experiments, we will leverage Dask to parallelize the training process across multiple CPU cores. This will significantly speed up our experimentation phase. Additionally, we can utilize Google Colab to access GPU resources for training our models, further enhancing our ability to test different architectures quickly.    


In [None]:
## Data download:
import dask.dataframe as dd
csv_path = '../healthcare-dataset-stroke-data.csv'

# Dask may mis-infer integer columns when some partitions contain missing values.
# Two common fixes are:
#  1) pass assume_missing=True to treat unspecified integer columns as floats, or
#  2) provide explicit dtypes for troublesome columns (e.g. {'age': 'float64'}).
# We'll try a robust approach: read with assume_missing=True and fall back to an explicit dtype if needed.
try:
    data = dd.read_csv(csv_path, assume_missing=True)
except Exception as e:
    print('dd.read_csv failed:', e)
    print('Retrying with explicit dtype for age as float64')
    data = dd.read_csv(csv_path, dtype={'age': 'float64'}, assume_missing=True)

In [13]:
print(data.head())

        id  gender   age  hypertension  heart_disease ever_married  \
0   9046.0    Male  67.0           0.0            1.0          Yes   
1  51676.0  Female  61.0           0.0            0.0          Yes   
2  31112.0    Male  80.0           0.0            1.0          Yes   
3  60182.0  Female  49.0           0.0            0.0          Yes   
4   1665.0  Female  79.0           1.0            0.0          Yes   

       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0        Private          Urban             228.69  36.6  formerly smoked   
1  Self-employed          Rural             202.21   NaN     never smoked   
2        Private          Rural             105.92  32.5     never smoked   
3        Private          Urban             171.23  34.4           smokes   
4  Self-employed          Rural             174.12  24.0     never smoked   

   stroke  
0     1.0  
1     1.0  
2     1.0  
3     1.0  
4     1.0  
