# Gender Bias

- Is there a gender difference in Covid19 infection? Early news reports indicated that women are less likely to be severely infected than men. Does this show up in the competition data set?

In [None]:
import os
import pandas as pd
from tqdm.notebook import tqdm
import numpy as np
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut

### DICOM Meta tag

- The patient's name and other data are hashed, but we can see that the gender metadata is available (I wish age was also available!).

In [None]:
path = "../input/siim-covid19-detection/train/00086460a852/9e8302230c91/65761e66de9f.dcm"
dicom = pydicom.read_file(path)
print('\n'.join(str(dicom).split('\n')[14:17]))

In [None]:
df_study = pd.read_csv('../input/siim-covid19-detection/train_study_level.csv')

### Map labels and gender

In [None]:
ids, genders, labels = [], [], []
trainfiles = [(dirname, filenames) for dirname, _, filenames in tqdm(os.walk(f'../input/siim-covid19-detection/train/')) if len(filenames) > 0]
for dirname, filenames in tqdm(trainfiles):
    for file in filenames:
        sid = dirname.split("/")[-2]+'_study'
        id = file.replace('.dcm','')+'_image'
        ids.append(id)
        label = np.argmax(df_study[df_study.id==sid][df_study.columns[1:5]].values[0])
        labels.append(df_study.columns[1:5][label])
        path = os.path.join(dirname, file)
        dicom = pydicom.read_file(path)
        genders.append(str(dicom.get_item('00100040').value.decode('utf8'))[0])

- Gender is F or M. There is no DICOM that does not contain gender metadata.

In [None]:
set(genders)

In [None]:
pd.DataFrame({"id":ids,"gender":genders,"label":labels}).to_csv("genders.csv", index=False)

### Number in the dataset 

- Men and women are evenly represented in the data set.

In [None]:
import matplotlib.pyplot as plt
df = pd.read_csv('genders.csv')
plt.hist(df.gender)

In [None]:
df.hist(column="label", by="gender")

### Gender Bias in Study

- Obviously, there are more 'negatives' among women. WoW! As in the news, Women appear to be resistant to COVID-19.

In [None]:
ids, genders = [], []
testfiles = [(dirname, filenames) for dirname, _, filenames in tqdm(os.walk(f'../input/siim-covid19-detection/test/')) if len(filenames) > 0]
for dirname, filenames in tqdm(testfiles):
    for file in filenames:
        sid = dirname.split("/")[-2]+'_study'
        id = file.replace('.dcm','')+'_image'
        ids.append(id)
        path = os.path.join(dirname, file)
        dicom = pydicom.read_file(path)
        genders.append(str(dicom.get_item('00100040').value.decode('utf8'))[0])

### In the Test set

- The same tag for the test data set.

In [None]:
set(genders)

In [None]:
pd.DataFrame({"id":ids,"gender":genders}).to_csv("genders_test.csv", index=False)

In [None]:
df = pd.read_csv('genders_test.csv')
plt.hist(df.gender)

### We might be able to use this for something...