# An Analysis of Coronavirus Infection in Cancer Patients
> What is the incidence of infection with coronavirus among cancer patients?
>
>And, is disease severity or trajectory different among these patients vs those not diagnosed with cancer?

These questions are difficult to answer without access to specific data about cancer patients who have contracted the coronavirus. The most detailed dataset I have seen is the [dataset from Hospital Israelita Albert Einstein in São Paolo, Brazil](https://www.kaggle.com/einsteindata4u/covid19/) that contains anonymized lab results for patients, including whether they tested positive for COVID-19. The description of the data is as follows:

>This dataset contains anonymized data from patients seen at the Hospital Israelita Albert Einstein, at São Paulo, Brazil, and who had samples collected to perform the SARS-CoV-2 RT-PCR and additional laboratory tests during a visit to the hospital.
>
>All data were anonymized following the best international practices and recommendations. All clinical data were standardized to have a mean of zero and a unit standard deviation.

I hypothesize that it is possible to use the lab results in this dataset to infer which patients may have cancer, allowing us to answer the questions posed by this task.

Here are the details about this notebook that address the evaluation criteria:

>1. Scientific rigor
>a. Is the solution evidence based? I.e. does it leverage robust data?

Yes, the solution is based on data collected by a well-regarded hospital.

>2. Scientific model/strategy
>a. Did the solution employ a robust scientific method?

The solution attempts to answer the question posed while being conservative about which data is used in the analysis (since the solution uses the data to infer the answer to the question, it was prudent to consider a more-conservative approach).

>3. Unique and novel insight
>a. Does the solution identify information (new data, features, insights etc) that is yet to be “uncovered?”

Yes, the solution attempts to identify cancer patients based on their lab results instead of a direct cancer diagnosis.

>4. Market Translation and Applicability
>a. Does the solution resolve an existing market need for either an individual, health institution or policy maker?

Possibly; given more complete data, this method could identify specific lab markers or tests that could be used to identify severe cases of the coronavirus.

>5. Speed to market
>a. Does it apply to an existing product vision such as a self assessment tool or policy decision-making tool?

Possibly, given more data (as stated in the previous question.

>6. Longevity of solution in market
>a. Is the solution one that could be used in various markets through time?

Yes, given more data.

>7. Ability of user to collaborate and contribute to other solutions within the Kaggle community
>a. Did the user provide expertise and or resources in the form of datasets or models to their fellow Kaggle members?

I have not created any models yet, due to the small amount of data I have been able to extract from this dataset. If I am able to find other datasets, it would be possible to create models based on this technique.

In [None]:
import csv, sys, math
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

# Read data into a pandas dataframe
einsteinFile = "/kaggle/input/uncover/einstein/diagnosis-of-covid-19-and-its-clinical-spectrum.csv"
healthDF = pd.read_csv(einsteinFile)

healthDF.head()

In [None]:
# Show the columns available in this dataset
healthDF.columns.values

From a quick peek at the data:
* the 3rd column, *sars_cov_2_exam_result*, indicates whether the patient tested positive for COVID-19 or not
* the next three columns indicate whether the patient was admitted to the hospital and if so, to which unit (regular ward, semi-intensive unit, intensive care) -- **these columns can be used to indicate the severity of the sickness for these patients**
* the columns after that (column 7 and later) represent results of tests that were run on the patient -- **these columns can be used to determine which patients had cancer**

To start, I identify the lab tests for which we have a large enough sample size to be considered representative of the population. Using this [sample calculator](https://www.surveysystem.com/sscalc.htm), I determined that for:

* confidence level: 95%
* confidence interval: 5
* population: 5644 (*the number of rows in the dataset*)

the suggested sample size needed is 360. Therefore, I will only look at lab tests where at least 360 results were recorded.

In [None]:
columns = list(healthDF.columns)
labTests = columns[6:]

minSampleSize = 360

# Count the number of recordings for each lab test. Disallow test if the count is below minSampleSize.
usableLabTests = list()

for labTest in labTests:
    sampleCount = len(healthDF[healthDF[labTest].notnull()])
    
    if sampleCount >= minSampleSize:
        usableLabTests.append(labTest)
        
print("__Lab tests with sample size >= " + str(minSampleSize) + "__\n" + str(usableLabTests))

Some of these lab tests have a "detected" or "not detected" result. These are tests used to determine whether the patient had a specific disease. We should separate these boolean lab tests from the lab tests that have numerical results.

In [None]:
floatType = np.dtype('float64')

numericalLabTests = list()
booleanLabTests = list()

for test in usableLabTests:
    if healthDF[test].dtype == floatType:
        numericalLabTests.append(test)        
    else:
        uniqueVals = list(healthDF[test].unique())

        # Look for present of the word "detected", since there is at least one test where "not_detected" is 
        # the only non-null result (not useful).
        if "detected" in uniqueVals:
            booleanLabTests.append(test)

print("Numerical tests: " + str(numericalLabTests))
print("\nBoolean tests: " + str(booleanLabTests))

For most of the numerical tests, a lab result outside of the normal range for each of these tests can indicate severe issues that may include the following:
* autoimmune disease
* cancer
* anemia
* deficiency of some sort

While we may not be able to specifically determine from this dataset whether a patient has cancer, we can at the very least infer that something is severely and negatively impacting a patient's health, and that their immune system is likely compromised (much like a cancer patient). Based on this, I will separate the data into two groups of patients: "possible cancer/immunocompromised" and "likely no cancer".

Since the numerical lab tests are key to our analysis, we should drop any patients who have not recorded results for any of the numerical tests.

In [None]:
print("Original number of patients: " + str(len(healthDF)))
patientsWithNumericalResults = list()

for index, row in healthDF.iterrows():
    patientID = row['patient_id']
    
    hasResults = False
    for numTest in numericalLabTests:
        if not pd.isna(row[numTest]):
            hasResults = True
            break
    
    if hasResults:
        patientsWithNumericalResults.append(patientID)
        
patientsWithResultsDF = healthDF[healthDF['patient_id'].isin(patientsWithNumericalResults)].copy()

print("Number of patients with at least one numerical lab result: " + str(len(patientsWithResultsDF)))



Recall that the description of the dataset mentioned that the (numerical) lab results were normalized to have a mean of 0 and a standard deviation of 1. The normalization makes it difficult to determine if a test result is outside the range of acceptable values for that test (we can't just look up the "acceptable" range for that test and then see if the recorded value is within that range). However, we can still calculate whether the result is outside the +/- one standard deviation range of the mean. This could indicate that the person is very sick.

Let's take a look at range of values recorded for these tests.

In [None]:
testStats = dict()
minErrorToTest = dict()

for test in numericalLabTests:
    # Calculate mean, stdev
    testStats[test] = dict()
    
    meanVal = patientsWithResultsDF[test].mean()
    minVal = patientsWithResultsDF[test].min()
    maxVal = patientsWithResultsDF[test].max()
    stdVal = patientsWithResultsDF[test].std()

    testStats[test]["mean"] = meanVal
    testStats[test]["std"] = stdVal
    testStats[test]["min"] = minVal
    testStats[test]["max"] = maxVal
    testStats[test]["min1Std"] = meanVal - stdVal
    testStats[test]["max1Std"] = meanVal + stdVal
     
    minError = meanVal - minVal
    if minError not in minErrorToTest:
        minErrorToTest[minError] = list()
        
    minErrorToTest[minError].append(test)

sortedMins = reversed(sorted(minErrorToTest.keys()))

meansForPlot = list()
testNamesForPlot = list()
minsForPlot = list()
maxsForPlot = list()
minStdForPlot = list()
maxStdForPlot = list()

for sm in sortedMins:
    for testName in minErrorToTest[sm]:
        minsForPlot.append(sm)
        testNamesForPlot.append(testName)
        meansForPlot.append(testStats[testName]["mean"])
        maxsForPlot.append(testStats[testName]["max"])
        
        minStdForPlot.append(testStats[testName]["min1Std"])
        maxStdForPlot.append(testStats[testName]["max1Std"])

fig, (ax0, ax1) = plt.subplots(1, 2, sharey=True)
ax0.plot(testNamesForPlot[0:10], minStdForPlot[0:10], 'b--', label='1 std dev')
ax0.plot(testNamesForPlot[0:10], maxStdForPlot[0:10], 'b--')
ax0.errorbar(testNamesForPlot[0:10], meansForPlot[0:10], yerr=[minsForPlot[0:10], maxsForPlot[0:10]], fmt='o', label='test Mean')

ax1.plot(testNamesForPlot[10:], minStdForPlot[10:], 'b--', label='1 std dev')
ax1.plot(testNamesForPlot[10:], maxStdForPlot[10:], 'b--')
ax1.errorbar(testNamesForPlot[10:], meansForPlot[10:], yerr=[minsForPlot[10:], maxsForPlot[10:]], fmt='o', label='test Mean')

ax0.tick_params(labelsize=10)
ax1.tick_params(labelsize=10)

ax0.legend(loc='upper left')
ax1.legend(loc='upper left')

plt.title("Mean and min/max range for numerical lab tests")
plt.setp(ax0.xaxis.get_majorticklabels(), rotation=90)
plt.setp(ax1.xaxis.get_majorticklabels(), rotation=90)
plt.figure(figsize=(50,50))
plt.tight_layout()
plt.show()

As expected, the mean for each of the measured lab tests is indeed approximately 0, and the standard deviation is approximately 1.

Assuming a normal distribution, then +/- one standard deviation from the mean represents about 68% of the normal population. From the figure above, you can see that there are recordings for each test that range above and, in some cases, below, one standard deviation from the mean value. From this, we can conclude that patients that recorded test results outside of the +/- one standard deviation range exhibit abnormal results for that test, and are thus likely to be compromised. 

Notice that the last two lab test's minimum values are less than one standard deviation from the mean. This is supported in researching the tests -- a low value for each of these tests is considered normal, so the majority of the test results should be grouped in the low part of the range.

* [eosinophils](https://www.healthline.com/health/eosinophil-count-absolute)
* [proteina_c_reativa_mg_dl](https://labtestsonline.org/tests/c-reactive-protein-crp)

Let's look at the patient data and add columns to the dataframe to signify whether each patient exhibited a significant (more than 1 standard deviation higher or lower than the mean) for any lab result, and whether they tested positive for any diseases.

In [None]:
def calculateHighLowNormalTest(row, mode, testNames, testStats):
    
    testCount = 0
    
    for testName in testNames:
        stats = testStats[testName]       
    
        if mode == "high":
            if row[testName] > stats["max1Std"]:
                testCount += 1
                
        elif mode == "low":
            if row[testName] < stats["min1Std"]:
                testCount += 1
                
        elif mode == "normal":
            if row[testName] <= stats["max1Std"] and row[testName] >= stats["min1Std"]:
                testCount += 1
        
        else:
            print("Unrecognized mode: " + mode)
            sys.exit(1)
    
    return testCount
    
def calculateBooleanTest(row, stringToFind, booleanTests):
    count = 0
    
    for bt in booleanTests:
        if isinstance(row[bt], float):
            continue
            
        if row[bt] == stringToFind:
            count += 1
            
    return count

    
def sumRecordedTests(row, testType, testNames):
    count = 0
    
    if testType == "num":
        for testName in testNames:
            if not math.isnan(row[testName]):
                count += 1
                
    elif testType == "bool":
        for testName in testNames:
            if isinstance(row[testName], float):
                continue
            if "detected" in row[testName]:
                count += 1
            
    return count

patientsWithResultsDF['numHighTests'] = patientsWithResultsDF.apply(lambda row: calculateHighLowNormalTest(row, "high", numericalLabTests, testStats), axis=1)
patientsWithResultsDF['numLowTests'] = patientsWithResultsDF.apply(lambda row: calculateHighLowNormalTest(row, "low", numericalLabTests, testStats), axis=1)
patientsWithResultsDF['numNormalTests'] = patientsWithResultsDF.apply(lambda row: calculateHighLowNormalTest(row, "normal", numericalLabTests, testStats), axis=1)

patientsWithResultsDF['numDetectedTests'] = patientsWithResultsDF.apply(lambda row: calculateBooleanTest(row, "detected", booleanLabTests), axis=1)
patientsWithResultsDF['numNotDetectedTests'] = patientsWithResultsDF.apply(lambda row: calculateBooleanTest(row, "not_detected", booleanLabTests), axis=1)

patientsWithResultsDF['numValidNumTests'] = patientsWithResultsDF.apply(lambda row: sumRecordedTests(row, "num", numericalLabTests), axis=1)
patientsWithResultsDF['numValidBoolTests'] = patientsWithResultsDF.apply(lambda row: sumRecordedTests(row, "bool", booleanLabTests), axis=1)

In [None]:
patientsWithResultsDF.head()

At this point, we have enough data that we can separate the patients into two populations: *Group 1* patients that may have cancer or be immunocompromised, and *Group 2* patients that are likely not to have cancer or be immunocompromised. In the cases where a patient recorded abnormal numerical test results and also tested positive for any other disease, I do not include that patient's results in either Group 1 or Group 2, since it's possible the abnormal lab results are a result of testing positive for the disease (and not due to cancer).

### Group 1 
* possible cancer/immunocompromised
* abnormal (high or low) lab results
* did not test positive for any disease

### Group 2
* no cancer/not immunocompromised
* normal lab results
* positive/negative results for diseases are disregarded since I assume normal lab results indicate no cancer



In [None]:
# Patients with abnormal test results (possible cancer)
group1 = patientsWithResultsDF[((patientsWithResultsDF['numHighTests'] + patientsWithResultsDF['numLowTests']) > 0) &
                                         (patientsWithResultsDF['numValidBoolTests'] > 0) & 
                                         (patientsWithResultsDF['numNotDetectedTests'] == patientsWithResultsDF['numValidBoolTests'])]# &

group1Size = len(group1)

positiveGroup1 = len(group1[group1['sars_cov_2_exam_result'] == 'positive'])
negativeGroup1 = len(group1[group1['sars_cov_2_exam_result'] == 'negative'])

print("Group 1 (Cancer patients) size: " + str(group1Size))

In [None]:
# Patients no high/low numerical tests (no cancer)
group2 = patientsWithResultsDF[(patientsWithResultsDF['numNormalTests'] > 0) &
                                ((patientsWithResultsDF['numHighTests'] + patientsWithResultsDF['numLowTests']) == 0)]

group2Size = len(group2)

positiveGroup2 = len(group2[group2['sars_cov_2_exam_result'] == 'positive'])
negativeGroup2 = len(group2[group2['sars_cov_2_exam_result'] == 'negative'])

print("Group 2 (Non-Cancer patients) size: " + str(group2Size))

Based on the population represented in this dataset, we can answer the original question

>What is the incidence of infection with coronavirus among cancer patients?

(Recognize, however, that the numbers of patients in both Group 1 and especially Group 2 -- 186 and 28, respectively, are pretty low. It is difficult to generalize in a population with such low numbers.)

In [None]:
print("Group 1 (Cancer) fraction of patients testing positive for COVID-19:\t" + str(positiveGroup1/group1Size))
print("Group 2 (No Cancer) fraction of patients testing positive for COVID-19:\t" + str(positiveGroup2/group2Size))
print()
print("Group 1 (Cancer) fraction of patients testing negative for COVID-19:\t" + str(negativeGroup1/group1Size))
print("Group 2 (No Cancer) fraction of patients testing negative for COVID-19:\t" + str(negativeGroup2/group2Size))

In [None]:
# Graphical representation of the numbers
labels = ['Pos for COVID-19', 'Neg for COVID-19']
group1Numbers = [positiveGroup1/group1Size, negativeGroup1/group1Size]
group2Numbers = [positiveGroup2/group2Size, negativeGroup2/group2Size]

x = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, group1Numbers, width, label='Group 1 (Cancer) - ' + str(group1Size) + ' Patients')
rects2 = ax.bar(x + width/2, group2Numbers, width, label='Group 2 (No Cancer) - ' + str(group2Size) + ' Patients')

ax.set_ylabel('Fraction')
ax.set_title('Coronavirus Diagnosis in Cancer vs. Non-Cancer Patients')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45)

plt.show()

**Approximately 27% of cancer patients in this study tested positive for coronavirus.** This percentage is higher (as expected) than the 18% of non-cancer patients in this study who tested positive for coronavirus.

To answer the second question

>And, is disease severity or trajectory different among these patients vs those not diagnosed with cancer?

we can use the columns in the dataset indicating which hospital ward the patient was admitted to (regular, semi-intensive, or ICU). Note that a patient may not be admitted to any of these wards (all three columns could be false). However, I did verify that in this dataset, a patient can only be admitted to one of these wards (there are no patients where more than one ward has a value == 't').

In [None]:
positiveGroup1_regularWard = len(group1[(group1['sars_cov_2_exam_result'] == 'positive') & (group1['patient_addmited_to_regular_ward_1_yes_0_no'] == 't')])
positiveGroup1_semiIU = len(group1[(group1['sars_cov_2_exam_result'] == 'positive') & (group1['patient_addmited_to_semi_intensive_unit_1_yes_0_no'] == 't')])
positiveGroup1_ICU = len(group1[(group1['sars_cov_2_exam_result'] == 'positive') & (group1['patient_addmited_to_intensive_care_unit_1_yes_0_no'] == 't')])
positiveGroup1_noWard = group1Size - positiveGroup1_regularWard - positiveGroup1_semiIU - positiveGroup1_ICU

negativeGroup1_regularWard = len(group1[(group1['sars_cov_2_exam_result'] == 'negative') & (group1['patient_addmited_to_regular_ward_1_yes_0_no'] == 't')])
negativeGroup1_semiIU = len(group1[(group1['sars_cov_2_exam_result'] == 'negative') & (group1['patient_addmited_to_semi_intensive_unit_1_yes_0_no'] == 't')])
negativeGroup1_ICU = len(group1[(group1['sars_cov_2_exam_result'] == 'negative') & (group1['patient_addmited_to_intensive_care_unit_1_yes_0_no'] == 't')])
negativeGroup1_noWard = group1Size - negativeGroup1_regularWard - negativeGroup1_semiIU - negativeGroup1_ICU

positiveGroup2_regularWard = len(group2[(group2['sars_cov_2_exam_result'] == 'positive') & (group2['patient_addmited_to_regular_ward_1_yes_0_no'] == 't')])
positiveGroup2_semiIU = len(group2[(group2['sars_cov_2_exam_result'] == 'positive') & (group2['patient_addmited_to_semi_intensive_unit_1_yes_0_no'] == 't')])
positiveGroup2_ICU = len(group2[(group2['sars_cov_2_exam_result'] == 'positive') & (group2['patient_addmited_to_intensive_care_unit_1_yes_0_no'] == 't')])
positiveGroup2_noWard = group2Size - positiveGroup2_regularWard - positiveGroup2_semiIU - positiveGroup2_ICU

negativeGroup2_regularWard = len(group2[(group2['sars_cov_2_exam_result'] == 'negative') & (group2['patient_addmited_to_regular_ward_1_yes_0_no'] == 't')])
negativeGroup2_semiIU = len(group2[(group2['sars_cov_2_exam_result'] == 'negative') & (group2['patient_addmited_to_semi_intensive_unit_1_yes_0_no'] == 't')])
negativeGroup2_ICU = len(group2[(group2['sars_cov_2_exam_result'] == 'negative') & (group2['patient_addmited_to_intensive_care_unit_1_yes_0_no'] == 't')])
negativeGroup2_noWard = group2Size - negativeGroup2_regularWard - negativeGroup2_semiIU - negativeGroup2_ICU

In [None]:
print("Group 1 (Cancer) COVID-positive patients admitted to the regular ward:\t\t" + str(positiveGroup1_regularWard/group1Size))
print("Group 2 (No Cancer) COVID-positive patients admitted to the regular ward:\t" + str(positiveGroup2_regularWard/group2Size))
print()
print("Group 1 (Cancer) COVID-positive patients admitted to the Semi-Intensive Unit:\t\t" + str(positiveGroup1_semiIU/group1Size))
print("Group 2 (No Cancer) COVID-positive patients admitted to the Semi-Intensive Unit:\t" + str(positiveGroup2_semiIU/group2Size))
print()
print("Group 1 (Cancer) COVID-positive patients admitted to the ICU:\t\t" + str(positiveGroup1_ICU/group1Size))
print("Group 2 (No Cancer) COVID-positive patients admitted to the ICU:\t" + str(positiveGroup2_ICU/group2Size))
print()
print("Group 1 (Cancer) COVID-positive patients not admitted to any ward:\t\t" + str(positiveGroup1_noWard/group1Size))
print("Group 2 (No Cancer) COVID-positive patients not admitted to any ward:\t" + str(positiveGroup2_noWard/group2Size))
print("----------")
print("Group 1 (Cancer) COVID-negative patients admitted to the regular ward:\t\t" + str(negativeGroup1_regularWard/group1Size))
print("Group 2 (No Cancer) COVID-negative patients admitted to the regular ward:\t" + str(negativeGroup2_regularWard/group2Size))
print()
print("Group 1 (Cancer) COVID-negative patients admitted to the Semi-Intensive Unit:\t\t" + str(negativeGroup1_semiIU/group1Size))
print("Group 2 (No Cancer) COVID-negative patients admitted to the Semi-Intensive Unit:\t" + str(negativeGroup2_semiIU/group2Size))
print()
print("Group 1 (Cancer) COVID-negative patients admitted to the ICU:\t\t" + str(negativeGroup1_ICU/group1Size))
print("Group 2 (No Cancer) COVID-negative patients admitted to the ICU:\t" + str(negativeGroup2_ICU/group2Size))
print()
print("Group 1 (Cancer) COVID-negative patients not admitted to any ward:\t\t" + str(negativeGroup1_noWard/group1Size))
print("Group 2 (No Cancer) COVID-negative patients not admitted to any ward:\t" + str(negativeGroup2_noWard/group2Size))

In [None]:
labels = ['Pos COVID-19 admitted to Regular Ward', 'Pos COVID-19 admitted to Semi-Intensive Ward', 
          'Pos COVID-19 admitted to ICU', 'Pos COVID-19 not admitted']
group1Numbers = [ positiveGroup1_regularWard/group1Size, positiveGroup1_semiIU/group1Size, 
                 positiveGroup1_ICU/group1Size, positiveGroup1_noWard/group1Size ]
group2Numbers = [ positiveGroup2_regularWard/group2Size, positiveGroup2_semiIU/group2Size, 
                 positiveGroup2_ICU/group2Size, positiveGroup2_noWard/group2Size ]

x = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, group1Numbers, width, label='Group 1 (Cancer) - ' + str(group1Size) + ' Patients')
rects2 = ax.bar(x + width/2, group2Numbers, width, label='Group 2 (No Cancer) - ' + str(group2Size) + ' Patients')

ax.set_ylabel('Fraction')
ax.set_title('Admittance for Positive COVID-19 Cancer vs. Non-Cancer Patients')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=80)

plt.show()



This dataset shows that both cancer and non-cancer patients that test positive for coronavirus are admitted to the regular hospital ward at similar rates (11%).

However, non-cancer patients are not likely to be admitted at all to the semi-intensive or intensive care units (0% from the data). Meanwhile, approximately 4% of cancer patients will be admitted to the semi-intensive unit and 3% to the ICU.

For the sake of comparison, here is the graph showing admittance rates for cancer and non-cancer patients who test *negative* for the coronavirus.

In [None]:
labels = ['Neg COVID-19 admitted to Regular Ward', 'Neg COVID-19 admitted to Semi-Intensive Ward', 
          'Neg COVID-19 admitted to ICU', 'Neg COVID-19 not admitted']
group1Numbers = [ negativeGroup1_regularWard/group1Size, negativeGroup1_semiIU/group1Size,
                 negativeGroup1_ICU/group1Size, negativeGroup1_noWard/group1Size ]
group2Numbers = [ negativeGroup2_regularWard/group2Size, negativeGroup2_semiIU/group2Size,
                 negativeGroup2_ICU/group2Size, negativeGroup2_noWard/group2Size ]

x = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, group1Numbers, width, label='Group 1 (Cancer) - ' + str(group1Size) + ' Patients')
rects2 = ax.bar(x + width/2, group2Numbers, width, label='Group 2 (No Cancer) - ' + str(group2Size) + ' Patients')

ax.set_ylabel('Fraction')
ax.set_title('Admittance for Negative COVID-19 Cancer vs. Non-Cancer Patients')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=80)

plt.show()



## Future Work
The dataset used in this analysis is useful in showing the effects of the coronavirus on immunocompromised patients. With additional data, we may be able to delve more into how specific parts of the population are affected. Aspects I would hope to be able to study in the future (assuming more data is forthcoming) are:

* With more data, we could use Machine Learning algorithms (i.e., Random Forest) to predict when cancer and non-cancer patients are likely to have the coronavirus. With algorithms like Random Forest, we can also determine which lab tests (features) are important to and indicative of a positive coronavirus test. If the features are different for cancer patients vs non-cancer patients, that might give doctors a good idea of which tests to prioritize for the patient. 
*Note: I started going down this path, and saw some differences in models created from cancer vs. non-cancer patients, but I didn't feel confident in reporting the results given I only had 28 non-cancer datapoints to work with).*

* With more lab tests reported per patient, we could include more numerical lab tests in our analysis (sample size would be larger).

* If the gender of the patient is provided, we may be able to pinpoint how high/low test results contribute towards a coronavirus diagnosis (normal ranges for lab tests are often different between male/female).

* We could also start to delve into how much higher/lower a patient's test results are from the mean, and determine how that affects the coronavirus diagnosis.


