# Data Wrangling Homework

In the spirit of warming up for final projects, this weeks homework will be somewhat open-ended. Actually, it will more like closed-ended and open-middled...

Different doctors with different backgrounds, trained in different places, etc., might behave differently. In other words, one doctor might have various biases relative to another. We hope not, at least in critical situations, but doctors are people too.

Your job is to determine whether the 4 doctors in our data set are behaving essentially the same with respect to measuring clump thickness, bland chromatin, and diagnosis of tumor type, or whether any one of them seems to be different. 

The submission should be a pdf that makes your case as though to a boss/hospital administrator; it should make the case in enough detail to be convincing, but not in such detail that your boss will hate you. For example, one doctor-to-doctor comparison can be described in some detail, but the rest can be summarized with "Similar comparisons were made for ..."

Your final conclusion should be whether 1) everything seems okay with respect to the doctors or 2) if there are any red flags that might warrent further scrutiny.

Do the analysis with an open mind. It's not good to enter an analysis with a pre-conceived notion of what you may or may not find.

In [36]:
import numpy as np
import pandas as pd
data = pd.read_csv("./breast_cancer_data.csv")

In [37]:
doctor_stats = data.groupby('doctor_name').agg({'clump_thickness': 'mean',
                                           'bland_chromatin': 'mean',
                                            'class': lambda x: x.value_counts().index[0],
                                            })
print(doctor_stats)

             clump_thickness  bland_chromatin   class
doctor_name                                          
Dr. Doe             4.189189         3.076503  benign
Dr. Lee             4.182320         3.435754  benign
Dr. Smith           4.874286         3.863636  benign
Dr. Wong            4.445860         3.426752  benign


In [39]:
doctor_stats = data.groupby('doctor_name').agg({'clump_thickness': 'mean',
                                           'bland_chromatin': 'mean',
                                            'class': lambda x: x.value_counts().index[1]
                                            })
print(doctor_stats)

             clump_thickness  bland_chromatin      class
doctor_name                                             
Dr. Doe             4.189189         3.076503  malignant
Dr. Lee             4.182320         3.435754  malignant
Dr. Smith           4.874286         3.863636  malignant
Dr. Wong            4.445860         3.426752  malignant


In [40]:
doctor_stats = data.groupby('doctor_name').agg({'clump_thickness': 'mean',
                                                'bland_chromatin': 'mean',
                                                'class': [('Benign', lambda x: (x == 'Benign').sum()),
                                                          ('Malignant', lambda x: (x == 'Malignant').sum())]})

print(doctor_stats)

            clump_thickness bland_chromatin  class          
                       mean            mean Benign Malignant
doctor_name                                                 
Dr. Doe            4.189189        3.076503      0         0
Dr. Lee            4.182320        3.435754      0         0
Dr. Smith          4.874286        3.863636      0         0
Dr. Wong           4.445860        3.426752      0         0


In [21]:
doctor_stats = doctor_stats.reset_index()
print(doctor_stats)

  doctor_name clump_thickness bland_chromatin  class          
                         mean            mean Benign Malignant
0     Dr. Doe        4.189189        3.076503      0         0
1     Dr. Lee        4.182320        3.435754      0         0
2   Dr. Smith        4.874286        3.863636      0         0
3    Dr. Wong        4.445860        3.426752      0         0
