# DATA WRANGLING AND TIDYING
## EDA: Diagnosing Diabetes

In this project, you’ll imagine you are a data scientist interested in exploring data that looks at how certain diagnostic factors affect the diabetes outcome of women patients.

You will use your EDA skills to help inspect, clean, and validate the data.

**Note**: This [dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
    is from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- `BloodPressure`: Diastolic blood pressure (mm Hg)
- `SkinThickness`: Triceps skinfold thickness (mm)
- `Insulin`: 2-Hour serum insulin (muU/ml)
- `BMI`: Body mass index (kg/m^2)
- `DiabetesPedigreeFunction`: Diabetes pedigree function
- `Age`: Age (years)
- `Outcome`: Class variable (0 or 1)

Let’s get started!

### Initial Inspection

1. First, familiarize yourself with the dataset
    [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

    Look at each of the nine columns in the documentation.

    What do you expect each data type to be?

    *Answer*: It looks like all `dtype`s should be `int`s or `float`s.

2. Next, let’s load in the diabetes data to start exploring.

    Load the data in a variable called `diabetes_data` and print the first few rows.

    **Note**: The data is stored in a file called `diabetes.csv`.

In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_rows", None)

diabetes_data = pd.read_csv("diabetes.csv")

print(diabetes_data.info())
diabetes_data.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


3. How many columns (features) does the data contain?

   *Answer*: 9

4. How many rows (observations) does the data contain?

    *Answer*: 768

5. Let’s inspect diabetes_data further.

    Do any of the columns in the data contain null (missing) values?

In [2]:
diabetes_data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

*Answer*: Looks like no...

6. If you answered no to the question above, not so fast!

    While it’s technically true that none of the columns contain null values, that doesn’t necessarily mean that the data isn’t missing any values.

    When exploring data, you should always question your assumptions and try to dig deeper.

    To investigate further, calculate summary statistics on `diabates_data` using the `.describe()` method.

In [3]:
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


7. Looking at the summary statistics, do you notice anything odd about the following columns?
    - `Glucose`
    - `BloodPressure`
    - `SkinThickness`
    - `Insulin`
    - `BMI`

*Answer*: The min value is 0 for them, even though that value should be impossible.

8. Do you spot any other outliers in the data?

*Answer*: Max insulin is 846, max pregnancies is 17

9. Let’s see if we can get a more accurate view of the missing values in the data.

    Use the following code to replace the instances of `0` with `NaN` in the five columns mentioned:
    
    ```python
    diabetes_data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diabetes_data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)
    ```

In [4]:
columns = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
diabetes_data[columns] = diabetes_data[columns].replace(0, np.NaN)

10. Next, check for missing (null) values in all of the columns just like you did in Step 5.

    Now how many missing values are there?

In [5]:
diabetes_data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

11. Let’s take a closer look at these rows to get a better idea of why some data might be missing.

    Print out all of the rows that contain missing (null) values.

In [6]:
diabetes_data[diabetes_data.isnull().any(axis=1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
5,5,116.0,74.0,,,25.6,0.201,30,0
7,10,115.0,,,,35.3,0.134,29,0
9,8,125.0,96.0,,,,0.232,54,1
10,4,110.0,92.0,,,37.6,0.191,30,0
11,10,168.0,74.0,,,38.0,0.537,34,1
12,10,139.0,80.0,,,27.1,1.441,57,0
15,7,100.0,,,,30.0,0.484,32,1


12. Go through the rows with missing data. Do you notice any patterns or overlaps between the missing data?

*Answer*: Most rows with missing data are also missing `Insulin` data.

In [7]:
print(diabetes_data.dtypes)

Pregnancies                   int64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object


13. Next, take a closer look at the data types of each column in diabetes_data.

    Does the result match what you would expect?

*Answer*: Yes, all values are `float`s or `int`s.

**Note**: In Codecademy, the `Outcome` column shows up as `object` `dtype`. This is because some of the `0`s are actually the letter `O`. The data has likely been updated to fix this issue, so we'll skip ahead.

### Next Steps

16. Congratulations! In this project, you saw how EDA can help with the initial data inspection and cleaning process. This is an important step as it helps to keep your datasets clean and reliable.

    Here are some ways you might extend this project if you’d like:

    * Use .value_counts() to more fully explore the values in each column.
    * Investigate other outliers in the data that may be easily overlooked.
    * Instead of changing the 0 values in the five columns to NaN, try replacing the values with the median or mean of each column.

In [8]:
diabetes_data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,,,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,,,,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,,,,0.232,54,1
