# CRISP-DM for prediction of diabetes

Over the next few weeks, we'll go through a simple example of a data science process:

- business understanding (asking the right questions)
- loading data
- EDA (exploratory data analysis), visualization
- data cleaning and preparation
- modeling and evaluation
- communication of results
    
In this file, we'll cover the first three bullets above.
    
# 1. Business understanding

In this demo, we'll be working with a dataset with health and biographic data on diabetes diagoneses from [here](https://data.world/informatics-edu/diabetes-prediction). The purpose of our task is to understand which data is related to the occurance of diabetes, and to eventually predict a risk of diabetes based on the data. Our question is: can we accurately predict the occurance of diabetes based on the demographic and medical data we collect? With this, we can offer better personalized health services to people, and potentially improve the overall health of everyone by understanding what we can do to reduce the risk of diabetes.

We'll start with EDA (exploratory data analysis) - both numeric and visual.

# 2. Data understanding - EDA and visualization basics

Understanding how to create charts and plots of data is an important first step in understanding the data. In this first section, we'll cover the bare bones basics which are all you need to complete the assignments this week.

Make sure you've got the following packages installed before starting. You can install them from a terminal or command prompt with `conda install -c conda-forge pandas pandas-profiling matplotlib openpyxl`, or you can install them in a jupyter cell with `!conda install -c conda-forge pandas pandas-profiling matplotlib openpyxl -y`. If conda is taking too long, you can use pip instead: `!pip install pandas-profiling`. Press shift+enter to run the cells after selecting them.

In [None]:
!conda install -c conda-forge pandas matplotlib openpyxl -y

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

First, we load the data. `pandas` is one of the most, if not *the* most, common data loading and preparation package in Python for data science. The documentation is excellent for the package, and it can read from many filetypes. Here is the page for the `read_excel` function: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

The main class in pandas is the DataFrame, which is often stored in a variable `df`. If you are getting an error, you may need to install `openpyxl` as shown above.

In [None]:
# we can give an index number or name for our index column, or leave it blank
df = pd.read_excel('data/diabetes_data.xlsx', index_col='Patient number')
df

It's always a good idea to look at the top and bottom of the data to make sure everything looks ok. We can see printing out the `df` object in jupyter (by putting the df object as the last line in the cell) prints the top and bottom of the data. We can also do this with `head` and `tail`:

In [None]:
df.head()

In [None]:
df.tail()

## Using pandas for EDA and visualization

The pandas package has a few functions for generating numeric EDA and statistics, and can easily plot data.

numeric EDA:
- info
- describe
- unique
- value_counts

plots:
- bar plots
- histograms
- scatter plots

other:
- filtering

### Numeric EDA

Info shows the datatypes (dtype), number of values, and number of missing values:

In [None]:
df.info()

Describe shows some numeric stats on numeric columns:

In [None]:
df.describe()

We can get the columns like so:

In [None]:
df.columns

In [None]:
# select a column
df['Age']

In [None]:
# select a column and get the counts of each unique value
df['Age'].value_counts()

In [None]:
# similar, but only gets unique values
df['Age'].unique()

### Bar plots

In [None]:
# turn value_counts into a bar plot
df['Age'].value_counts().plot.bar()

In [None]:
# we can make it easier to read by restricting the number of values to the top 10
df['Age'].value_counts()[:10].plot.bar()

This is using the matplotlib package, so we can add axes labels and other things to the plot with matplotlib. The matlpotlib package is one of (if not *the*) oldest plotting packages in Python. For most common things, we can search the internet for it (e.g. add x-axis label) and we will usually arrive at a stack overflow page or the matplotlib documentation.

In [None]:
import matplotlib.pyplot as plt

df['Age'].value_counts()[:10].plot.bar()
plt.xlabel('Age')
plt.ylabel('Counts')

If you want to hide the printout of text, assign the last line to the _ variable (essentially, throw away the output that gets printed).

In [None]:
df['Age'].value_counts()[:10].plot.bar()
plt.xlabel('Age')
_ = plt.ylabel('Counts')

If you want to share the figure, you can right click it and copy or save the image, or you can see the example at the bottom of the notebook for saving a figure.

### Histograms

Three common types of plots we can use are bar plots (like we saw), histograms, and scatter plots. Histograms are generated by pandas-profiling, but we can also look at a particular histogram like so:

In [None]:
df['Glucose'].hist()

In [None]:
# this slightly different interface has a different style and generally looks better without gridlines
df['Glucose'].plot.hist()

There are many options for the function shown in the docs:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Here, we change the number of bars (bins).

In [None]:
df['Glucose'].plot.hist(bins=30)

### Scatter plots

Scatter plots are for showing the relationship between two continuous variables, or variables that can take any value within a given range (e.g. both glucose and cholesterol can be any value above 0, but cholesterol is usually in the range 100-300).

In [None]:
df.plot.scatter(x='Cholesterol', y='Glucose')

# Optional - Advanced EDA and visualization
This part is not required, but is extra for those who want to learn more. It covers:
- filtering dataframes
- plotting with seaborn
- using the phik correlation
- time series plots with pandas

### Filtering

If we want to get only certain subsets of the data, we can filter it. For example, let's get everyone over the median age. It's usually best to use copy() at the end to take a copy of the slice of the dataframe -- this avoids the settingwithcopy errors that can happen otherwise.

In [None]:
over_median_age = df[df['Age'] > df['Age'].median()].copy()

This uses a boolean comparison, which indexes the dataframe and returns rows where the condition is `True`.

In [None]:
df['Age'] > df['Age'].median()

We can use the same boolean comparison operators as in most of Python and other programming, such as <, >, <=, >=, ==, and !=. We can also negate something with the ~ character:

In [None]:
~(df['Age'] > df['Age'].median())

To combine filters, we use the & (and) and | (or) operators, and be careful to wrap each conditional filter within parentheses:

In [None]:
over_median_age_chol = df[(df['Age'] > df['Age'].median()) & (df['Cholesterol'] > df['Cholesterol'].median())].copy()

In [None]:
over_median_age_chol

We can filter to get the two groups of diabetes and no diabetes people, and then look at the proportions of the genders in the groups (since the numbers in the groups are not the same). The `shape` attribute of a dataframe is a tuple with (rows, columns), so getting the first element with `[0]` gives us the number of rows. It looks like there isn't a large difference in the balance of male/female genders between these two groups.

In [None]:
df.shape

In [None]:
diabetes_df = df[df['Diabetes'] == 'Diabetes']

diabetes_df['Gender'].value_counts() / diabetes_df.shape[0]

In [None]:
no_diabetes_df = df[df['Diabetes'] == 'No diabetes']

no_diabetes_df['Gender'].value_counts() / no_diabetes_df.shape[0]

This is also what the `normalize` argument does.

In [None]:
no_diabetes_df['Gender'].value_counts(normalize=True)

## Seaborn for plotting

seaborn is a package that uses matplotlib and pandas dataframes to create more complex plots with minimal effort. In our case, we can group our data by people with and without diabetes, and plot some of their characterists. We are also going to use the phik package for correlations, so we need to install both these packages first:

In [None]:
!conda install -c conda-forge seaborn phik -y

Setting `stat='density'` and `common_norm=True` normalizes area under the individual histograms so they equal 1.
This makes it easier to compare the two groups. We can see that people with diabetes tend to have much higher glucose levels:

In [None]:
import phik
import seaborn as sns

_ = sns.histplot(data=df, y='Glucose', hue='Diabetes', stat='density', common_norm=False)

With seaborn, we can create scatter plots and color them by groups:

In [None]:
sns.relplot(data=df, x='Cholesterol', y='Glucose', hue='Diabetes')

One other nice plot to examine is a correlogram. This shows the linear correlations between columns. We can see the pairs BMI and weight as well as waist and hip measurements are strongly correlated with each other. This is the Pearson correlation, which shows linear relationships between two numeric columns. For more advanced correlations, try the Phi-k correlation package: https://phik.readthedocs.io/en/latest/

In [None]:
sns.heatmap(df.corr())

In [None]:
sns.heatmap(df.phik_matrix())

One last note: the `countplot` in seaborn is very much like doing `df['column'].value_counts().plot.bar()`, but allows us to use the `hue` argument to group data by a categorical variable.

## Time series plots

Time series plots are a little different, since we'll often be using the x-axis as sequential time. With pandas, as long as our timestamp is a timestamp datatype and our dataframe index, we can easily plot timeseries data. We'll be using data from here: https://www.kaggle.com/selfishgene/historical-hourly-weather-data?select=temperature.csv

In [None]:
time_df = pd.read_csv('data/temperature.csv', index_col='datetime', parse_dates=['datetime'], infer_datetime_format=True)
time_df

In [None]:
# we can see the index is of type "DatetimeIndex"
time_df.info()

We could also get our data in a proper format using `pd.to_datetime()`:

In [None]:
time_df2 = pd.read_csv('data/temperature.csv')
time_df2['datetime'] = pd.to_datetime(time_df2['datetime'])
time_df2.set_index('datetime', inplace=True)

In [None]:
time_df2

In [None]:
time_df['Denver'].plot()

One last trick we'll learn with datetime data is we can *resample* it, meaning change the time increments. We can convert our data to monthly data like so. We need to provide a transformation for the data, like 'mean' to take the average.

In [None]:
time_df_months = time_df.resample('1M').mean()

In [None]:
time_df_months['Denver'].plot()

If you want to remove missing values, dropna works:

In [None]:
time_df.shape

In [None]:
time_df.dropna(inplace=True)
time_df.shape

### Saving a plot

Saving a plot allows you to get higher resolution and control the size of the plot. In general, we want to first create the figure object with our specified size, then create our figure, then use plt.tight_layout, then save the figure. Remember `plt` is matplotlib which we imported earlier. Here is a respectable figure showing the temperature in Denver over the years. `dpi` is dots per inch. A higher value means higher resolution and bigger filesize. 300 can work well.

Notice we also convert the units from Kelvin to Farenheight and add reasonable x- and y-labels so that the plot is easily understood. Making a good plot requires this effort.

In [None]:
time_df_months['Denver_F'] = 9 / 5 * (time_df_months['Denver'] - 273) + 32

f = plt.figure(figsize=(5.5, 5.5))
time_df_months['Denver_F'].plot()
plt.xlabel('Year')
plt.ylabel('Temperature (F)')
plt.tight_layout()  # auto-adjust margins
plt.savefig('denver_temps.jpg', dpi=300)

## Further resources

The pandas documentation is excellent and shows how to create plots: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

Seaborn also has a gallery with examples: https://seaborn.pydata.org/examples/index.html

Kaggle has a short course on Python visualization: https://www.kaggle.com/learn/overview