In [27]:
import pandas as pd
import numpy as np

import ml_utils

## Exploratory Data Analysis

This is an example of how we can use an external EDA class to automate some of the data exploration steps. 

A few points to note here:
<ul>
<li> Above, in the imports, there is a line to import "ml_utils", which is "ml_utils.py", the file we've made to contain our EDA class.
<li> In that ml_utils.py file the actual EDA code is located.  
<li> Below, we create an object of that EDA class (similarly to how we'd create an object of a LinearRegression() when making a model).
<li> To do the EDA work, we call ("ask") the object we created to generate the EDA work that we've built into the class.
</ul>

In [28]:
df = pd.read_csv("heart.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


### Load External EDA Class

We want to create an object here that we can call later. 

In [30]:
df_eda = ml_utils.edaDF(df,"HeartDisease")
print(df_eda.giveTarget())

HeartDisease


### Provide Configuration to the EDA Class

In [31]:
df_eda.setCat(["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"])
df_eda.setNum(["Age", "RestingBP", "Cholesterol", "FastingBS", "MaxHR", "Oldpeak"])

### Perform EDA

In [32]:
df_eda.fullEDA()

Tab(children=(Output(), Output(), Output()), _titles={'0': 'Info', '1': 'Categorical', '2': 'Numerical'})

## Create an EDA Class

Your first task is to create your own EDA class, optionally based on the example code above. There isn't one correct answer or one specific set of features that your tool needs to do. The overall goal is to do as much of the EDA process as possible, with the least work required each time. Some things to consider:

<ul>
<li>How are different types of data represented and displayed?
<li>What is important to know? E.g. outliers, correlations, distributions, etc...
<li>How much customization is needed? Think about the Seaborn visualizations as an example, each chart has many options that we can optionally modify. E.g. pariplots can be useful, but can also take a long time to generate, would it be better if they were optional?
<li>Code modularity - in general, breaking up large tasks into reusable pieces of code will be preferable. If/when you add to this later on, it'll be much easier. 
<li>Commenting - we'll comment this code in the correct manner. See the example for a pattern that should be enough to follow. Commenting is dull and boring, but generally very important; we leave it mainly to the coding class to cover proper comments, but we'll do it the 'proper' way here since we are creating a portable tool. If someone else was to use this, good comments allow them to make sense of it. 
</ul>

The example code isn't comprehensive or definitive, the intent is to give a few examples of how to put things together (tabs, matplotlib figures, classes, etc). You can use whatever you want to put this together, it is also a good chance to build skills in reading documentation for the differnet libraries to adapt things to what you want. 

This tool will be peer evaluated - you'll each try out 3 other people's EDA tools and judge how useful they are and how easy they are to use. 

<h4>Utility File</h4>

When creating this EDA class, place it in a regular python file, e.g. ml_utils.py. The python file is effectively just one big code cell and pretty much everything that you write should translate directly (there's a slight possibility you might need to adjust somehting, my sample worked as is. If so, Google the error, it is probably common, then ask me if you don't get it). This utility file can also stay with you and be built upon as we go forward; any common code that you use repeatedly can be built into a function in this file, then you can just import it (like thinkstats2 and thinkplot) to your notebook, and use those things without rewriting them. 

This approach is really common, when programming we want to make things into reusable functions almost as often as possible. It saves us work, allows us to not have to think about the same challenge repeatedly, makes it easy to make improvements, and reduces the probablity of making an error. As we go, add to it! 


For development purposes, use this function in the next part, predictions with trees. 