## Myocardial Infarction Analytics with Pandas


### University of Virginia
### Programming for Data Science
### Last Updated: March 1, 2021
---

### Objectives: 
- Use pandas to perform operations, analytics, and visualization on healthcare data

### Executive Summary

In this assignment you will download and work with the Myocardial Infarction (MI) Complications Data Set housed at UCI.  
A myocardial infarction is commonly called a heart attack.

**Dataset Description File (DDF)**  
Read about the dataset here:  
https://archive.ics.uci.edu/ml/datasets/Myocardial+infarction+complications#

You will work with some of the columns.  
A subset of these could be predictors in an ML model, while others could be outcome variables.  
The section **Attribute Information** in the DDF provides details.

### Instructions

You will be asked to perform several tasks.  
Show your code and solutions clearly in the cells following each question.   
When the file is completed, submit the notebook through Collab.

**TOTAL POINTS: 12**

---


In [2]:
import pandas as pd
import numpy as np

#### Read in the dataset from the UCI Machine Learning Repository 
Hints:  
Use the read_csv() function, giving the path to the dataset (*path_to_data*) as an argument.  
There is no header in this data, so pass a second argument `header=None`.

In [7]:
path_to_data = "https://archive.ics.uci.edu/ml/machine-learning-databases/00579/MI.data"
df = pd.read_csv(path_to_data)

 **<span style="color:blue">1 PT) 1. Print the number of records in the dataset</span>**

In [4]:
len(df.index) 

1699

**<span style="color:blue">(1 PT) 2. Show the first three records in the dataset</span>**

In [6]:
df.head(3)

Unnamed: 0,1,77,1.1,2,1.2,1.3,2.1,?,3,0,...,0.81,0.82,0.83,0.84,0.85,0.86,0.87,0.88,0.89,0.90
0,2,55,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,52,1,0,0,0,2,?,2,0,...,0,0,0,0,0,0,0,0,0,0
2,4,68,0,0,0,0,2,?,2,0,...,0,0,0,0,0,0,1,0,0,0


**Working with AGE**

The second column contains patient age. If your dataframe is named `df`, you can reference the column with: df[1]  
(generally the field names will be strings and you can use df['age'] to access field `age`, as an example).


**<span style="color:blue">(1 PT) 3. One complication: missing values are filled with '?' which will cause problems (e.g., stats can't be computed easily).   
Count the number of records in df[1] containing '?'.</span>**

In [8]:
df["1"].isin(['?']).sum(axis=0)

0

**<span style="color:blue">(1 PT) 4. Write code to replace '?' with np.nan in the column df[1] of your dataframe.** </span>

In [10]:
df["1"]=df["1"].replace("?", np.nan)
df["1"]

0          2
1          3
2          4
3          5
4          6
        ... 
1694    1696
1695    1697
1696    1698
1697    1699
1698    1700
Name: 1, Length: 1699, dtype: int64

**<span style="color:blue"> (1 PT) 5. Print the number of records containing np.nan in the column df[1] of your dataframe.**</span>

In [11]:
df["1"].isin([np.nan]).sum(axis=0)

0

Another complication: the age data is saved as strings, and there are the null values.  
Here's an example:

In [12]:
# inspect first element

df["1"].iloc[0]

2

In [13]:
# check the column type
df["1"].dtype

dtype('int64')

To convert the column to numeric, we can use `apply()` with a lambda function.  
If the type is string, we cast to numeric, otherwise it's null and we leave things alone.  
`isinstance(x, str)` checks if x is a string, returning a bool.  

Review this code for understanding:

In [14]:
df["1"] = df["1"].apply(lambda x: float(x) if isinstance(x, str) else x)

**<span style="color:blue">(1 PT) 6. Show the data type of `age` is no longer string type.</span>**

In [15]:
df["1"].dtype

dtype('int64')

**<span style="color:blue">(1 PT) 7. Compute the median age</span>**

In [16]:
df["1"].median

<bound method NDFrame._add_numeric_operations.<locals>.median of 0          2
1          3
2          4
3          5
4          6
        ... 
1694    1696
1695    1697
1696    1698
1697    1699
1698    1700
Name: 1, Length: 1699, dtype: int64>

**Working with GENDER** 

The third column contains patient gender; again since indexing starts at zero, you'll reference df[2]

**<span style="color:blue">(1 PT) 8. Print the frequency AND percentage of each gender.</span>**  
Hint: The function you'll use to compute frequencies will take an argument to compute percentages.

In [17]:
df["2"].value_counts(normalize=True)
df["2"].value_counts(normalize=True).mul(100)

0    62.389641
1    24.131842
2     8.593290
3     4.649794
?     0.235433
Name: 2, dtype: float64

**Working with Essential Hypertension (EH)** 

Reference this column with df[8]  

**<span style="color:blue">(1 PT) 9. Enter the most frequent value</span>**

In [1]:
print(df["8"].value_counts())
"most frequent value is ?"


NameError: name 'df' is not defined

**Working with Atrial Fibrillation (AFIB)** 

Reference this column with df[112]  
AFIB is one of the complications and outcomes of myocardial infarction.

**<span style="color:blue">(1 PT) 10. Print the number of AFIB cases (value=1)</span>**

In [28]:
df.iloc[:, 112].sum()

170

Construct a new dataframe containing only columns: AGE, AFIB  
As a reminder, AGE lives in df[1] and AFIB lives in df[112]  

**<span style="color:blue">(1 PT) 11. Print the shape of this dataframe</span>**

In [20]:
df2=df.filter(["1","112"])
df2.shape

(1699, 1)

We plan to plot AGE and AFIB, so renaming the columns to strings will make things easier.  
We use `rename()` for this:

In [128]:
df2 = df2.rename(columns={1:'age', 112:'AFIB'})
df2.head()

Unnamed: 0,1
0,2
1,3
2,4
3,5
4,6


**<span style="color:blue">(1 PT) 12. Import the `seaborn` package for visualization.
             Create a boxplot with AFIB on the x-axis and Age on the y-axis</span>**
             
documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [142]:
import seaborn as se
se.relplot(
    data=df,
    x=df["112"], y=df["1"]
)

KeyError: 112

(ungraded) What do you notice about the difference in age distributions between AFIB/non-AFIB groups?