# Pandas

Using the heart-attack.csv example below, fill in the code blocks for a new dataset, auto-mpg

The first step is to download the auto-mpg data set (auto-mpg.data and auto-mpg.names) from UCI: https://archive.ics.uci.edu/ml/datasets/Auto%2BMPG

In this file, replace gender with origin and age with mpg

## Resources
1. Ch 5-6 in Python for Data Analysis, 2nd Ed, Wes McKinney (UCalgary library and https://github.com/wesm/pydata-book)
2. Ch 3 in Python Data Science Handbook, Jake VanderPlas (Ucalgary library and https://github.com/jakevdp/PythonDataScienceHandbook)

First, import Pandas and NumPy:

In [4]:
import numpy as np
import pandas as pd

## Load data from file

Most often data will come from somewhere, often csv files, and using `pd.read_csv()` will allow smooth creation of DataFrames.

Let's load the required dataset:

In [5]:
# Replace code below with code to load auto-mpg dataset
# Hint: Use attribute information from website to determine column names
# Hint: Load with na_values = '?' and sep=r'\s+'
data = pd.read_csv('heart-attack.csv')

After loading data, it is good practice to check what we have. Usually, the sequences is:
1. Check dimension
2. Peek at the first rows
3. Get info on data types and missing values
4. Summarize columns

In [6]:
# Check dimension (rows, columns) 
data.shape

(293, 14)

In [7]:
# Peek at the first rows
data.head()

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130,132,0,2,185,0,0.0,?,?,?,0
1,29,1,2,120,243,0,0,160,0,0.0,?,?,?,0
2,29,1,2,140,?,0,0,170,0,0.0,?,?,?,0
3,30,0,1,170,237,0,1,170,0,0.0,?,?,6,0
4,31,0,2,100,219,0,1,150,0,0.0,?,?,?,0


In [8]:
# Column names are
data.columns

Index(['age', 'gender', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
      dtype='object')

In [9]:
# Get info on data types and missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293 entries, 0 to 292
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       293 non-null    int64  
 1   gender    293 non-null    int64  
 2   cp        293 non-null    int64  
 3   trestbps  293 non-null    object 
 4   chol      293 non-null    object 
 5   fbs       293 non-null    object 
 6   restecg   293 non-null    object 
 7   thalach   293 non-null    object 
 8   exang     293 non-null    object 
 9   oldpeak   293 non-null    float64
 10  slope     293 non-null    object 
 11  ca        293 non-null    object 
 12  thal      293 non-null    object 
 13  num       293 non-null    int64  
dtypes: float64(1), int64(4), object(9)
memory usage: 32.2+ KB


## Summarize values
What is the mean, std, min, max in each column?

In [None]:
data.mean()

In [None]:
# where are the other columns? Check data types
data.dtypes

Now we can describe all columns, meaning printing basic statistics. Note that by default Pandas ignores NaN, whereas NumPy does not.

In [None]:
data.describe() # ignores NaN

We could be interested by these statistics in each of the values from one column. To get these, we first group values by those values, then ask for the description. We will only look at one separate variable for clarity

In [None]:
# Replace variables to correspond to auto-mpg dataset
data.groupby(by='gender').describe().age

## Find NaNs
How many NaNs in each column?

We can ask which entries are null, which produces a boolean array


In [10]:
data.isnull()

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
288,False,False,False,False,False,False,False,False,False,False,False,False,False,False
289,False,False,False,False,False,False,False,False,False,False,False,False,False,False
290,False,False,False,False,False,False,False,False,False,False,False,False,False,False
291,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Applying `sum()` to this boolean array will count the number of `True` values in each column

In [11]:
data.isnull().sum()

age         0
gender      0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
num         0
dtype: int64

We get complementary information from `info()`

In [None]:
data.info()

We can fill (replace) these missing values, for example with the minimum value in each column

In [None]:
data.fillna(data.min()).describe()

## Count unique values (a histogram)

We finish off, with our good friend the histogram

In [None]:
# Replace code to correspond to relevant auto-mpg variable
data['age'].value_counts()