# Pandas

Using the heart-attack.csv example below, fill in the code blocks for a new dataset, auto-mpg

The first step is to download the auto-mpg data set (auto-mpg.data and auto-mpg.names) from UCI: https://archive.ics.uci.edu/ml/datasets/Auto%2BMPG

In this file, replace gender with origin and age with mpg

## Resources
1. Ch 5-6 in Python for Data Analysis, 2nd Ed, Wes McKinney (UCalgary library and https://github.com/wesm/pydata-book)
2. Ch 3 in Python Data Science Handbook, Jake VanderPlas (Ucalgary library and https://github.com/jakevdp/PythonDataScienceHandbook)

First, import Pandas and NumPy:

In [None]:
import numpy as np
import pandas as pd

## Load data from file

Most often data will come from somewhere, often csv files, and using `pd.read_csv()` will allow smooth creation of DataFrames.

Let's load the required dataset:

In [None]:
# Replace code below with code to load auto-mpg dataset
# Hint: Use attribute information from website to determine column names
# Hint: Load with na_values = '?' and sep=r'\s+'
data = pd.read_csv('heart-attack.csv')

After loading data, it is good practice to check what we have. Usually, the sequences is:
1. Check dimension
2. Peek at the first rows
3. Get info on data types and missing values
4. Summarize columns

In [None]:
# Check dimension (rows, columns) 
data.shape

In [None]:
# Peek at the first rows
data.head()

In [None]:
# Column names are
data.columns

In [None]:
# Get info on data types and missing values
data.info()

## Summarize values
What is the mean, std, min, max in each column?

In [None]:
data.mean()

In [None]:
# where are the other columns? Check data types
data.dtypes

Now we can describe all columns, meaning printing basic statistics. Note that by default Pandas ignores NaN, whereas NumPy does not.

In [None]:
data.describe() # ignores NaN

We could be interested by these statistics in each of the values from one column. To get these, we first group values by those values, then ask for the description. We will only look at one separate variable for clarity

In [None]:
# Replace variables to correspond to auto-mpg dataset
data.groupby(by='gender').describe().age

## Find NaNs
How many NaNs in each column?

We can ask which entries are null, which produces a boolean array


In [None]:
data.isnull()

Applying `sum()` to this boolean array will count the number of `True` values in each column

In [None]:
data.isnull().sum()

We get complementary information from `info()`

In [None]:
data.info()

We can fill (replace) these missing values, for example with the minimum value in each column

In [None]:
data.fillna(data.min()).describe()

## Count unique values (a histogram)

We finish off, with our good friend the histogram

In [None]:
# Replace code to correspond to relevant auto-mpg variable
data['age'].value_counts()