# Pandas
As described at https://pandas.pydata.org 
> pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

## Resources
1. Ch 5-6 in Python for Data Analysis, 2nd Ed, Wes McKinney (UCalgary library and https://github.com/wesm/pydata-book)
2. Ch 3 in Python Data Science Handbook, Jake VanderPlas (Ucalgary library and https://github.com/jakevdp/PythonDataScienceHandbook)


Let's explore some of the features. 

First, import Pandas and NumPy

In [1]:
import numpy as np
import pandas as pd

## Create pandas DataFrames

There are several ways to create Pandas DataFrames, most notably from reading a csv (comma separated values file). DataFrames are 'spreadsheets' in Python. We will often use `df` as a variable name for a DataFrame.

If data is not stored in a file, a DataFrame can be created from a dictionary of lists

```python
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
            'year': [2000, 2001, 2002, 2001, 2002, 2003],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)


```

where dictionary keys become column headers.

An alternative is to create from a numpy array and set column headers separately:

In [None]:
# From a numpy array
df = pd.DataFrame( np.arange(20).reshape(5,4), columns=['alpha', 'beta', 'gamma', 'delta'])
df

In [None]:
# checking its type
type(df)

## Indexing
Accessing data in Dataframes is done by rows and columns, either index or label based.

In [None]:
# select a column
df['alpha']

In [None]:
# select two columns
df[['alpha', 'gamma']]

In [None]:
# select rows
df.iloc[:2]

In [None]:
# select rows and columns
df.iloc[:2, :2]

In [None]:
# select rows and columns, mixed
df.loc[:2, ['alpha', 'beta']]

## DataFrame math
Similar to Numpy, DataFrames support direct math


In [None]:
# direct math
df2 = (9/5) * df + 32
df2

In [None]:
# add two dataframes of same shape
df + df2

In [None]:
# map a function to each column
f = lambda x: x.max() - x.min()

df.apply(f)

## DataFrame manipulation
Adding and deleting columns, as well as changing entries is similar to Python dictionaries.

Note that most DataFrame methods do not change the DataFrame directly, but return a new DataFrame. It is always good to check how the method you are invoking behaves.


In [None]:
# add a column
df['epsilon'] = ['low', 'medium', 'low', 'high', 'high']
df

In [None]:
# What is the size?
df.shape

In [None]:
# delete column
df_dropped = df.drop(columns=['gamma'])
df_dropped

In [None]:
# the original dataframe is unaffected
df

Let's create a copy and assign new values to the first column:

In [None]:
df_copy = df.copy()
df_copy['alpha'] = 20
print(df)
print(df_copy)

DataFrames can be sorted by column:

In [None]:
# sorting values
df.sort_values(by='epsilon')

## Load data from file

Most often data will come from somewhere, often csv files, and using `pd.read_csv()` will allow smooth creation of DataFrames.

Let's load that same heart-attack.csv that we used in Numpy before:

In [None]:
data = pd.read_csv('heart-attack.csv')

After loading data, it is good practice to check what we have. Usually, the sequences is:
1. Check dimension
2. Peek at the first rows
3. Get info on data types and missing values
4. Summarize columns

In [None]:
# Check dimension (rows, columns) 
data.shape

In [None]:
# Peek at the first rows
data.head()

In [None]:
# Column names are
data.columns

In [None]:
# Get info on data types and missing values
data.info()

## Summarize values
How do we find the mean, std, min, max in each column?

In [None]:
data.mean()

In [None]:
# where are the other columns? Check data types
data.dtypes

Notice that many columns are of type object, which is not a number. Maybe this has to do with missing values? We know from peeking at the first rows that there are '?' values in there. Let's replace these with the string NaN for not-a-number.

In [None]:
# replace '?' with 'NaN'
data = data.replace({'?': 'NaN'})
data.head()

Pandas knows that 'NaN' probably means that numbers are missing. Now we can convert the data type from object to float

In [None]:
# convert dtypes
data = data.astype('float')
data.dtypes

We could have loaded the data with the `na_values` argument to indicate that '?' means missing number:

In [None]:
data = pd.read_csv('heart-attack.csv', na_values='?')
data.dtypes

This worked nicely. Now we can describe all columns, meaning printing basic statistics. Note that by default Pandas ignores NaN, whereas Numpy does not.

In [None]:
data.describe() # ignores NaN

We could be interested by these statistics in each of the genders. To get these, we first group values by gender, then ask for the description. We will only look at age for clarity

In [None]:
data.groupby(by='gender').describe().age

## Find NaNs
How many NaNs in each column?

We can ask which entries are null, which produces a boolean array


In [None]:
data.isnull()

Applying `sum()` to this boolean array will count the number of `True` values in each column

In [None]:
data.isnull().sum()

We get complementary information from `info()`

In [None]:
data.info()

We can fill (replace) these missing values, for example with the minimum value in each column

In [None]:
data.fillna(data.min()).describe()

## Count unique values (a histogram)

We finish off, with our good friend the histogram

In [None]:
data['age'].value_counts()