# Pandas

Programming in Python

School of Computer Science, University of St Andrews

# What is Pandas?

Pandas is a library for data manipulation.

Useful for reading, exploring, transforming, and writing table-like data with columns of different data-types.

This is what makes it distinct from NumPy, the heterogeneous nature of the data.

It’s similar to a spreadsheet but can do many of the selection and transformation operations you might do with a database.

# Importing Pandas
Much like NumPy there is a convention when importing Pandas.

In [1]:
import pandas as pd

# Series

A Series is one-dimensional ndarray with labels for each of its elements – called its index.

An index can be any hashable value, such as a string or an integer. They default to 0, 1, 2, 3, …

A Series can also have name (optionally).

We can declare a Series like so, passing in a value for each element:

In [2]:
pd.Series(['a', 'b', 'c', 'c'])

0    a
1    b
2    c
3    c
dtype: object

This will add a Range Index by default, that is one that starts a 0 and goes up in increments of 1.

Let's say that we want to use a set of the string of our own invention as the indices. We can do this by passing a list to the `index` keyword argument.

In [3]:
s = pd.Series(['a', 'b', 'c', 'c'], 
        index=['row0', 'row1', 'row2', 'row3'])
s

row0    a
row1    b
row2    c
row3    c
dtype: object

## Indexing Series
If we want to select a value from a series, there are two methods: 

- `loc` – retrieves that value with that label in the index

In [4]:
s.loc['row2']

'c'

- `iloc` – retrieves the value at that location

In [5]:
s.iloc[2]

'c'

Watch out! The subscript operator uses `loc`. This can be a source of bugs.

## Data Frames

- A Data Frame is two-dimensional data structure, used to represent table-like data consisting of cells, organised into rows and columns.

- Each column in a Data Frame is a Series. 

- You can think of a Data Frame as a dictionary of Series.

- All Series in the Data Frame have identical indices.

- Think of the indices as defining the row names and the keys for each Series as defining the column names.

- They are declared using the `DataFrame` constructor. For example:

In [6]:
pink_floyd = pd.DataFrame({
    "name": ["Nick", "Richard", "Dave"],
    "instrument": ["drums", "keyboards", "guitar"],
    "age": [77, 65, 74]
})
pink_floyd

Unnamed: 0,name,instrument,age
0,Nick,drums,77
1,Richard,keyboards,65
2,Dave,guitar,74


You can declare an empty Data Frame with the column names using the `columns` keyword argument.

Data Frames actually have a property called `columns`. Note - it is of type index but it's actually the column names, not the rows.

In [7]:
band = pd.DataFrame(columns=['name', 'instrument', 'age'])
band

Unnamed: 0,name,instrument,age


In [8]:
band.columns

Index(['name', 'instrument', 'age'], dtype='object')

It can be useful to get a quick look at the start of any data. To do this we can use the `head` method to select as many rows at the start as we want.

In [9]:
pink_floyd.head(2)

Unnamed: 0,name,instrument,age
0,Nick,drums,77
1,Richard,keyboards,65


## Indexing Data Frames

Indexing pandas dataframes has the same label and position options as the pandas Series.

In [10]:
pink_floyd.loc[[0, 2], ['name', 'age']]

Unnamed: 0,name,age
0,Nick,77
2,Dave,74


We can also use a boolean condition (a *predicate*) to select specific rows.

In [11]:
predicate = pink_floyd['age'] > 71
over_71 = pink_floyd[predicate]
over_71

Unnamed: 0,name,instrument,age
0,Nick,drums,77
2,Dave,guitar,74


We can transform Data Frames in lots of ways. For example, let's say we wanted to add another column to the frame. We can do this by assigning to the column, like so:

In [12]:
hometown = ['London', 'Great Bookham', 'Cambridge']
pink_floyd['hometown'] = hometown
pink_floyd

Unnamed: 0,name,instrument,age,hometown
0,Nick,drums,77,London
1,Richard,keyboards,65,Great Bookham
2,Dave,guitar,74,Cambridge


Note that we can assign to a column with a single value and it will fill the column.

In [13]:
pink_floyd['can_sing'] = True  
pink_floyd

Unnamed: 0,name,instrument,age,hometown,can_sing
0,Nick,drums,77,London,True
1,Richard,keyboards,65,Great Bookham,True
2,Dave,guitar,74,Cambridge,True


## Reading and Writing CSV files using Pandas
Pandas can also be used to read and write CSV files using the `pd.read_csv` and `pd.to_csv` functions.

In [14]:
grades = pd.read_csv('data/grades.csv')
grades.head()

Unnamed: 0,Student ID,Name,Subject,Grade,Attendance
0,1,John Smith,Math,85,95
1,2,Jane Doe,Science,92,100
2,3,Mark Brown,English,78,90
3,4,Sarah Green,History,88,85
4,5,Alex Lee,Math,95,100


Let's add a column and save it back out.

In [15]:
grades['Normalised Grade'] = grades['Grade'] / 100
grades.head()

Unnamed: 0,Student ID,Name,Subject,Grade,Attendance,Normalised Grade
0,1,John Smith,Math,85,95,0.85
1,2,Jane Doe,Science,92,100,0.92
2,3,Mark Brown,English,78,90,0.78
3,4,Sarah Green,History,88,85,0.88
4,5,Alex Lee,Math,95,100,0.95


In [16]:
grades.to_csv('data/normed_grades.csv', index=False)

## Getting Summary Statistics
Like in other similar libraries, it's possible to get summary statistics about your data using Pandas. For example:

In [17]:
# what is the mean grade
grades['Grade'].mean()

88.6

In [18]:
# what is the median attendence
grades['Attendance'].median()

95.0

You can generate a standard set of statistics about specfic columns using the `describe` function.

In [19]:
grades[['Grade', 'Attendance']].describe()

Unnamed: 0,Grade,Attendance
count,20.0,20.0
mean,88.6,92.5
std,5.030434,6.786209
min,78.0,80.0
25%,86.5,88.75
50%,89.5,95.0
75%,92.0,100.0
max,96.0,100.0


## Summary

- Series

- DataFrames

- Indexing and Manipulating DataFrames

- Loading and Saving DataFrames

- Using describe to generate summary statistics of the data frame.

## Exercise

Read in the data in `data/us-states.csv` and print out:
- The quartiles of the state populations;
- The state codes, separated by commas, in the order that the states joined the union;
- The GDP per capita of each state (GDP divided by population)