In [1]:
import pandas as pd
import numpy as np

# Introduction to Data Science for Public Policy
## Class 6: NumPy & Pandas Basics
## Thomas Monk

Now we get on to the *data science* part of the course.

We've spent a lot of time over the last week learning the fundamentals of Python - the knowledge will allow us to easily work with these new modules.

## NumPy

The tools we're going to use are built on *NumPy* - this module allows us to easily manipulate *arrays*.

Arrays work, and are fundamentally the same as lists, but we'll have wider options to tranform them.

NumPy adds a lot of mathematical functioality - we can understand these arrays as matrices - but we can abstract from that in our use.

First we need to import the numpy module.

In [1]:
import numpy as np

Importing allows us to use all the functionality of NumPy - it will provide us with the built methods that we'll use.

Ensure this is always the first cell in your notebook, and remember to run it so the module is availible to you.

Notice that I've imported it `as` np. All this does is make the name shorter while we use it!

I'm going to quickly show you the features of an NumPy array. We won't ever use these arrays directly, but the Pandas module we use is built on top of these.

The features we'll use are exactly isomorphic when we look at Pandas.

In [3]:
import numpy as np
arr = np.array([[0, 1.5, 2.0],
[-1.0, 3.0, 5.0]])

Notice, this is just a list of lists with the `np.array` method surrounding it.

Why does this matter? All we're going to work with are lists of lists, but they're now called arrays!

We can do some useful things that we couldn't do before with standard lists - think about how this will be useful when it comes to the data!

**Indexing**. We can index the array in the standard way, and some new ways.

In [5]:
arr[0] # Gives us the first row

array([0. , 1.5, 2. ])

In [9]:
arr[0][1:3] # Gives us the second and third elements of the first row

array([1.5, 2. ])

In [17]:
arr[0,1] # We can now also access items in 'comma' notication - i.e we give the co-ordinated of the item we want.

1.5

In [18]:
arr[0,:] # Or obtain the whole first row.

array([0. , 1.5, 2. ])

In [20]:
arr[1][:] = 0 # we can also assign in this way
arr

array([[0. , 1.5, 2. ],
       [0. , 0. , 0. ]])

**Arithmetic**. We definitely weren't able to do the following before - these will make our lives a lot easier. Almost like Stata!

In [23]:
arr = np.array([1,2,3])
arr * 2 # We can multiply the whole array directly!

array([2, 4, 6])

In [24]:
arr + 1 # Or add to it.

array([2, 3, 4])

In [28]:
arr1 = np.array([2,4,8])
arr2 = np.array([2,4,8])
arr1 / arr2 # Or do any operation with two different arrays.
# These will be run element wise (i.e. element by element) unless with specify otherwise

array([1., 1., 1.])

In [29]:
arr = np.array([10,20,30,40])
print (arr.sum(), arr.min(), arr.max(), arr.std())

100 10 40 11.180339887498949


**Axes** We can also run functions across specific axes.
<img src="numpy-arrays-have-axes_updated_v2.png" alt="drawing" width="600"/>
Our first axis here (vertical) is the 0th axes, and the horzontal is the 1st.

**Axes** This can expand to as many dimensions as we wish...
<img src="3dax.png" alt="drawing" width="600"/>

In [31]:
# If we want to sum across row-wise, we would therefore run the following code:
arr = np.array([[0, 1.5, 2.0],
[-1.0, 3.0, 5.0]])
arr.sum(axis=0)

array([-1. ,  4.5,  7. ])

In [32]:
# Or column wise
arr = np.array([[0, 1.5, 2.0],
[-1.0, 3.0, 5.0]])
arr.sum(axis=1)

array([3.5, 7. ])

## Quick NumPy problem
How would we find the sum of the difference of these two arrays?

In [35]:
arr1 = np.array([1,2,3,4])
arr2 = np.array([5,1,3,2])

In [37]:
(arr1-arr2).sum()

-1

## Pandas

Pandas adapts this NumPy array into two structures useful for data science: the `Series` and the `DataFrame`.

- The Series is just a one-dimensional array. The difference is it stores an array of data labels alongside - the index. We've seen this before in Stata.

![title](stata.png)

- The DataFrame is like a table or a spreadsheet. It is an ordered list of columns, where each column is a Pandas Series.

We can construct a data frame directly from a dictionary, but there are many other ways - loading from a CSV, Excel file or an external database.

Notice that the index below is generated as part of the DataFrame construction, again just like Stata.

In [2]:
import pandas as pd
data = {
'state': ['Ohio', 'Ohio', 'Ohio',
'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

df = pd.DataFrame(data)
print(df.sort())

AttributeError: 'DataFrame' object has no attribute 'sort'

Notice that the columns have the names of the `keys` of the dictionary.

**Referencing columns** We can reference a specific column, or set of columns, in a similar way to lists and dictionaries.

In [5]:
df['state']['pop'] # Reference with a single string, for a single column.
# Notice that the output here is a Pandas Series, with its index alongside.

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [None]:
# We can also use this notation
df.state

In [9]:
df[['state','pop']] # This list index return a DataFrame, not a Pandas Series. The DataFrame has a single column.

Unnamed: 0,state,pop
0,Ohio,1.5
1,Ohio,1.7
2,Ohio,3.6
3,Nevada,2.4
4,Nevada,2.9


In [46]:
df[['state', 'pop']] # We can also grab two (or more) columns by passing more items in the list.

Unnamed: 0,state,pop
0,Ohio,1.5
1,Ohio,1.7
2,Ohio,3.6
3,Nevada,2.4
4,Nevada,2.9


**Referencing rows** Here we use the `.loc[]` method of the DataFrame.

In [58]:
df.loc[3] # The fourth row of the DataFrame.

state    Nevada
year       2001
pop         2.4
Name: 3, dtype: object

In [14]:
df['state'] # or the fourth and the first, in the order we specify.

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

### Boolean series indexing and logic

This is where you'll start to see the power of the DataFrame.

Let's say we want to grab only the rows which have a population of over 2.

In [63]:
df[df['pop']>2] # That's all we need to do! This returns a DF.

Unnamed: 0,state,year,pop
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


And how about only the state names and years, with a population of over two - we use our column selection.

In [64]:
df[['state','year']][df['pop']>2]

Unnamed: 0,state,year
2,Ohio,2002
3,Nevada,2001
4,Nevada,2002


In [68]:
df[['state','year']][(df['pop']>2) & (df['state']!='Nevada')] # If we want to ignore Nevada, we can just add the condition on.
# Note that & is used as the and here.
# Also note the parentheses, they have to surround each of our conditions

Unnamed: 0,state,year
2,Ohio,2002


**Creating columns**. We can create new columns very simply!

In [72]:
df_new = df.copy()
df_new['debt'] = 1.5
df_new

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,1.5
1,Ohio,2001,1.7,1.5
2,Ohio,2002,3.6,1.5
3,Nevada,2001,2.4,1.5
4,Nevada,2002,2.9,1.5


And we can create new columns based on our others - notice the overlap with NumPy.

In [74]:
df_new['debt_per_pop'] = df_new['debt']/df_new['pop']
df_new

Unnamed: 0,state,year,pop,debt,debt_per_pop
0,Ohio,2000,1.5,1.5,1.0
1,Ohio,2001,1.7,1.5,0.882353
2,Ohio,2002,3.6,1.5,0.416667
3,Nevada,2001,2.4,1.5,0.625
4,Nevada,2002,2.9,1.5,0.517241
