# ICT 782 - Day 2 Notes

# pandas

The Python Data Analysis Library (**pandas**) is a widely used, well-supported package for data analysis. Development for the project began in 2008 by Wes McKinney, who worked in quantitative investment management. It has now been adopted by numerous companies and projects in a variety of fields. For the purposes of these notes, I have followed the structure of Chapter 5 of McKinney's book *Pandas for Data Analysis* (McKinney, 2013).

pandas is built on top of NumPy, and therefore inherits much of NumPy's functionality and flexibility. However, pandas introduces new objects that are more specific to data analysis tasks, rather than the general numerical computation of NumPy.

In [1]:
import numpy as np
import pandas as pd

pd.__version__

'0.25.1'

## The `Series` object

We've seen NumPy `ndarray` objects, and the `Series` object is very similar. It has an added attribute called the *index*. Let's take a look and then discuss it in more detail. We'll create a `Series` object with 10 random numbers sampled from the half-open interval `[0,1)`.

In [2]:
ser1 = pd.Series(np.random.rand(10))
ser1

0    0.807117
1    0.514131
2    0.251846
3    0.981062
4    0.489280
5    0.777354
6    0.741454
7    0.516422
8    0.219777
9    0.207266
dtype: float64

The integers ranging from 0 to 9 on the left of the display are the *indices* of the `Series`. It is useful to think of the indices as rows of a spreadsheet. As with `ndarray`s, we can access `Series` elements by their index.

In [3]:
ser1[4]

0.4892801874561964

However, pandas gives us more options than this. For example, we can now specify what we want the indices to be called.

In [4]:
ser1 = pd.Series(np.random.rand(10), index = ['a','b','c','d','e','f','g','h','i','j'])
ser1

a    0.021457
b    0.878437
c    0.602672
d    0.871500
e    0.210825
f    0.552579
g    0.745294
h    0.904041
i    0.026695
j    0.195820
dtype: float64

In [5]:
ser1['e']

0.21082537061597384

To change the index names, I didn't really need to create a new `Series`. We can change the indices of a `Series` arbitrarily, as they are keyworded attributes of the `Series` object. We can also display the values of a `Series`, but they cannot be reassigned.

In [7]:
ser1.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

In [8]:
ser1.index = np.arange(10,20)
ser1

10    0.021457
11    0.878437
12    0.602672
13    0.871500
14    0.210825
15    0.552579
16    0.745294
17    0.904041
18    0.026695
19    0.195820
dtype: float64

In [9]:
ser1.values

array([0.02145715, 0.87843694, 0.60267183, 0.87150015, 0.21082537,
       0.55257867, 0.7452942 , 0.90404058, 0.02669488, 0.19581991])

In [10]:
ser1.values = np.arange(200,210)

AttributeError: can't set attribute

We can't change all values of a `Series`, but we can change single values by referencing their index.

In [11]:
ser1[10] = 1
ser1

10    1.000000
11    0.878437
12    0.602672
13    0.871500
14    0.210825
15    0.552579
16    0.745294
17    0.904041
18    0.026695
19    0.195820
dtype: float64

We can also display multiple values by list indexing.

In [12]:
ser1[[10,11,16]]

10    1.000000
11    0.878437
16    0.745294
dtype: float64

In [13]:
ser2 = pd.Series(np.random.rand(10), index = ['a','b','c','d','e','f','g','h','i','j'])
ser2[['d','j']]

d    0.604505
j    0.642697
dtype: float64

### `Series` objects from dictionaries

Recall that a common data structure in Python is the dictionary, which takes the form
```
{key0: value0, key1: value1, key2: value2, ...}
```

A dictionary may be passed as an argument to the `Series` constructor.

In [14]:
prov_exp = {'AB': 2.3, 'BC': 2.1, 'SK': 1.7, 'MB': 1.9}
prov = pd.Series(prov_exp)
prov

AB    2.3
BC    2.1
SK    1.7
MB    1.9
dtype: float64

## The `DataFrame` object

A natural extension of the `Series` object would be to allow the object to contain more than one variable in more than one column. Hence, we introduce the `DataFrame`. This object allows data to be stored in a very similar manner to a spreadsheet (or in R, the `data.frame` object), with multiple values being stored in multiple columns, and an index keeping track of each row. It may also be helpful to think of a `DataFrame` as a dictionary of `Series` objects. 

Let's see an example of a `DataFrame` being initialized and view its `index`, `value`, and column attributes.

In [15]:
data = {'Country': ['Afghanistan','Bahamas','Cabo Verde','Denmark','Ecuador','Fiji','Gabon'],
        'Population (millions)': [38.9, .393, .556, 5.79, 17.6, .896, 2.23],
        'Urban pop (%)': [25, 86, 68, 88, 63, 59, 87]}

df1 = pd.DataFrame(data)
df1

Unnamed: 0,Country,Population (millions),Urban pop (%)
0,Afghanistan,38.9,25
1,Bahamas,0.393,86
2,Cabo Verde,0.556,68
3,Denmark,5.79,88
4,Ecuador,17.6,63
5,Fiji,0.896,59
6,Gabon,2.23,87


In [16]:
df1.index

RangeIndex(start=0, stop=7, step=1)

In [17]:
df1.values

array([['Afghanistan', 38.9, 25],
       ['Bahamas', 0.393, 86],
       ['Cabo Verde', 0.556, 68],
       ['Denmark', 5.79, 88],
       ['Ecuador', 17.6, 63],
       ['Fiji', 0.896, 59],
       ['Gabon', 2.23, 87]], dtype=object)

In [18]:
df1.columns

Index(['Country', 'Population (millions)', 'Urban pop (%)'], dtype='object')

In [19]:
df1.keys()

Index(['Country', 'Population (millions)', 'Urban pop (%)'], dtype='object')

We may easily reorder columns by specifying their left-to-right order on initialization of the `DataFrame`.

In [20]:
df1 = pd.DataFrame(data, columns = ['Population (millions)', 'Urban pop (%)', 'Country'])
df1

Unnamed: 0,Population (millions),Urban pop (%),Country
0,38.9,25,Afghanistan
1,0.393,86,Bahamas
2,0.556,68,Cabo Verde
3,5.79,88,Denmark
4,17.6,63,Ecuador
5,0.896,59,Fiji
6,2.23,87,Gabon


New columns can be added after a `DataFrame` object has already been created. If the values for the new column are not specified, then the column is filled with `NaN` (not a number).

In [21]:
df1 = pd.DataFrame(data, columns = ['Country','Population (millions)','Urban pop (%)','GDP'])
df1

Unnamed: 0,Country,Population (millions),Urban pop (%),GDP
0,Afghanistan,38.9,25,
1,Bahamas,0.393,86,
2,Cabo Verde,0.556,68,
3,Denmark,5.79,88,
4,Ecuador,17.6,63,
5,Fiji,0.896,59,
6,Gabon,2.23,87,


### Finding data in `DataFrame` objects

There are two techniques to finding specific data within a `DataFrame`: locating a *row* with a particular index or a *column* with a given name (key).

In the first case, we use the `loc` method of the `DataFrame` object. In the next code snippet, we'll locate the data corresponding to row `0`.

In [24]:
df1.loc[0]

Country                  Afghanistan
Population (millions)           38.9
Urban pop (%)                     25
GDP                              NaN
Name: 0, dtype: object

To locate a *column* with a given name, we pass the column key as our indexing variable.

In [25]:
df1['Country']

0    Afghanistan
1        Bahamas
2     Cabo Verde
3        Denmark
4        Ecuador
5           Fiji
6          Gabon
Name: Country, dtype: object

Note that this is different from the `Series` object, as the following code snippet demonstrates.

In [26]:
df1[0]

KeyError: 0

Data may be taken from a `DataFrame` column and put into a `Series` object.

In [27]:
country = df1.Country
country

0    Afghanistan
1        Bahamas
2     Cabo Verde
3        Denmark
4        Ecuador
5           Fiji
6          Gabon
Name: Country, dtype: object

In [28]:
type(country)

pandas.core.series.Series

### Changing values within a row or column

Once a `DataFrame` has been initialized, we may still change its values. For example, we can set a column to a single value.

In [29]:
df1.GDP = 16.5
df1

Unnamed: 0,Country,Population (millions),Urban pop (%),GDP
0,Afghanistan,38.9,25,16.5
1,Bahamas,0.393,86,16.5
2,Cabo Verde,0.556,68,16.5
3,Denmark,5.79,88,16.5
4,Ecuador,17.6,63,16.5
5,Fiji,0.896,59,16.5
6,Gabon,2.23,87,16.5


Similarly, we can set a column to a range of values.

In [30]:
df1.GDP = np.arange(7,14)
df1

Unnamed: 0,Country,Population (millions),Urban pop (%),GDP
0,Afghanistan,38.9,25,7
1,Bahamas,0.393,86,8
2,Cabo Verde,0.556,68,9
3,Denmark,5.79,88,10
4,Ecuador,17.6,63,11
5,Fiji,0.896,59,12
6,Gabon,2.23,87,13


If we specify a `Series` as a new `DataFrame` column and the `Series` doesn't have enough values to fill the column, then `NaN` will fill out the missing values. This example also illustrates that if we try to assign values to a column that hasn't been initialized, then the column will be created.

In [31]:
df1['GNP'] = pd.Series([1.7,3.2,1.6])
df1

Unnamed: 0,Country,Population (millions),Urban pop (%),GDP,GNP
0,Afghanistan,38.9,25,7,1.7
1,Bahamas,0.393,86,8,3.2
2,Cabo Verde,0.556,68,9,1.6
3,Denmark,5.79,88,10,
4,Ecuador,17.6,63,11,
5,Fiji,0.896,59,12,
6,Gabon,2.23,87,13,


Finally, a column can be deleted using the `del` command.

In [32]:
del df1['GNP']
df1

Unnamed: 0,Country,Population (millions),Urban pop (%),GDP
0,Afghanistan,38.9,25,7
1,Bahamas,0.393,86,8
2,Cabo Verde,0.556,68,9
3,Denmark,5.79,88,10
4,Ecuador,17.6,63,11
5,Fiji,0.896,59,12
6,Gabon,2.23,87,13


## Slicing and boolean indexing

Just as with `ndarray`s, we can access or set elements of a `Series` or `DataFrame` using slicing and boolean indexing. 

### Slicing

Slicing with a `Series` object works just like it does with NumPy arrays.

In [33]:
ser2

a    0.046123
b    0.226757
c    0.339368
d    0.604505
e    0.614747
f    0.735096
g    0.770280
h    0.750207
i    0.777633
j    0.642697
dtype: float64

In [34]:
ser2[0:4]

a    0.046123
b    0.226757
c    0.339368
d    0.604505
dtype: float64

Note that if we slice by row label, the endpoint is included.

In [35]:
ser2['c':'i']

c    0.339368
d    0.604505
e    0.614747
f    0.735096
g    0.770280
h    0.750207
i    0.777633
dtype: float64

In [37]:
# Slicing a DataFrame by row

df1.loc[3:]

Unnamed: 0,Country,Population (millions),Urban pop (%),GDP
3,Denmark,5.79,88,10
4,Ecuador,17.6,63,11
5,Fiji,0.896,59,12
6,Gabon,2.23,87,13


### Boolean indexing

Here, we'll find the elements of `ser2` that are greater than 0.5.

In [38]:
ser2

a    0.046123
b    0.226757
c    0.339368
d    0.604505
e    0.614747
f    0.735096
g    0.770280
h    0.750207
i    0.777633
j    0.642697
dtype: float64

In [39]:
ser2[ser2 > 0.5]

d    0.604505
e    0.614747
f    0.735096
g    0.770280
h    0.750207
i    0.777633
j    0.642697
dtype: float64

For a `DataFrame`, we need to specify the column when we use boolean indexing. In this example, we'll display all rows that have a value in `col0` greater than 0.5.

In [40]:
df2 = pd.DataFrame(np.random.rand(16).reshape((4,4)), 
                   columns = ['col0','col1','col2','col3'], 
                   index = ['row0','row1','row2','row3'])
df2

Unnamed: 0,col0,col1,col2,col3
row0,0.044191,0.437528,0.626565,0.880175
row1,0.621835,0.77037,0.228442,0.863557
row2,0.502018,0.81118,0.727412,0.808079
row3,0.268752,0.176167,0.836101,0.022781


In [41]:
df2[df2['col0'] > 0.5]

Unnamed: 0,col0,col1,col2,col3
row1,0.621835,0.77037,0.228442,0.863557
row2,0.502018,0.81118,0.727412,0.808079


We may also use boolean masks in a similar manner to their use in NumPy arrays. Here we'll display a boolean mask corresponding to elements of `df2` that are greater than 0.5.

In [42]:
df2 > 0.5

Unnamed: 0,col0,col1,col2,col3
row0,False,False,True,True
row1,True,True,False,True
row2,True,True,True,True
row3,False,False,True,False


Now we'll use this mask to zero elements of `df2` less than or equal to 0.5.

In [43]:
df2[df2 <= 0.5] = 0
df2

Unnamed: 0,col0,col1,col2,col3
row0,0.0,0.0,0.626565,0.880175
row1,0.621835,0.77037,0.0,0.863557
row2,0.502018,0.81118,0.727412,0.808079
row3,0.0,0.0,0.836101,0.0


## Useful `Series` and `DataFrame` methods

### The `set_index()` method

By default, `DataFrame` objects have the index attribute set to an integer range. We can change the index for the entire object to be any of the existing columns or an integer range.

In [44]:
df3 = pd.DataFrame({'Country': ['Mexico','Guatemala','Nicaragua','Honduras'], 
                    'Curreny': ['peso','quetzal','cordoba','lempira']})
df3

Unnamed: 0,Country,Curreny
0,Mexico,peso
1,Guatemala,quetzal
2,Nicaragua,cordoba
3,Honduras,lempira


In [45]:
df3.set_index(df3['Country'])

Unnamed: 0_level_0,Country,Curreny
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Mexico,Mexico,peso
Guatemala,Guatemala,quetzal
Nicaragua,Nicaragua,cordoba
Honduras,Honduras,lempira


In [46]:
df3.set_index(np.arange(5,9))

Unnamed: 0,Country,Curreny
5,Mexico,peso
6,Guatemala,quetzal
7,Nicaragua,cordoba
8,Honduras,lempira


### The `reindex()` method

Similar to rearranging the column ordering, we may rearrange the row ordering using the `reindex()` method. The `reindex()` method takes in a list of row names as an argument and returns a copy with the changed order of rows in a `Series` or `DataFrame`.

In [47]:
data2 = pd.Series(np.random.rand(4), index = ['id' + str(i) for i in range(4)])
data2

id0    0.010216
id1    0.320124
id2    0.473417
id3    0.346076
dtype: float64

In [48]:
data3 = data2.reindex(['id3','id1','id2','id0'])
data3

id3    0.346076
id1    0.320124
id2    0.473417
id0    0.010216
dtype: float64

In [49]:
df2 = pd.DataFrame(np.random.rand(12).reshape((3,4)), index = ['row0','row1','row2'])
df2

Unnamed: 0,0,1,2,3
row0,0.583449,0.45117,0.880017,0.502174
row1,0.440819,0.722679,0.540755,0.084769
row2,0.495765,0.104294,0.815846,0.265324


In [50]:
df3 = df2.reindex(['row2','row1','row0'])
df3

Unnamed: 0,0,1,2,3
row2,0.495765,0.104294,0.815846,0.265324
row1,0.440819,0.722679,0.540755,0.084769
row0,0.583449,0.45117,0.880017,0.502174


In the event that we pass the `reindex()` method more indices than the `DataFrame` contains, new indices will be created and their values will be set to `NaN`.

We may also choose how to fill in these new indices. This is controlled by specifying the optional `fill_value` argument. 
```
df.reindex(<new order of indices>, fill_value = <option here>)
```
Let's fill in new rows with zeros.

In [51]:
df4 = df3.reindex(['row0','row4','row1','row2'])
df4

Unnamed: 0,0,1,2,3
row0,0.583449,0.45117,0.880017,0.502174
row4,,,,
row1,0.440819,0.722679,0.540755,0.084769
row2,0.495765,0.104294,0.815846,0.265324


In [52]:
df4 = df3.reindex(['row0','row4','row1','row2'], fill_value = 0)
df4

Unnamed: 0,0,1,2,3
row0,0.583449,0.45117,0.880017,0.502174
row4,0.0,0.0,0.0,0.0
row1,0.440819,0.722679,0.540755,0.084769
row2,0.495765,0.104294,0.815846,0.265324


### Transposing objects

We may also call the transpose method, denoted `T`, to switch rows and columns in a pandas `DataFrame`.

In [53]:
df3.T

Unnamed: 0,row2,row1,row0
0,0.495765,0.440819,0.583449
1,0.104294,0.722679,0.45117
2,0.815846,0.540755,0.880017
3,0.265324,0.084769,0.502174


### The `drop()` method

We may remove values from a `Series` by passing the index to be removed to the `drop()` method.

In [54]:
ser2

a    0.046123
b    0.226757
c    0.339368
d    0.604505
e    0.614747
f    0.735096
g    0.770280
h    0.750207
i    0.777633
j    0.642697
dtype: float64

In [58]:
ser2.drop('f')

a    0.046123
b    0.226757
c    0.339368
d    0.604505
e    0.614747
g    0.770280
h    0.750207
i    0.777633
j    0.642697
dtype: float64

Similarly, for a `DataFrame`, we may remove either a row or a column. To drop a column, we specify the column label to be dropped. To drop a row, the row index must be used. By specifying the `axis` keyworded argument, we tell the interpreter where to find the index or column label we have passed in. Using `axis = 0` corresponds to dropping a row, and `axis = 1` indicates dropping a column. By default, `axis = 0`. 

In [56]:
df1

Unnamed: 0,Country,Population (millions),Urban pop (%),GDP
0,Afghanistan,38.9,25,7
1,Bahamas,0.393,86,8
2,Cabo Verde,0.556,68,9
3,Denmark,5.79,88,10
4,Ecuador,17.6,63,11
5,Fiji,0.896,59,12
6,Gabon,2.23,87,13


In [60]:
df1.drop('GDP', axis = 1)

Unnamed: 0,Country,Population (millions),Urban pop (%)
0,Afghanistan,38.9,25
1,Bahamas,0.393,86
2,Cabo Verde,0.556,68
3,Denmark,5.79,88
4,Ecuador,17.6,63
5,Fiji,0.896,59
6,Gabon,2.23,87


In [61]:
df1.drop(6, axis = 0)

Unnamed: 0,Country,Population (millions),Urban pop (%),GDP
0,Afghanistan,38.9,25,7
1,Bahamas,0.393,86,8
2,Cabo Verde,0.556,68,9
3,Denmark,5.79,88,10
4,Ecuador,17.6,63,11
5,Fiji,0.896,59,12


## Adding, subtracting, multiplying, and dividing `Series` and `DataFrame` objects

Combining two or more `Series` or `DataFrame` objects using arithmetic operators is done by index. If both objects have the same indices, then the operation works as expected. If one object has indices that are missing from the other object, then the result is the union of both objects. Let's look at some examples to make this clearer.

In [62]:
# Adding two Series with the same indices

s1 = pd.Series(np.arange(5))
s2 = pd.Series(np.arange(10,15))
s1 + s2

0    10
1    12
2    14
3    16
4    18
dtype: int32

In [65]:
# Adding two Series with some of the same indices

s1 = s1.reindex([0,1,'c','d','e'])
s1

0    0.0
1    1.0
c    NaN
d    NaN
e    NaN
dtype: float64

In [66]:
s1 + s2

0    10.0
1    12.0
2     NaN
3     NaN
4     NaN
c     NaN
d     NaN
e     NaN
dtype: float64

In [67]:
# Multiplying two DataFrames with some overlapping indices

d1 = pd.DataFrame(np.arange(25).reshape((5,5)))
d2 = pd.DataFrame(np.arange(16).reshape((4,4)))
d1 * d2

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,4.0,9.0,
1,20.0,30.0,42.0,56.0,
2,80.0,99.0,120.0,143.0,
3,180.0,208.0,238.0,270.0,
4,,,,,


In [68]:
d1 / d2

Unnamed: 0,0,1,2,3,4
0,,1.0,1.0,1.0,
1,1.25,1.2,1.166667,1.142857,
2,1.25,1.222222,1.2,1.181818,
3,1.25,1.230769,1.214286,1.2,
4,,,,,


## Applying universal functions

Since pandas objects are built on the solid foundation of NumPy arrays, it is no surprise that universal (vectorized) functions work perfectly with pandas objects. In this section, we'll look at NumPy universal functions and introduce a way to apply custom 1-dimensional functions to pandas objects.

First, we can apply universal NumPy functions to `Series` objects without any problems.

In [69]:
ser2

a    0.046123
b    0.226757
c    0.339368
d    0.604505
e    0.614747
f    0.735096
g    0.770280
h    0.750207
i    0.777633
j    0.642697
dtype: float64

In [70]:
np.sin(ser2)

a    0.046107
b    0.224819
c    0.332891
d    0.568355
e    0.576752
f    0.670658
g    0.696336
h    0.681790
i    0.701595
j    0.599356
dtype: float64

In [71]:
2*ser2

a    0.092246
b    0.453514
c    0.678736
d    1.209011
e    1.229493
f    1.470191
g    1.540559
h    1.500415
i    1.555266
j    1.285393
dtype: float64

When applying universal functions to `DataFrame`s, universal functions are automatically applied to all columns. 

In [72]:
random10 = pd.DataFrame(np.random.rand(10).reshape((5,2)))
random10

Unnamed: 0,0,1
0,0.889487,0.990971
1,0.1613,0.652061
2,0.811852,0.775182
3,0.152872,0.2117
4,0.124333,0.350693


In [73]:
random10 + 2

Unnamed: 0,0,1
0,2.889487,2.990971
1,2.1613,2.652061
2,2.811852,2.775182
3,2.152872,2.2117
4,2.124333,2.350693


Alternatively, we may specify the column to which we apply the function. As always, the result of any expression may be saved to a new variable.

In [74]:
r10 = random10[0] + 2
r10

0    2.889487
1    2.161300
2    2.811852
3    2.152872
4    2.124333
Name: 0, dtype: float64

### `lambda` functions

There are many times when we may want to apply a function to an argument passed to another function, or we might want to compute a function without using a full function declaration. This is where we use `lambda`, or **anonymous**, functions.

Declaring a `lambda` function is done using the syntax:
```
lambda <inputs>: <return value>
```

**Note:** If you find yourself doing anything complicated with `lambda` functions, then the purpose of `lambda` functions is really defeated. Use a declared regular function instead.

Here's an example of declaring and calling a `lambda` function in one line.

In [75]:
(lambda x, y: (x%2) + y)(np.arange(5),np.random.rand(5))

array([0.16124981, 1.32125093, 0.22685185, 1.92709531, 0.52462414])

The `lambda` function above takes in two arguments, `x` and `y`. It then returns `(x%2) + y)`. We immediately specified the function input in the tuple `(np.arange(5),np.random.rand(5))`.

We can also give `lambda` functions a name, but this takes away their anonymity. It can, however, make for visually satisfying syntax.

In [76]:
f = lambda x: (x + 4) % 3
x = np.arange(23,33)
f(x)

array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype=int32)

We can apply these functions to `Series` and `DataFrame` objects using the `apply()` method. Doing so will generate a copy of the object with the function applied. When applying a function to a `DataFrame`, make sure that you apply the function to columns of the correct input type. Here are some examples.

In [77]:
ser3 = pd.Series(np.random.rand(7))
ser3

0    0.394547
1    0.230526
2    0.988302
3    0.759609
4    0.015620
5    0.753137
6    0.528928
dtype: float64

In [78]:
ser3.apply(lambda x: x-0.75)

0   -0.355453
1   -0.519474
2    0.238302
3    0.009609
4   -0.734380
5    0.003137
6   -0.221072
dtype: float64

In [79]:
ser3

0    0.394547
1    0.230526
2    0.988302
3    0.759609
4    0.015620
5    0.753137
6    0.528928
dtype: float64

In [80]:
dat1 = pd.DataFrame({'pond': ['DeWitt','McLean'],
                     'ph': [6.5, 6.8],
                     'biod': [54, 67]})
dat1['ph'].apply(lambda x: np.ceil(x))

0    7.0
1    7.0
Name: ph, dtype: float64

## Computing summary statistics

Many methods are available to calculate various statistics for `Series` and `DataFrame` objects. We will list some of the most commonly used here and provide examples below. When these methods are used on `DataFrame`s, the calculation is done column-wise unless otherwise specified. If you want to use a method on numeric columns only, you can use the keyworded argument `numeric_only = True`.

|Method|Details|
|---|---|
|`sum()`|Sum all values.|
|`mean()`|Calculate the average value(s).|
|`median()`|Calculate the median value(s) (also known as the 50th percentile(s)).|
|`quantile()`|Compute the quantiles for the data.|
|`count()`|Count all values. This is useful when `DataFrame` columns contain missing values.|
|`std()`|Calculate the standard deviation.|
|`var()`|Calculate the variance.|
|`describe()`|Produce an output of the summary statistics which includes the count, mean, std, quantiles, min, and max.|
|`min()`, `max()`|Compute the minimum or maximum value(s).|
|`idxmin()`, `idxmax()`|Return the index corresponding to the minimum or maximum value.|
|`skew()`|Compute the unbaised skew of the data.|
|`kurt()`|Compute the unbiased kurtosis of the data.|

In [81]:
dat1

Unnamed: 0,pond,ph,biod
0,DeWitt,6.5,54
1,McLean,6.8,67


In [92]:
dat1.describe()

Unnamed: 0,ph,biod
count,2.0,2.0
mean,6.65,60.5
std,0.212132,9.192388
min,6.5,54.0
25%,6.575,57.25
50%,6.65,60.5
75%,6.725,63.75
max,6.8,67.0


In [91]:
# To access a particular stat; in this case, the mean.

stats = dat1.describe()
stats['ph'].loc['mean']

6.65

In [83]:
dat1.var()

ph       0.045
biod    84.500
dtype: float64

In [84]:
ser10 = pd.Series(np.array([1.2, 5.6, 1.11, 2.34, 4.101, 5.66, 0.671]))
ser10.skew()

0.3763560087196213

In [85]:
ser10.kurt()

-2.058620536689417

The next few methods are for `DataFrame`s only.

|Method|Details|
|---|---|
|`cov()`|Returns the pair-wise covariance matrix of the columns.|
|`corr()`|Returns the pair-wise correlation of the columns.|
|`corrwith(<object>)`|Returns the pairwise correlation of the `DataFrame` with either a `Series` or `DataFrame` specified by `<object>`.|

In [94]:
dat2 = pd.DataFrame(np.random.rand(25).reshape((5,5)), columns = ['col0','col1','col2','col3','col4'])
ser2 = pd.Series(np.random.rand(5))
dat2.cov()

Unnamed: 0,col0,col1,col2,col3,col4
col0,0.064786,-0.01457,0.025866,0.018971,-0.051671
col1,-0.01457,0.100862,-0.010842,0.06599,0.018109
col2,0.025866,-0.010842,0.025682,0.000163,-0.047429
col3,0.018971,0.06599,0.000163,0.059747,0.001902
col4,-0.051671,0.018109,-0.047429,0.001902,0.100317


In [95]:
dat2.corr()

Unnamed: 0,col0,col1,col2,col3,col4
col0,1.0,-0.180247,0.634121,0.304929,-0.640952
col1,-0.180247,1.0,-0.213027,0.850077,0.180033
col2,0.634121,-0.213027,1.0,0.004166,-0.934414
col3,0.304929,0.850077,0.004166,1.0,0.024565
col4,-0.640952,0.180033,-0.934414,0.024565,1.0


In [96]:
dat2.corrwith(ser2)

col0    0.730430
col1   -0.166275
col2   -0.043962
col3    0.243510
col4   -0.068900
dtype: float64

## Hierarchical indexing

It is also possible to store data with additional categorical *levels* by passing in a dictionary of dictionaries. For example, if we had multiple data points from 4 individual sources, we could express this as a `Series` with its index specified by a list of lists or arrays.

In [97]:
data = pd.Series(np.random.rand(11),
                 index = [[0,0,0,1,1,1,2,2,3,3,3],
                          [5,5,6,5,6,7,7,7,5,6,7]])
data       

0  5    0.540078
   5    0.296653
   6    0.869285
1  5    0.356492
   6    0.682221
   7    0.272009
2  7    0.392580
   7    0.116501
3  5    0.101806
   6    0.321968
   7    0.638971
dtype: float64

From this `Series`, we can access a single data source by using its index as a key.

In [98]:
data[0]

5    0.540078
5    0.296653
6    0.869285
dtype: float64

In [99]:
data[0,5]

(0, 5)    0.540078
(0, 5)    0.296653
dtype: float64

We can make `DataFrame`s with different levels on *both* axes by passing lists of lists to the `index` and `columns` keyworded arguments.

In [100]:
data = pd.DataFrame(np.random.rand(12).reshape((4,3)),
                    index = [['a','a','b','b'], [5,6,6,6]],
                    columns = [['A','A','B'],['x','y','x']])
data

Unnamed: 0_level_0,Unnamed: 1_level_0,A,A,B
Unnamed: 0_level_1,Unnamed: 1_level_1,x,y,x
a,5,0.777587,0.883452,0.033223
a,6,0.478672,0.368124,0.026701
b,6,0.313537,0.727633,0.455368
b,6,0.973679,0.833384,0.837494


In [104]:
data['A']

Unnamed: 0,Unnamed: 1,x,y
a,5,0.777587,0.883452
a,6,0.478672,0.368124
b,6,0.313537,0.727633
b,6,0.973679,0.833384


In [109]:
data['A','y']

a  5    0.883452
   6    0.368124
b  6    0.727633
   6    0.833384
Name: (A, y), dtype: float64

In [110]:
data['A','y'].loc['a']

5    0.883452
6    0.368124
Name: (A, y), dtype: float64

In [111]:
data['A','y'].loc['a'].loc[5]

0.8834520435209073

In [102]:
data['B']

Unnamed: 0,Unnamed: 1,x
a,5,0.033223
a,6,0.026701
b,6,0.455368
b,6,0.837494


## Sorting

The three main object methods used for sorting `Series` and `DataFrame`s are `sort_index()`, `sort_values()`, and `rank()`. All three methods return a copy of the sorted object. To change the actual object, use the `inplace = True` keyworded argument. 

In [112]:
ser = pd.Series(np.random.rand(7))
ser

0    0.106441
1    0.100691
2    0.185566
3    0.150508
4    0.620327
5    0.385640
6    0.097992
dtype: float64

In [113]:
ser.sort_index(ascending = False)

6    0.097992
5    0.385640
4    0.620327
3    0.150508
2    0.185566
1    0.100691
0    0.106441
dtype: float64

The `sort_values()` method is used to sort an object by a specified column of values. For `DataFrame` objects, we can specify the `by` keyworded argument to select the column(s) by which we want to sort.

In [114]:
humidity = pd.DataFrame({'date': ['1-23-1999','1-19-1999','1-20-1999','1-21-1999','1-22-1999'], 
                         'humi': np.random.rand(5), 
                         'temp': np.random.randint(14, 33, 5)})
humidity.sort_values(by = 'temp')

Unnamed: 0,date,humi,temp
2,1-20-1999,0.649933,17
3,1-21-1999,0.90884,21
1,1-19-1999,0.970229,25
0,1-23-1999,0.698075,27
4,1-22-1999,0.456091,29


In [115]:
humidity.sort_values(by = 'date')

Unnamed: 0,date,humi,temp
1,1-19-1999,0.970229,25
2,1-20-1999,0.649933,17
3,1-21-1999,0.90884,21
4,1-22-1999,0.456091,29
0,1-23-1999,0.698075,27


In [116]:
humidity.sort_values(by = ['date','humi'])

Unnamed: 0,date,humi,temp
1,1-19-1999,0.970229,25
2,1-20-1999,0.649933,17
3,1-21-1999,0.90884,21
4,1-22-1999,0.456091,29
0,1-23-1999,0.698075,27


Note that sorting by two columns will only produce a difference if there are duplicate values in at least one column.

In [117]:
humidity.loc[5] = ['1-19-1999', np.random.rand(), np.random.randint(14,33)]
humidity

Unnamed: 0,date,humi,temp
0,1-23-1999,0.698075,27
1,1-19-1999,0.970229,25
2,1-20-1999,0.649933,17
3,1-21-1999,0.90884,21
4,1-22-1999,0.456091,29
5,1-19-1999,0.453245,15


In [118]:
humidity.sort_values(by = ['date','humi'])

Unnamed: 0,date,humi,temp
5,1-19-1999,0.453245,15
1,1-19-1999,0.970229,25
2,1-20-1999,0.649933,17
3,1-21-1999,0.90884,21
4,1-22-1999,0.456091,29
0,1-23-1999,0.698075,27


Finally, the `rank()` method displays the ranked values in each column. The optional keyworded argument `pct = True` displays the ranked values as percentiles. As with the other methods above, we can also use the keyworded arguments `axis` for ranking by row/column and `ascending` for ascending/descending order.

In [119]:
humidity.rank()

Unnamed: 0,date,humi,temp
0,6.0,4.0,5.0
1,1.5,6.0,4.0
2,3.0,3.0,2.0
3,4.0,5.0,3.0
4,5.0,2.0,6.0
5,1.5,1.0,1.0


In [120]:
humidity.rank(pct = True)

Unnamed: 0,date,humi,temp
0,1.0,0.666667,0.833333
1,0.25,1.0,0.666667
2,0.5,0.5,0.333333
3,0.666667,0.833333,0.5
4,0.833333,0.333333,1.0
5,0.25,0.166667,0.166667


# Summary

* pandas is one of the main Python packages for data analysis.
* The `Series` object is an array with an additional `index` attribute.
* The `DataFrame` object is used for stored tabular data in row/column format.
* `Series` and `DataFrame` objects can be created from lists, dictionaries, or NumPy arrays.
    * Both objects can be sliced and boolean indexed much like NumPy arrays.
    * Arithmetic operations (`+`, `-`, `*`, `/`) on two objects results in element-wise calculation where the indices and columns agree.
    * Universal functions work element-wise on pandas objects.
* `lambda` functions are used to create simple, anonymous functions without a full function declaration.
    * `lambda` functions can be applied to pandas objects using the `apply()` method.
* Many summary statistic calculations are available as methods of pandas objects.
* Hierarchical indexing allows creating pandas objects with more than 2 dimensions.
* Sorting and ranking data in pandas objects is done using the `sort_index()`, `sort_values()`, and `rank()` methods.

# *Exercises*

1. Create pandas objects containing the following data:
    
    a) A single column of 100 random values. 
    
    b) Three columns, each with 10 values. The first column consists of random integers between 1 and 50. The second column will consist of the `sin` of the first column. Finally, the third column will be the difference of the first and second columns divided by the mean of the second column.
    
    c) Thirteen columns, one for each province or territory of Canada. The values in each column will simply be 30 random values from the normal distribution with mean 0 and variance 1.

In [121]:
# a) single column of 100 random values.
obj1 = pd.Series(np.random.rand(100))
obj1

0     0.283872
1     0.850190
2     0.533738
3     0.956789
4     0.230041
        ...   
95    0.739886
96    0.919620
97    0.874968
98    0.453832
99    0.647021
Length: 100, dtype: float64

In [122]:
# b) three columns of 10 values

obj2 = pd.DataFrame({'col1': np.random.randint(1, 50, 10)})
obj2

Unnamed: 0,col1
0,35
1,10
2,41
3,40
4,44
5,16
6,34
7,4
8,28
9,18


In [123]:
obj2['col2'] = np.sin(obj2['col1'])
obj2

Unnamed: 0,col1,col2
0,35,-0.428183
1,10,-0.544021
2,41,-0.158623
3,40,0.745113
4,44,0.017702
5,16,-0.287903
6,34,0.529083
7,4,-0.756802
8,28,0.270906
9,18,-0.750987


In [124]:
obj2['col3'] = (obj2['col1'] - obj2['col2'])/obj2['col2'].mean()
obj2

Unnamed: 0,col1,col2,col3
0,35,-0.428183,-259.791511
1,10,-0.544021,-77.318309
2,41,-0.158623,-301.812285
3,40,0.745113,-287.852371
4,44,0.017702,-322.51803
5,16,-0.287903,-119.437654
6,34,0.529083,-245.439069
7,4,-0.756802,-34.881183
8,28,0.270906,-203.334824
9,18,-0.750987,-137.499215


In [127]:
# c) thirteen columns
obj3 = pd.DataFrame(np.random.randn(30*13).reshape((30,13)), 
                    columns = ['YT','NWT','NVT','BC','AB','SK','MB','ON','QC','NB','PEI','NS','NFLD'])
obj3

Unnamed: 0,YT,NWT,NVT,BC,AB,SK,MB,ON,QC,NB,PEI,NS,NFLD
0,-0.034285,-1.472742,0.141023,0.028372,-0.395149,-0.867364,-1.157314,0.778553,0.480924,-0.472213,1.378278,2.608821,-0.609321
1,0.327806,1.139212,1.34541,1.51868,-2.339126,-0.22257,0.86846,0.660011,-2.497309,-0.670455,1.20026,0.525854,0.184823
2,0.462033,0.745415,-1.433063,0.407043,0.265008,0.019005,-0.159403,-0.655218,1.28232,-0.101855,-1.585913,0.031235,-0.352905
3,0.988618,0.995684,1.191479,0.896943,1.328536,-0.864569,-0.178767,1.201558,-1.20402,-0.390222,0.149133,1.29111,1.080377
4,-0.887174,-0.481288,-0.479606,-2.527273,0.119462,1.2697,1.154197,-0.8889,0.620135,0.649167,0.854276,1.780744,-1.476721
5,-0.747932,-1.070316,-0.593384,-0.14458,-0.068448,0.521945,-2.0823,0.876233,-0.587577,-1.390417,1.450669,-0.568462,0.298337
6,0.606071,0.095835,0.439594,-0.418624,-1.523756,-0.675963,0.886072,-0.052287,-1.015929,0.148229,0.191213,-0.966947,0.642568
7,0.346963,1.622298,-0.175552,2.142699,-1.120323,0.466952,-3.891988,0.656304,-0.372482,0.575088,0.881975,-0.158631,0.496484
8,-1.286475,0.636659,0.677417,1.682905,0.427912,-0.598665,-1.299606,0.593012,-0.318723,1.063993,-0.227803,-0.864507,-1.033744
9,0.485912,-0.102049,0.434591,-1.145246,0.589053,-1.149702,0.553664,2.605512,0.004717,1.325732,0.491074,-0.518277,2.238223


2. Compute the summary statistics for the dataset below. Next, use boolean indexing to zero any negative numbers in the dataset. Compare the summary statistics for the new dataset with the original summary statistics.

In [128]:
ages = pd.Series(np.random.randn(50))
ages

0    -0.031027
1     1.373437
2    -0.284365
3     0.440350
4    -1.141828
5     0.632892
6    -1.216421
7     0.039934
8    -1.780844
9     0.094619
10   -1.875970
11    1.184589
12    0.260636
13   -0.159101
14   -0.082723
15    0.451262
16    0.132534
17   -1.556987
18    1.000775
19    2.734834
20    0.236891
21    0.244116
22    0.345310
23    1.234274
24   -1.680645
25   -0.255190
26   -1.800049
27    0.157154
28   -0.771738
29    1.344758
30   -0.841088
31   -1.118430
32   -0.718158
33    1.000506
34    0.130716
35   -0.432284
36    0.745079
37    1.115818
38   -0.788836
39   -1.361365
40    0.302811
41   -0.463416
42    1.156021
43   -1.100722
44    0.013337
45   -0.232182
46   -1.182635
47   -0.789555
48    1.286519
49    1.362109
dtype: float64

In [129]:
ages.describe()

count    50.000000
mean     -0.052886
std       1.022699
min      -1.875970
25%      -0.789375
50%       0.026636
75%       0.587485
max       2.734834
dtype: float64

In [131]:
mask = ages > 0
mask

0     False
1      True
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9      True
10    False
11     True
12     True
13    False
14    False
15     True
16     True
17    False
18     True
19     True
20     True
21     True
22     True
23     True
24    False
25    False
26    False
27     True
28    False
29     True
30    False
31    False
32    False
33     True
34     True
35    False
36     True
37     True
38    False
39    False
40     True
41    False
42     True
43    False
44     True
45    False
46    False
47    False
48     True
49     True
dtype: bool

In [132]:
ages = ages*mask
ages

0    -0.000000
1     1.373437
2    -0.000000
3     0.440350
4    -0.000000
5     0.632892
6    -0.000000
7     0.039934
8    -0.000000
9     0.094619
10   -0.000000
11    1.184589
12    0.260636
13   -0.000000
14   -0.000000
15    0.451262
16    0.132534
17   -0.000000
18    1.000775
19    2.734834
20    0.236891
21    0.244116
22    0.345310
23    1.234274
24   -0.000000
25   -0.000000
26   -0.000000
27    0.157154
28   -0.000000
29    1.344758
30   -0.000000
31   -0.000000
32   -0.000000
33    1.000506
34    0.130716
35   -0.000000
36    0.745079
37    1.115818
38   -0.000000
39   -0.000000
40    0.302811
41   -0.000000
42    1.156021
43   -0.000000
44    0.013337
45   -0.000000
46   -0.000000
47   -0.000000
48    1.286519
49    1.362109
dtype: float64

In [133]:
ages.describe()

count    50.000000
mean      0.380426
std       0.585137
min      -0.000000
25%      -0.000000
50%       0.026636
75%       0.587485
max       2.734834
dtype: float64

3. Roughly speaking, we say a dataset is 'normally distributed' when the histogram of the data approximates the normal distribution (sometimes known as the 'bell curve'). A quick check for normality can be done by comparing the `median` and the `mean` of the data. When these two statistics are close in value, the data *might* be normally distributed. When they differ, the data is possibly not normally distributed.
    
    We're going to step ahead to next week momentarily. We will load the 'Food Services and Drinking Places Sales' dataset from Statistics Canada. To get the data, navigate to [this page](https://www150.statcan.gc.ca/n1/pub/71-607-x/71-607-x2017003-eng.htm) at StatsCan. Copy the dataset with `Ctrl` + `C` **without selecting the top row**, and execute the cell below. 

In [137]:
data = pd.read_clipboard(header = None)
data

Unnamed: 0,0,1,2
0,Canada Footnote5,6318106,0.1
1,Newfoundland and Labrador,74155,-0.4
2,Prince Edward Island,25251,0.4
3,Nova Scotia,146743,-0.7
4,New Brunswick,106165,0.4
5,Quebec,1243607,0.2
6,Ontario,2496507,0.1
7,Manitoba,180150,0.8
8,Saskatchewan,161992,1.2
9,Alberta,803148,-0.2


a) Rename the columns according to the original dataset.

In [138]:
data.columns = ['Geography', 'Total', 'Percent change']
data

Unnamed: 0,Geography,Total,Percent change
0,Canada Footnote5,6318106,0.1
1,Newfoundland and Labrador,74155,-0.4
2,Prince Edward Island,25251,0.4
3,Nova Scotia,146743,-0.7
4,New Brunswick,106165,0.4
5,Quebec,1243607,0.2
6,Ontario,2496507,0.1
7,Manitoba,180150,0.8
8,Saskatchewan,161992,1.2
9,Alberta,803148,-0.2


  b) Using a `lambda` function, strip the `Footnote5` string from the values in the first column. Make sure that each province is spelled correctly, and change the spelling if necessary. **Hint:** the `rstrip(<characters to remove>)` string method removes characters from the *right* of a string.

In [139]:
data['Geography'] = data['Geography'].apply(lambda text: text.rstrip("Footnote 5"))
data

Unnamed: 0,Geography,Total,Percent change
0,Canada,6318106,0.1
1,Newfoundland and Labrador,74155,-0.4
2,Prince Edward Island,25251,0.4
3,Nova Scotia,146743,-0.7
4,New Brunswick,106165,0.4
5,Quebec,1243607,0.2
6,Ontari,2496507,0.1
7,Manitoba,180150,0.8
8,Saskatchewa,161992,1.2
9,Alberta,803148,-0.2


In [140]:
data['Geography'].loc[[6,8,11,13]] = 'Ontario','Saskatchewan','Yukon','Nunavut'
data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,Geography,Total,Percent change
0,Canada,6318106,0.1
1,Newfoundland and Labrador,74155,-0.4
2,Prince Edward Island,25251,0.4
3,Nova Scotia,146743,-0.7
4,New Brunswick,106165,0.4
5,Quebec,1243607,0.2
6,Ontario,2496507,0.1
7,Manitoba,180150,0.8
8,Saskatchewan,161992,1.2
9,Alberta,803148,-0.2


c) Remove the row containing `Canada`.

In [143]:
data = data.drop(0, axis = 0)
data

Unnamed: 0,Geography,Total,Percent change
1,Newfoundland and Labrador,74155,-0.4
2,Prince Edward Island,25251,0.4
3,Nova Scotia,146743,-0.7
4,New Brunswick,106165,0.4
5,Quebec,1243607,0.2
6,Ontario,2496507,0.1
7,Manitoba,180150,0.8
8,Saskatchewan,161992,1.2
9,Alberta,803148,-0.2
10,British Columbia,1065956,0.1


d) Convert the `Total` column to integers. You may have to remove the commas from the values using the `replace(<character to remove>,<replacement>)` before performing the conversion.

In [144]:
data['Total'] = data['Total'].apply(lambda text: int(text.replace(',','')))
data

Unnamed: 0,Geography,Total,Percent change
1,Newfoundland and Labrador,74155,-0.4
2,Prince Edward Island,25251,0.4
3,Nova Scotia,146743,-0.7
4,New Brunswick,106165,0.4
5,Quebec,1243607,0.2
6,Ontario,2496507,0.1
7,Manitoba,180150,0.8
8,Saskatchewan,161992,1.2
9,Alberta,803148,-0.2
10,British Columbia,1065956,0.1


e) Calculate the summary statistics for the `Total` and `Percentage change` columns. Comment on the possible normality of the data.

In [147]:
data[['Total','Percent change']].describe()

Unnamed: 0,Total,Percent change
count,13.0,13.0
mean,486008.2,0.030769
std,740916.4,1.081962
min,1478.0,-1.9
25%,25251.0,-0.4
50%,146743.0,0.1
75%,803148.0,0.4
max,2496507.0,2.1


4. Sort the dataset by the different column levels. **Hint:** to sort by level, you must specify the level and column in a tuple. The general syntax is `pd.DataFrame.sort_values(by = (<level>, <column name>))`.

In [2]:
data = pd.DataFrame(np.random.rand(12).reshape((4,3)),
                    index = [['a','a','b','b'], [5,6,6,6]],
                    columns = [['A','A','B'],['x','y','x']])
data

Unnamed: 0_level_0,Unnamed: 1_level_0,A,A,B
Unnamed: 0_level_1,Unnamed: 1_level_1,x,y,x
a,5,0.519427,0.575028,0.572158
a,6,0.455852,0.447914,0.408985
b,6,0.075348,0.456354,0.541799
b,6,0.185952,0.522912,0.409019


In [3]:
data.sort_values(by = ('A','x'))

Unnamed: 0_level_0,Unnamed: 1_level_0,A,A,B
Unnamed: 0_level_1,Unnamed: 1_level_1,x,y,x
b,6,0.075348,0.456354,0.541799
b,6,0.185952,0.522912,0.409019
a,6,0.455852,0.447914,0.408985
a,5,0.519427,0.575028,0.572158


In [4]:
data.sort_values(by = ('B','x'))

Unnamed: 0_level_0,Unnamed: 1_level_0,A,A,B
Unnamed: 0_level_1,Unnamed: 1_level_1,x,y,x
a,6,0.455852,0.447914,0.408985
b,6,0.185952,0.522912,0.409019
b,6,0.075348,0.456354,0.541799
a,5,0.519427,0.575028,0.572158


# References

McKinney, W. (2013). *Python for Data Analysis*. O'Reilly Media: Sebastopol, California.