In [1]:
import pandas as pd

# Using Pandas DataFrames

This notebook focuses on using Dataframes, which is the primary data structure that Pandas adds to python. We will discuss the various parts of a Pandas Dataframe and how to create, manipulate, and edit a dataframe. 

For this section, we are going to be using the data located at  
> https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_02/data/parsed.csv

This dataset will be used for all of the exersises.


## What is a Dataframe

The most widely understood analogy when describing a Pandas Dataframe is to an spreadsheet. In a spreadsheet (be it excel, google sheet, or whatever version you prefer), you have rows, columns, and entries. In fact, Pandas uses this same vocabulary when referring to the various peices of a Dataframe. 

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_02/data/parsed.csv")
df

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,status,time,title,tsunami,type,types,tz,updated,url,parsed_place
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...,California
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.020030,,79.0,",ci37389202,",1.29,ml,...,automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...,California
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.021370,28.0,21.0,",ci37389194,",3.42,ml,...,automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...,California
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.026180,,39.0,",ci37389186,",0.44,ml,...,automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...,California
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.077990,,192.0,",nc73096941,",2.16,md,...,automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...,California
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9327,,,73086771,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.018060,,185.0,",nc73086771,",0.62,md,...,reviewed,1537230228060,"M 0.6 - 9km ENE of Mammoth Lakes, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1537285598315,https://earthquake.usgs.gov/earthquakes/eventp...,California
9328,,,38063967,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.030410,,50.0,",ci38063967,",1.00,ml,...,reviewed,1537230135130,"M 1.0 - 3km W of Julian, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1537276800970,https://earthquake.usgs.gov/earthquakes/eventp...,California
9329,,,2018261000,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.452600,,276.0,",pr2018261000,",2.40,md,...,reviewed,1537229908180,"M 2.4 - 35km NNE of Hatillo, Puerto Rico",0,earthquake,",geoserve,origin,phase-data,",-240.0,1537243777410,https://earthquake.usgs.gov/earthquakes/eventp...,Puerto Rico
9330,,,38063959,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.018650,,61.0,",ci38063959,",1.10,ml,...,reviewed,1537229545350,"M 1.1 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537230211640,https://earthquake.usgs.gov/earthquakes/eventp...,California


For this example dataframe we are going to be using for this file, we note that the whole table has 27 columns and 9332 rows. We can also get information about what data type each columns contains. For example: 

In [3]:
df.dtypes

alert            object
cdi             float64
code             object
detail           object
dmin            float64
felt            float64
gap             float64
ids              object
mag             float64
magType          object
mmi             float64
net              object
nst             float64
place            object
rms             float64
sig               int64
sources          object
status           object
time              int64
title            object
tsunami           int64
type             object
types            object
tz              float64
updated           int64
url              object
parsed_place     object
dtype: object

In this dataframe, we have integers, floats, and strings (here labeled as `object`). We can extract any one of these columns by referencing the column's name or `index`.

In [4]:
df.place

0                  9km NE of Aguanga, CA
1                  9km NE of Aguanga, CA
2                  8km NE of Aguanga, CA
3                  9km NE of Aguanga, CA
4                  10km NW of Avenal, CA
                      ...               
9327        9km ENE of Mammoth Lakes, CA
9328                 3km W of Julian, CA
9329    35km NNE of Hatillo, Puerto Rico
9330               9km NE of Aguanga, CA
9331               9km NE of Aguanga, CA
Name: place, Length: 9332, dtype: object

Alternatively, we can reference the column by using a key, similar to Python dictionaries. 

In [5]:
df["place"]

0                  9km NE of Aguanga, CA
1                  9km NE of Aguanga, CA
2                  8km NE of Aguanga, CA
3                  9km NE of Aguanga, CA
4                  10km NW of Avenal, CA
                      ...               
9327        9km ENE of Mammoth Lakes, CA
9328                 3km W of Julian, CA
9329    35km NNE of Hatillo, Puerto Rico
9330               9km NE of Aguanga, CA
9331               9km NE of Aguanga, CA
Name: place, Length: 9332, dtype: object

This single column is no longer a `DataFrame`, but instead is the class object `Series`

In [6]:
print(type(df))
print(type(df.place))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


A `DataFrame` object is simply a collection of `Series` objects. All objects in a `Series` have to have the same data type, and a `DataFrame` can be made up of whatever series objects you wish.

An additional object that makes up both the `Series` and the `DataFrame` is the `Index`. Notice that on every one of the above outputs, you can see the numbers 0-9331. This list of numbers is the `Index` for each `Series`. Each value in the `Index` is the index of that row or value. For `Series` objects, we can reference a particular value by its index. 

In [8]:
# df.place[3]
df.3

SyntaxError: invalid syntax (314454445.py, line 2)

However, if we try to do the same thing for a `DataFrame`, we get the following error. 

In [9]:
df[3]

KeyError: 3

This is because the above syntax is trying to refer to the column name. To reference an entire row of a dataframe, you must use the following syntax.

In [14]:
df.iloc[3]

alert                                                         NaN
cdi                                                           NaN
code                                                     37389186
detail          https://earthquake.usgs.gov/fdsnws/event/1/que...
dmin                                                      0.02618
felt                                                          NaN
gap                                                          39.0
ids                                                  ,ci37389186,
mag                                                          0.44
magType                                                        ml
mmi                                                           NaN
net                                                            ci
nst                                                          26.0
place                                       9km NE of Aguanga, CA
rms                                                          0.17
sig       

The column names can be retrieved by using

In [15]:
df.columns

Index(['alert', 'cdi', 'code', 'detail', 'dmin', 'felt', 'gap', 'ids', 'mag',
       'magType', 'mmi', 'net', 'nst', 'place', 'rms', 'sig', 'sources',
       'status', 'time', 'title', 'tsunami', 'type', 'types', 'tz', 'updated',
       'url', 'parsed_place'],
      dtype='object')

If we need to refernce multiple columns at once, we can pass in a list of column names into square brackets. This will return a new dataframe of just the subset of columns. 

In [17]:
df[["time","mag", "magType", "place", "parsed_place"]]

Unnamed: 0,time,mag,magType,place,parsed_place
0,1539475168010,1.35,ml,"9km NE of Aguanga, CA",California
1,1539475129610,1.29,ml,"9km NE of Aguanga, CA",California
2,1539475062610,3.42,ml,"8km NE of Aguanga, CA",California
3,1539474978070,0.44,ml,"9km NE of Aguanga, CA",California
4,1539474716050,2.16,md,"10km NW of Avenal, CA",California
...,...,...,...,...,...
9327,1537230228060,0.62,md,"9km ENE of Mammoth Lakes, CA",California
9328,1537230135130,1.00,ml,"3km W of Julian, CA",California
9329,1537229908180,2.40,md,"35km NNE of Hatillo, Puerto Rico",Puerto Rico
9330,1537229545350,1.10,ml,"9km NE of Aguanga, CA",California


## Using DataFrames

Python has a reputation for being very slow. This is due to the fact that 
1. it is not a compiled language like C or C++
2. there is a lot of extra bits on the backend that might not be present in a lower level language. 

For most scripting use cases, neither of these poses a problem. If you are doing any sort of high volume numerical computations, this will really slow down your workflow. To fix this, the python community created the NumPy package. This package trims down numbers to simply the actual number, offers in more matrix- and vector-like functionality (element-wise addition, vetor products, matrix multiplication, etc.), and many other functions to enable faster numerical computations. Pandas builds upon that base to bring in many of the same speed and functionality benefits into dataframes. To this end, we can operate on entire columns, build entirely new columns based on the values of already existing ones, filter rows based on the value of a single column, etc.

### Adding new Columns
Let's first build a new column. I want to determine if an earthquake occured on the Ring of Fire. The locations that make up the Ring of Fire are saved in the following list (inlcuding a mix of country and US State names):

In [27]:
ring_of_fire = [ 
    "Bolivia", 
    "Chile", 
    "Ecuador", 
    "Peru", 
    "Costa Rica", 
    "Guatemala", 
    "Mexico", 
    "Japan", 
    "Philippines", 
    "Indonesia", 
    "New Zealand", 
    "Antarctic", 
    "Canada", 
    "Fiji", 
    "Alaska", 
    "Washington", 
    "California", 
    "Russia", 
    "Taiwan", 
    "Tonga", 
    "Kermadec Islands"
]

Taking a look at the column `parsed_place`, we note that these names best match to that column. 

In [19]:
df.parsed_place.unique()

array(['California', 'Dominican Republic', 'Alaska', 'Indonesia',
       'Canada', 'Puerto Rico', 'Montana', 'Nevada', 'Christmas Island',
       'Hawaii', 'Northern Mariana Islands', 'Japan', 'Ecuador',
       'Vanuatu', 'Mexico', 'Russia', 'British Virgin Islands',
       'Washington', 'Papua New Guinea', 'Fiji', 'U.S. Virgin Islands',
       'Chile', 'Peru', 'Yemen', 'Guatemala', 'Kansas', 'Australia',
       'Wyoming', 'Kuril Islands', 'Oklahoma', 'Tennessee',
       'Pacific-Antarctic Ridge', 'Utah', 'Colombia', 'Argentina',
       'Oregon', 'Greece', 'Missouri', 'Tajikistan',
       'Northern Mid-Atlantic Ridge', 'Sumatra', 'Solomon Islands',
       'Burma', 'Taiwan', 'Nicaragua',
       'South Georgia and South Sandwich Islands', 'Idaho', 'Kyrgyzstan',
       'Arizona', 'Tonga', 'Northern East Pacific Rise', 'South Africa',
       'Southern Mid-Atlantic Ridge', 'Costa Rica', 'China',
       'Philippines', 'Haiti', 'Jamaica', 'Kentucky', 'New Zealand',
       'Iran', 'Afghanistan

One way of determining if a value is in a given list is by using the `value in list` syntax. Using this, we are essentially asking of a particular value exists within that list. This would look like the following 

In [23]:
"North Carolina" in df.parsed_place.unique()

True

This tells us that at least one row has the value `'North Carolina'` in the `parsed_place` column of our dataset. However, what we need to do is build a series of values for true and false based on the value of that particular row. For this we can use a feature called list comprehension. This one-liner trick builds a list very efficently, which we can later convert into a column. 

In [28]:
ring_of_fire_column_list = [location in ring_of_fire for location in df.parsed_place]
print("length:", len(ring_of_fire_column_list))
print("unique:", set(ring_of_fire_column_list))

length: 9332
unique: {False, True}


Perfect, now we have a list of true/false values with the same number of values as the number of rows in our dataset. Converting this into a series is done simply by instanctiating the class from the Pandas library

In [29]:
ring_of_fire_column = pd.Series(ring_of_fire_column_list, name="is_in_ring_of_fire")
ring_of_fire_column

0        True
1        True
2        True
3        True
4        True
        ...  
9327     True
9328     True
9329    False
9330     True
9331     True
Name: is_in_ring_of_fire, Length: 9332, dtype: bool

Pandas is a very versitile library and is able adapt its functionality based on the inputs. Here we were able to convert out list into a Series. We can add our series to the original dataframe by using the `join` method.

In [30]:
df_join = df.join(ring_of_fire_column)
df_join[["parsed_place","is_in_ring_of_fire"]]

Unnamed: 0,parsed_place,is_in_ring_of_fire
0,California,True
1,California,True
2,California,True
3,California,True
4,California,True
...,...,...
9327,California,True
9328,California,True
9329,Puerto Rico,False
9330,California,True


In [32]:
df_join.columns

Index(['alert', 'cdi', 'code', 'detail', 'dmin', 'felt', 'gap', 'ids', 'mag',
       'magType', 'mmi', 'net', 'nst', 'place', 'rms', 'sig', 'sources',
       'status', 'time', 'title', 'tsunami', 'type', 'types', 'tz', 'updated',
       'url', 'parsed_place', 'is_in_ring_of_fire'],
      dtype='object')

We can now see that the column has been added to the dataframe. 

_Note: that the dataframe has to be saved again after the joining. Dataframes are imutible (unchangeable) objects in python. Therefore, you need to resave the dataframe after making a change like dropping or adding columns or filtering rows._

An alternate (and slightly simpler) way of adding the column is by assigning the list directly to a new key. 

In [33]:
df['is_in_ring_of_fire'] = ring_of_fire_column_list
df[["parsed_place", "is_in_ring_of_fire"]]

Unnamed: 0,parsed_place,is_in_ring_of_fire
0,California,True
1,California,True
2,California,True
3,California,True
4,California,True
...,...,...
9327,California,True
9328,California,True
9329,Puerto Rico,False
9330,California,True


### Filtering rows
Filtering rows works by creating a series or list of boolean values, and passing that in as the index argument. This can be either the value of a boolean column, such as the `is_in_ring_of_fire` column we created in the last section. Alternatively, you can filter based on some condition regarding the value of the entry, such as is the magnitude higher than some value. Let's explore both of these options below. 

In [39]:
df[df.is_in_ring_of_fire][["parsed_place","is_in_ring_of_fire"]]

Unnamed: 0,parsed_place,is_in_ring_of_fire
0,California,True
1,California,True
2,California,True
3,California,True
4,California,True
...,...,...
9326,California,True
9327,California,True
9328,California,True
9330,California,True


Notice that the number of rows is smaller by ~2000. To make this even more dramatic, let's show all the rows that are _not_ in the ring of fire. 

In [40]:
df[df.is_in_ring_of_fire == False][["parsed_place","is_in_ring_of_fire"]]

Unnamed: 0,parsed_place,is_in_ring_of_fire
5,Dominican Republic,False
11,Dominican Republic,False
17,Puerto Rico,False
20,Puerto Rico,False
30,Montana,False
...,...,...
9306,Puerto Rico,False
9307,Puerto Rico,False
9315,Puerto Rico,False
9319,Argentina,False


Now we have just over 2000 rows, which lines up roughly with what we noticed before. Notice that we could have used the same syntax for the true case, specificaly, `df.is_in_ring_of_fire == True`. Also notice that the index of the row does not change. The index still points back to the row number of the original dataframe, or the row number is specifically connected to the data, and is not simply a counter. There are ways of reassigning the index, but that is not something I want to explore for this class. 

Note that we used the equality operator (`==`) when filtering on this second row. This suggestes that we can use other comparison operators for different values as well. Rather, we can use any operator that returns an array boolean values. To clarify what I mean, let's look at the following. 

In [41]:
df.is_in_ring_of_fire == True

0        True
1        True
2        True
3        True
4        True
        ...  
9327     True
9328     True
9329    False
9330     True
9331     True
Name: is_in_ring_of_fire, Length: 9332, dtype: bool

The above value is a Pandas Series. Coensidentially, it is the set of values as the column that we created. This suggests that any list we create can be used to filter. We could use the following to get the same filter instead of creating a new column. 

In [42]:
df[[location in ring_of_fire for location in df.parsed_place]][["parsed_place","is_in_ring_of_fire"]]

Unnamed: 0,parsed_place,is_in_ring_of_fire
0,California,True
1,California,True
2,California,True
3,California,True
4,California,True
...,...,...
9326,California,True
9327,California,True
9328,California,True
9330,California,True


We get the same result in fewer lines of code, which can be very helpful if we are pressed for computation time, or we just don't need to add more data for python to manage. 

Another way we can filter our data is through numerical comparisons. Note the following series

In [43]:
df.mag >= 2.0

0       False
1       False
2        True
3       False
4        True
        ...  
9327    False
9328    False
9329     True
9330    False
9331    False
Name: mag, Length: 9332, dtype: bool

We can use something like this to find just the earthquakes that are above a certain threshold. 

In [44]:
df[df.mag >= 5.0][["parsed_place","is_in_ring_of_fire"]]

Unnamed: 0,parsed_place,is_in_ring_of_fire
36,Christmas Island,False
118,Russia,True
180,Indonesia,True
226,Vanuatu,False
227,Peru,True
...,...,...
9175,East Timor,False
9176,New Zealand,True
9211,Southwest Indian Ridge,False
9213,Tonga,True


By this point, we can start combining all sorts of conditions on the dataset to zero in on the specific rows that you need for your analysis. Let's find all the earthquakes that hit Indonesia that were also coupled with a tsunami. The first part of this filter is easy. Simply find all the rows where `df.parsed_place == "Indonesia"`, very similar to filters we have already performed. For the second part, let's first take a look at the values present in `df.tsunami`. 

In [None]:
df.tsunami.unique()

Out of all 9000+ rows, only two values exist: zero and one. Sometimes, a boolean value is stored as integers. If this is the case, the standard translation is `0 == False` and `1== True`. We could have python convert the values of this column to an actual boolean type, but that would be an unnecessary extra step. We can simply use `df.tsunami == 1` to find all the rows where a tsunami was also triggered by the earthquake. 

To use both filters, there are a couple different ways of managing this. The simplist would be to use the method `loc`. This method allows us to select a subset of columns and combine the filters together using simple and/or operators. For this problem, you could execute the following

In [45]:
df.loc[ 
    (df.parsed_place == "Indonesia") & (df.tsunami == 1),
    ["parsed_place", "is_in_ring_of_fire"]
]

Unnamed: 0,parsed_place,is_in_ring_of_fire
1406,Indonesia,True
1698,Indonesia,True
3112,Indonesia,True
3150,Indonesia,True
3605,Indonesia,True
3609,Indonesia,True
3692,Indonesia,True
3699,Indonesia,True
3707,Indonesia,True
3709,Indonesia,True


_Note: `loc` can also be used to select specific row numbers if that is known. This will work based on the index of the row, not the position in the dataframe. Read [this StackOverflow question](https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different) for a more detailed explanation, along with a comparison of another method `iloc`._

### Summary Statistics

One main point of using large datasets is for calculating statistical values, such as averages and spread. Methods to compute these values are build directly into the `DataFrame` and `Series` classes. They can be accessed by calling the appropriate methods. Doing so returns a series of the 

In [46]:
print("Count non-empty:")
print(df.count(), end="\n----------------------------------\n\n")
print("Mean:")
print(df.mean(), end="\n----------------------------------\n\n")
print("Standard Deviation:")
print(df.std(), end="\n----------------------------------\n\n")

Count non-empty:
alert                   59
cdi                    329
code                  9332
detail                9332
dmin                  6139
felt                   329
gap                   6164
ids                   9332
mag                   9331
magType               9331
mmi                     93
net                   9332
nst                   5364
place                 9332
rms                   9332
sig                   9332
sources               9332
status                9332
time                  9332
title                 9332
tsunami               9332
type                  9332
types                 9332
tz                    9331
updated               9332
url                   9332
parsed_place          9332
is_in_ring_of_fire    9332
dtype: int64
----------------------------------

Mean:
cdi                   2.754711e+00
dmin                  5.449253e-01
felt                  1.231003e+01
gap                   1.215066e+02
mag                   1.497345e+

  print(df.mean(), end="\n----------------------------------\n\n")
  print(df.std(), end="\n----------------------------------\n\n")


The above are warning coming from Pandas, letting us know that the current syntax that we are using will soon be deprecated. The following table shows some of the more common methods that you might use on a table or series to gather some high level information about the data. 

| Method | Description | Data types |
| - | - | - |
| `count()` | The number of non-null observations | Any |
| `nunique()` | The number of unique values | Any |
| `sum()` | The toal of the values | Numeric or Boolean | 
| `mean()` | The average of the values | Numerical or Boolean | 
| `meadian()` | The median of the values | Numerical | 
| `min()` | The minimum of the Values | Numerical | 
| `idxmin()` | The index where the minimum value occurs | Numerical | 
| `max()` | The maximum of the values | Numerical | 
| `idxmax()` | The index where the maximum value occurs | Numerical | 
| `abs()` | The absolute value of the values | Numerical | 
| `std()` | The standard deviation | Numerical | 
| `var()` | The variance | Numerical | 
| `cov()` | The covariance between two `Series`, or a covariance matrix for all column combinations in a DataFrame | Numerical |
| `corr()` | The correlation between two `Series`, or a correlation matrix for all column combinations in a `DataFrame` | Numerical | 
| `quantril()` | Gets a specific quantrile | Numerical | 
| `cumsum()` | The cumulative sum | Numerical or Boolean | 
| `cummin()` | The cumulative minimum | Numerical | 
| `cummax()` | The cumulative maximum | Numerical |

A handful of these values can be calculated and displayed all at once by just the `discribe` method.

In [47]:
df.describe()

Unnamed: 0,cdi,dmin,felt,gap,mag,mmi,nst,rms,sig,time,tsunami,tz,updated
count,329.0,6139.0,329.0,6164.0,9331.0,93.0,5364.0,9332.0,9332.0,9332.0,9332.0,9331.0,9332.0
mean,2.754711,0.544925,12.31003,121.506588,1.497345,3.651398,19.053878,0.362122,56.899914,1538284000000.0,0.006537,-451.99014,1538537000000.0
std,1.010637,2.214305,48.954944,72.962363,1.203347,1.790523,15.492315,0.317784,91.872163,608030600.0,0.080589,231.752571,656413500.0
min,0.0,0.000648,0.0,12.0,-1.26,0.0,0.0,0.0,0.0,1537229000000.0,0.0,-720.0,1537230000000.0
25%,2.0,0.020425,1.0,66.1425,0.72,2.68,8.0,0.119675,8.0,1537793000000.0,0.0,-540.0,1537996000000.0
50%,2.7,0.05905,2.0,105.0,1.3,3.72,15.0,0.21,26.0,1538245000000.0,0.0,-480.0,1538621000000.0
75%,3.3,0.17725,5.0,159.0,1.9,4.57,25.0,0.59,56.0,1538766000000.0,0.0,-480.0,1539110000000.0
max,8.4,53.737,580.0,355.91,7.5,9.12,172.0,1.91,2015.0,1539475000000.0,1.0,720.0,1539537000000.0


This provides some of the most common values that used in statistial analysis.