In [None]:
import pandas as pd

# Using Pandas DataFrames

This notebook focuses on using Dataframes, which is the primary data structure that Pandas adds to python. We will discuss the various parts of a Pandas Dataframe and how to create, manipulate, and edit a dataframe. 

For this section, we are going to be using the data located at  
> https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_02/data/parsed.csv

This dataset will be used for all of the exersises.


## What is a Dataframe

The most widely understood analogy when describing a Pandas Dataframe is to an spreadsheet. In a spreadsheet (be it excel, google sheet, or whatever version you prefer), you have rows, columns, and entries. In fact, Pandas uses this same vocabulary when referring to the various peices of a Dataframe. 

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_02/data/parsed.csv")
df

For this example dataframe we are going to be using for this file, we note that the whole table has 27 columns and 9332 rows. We can also get information about what data type each columns contains. For example: 

In [None]:
df.dtypes

In this dataframe, we have integers, floats, and strings (here labeled as `object`). We can extract any one of these columns by referencing the column's name or `index`.

In [None]:
df.place

Alternatively, we can reference the column by using a key, similar to Python dictionaries. 

In [None]:
df["place"]

This single column is no longer a `DataFrame`, but instead is the class object `Series`

In [None]:
print(type(df))
print(type(df.place))

A `DataFrame` object is simply a collection of `Series` objects. All objects in a `Series` have to have the same data type, and a `DataFrame` can be made up of whatever series objects you wish.

An additional object that makes up both the `Series` and the `DataFrame` is the `Index`. Notice that on every one of the above outputs, you can see the numbers 0-9331. This list of numbers is the `Index` for each `Series`. Each value in the `Index` is the index of that row or value. For `Series` objects, we can reference a particular value by its index. 

In [None]:
# df.place[3]
df.3

However, if we try to do the same thing for a `DataFrame`, we get the following error. 

In [None]:
df[3]

This is because the above syntax is trying to refer to the column name. To reference an entire row of a dataframe, you must use the following syntax.

In [None]:
df.iloc[3]

The column names can be retrieved by using

In [None]:
df.columns

If we need to refernce multiple columns at once, we can pass in a list of column names into square brackets. This will return a new dataframe of just the subset of columns. 

In [None]:
df[["time","mag", "magType", "place", "parsed_place"]]

## Using DataFrames

Python has a reputation for being very slow. This is due to the fact that 
1. it is not a compiled language like C or C++
2. there is a lot of extra bits on the backend that might not be present in a lower level language. 

For most scripting use cases, neither of these poses a problem. If you are doing any sort of high volume numerical computations, this will really slow down your workflow. To fix this, the python community created the NumPy package. This package trims down numbers to simply the actual number, offers in more matrix- and vector-like functionality (element-wise addition, vetor products, matrix multiplication, etc.), and many other functions to enable faster numerical computations. Pandas builds upon that base to bring in many of the same speed and functionality benefits into dataframes. To this end, we can operate on entire columns, build entirely new columns based on the values of already existing ones, filter rows based on the value of a single column, etc.

### Adding new Columns
Let's first build a new column. I want to determine if an earthquake occured on the Ring of Fire. The locations that make up the Ring of Fire are saved in the following list (inlcuding a mix of country and US State names):

In [None]:
ring_of_fire = [ 
    "Bolivia", 
    "Chile", 
    "Ecuador", 
    "Peru", 
    "Costa Rica", 
    "Guatemala", 
    "Mexico", 
    "Japan", 
    "Philippines", 
    "Indonesia", 
    "New Zealand", 
    "Antarctic", 
    "Canada", 
    "Fiji", 
    "Alaska", 
    "Washington", 
    "California", 
    "Russia", 
    "Taiwan", 
    "Tonga", 
    "Kermadec Islands"
]

Taking a look at the column `parsed_place`, we note that these names best match to that column. 

In [None]:
df.parsed_place.unique()

One way of determining if a value is in a given list is by using the `value in list` syntax. Using this, we are essentially asking of a particular value exists within that list. This would look like the following 

In [None]:
"North Carolina" in df.parsed_place.unique()

This tells us that at least one row has the value `'North Carolina'` in the `parsed_place` column of our dataset. However, what we need to do is build a series of values for true and false based on the value of that particular row. For this we can use a feature called list comprehension. This one-liner trick builds a list very efficently, which we can later convert into a column. 

In [None]:
ring_of_fire_column_list = [location in ring_of_fire for location in df.parsed_place]
print("length:", len(ring_of_fire_column_list))
print("unique:", set(ring_of_fire_column_list))

Perfect, now we have a list of true/false values with the same number of values as the number of rows in our dataset. Converting this into a series is done simply by instanctiating the class from the Pandas library

In [None]:
ring_of_fire_column = pd.Series(ring_of_fire_column_list, name="is_in_ring_of_fire")
ring_of_fire_column

Pandas is a very versitile library and is able adapt its functionality based on the inputs. Here we were able to convert out list into a Series. We can add our series to the original dataframe by using the `join` method.

In [None]:
df_join = df.join(ring_of_fire_column)
df_join[["parsed_place","is_in_ring_of_fire"]]

In [None]:
df_join.columns

We can now see that the column has been added to the dataframe. 

_Note: that the dataframe has to be saved again after the joining. Dataframes are imutible (unchangeable) objects in python. Therefore, you need to resave the dataframe after making a change like dropping or adding columns or filtering rows._

An alternate (and slightly simpler) way of adding the column is by assigning the list directly to a new key. 

In [None]:
df['is_in_ring_of_fire'] = ring_of_fire_column_list
df[["parsed_place", "is_in_ring_of_fire"]]

### Filtering rows
Filtering rows works by creating a series or list of boolean values, and passing that in as the index argument. This can be either the value of a boolean column, such as the `is_in_ring_of_fire` column we created in the last section. Alternatively, you can filter based on some condition regarding the value of the entry, such as is the magnitude higher than some value. Let's explore both of these options below. 

In [None]:
df[df.is_in_ring_of_fire][["parsed_place","is_in_ring_of_fire"]]

Notice that the number of rows is smaller by ~2000. To make this even more dramatic, let's show all the rows that are _not_ in the ring of fire. 

In [None]:
df[df.is_in_ring_of_fire == False][["parsed_place","is_in_ring_of_fire"]]

Now we have just over 2000 rows, which lines up roughly with what we noticed before. Notice that we could have used the same syntax for the true case, specificaly, `df.is_in_ring_of_fire == True`. Also notice that the index of the row does not change. The index still points back to the row number of the original dataframe, or the row number is specifically connected to the data, and is not simply a counter. There are ways of reassigning the index, but that is not something I want to explore for this class. 

Note that we used the equality operator (`==`) when filtering on this second row. This suggestes that we can use other comparison operators for different values as well. Rather, we can use any operator that returns an array boolean values. To clarify what I mean, let's look at the following. 

In [None]:
df.is_in_ring_of_fire == True

The above value is a Pandas Series. Coensidentially, it is the set of values as the column that we created. This suggests that any list we create can be used to filter. We could use the following to get the same filter instead of creating a new column. 

In [None]:
df[[location in ring_of_fire for location in df.parsed_place]][["parsed_place","is_in_ring_of_fire"]]

We get the same result in fewer lines of code, which can be very helpful if we are pressed for computation time, or we just don't need to add more data for python to manage. 

Another way we can filter our data is through numerical comparisons. Note the following series

In [None]:
df.mag >= 2.0

We can use something like this to find just the earthquakes that are above a certain threshold. 

In [None]:
df[df.mag >= 5.0][["parsed_place","is_in_ring_of_fire"]]

By this point, we can start combining all sorts of conditions on the dataset to zero in on the specific rows that you need for your analysis. Let's find all the earthquakes that hit Indonesia that were also coupled with a tsunami. The first part of this filter is easy. Simply find all the rows where `df.parsed_place == "Indonesia"`, very similar to filters we have already performed. For the second part, let's first take a look at the values present in `df.tsunami`. 

In [None]:
df.tsunami.unique()

Out of all 9000+ rows, only two values exist: zero and one. Sometimes, a boolean value is stored as integers. If this is the case, the standard translation is `0 == False` and `1== True`. We could have python convert the values of this column to an actual boolean type, but that would be an unnecessary extra step. We can simply use `df.tsunami == 1` to find all the rows where a tsunami was also triggered by the earthquake. 

To use both filters, there are a couple different ways of managing this. The simplist would be to use the method `loc`. This method allows us to select a subset of columns and combine the filters together using simple and/or operators. For this problem, you could execute the following

In [None]:
df.loc[ 
    (df.parsed_place == "Indonesia") & (df.tsunami == 1),
    ["parsed_place", "is_in_ring_of_fire"]
]

_Note: `loc` can also be used to select specific row numbers if that is known. This will work based on the index of the row, not the position in the dataframe. Read [this StackOverflow question](https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different) for a more detailed explanation, along with a comparison of another method `iloc`._

### Summary Statistics

One main point of using large datasets is for calculating statistical values, such as averages and spread. Methods to compute these values are build directly into the `DataFrame` and `Series` classes. They can be accessed by calling the appropriate methods. Doing so returns a series of the 

In [None]:
print("Count non-empty:")
print(df.count(), end="\n----------------------------------\n\n")
print("Mean:")
print(df.mean(), end="\n----------------------------------\n\n")
print("Standard Deviation:")
print(df.std(), end="\n----------------------------------\n\n")

The above are warning coming from Pandas, letting us know that the current syntax that we are using will soon be deprecated. The following table shows some of the more common methods that you might use on a table or series to gather some high level information about the data. 

| Method | Description | Data types |
| - | - | - |
| `count()` | The number of non-null observations | Any |
| `nunique()` | The number of unique values | Any |
| `sum()` | The toal of the values | Numeric or Boolean | 
| `mean()` | The average of the values | Numerical or Boolean | 
| `meadian()` | The median of the values | Numerical | 
| `min()` | The minimum of the Values | Numerical | 
| `idxmin()` | The index where the minimum value occurs | Numerical | 
| `max()` | The maximum of the values | Numerical | 
| `idxmax()` | The index where the maximum value occurs | Numerical | 
| `abs()` | The absolute value of the values | Numerical | 
| `std()` | The standard deviation | Numerical | 
| `var()` | The variance | Numerical | 
| `cov()` | The covariance between two `Series`, or a covariance matrix for all column combinations in a DataFrame | Numerical |
| `corr()` | The correlation between two `Series`, or a correlation matrix for all column combinations in a `DataFrame` | Numerical | 
| `quantril()` | Gets a specific quantrile | Numerical | 
| `cumsum()` | The cumulative sum | Numerical or Boolean | 
| `cummin()` | The cumulative minimum | Numerical | 
| `cummax()` | The cumulative maximum | Numerical |

A handful of these values can be calculated and displayed all at once by just the `discribe` method.

In [None]:
df.describe()

This provides some of the most common values that used in statistial analysis.