# Pandas Fundamentals


![](pandas.png)
## Use the Pandas library to do statistics on tabular data.

Pandas is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of tabular data, i.e. data that can can go into a table. 

### In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.
 
Pandas DataFrames are 2-dimensional tables whose columns have names and potentially have different data types.

Pandas Dataframes are pretty much like excel spreadsheets! **and excel spreadsheets are pretty much like matricies** - make Nick draw this on the board and talk throughs some examples


### installing `pandas` with (Ana)conda (if needed...)
to get `pandas` (you only need to do this the first time we go throug this):
1. go to the little `+` on the left to open a `launcher` window. 
1. click the 'terminal' tile to open a terminal
1. type `conda env list` and hit enter. Make sure that there is a `*` next to the line that says swbc
1. if that's right type `conda install pandas`
1. when it asks `proceed ([y]/n)?` type `y` and hit enter

## Credit:

this comes from Abernathys open book, which we will be looking at a lot! https://earth-env-data-science.github.io/lectures/core_python/python_fundamentals.html


## Pandas Data Structures: Series

We've seen several data structures so far: `lists` `dictionaries`, `arrays`. The Pandas labrary provides several data structures which are **super** useful. 

The `Series` data structure represents a one-dimensional array of data. The main difference between a `Series` and numpy `array` is that a Series has an _index_. The index contains the labels that we use to access the data.

There are many ways to [create a Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series). We will just show a few.

If you think back to the `dictionary` you will see some similarities: namely labels (`keys` for `dict`, `index` for pandas `series`)

Series have built in plotting methods that will let you very quickly make some default plots

These default plots are configurable in many ways:

Arithmetic operations and most numpy function can be applied to Series.
An important point is that the Series keep their index during such operations.

We can access the underlying `index` object if we need to by asking for the `.index` for a pandas series code object:

### Indexing

We talked about indexing (or grabbing data from a position) before: for example we looked at the first patient data with something like `patient_0 = data[0,:]`. There we used a number to get the data position. 

With pandas we can use the index to grab data. In this case the index is a bunch of names, so we can ask for the data from `Johnny` for example using the `.loc` attribute:

Or we can still use the number position `.iloc`

You can already  maybe see part of why pandas is so great. What is easier to understand?
```python
ages.loc['Johnny']
```
or 
```python
ages.iloc[2]
```

To me, being able to ask for the index using a label like `Johnny` makes the code much easier to understand, and makes my analysis more clear in my head.

We can pass a list or array to loc to get multiple rows back:

And we can even use slice notation

If we need to, we can always get the raw data back out as well

## Pandas Data Structures: DataFrame

There is a lot more to Series, but they are limited to a single "column". A more useful Pandas data structure is the `DataFrame`. A `DataFrame` is basically a **bunch of series that share the same index**. It's a lot like a table in a spreadsheet.

Below we create a DataFrame.

Pandas handles missing data very elegantly, keeping track of it through all calculations.

We can get some basic information about our `dataframe` data structure by using its `.info()` function:

A wide range of statistical functions are available on both Series and DataFrames.

We can get a single column as a Series using python's getitem syntax on the DataFrame object.

...or using attribute syntax.

Indexing works very similar to series

But we can also specify the column we want to access

If we make a calculation using columns from the DataFrame, it will keep the same index:

Which we can easily add as another column to the DataFrame:

## Merging Data

Pandas supports a wide range of methods for merging different datasets. These are described extensively in the [documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html). Here we just give a few examples.

We can add the data from the series `education` to our dataframe `df` using the function `.join()`, which will match overlapping indexes and add the new series as a column:

We can also index using a boolean series. This is very useful

### Modifying Values

We often want to modify values in a dataframe based on some rule. To modify values, we need to use `.loc` or `.iloc`

## Plotting

DataFrames have all kinds of [useful plotting](https://pandas.pydata.org/pandas-docs/stable/visualization.html) built in.

Later we will dig deeper into resampling, rolling means, and grouping operations (groupby).