### [Video Explanation Here!](https://youtu.be/31NuV3K-cGw)

### Pandas

Pandas is a Python library that allows us to analyze tabular data. If you have used Microsoft Excel or a similar spreadsheet tool, you've used a user interface to do some of the things we can do with Pandas. If you've worked on data analysis with R, you have already done some programmatic data analysis like we will do with Pandas. 

Pandas represents tabular data inside of an objetc called a `DataFrame`. Each column header points to a `Series` (a `pandas` object that contains a collection of values).

In [None]:
import sys
!{sys.executable} -m pip install pandas 

import pandas as pd

Notice that, when we install `pandas`, it pulls in its own dependency on a library called `numpy`. We are going to talk about that later.

### Pandas `DataFrame` Object

In [None]:
df = pd.read_csv('birds.csv')

df.head()

A `DataFrame` has attributes:

- index: An Index object. The values are the row/index labels.
- columns: An index object. The values are the column labels.

In [None]:
print(df.index)
print(df.columns)

### Data Indexing and Selection

A `DataFrame` maps a column name to a `Series` of column data. Because the `__getitem__` behavior of a `DataFrame` returns a column, you can think about a DataFrame as a sort of dictionary (although there's more to it of course—for sorting or filtering rows, for example, we need a row-oriented representation of the data).

In [None]:
df['specimen_id'].head()

In [None]:
df.specimen_id.head()

We can also add a new column the way that we would add a new key value pair to a dictionary:

In [None]:
df['Description'] = "Specimen " + df["specimen_id"] + " is a " + df.species + " and weighs " + df.weight.astype(str) + " ounces."
print(df['Description'][0])

## You can transpose a dataframe.

**Transpose** means to switch the rows and columns. This might make sense if you're refrencing individual specimens more often than their characteristics, or if you have more characteristics per specimen than you have specimens.

In [None]:
df

## You can also get ahold of the values as a 2d matrix.

Remember, we sometimes need a matrix representation of our data for sorting, filtering, or other operations that focus on the rows (specimens) rather than the columns (specime characteristics).

We also often need a representation like this to feed data into existing machine learning model libraries.

In [None]:
df.values[:5]

Notice, again, that the data type of the thing we are looking at above is not the Python `list`: instead, it says **array**. We'll discuss that further later.

## Many kinds of `DataFrame` indexing

In [None]:
df.loc[3]

In [None]:
df.loc[3, 'weight']

`iloc` stands for "index location."

In [None]:
df.iloc[3,2]

If the numeric index doesn't mean anything to you and you'd like to index on one of the column values, you can do that:

In [None]:
df = df.set_index('specimen_id')
df.head()

In [None]:
df.loc['7ss24g6t7f2dr4h32']

You can even set a multi-index on multiple characteristics. This would be useful if, for example, a `specimen_id` might be duplicated among the specimens but never within a species.

In [None]:
df = df.reset_index() # resets the index to the default

In [None]:
df = df.set_index(['specimen_id', 'species'])
df.head()

In [None]:
df.loc[('g6t7f2dr4h327ss24', 'oriole'), 'Description']

Note that after resetting the index, numeric index location still works. Here we are getting a slice of the dataframe representing the third through the fifth row and the first column. (the indices are also presented here, which is why you see `specimen_id` and `species` in the output.)

In [None]:
df.iloc[2:5, 0]

In [None]:
df.iloc[2:5, 0:1]

### Masking

You can filter for a window of your dataframe for which a certain condition is true:

In [None]:
# Masking
df[df['weight'] < 4.0]

In [None]:
df[(df['weight'] < 3.0) & (df['weight'] > 2.0)]

### `DataFrame` Operations

Pandas `DataFrame` objects allow element-wise operations.

You can even use ufuncs to *preserve index and column labels* in the output.

In [None]:
import numpy as np 
df = pd.DataFrame(np.random.randint(0, 10, (3,4)), columns=['A','B','C','D'])
df

In [None]:
np.sin(df * np.pi / 4)

### Creating DataFrames

`DataFrames` provide several operations that return an instance of a `DataFrame` from vaaious file formats.

```
df_from_csv = pd.read_csv('/path/to/file.csv')
df_from_web_csv = pd.read_csv('https://data.cityofchicago.org/api/views/x8fc-8rcq/rows.csv?accessType=DOWNLOAD')
df_from_json_csv = pd.read_json('/path/to/file.json')
df_from_web_json_csv = pd.read_json('https://website.com/resource')
df_from_excel = pd.read_excel('path/to/file.xlsx')
```

You can also, if you like, make one from a dictionary:

In [None]:
pd.DataFrame({
                  "name" : ["Jupiter", "Saturn", "Neptune", "Earth"],
                  "habitable_zone": [False, False, False, False],
                  "gaseous": [True, True, True, False]
                  
              })

## Analyzing Data with DataFrames

I could stand here and walk you through an entire catalog of the methods available on a `DataFrame`. I won't, though: that's not how you would learn to use a new library or tool in a professional or research context. Rather, you would: 

1. start with a **use case**: a specific problem to solve.
2. look for ways to solve your problem by:
    - Asking Experts
    - Searching the Internet
    - Checking the Indices of good books
    
As you gain experience in this industry, you are going to learn how to do more things. But even more importantly, you will improve your efficiency at _learning_ new things by:
- Meeting experts, who you can text/email withe questions
- Learning better strategies for searching the internet
- Building a collection of authors and books that you trust
- Developing intuition for how things work that sometimes helps you guess what's going on.

For example, from looking at usage examples for `pandas` like the one below, you can get an idea of what the library is good for and what other methods it probably has.

In [None]:
aggregation = pd.read_csv('metrics.csv') \
        .assign(year=lambda row: row["Period Start"].apply(lambda x: x[-4:])) \
        .assign(activity_year=lambda row: row["Activity"] + " (" + row["year"] + ")") \
        .assign(average_days_to_complete_activity=lambda row: row["Average Days to Complete Activity"].apply(lambda x: float(x))) \
        .where(lambda x: x['Activity'] == 'Alley Grading-Unimproved') \
        .groupby('activity_year') \
        .agg({
             'Target Response Days': 'max', 
             'average_days_to_complete_activity': 'mean'
            }) \
        .reset_index() \
        .assign(average_schedule_slippage=lambda row: row.average_days_to_complete_activity - row['Target Response Days']) \
        .sort_values('average_schedule_slippage', ascending=False)

aggregation


In this block of code, we use `pandas` to take some data in a csv and answer the question "Out of all the years that Chicago has recorded the completion of Alley Grading-Unimproved projects, in which of those years has the schedule for the projects slipped the most, on average, compared to its target?

**In the process we:**
- Make a tabular representation of a csv
- Assign new columns to it to represent the year
- Filter it for the activities we want to look at
- Group it by activity year
- Take some aggregate statistics
- Use those aggregate statistics to calculate schedule slippage
- Put the activity years in descending order from most to least schedule slippage.

### Sidenote: The Aggregation Functions

> Here are the 13 aggregating functions available in Pandas and quick summary of what it does.

- mean(): Compute mean of groups
- sum(): Compute sum of group values
- size(): Compute group sizes
- count(): Compute count of group
- std(): Standard deviation of groups
- var(): Compute variance of groups
- sem(): Standard error of the mean of groups
- describe(): Generates descriptive statistics
- first(): Compute first of group values
- last(): Compute last of group values
- nth() : Take nth value, or a subset if n is a list
- min(): Compute min of group values
- max(): Compute max of group values
    
Source: [Python and R Tips](https://cmdlinetips.com/2019/10/pandas-groupby-13-functions-to-aggregate/)
