In [None]:
import pandas as pd
import numpy as np

# Summarizing, Aggregating & Grouping
Knowing how to use pandas aggregation & grouping functions lets us reduce the dimensionality of our data and visualize it in different ways (most often over the rows - aka `axis=0`).  

For completeness, `axis=1` refers to the columns.

In [None]:
wine = pd.read_csv('data/wine_reviews/winemag-data_first150k.csv', index_col=0)

Answer to exercise from notebook 1:


`scrambled_wine[['points', 'region_1']].iloc[:5] 
`

## Initial quick analysis using pandas
Pandas has multiple built-in functions that make it easy to quickly see what's in your dataframe. 
You can build them with the selecting tools you used before.

Here, we will select a column, and then see how pandas lets us quickly analyse it.

To quickly see which columns our wine dataset has, we can use the .columns attribute.

Let's select price.

We can quickly see some metrics of the price, using some built-in aggregating functions in pandas.

And some more advanced metrics..

What if we want to see this list in the opposite order?

Both of these lists are too long. What if we only want to see the top 10 countries?
We can string together the other selectors we learned before!

Can you think of another way to get the top 10 rows?

What if we just want to know how many countries are on the list?

In [None]:
# Number of non-null unique values


And if we want a list of them? 
(This isn't a pandas thing, but is still super useful:)

In [None]:
# Gives all unique values


You can look [here](https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/) for a list of all the built-in pandas stats.

One of the most powerful built-in summary tools for pandas is `df.describe()`. This quickly calculates some of these stats for the numeric columns in the df.

**Question**: Why are only 2 of the columns included?

### Conditional Selections 
We can use conditional selections to narrow our analysis even further.

DON'T FORGET - to make things easier, we can save selections we plan to use often as their own variables.

We can then use these to calculate more targeted metrics.

#### More advanced conditionals: Using masks
When you want to filter on >1 criteria, it can be easier to use a mask.

How many wines from North America do we have on our list?

How many wines do we have total in the in North America?

**Question:** How many of the wines belong to each country?

**Question:** From which US state do most of our wines come?

## Groupby
One of the most flexible ways to aggregate in pandas is with .groupby() .
We will look at how this works for categorical datasets like this one, and also for datetime datasets, as dealing with datetimes in pandas can be tricky.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

### How Groupby Works:
You can group your data in many different ways, and also aggregate it by any of the aggregators we saw before: like mean, mode, sum, etc.

So, British wine is the highest average ranked?? This is surprising. Maybe we need to look at our data in a different way.

The beauty of `df.groupby()` is that it lets you aggregate different columns in different ways.

Say we want to know the average price of wine in each country, but the _highest_ score:

**Tip:** Sometimes, fo make your code cleaner it's best to move the aggregations out and store them in their own variable that you can update saparately.

You can also use a list in your aggs to aggergate one column in different ways.

This will give a **multi-index**. Multi-indexes can be difficult to sort on. But, there are a few different ways we can deal with this.

One way is by dropping the top level ('price'):

Another way is by using `np.ravel()` . This preserves the "price" indicator somewhere in each of the column names.

**Question**: Create a quick plot of the number of wines made in each country, from highest to lowest.

## Selecting the max and min values with Index Max and Min
In addition to `.max()` and `.min()`, which returns the maximum or minimum values, we can use `.idxmax()` and `.idxmin()` to return the *index* pertaining to the maximum and minimum values. 

For example, let's use `.idxmax()` to find the country with the highest standard deviation in its prices.

# Dealing with datetime in Pandas
Pandas built off the datetime package in Python to offer a datetime index, and plenty of ways to work with this.
However, it is still far from intuitive. 
That doesn't mean it's not useful, and for anyone doing a timeseries project, you'll need to deal with dates and times in pandas often.

Let's load a sample dataset of datetime energy data and get started!

In [None]:
energy = pd.read_csv('data/energy/PJM_Load_hourly.csv', parse_dates=True, index_col=0)

Note that this data has a DateTimeIndex. 
setting `parse_dates=True` when we read the CSV lets pandas infer this datetimeindex.

We can select data points within a specific time range, using the DateTimeIndex and .loc.
Here, we select the first one day of data.

In [None]:
# One record for each hour of this day.


## Selecting with boolean indexing on pandas datetimeindex
We can use .dot notation with conditionals to select on specific parts of the datetime, like days or months.

Python datetime functionality example:

In [None]:
from datetime import datetime



In [None]:
# making a new DF that only includes the month of septmeber from each year.


We can also call just a date, and get all the hours/time periods in that day:

Same with for a year and month:

In [None]:
# We see that it includes one record for each our of each day of the month of January, which has 31 days


In [None]:
# or better, with an assert statement:


## Resampling
We can also combine the data in different ways, and over different time periods.
This means that just because our data is in hourly time periods, we dont have to keep it that way. 

In [None]:
# We can get the average load over a day:


In [None]:
# We can also get the total MWh used in a day:


## Groupby with DateTimeIndex
Using groupby with a pandas DateTimeIndex can be extremely useful and powerful.
Let's look at how this can work.

The index level names are not particularly helpful here. We can change them.

In [None]:
# the long, ugly way


In [None]:
# the short, clean way. Both do the same thing.


### Selecting on Multi-Index: Using Slice

Or, say we want to know what Christmas week looked like across all the years for which we have data..

You can also then use a groupby again!

Note that here, we're using a groupby on the index level.

Now, we can plot by these levels!

**Question:** See what just writing "52" in the slice box does.

**Question**: Select Christmas week, but only for years 1998-2000.

**Question**: Which week is christmas in all of the years?

## Selecting on Multi-Index: Using reset_index()
The `slice` method in pandas can be difficult to work with and read. Often times, it can be easier to work with these as "normal" pandas DataFrame columns.

As an example, we'll select the Christmas week the same way we did with the slice command above. 

## Exercise:
- Find the week (and its associated year) with the highest total weekly consumption.

- Find the day of the week that averages the highest consumption

- Find the time of day that averages the lowest consumption.
    - Has this changed over the years?
    

- Is average consumption rising, falling, or staying the same over the years?
- What is the %age difference in consumption on average between April and June?

- Find the week (and its associated year) with the highest total weekly consumption.

- Find the day of the week that averages the highest consumption

- Find the time of day that averages the lowest consumption.
    - Has this changed over the years?

- Is average consumption rising, falling, or staying the same over the years?

- What is the %age difference in consumption on average between April and June?