# 0. Importing data (DO NOT ALTER)

In [0]:
import pandas as pd
import numpy as np

In [0]:
# Load the dataset
data = pd.read_csv('../../../Data/avocado.csv')

In [0]:
# Preview the data
data

# 1. 1-D aggregations on Pandas Series

Let's recall computing aggregations such as `sum()`, `mean()`, `median()`, `max()` and `min()` using Pandas Series.

In [0]:
# Create Pandas Series using values: [8.45, 3.15, 1.25, 10.55, 2.40]

our_series = pd.Series([8.45, 3.15, 1.25, 10.55, 2.40])

In [0]:
# TASK 1 >>>> Print computing aggregations 
print(f'The rounded count of the values is: {our_series.sum().round()}')

# Fill in the empy brackets {} (returns error if not filled)
print(f'The average value is: {}')
print(f'The median value is: {}')
print(f'The maximum value is: {}')
print(f'The minimum value is: {}')

# 2. 2-D aggregations on Pandas DataFrame

To understand the true power of `groupby()` we can take a look at what is going on under the hood.  
Let's say we want to compute the average price of avocados based on their type: conventional and organic. 

Firstly, we have to split our dataset into 2 different groups based on the type:

In [0]:
# Filter only those records that are organic type and assign it to variable filter_o
filter_o = data['type'] == 'organic'

In [0]:
# Use .loc[] on data to access all columns based on our condition filter_o and assign it to the variable data_organic
data_organic = data.loc[filter_o]
data_organic

See that only organic type remain.

In [0]:
# TASK 2.1 >>>> Filter only those records that are of type conventional and assign it to the variable filter_c

In [0]:
# TASK 2.2 >>>> Use .loc[] on data to access all columns based on our condition filter_c and assign it to the variable data_conventional

Now compute the average price for both types of avocados using the `.mean()` method applied to the column `AveragePrice`.

In [0]:
# Compute the average price for filtered organic avocados and assign it to the variable avg_organic
avg_organic = data_organic['AveragePrice'].mean()

In [0]:
# TASK 3 >>>> Compute the average price for filtered conventional avocados and assign it to the variable avg_conventional

In [0]:
# Print the outputs and the type of the outputs
print(avg_organic, avg_conventional)
print('\n')
print(type(avg_organic), type(avg_conventional))

Lastly, combine these results into data structure using `Pandas`  `.DataFrame()`. Create a dictionary, where the first key name will be 'Type' and its values 'organic', 'conventional'. The second key name will be 'Average_price' and its values will be our created `avg_organic` and `avg_conventional`, respectively.

In [0]:
# Combine these results into a new DataFrame
data_output = pd.DataFrame({'Type':['organic','conventional'], 
                            'Average_price':[avg_organic, avg_conventional]})

In [0]:
# Print the resulting DataFrame
print('\nResult dataframe :\n',data_output)

However, we can use `groupby()` to achieve the same result with only 1 line of the code!

# 3. 2-D aggregations on Pandas DataFrame (KEY LEARNING)

`groupby()` function allows us to quickly and efficiently split the data into separate groups to perform computations. When we pass the desired column or columns within `groupby()`, it will return _DataFrameGroupBy object_. We can think of it as a special view on our DataFrame. No computation will be done until we specify functions such as `mean()`, `sum()` etc.

In [0]:
# Group the data based on the column 'year'
data.groupby('year')

Now we compute the average price for organic and conventional avocados again but we'll make use of `groupby()`.

In [0]:
# Group the data based on Avocado type
# Compute the average price using .mean()

by_type_total = data.groupby('type')['AveragePrice'].mean()
print(by_type_total)

In [0]:
# Group the data based on columns 'type' and 'region' passed into the list and compute the average price

by_type_year = data.groupby(['type','year'])['AveragePrice'].mean()
print(by_type_year)

In [0]:
# TASK 4 >>>> Group the data based on columns 'type', 'year' and 'region' passed into the list
# and compute how many kg of Large Hass Avocados have been sold in total. 
# Assign the result to the variable by_year.

When we are using the `.groupby()`, the resulting object will be slightly different from a standard Pandas dataframe. You can see it in the print statement and how "type" and "year" are nicely printed. 

If we would like to work with the resulting object further, we should reset its row index by using `reset_index()` and convert it into a regular dataframe.

In [0]:
# Reset the index using .reset_index() method and create a DataFrame
our_df = pd.DataFrame(by_year).reset_index()
print(our_df)

# 4. Aggregate function (ADVANCED)

![](https://keytodatascience.com/wp-content/uploads/2020/04/image-1.png)

[image source](https://keytodatascience.com/groupby-pandas-python/)

The aggregation method `agg()`.

- it is an alias for aggregate
- it is used to pass a function or list of functions to be applied on a series or even each element of a series separately

This can be done by passing the columns and functions within a dictionary like this:

`our_dataset.agg({'First_column' : ['max', 'min'], 'Second_column' : ['mean', 'median']})`

In [0]:
# Compute maximum and minimum values for column 'Total Volume' and minimum and mean values for column 'Small Bags' using .agg()
data.agg({'Total Volume' : ['max', 'min'], 'Small Bags' : ['min', 'mean']})

We can pass `.agg()` also to our grouped object and compute statistics for selected column.

In [0]:
# Group the data based on the two columns 'region' and 'type'
# Compute aggregations 'min','max' and 'mean' for 'AveragePrice'
grouped = data.groupby(['region','type']).agg({'AveragePrice':['min','max','mean']})

In [0]:
grouped

- within `agg()` we can have our custom function along with computing aggregation

In [0]:
# Write a function to compute 95th percentile on desired column using .quantile(0.95)
def percentile_95(column):
    return column.quantile(0.95)

In [0]:
# TASK 5 - HARD >>>> Get 95th percentile and mean values for columns: 'Small Bags','Large Bags','XLarge Bags' from DataFrame data, using .agg()

# 5. Bonus Task (HARD)

`groupby()` can be useful when we want to look at the proportion of avocado's type. We would like to see what percentages of conventional and organic avocados have been sold. For example: 97 % and 3%.

To reach this result:
- Group the data by 'type' and obtain sums on the 'Total Volume' column, assign result to `volume_by_type`
- Divide `volume_by_type` by the sum of all avocados. Assign the result to the variable `proportion`.
- Print the proportion and optionally multiply it by 100 to obtain a figure in percentage

In [0]:
# TASK 6.1 >>>> Group data based on their types and compute count of the Total Volume 

In [0]:
# TASK 6.2 >>>> Compute the proportion of the avocado's type

In [0]:
# TASK 6.3 >>>> Print the output multiply by 100

# 6. Appendix

Data Source: https://www.kaggle.com/neuromusic/avocado-prices

License: Database: Open Database, Contents: © Original Authors

Material adapted for RBI internal purposes with full permissions from original authors. Source: https://github.com/zatkopatrik/authentic-data-science