# 0. Importing data (DON'T ALTER)

In [3]:
# Import pandas as pd
# Import numpy as np
import pandas as pd
import numpy as np

In [4]:
# Load the dataset
data = pd.read_csv('Datasets/avocado.csv')

In [None]:
# Preview the data
data

# 1. 1-D aggregations on Pandas Series

Let's recall computing aggregations such as `sum()`, `mean()`, `median()`, `max()` and `min()` using Pandas Series. 

In [None]:
# Create Pandas Series using values: [8.45, 3.15, 1.25, 10.55, 2.40]

our_series = pd.Series([8.45, 3.15, 1.25, 10.55, 2.40])

In [None]:
# TASK 1 >>>> Print computing aggregations

print('The rounded count of the values is: {}'.format(our_series.sum().round()))
print('The average value is: {}'.format())
print('The median value is: {}'.format())
print('The maximum value is: {}'.format())
print('The minimum value is: {}'.format())

# 2. 2-D aggregations on Pandas DataFrame

To understand the true power of `groupby()` we can look what it's going on under the hood. Let's say we want to compute average price of avocados based on their type: conventional and organic. 

Firstly, we have to split our dataset into 2 different groups based on the type:

In [6]:
# Filter only those records that are organic type and assign it to variable filter_o
filter_o = data['type'] == 'organic'

In [None]:
# Use .loc[] on data which access all columns based on our condition filter_o and assign it to variable data_organic
data_organic = data.loc[filter_o]
data_organic

Look that only organic type remain.

In [9]:
# TASK 2 >>>> Filter only those records that are conventional type and assign it to variable filter_c


In [None]:
# TASK 2 >>>> Use .loc[] on data which access all columns based on our condition filter_c and assign it to variable data_conventional


Now compute average price for both type of avocados using `.mean()` function applied on the column `AveragePrice`.

In [26]:
# Compute average price for filtered organic avocados and assign it to variable avg_organic
avg_organic = data_organic['AveragePrice'].mean()

In [27]:
# TASK 3 >>>> Compute average price for filtered conventional avocados and assign it to variable avg_conventional
avg_conventional = data_conventional['AveragePrice'].mean()

In [None]:
# Print the outputs and the type of the outputs
print(avg_organic, avg_conventional)
print('\n')
print(type(avg_organic), type(avg_conventional))

Lastly, combine these results into data structure using `Pandas`  `.DataFrame()`. Create a dictionary, where the first key name will be 'Type' and its values 'organic', 'conventional'. The second key name will be 'Average_price' and its values will be our created avg_organic and avg_conventional, respectively.

In [29]:
# Combine these results into the DataFrame
data_output = pd.DataFrame({'Type':['organic','conventional'], 
                            'Average_price':[avg_organic, avg_conventional]})

In [None]:
# Print resulting DataFrame
print('\nResult dataframe :\n',data_output)

However, we can use `groupby()` to achieve the same result with only 1 line of the code!

# 3. 2-D aggregations on Pandas DataFrame (KEY LEARNING)

 `groupby()` function allows us to quickly and efficiently split the data into separate groups to perform computations. When we pass the desired column or columns within `groupby()`, it will return `DataFrameGroupBy object`. We can think of it as a special view of our DataFrame. No computation will be done until we specify the functions such as `mean()`, `sum()` etc. 

In [None]:
# Group the data based on the column 'year'
data.groupby('year')

Now we compute the average price for organic and conventional type of the avocados again, but we'll make use of `groupby()`.

In [None]:
# Group the data based on Avocado type
# Compute average price using .mean()
by_type_total = data.groupby('type')['AveragePrice'].mean()
print(by_type_total)

In [None]:
# Group the data based on 2 columns passed into the list: columns 'type' and 'region' and compute the average price
by_type_year = data.groupby(['type','year'])['AveragePrice'].mean()
print(by_type_year)

In [None]:
# TASK 4 >>>> Group the data based on 3 columns passed into the list: columns 'type', 'year', 'region' and compute how many Small Hass Avocados have been sold in total
#             Assign it to variable by_year


When we are using the `.groupby()`, the resulting object will be slightly different from standard Pandas dataframe. You can see it in the print statement and how the "type" and "year" are nicely printed. 

If we would like to operate with resulted object further, we should reset its row index by using reset_index().

In [None]:
# Reset the index using .reset_index() method and create a DataFrame
our_df = pd.DataFrame(by_year).reset_index()
print(our_df)

# 4. Aggregate function (ADVANCED)

`agg()` 

- it is an alias for aggregate
- it is used to pass a function or list of function to be applied on a series or even each element of series separately

This can be done by passing the columns and functions within a dictionary like this:

`our_dataset.agg({'First_column' : ['max', 'min'], 'Second_column' : ['mean', 'median']})`

In [None]:
# Compute maximum and minimum values for column 'Total Volume' and minimum and mean values for column 'Small Bags' using .agg()
data.agg({'Total Volume' : ['max', 'min'], 'Small Bags' : ['min', 'mean']})

We can pass `.agg()` also to our grouped object and compute statistics for selected column.

In [36]:
# Group the data based on 2 columns: 'region' and 'type'
# Compute aggregations 'min','max' and 'mean' for 'AveragePrice'
grouped = data.groupby(['region','type']).agg({'AveragePrice':['min','max','mean']})

In [None]:
grouped

- within `agg()` we can have our custom function along with computing aggregation

In [None]:
# Write a function to compute 95th percentile on desired column using .quantile(0.95)
def percentile_95(column):
  return column.quantile(0.95)

In [None]:
# TASK 5 - HARD >>>> Get 95th percentile and mean values for columns: 'Small Bags','Large Bags','XLarge Bags' from DataFrame data, using .agg()


# 5. Bonus Task (HARD)

`groupby()` can be useful when we want to look at the proportion of avocado's type. We would like to see what percentage of conventional and organic avocados has been sold. For example: 97 % and 3%.

To reach this result:
- Group the data by 'type' and obtain sums on 'Total Volume' column, assign result to volume_by_type
- Divide the volume_by_type by the sum of all avocados. Assign the result to variable proportion.
- Print the proportion and optionally multiply by 100 to obtian a figure in percentage

In [None]:
# TASK 6 >>>> Group data based on their types and compute count of the Total Volume 


In [None]:
# TASK 6 >>>> Compute the proportion of the avocado's type


In [None]:
# TASK 6 >>>> Print the output multiply by 100


# 6. Appendix

Data Source: https://www.kaggle.com/neuromusic/avocado-prices

License: Database: Open Database, Contents: © Original Authors