
# Descriptive statistics
2.3.2019, Sakari Lukkarinen<br>
Probability and Statistics<br>
[Helsinki Metropolia University of Applied Sciences](https://www.metropolia.fi/en)

## Background and Objectives
The aims of this Notebook exercise are to:
- get familiar with pandas descriptive statistics
- learn plot histograms of the data (hist, boxplot, violin plot)
- group the data by using categories


In [None]:
import os # operating system
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # standard graphics
import seaborn as sns # fancier graphics

# Input data files are available in the "../input/" directory.
print(os.listdir("../input"))

In [None]:
# Read the dataset
data = pd.read_csv('../input/avocado.csv')

In [None]:
# Show the first rows
data.head()

### Clean the data

The first column is an index, so drop it out

In [None]:
data = data.drop(columns = 'Unnamed: 0')
data.head()

Date column data type is object. It needs to be converted to datetime.

In [None]:
data.info()

In [None]:
data['Date'] = pd.to_datetime(data['Date'])
data.info()

Check the data now.

In [None]:
data.head()

What are 4046, 4225 and 4770?

Reading of avocado varieties: https://producebrands.com/the-avocado/ gives explanations:
- 4046 = Hass – small
- 4225 = Hass – large
- 4770 = Hass Extra Large

We rename the columns accordingly.


In [None]:
data = data.rename(columns = {'4046': 'small', '4225': 'large', '4770': 'xl'})
data.head()

In [None]:
# Show the last rows
data.tail()

### Descriptive statistics

- <a href="https://en.wikipedia.org/wiki/Sample_(statistics)">count = number of observations  (rows) in data</a>
- [mean = arithmetic mean](https://en.wikipedia.org/wiki/Mean#Arithmetic_mean_(AM)
- [std = standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)
- [25%, 50%, 75% = quartiles](https://en.wikipedia.org/wiki/Quartile)
- [50% = median](https://en.wikipedia.org/wiki/Median#Finite_set_of_numbers)
- [min, max = sample minimum and maximum](https://en.wikipedia.org/wiki/Sample_maximum_and_minimum)

In [None]:
# Show the descriptive statistics
data.describe()

In [None]:
# What is the mean AveragePrice?
data['AveragePrice'].mean()

In [None]:
# Print it with less decimals
m = data['AveragePrice'].mean()
print('Mean value of AveragePrice = {:.2f}'.format(m))

In [None]:
# What is the standard deviation of Total Volume?
s = data['Total Volume'].std()
s

In [None]:
# Printing the standard deviation of Total Volume in scientific format and with less decimals
print('Standard deviation of Total volume = {:.2e}'.format(s))

If we compare the `data.head()` and `data.describe()`, we notice that 'type' and 'region' columns were dropped out from the descriptive statistics table as they were text (or categorical data). However, we can study how many unique values these variables contain and the distribution of the values.

In [None]:
# What are the unique type values?
data['type'].unique()

In [None]:
# How many rows there are for each type?
data['type'].value_counts()

In [None]:
# What are the regions?
data['region'].unique()

In [None]:
# How many regions are there in total?
len(data['region'].unique())

We can use type and region to group the data and calculate statistics between the groups.

In [None]:
data.groupby('type').describe()

In [None]:
# or showing the same values but the table is transposed
data.groupby('type').describe().T

In [None]:
# What if we want only to compare the mean of AveragePrice between different types?
data['AveragePrice'].groupby(data['type']).mean()

In [None]:
# Another example: How does the sum of large avocados vary grouped by year?
data['large'].groupby(data['year']).sum()

## Distribution plots

- [What is histogram?](https://en.wikipedia.org/wiki/Histogram)
- [What is box plot?](https://en.wikipedia.org/wiki/Box_plot)
- [What is violin plot?](https://en.wikipedia.org/wiki/Violin_plot)


### Histograms

In [None]:
# What is the distribution of the average prices?
data.hist(column = 'AveragePrice', bins = 30, figsize = (8, 6))
plt.xlabel('Price')
plt.ylabel('Count')
plt.title('Distribution of avocado average prices')
plt.show()

Seems that we have two peaks in our distribution. Could this be due to different avocado types?

In [None]:
# How do the prices differ by type?
ax = data.hist(column = 'AveragePrice', by = 'type', bins = 30, sharex = True, grid = True, figsize = (14, 6), xlabelsize = 12, ylabelsize = 12)
# Annotate the graphs (xlabel, ylabel and grid)
for i in range(2):
    ax[i].set_xlabel('Price', size = 14)
    ax[i].set_ylabel('Count', size = 14)
    ax[i].grid()
plt.show()

In [None]:
# Use seaborn to create distribution plot
import warnings

# sns.distplot() gives some warnings. Ignore them.
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    plt.figure(figsize=(12,5))
    plt.title("Distribution of avocado average price")
    ax = sns.distplot(data["AveragePrice"], color = 'r')
    plt.grid()

More info: [seaborn.distplot()](https://seaborn.pydata.org/generated/seaborn.distplot.html)

In [None]:
# Or quick help, remove the comment
# ?sns.distplot

In [None]:
# Can we overlay the distribution of average price grouped by type?

# sns.distplot() gives some warnings. Ignore them.
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    # Create a figure, add title
    plt.figure(figsize=(12,5))
    plt.title("Distribution of average price grouped by type")

    # Plot the distribution of conventional type data
    mask0 = data['type'] == 'conventional'
    ax = sns.distplot(data["AveragePrice"][mask0], color = 'r', label = 'conventional')

    # Plot the histogram of organic type data
    mask1 = data['type'] == 'organic'
    ax = sns.distplot(data["AveragePrice"][mask1], color = 'g', label = 'organic')

    # add legend, show the graphics
    plt.legend()
    plt.grid()

### Boxplot

In [None]:
# Make a boxplot graph using pandas
data.boxplot(column = 'AveragePrice', by = 'type', figsize = (8,6))
plt.show()

In [None]:
# Make a boxplot graph with seaborn
plt.figure(figsize=(12,5))
sns.boxplot(y = "type", x = "AveragePrice", data = data, palette = 'pink')
plt.xlim([0, 4])
plt.grid()
plt.show()

More info: [seaborn boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot)

### Violin plot

Pandas library doesn't contain violin plot, see: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html 

Instead, we can use seaborn to draw violins.

In [None]:
# Violin plot using seaborn

# sns.violinplot() gives some warnings. Ignore them.
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    plt.figure(figsize=(12,5))
    sns.violinplot(y = "type", x = "AveragePrice", data = data, palette = 'pink')
    plt.xlim([0, 4])
    plt.grid()
    plt.show()

More info: [Seaborn violinplot](https://seaborn.pydata.org/generated/seaborn.violinplot.html#seaborn.violinplot)

### Categorical plot

Categocial plot is a higher level function to make all kinds of statistical plots. More info: [seaborn catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html?highlight=catplot#seaborn.catplot)

In [None]:
# How do the avocado prices vary by region? First type == 'organic'
mask = data['type'] == 'organic'
g = sns.catplot(x = 'AveragePrice', y = 'region', data = data[mask],
                   height = 13,
                   aspect = 0.8,
                   palette = 'magma')
plt.xlim([0, 4])
plt.grid()
plt.show()

In [None]:
# How about conventional avocados? Their price distribution by region?
mask = data['type'] == 'conventional'
g = sns.catplot(x = 'AveragePrice', y = 'region', data = data[mask],
                   height = 13,
                   aspect = 0.8,
                   palette = 'magma')
plt.xlim([0, 4])
plt.grid()
plt.show()

In [None]:
# Overlay the distributions, we use hue = 'type' for overalying
g = sns.catplot(data = data,
                x = 'AveragePrice', 
                y = 'region',
                hue = 'type',
                height = 13,
                aspect = 0.8,
                palette = 'magma')
plt.xlim([0, 4])
plt.grid()
plt.show()

## Next steps
- How does the Total Volume vary by type and region?
- Which year the avocado total sales (AveragePrice * Total Volume) has been highest?
- How much does the total sales vary by calendar month?

## More reading
- [Basic statistics in pandas DataFrame](https://medium.com/@kasiarachuta/basic-statistics-in-pandas-dataframe-594208074f85)
- [Pandas describe](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)
- [Pandas group by](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)
- [Pandas visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)
- [Seaborn tutorials](https://seaborn.pydata.org/tutorial.html)
- Seaborn [distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html), [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html?highlight=boxplot#seaborn.boxplot), [violinplot](https://seaborn.pydata.org/generated/seaborn.violinplot.html#seaborn.violinplot), [catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html?highlight=catplot#seaborn.catplot)

## Acknowledgments

-  Many thanks to: [Explore avocados from all sides](https://www.kaggle.com/hely333/explore-avocados-from-all-sides)
- Useful ideas how to use seaborn: [Data Visualization-Seaborn(Beginner)](https://www.kaggle.com/fetenbasak/data-visualization-seaborn-beginner)