##Descriptive Statistics

## Contents


1. Set up
1. Generate Statistics Summary
1. Central Tendency
1. Computations

## 1. Setup

Load the required libraries.

In [5]:
import pandas as pd
import numpy as np
import datetime
(pd.__version__,np.__version__)

Create sample data `order_pdf` for demonstration below.

In [7]:
np.random.seed(1)
order_pdf=pd.DataFrame({'Item':['A']*6+['B']*6+['C']*6,
                  'Price':np.random.rand(18),
                  'Quantity':np.random.randint(2,100,size=18),
                  'Type':['Fruit']*12+['Drink']*6,
                  'Date':[datetime.date(2013, i, 1) for i in range(1, 7)]*3})
order_pdf.head()

In [8]:
order_pdf.info()

Notice that in the `order_pdf`，there are three columns with `object` datatype, one column with `float` datatype and one column with `integer` datatype. `order_pdf` has 18 observations and 5 columns in total.

Create sample data `order_pdf_NA` with `NaN` values for below demonstration.

In [11]:
order_pdf_NA=order_pdf.stack().sample(frac=0.9).unstack()
order_pdf_NA['Price']=order_pdf_NA['Price'].astype(float)
order_pdf_NA['Quantity']=order_pdf_NA['Quantity'].astype(float)
order_pdf_NA.head()

Notice that compred to `order_pdf`,`order_pdf_NA` has random `NaN` values in each column.

##2. Generate Statistics Summary

To generate statistics a summary for this dataframe, use the `describe` method which by default returns `count`, `mean`, `std`, `min`, `25th`, `50th`, `75th` and `max` for numeric series.

`describe` method takes `percentiles` and `include` parameters.
-  `percentiles`: The percentiles to include in the output.
-  `include`: A list of data type to include in the output.


Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

Use the `describe` method to return a descriptive statistics summary of the `order_pdf` dataframe. By default, it returns summaries only for numeric series.

In [16]:
order_pdf.describe()

Notice that the `describe` method only returns summary for numeric columns`Price` and `Quantity`. By default, it returns the information of `count`, `mean`, `std`, `min`, `max` and percentiles of the `25%`, `50%` and `75%` of numeric series.

Use single brackets to extract a column of the `order_pdf` dataframe to summarize with `describe` method. Below we extract the `Price` column to summarize.

In [19]:
order_pdf['Price'].describe()

Notice that the output includes all 18 observations and the mean price of all orders is 0.38.

For `describe` method, use `percentiles` parameters to include corresponding percentiles in the output and the default percentiles is `[0.25,0.5,0.75]`, which returns the 25th, 50th, and 75th percentiles.

In [22]:
order_pdf.describe(percentiles=[0.1,0.6,0.7])

Instead of returning the 25th, 50th, 75th percentiles, the output returns the 10th, 50th, 60th, 70th percentiles of numeric series `Price` and `Quantity`.

To get the summary for non-numeric series, set the `include` parameter of the `describe` method as `'object'`, it returns counts of each value, number of unique values, the most commonly occurring value (top) and the frequency of most commonly occurring value (freq) for non-numeric series.

In [25]:
order_pdf.describe(include='object')

Notice that when the parameter `include` is set as `object`, the `describe` method does not return summary for numeric series. 
Take column `Type` as example and notice that for `Type` column, there are 18 values, 2 unique values, the most commonly occurring value is `Fruit` and the occurrence of `Fruit` is 12.

`describe` method with parameter `include='all'` gives the summary of all columns.

In [28]:
order_pdf.describe(include='all')

The result includes summary for both numeric and object series. Notice that for `Date`,`Item` and `Type` columns which have categorical values, the method does not return valid numeric summaries and the output is `NaN`.

To have summary of specific item, generate statistics summary after using `groupby` method.

In [31]:
order_pdf.groupby(['Item']).describe()

The `describe` method generates numeric summaries for `Price` and `Quantity` of each item instead of generating numeric summaries for all observations of `order_pdf`.

Now use `describe` method on `order_pdf_NA` to see how it works for dataframe with `NaN` values.

In [34]:
order_pdf.describe(include='all')

In [35]:
order_pdf_NA.describe(include='all')

Compared to `order_pdf`, the dataframe `order_pdf_NA` has missing values in all columns so the `count` for each column is less than 18. The `describe` method skips the `NaN` values in the `order_pdf_NA` and generates numeric summaries based on valid values in the dataframe.

## 3. Central Tendency

After generating statistics summary, use `mean`,`median`,`var`,`std` to measure central tendency of a dataframe.

Reference:

1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html
1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html
1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.std.html
1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.var.html

`mean` method returns results for numeric series by default.

In [40]:
order_pdf.mean() 

Notice that `mean` method only returns mean value for numeric series of the dataframe `order_pdf`.

Combine `groupby` method and `mean` method to return the average price and quantity for each item.

In [43]:
order_pdf.groupby(['Item']).mean()

Notice that the price of item A is lower that those of item B and item C. The average quantity sold of item B is lower those of item A and item C.

Calculate variance, standard deviation and median for the `Price` and  `Quantity` of each item with `var` method, `std` method and `median` method.

In [46]:
order_pdf.groupby(['Item']).var()

In [47]:
order_pdf.groupby(['Item']).std()

In [48]:
order_pdf.groupby(['Item']).median()

Use `mean`,`median`, `var`, `std` methods on the dataframe `order_pdf_NA` which has `NaN` values.

In [50]:
order_pdf_NA.groupby(['Item']).mean()

In [51]:
order_pdf_NA.groupby(['Item']).median()

In [52]:
order_pdf_NA.groupby(['Item']).var()

In [53]:
order_pdf_NA.groupby(['Item']).std()

Notice that by default, the methods exclude `NaN` values and if we set parameter `skipna=False`, the methods include `NaN` value and return `NaN` values.

In [55]:
order_pdf_NA.mean(skipna=False)

In [56]:
order_pdf_NA.var(skipna=False)

When `NaN` is included, the result for `mean`,` median`,`std`,`var` is `NaN`.

## 4. Computations

Pandas offer computation methods to analyze data and we will use following methods to summarize the `Quantity` column of `order_pdf`:
1.  `sum` method: Return the sum of values on the required axis.
1.  `cumsum` method: Return cumulative sum over a DataFrame axis.
1.  `diff` method: Return the difference of a DataFrame element compared with another element in the DataFrame.
1.  `pct_change` method: Return percentage change between the current element and a prior element in the DataFrame.
1.  `cummax` method: Return cumulative maximum over a DataFrame axis.
1.  `cummin` method: Return cumulative minimum over a DataFrame axis.
1.  `corr` method: Compute correlation of columns and generates pearson correlation by default.


Reference: 
1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html
1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.cumsum.html
1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.diff.html
1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pct_change.html#pandas.DataFrame.pct_change
1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cummax.html
1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cummin.html
1. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html

Use `sum` method to figure out the quantity of all orders.

In [61]:
order_pdf['Quantity'].sum()

The total quantity of all orders is 842.

Calculate cumulative sum of quantity using `cumsum`method.

In [64]:
order_pdf['Quantity'].cumsum()

The result shows that first two orders sold 194 items and first three orders have 282 and so on.

Use `diff` method to calculate the difference in quantity between current order and previous order.

In [67]:
order_pdf['Quantity'].diff()

Notice that becuase there is no previous value for the first order to compare, the first result is `NaN` and the second order has 2 more items than the first order. The seventh order and the sixteenth order has remarkable increase in quantity ordered.

Calculate the difference in quantity between current order and previous N order by setting parameter `periods=N`.

In [70]:
order_pdf['Quantity'].diff(periods=2)

The result shows the third order has 8 less quantity sold than the first order. The first order and second order have no values to compare so the results are `NaN`.

Calculate percent change of quantity between orders using `pct_change` method.

In [73]:
order_pdf['Quantity'].pct_change()

Each value represents the percent change in quantity compared to previous order and it is the same as `diff` method, the first order does not have previous order to compare so the result is `NaN`. The output shows a trend of fluctuation in quantity sold.

Calculate percent change in quantity between current order and previous N order by setting the parameter `periods=N`.

In [76]:
order_pdf['Quantity'].pct_change(periods=2)

The result shows compared to the first order, the third order decreased in quantity by 0.08333. Since the first order and second order have no values to compare, the results are `NaN`. It is easy to notice that compared to the twelfth order, the quantity ordered of the fourteenth order increases by 40%.

Use `cummax` method to find out the order with most quantity and what is the most quantity. The method returns the cumulative maximum over `Quantity` column.

In [79]:
order_pdf['Quantity'].cummax()

Notice that the second order has the most quantity sold and the method returns cumulative maximum for subsequent orders.

Use `cummin` method to find out the order with least quantity and what is the least order quantity. The method returns the cumulative minimum over `Quantity` column.

In [82]:
order_pdf['Quantity'].cummin()

`cummin` finds least quantity of all orders in the 11th order and the least quantity is 2.

Use `corr` method to compute the pearson correlation coefficient between `Price` and `Quantity`.

In [85]:
order_pdf.corr()

The method returns the pearson coefficient between numeric series of the `order_pdf` by default. The pearson correlation coefficient between `Price` and `Quantity` is 0.31. `Price` and `Quantity` has positively weak relationship.

__The End__