# Mean and median

Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data

In [1]:
# # Print the head of the sales DataFrame
# print(sales.head())

# # Print the info about the sales DataFrame
# print(sales.info())

# # Print the mean of weekly_sales
# print(sales["weekly_sales"].mean())

# # Print the median of weekly_sales
# print(sales["weekly_sales"].median())

# Summarizing dates

Summary statistics can also be calculated on date columns that have values with the data type `datetime64`. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

In [2]:
# # Print the maximum of the date column
# print(sales["date"].max())

# # Print the minimum of the date column
# print(sales["date"].min())

# Efficient summaries

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. 

In [3]:
# # A custom IQR function
# def iqr(column):
#     return column.quantile(0.75) - column.quantile(0.25)
    
# # Print IQR of the temperature_c column
# print(sales["temperature_c"].agg(iqr))

In [4]:
# # A custom IQR function
# def iqr(column):
#     return column.quantile(0.75) - column.quantile(0.25)

# # Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
# print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr))

In [5]:
# # Import NumPy and create custom IQR function
# import numpy as np
# def iqr(column):
#     return column.quantile(0.75) - column.quantile(0.25)

# # Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
# print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr,np.median]))

# Cumulative statistics

Cumulative statistics can also be helpful in tracking summary statistics over time. It will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

In [6]:
# # Sort sales_1_1 by date
# sales_1_1 = sales_1_1.sort_values("date")

# # Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
# sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

# # Get the cumulative max of weekly_sales, add as cum_max_sales col
# sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

# # See the columns you calculated
# print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

# Dropping duplicates

Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times. 

In [7]:
# # Drop duplicate store/type combinations
# store_types = sales.drop_duplicates(['store', 'type'])
# print(store_types.head())

# # Drop duplicate store/department combinations
# store_depts = sales.drop_duplicates(['store', 'department'])
# print(store_depts.head())

# # Subset the rows where is_holiday is True and drop duplicate dates
# holiday_dates = sales[sales['is_holiday']].drop_duplicates('date')

# # Print date col of holiday_dates
# print(holiday_dates['date'])

# Counting categorical variables

Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise.

In [8]:
# # Count the number of stores of each type
# store_counts = store_types['type'].value_counts()
# print(store_counts)

# # Get the proportion of stores of each type
# store_props = store_counts / store_counts.sum()
# print(store_props)

# # Count the number of each department number and sort
# dept_counts_sorted = store_depts["department"].value_counts(sort = True)
# print(dept_counts_sorted)

# # Get the proportion of departments of each number and sort
# dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)
# print(dept_props_sorted)

# What percent of sales occurred at each store type?

While `.groupby()` is useful, you can calculate grouped summary statistics without it.

In [9]:
# # Calc total weekly sales
# sales_all = sales["weekly_sales"].sum()

# # Subset for type A stores, calc total weekly sales
# sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

# # Subset for type B stores, calc total weekly sales
# sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()

# # Subset for type C stores, calc total weekly sales
# sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# # Get proportion for each type
# sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
# print(sales_propn_by_type)

# Calculations with .groupby()

The `.groupby()` method makes these tasks much easier

In [10]:
# # Group by type; calc total weekly sales
# sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# # Get proportion for each type
# sales_propn_by_type = sales_by_type / sum(sales["weekly_sales"])
# print(sales_propn_by_type)

In [11]:
# # From previous step
# sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# # Group by type and is_holiday; calc total weekly sales
# sales_by_type_is_holiday = sales.groupby(["type","is_holiday"])["weekly_sales"].sum()
# print(sales_by_type_is_holiday)

# Multiple grouped summaries

`.agg()` method is useful to compute multiple statistics on multiple variables. It also works with grouped data.

In [12]:
# # Import numpy with the alias np
# import numpy as np

# # For each store type, aggregate weekly_sales: get min, max, mean, and median
# sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])

# # Print sales_stats
# print(sales_stats)

# # For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
# unemp_fuel_stats = sales.groupby("type")["unemployment","fuel_price_usd_per_l"].agg([np.min, np.max, np.mean, np.median])

# # Print unemp_fuel_stats
# print(unemp_fuel_stats)

# Pivoting on one variable

Pivot tables are the standard way of aggregating data in spreadsheets. In pandas, pivot tables are essentially just another way of performing grouped calculations. That is, the `.pivot_table() `method is just an alternative to `.groupby()`.

In [13]:
# # Pivot for mean weekly_sales for each store type
# mean_sales_by_type = sales.pivot_table(index= "type", values = "weekly_sales")

# # Print mean_sales_by_type
# print(mean_sales_by_type)

In [14]:
# # Import NumPy as np
# import numpy as np

# # Pivot for mean and median weekly_sales for each store type
# mean_med_sales_by_type = sales.pivot_table(index= "type", values = "weekly_sales", aggfunc=[np.mean,np.median])

# # Print mean_med_sales_by_type
# print(mean_med_sales_by_type)

In [15]:
# # Pivot for mean weekly_sales by store type and holiday 
# mean_sales_by_type_holiday = sales.pivot_table(index= "type", values = "weekly_sales",columns="is_holiday" , aggfunc='mean')

# # Print mean_sales_by_type_holiday
# print(mean_sales_by_type_holiday)

# Fill in missing values and sum values with pivot tables

The `.pivot_table()` method has several useful arguments, including `fill_value` and `margins`:

- `fill_value` replaces missing values with a real value (known as imputation). What to replace missing values with is a topic big enough to have its own course (Dealing with Missing Data in Python), but the simplest thing to do is to substitute a dummy value.
- `margins` is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.

In [16]:
# # Print mean weekly_sales by department and type; fill missing values with 0
# print(sales.pivot_table(index= "type", values = "weekly_sales",columns="department",fill_value=0 , aggfunc='mean'))

In [17]:
# # Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
# print(sales.pivot_table(values="weekly_sales", index="department", columns="type", fill_value=0 , aggfunc='sum', margins=True))