**Mean and median**

Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

sales is available and pandas is loaded as pd.


* Explore your new DataFrame first by printing the first few rows of the sales DataFrame.
* Print information about the columns in sales.
* Print the mean of the weekly_sales column.
* Print the median of the weekly_sales column.

In [7]:
import pandas as pd
sales = pd.read_csv('/kaggle/input/walmart-sales/Walmart_Sales.csv')

In [8]:
# Print the head of the sales DataFrame
print(sales.head())

# Print the info about the sales DataFrame
print('---------------------------------------------------')
print(sales.info())



   Store        Date  Weekly_Sales  Holiday_Flag  Temperature  Fuel_Price  \
0      1  05-02-2010    1643690.90             0        42.31       2.572   
1      1  12-02-2010    1641957.44             1        38.51       2.548   
2      1  19-02-2010    1611968.17             0        39.93       2.514   
3      1  26-02-2010    1409727.59             0        46.63       2.561   
4      1  05-03-2010    1554806.68             0        46.50       2.625   

          CPI  Unemployment  
0  211.096358         8.106  
1  211.242170         8.106  
2  211.289143         8.106  
3  211.319643         8.106  
4  211.350143         8.106  
---------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         6435 non-null   int64  
 1   Date          6435 non-null   object 
 2   Weekly_Sales  6435 non

In [9]:
# Print the mean of weekly_sales
print(sales['Weekly_Sales'].mean())

# Print the median of weekly_sales
print(sales['Weekly_Sales'].median())

1046964.8775617715
960746.04


The mean weekly sales amount is almost double the median weekly sales amount! This can tell you that there are a few very high sales weeks that are making the mean so much higher than the median

**Summarizing dates**

Summary statistics can also be calculated on date columns that have values with the data type datetime64. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

sales is available and pandas is loaded as pd.


* Print the maximum of the date column.
* Print the minimum of the date column.

In [10]:
# Print the maximum of the date column
print(sales['Date'].max())

# Print the minimum of the date column
print(sales['Date'].min())

31-12-2010
01-04-2011


Super summarizing! Taking the minimum and maximum of a column of dates is handy for figuring out what time period your data covers. In this case, there are data from December 31st of 2010 to  Jan 4th  2011.

**Efficient summaries**

While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

df['column'].agg(function)

In the custom function for this exercise, **"IQR" is short for inter-quartile range**, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

In [11]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales['Temperature'].agg(iqr))

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["Temperature", "Fuel_Price", "Unemployment"]].agg(iqr))

27.479999999999997
Temperature     27.480
Fuel_Price       0.802
Unemployment     1.731
dtype: float64


In [12]:
# Import NumPy and create custom IQR function
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["Temperature", "Fuel_Price", "Unemployment"]].agg([iqr, np.median]))

        Temperature  Fuel_Price  Unemployment
iqr           27.48       0.802         1.731
median        62.67       3.445         7.874


  print(sales[["Temperature", "Fuel_Price", "Unemployment"]].agg([iqr, np.median]))
  print(sales[["Temperature", "Fuel_Price", "Unemployment"]].agg([iqr, np.median]))
  print(sales[["Temperature", "Fuel_Price", "Unemployment"]].agg([iqr, np.median]))


Excellent efficiency! The .agg() method makes it easy to compute multiple statistics on multiple columns, all in just one line of code.

**Cumulative statistics**

Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

A DataFrame called sales_1_1 has been created for you, which contains the sales data for department 1 of store 1. pandas is loaded as pd.

* Sort the rows of sales_1_1 by the date column in ascending order.
* Get the cumulative sum of weekly_sales and add it as a new column of sales_1_1 called cum_weekly_sales.
* Get the cumulative maximum of weekly_sales, and add it as a column called cum_max_sales.
* Print the date, weekly_sales, cum_weekly_sales, and cum_max_sales columns.

In [16]:
sales_1 = pd.read_csv("/kaggle/input/walmart-sales/Walmart_Sales.csv")


In [17]:
# Sort sales_1_1 by date
sales_1_1 = sales_1.sort_values("Date", ascending=True)

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1['Weekly_Sales'].cumsum() 

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1['cum_max_sales'] = sales_1_1["Weekly_Sales"].cummax()

# See the columns you calculated
print(sales_1_1[["Date", "Weekly_Sales", "cum_weekly_sales", "cum_max_sales"]])

            Date  Weekly_Sales  cum_weekly_sales  cum_max_sales
5208  01-04-2011     534578.78      5.345788e+05      534578.78
1204  01-04-2011     520962.14      1.055541e+06      534578.78
1776  01-04-2011    1864238.64      2.919780e+06     1864238.64
2634  01-04-2011    1305950.22      4.225730e+06     1864238.64
6066  01-04-2011     611585.54      4.837315e+06     1864238.64
...          ...           ...               ...            ...
1620  31-12-2010     891736.91      6.733058e+09     3818686.45
5767  31-12-2010    1001790.16      6.734059e+09     3818686.45
5624  31-12-2010     811318.30      6.734871e+09     3818686.45
2907  31-12-2010     672903.23      6.735544e+09     3818686.45
1763  31-12-2010    1675292.00      6.737219e+09     3818686.45

[6435 rows x 4 columns]


You've accumulated success! Not all functions that calculate on columns return a single number. Some, like the cumulative statistic functions, return a whole column