# Descriptive Analytics: Numerical Summary
By Eli Yi-Liang Tung

Department of Analytics and Operations, Business School, NUS

## Learning Objectives
1. Use Pandas to obtain and interpret summary statistics.
2. Slice and dice data by using group-wise operations using aggregate, filter and apply functions in Pandas

## Descriptive Measures
**Descriptive analytics** creates a summary of historical data to yield useful information and possibly prepare the data for further analysis. Such information may include some basic descriptive measures of data and graphs showing important insights.

### Centers, variations, and extreme points
The center of data is usually expressed by the mean value of the median, which can be easily achieved by the corresponding methods.

In [1]:
import pandas as pd

In [2]:
data_dict = {'wage': [3.10, 3.24, 3.00, 6.00, 5.30, 8.75],
             'educ': [11.0, 12.0, 11.0, 8.0, 12.0, 16.0],
             'exper': [2.0, 22.0, 2.0, 44.0, 7.0, 9.0],
             'female': [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],
             'married': [0.0, 1.0, 0.0, 1.0, 1.0, 1.0]}

data = pd.DataFrame(data_dict)    # DataFrame constructor
data                              # Display the DataFrame

Unnamed: 0,wage,educ,exper,female,married
0,3.1,11.0,2.0,1.0,0.0
1,3.24,12.0,22.0,1.0,1.0
2,3.0,11.0,2.0,0.0,0.0
3,6.0,8.0,44.0,0.0,1.0
4,5.3,12.0,7.0,0.0,1.0
5,8.75,16.0,9.0,0.0,1.0


In [3]:
print(data.mean())          # Mean value of each column
print(type(data.mean()))    # Show the data type of the results

wage        4.898333
educ       11.666667
exper      14.333333
female      0.333333
married     0.666667
dtype: float64
<class 'pandas.core.series.Series'>


In [4]:
data.median()        # Median value of each column

wage        4.27
educ       11.50
exper       8.00
female      0.00
married     1.00
dtype: float64

In [5]:
type(data.median())  # Show the data type of the results

pandas.core.series.Series

Please note that the mean value of each column is stored in a <code>pandas.Series</code>, where the index labels are variable names, rather than a sequence of integers. 

Also notice that in the case of the 0-1 categorical variable "female", the mean value of 0.333 is the proportion of observations in the dataset labeled as "female". The same concept can be applied to the variable "married" as well.

Similarly, the measures of standard deviations and variances can be calculated by the corresponding methods.

In [6]:
data.std()         # Sample Standard deviation of each column

wage        2.271479
educ        2.581989
exper      16.280868
female      0.516398
married     0.516398
dtype: float64

In [7]:
data.var()         # Sample Variance of each column

wage         5.159617
educ         6.666667
exper      265.066667
female       0.266667
married      0.266667
dtype: float64

The maximum and minimum points in the dataset can also be found.

In [8]:
data.max()         # Maximum value of each column

wage        8.75
educ       16.00
exper      44.00
female      1.00
married     1.00
dtype: float64

In [9]:
data.min()         # Minimum value of each column

wage       3.0
educ       8.0
exper      2.0
female     0.0
married    0.0
dtype: float64

### Method <code>describe</code>

For <code>pandas.DataFrame</code> and <code>pandas.Series</code>, the method <code>describe</code> is a convenient tool to summarize some key measures altogether.

In [10]:
wage_summary = data.describe()  # Obtain the key descriptive measures 
wage_summary                    # Display these measures as a table

Unnamed: 0,wage,educ,exper,female,married
count,6.0,6.0,6.0,6.0,6.0
mean,4.898333,11.666667,14.333333,0.333333,0.666667
std,2.271479,2.581989,16.280868,0.516398,0.516398
min,3.0,8.0,2.0,0.0,0.0
25%,3.135,11.0,3.25,0.0,0.25
50%,4.27,11.5,8.0,0.0,1.0
75%,5.825,12.0,18.75,0.75,1.0
max,8.75,16.0,44.0,1.0,1.0


The variable <code>wage_summary</code> is a <code>pandas.DataFrame</code> table where the index labels are the names of the descriptive measures. 

Note that rows <code>25%</code>, <code>50%</code>, and <code>75%</code> represent the first (Q1), second(Q2), and third quartiles(Q3), respectively. The value Q3 - Q1 is called the interquartile range (IQR). 

<img src="http://www.brainfuse.com/quizUpload/c_83740/quartiles2.gif" width=450>

Besides methods mentioned above, Pandas has many other methods available for you to calculate descriptive measures via one line of code. You may check [Essential Basic Functionality¶](https://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics)

## Group-wise Operations in Pandas

<i><b>Background</b></i>: In the following, we will use Singapore’s 4-Digits data set to exemplify the power of group-wise operations using Pandas. The 4-Digits (abbreviation: 4-D) is a lottery in Singapore. Individuals play by choosing any number from 0000 to 9999. Then, twenty-three winning numbers are drawn each time. If one of the numbers matches the one that the player has bought, a prize is won. A draw is conducted to select these winning numbers. 4-Digits is a fixed-odds game. There are five prize categories: **1st Prize**, **2nd Prize**, **3rd Prize**, **Starter Prizes** and **Consolation Prizes**. We would like to know prize-specific summary of the 4-D lottery. 

### Load data

The following data file "4D_results_long.csv" has been generated. First, please load data into Python.

In [11]:
data = pd.read_csv('4D_results_long.csv')

In [12]:
data.head(10)

Unnamed: 0,draw_no,date,querystring,weekday,year,week_no,prize,number,prize_type
0,4323,2018-09-30,sppl=RHJhd051bWJlcj00MzIz,Sun,2018,39,first_prize,1646,1st
1,4322,2018-09-29,sppl=RHJhd051bWJlcj00MzIy,Sat,2018,38,first_prize,1490,1st
2,4321,2018-09-26,sppl=RHJhd051bWJlcj00MzIx,Wed,2018,38,first_prize,3141,1st
3,4320,2018-09-23,sppl=RHJhd051bWJlcj00MzIw,Sun,2018,38,first_prize,5917,1st
4,4319,2018-09-22,sppl=RHJhd051bWJlcj00MzE5,Sat,2018,37,first_prize,7338,1st
5,4318,2018-09-19,sppl=RHJhd051bWJlcj00MzE4,Wed,2018,37,first_prize,939,1st
6,4317,2018-09-16,sppl=RHJhd051bWJlcj00MzE3,Sun,2018,37,first_prize,7127,1st
7,4316,2018-09-15,sppl=RHJhd051bWJlcj00MzE2,Sat,2018,36,first_prize,3444,1st
8,4315,2018-09-12,sppl=RHJhd051bWJlcj00MzE1,Wed,2018,36,first_prize,4281,1st
9,4314,2018-09-09,sppl=RHJhd051bWJlcj00MzE0,Sun,2018,36,first_prize,4185,1st


In [13]:
data.shape

(10810, 9)

### Column Summary

#### value_counts()

In [14]:
# count the instances of each prize type
data['prize_type'].value_counts()

consolation    4700
starter        4700
1st             470
3rd             470
2nd             470
Name: prize_type, dtype: int64

In [15]:
# show the Top 10 most frequent winning number in the dataset
data['number'].value_counts().head(10)

7532    8
5788    7
9331    6
1289    6
9532    6
3442    6
9306    6
3509    6
5228    6
8304    6
Name: number, dtype: int64

#### describe()

In [16]:
# summary statistics of all the winning numbers
data['number'].describe()

count    10810.000000
mean      4957.073636
std       2905.037198
min          0.000000
25%       2400.250000
50%       4964.000000
75%       7479.750000
max       9999.000000
Name: number, dtype: float64

In [17]:
# summary of a categorical variable
data['weekday'].describe()

count     10810
unique        3
top         Sat
freq       3611
Name: weekday, dtype: object

### GroupBy

#### create a GroupBy object

In [18]:
data_by_prizetype = data.groupby(['prize_type'])

In [19]:
data_by_prizetype

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002CB6276ACA0>

#### GroupBy object attributes

In [20]:
data_by_prizetype.size()

prize_type
1st             470
2nd             470
3rd             470
consolation    4700
starter        4700
dtype: int64

In [21]:
#DataFrame column selection in GroupBy
data_by_prizetype['number']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002CB62779970>

In [22]:
data_by_prizetype[['number']]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002CB62779C10>

In [23]:
data_by_prizetype.groups

{'1st': Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
             ...
             460, 461, 462, 463, 464, 465, 466, 467, 468, 469],
            dtype='int64', length=470),
 '2nd': Int64Index([470, 471, 472, 473, 474, 475, 476, 477, 478, 479,
             ...
             930, 931, 932, 933, 934, 935, 936, 937, 938, 939],
            dtype='int64', length=470),
 '3rd': Int64Index([ 940,  941,  942,  943,  944,  945,  946,  947,  948,  949,
             ...
             1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409],
            dtype='int64', length=470),
 'consolation': Int64Index([ 6110,  6111,  6112,  6113,  6114,  6115,  6116,  6117,  6118,
              6119,
             ...
             10800, 10801, 10802, 10803, 10804, 10805, 10806, 10807, 10808,
             10809],
            dtype='int64', length=4700),
 'starter': Int64Index([1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419,
             ...
             6100, 6101, 6102, 6103, 610

In [24]:
# Selecting a group
data_by_prizetype.get_group('2nd').head(10)

Unnamed: 0,draw_no,date,querystring,weekday,year,week_no,prize,number,prize_type
470,4323,2018-09-30,sppl=RHJhd051bWJlcj00MzIz,Sun,2018,39,second_prize,8122,2nd
471,4322,2018-09-29,sppl=RHJhd051bWJlcj00MzIy,Sat,2018,38,second_prize,4593,2nd
472,4321,2018-09-26,sppl=RHJhd051bWJlcj00MzIx,Wed,2018,38,second_prize,764,2nd
473,4320,2018-09-23,sppl=RHJhd051bWJlcj00MzIw,Sun,2018,38,second_prize,2345,2nd
474,4319,2018-09-22,sppl=RHJhd051bWJlcj00MzE5,Sat,2018,37,second_prize,7494,2nd
475,4318,2018-09-19,sppl=RHJhd051bWJlcj00MzE4,Wed,2018,37,second_prize,5409,2nd
476,4317,2018-09-16,sppl=RHJhd051bWJlcj00MzE3,Sun,2018,37,second_prize,6211,2nd
477,4316,2018-09-15,sppl=RHJhd051bWJlcj00MzE2,Sat,2018,36,second_prize,9074,2nd
478,4315,2018-09-12,sppl=RHJhd051bWJlcj00MzE1,Wed,2018,36,second_prize,3577,2nd
479,4314,2018-09-09,sppl=RHJhd051bWJlcj00MzE0,Sun,2018,36,second_prize,2622,2nd


In [25]:
# Iterating through groups
for name, group in data_by_prizetype:
    print(name)
    print(group.shape)
    print(type(group))

1st
(470, 9)
<class 'pandas.core.frame.DataFrame'>
2nd
(470, 9)
<class 'pandas.core.frame.DataFrame'>
3rd
(470, 9)
<class 'pandas.core.frame.DataFrame'>
consolation
(4700, 9)
<class 'pandas.core.frame.DataFrame'>
starter
(4700, 9)
<class 'pandas.core.frame.DataFrame'>


### Aggregation

#### Mean of winning numbers for each prizetype

In [26]:
# as series
data_by_prizetype['number'].mean()

prize_type
1st            4996.627660
2nd            5003.395745
3rd            4904.921277
consolation    4921.029362
starter        4989.745532
Name: number, dtype: float64

In [27]:
# as dataframe with the group variable as index
data_by_prizetype[['number']].mean()

Unnamed: 0_level_0,number
prize_type,Unnamed: 1_level_1
1st,4996.62766
2nd,5003.395745
3rd,4904.921277
consolation,4921.029362
starter,4989.745532


In [28]:
# as dataframe without index
data_by_prizetype[['number']].mean().reset_index()

Unnamed: 0,prize_type,number
0,1st,4996.62766
1,2nd,5003.395745
2,3rd,4904.921277
3,consolation,4921.029362
4,starter,4989.745532


In [29]:
# avoid index when grouping
data.groupby('prize_type', as_index=False)[['number']].mean()

Unnamed: 0,prize_type,number
0,1st,4996.62766
1,2nd,5003.395745
2,3rd,4904.921277
3,consolation,4921.029362
4,starter,4989.745532


#### Count of prize_type for each draw_no

In [30]:
data.groupby('draw_no')['prize_type'].value_counts()

draw_no  prize_type 
3854     consolation    10
         starter        10
         1st             1
         2nd             1
         3rd             1
                        ..
4323     consolation    10
         starter        10
         1st             1
         2nd             1
         3rd             1
Name: prize_type, Length: 2350, dtype: int64

### Aggregation by agg()

#### # Applying multiple functions at once

In [31]:
import numpy as np
# Min/Median/Max of number for each weekday-prize_type combination
# group by 'weekday' and 'prize_type and extract 'number' column
weekday_prize_gpby = data.groupby(['weekday', 'prize_type'])

In [32]:
weekday_prize_gpby

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002CB627791C0>

In [33]:
weekday_prize_gpby[['number']].agg([np.min, np.median, np.max])

Unnamed: 0_level_0,Unnamed: 1_level_0,number,number,number
Unnamed: 0_level_1,Unnamed: 1_level_1,amin,median,amax
weekday,prize_type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Sat,1st,64,5154.0,9996
Sat,2nd,25,4849.0,9938
Sat,3rd,86,5108.0,9925
Sat,consolation,1,5165.5,9993
Sat,starter,3,5135.5,9995
Sun,1st,5,5061.0,9992
Sun,2nd,32,5248.0,9978
Sun,3rd,40,4493.0,9966
Sun,consolation,0,4882.0,9999
Sun,starter,5,5100.5,9997


#### Applying different functions to different columns

In [34]:
weekday_prize_gpby.agg({'number': np.mean, 'date':np.max})

Unnamed: 0_level_0,Unnamed: 1_level_0,number,date
weekday,prize_type,Unnamed: 2_level_1,Unnamed: 3_level_1
Sat,1st,5184.949045,2018-09-29
Sat,2nd,4908.77707,2018-09-29
Sat,3rd,4875.140127,2018-09-29
Sat,consolation,5060.333758,2018-09-29
Sat,starter,5007.86879,2018-09-29
Sun,1st,5148.828025,2018-09-30
Sun,2nd,5191.235669,2018-09-30
Sun,3rd,4794.929936,2018-09-30
Sun,consolation,4900.069427,2018-09-30
Sun,starter,5036.996178,2018-09-30


#### agg with a customized function

In [35]:
year_gpby = data.groupby('year')

In [36]:
year_gpby['number'].agg(lambda x: sum(x > 9900))

year
2015    10
2016    38
2017    40
2018    25
Name: number, dtype: int64

In [37]:
# count how many times that the prize number > 9900 for each year
year_gpby[['number']].agg(lambda x: sum(x > 9900))

Unnamed: 0_level_0,number
year,Unnamed: 1_level_1
2015,10
2016,38
2017,40
2018,25


### Transformation

#### Normalize number by year

In [38]:
year_gpby[['number']].transform(lambda x: (x - x.mean())/x.std())

Unnamed: 0,number
0,-1.152427
1,-1.206749
2,-0.631836
3,0.334825
4,0.829647
...,...
10805,1.388484
10806,1.401417
10807,1.484800
10808,1.453148


### Filtration

#### Focus on first prizes, select data only in weeks with 3 draws

In [39]:
filt_1st = (data['prize'] == 'first_prize')
data_1st = data.loc[filt_1st,:].copy()       # Create a new data set with 1st prize data only

In [40]:
year_wkno_1st_gpby = data_1st.groupby(['year', 'week_no'])

In [41]:
year_wkno_1st_gpby.filter(lambda x: len(x) > 2)

Unnamed: 0,draw_no,date,querystring,weekday,year,week_no,prize,number,prize_type
1,4322,2018-09-29,sppl=RHJhd051bWJlcj00MzIy,Sat,2018,38,first_prize,1490,1st
2,4321,2018-09-26,sppl=RHJhd051bWJlcj00MzIx,Wed,2018,38,first_prize,3141,1st
3,4320,2018-09-23,sppl=RHJhd051bWJlcj00MzIw,Sun,2018,38,first_prize,5917,1st
4,4319,2018-09-22,sppl=RHJhd051bWJlcj00MzE5,Sat,2018,37,first_prize,7338,1st
5,4318,2018-09-19,sppl=RHJhd051bWJlcj00MzE4,Wed,2018,37,first_prize,939,1st
...,...,...,...,...,...,...,...,...,...
464,3859,2015-10-14,sppl=RHJhd051bWJlcj0zODU5,Wed,2015,41,first_prize,9306,1st
465,3858,2015-10-11,sppl=RHJhd051bWJlcj0zODU4,Sun,2015,41,first_prize,6932,1st
466,3857,2015-10-10,sppl=RHJhd051bWJlcj0zODU3,Sat,2015,40,first_prize,8596,1st
467,3856,2015-10-07,sppl=RHJhd051bWJlcj0zODU2,Wed,2015,40,first_prize,4542,1st


### Apply

#### For each year and each prize_type, find the draw with smallest number and the one with the largest number

In [42]:
def draw_min_max(x):
    i = x['number'].idxmin()
    j = x['number'].idxmax()
    return pd.concat([x.loc[i, ['draw_no', 'number']], x.loc[j, ['draw_no', 'number']]])

In [43]:
data.groupby(['year', 'prize_type']).apply(draw_min_max)

Unnamed: 0_level_0,Unnamed: 1_level_0,draw_no,number,draw_no,number
year,prize_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015,1st,3883,235,3870,9992
2015,2nd,3889,320,3868,9926
2015,3rd,3856,135,3866,9720
2015,consolation,3887,13,3886,9994
2015,starter,3880,33,3869,9971
2016,1st,3972,5,4013,9996
2016,2nd,3977,42,3909,9978
2016,3rd,3983,86,4048,9979
2016,consolation,3915,0,3998,9988
2016,starter,3995,3,3905,9993
