# Descriptive Statistics

The easiest way to understand the data is to aggregate. 
An aggregation is a number or a category which summarizes the data. Vertica ML Python allows the computation of all the well known aggregation in one line.

Using the 'agg' method is the best way to compute multiple aggregations on multiple columns at the same time.

In [32]:
from vertica_ml_python import *
help(vDataFrame.agg)

Help on function agg in module vertica_ml_python.vdataframe:

agg(self, func:list, columns:list=[])
    ---------------------------------------------------------------------------
    Aggregates the vDataFrame using the input functions.
    
    Parameters
    ----------
    func: list
            List of the different aggregation.
                    approx_unique  : approximative cardinality
                    count          : number of non-missing elements
                    cvar           : conditional value at risk
                    dtype          : virtual column type
                    iqr            : interquartile range
                    kurtosis       : kurtosis
                    jb             : Jarque Bera index 
                    mad            : median absolute deviation
                    mae            : mean absolute error (deviation)
                    max            : maximum
                    mean           : average
                    median        

This function will help you understanding your data.

In [36]:
vdf = vDataFrame("public.churn")
vdf.agg(func = ["min", "10%", "median", "90%", "max", "kurtosis", "unique"])

0,1,2,3,4,5,6,7
,min,10%,median,90%,max,kurtosis,unique
"""Churn""",0,0.0,0.0,1.0,1,-0.870211342331981,2
"""TotalCharges""",18.8,84.445,1397.475,5985.4476923077,8684.8,-0.231798760869362,6530
"""tenure""",0,2.0,29.0,69.0,72,-1.38737163597169,73
"""Dependents""",0,0.0,0.0,1.0,1,-1.2343780571695,2
"""MonthlyCharges""",18.25,20.05,70.3214285714286,102.6,118.75,-1.25725969454951,1585
"""PhoneService""",0,1.0,1.0,1.0,1,5.43890755508706,2
"""PaperlessBilling""",0,0.0,1.0,1.0,1,-1.85960618560884,2
"""SeniorCitizen""",0,0.0,0.0,1.0,1,1.36259589579391,2
"""Partner""",0,0.0,0.0,1.0,1,-1.9959534211947,2


<object>

Some other methods are abstractions of the 'agg' method. They will simplify the call to specific aggregations computations. You can use the 'statistics' method to get in one line the most useful quantiles and other important statistics.

In [37]:
vdf.statistics()

0,1,2,3,4,5,6,7,8,9,10,11,12
,skewness,kurtosis,count,avg,stddev,min,10%,25%,median,75%,90%,max
"""Churn""",1.06303144457513,-0.870211342331981,7043.0,0.265369870793696,0.441561305121947,0.0,0.0,0.0,0.0,1.0,1.0,1.0
"""TotalCharges""",0.961642499724251,-0.231798760869362,7032.0,2283.30044084187,2266.77136188314,18.8,84.445,402.683333333333,1397.475,3798.2375,5985.4476923077,8684.8
"""tenure""",0.239539749561985,-1.38737163597169,7043.0,32.3711486582422,24.5594810230945,0.0,2.0,9.0,29.0,55.0,69.0,72.0
"""Dependents""",0.87519857729972,-1.2343780571695,7043.0,0.299588243646173,0.458110167510015,0.0,0.0,0.0,0.0,1.0,1.0,1.0
"""MonthlyCharges""",-0.220524433943982,-1.25725969454951,7043.0,64.7616924605992,30.0900470976785,18.25,20.05,35.5,70.3214285714286,89.85,102.6,118.75
"""PhoneService""",-2.72715293844056,5.43890755508706,7043.0,0.903166264375976,0.295752231783635,0.0,1.0,1.0,1.0,1.0,1.0,1.0
"""PaperlessBilling""",-0.375395747503722,-1.85960618560884,7043.0,0.592219224762175,0.491456924049407,0.0,0.0,0.0,1.0,1.0,1.0,1.0
"""SeniorCitizen""",1.83363274409285,1.36259589579391,7043.0,0.162146812437882,0.368611605610013,0.0,0.0,0.0,0.0,0.0,1.0,1.0
"""Partner""",0.0679223834263394,-1.9959534211947,7043.0,0.483032798523357,0.499747510719987,0.0,0.0,0.0,0.0,1.0,1.0,1.0


<object>

You can use describe which will compute different information according to the input method.

In [38]:
vdf.describe()

0,1,2,3,4,5,6,7,8,9
,count,mean,std,min,25%,50%,75%,max,unique
Churn,7043,0.265369870793696,0.441561305121947,0,0,0,1,1,2.0
TotalCharges,7032,2283.30044084187,2266.77136188314,18.8,402.683333333333,1397.475,3798.2375,8684.8,6530.0
tenure,7043,32.3711486582422,24.5594810230945,0,9,29,55,72,73.0
Dependents,7043,0.299588243646173,0.458110167510015,0,0,0,1,1,2.0
MonthlyCharges,7043,64.7616924605992,30.0900470976785,18.25,35.5,70.3214285714286,89.85,118.75,1585.0
PhoneService,7043,0.903166264375976,0.295752231783635,0,1,1,1,1,2.0
PaperlessBilling,7043,0.592219224762175,0.491456924049407,0,0,1,1,1,2.0
SeniorCitizen,7043,0.162146812437882,0.368611605610013,0,0,0,0,1,2.0
Partner,7043,0.483032798523357,0.499747510719987,0,0,0,1,1,2.0


<object>

In [43]:
vdf.describe(method = "categorical")

0,1,2,3,4,5
,dtype,unique,count,top,top_percent
"""PaymentMethod""",varchar(50),4,7043,Electronic check,33.579
"""OnlineBackup""",varchar(38),3,7043,No,43.845
"""gender""",varchar(20),2,7043,Male,50.476
"""Churn""",boolean,2,7043,False,73.463
"""StreamingTV""",varchar(38),3,7043,No,39.898
"""TotalCharges""","numeric(9,3)",6530,7032,20.2,0.156
"""Contract""",varchar(28),3,7043,Month-to-month,55.019
"""tenure""",int,73,7043,1,8.704
"""DeviceProtection""",varchar(38),3,7043,No,43.944


<object>

All the aggregations can also be called using many built-in methods. You can for example compute the 'avg' of all the numerical columns in one line.

In [39]:
vdf.avg()

0,1
,avg
"""Churn""",0.265369870793696
"""TotalCharges""",2283.30044084187
"""tenure""",32.3711486582422
"""Dependents""",0.299588243646173
"""MonthlyCharges""",64.7616924605992
"""PhoneService""",0.903166264375976
"""PaperlessBilling""",0.592219224762175
"""SeniorCitizen""",0.162146812437882
"""Partner""",0.483032798523357


<object>

Or just the 'median' of a specific column.

In [42]:
vdf["tenure"].median()

29.0

It is also possible to use the 'groupby' method to compute customized aggregations.

In [46]:
vdf.groupby(["gender",
             "Contract"],
            ["AVG(Churn::int) AS churn"]).head(6)

0,1,2,3
,gender,Contract,churn
0.0,Female,One year,0.104456824512535
1.0,Male,Two year,0.0305882352941176
2.0,Male,Month-to-month,0.416923076923077
3.0,Male,One year,0.120529801324503
4.0,Female,Two year,0.0260355029585799
5.0,Female,Month-to-month,0.437402597402597


<object>  Name: groupby, Number of rows: 6, Number of columns: 3

Aggregations can be directly used to understand data. 

Another way of using the power of aggregations are graphics. Our next Chapter will show you how drawing graphics in Vertica ML Python.