### Disclaimer
Please note, the Vantage Functions via SQLAlchemy feature is a preview/beta code release with limited functionality (the “Code”). As such, you acknowledge that the Code is experimental in nature and that the Code is provided “AS IS” and may not be functional on any machine or in any environment. TERADATA DISCLAIMS ALL WARRANTIES RELATING TO THE CODE, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES AGAINST INFRINGEMENT OF THIRD-PARTY RIGHTS, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

TERADATA SHALL NOT BE RESPONSIBLE OR LIABLE WITH RESPECT TO ANY SUBJECT MATTER OF THE CODE UNDER ANY CONTRACT, NEGLIGENCE, STRICT LIABILITY OR OTHER THEORY 
    (A) FOR LOSS OR INACCURACY OF DATA OR COST OF PROCUREMENT OF SUBSTITUTE GOODS, SERVICES OR TECHNOLOGY, OR 
    (B) FOR ANY INDIRECT, INCIDENTAL OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO LOSS OF REVENUES AND LOSS OF PROFITS. TERADATA SHALL NOT BE RESPONSIBLE FOR ANY MATTER BEYOND ITS REASONABLE CONTROL.

Notwithstanding anything to the contrary: 
    (a) Teradata will have no obligation of any kind with respect to any Code-related comments, suggestions, design changes or improvements that you elect to provide to Teradata in either verbal or written form (collectively, “Feedback”), and 
    (b) Teradata and its affiliates are hereby free to use any ideas, concepts, know-how or techniques, in whole or in part, contained in Feedback: 
        (i) for any purpose whatsoever, including developing, manufacturing, and/or marketing products and/or services incorporating Feedback in whole or in part, and 
        (ii) without any restrictions or limitations, including requiring the payment of any license fees, royalties, or other consideration. 

In [1]:
# In this notebook, we will be covering examples for following Regular Aggregate Functions
# SQL Documentation: https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/c2fX4dzxCcDJFKqXbyQtTA
    # 1. avg/average/ave
    # 2. corr
    # 3. count
    # 4. covar_pop
    # 5. covar_samp
    # 6. var_pop
    # 7. var_samp
    # 8. kurtosis
    # 9. max
    # 10. REGR_AVGX
    # 11. REGR_AVGY
    # 12. REGR_Intercept
    # 13. REGR_SLOPE
    # 14. REGR_R2
    # 15. REGR_SXX
    # 16. REGR_SXY
    # 17. REGR_SYY
    # 18. min
    # 19. skew
    # 20. stddev_pop
    # 21. stddev_samp
    # 22. sum

In [2]:
# Get the connection to the Vantage using create_context()
from teradataml import *
import getpass
td_context = create_context(host=getpass.getpass("Hostname: "), username=getpass.getpass("Username: "), password=getpass.getpass("Password: "))
# Load the example dataset.
load_example_data("GLM", ["admissions_train"])

Hostname: ········
Username: ········
Password: ········


In [3]:
# Create the DataFrame on 'admissions_train' table
admissions_train = DataFrame("admissions_train")
admissions_train

   masters   gpa     stats programming  admitted
id                                              
15     yes  4.00  Advanced    Advanced         1
7      yes  2.33    Novice      Novice         1
22     yes  3.46    Novice    Beginner         0
17      no  3.83  Advanced    Advanced         1
13      no  4.00  Advanced      Novice         1
38     yes  2.65  Advanced    Beginner         1
26     yes  3.57  Advanced    Advanced         1
5       no  3.44    Novice      Novice         0
34     yes  3.85  Advanced    Beginner         0
40     yes  3.95    Novice    Beginner         0

In [4]:
def print_variables(df, columns):
    print("Equivalent SQL: {}".format(df.show_query()))
    print("\n")
    print(" ************************* DataFrame ********************* ")
    print(df)
    print("\n\n")
    print(" ************************* DataFrame.dtypes ********************* ")
    print(df.dtypes)
    print("\n\n")
    if isinstance(columns, str):
        columns = [columns]
    for col in columns:
        coltype = df.__getattr__(col).type
        if isinstance(coltype, sqlalchemy.sql.sqltypes.NullType):
            coltype = "NullType"
        print(" '{}' Column Type: {}".format(col, coltype))

# Using Aggregate Functions from Teradata Vanatge with SQLAlchemy

In [5]:
# Import func from SQLAlchemy to use the same for executing aggregate functions
from sqlalchemy import func

In [6]:
# Before we move on with examples, one should read below just to understand how teradataml DataFrame and 
# it's columns are used to create a SQLAlchemy ClauseElement/Expression.

# Often in below examples one would see something like this: 'admissions_train.admitted.expression'
# Here in the above expression,
#    'admissions_train' is 'teradataml DataFrame'
#    'admitted' is 'column name' in teradataml DataFrame 'admissions_train'
#    Thus, 
#        'admissions_train.admitted' together forms a ColumnExpression.
#    expression allows us to use teradata ColumnExpression to be treated as SQLAlchemy Expression.
#    Thus,
#        'admissions_train.admitted.expression' gives us an expression that can be used with SQLAlchemy clauseElements.

## Avg/Average/Ave Function

In [7]:
# Function returns the arithmetic average of all values in value_expression.
# Syntax:
#         Avg(value_expression)

In [8]:
agg_func_ = func.avg(admissions_train.gpa.expression)
type(agg_func_)

sqlalchemy.sql.functions.Function

In [9]:
df = admissions_train.assign(True, avg_gpa_=agg_func_, 
                             average_admitted_=func.average(admissions_train.admitted.expression),
                             ave_admitted_=func.ave(admissions_train.admitted.expression))
print_variables(df, ["avg_gpa_", "average_admitted_", "ave_admitted_"])

Equivalent SQL: select ave(admitted) AS ave_admitted_, average(admitted) AS average_admitted_, avg(gpa) AS avg_gpa_ from "admissions_train"


 ************************* DataFrame ********************* 
   ave_admitted_  average_admitted_  avg_gpa_
0           0.65               0.65   3.54175



 ************************* DataFrame.dtypes ********************* 
ave_admitted_        float
average_admitted_    float
avg_gpa_             float



 'avg_gpa_' Column Type: FLOAT
 'average_admitted_' Column Type: FLOAT
 'ave_admitted_' Column Type: FLOAT


## CORR Function

In [10]:
# Function returns the Sample Pearson product moment correlation coefficient of its arguments for all non-null data point pairs.
# Syntax:
#         Corr(value_expression1, value_expression2)

In [11]:
df = admissions_train.assign(True, 
                             corr_numeric_=func.corr(admissions_train.admitted.expression, admissions_train.gpa.expression))
print_variables(df, ["corr_numeric_"])

Equivalent SQL: select corr(admitted, gpa) AS corr_numeric_ from "admissions_train"


 ************************* DataFrame ********************* 
   corr_numeric_
0      -0.022265



 ************************* DataFrame.dtypes ********************* 
corr_numeric_    float



 'corr_numeric_' Column Type: FLOAT


## Count Function

In [12]:
# Function returns a column value that is the total number of qualified rows in value_expression.
# Syntax:
#         Count(value_expression)

In [13]:
df = admissions_train.assign(True, assined_count_col_=func.count(admissions_train.admitted.expression))
print_variables(df, ["assined_count_col_"])

Equivalent SQL: select count(admitted) AS assined_count_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_count_col_
0                  40



 ************************* DataFrame.dtypes ********************* 
assined_count_col_    int



 'assined_count_col_' Column Type: INTEGER


## Covar_pop Function

In [14]:
# Function returns the population covariance of its arguments for all non-null data point pairs.
# Syntax:
#         Covar_pop(value_expression1, value_expression2)

In [15]:
df = admissions_train.assign(True, 
                             assined_col_Covar_pop=func.Covar_pop(admissions_train.admitted.expression, admissions_train.gpa.expression))
print_variables(df, ["assined_col_Covar_pop"])

Equivalent SQL: select Covar_pop(admitted, gpa) AS "assined_col_Covar_pop" from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_Covar_pop
0              -0.005387



 ************************* DataFrame.dtypes ********************* 
assined_col_Covar_pop    float



 'assined_col_Covar_pop' Column Type: FLOAT


## Covar_samp Function

In [16]:
# Function returns the sample covariance of its arguments for all non-null data point pairs.
# Syntax:
#         Covar_samp(value_expression)

In [17]:
df = admissions_train.assign(True, 
                             assined_col_Covar_samp=func.Covar_samp(admissions_train.admitted.expression, admissions_train.gpa.expression))
print_variables(df, ["assined_col_Covar_samp"])

Equivalent SQL: select Covar_samp(admitted, gpa) AS "assined_col_Covar_samp" from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_Covar_samp
0               -0.005526



 ************************* DataFrame.dtypes ********************* 
assined_col_Covar_samp    float



 'assined_col_Covar_samp' Column Type: FLOAT


## Kurtosis Function

In [18]:
# Function returns the kurtosis of the distribution of value_expression.
# Syntax:
#         Kurtosis(value_expression)

In [19]:
df = admissions_train.assign(True, assined_col_Kurtosis_num=func.Kurtosis(admissions_train.gpa.expression))
print_variables(df, ["assined_col_Kurtosis_num"])

Equivalent SQL: select Kurtosis(gpa) AS "assined_col_Kurtosis_num" from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_Kurtosis_num
0                  4.052659



 ************************* DataFrame.dtypes ********************* 
assined_col_Kurtosis_num    float



 'assined_col_Kurtosis_num' Column Type: FLOAT


## max/maximum Function

In [20]:
# Function returns a column value that is the maximum value for value_expression.
# Syntax:
#         max(value_expression)

In [21]:
df = admissions_train.assign(True, 
                             assined_col_max=func.max(admissions_train.gpa.expression),
                             assined_col_maximum=func.maximum(admissions_train.stats.expression))
print_variables(df, ["assined_col_maximum", "assined_col_max"])

Equivalent SQL: select max(gpa) AS assined_col_max, maximum(stats) AS assined_col_maximum from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_max assined_col_maximum
0              4.0              Novice



 ************************* DataFrame.dtypes ********************* 
assined_col_max        float
assined_col_maximum      str



 'assined_col_maximum' Column Type: VARCHAR
 'assined_col_max' Column Type: FLOAT


## min/minimum Function

In [22]:
# Function returns a column value that is the minimum value for value_expression.
# Syntax:
#         min(value_expression)

In [23]:
df = admissions_train.assign(True, 
                             assined_col_min=func.min(admissions_train.gpa.expression),
                             assined_col_minimum=func.minimum(admissions_train.stats.expression))
print_variables(df, ["assined_col_min", "assined_col_minimum"])

Equivalent SQL: select min(gpa) AS assined_col_min, minimum(stats) AS assined_col_minimum from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_min assined_col_minimum
0             1.87            Advanced



 ************************* DataFrame.dtypes ********************* 
assined_col_min        float
assined_col_minimum      str



 'assined_col_min' Column Type: FLOAT
 'assined_col_minimum' Column Type: VARCHAR


## REGR_AVGX Function

In [24]:
# Function returns the mean of the independent_variable_expression for all non-null data pairs of the 
# dependent and independent variable arguments.
# Syntax:
#         REGR_AVGX(dependent_value_expression, independent_value_expression)

In [25]:
df = admissions_train.assign(True, 
                             assined_col_=func.regr_avgx(admissions_train.admitted.expression, 
                                                         admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select regr_avgx(admitted, gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0       3.54175



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## REGR_AVGY Function

In [26]:
# Function returns the mean of the dependent_variable_expression for all non-null data pairs of the 
# dependent and independent variable arguments.
# Syntax:
#         REGR_AVGY(dependent_value_expression, independent_value_expression)

In [27]:
df = admissions_train.assign(True, 
                             assined_col_=func.regr_avgy(admissions_train.admitted.expression, 
                                                         admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select regr_avgy(admitted, gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0          0.65



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## REGR_Count Function

In [28]:
# Function returns the count of all non-null data pairs of the dependent and independent variable arguments.
# Syntax:
#         REGR_count(dependent_value_expression, independent_value_expression)

In [29]:
df = admissions_train.assign(True, 
                             assined_col_=func.REGR_count(admissions_train.admitted.expression, 
                                                         admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select REGR_count(admitted, gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0            40



 ************************* DataFrame.dtypes ********************* 
assined_col_    int



 'assined_col_' Column Type: INTEGER


## REGR_Intercept Function

In [30]:
# Function returns the intercept of the univariate linear regression line through all non-null data pairs of the 
# dependent and independent variable arguments.
# Syntax:
#         REGR_Intercept(dependent_value_expression, independent_value_expression)

In [31]:
df = admissions_train.assign(True, 
                             assined_col_=func.REGR_Intercept(admissions_train.admitted.expression, 
                                                         admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select REGR_Intercept(admitted, gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0      0.724144



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## REGR_R2 Function

In [32]:
# Function returns the coefficient of determination for all non-null data pairs of the dependent and independent 
# variable arguments.
# Syntax:
#         REGR_R2(dependent_value_expression, independent_value_expression)

In [33]:
df = admissions_train.assign(True, 
                             assined_col_=func.REGR_R2(admissions_train.admitted.expression, 
                                                         admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select REGR_R2(admitted, gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0      0.000496



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## REGR_SLOPE Function

In [34]:
# Function returns the slope of the univariate linear regression line through all non-null data pairs of the 
# dependent and independent variable arguments.
# Syntax:
#         REGR_SLOPE(dependent_value_expression, independent_value_expression)

In [35]:
df = admissions_train.assign(True, 
                             assined_col_=func.REGR_SLOPE(admissions_train.admitted.expression, 
                                                         admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select REGR_SLOPE(admitted, gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0     -0.020934



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## REGR_SXX Function

In [36]:
# Function returns the sum of the squares of the independent_variable_expression for all non-null data pairs of the 
# dependent and independent variable arguments.
# Syntax:
#         REGR_SXX(dependent_value_expression, independent_value_expression)

In [37]:
df = admissions_train.assign(True, 
                             assined_col_=func.REGR_SXX(admissions_train.admitted.expression, 
                                                         admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select REGR_SXX(admitted, gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0     10.294177



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## REGR_SXY Function

In [38]:
# Function returns the sum of the products of the independent_variable_expression and the dependent_variable_expression 
# for all non-null data pairs of the dependent and independent variable arguments.
# Syntax:
#         REGR_SXY(dependent_value_expression, independent_value_expression)

In [39]:
df = admissions_train.assign(True, 
                             assined_col_=func.REGR_SXY(admissions_train.admitted.expression, 
                                                         admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select REGR_SXY(admitted, gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0       -0.2155



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## REGR_SYY Function

In [40]:
# Function returns the sum of the squares of the dependent_variable_expression for all non-null data pairs of the 
# dependent and independent variable arguments.
# Syntax:
#         REGR_SYY(dependent_value_expression, independent_value_expression)

In [41]:
df = admissions_train.assign(True, 
                             assined_col_=func.REGR_SYY(admissions_train.admitted.expression, 
                                                         admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select REGR_SYY(admitted, gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0           9.1



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## Skew Function

In [42]:
# Function returns the skewness of the distribution of value_expression.
# Syntax:
#         skew(value_expression)

In [43]:
df = admissions_train.assign(True, assined_col_int=func.skew(admissions_train.admitted.expression),
                            assined_col_float=func.skew(admissions_train.gpa.expression))
print_variables(df, ["assined_col_int", "assined_col_float"])

Equivalent SQL: select skew(gpa) AS assined_col_float, skew(admitted) AS assined_col_int from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_float  assined_col_int
0          -2.058969        -0.653746



 ************************* DataFrame.dtypes ********************* 
assined_col_float    float
assined_col_int      float



 'assined_col_int' Column Type: FLOAT
 'assined_col_float' Column Type: FLOAT


## stddev_pop Function

In [44]:
# Function returns the population standard deviation for the non-null data points in value_expression.
# Syntax:
#         stddev_pop(value_expression)

In [45]:
df = admissions_train.assign(True, assined_col_=func.stddev_pop(admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select stddev_pop(gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0      0.507301



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## stddev_samp Function

In [46]:
# Function returns the sample standard deviation for the non-null data points in value_expression.
# Syntax:
#         stddev_samp(value_expression)

In [47]:
df = admissions_train.assign(True, assined_col_=func.stddev_samp(admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select stddev_samp(gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0      0.513764



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## sum Function

In [48]:
# Function returns a column value that is the arithmetic sum of value_expression.
# Syntax:
#         sum(value_expression)

In [49]:
df = admissions_train.assign(True, assined_col_=func.sum(admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select sum(gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0        141.67



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## var_pop Function

In [50]:
# Function returns the population variance for the data points in value_expression.
# Syntax:
#         var_pop(value_expression)

In [51]:
df = admissions_train.assign(True, assined_col_=func.var_pop(admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select var_pop(gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0      0.257354



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


## var_samp Function

In [52]:
# Function returns the sample variance for the data points in value_expression.
# Syntax:
#         var_samp(value_expression)

In [53]:
df = admissions_train.assign(True, assined_col_=func.var_samp(admissions_train.gpa.expression))
print_variables(df, ["assined_col_"])

Equivalent SQL: select var_samp(gpa) AS assined_col_ from "admissions_train"


 ************************* DataFrame ********************* 
   assined_col_
0      0.263953



 ************************* DataFrame.dtypes ********************* 
assined_col_    float



 'assined_col_' Column Type: FLOAT


In [54]:
# One must run remove_context() to close the connection and garbage collect internally generated objects.
remove_context()

True

In [55]:
## Grouping, pivot, unpivot - Not possible to use.