<header style="padding:10px;background:#f9f9f9;border-top:3px solid #00b2b1">
    <h3>Machine Learning Analytics (MLA) with Teradata Vantage</h3>
    <h2>Vantage Analytics Library (VAL) with Python:<br>In-Database Hypothesis Testing</h2>     <p>This notebook demonstrates Vantage Analytics Library (VAL) functions that leverages Vantage's SQL push-down architecture to allow users to perform statistical tests on datasets in-database without moving data to your client machine.</p>
    <p>For reference documentation, go to <a href="https://docs.teradata.com">https://docs.teradata.com</a> and search for "Teradata Package for Python Function  Reference".</p>
    <p>For more information on Machine Learning Analytics (MLA), go to the <a href="https://uhgazure.sharepoint.com/sites/UDW/SitePages/Machine-Learning-Analytics.aspx" target="new">Machine Learning Analytics (MLA) Sharepoint Site</a>.
</header>

### Use Cases
1. Use Binomial Tests for comparing saving and checking account balances.
2. Use a Median Test on incomes across marital statuses.


#### Import teradataml package libraries

##### Install packages as needed
Note: You only need to run these once per package. The "!" allows you to run Linux script from the notebook cell. 

In [None]:
!pip install teradataml --user

In [None]:
# managing connection context
from teradataml import create_context, get_context, remove_context

# for setting configure options
from teradataml import configure

# for teradataml DataFrame object
from teradataml.dataframe.dataframe import DataFrame, in_schema

# for copying pandas dataframe to SQL table
from teradataml.dataframe.copy_to import copy_to_sql

# dataframe manipulation methods and sql data types
from teradatasqlalchemy.types import *
from sqlalchemy.sql.expression import select, and_, or_, not_, extract, text, join, case as case_when
from sqlalchemy import func, sql, distinct

# teradataml utils
from teradataml import configure, db_drop_table, UtilFuncs

# Vantage Analytics Library (valib)
from teradataml.analytics.valib import *
from teradataml.analytics import Transformations as tf 

#### Import other helpful open source packages

In [None]:
# Open source packages

# hide passwords
import getpass as gp

# visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# system
from os.path import exists
import yaml, sys
from datetime import datetime as dt, timedelta
import math

# dataframes and matrices
import pandas as pd
import numpy as np

%matplotlib inline

##### Configure Display Options

In [None]:
plt.rcdefaults()
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (5, 3)
plt.rcParams['font.size'] = 8

### Connection Variables

##### Set User and Password Variables

In [None]:
user = gp.getpass("User")

In [None]:
password = gp.getpass("Password")

##### Set Connection Variables

In [None]:
host = 'UDWProd.uhc.com'
logmech = 'LDAP'
defaultDB = 'MLA_BOPS'  # use your MLA database (MLA_BOPS, MLA_CNS, MLA_ENI, MLA_MNR)

##### Virtual DataFrames are only allowed in the MLA Databases.

If you are using a default database other than database="MLA_xxx", you must set the configure options below to your MLA database to use virtual DataFrames. 

**<font color="red">STOP: Only run cell below if default database is NOT MLA_XXX database.</font>**

In [None]:
mlaDB = "MLA_XXX" # use your MLA database (MLA_BOPS, MLA_CNS, MLA_ENI, MLA_MNR)

# this is the MLA database to which teradataml virtual tables will be written. 
configure.temp_table_database = defaultDB if defaultDB[:3]=="MLA" else mlaDB

# this is the MLA database to which teradataml virtual views will be written.
configure.temp_view_database = defaultDB if defaultDB[:3]=="MLA" else mlaDB 

##### Create Context
See the PythonBasics-1-ConnectingToVantage Notebook for more information about contexts and garbage collection.  

In [None]:
td_context = create_context(host = host, 
                            username= user, 
                            password = password, 
                            logmech='LDAP', 
                            sslmode='ALLOW', 
                            database=defaultDB)

#### Set Vantage Analytics Library (VAL) database location
`from teradataml import configure`

In [None]:
configure.val_install_location = val_database

### Binomial Tests

In a binomial test, there are assumed to be N independent trials, each with two possible outcomes, each of equal probability. You can choose to perform a binomial test, in which the sign of the difference between a first and second column is analyzed, or a sign test, in which the sign of a single column is analyzed. In a binomial test, you may choose to use a probability different from the 0.5 default value, whereas in a sign test, the binomial probability is fixed at 0.5.
    
`valib.BinomialTest(data, first_column=None, binomial_prob=0.5, exact_matches='negative', fallback=False, group_columns=None, allow_duplicates=False, second_column=None, single_tail=False, stats_database=None, style='binomial', probability_threshold=0.05)`

#### A binomial test without any grouping.
**Hypotheses:**
- H0: The distribution of the saving account balance is the same as the checking account
- H1: The distribution of the saving account balance is NOT the same as the checking account

In [None]:
df = DataFrame("demo_customer_analysis")

In [None]:
plot_df = df.select(["avg_sv_bal","avg_ck_bal"]).to_pandas()

In [None]:
plt.hist(plot_df.avg_sv_bal, alpha=0.5, label='Saving Acc Bal')
plt.hist(plot_df.avg_ck_bal, alpha=0.5, label='Checking Acc Bal')
plt.legend(loc='upper right')
plt.show() 

In [None]:
obj = valib.BinomialTest(data= df,
                         first_column="avg_sv_bal",
                         second_column="avg_ck_bal",
                         probability_threshold = 0.05,
                         stats_database=val_database)
bin_df = obj.result.to_pandas().reset_index()

In [None]:
print("N = %s" % bin_df['N'].values[0])
print("# Positive = %s" % bin_df['NPos'].values[0])
print("# of Negative = %s" % bin_df['NNeg'].values[0])
print("Binomial probability = %s" % bin_df['BP'].values[0])

if (bin_df['BinomialCallP_0.05'].values[0] == 'a'):
    print('*** accept null hypothesis ***')
else:
    print('*** reject null hypothesis ***')

### Chi-square Median test
**Hypotheses:**

- H0: The median income across different marital status are the same
- H1: The median income across different marital status are NOT the same

#### visualize the data

In [None]:
plot_df = df.select(["marital_status", "income"]).to_pandas().sort_values("income")
sns.boxplot(x="marital_status", y="income", data=plot_df)

In [None]:
obj = valib.ChiSquareTest(data= df,
                          dependent_column="income",
                          columns="marital_status",
                          style="median",
                          stats_database=val_database)
med_df = obj.result.to_pandas().reset_index()

In [None]:
print("** Median test Result ***")
print(" - Degree of Freedom = %s" %  med_df['DF'].values[0])
print(" - Chi Square = %s" %  med_df['ChiSq'].values[0])
print(" - P-value = %s" %  med_df['MedianPText'].values[0])

if (med_df['MedianCallP_5E-2'].values[0] == 'a'):
    print('*** accept null hypothesis ***')
else:
    print('*** reject null hypothesis ***')
    

### Remove context
This best practice performs garbage collection for the volatile tables and views that were created during this session.

In [None]:
# One must run remove_context() to close the connection and garbage collect internally generated objects.
remove_context()

<span style="font-size:16px;">For online documentation on Teradata Vantage analytic functions, refer to the [Teradata Developer Portal](https://docs.teradata.com/) and search for phrases "Python User Guide" and "Python Function Reference".</span>