<header style="background:#00233C;padding-left:20pt;padding-right:20pt;padding-top:20pt;padding-bottom:10pt;"><img id="Teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 100px; height: auto; margin-top: 20pt;" align="right">
<p style="font-size:20px; color:#ffffff">UDW INNOVATION DAYS</p>
<p style="font-size:24px; color:#ffffff">Teradata Package for Python: Introduction to teradataml </p>
<p style="font-size:16px; color:#ffffff">Data transfer.</p>
</header>

#### Install teradataml package
Note: You only need to run this once. The "!" allows you to run Linux script from the notebook cell. 

In [None]:
#!pip install teradataml --user

#### Import Packages

In [5]:
# to hide authentication strings
import getpass as gp

# for managing connections
from teradataml import create_context, get_context, remove_context

# for setting configure options
from teradataml import configure

# DataFrames
from teradataml import DataFrame, in_schema

# for dropping tables or views
from teradataml import db_drop_table, db_drop_view, db_list_tables

import pandas as pd
import numpy as np

### Connection Variables

##### Set User and Password Variables

In [6]:
user = gp.getpass("User")

User ········


In [7]:
password = gp.getpass("Password")

Password ········


##### Set Connection Variables

In [8]:
host = 'UDWTest'
logmech = 'LDAP'
defaultDB = 'INOUDWTRAINING2024' 

##### Create Context
See the PythonBasics-1-ConnectingToVantage Notebook for more information about contexts and garbage collection.  

In [9]:
td_context = create_context(host = host, 
                            username= user, 
                            password = password, 
                            logmech='LDAP', 
                            sslmode='ALLOW', 
                            database=defaultDB)



#### Load Packages for fastload, copy_to_sql, fastexport

to_sql and to_pandas are methods of the teradataml.DataFrame

In [10]:
# for fastload, copy_to_sql, fastexport
from teradataml.dataframe.fastload import fastload
from teradataml.dataframe.data_transfer import fastexport
from teradataml.dataframe.copy_to import copy_to_sql

### Read CSV Data into Pandas
##### Load three CSV files locally into Pandas:  Plots, Species, Surveys

In [11]:
plots = pd.read_csv('data/plotsdata.csv', names=["plot_id","plot_type"])
plots.head()

Unnamed: 0,plot_id,plot_type
0,5,Rodent Exclosure
1,24,Rodent Exclosure
2,3,Long-term Krat Exclosure
3,1,Spectab exclosure
4,20,Short-term Krat Exclosure


In [12]:
species = pd.read_csv('data/speciesdata.csv', names=["species_id","genus","species","taxa"])
species.head()

Unnamed: 0,species_id,genus,species,taxa
0,CM,Calamospiza,melanocorys,Bird
1,SS,Spermophilus,spilosoma,Rodent
2,AH,Ammospermophilus,harrisi,Rodent
3,ST,Spermophilus,tereticaudus,Rodent
4,RF,Reithrodontomys,fulvescens,Rodent


In [13]:
surveys = pd.read_csv('data/surveysdata.csv', 
                      names=["record_id","month","day","year","plot_id","species_id","sex","hindfoot_length","weight"])
surveys.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,12552,4,6,1987,4,RM,F,16.0,9.0
1,26704,7,30,1997,8,PP,M,20.0,14.0
2,4434,5,4,1981,9,DM,F,36.0,44.0
3,24153,6,14,1996,4,DM,M,38.0,46.0
4,5,7,16,1977,3,DM,M,35.0,


#### View Pandas Data Types

In [14]:
surveys.dtypes

record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

### Write dataset to Vantage

#### Method: copy_to_sql() for < 100,000 rows
Write pandas dataframe "plots" to Vantage database using copy_to_sql(): < 100,000 Rows 

If a table already exists in the database and you only want to append data, then set `if_exists='append'`. Otherwise, use `if_exists='replace'`.

Below is the same as the following in SQL: 
```
DROP TABLE {user}_plots; 
CREATE MULTISET TABLE {user}_plots (plot_id INTEGER, plot_type VARCHAR(40)) PRIMARY INDEX ( plot_id );
INSERT INTO {user}_plots (plot_id, plot_type) VALUES (?,?);
```

In [15]:
# for defining SQL data types
from teradatasqlalchemy.types import *

copy_to_sql(df = plots, 
            table_name = f'{user}_plots', 
            schema_name = defaultDB, 
            index = False, 
            temporary = False,            # not volatile/temp table
            primary_index = ['plot_id'], 
            if_exists = 'replace',        # create or drop/create table
            types = {'plot_id': INTEGER,
                     'plot_type': VARCHAR},
            set_table=False
           )

Verify that data was written. You can pull a Vantage table into a teradataml DataFrame by using the `DataFrame()` constructor. 

In [16]:
td_plots = DataFrame(f'{user}_plots')
td_plots.head()

plot_id,plot_type
3,Long-term Krat Exclosure
5,Rodent Exclosure
6,Short-term Krat Exclosure
7,Rodent Exclosure
9,Spectab exclosure
10,Rodent Exclosure
8,Control
4,Control
2,Control
1,Spectab exclosure


#### Method: DataFrame.to_sql() for < 100,000 rows
Write species pandas dataframe to Vantage with  to_sql(): < 100,000 Rows 

##### Species Table

In [17]:
DataFrame.to_sql(species, table_name=f'{user}_species', if_exists='replace')

  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))


Confirm table loaded correctly in database

In [18]:
td_species = DataFrame(f'{user}_species')
td_species.head()

species_id,genus,species,taxa
AS,Ammodramus,savannarum,Bird
CB,Campylorhynchus,brunneicapillus,Bird
CM,Calamospiza,melanocorys,Bird
CQ,Callipepla,squamata,Bird
CT,Cnemidophorus,tigris,Reptile
CU,Cnemidophorus,uniparens,Reptile
CS,Crotalus,scutalatus,Reptile
BA,Baiomys,taylori,Rodent
AH,Ammospermophilus,harrisi,Rodent
AB,Amphispiza,bilineata,Bird


##### Surveys

Finally, also write the surveys pandas dataframe to Vantage with to_sql(): < 100,000 Rows 

In [19]:
DataFrame.to_sql(surveys, table_name=f'{user}_surveys', if_exists='replace')

  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMEST

Confirm table loaded correctly in database

In [20]:
td_surveys = DataFrame(f'{user}_surveys')
td_surveys.head()

record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
3,7,16,1977,2,DM,F,37.0,
5,7,16,1977,3,DM,M,35.0,
6,7,16,1977,1,PF,M,14.0,
7,7,16,1977,2,PE,F,,
9,7,16,1977,1,DM,F,34.0,
10,7,16,1977,6,PF,F,20.0,
8,7,16,1977,1,DM,M,37.0,
4,7,16,1977,7,DM,M,36.0,
2,7,16,1977,3,NL,M,33.0,
1,7,16,1977,2,NL,M,32.0,


#### Method: fastload() > 100,000 rows

Write surveys pandas dataframe to Vantage with  fastload(): > 100,000 Rows 

- fastload() has limited support for Nan and inf, so convert to empty string or None.
- teradataml fastload does not support BLOB or CLOB data

##### Heart Disease Data
Because fastload works best with data > 100K rows, we will use the heartdisease.csv (n=319,795) for demonstration purposes. 


##### Load CSV to pandas DataFrame

In [21]:
heartdisease = pd.read_csv('data/heartdisease.csv')

In [22]:
heartdisease.head()

Unnamed: 0,ID,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [23]:
heartdisease.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   ID                319795 non-null  int64  
 1   HeartDisease      319795 non-null  object 
 2   BMI               319795 non-null  float64
 3   Smoking           319795 non-null  object 
 4   AlcoholDrinking   319795 non-null  object 
 5   Stroke            319795 non-null  object 
 6   PhysicalHealth    319795 non-null  float64
 7   MentalHealth      319795 non-null  float64
 8   DiffWalking       319795 non-null  object 
 9   Sex               319795 non-null  object 
 10  AgeCategory       319795 non-null  object 
 11  Race              319795 non-null  object 
 12  Diabetic          319795 non-null  object 
 13  PhysicalActivity  319795 non-null  object 
 14  GenHealth         319795 non-null  object 
 15  SleepTime         319795 non-null  float64
 16  Asthma            31

#### Load data to an existing table
In this example, we will assume the table already exists and you want to append data to it. To do this, we will use the connect() object from the create_context() object to run SQL through the execute() method. 

##### Get the connection object from the context to execute a direct SQL query

In [24]:
conn = td_context.connect()

##### Create the table first if you want control over datatypes

In [27]:
from teradataml import execute_sql

# Create surveys table using SQL for easier control over datatypes
SQLstr = f"""CREATE MULTISET TABLE {user}_heartdisease,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      ID BIGINT,
      HeartDisease VARCHAR(5) CHARACTER SET UNICODE NOT CASESPECIFIC,
      BMI DECIMAL(10,4),
      Smoking VARCHAR(5) CHARACTER SET UNICODE NOT CASESPECIFIC,
      AlcoholDrinking VARCHAR(5) CHARACTER SET UNICODE NOT CASESPECIFIC,
      Stroke VARCHAR(5) CHARACTER SET UNICODE NOT CASESPECIFIC,
      PhysicalHealth DECIMAL(10,4),
      MentalHealth  DECIMAL(10,4),
      DiffWalking VARCHAR(5) CHARACTER SET UNICODE NOT CASESPECIFIC,
      Sex VARCHAR(10) CHARACTER SET UNICODE NOT CASESPECIFIC,
      AgeCategory VARCHAR(10) CHARACTER SET UNICODE NOT CASESPECIFIC,
      Race VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC,
      Diabetic VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC,
      PhysicalActivity VARCHAR(5) CHARACTER SET UNICODE NOT CASESPECIFIC,
      GenHealth VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC,
      SleepTime  DECIMAL(10,4),
      Asthma VARCHAR(5) CHARACTER SET UNICODE NOT CASESPECIFIC,
      KidneyDisease VARCHAR(5) CHARACTER SET UNICODE NOT CASESPECIFIC,
      SkinCancer VARCHAR(5) CHARACTER SET UNICODE NOT CASESPECIFIC)
PRIMARY INDEX ( ID );"""

execute_sql(SQLstr)

TeradataCursor uRowsHandle=72 bClosed=False

##### NaN and inf values in pandas dataframe must be converted to use fast_load()

In [28]:
heartdisease = heartdisease.where(heartdisease.notnull(), None)

##### Using append for existing table

In [29]:
fl_heartdisease = fastload(df = heartdisease, table_name = f'{user}_heartdisease', if_exists='append')

  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMESTAMP(timezone=True) if pt.is_datetime64_ns_dtype(df.dtypes[key])
  else _get_sqlalchemy_mapping(str(df.dtypes[key]))
  TIMEST

Processed 106598 rows in batch 1.
Processed 106598 rows in batch 2.
Processed 106599 rows in batch 3.


##### View returned errors
A dict containing the following attributes:
1. errors_dataframe: It is a Pandas DataFrame containing error messages thrown by fastload. DataFrame is empty if there are no errors.
2. warnings_dataframe: It is a Pandas DataFrame containing warning messages thrown by fastload. DataFrame is empty if there are no warnings.
3. errors_table: Name of the table containing errors. It is None, if argument save_errors is False.
4. warnings_table: Name of the table containing warnings. It is None, if argument save_errors is False.

In [30]:
fl_heartdisease

{'errors_dataframe': Empty DataFrame
 Columns: []
 Index: [],
 Columns: []
 Index: [],
 'errors_table': '',

##### Verify that data loaded to database

In [31]:
tdf_heartdisease = DataFrame(f'{user}_heartdisease')

In [32]:
tdf_heartdisease.head()

ID,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
5,Yes,28.87,Yes,No,No,6.0,0.0,Yes,Female,75-79,Black,No,No,Fair,12.0,No,No,No
6,No,21.63,No,No,No,15.0,0.0,No,Female,70-74,White,No,Yes,Fair,4.0,Yes,No,Yes
8,No,26.45,No,No,No,0.0,0.0,No,Female,80 or olde,White,"No, borderline diabetes",No,Fair,5.0,No,Yes,No
9,No,40.69,No,No,No,0.0,0.0,Yes,Male,65-69,White,No,Yes,Good,10.0,No,No,No
7,No,31.64,Yes,No,No,5.0,0.0,Yes,Female,80 or olde,White,Yes,No,Good,9.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or olde,White,No,Yes,Very good,7.0,No,No,No
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes


In [33]:
tdf_heartdisease.tdtypes

COLUMN NAME,TYPE
ID,BIGINT()
HeartDisease,"VARCHAR(length=5, charset='UNICODE')"
BMI,"DECIMAL(precision=10, scale=4)"
Smoking,"VARCHAR(length=5, charset='UNICODE')"
AlcoholDrinking,"VARCHAR(length=5, charset='UNICODE')"
Stroke,"VARCHAR(length=5, charset='UNICODE')"
PhysicalHealth,"DECIMAL(precision=10, scale=4)"
MentalHealth,"DECIMAL(precision=10, scale=4)"
DiffWalking,"VARCHAR(length=5, charset='UNICODE')"
Sex,"VARCHAR(length=10, charset='UNICODE')"


### Exporting Data from Vantage Database Table
#### Method:  DataFrame.to_pandas() for < 100,000 rows

- Use teradataml DataFrame to read from database. 

In [34]:
tdf_species = DataFrame(f'{user}_species')
print(tdf_species.tdtypes)
tdf_species.head()

species_id    VARCHAR(length=1024, charset='UNICODE')
genus         VARCHAR(length=1024, charset='UNICODE')
species       VARCHAR(length=1024, charset='UNICODE')
taxa          VARCHAR(length=1024, charset='UNICODE')


species_id,genus,species,taxa
AS,Ammodramus,savannarum,Bird
CB,Campylorhynchus,brunneicapillus,Bird
CM,Calamospiza,melanocorys,Bird
CQ,Callipepla,squamata,Bird
CT,Cnemidophorus,tigris,Reptile
CU,Cnemidophorus,uniparens,Reptile
CS,Crotalus,scutalatus,Reptile
BA,Baiomys,taylori,Rodent
AH,Ammospermophilus,harrisi,Rodent
AB,Amphispiza,bilineata,Bird


Then use the `DataFrame.to_pandas()` method to bring it to your local machine.

In [35]:
pd_species = tdf_species.to_pandas()
pd_species.head()

Unnamed: 0,species_id,genus,species,taxa
0,CM,Calamospiza,melanocorys,Bird
1,SS,Spermophilus,spilosoma,Rodent
2,AH,Ammospermophilus,harrisi,Rodent
3,ST,Spermophilus,tereticaudus,Rodent
4,RF,Reithrodontomys,fulvescens,Rodent


#### Method: fastexport() for > 100,000 rows
Using heartdisease virtual dataframe

##### A note about open_sessions:

- **Usecase 1:  Workload Manager is configured to open maximum 4 sessions**
   - User specifies 3 sessions for 'open_session' parameter. In this scenario operations will succeed.

- **Usecase 2:  Workload Manager is configured to open maximum 4 sessions** 
   - User specifies 5 sessions for 'open_session' parameter. In this scenario operations will Fail.

- **Usecase 3: System has 6 AMPs. No open_session value set.**
    - Workload Manager is configured to open maximum 4 sessions. Since 'open_session' is not specified, it will be a minimum between 6 and 8 so teradataml tries to open 6 sessions. However, workload manager is configured to open only 4. Therefore operations will Fail. Try starting with a lower number, such as open_sessions=3.
    
##### Use in_schema function when defining the DataFrame to ensure that the schema is specified.

In [36]:
tdf_heartdisease = DataFrame(in_schema(defaultDB, f'{user}_heartdisease'))
tdf_heartdisease.show_query()

'select * from "INOUDWTRAINING2024"."tlugtu_heartdisease"'

In [37]:
# using tdf_heartdisease from above
pd_heartdisease = fastexport(tdf_heartdisease, export_to='pandas', index_column='ID', open_sessions=1)

Errors: []


In [38]:
pd_heartdisease.head()

Unnamed: 0_level_0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
297985,No,36.61,No,No,No,5.0,15.0,No,Female,40-44,White,Yes (during pregnancy),Yes,Fair,6.0,Yes,No,No
269008,No,24.69,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Very good,11.0,No,No,No
112194,No,23.75,Yes,No,No,0.0,0.0,No,Male,18-24,White,No,Yes,Excellent,10.0,No,No,No
239093,No,20.98,Yes,No,No,0.0,15.0,No,Female,25-29,White,No,Yes,Good,7.0,No,No,No
130302,No,30.9,No,No,No,5.0,5.0,Yes,Female,70-74,White,No,Yes,Fair,8.0,No,No,No


#### Clean up tables

In [None]:
try:
    db_drop_table(f'{user}_plots')
except: 
    pass

try:
    db_drop_table(f'{user}_species')
except: 
    pass

try:
    db_drop_table(f'{user}_surveys')
except: 
    pass

try:
    db_drop_table(f'{user}_heartdisease')
except: 
    pass

In [None]:
remove_context()

<span style="font-size:16px;">For online documentation on Teradata Vantage analytic functions, refer to the [Teradata Developer Portal](https://docs.teradata.com/) and search for phrases "Python User Guide" and "Python Function Reference".</span>