# DRAFT- IBM Cloud Pak for Data - Data Virtualization- DRAFT

## Introduction
* What is Data Virtualization
* Why are the benefit to my business

### Where to find this sample online
You can find a copy of this notebook at https://github.com/Db2-DTE-POC/db2dmc.

## Getting Started

### Using Jupyter notebooks


### Connecting to IBM Cloud Pak for Data
For this lab you will be assigned two IBM Cloud Pak for Data User IDs: A Data Engineer userid and and end-user userid.
* **Engineer:**
    * ID: LABDATAENGINEER1
    * PASSWORD: password
* **User:**
    * ID: LABUSER1
    * PASSWORD: password

To get started, sign in using you Engineer id:
1. Click the following link to open the IBM Cloud Pak for Data Console: https://services-uscentral.skytap.com:9152/
2. Sign in using the Engineer userid and password
3. Click the icon at the very top right of the webpage
4. Click **Profile and settings**
5. Click **Permissions** and review the user permissions for this user
6. Click the **three bar menu** at the very top left of the console webpage
7. Click **Data Virtualization**
8. Click the carrot symbol beside **Menu** below the Data Virtualization title
This displays the actions available to your user. Different user have access to more or fewer menu options depending on their role in Data Virtualization. 

As a Data Engineer you can:
* Add and modify Data sources. Each source is a connetion to a single database, either inside or outside of IBM Cloud Pak for Data.
* Virtualize data. This makes tables in other data sources look and act like tables that are local to the Data Virtualization database
* Work with the data you have virtualized.
* Write SQL to access and join data that you have virtualized
* See detailed in formation on how to connect external analytic tools and applications to your virtualized data

As a User you can only:
* Work with data that has been virtualized for you
* Write SQL to work with that data
* See detailed connection information

As an Administrator (only available to the course instructor) you can also:
* Manage IBM Cloud Pak for Data User Access and Roles
* Create and Manage Data Caches to accelerate performance
* Change key service setttings

### Exploring Data Source Connections
Let's start by looking at the the Data Source Connections that are already available. 

    **Insert Graphic**

1. Click the Data Virtualization menu and select **Data Sources**.
2. Click the **icon below the menu with a circle with three connected dots**.
    This displays the Data Source Graph with 8 active data sources:
    * 4 Db2 Family Databases hosted on premises, IBM Cloud, Azure and AWS
    * 1 EDB Postgres Database on Azure
    * 1 zOS VSAM file
    * 1 Informix Database running on premises 
We are not going to add a new data source but just go through the steps so you can review the available datasources:
1. Click **+ Add** at the left of the console screen
2. Select **Add data source**
You can see a history of other data source connection information that was used before. This history is maintain to make reconnecting to data sources easier and faster.
3. Click **Add connection**
4. Click the field below **Connection type**
5. Scroll through all the **available data sources** to see the available connection types
6. Select **different data connection types** from the list to see the information required to connect to a new data source. 
At a minumum you typically need the host URL and port address, database name, userid and password. You can also connect using an SSL certificate that can be dragged and dropped directly into the console interface. 
7. Click **Cancel**
8. Click **Cancel**


### Basic Data Virtualiation

#### Part One Exploring the Interface
Now that you understand how to connect to data sources you can start virtualizing data. Much of the work has already been done for you. IBM Cloud Pak for Data searches through the available data sources and compiles a single large inventory of all the tables and data available to virtualize in IBM Cloud Pak for Data. 
1. Click the Data Virtualization menu and select **Data Sources**.
2. Select to browse for **Tables**
3. Check the total number of available tables at the top of the list. There should be well over 500 available.
4. Enter "STOCK" into the search field and hit **Enter**. Any tables with the string **STOCK** in the tables name, the table schema or with a colunn title that includes **STOCK** will appear in the search results. 
5. Hover your mouse pointer to the far right side to the search results table. A **eye** icon will appear on each row as you move your mouse. 
6. Click the **eye** icon beside one table. This displays a preview of the data in the selected table.
7. Click **X** at the top right of the dialog box to return to the search results.

#### Part Two Creating a New Table
So that each user in this lab can have their own data to virtualize you will create your own table in a database.
In this part of the lab you will use this Jupyter notebook and Phyton code to connect to a source database, create a simple table and populate it with data. IBM Cloud Pak for Data will automatically detect the change in the source database and make the new table available for virtualization.

The first step is to connect to one of our remote data sources directly as if we were part of the team builing a new business application. Since each lab user will create their own table in their own schema the first thing you need to do is update and run the cell below with your engineer name. 
1. Click on the cell below.
2. Update the name in quotes to match your engineer name
3. Click **Run** from the Jupyter notebook menu above. 

In [1]:
# Setting your userID
engineer = 'LABDATAENGINEER1'

The next part of the lab relies on a Jupyter notebook extension, commonly refer to as a "magic" command, to connect to a Db2 database. To use the commands you load load the extension by running another notebook call db2 that contains all the required code 
<pre>
&#37;run db2.ipynb
</pre>
The cell below loads the Db2 extension. Note that it will take a few seconds for the extension to load, so you should generally wait until the "Db2 Extensions Loaded" message is displayed in your notebook. 
1. Click the cell below
2. Click **Run**

In [2]:
%run db2.ipynb

Db2 Extensions Loaded.


#### Connecting to Db2

Before any SQL commands can be issued, a connection needs to be made to the Db2 database that you will be using. 

The Db2 magic command tracks whether or not a connection has occured in the past and saves this information between notebooks and sessions. When you start up a notebook and issue a command, the program will reconnect to the database using your credentials from the last session. In the event that you have not connected before, the system will prompt you for all the information it needs to connect. This information includes:

- Database name
- Hostname
- PORT 
- Userid
- Password

Run the next cell.

In [3]:
%sql CONNECT TO BLUDB USER user999 USING t1cz?K9-X1_Y-2Wi HOST services-uscentral.skytap.com PORT 9094

Connection successful.


To check that the connection is working. Run the following cell. It lists the tables in the database in the **DVDEMO** schema. Only the first 5 tables are listed.

In [4]:
%sql select TABNAME, OWNER from syscat.tables where TABSCHEMA = 'DVDEMO'

Unnamed: 0,TABNAME,OWNER
0,ACCOUNTS,USER999
1,CUSTOMERS,USER999
2,STOCK_HISTORY,USER999
3,STOCK_SYMBOLS,USER999
4,STOCK_TRANSACTIONS,USER999


Now that you can successfully connect to the database, you are going to create two tables with the same name and column across two different schemas. In following steps of the lab you are going to virtualize these tables in IBM Cloud Paks for Data and fold them together into a single table. 

The next cell sets the default schema to your engineer name followed by 'A'. Notice how you can set a python variable and substitute it into the SQL Statement in the cell. The **-e** option echos the command. 

Run the next cell.

In [8]:
schema = engineer+'A'
%sql -e SET CURRENT SCHEMA {schema}

Command completed.


Run next cell to create a table with a single INTEGER column containing values from 1 to 10.

In [9]:
%%sql
CREATE TABLE DISCOVER (A INT);
INSERT INTO DISCOVER VALUES 1,2,3,4,5,6,7,8,9,10;
SELECT * FROM DISCOVER;

Unnamed: 0,A
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


Run the next two cells to create the same table in a schema ending in **B**. It is populated with values from 11 to 20.

In [11]:
schema = engineer+'B'
print(schema)
%sql SET CURRENT SCHEMA {schema}

LABDATAENGINEER1B
Command completed.


In [12]:
%%sql
CREATE TABLE DISCOVER (A INT);
INSERT INTO DISCOVER VALUES 11,12,13,14,15,16,17,18,19,20;
SELECT * FROM DISCOVER;

Unnamed: 0,A
0,11
1,12
2,13
3,14
4,15
5,16
6,17
7,18
8,19
9,20


Run the next cell to see all the tables in the database called **DISCOVER**. You may see tables created by other people running the lab. 

In [13]:
%sql SELECT TABSCHEMA, TABNAME FROM SYSCAT.TABLES WHERE TABNAME = 'DISCOVER'

Unnamed: 0,TABSCHEMA,TABNAME
0,DATAENGINEER1A,DISCOVER
1,DATAENGINEER1B,DISCOVER
2,LABDATAENGINEER1A,DISCOVER
3,LABDATAENGINEER1B,DISCOVER


#### Virtualizing your new Tables
Now that you have created two new tables you can virtualize that data and make it look like a single table in your database.
1. Return to the IBM Cloud Pak for Data Console
2. Click **Virtualize** in the Dat Virtualization menu
3. Enter **DISCOVER** in the search bar and hit Enter. Now you can see that your new tables have automatically been discovered by IBM Cloud Pak for Data. You will see your tables listed under the LABDATAENGINEER schemas you used when you created your tables. You will also like see other lab participant tables.
4. Select the two tables you just created by clicking the **check box** beside each table
5. Click **Add to Cart**
6. Click **View Cart**
7. Change the name of your two tables from DISCOVER to **DISCOVERA** and **DISCOVERB**. These are the new names that you will be able to use to find your tables in the Data Virtualization database. Don't change the Schema name. It is unique to your current userid. 

9. Click the **back arrow** beside **Review cart and virtualize tables**. We are going to add one more thing to your cart.
10. Click the checkbox beside **Automatically group tables**. Notice how all the tables called **DISCOVER** have been grouped together into a single entry.
11. Select the row were all the DISCOVER table have been grouped together
12. Click **Add to cart**
13. Click **View cart**
14. However over the elipsis icon at the right side of the list for the **DISCOVER** table
15. Select **Edit grouped tables**
16. Deselect all the tables except for those in one of the schemas you created. You should now have two tables selected. 
17. Click Apply
17. Change the name of the new combined table **DISCOVERFOLD**
18. Select a project from the drop down list that corresponds to your current user id. 
19. From the elpsis menu select **Preview** for each of the three tables in your list. The new virtualizaed table **DISCOVERA** should contain values from 1-10. The new virtualized table **DISCOVERB** should contain values from 11-20. The **DISCOVERFOLD** virtualized table should contain values from 1-20.
20. Click **Virtualize**. You should that three new virtual tables have been created. 
21. Click **View my virtualized data**

#### Work with your new tables
1. Enter **DISCOVER** in the Find field. 
You should see the three virtual tables you just created. Notice that you do not see tables that other users have created. By default, Data Engineers only see virtualized tables they have virtualized or virtual tables where they have been given access by other users. 
2. Click **Preview** beside the **DISCOVERFOLD** table to confirm that it contains 20 rows. 
3. Click **SQL Editor** from the Data Virtualization menu
4. Click **Blank** to create a new blank SQL Script
4. Enter **SELECT * FROM DISCOVERFOLD;** into the SQL Editor
5. Click **Run All**. You should see 20 rows returned in the result. 

Notice that you didn't have to specify the schema for your new virtual tables. The SQL Editor automatically uses the schema associated with your userid that was used when you created your new tables. 

Now you can:
* Create connection to a remote data source 
* Make a new or existing table in that remote data source look and act like a local table 
* Fold data from different tables in the same data source or access data sources by folding it together into a single virtual table

In the next steps you will use more complex data structure and learn how to combine data from multiple tables together into easy to consume views and then share with other users.

### Advanced Data Virtualization 
#### Combining Data Together
The IBM Cloud Pak for Data Virtualization Administrator has set up more complex data from multiple source for the next steps. The administrator has also given you access to this virtualized data. You may have noticed this in previous steps. 
1. Select **My virtualized data** from the Data Virtualiztion menu. All of these virtualized tables look and act like normal Db2 tables. 
2. Click **Preview** for any of the tables to see what they contain. 

The virtualized tables in the **FOLDING** schema have all been created by combinig the same tables from different data source. Folding isn't something that is restricted to the same data source in the simple example you just completed.

The virtaulized tables in the **TRADING** schema are view of complex queries that were use to combine data from multiple data source to answer specific business questions. 

3. Select **SQL Editor** from the Data Virtualization menu.
4. Select **Script Library**
5. Search for **OHIO**
6. Select and expand the **OHIO Customer** query
7. Click the **Open a script to edit** icon to open the script in the SQL Editor
8. Click **Run All**

This script is a complex SQL join query that uses data from all the virtualize data sources you explored in the first steps of this lab. While the SQL looks complex the author of the query did not have be aware that the data was coming from multiple sources. Everything used in this query looks like it comes from a single database, not eight different data sources across eight different systems on premesis or in the Cloud. 

#### Making Complex SQL Simple to Consume
You can easily make this complex query easy for a user to consume. Instead of shaing this query with other users, you can wrap the query into a view that looks and acts like a simple table. 
1. Enter **CREATE VIEW MYOHIOQUERY AS** in the SQL Editor at the first line below the comment and before the **WITH** clause
2. Click **Run all**
3. Click **+** to **Add a new script**
4. Click **Blank**
4. Enter **SELECT * FROM MYOHIOQUERY;**
5. Click **Run all**

Now you have a very simple virtualized table that is pulling data from eight different data sources, combining the data together to resolve a complex business problem. In the next step you will share your new virtualized data with a user.

#### Sharing Virtualized Tables
1. Select **My virtualized data** from the Data Virtualization Menu.
2. Click **Manage Access** from the elipsis menu to the right of the **MYOHIOQUERY** virtualized table
3. Click **Grant access**
4. Select the **LABUSERx** id associated with your lab. If you are LABDATAENGINEER5, then select LABUSER5.
5. Click **Add**

You should now see that your **LABUSER** id has view only access to the new virtualized table. Nextyou switch to your LABUSERx id to check that you can see the data you have just granted access for.

6. Click the user icon at the very top right of the console
7. Click **Log out**
8. Sign in using the LABUSER id specified by your lab instructor
9. Click the three bar menu at the top left of the IBM Cloud Pak for Data console
10. Select **Data Virtualization**

You should see the **MYOHIOQUERY** with the schema from your engineer userid in the list of virtualized data.

11. Make a note of the schema of the MYOHIOQUERY in your list of virtualized tables. It starts with **USER**.
12. Select the **SQL Editor** from the Data virtualization menu
13. Click **Blank** to open a new SQL Editor window
14. Enter **SELECT * FROM USERxxxx.MYOHIOQUERY** where xxxx is the user number of your engineer user. The view created by your engineer user was created in their default schema. 
15. Click **Run all**
16. Add the following to your query: **WHERE TOTAL > 3000 ORDER BY TOTAL**
17. Click **</>** to format the query so it is easiler to read
18. Click **Run all**

You can see how you have just make a very complex data set extremely easy to consume by a data user. They don't have to know how to connect to multiple data sources or how to combine the data using complex SQL. You can hide that complexity while ensuring only the right user has access to the right data. 

In the next steps you will learn how to access virtualized data from outside of IBM Cloud Pak for Data.

### Accessing Virtualized Data from outside of IBM Cloud Pak for Data
In the next set of steps you will connect to virtualized data from outside of IBM Cloud Pak for Data. The connection appears just like connecting to a single database. All the complexity of a dozens of tables across multiple databases on different on premesis and cloud providers is now as simple as connecting to a single database and querying a table. 

We are going to connect to the IBM Cloud Pak for Data Virtaulization database in exactly the same way we connected to a Db2 database earlier in this lab. However we need to change the detailed connection information. 

In [None]:
%sql CONNECT TO BLUDB USER user999 USING t1cz?K9-X1_Y-2Wi HOST services-uscentral.skytap.com PORT 9094

### Show how to create a join view:
* Select table STOCK_TRANSACTIONS
* Select table STOCK_SYMBOLS
* Click Join View
* In table STOCK_SYMBOLS: deselect SYMBOL
* In table STOCK_HISTORY: deselct _ID
* Click STOCK_TRANSACTION.SYMBOLS and drag to STOCK_SYMBOLS.SYMBOL
* Click Open in SQL Editor
* Click Back button on top of the screen
* Click JOIN
* Type view name VIEW__STOCK_TRANSACTIONS__STOCK_SYMBOLS
* Type schema name DVDEMO
* Click NEXT
* Select project DVDEMO
* Click CREATE VIEW -> Popup window "Join view created" apears
* Click View my virtualized data
* Click the 3 dots besides object VIEW__STOCK_TRANSACTIONS__STOCK_SYMBOLS
* Click Preview
* Click Manage Access
* Click Grant Access
* Select user ctp
* Click Add
* Click Back button on top of the screen -> take you back to My virtualized data screen


### Show SQL editor and preconfigured scripts:
**Prep steps:**
* Go to SQL Editor and open the following scripts: Ohio customers, 30 day moving average
* Click Ohio Customers to make sure this is preselected

**Steps:**
* Click Menu
* Click SQL Editor
* Click 30 day moving average
* Click Script Library
* Click arrow besides 3% stocks
* Click icon Open a script to edit
* Click Run all -> the query result is displayed

### Show how to deploy the CPD DV service from an external Jupyter notebook
* Open the Connections Info page of the DV service and show the connection information
* Switch to different brower window that shows external Jupyter Notebooks console
* Run through the notebook from top to bottom and show the connect to the DV layer and the execution of the sample queries.

In [None]:
schema = user+'A'
%sql DROP TABLE {schema}.DISCOVER 

In [None]:
schema = user+'B'
%sql DROP TABLE {schema}.DISCOVER

#### Credits: IBM 2019, Peter Kohlmann [kohlmann@ca.ibm.com]