# Getting Started with the Iguazio Data-Science Platform

Learn how to quickly start using the Iguazio Data Science Platform to collect, ingest, and explore data, and perform additional data science tasks:

- [Overview](#overview)
  - [Jupyter Notebook Basics](#jupyter-notebook-basics)
- [Built-In Platform Tools for Data Collection and Exploration](#builtin-product-data-collection-n-expoloration-tools)
- [Collecting and Ingesting Data](#data-collection-and-ingestion)
  - [Ingesting Data From an External Database to a NoSQL Table](#ingest-from-external-db-to-no-sql)
  - [Ingesting Files from Amazon S3 to the Platform](#ingest-from-amazon-s3)
  - [Streaming Data Data From an External Streaming Engine](#streaming-data-from-an-external-streaming-engine)
  - [Ingesting Data Using the Platform's RESTful Web APIs](#ingest-with-web-apis)
- [Exploring and Processing Data](#data-exploration-and-processing)
  - [Exploring Data Using Spark DataFrames](#data-exploration-spark)
  - [Exploring Data Using V3IO Frames and pandas DataFrames](#data-exploration-v3io-frames-n-pandas)
  - [Exploring Data Using SQL](#data-exploration-sql)
- [Getting-Started Example](#getting-started-example)

<a id="overview"></a>
## Overview

The **GettingStarted** directory tutorial Jupyter notebooks directory contains information and code examples to help you with your first steps using the Iguazio Data Science Platform (**"the platform"**).<br>
For an overview of the platform and how it can be used to implement a full data science workflow, see the [**Welcome**](../Welcome.ipynb) tutorial notebook.<br>
For full end-to-end platform use-case application demos, see [**demos**](demos/README.ipynb) tutorial notebooks directory.

<a id="jupyter-notebook-basics"></a>
### Jupyter Notebook Basics

The platform's Jupyter Notebook service displays the JupyterLab UI, which consists of a collapisble left sidebar, a main work area (on the right), and a top menu bar.
For details, see the [JupyterLab documentation](https://jupyterlab.readthedocs.io/en/stable/user/interface.html#the-jupyterlab-interface).

The main work area (on the right) contains tabs of documents and activities &mdash; for creating, viewing, editing, and running  interactive notebooks, shell terminals, or consoles, as well as viewing and editing other common file types.<br>
To create a new notebook or terminal, select the **New Launcher** option (`+` icon) from the top action toolbar in the left sidebar.

The top menu bar exposes avaialble top-level actions.

The left-sidebar menu contains commonly used tabs, including a **File Browser** (directory icon) for browsing files.
The root file-browser directory of the platform's Jupyter Notebook service contains the following files and directories:

- <a id="v3io-mount"></a>**v3io** directory &mdash; displays the contents of the `v3io` platform cluster data mount for browsing the contents of the cluster's shared data containers.
  You can also browse the contents of the data containers from the **Data** page of the platform dashboard.
  To learn how to set platform data paths for the different platform programming interfaces, see [Setting Data Paths](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/fundamentals/#data-paths).
  
  You can ingest data into platform's data containers using various alternative APIs, and retrieve and run queries on the ingested data using various APIs, tools and services.
  For more information, see the [Working with Data Containers](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/containers/) and [Ingesting and Consuming Files](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/ingest-n-consume-files/) platform tutorials and the other sections of the current [GettingStarted](GettingStarted.ipynb) notebook.<br>
  A cluster security adminstrator can restrict user access to the specific data containers or contained directories and files through the data-access policies.
  For more information, see [Data-Access Authorization](https://www.iguazio.com/docs/concepts/latest-release/security/#data-access-authorization).

- <a id="running-user-dir"></a>The contents of the running-user home directory &mdash; **users/&lt;running user&gt;**.
  The platform cluster has a predefined “users” container, which is designed to contain **&lt;username&gt;** directories that provide individual development envioronments for storing user-specific data.
  The platform's Jupyter Notebook, Zeppelin, and web-based shell "command-line services" automatically create such a directory for the running user of the service and set it as the home directory of the service environment, and therefore you can see its contents at the root level of the JupyterLab file browser.
  You can leverage the following environment variables, which are predefined in the platform's command-line services, to access this running-user directory from your code:
  
  - `V3IO_USERNAME` &mdash; set to the username of the running user of the Jupyter Notebook service.
  - `V3IO_HOME` &mdash; set to the running-user directory in the "users" container &mdash; **users/&lt;running user&gt;**.
  - `V3IO_HOME_URL` &mdash; set to the fully qualified `v3io` path to the running-user directory (which is required when using Spark DataFrames and Hadoop FS command) &mdash; `v3io://users/<running user>`.

  In local file-system commands, you can also use the predefined `User` mount to reference the running-user home directory (`/User[/<data path>]`).

  The running-user home directory also contains the platform's [tutorial Jupyter notebooks](https://github.com/v3io/tutorials):

  - [**Welcome.ipynb**](../Welcome.ipynb) &mdash; a documentation notebook that provides a short introduction to the platform and useful development resources, including the provided tutorial notebooks.
  - **GettingStarted** &mdash; a directory containing getting-started tutorials that explain and demonstrate how to perform basic platform operations &mdash; such as data collection, ingestion, and analysis &mdash; as detailed in the current [GettingStarted](GettingStarted.ipynb) notebook.
  - **demos** &mdash; a directory containing [end-to-end application use-case demos](#demo-tutorials).

<a id="builtin-product-data-collection-n-expoloration-tools"></a>
## Built-In Platform Tools for Data Collection and Exploration

Iguazio provides various ways for collecting data from different sources such as databases, files and streaming engines<br>
Collecting data can be done as a one time operation (i.e. using a notebook in Jupyter or zeppelin) or on an ongoing basis using Nuclio functions <br>
In the examples below you'll find notebooks explaining how to import data into the system via Jupyter.

<a id="data-collection-and-ingestion"></a>
## Collecting and Ingesting Data

The platform supports various alternative methods for collecting and ingesting data into the platform's data store (data containers), as demonstrated in the following examples.
For more information, see the [Welcome](../Welcome.ipynb#data-collection-and-ingestion) platform tutorial Jupyter notebook.

<a id="ingest-from-external-db-to-no-sql"></a>
### Ingesting Data From an External Database to a NoSQL Table

See the [ReadingFromExternalDB](ReadingFromExternalDB.ipynb) getting-started tutorial notebook to learn how to collect data from different databases &mdash; such as MySQL, Oracle, and Postgress &mdash; and write them to a NoSQL table in the platform.

<a id="ingest-from-amazon-s3"></a>
### Ingesting Files from Amazon S3 to the Platform

<a id="ingest-from-amazon-s3-using-curl"></a>
#### Ingesting Files from Amazon S3 Using curl

You can use a simple [curl](https://curl.haxx.se/) command to ingest a file (object) from an external web data source, such as an Amazon S3 bucket, to the platform's data containers (i.e., into the platform's distirbuted file system).
For example, the following command reads a CSV file from the [Iguazio sample data-sets](http://iguazio-sample-data.s3.amazonaws.com/) public Amazon S3 bucket and saves it to an **examples** directory in the running-user directory of the "users" container (`/v3io/users/$V3IO_USERNAME` = `v3io/$V3IO_HOME` = `/User`):

In [None]:
!mkdir -p /v3io/${V3IO_HOME}/examples

In [None]:
!curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > /v3io/${V3IO_HOME}/examples/stocks.csv

<a id="ingest-from-amazon-s3-to-nosql-table-using-v3io-frames-n-pandas"></a>
#### Ingesting Data from Amazon S3 to a NoSQL Table Using V3IO Frames and pandas

See the [frames](GettingStarted/frames.ipynb) getting-started tutorial notebook to learn how to import data from Amazon S3 and save it into a NoSQL table in the platform using V3IO Frames and pandas DataFrames.

<a id="streaming-data-from-an-external-streaming-engine"></a>
### Streaming Data Data From an External Streaming Engine

To read data from an external streaming engine &mdash; such as Kafka, Kinesis, or RabbitMQ &mdash; create a Nuclio function that listens on the stream, and write the stream data to a NoSQL or time-series database (TSDB) table:

1. In the dashboard's side navigation menu, select **Functions** to display the Nuclio serverless functions dashboard.
2. Create a new Nuclio project or select an existing project.
3. In the action toolbar, select **Create Function**.
4. Eneter a function name, select an appropriate template, such as **kafka-to-tsdb**, configure the required template parameters, and apply your changes.
5. Select **Deploy** from the action toolbar to deploy your function.

<a id="ingest-with-web-apis"></a>
### Ingesting Data Using the Platform's RESTful Web APIs

You can use the paltform's RESTful web APIs to ingest data into the platform by sending HTTP or HTTPS requests to the endpoint of your cluster's web-APIs service.
To get the URL of this endpoint, go to the **Services** dashboard page and copy the HTTPS link in the **API** column of the "Web APIs" service. 
For detailed documentation look at https://www.iguazio.com/docs/reference/latest-release/api-reference/web-apis/

<a id="data-exploration-and-processing"></a>
## Exploring and Processing Data

After you have ingested data into the platform's data containers, you can use various alternative technics and tools to explore and analyze the data from Jupyter Notebook.
For more information, see the [Welcome](../Welcome.ipynb#data-exploration-and-processing) notebook.
Following are examples of using different tools to explore data in the platform from a Jupyter notebook:

<a id="data-exploration-spark"></a>
### Exploring Data using Spark DataFrames

Spark is a distributed computing framework for data analytics.
You can easily run distributed Spark jobs on you platform cluster that use Spark DataFrames to access data files (objects), tables, or streams in the platform's data store.
For more information and examples, see the [SparkSQLAnalytics](SparkSQLAnalytics.ipynb) getting-started tutorial notebook.

<a id="data-exploration-v3io-frames-n-pandas"></a>
### Exploring Data Using V3IO Frames and pandas DataFrames

Iguazio's V3IO Frames open-source data-access library <font color="#00BCF2">\[Tech Preview\]</font> provides a unified high-performance DataFrames API for accessing NoSQL, stream, and time-series data in the platform's data store.
These DataFrames can also be used to analyze the data with pandas. 
For details and examples, see the [frames](frames.ipynb) getting-started tutorial notebook.

<a id="data-exploration-sql"></a>
### Exploring Data Using SQL

You can run SQL statements (`SELECT` only) on top of NoSQL tables in the platform's data store.
To do this, you need to use the Jupyter `%sql` or `%%sql` IPython Jupyter magic followed by an SQL statement.
The platform supports standard ANSI SQL semantics.
Under the hood, the SQL statements are executed via [Presto](https://prestodb.github.io/), which is a distributed SQL engine designed from the ground up for fast analytics queries.

In the example in the following cell, as a preperation for the SQL query, the **stocks.csv** file that was ingested to the **users/&lt;running user&gt;/examples** platform data-container directory in the previous [Ingesting Files from Amazon S3 to the Platform](#ingest-from-amazon-s3) example is written to a **stocks_example_tab** NoSQL table in the same directory.
Then, an SQL `SELECT` query is run on this table.

In [None]:
# taking the csv that was generated in the first section and write it as a NoSQL table using frames
# make sure to run the "reading from S3"
import pandas as pd
import v3io_frames as v3f
import os
client = v3f.Client('framesd:8081', container='users')

df = pd.read_csv(os.path.join('/v3io/users/'+os.getenv('V3IO_USERNAME')+'/examples/stocks.csv'))

tablename = os.path.join(os.getenv('V3IO_USERNAME')+'/examples/stocks_example_tab')
client.write('kv', tablename, df)

In [None]:
table_path = os.path.join('v3io.users."'+os.getenv('V3IO_USERNAME')+'/examples/stocks_example_tab"')
%sql select * from $table_path limit 10

<a id="getting-started-example"></a>
## Getting-Started Example

Follow the tutorial by running the code cells in order of appearance.

> **Tip:** You can also browse the files and directories that you write to the "users" container in this tutorial from the platform dashboard: in the side navigation menu, select **Data**, and then select the **users** container from the table. On the container data page, select the **Browse** tab, and then use the side directory-navigation tree to browse the directories. Selecting a file or directory in the browse table displays its metadata.

<a id="getting-started-example-step-ingest-csv"></a>
### Step 1: Ingest a sample CSV file from Amazon S3

Use `curl` to download a sample stocks-data CSV file from the [Iguazio sample data-sets](http://iguazio-sample-data.s3.amazonaws.com/) public Amazon S3 bucket, which is part of the deutsche-boerse public data set.
For additional public datasets, check out [Registry of Open Data on AWS](https://registry.opendata.aws/).

> **NOTE:** All the platform tutorial notebook examples store the data in an **examples** directory in the running-user directory of the predefined "users" container &mdash; **users/&lt;running user&gt;/examples**.
> The runnnig-user directory is automatically created by the Jupyter Notebook service.
> The `V3IO_HOME` environment variable is used to reference the **users/&lt;running user&gt;** directory.
> To save the data to a different root container directory or to a different container, you need to specify the data path in the local file-system commands as `/v3io/<container name>/<data path>`, and in Spark DataFrames or Hadoop FS commands as a fully qualified path of the format `v3io://<container name>/<table path>`.
> For more information, see the the [v3io-mount](#v3io-mount) and [running-user directory](#running-user-dir) information [Jupyter Notebook Basics](#jupyter-notebook-basics) section of this notebook.

In [None]:
%%sh 
mkdir -p /v3io/${V3IO_HOME}/examples

# Download a sample stocks CSV file from the Iguazio sample data-set Amazon S3 bucket
curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > /v3io/${V3IO_HOME}/examples/stocks.csv

<a id="getting-started-example-step-convert-csv-to-nosql-table"></a>
### Step 2: Convert the sample CSV file to a NoSQL table

Read the sample **stocks.csv** file that you downloaded and ingested in the previous step into a Spark DataFrame, and write the data in NoSQL format to a new "stocks_tab" table in the same container directory (**users/&lt;running user&gt;/examples/stocks_tab**). 

> **Note**
> - To use the Iguazio Spark Connector, set the DataFrame's data-source format to `io.iguaz.v3io.spark.sql.kv`.
> - The data path in the Spark DataFrame is specified by using the `V3IO_HOME_URL` environment varible, which is set to `v3io://users/<running user>`.
>   See the [running-user directory](#running-user-dir) information.

In [None]:
import os
from pyspark.sql import SparkSession

# Create a new Spark session
spark = SparkSession.builder.appName("Iguazio getting-started example").getOrCreate()

file_path=os.path.join(os.getenv('V3IO_HOME_URL')+'/examples')

# Read the sample stocks.csv file into a Spark DataFrame, and let Spark infer the schema of the CSV file
df = spark.read.option("header", "true").csv(os.path.join(file_path)+'/stocks.csv')

# Show the DataFrame data
df.show()

# Write the DataFrame data to a stocks_tab table in the users/<running user>/examples container directory,
# and define the "ISIN" column (attribute) as the table's primary key
df.write.format("io.iguaz.v3io.spark.sql.kv").mode("append").option("key", "ISIN").option("allow-overwrite-schema", "true").save(os.path.join(file_path)+'/stocks_tab/')


<a id="getting-started-example-step-run-sql-queries"></a>
### Step 3: Run interactive SQL queries

Use the `%sql` Jupyter magic to run an SQL queries on the "stocks_tab" table that was created in the previous step.
(The queries is processed using Presto.)
The example runs a `SELECT` query that reads the first ten table items.

In [None]:
table_path = os.path.join('v3io.users."'+os.getenv('V3IO_USERNAME')+'/examples/stocks_tab"')
%sql select * from $table_path limit 10

<a id="getting-started-example-step-convert-nosql-table-to-parquet"></a>
### Step 4: Convert the NoSQL table to a Parquet table

Use a Spark DataFrame `write` command to write the data in the Spark DaraFrame that was created in [Step 2](#getting-started-example-step-convert-csv-to-nosql-table) to a new **users/&lt;running user&gt;/examples/stocks_prqt** Parquet table.

In [8]:
df.write.mode('overwrite').parquet(os.path.join(file_path)+'/stocks_prqt')

<a id="getting-started-example-step-browse-the-examples-dir"></a>
### Step 5: Browse the example container directory

Use a file-system bash-shell command to list the contents of the **users/&lt;running user&gt;/examples** data-container directory to which all the ingested data in the previous steps were saved.
You should see in this directory the **stocks.csv** file, **stocks_tab** NoSQL table directory, and **stocks_prqt** Parquet table directory that you created in the previous steps.
The following cells demonstrate how to issue the same command using the local file system and using Hadoop FS.

In [9]:
# List the contents of the users/<running user>/examples directory using a local file-system command
!ls -lrt /v3io/${V3IO_HOME}/examples

total 0
-rw-r--r-- 1 root nogroup 882055 Mar 31 09:43 stocks.csv
drwxrwxrwx 2 root nogroup      0 Mar 31 09:43 stocks_tab
drwxr-xr-x 2 root nogroup      0 Mar 31 09:43 stocks_prqt


In [None]:
%%sh

# List the contents of the users/<running user>/examples directory using an Hadoop FS command
hadoop fs -ls ${V3IO_HOME_URL}/examples

<a id="getting-started-example-deleting-data"></a>
### Deleting Data

When are you are done, you can select to delete the any of the directories or files that you created.
See the instructions in the [Creating and Deleting Container Directories](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/containers/#create-delete-container-dirs) tutorial.
The following example uess a local file-system bash-shell command to delete the entire contents of the **users/&lt;running user&gt;/examples** directory that was created in this example, but not the directory itself.

In [11]:
# Delete all files under my example directory
!rm -rf /v3io/${V3IO_HOME}/examples/*

<a id="getting-started-example-release-spark-resources"></a>
### Releasing Spark Resources

When you are done, run the following command to stop your Spark session and release its computation and memory resources:

In [None]:
spark.stop()