# Getting Started with the Iguazio Data-Science Platform

Learn how to quickly start using the Iguazio Data Science Platform to collect, ingest, and explore data, and perform additional data science tasks:

- [Overview](#gs-overview)
  - [Platform Data Containers](#platform-data-containers)
- [Collecting and Ingesting Data](#gs-data-collection-and-ingestion)
  - [Ingestion Examples Overview](#data-collection-and-ingestion-examples-overview)
  - [Ingesting Data From an External Database to a NoSQL Table Using V3IO Frames](#ingest-from-external-db-to-no-sql-using-frames)
  - [Ingesting Files from Amazon S3](#ingest-from-amazon-s3)
  - [Streaming Data From an External Streaming Engine Using Nuclio](#streaming-data-from-an-external-streaming-engine-using-nuclio)
- [Exploring and Processing Data](#gs-data-exploration-and-processing)
  - [Exploring Data Using Spark DataFrames](#data-exploration-spark)
  - [Exploring Data Using V3IO Frames and pandas DataFrames](#data-exploration-v3io-frames-n-pandas)
  - [Exploring Data Using SQL](#data-exploration-sql)
- [Getting-Started Example](#getting-started-example)

<a id="gs-overview"></a>
## Overview

The **GettingStarted** directory tutorial Jupyter notebooks directory contains information and code examples to help you with your first steps using the Iguazio Data Science Platform (**"the platform"**).<br>
For an overview of the platform and how it can be used to implement a full data science workflow, see the [**Welcome**](../Welcome.ipynb) tutorial notebook.<br>
For full end-to-end platform use-case application demos, see [**demos**](../demos/README.ipynb) tutorial notebooks directory.

<a id="platform-data-containers"></a>
### Platform Data Containers

Data is stored within data containers in the platform's distributed file system (DFS).
All platform clusters have two predefined containers:

- <a id="default-container"></a> The default **"bigdata"** container.
- <a id="users-container"></a>The **"users"** container, which is designed to contain **&lt;username&gt;** directories that provide individual development environments for storing user-specific data.
  The platform's Jupyter Notebook, Zeppelin, and web-based shell "command-line services" automatically create such a directory for the running user of the service and set it as the home directory of the service environment.
  You can leverage the following environment variables, which are predefined in the platform's command-line services, to access this running-user directory from your code:

  - `V3IO_USERNAME` &mdash; set to the username of the running user of the Jupyter Notebook service.
  - `V3IO_HOME` &mdash; set to the running-user directory in the "users" container &mdash; **users/&lt;running user&gt;**.
  - `V3IO_HOME_URL` &mdash; set to the fully qualified `v3io` path to the running-user directory &mdash; `v3io://users/<running user>`.

The data containers and their contents are referenced differently depending on the programming interface.
For example, in local file-system commands you use the predefined `v3io` root data mount &mdash; `/v3io/<container name>[/<data path>]` &mdash; or the predefined `User` mount to the **users/&lt;running user&gt;** directory &mdash; `/User[/<data path>]` (= `/v3io/users/$V3IO_USERNAME[/<data path>]`).
But in Hadoop FS or Spark DataFrame commands, you use a fully qualified path of the format `v3io://<container name>/<data path>`.
For detailed information on how to set the data path for each interface, see [Setting Data Paths](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/fundamentals/#data-paths).
For information and examples of how to use the different platform interfaces to create and delete container and container directories and browse their contents, and how to ingest and consume container data, see the [Working with Data Containers](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/containers/) and [Ingesting and Consuming Files](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/ingest-n-consume-files/) platform quick-start tutorials and the examples in the platform's getting-started tutorial Jupyter notebooks, such as in the [getting-started example](#getting-started-example) in the current notebook.

A cluster security adminstrator can restrict user access to the specific data containers or contained directories and files through the data-access policies.
For more information, see [Data-Access Authorization](https://www.iguazio.com/docs/concepts/latest-release/security/#data-access-authorization).

<a id="gs-data-collection-and-ingestion"></a>
## Collecting and Ingesting Data

The platform supports various alternative methods for collecting and ingesting data into its data containers (i.e., its data store).
For more information, see the [Welcome](../Welcome.ipynb#data-collection-and-ingestion) platform tutorial Jupyter notebook
The data collection and ingestion can be done as a one-time operation, using different platform APIs &mdash; which can be run from your preferred programming interface, such as an interactive web-based Jupyter or Zeppelin notebook &mdash; or as an ongoing ingestion stream, using Nuclio serverless functions.
The getting-started tutorial Jupyter tutorial examples demonstrate how to import and ingest data into the platform using code that's run from a Jupyter notebook.

<a id="data-collection-and-ingestion-examples-overview"></a>
### Ingestion Examples Overview

The platform's getting-started tutorial Jupyter notebooks and the quick-start tutorials in the platform's documentation site feature code examples that demonstrate how to use the different data ingestion methods supported by the platform.
You can browse the examples by different criteria:

<a id="ingestion-examples-data-source"></a>
#### Examples by Data Source

- External database
  - [Ingesting Data From an External Database to a NoSQL Table Using V3IO Frames](#ingest-from-external-db-to-no-sql-using-frames)
- Amazon S3
  - [Ingesting Files from Amazon S3 to the Platform](#ingest-from-amazon-s3)
- External streaming engine
  - [Streaming Data rom an External Streaming Engine Using Nuclio](#streaming-data-from-an-external-streaming-engine-using-nuclio)

<a id="ingestion-examples-by-api"></a>
#### Examples by API

- curl or Botocore
  - [Ingesting Files from Amazon S3 to the Platform File System Using curl](#ingest-from-amazon-s3-using-curl)
- Spark DataFrames &mdash; see the [getting-started example](#getting-started-example), the [SparkSQLAnalytics](SparkSQLAnalytics.ipynb) and [FilesAccess](FilesAccess.ipynb) getting-started tutorial notebooks, and the [Getting Started with Data Ingestion Using Spark](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/data-ingestion-w-spark-qs/) platform quick-start tutorial.
- Platform web APIs &mdash; you can use the platform's RESTful web APIs to ingest data into the platform by sending HTTP requests to the APIs endpoint URL of your cluster's web-APIs service, which is available from the **Services** page of the platform dashboard.
  For detailed documentation and examples, see the [Sending HTTP Requests](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/fundamentals/#sending-http-requests) and [Ingesting and Consuming Files](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/ingest-n-consume-files/) platform quick-start tutorials and the platform's [web-API references](https://www.iguazio.com/docs/reference/latest-release/api-reference/web-apis/).
- Nuclio functions
  - [Streaming Data From an External Streaming Engine Using Nuclio](#streaming-data-from-an-external-streaming-engine-using-nuclio)
- V3IO Frames and pandas DataFrames
  - [Ingesting Data From an External Database to a NoSQL Table Using V3IO Frames](#ingest-from-external-db-to-no-sql-using-frames)
  - [Ingesting Data from Amazon S3 to a NoSQL Table Using V3IO Frames and pandas](#ingest-from-amazon-s3-to-nosql-table-using-v3io-frames-n-pandas)
- Platform dashboard &mdash; see the [Ingesting and Consuming Files](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/ingest-n-consume-files/) platform quick-start tutorial.

<a id="ingestion-examples-by-file-type"></a>
#### Examples by File Type
  - CSV files
    - Using Spark DataFrames and curl or Botocore to ingest CSV files &mdash; see the [getting-started example](#getting-started-example), [FilesAccess](FilesAccess.ipynb) and [SparkSQLAnalytics](SparkSQLAnalytics.ipynb) getting-starated tutorial notebooks, and [Getting Started with Data Ingestion Using Spark](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/data-ingestion-w-spark-qs/) platform quick-start tutorial.
    - Using V3IO Frames to ingest CSV file-s &mdash; see the [frames](frames.ipynb) getting-started tutorial notebook.
  - Parquet tables
    - Using Spark DataFrames to ingest a Parquet table or convert a CSV file into a Parquet table &mdash; see the [getting-started example](#getting-started-example-step-convert-data-to-parquet) and [Getting Started with Data Ingestion Using Spark](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/data-ingestion-w-spark-qs/) tutorial.
    - Using pandas DataFrames &mdash; see the [ReadWriteFromParquet](ReadWriteFromParquet.ipynb) notebook.
  - Ingesting binary image files &mdash; see the [Ingesting and Consuming Files](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/ingest-n-consume-files/) tutorial.

<a id="ingest-from-external-db-to-no-sql-using-frames"></a>
### Ingesting Data From an External Database to a NoSQL Table Using V3IO Frames

For an example of how to collect data from an external database &mdash; such as MySQL, Oracle, and Postgress &mdash; and ingest (write) it into a NoSQL table in the platform, using the V3IO Frames API, see the [ReadingFromExternalDB](ReadingFromExternalDB.ipynb) getting-started tutorial.

<a id="ingest-from-amazon-s3"></a>
### Ingesting Files from Amazon S3

<a id="ingest-from-amazon-s3-using-curl"></a>
#### Ingesting Files from Amazon S3 to the Platform File System Using curl

You can use a simple [curl](https://curl.haxx.se/) command to ingest a file (object) from an external web data source, such as an Amazon S3 bucket, to the platform's distributed file system (i.e., into the platform's data store).
This is demonstrated in the following code example as well as in the [getting-started example](#getting-started-example).
The [SparkSQLAnalytics](GettingStarted/SparkSQLAnalytics.ipynb) getting-started tutorial notebook demonstrates a similar ingestion using [Botocore](https://github.com/boto/botocore).

The example in the following cells uses curl to read a CSV file from the [Iguazio sample data-sets](http://iguazio-sample-data.s3.amazonaws.com/) public Amazon S3 bucket and save it to an **examples** directory in the running-user directory of the predefined "users" data container (`/v3io/users/$V3IO_USERNAME` = `v3io/$V3IO_HOME` = `/User`).

In [1]:
!mkdir -p /v3io/${V3IO_HOME}/examples

In [2]:
!curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > /v3io/${V3IO_HOME}/examples/stocks.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  861k  100  861k    0     0   599k      0  0:00:01  0:00:01 --:--:--  599k


<a id="ingest-from-amazon-s3-to-nosql-table-using-v3io-frames-n-pandas"></a>
#### Ingesting Data from Amazon S3 to a NoSQL Table Using V3IO Frames and pandas

For en example of how to import data from Amazon S3 and save it into a NoSQL table in the platform's data store by using V3IO Frames and pandas DataFrames, see the [frames](GettingStarted/frames.ipynb) getting-started tutorial notebook.

<a id="streaming-data-from-an-external-streaming-engine-using-nuclio"></a>
### Streaming Data From an External Streaming Engine Using Nuclio

To read data from an external streaming engine &mdash; such as Kafka, Kinesis, or RabbitMQ &mdash; create a Nuclio function that listens on the stream, and write the stream data to a NoSQL or time-series database (TSDB) table:

1. In the dashboard's side navigation menu, select **Functions** to display the Nuclio serverless functions dashboard.
2. Create a new Nuclio project or select an existing project.
3. In the action toolbar, select **Create Function**.
4. Enter a function name, select an appropriate template, such as **kafka-to-tsdb**, configure the required template parameters, and apply your changes.
5. Select **Deploy** from the action toolbar to deploy your function.

<a id="gs-data-exploration-and-processing"></a>
## Exploring and Processing Data

After you have ingested data into the platform's data containers, you can use various alternative methods and tools to explore and analyze the data.
Data scientists typically use Jupyter Notebook to run the exploration phase.
As outlined in the [Welcome](../Welcome.ipynb#data-exploration-and-processing) tutorial notebook, the platform's Jupyter Notebook service has a wide range of pre-deployed popular data science tools (such as Spark and Presto) and allows installation of additional tools and packages, enabling you to use different APIs to access the same data from a single Jupyter notebook.
Following are examples of using different tools to explore data in the platform from a Jupyter notebook.

<a id="data-exploration-spark"></a>
### Exploring Data using Spark DataFrames

Spark is a distributed computing framework for data analytics.
You can easily run distributed Spark jobs on you platform cluster that use Spark DataFrames to access data files (objects), tables, or streams in the platform's data store.
For more information and examples, see the [SparkSQLAnalytics](SparkSQLAnalytics.ipynb) getting-started tutorial notebook.

<a id="data-exploration-v3io-frames-n-pandas"></a>
### Exploring Data Using V3IO Frames and pandas DataFrames

Iguazio's V3IO Frames open-source data-access library <font color="#00BCF2">\[Tech Preview\]</font> provides a unified high-performance DataFrames API for accessing NoSQL, stream, and time-series data in the platform's data store.
These DataFrames can also be used to analyze the data with pandas. 
For details and examples, see the [frames](frames.ipynb) getting-started tutorial notebook.

<a id="data-exploration-sql"></a>
### Exploring Data Using SQL

You can run SQL statements (`SELECT` only) on top of NoSQL tables in the platform's data store.
To do this, you need to use the Jupyter `%sql` or `%%sql` IPython Jupyter magic followed by an SQL statement.
The platform supports standard ANSI SQL semantics.
Under the hood, the SQL statements are executed via [Presto](https://prestodb.github.io/), which is a distributed SQL engine designed from the ground up for fast analytics queries.

In the example in the following cell, as a preparation for the SQL query, the **stocks.csv** file that was ingested to the **users/&lt;running user&gt;/examples** platform data-container directory in the previous [Ingesting Files from Amazon S3 to the Platform](#ingest-from-amazon-s3) example is written to a **stocks_example_tab** NoSQL table in the same directory.
Then, an SQL `SELECT` query is run on this table.

In [3]:
# Convert the CSV file that was ingested in the AWS S3 data-collection example into a NoSQL table by using the V3io Frames library.
# NOTE: Make sure to first create a V3IO Frames service from the "Services" page of the platform dashboard, and run the
# "Ingesting Files from Amazon S3 to the Platform File System Using curl" example to create users/$V3IO_USERNAME/examples/stocks.csv.
import pandas as pd
import v3io_frames as v3f
import os
client = v3f.Client('framesd:8081', container='users')

df = pd.read_csv(os.path.join('/v3io/users/'+os.getenv('V3IO_USERNAME')+'/examples/stocks.csv'))

table_path = os.path.join(os.getenv('V3IO_USERNAME')+'/examples/stocks_example_tab')
client.write('kv', table_path, df)

In [4]:
# Use Presto to query the stocks NoSQL table that was created in the previous step
table_path = os.path.join('v3io.users."'+os.getenv('V3IO_USERNAME')+'/examples/stocks_example_tab"')
%sql select * from $table_path limit 10

Done.


securitydesc,securitytype,time,isin,minprice,date,endprice,numberoftrades,mnemonic,currency,securityid,maxprice,tradedvolume,startprice
BNP PARIBAS INH. EO 2,Common stock,08:34,FR0000131104,59.3,2018-03-26,59.3,1,BNP,EUR,2505185,59.3,25,59.3
I2-I.MSCI USA QD.UETF DLD,ETF,08:18,IE00BKM4H312,26.225,2018-03-26,26.225,1,QDVD,EUR,2505429,26.225,100,26.225
COMST.-F.A.Z.IDX U.ETF I,ETF,08:55,LU0650624025,27.495,2018-03-26,27.495,1,C006,EUR,2506038,27.495,431,27.495
VANG.FTSE A.-WO.U.ETF DL,ETF,08:53,IE00B3RBWM25,67.39,2018-03-26,67.39,1,VGWL,EUR,2749247,67.39,348,67.39
LEONI AG NA O.N.,Common stock,08:56,DE0005408884,52.5,2018-03-26,52.52,9,LEO,EUR,2504929,52.56,538,52.56
BAYWA AG VINK.NA. O.N.,Common stock,08:41,DE0005194062,28.35,2018-03-26,28.35,1,BYW6,EUR,2504903,28.35,22,28.35
TLG IMMOBILIEN AG,Common stock,08:55,DE000A12B8Z4,22.7,2018-03-26,22.7,6,TLG,EUR,2504555,22.7,446,22.7
"RIO TINTO PLC LS-,10",Common stock,08:59,GB0007188757,41.285,2018-03-26,41.285,1,RIO1,EUR,2505378,41.285,12,41.285
"NOVARTIS NAM. SF 0,50",Common stock,08:49,CH0012005267,65.0,2018-03-26,65.0,1,NOT,EUR,2504217,65.0,26,65.0
IS EO H.Y.CO.BD U.ETF EOD,ETF,08:50,IE00B66F4759,104.795,2018-03-26,104.795,1,EUNW,EUR,2505762,104.795,100,104.795


<a id="getting-started-example"></a>
## Getting-Started Example

Follow the tutorial by running the code cells in order of appearance.
See also the [Converting a CSV File to a NoSQL Table](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/ingest-n-consume-files/#convert-csv-to-nosql) platform quick-start tutorial.

> **Tip:** You can also browse the files and directories that you write to the "users" container in this tutorial from the platform dashboard: in the side navigation menu, select **Data**, and then select the **users** container from the table. On the container data page, select the **Browse** tab, and then use the side directory-navigation tree to browse the directories. Selecting a file or directory in the browse table displays its metadata.

<a id="getting-started-example-step-ingest-csv"></a>
### Step 1: Ingest a sample CSV file from Amazon S3

Use `curl` to download a sample stocks-data CSV file from the [Iguazio sample data-set](http://iguazio-sample-data.s3.amazonaws.com/) public Amazon S3 bucket.
For additional public data sets, check out [Registry of Open Data on AWS](https://registry.opendata.aws/).

> **NOTE:** All the platform tutorial notebook examples store the data in an **examples** directory in the running-user directory of the predefined "users" container &mdash; **users/&lt;running user&gt;/examples**.
> The running-user directory is automatically created by the Jupyter Notebook service.
> The `V3IO_HOME` environment variable is used to reference the **users/&lt;running user&gt;** directory.
> To save the data to a different root container directory or to a different container, you need to specify the data path in the local file-system commands as `/v3io/<container name>/<data path>`, and in Spark DataFrames or Hadoop FS commands as a fully qualified path of the format `v3io://<container name>/<table path>`.
> For more information, see the [v3io-mount](#v3io-mount) and [running-user directory](#running-user-dir) information [Jupyter Notebook Basics](#jupyter-notebook-basics) section of this notebook.

In [None]:
%%sh 
mkdir -p /v3io/${V3IO_HOME}/examples

# Download a sample stocks CSV file from the Iguazio sample data-set Amazon S3 bucket
curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > /v3io/${V3IO_HOME}/examples/stocks.csv

<a id="getting-started-example-step-convert-csv-to-nosql-table"></a>
### Step 2: Convert the sample CSV file to a NoSQL table

Read the sample **stocks.csv** file that you downloaded and ingested in the previous step into a Spark DataFrame, and write the data in NoSQL format to a new "stocks_tab" table in the same container directory (**users/&lt;running user&gt;/examples/stocks_tab**). 

> **Note**
> - To use the Iguazio Spark Connector, set the DataFrame's data-source format to `io.iguaz.v3io.spark.sql.kv`.
> - The data path in the Spark DataFrame is specified by using the `V3IO_HOME_URL` environment variable, which is set to `v3io://users/<running user>`.
>   See the [running-user directory](#running-user-dir) information.

In [None]:
import os
from pyspark.sql import SparkSession

# Create a new Spark session
spark = SparkSession.builder.appName("Iguazio getting-started example").getOrCreate()

file_path=os.path.join(os.getenv('V3IO_HOME_URL')+'/examples')

# Read the sample stocks.csv file into a Spark DataFrame, and let Spark infer the schema of the CSV file
df = spark.read.option("header", "true").csv(os.path.join(file_path)+'/stocks.csv')

# Show the DataFrame data
df.show()

# Write the DataFrame data to a stocks_tab table in the users/<running user>/examples container directory,
# and define the "ISIN" column (attribute) as the table's primary key
df.write.format("io.iguaz.v3io.spark.sql.kv").mode("append").option("key", "ISIN").option("allow-overwrite-schema", "true").save(os.path.join(file_path)+'/stocks_tab/')


<a id="getting-started-example-step-run-sql-queries"></a>
### Step 3: Run interactive SQL queries

Use the `%sql` Jupyter magic to run an SQL queries on the "stocks_tab" table that was created in the previous step.
(The queries is processed using Presto.)
The example runs a `SELECT` query that reads the first ten table items.

In [None]:
table_path = os.path.join('v3io.users."'+os.getenv('V3IO_USERNAME')+'/examples/stocks_tab"')
%sql select * from $table_path limit 10

<a id="getting-started-example-step-convert-data-to-parquet"></a>
### Step 4: Convert the data to a Parquet table

Use a Spark DataFrame `write` command to write the data in the Spark DaraFrame &mdash; which was created from the CSV file and used to create the NoSQL table in [Step 2](#getting-started-example-step-convert-csv-to-nosql-table) &mdash; to a new **users/&lt;running user&gt;/examples/stocks_prqt** Parquet table.

In [8]:
df.write.mode('overwrite').parquet(os.path.join(file_path)+'/stocks_prqt')

<a id="getting-started-example-step-browse-the-examples-dir"></a>
### Step 5: Browse the example container directory

Use a file-system bash-shell command to list the contents of the **users/&lt;running user&gt;/examples** data-container directory to which all the ingested data in the previous steps were saved.
You should see in this directory the **stocks.csv** file, **stocks_tab** NoSQL table directory, and **stocks_prqt** Parquet table directory that you created in the previous steps.
The following cells demonstrate how to issue the same command using the local file system and using Hadoop FS.

In [9]:
# List the contents of the users/<running user>/examples directory using a local file-system command
!ls -lrt /v3io/${V3IO_HOME}/examples

total 0
-rw-r--r-- 1 root nogroup 882055 Mar 31 09:43 stocks.csv
drwxrwxrwx 2 root nogroup      0 Mar 31 09:43 stocks_tab
drwxr-xr-x 2 root nogroup      0 Mar 31 09:43 stocks_prqt


In [None]:
%%sh

# List the contents of the users/<running user>/examples directory using an Hadoop FS command
hadoop fs -ls ${V3IO_HOME_URL}/examples

<a id="getting-started-example-deleting-data"></a>
### Deleting Data

When are you are done, you can select to delete the any of the directories or files that you created.
See the instructions in the [Creating and Deleting Container Directories](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/containers/#create-delete-container-dirs) tutorial.
The following example uses a local file-system bash-shell command to delete the entire contents of the **users/&lt;running user&gt;/examples** directory that was created in this example, but not the directory itself.

In [11]:
# Delete all files under my example directory
!rm -rf /v3io/${V3IO_HOME}/examples/*

<a id="getting-started-example-release-spark-resources"></a>
### Releasing Spark Resources

When you are done, run the following command to stop your Spark session and release its computation and memory resources:

In [None]:
spark.stop()