## 1. Introduction to Tuplex
[Tuplex](https://tuplex.cs.brown.edu) is a novel big data analytics framework allowing to execute user-defined functions in Python at native code speeds.

<img src="https://tuplex.cs.brown.edu/_static/img/logo.png" width="128px" style="float: right;" />

The following notebook allows you to run Tuplex interactively and learn key concepts on the way!


### 1.1 Setup via pip
Tuplex can be easily installed using the following pip command (python3.7 - python3.9). For other versions, please build Tuplex from source.

```
pip3 install tuplex
```

If you're executing this notebook from the docker container `tuplex/tuplex` - you can skip the setup, Tuplex is already installed!

### 1.2 Creating and configuring a context object
All computation starts by importing the module and creating a Tuplex context objects.

In [None]:
import tuplex

c = tuplex.Context()

A Tuplex context object can be configured in multiple ways, to see a list of available options, simply run `.options` on an existing Context object.

In [None]:
c.options()

In the following, let's walk over a couple general, important options to tune Tuplex:

- `executorCount` determines how many threads in addition to the main-thread Tuplex should use. I.e., when set to 3 Tuplex uses 4-way parallelism.
- `driverMemory`/`executorMemory` determines how much memory the main thread/executor threads should use during computation. Can be configured using either a number in bytes or by passing in a string like `2G` for 2 Gibibytes. E.g., when setting driverMemory=1G and executorMemory=1g, exeutorCount=5 Tuplex will use a total of 6 Gibibytes.
- `partitionSize` Each Task corresponds to a block of memory. partitionsize determines the task size. 

Other settings which change the behavior of the compiler or allow to enable/disable certain optimizations are discussed later in this Intro series.

In [None]:
# create a tiny, single-threaded context
c2 = tuplex.Context(executorCount=0, driverMemory='400KB')

In [None]:
# create context using options from a json dictionary
conf = {'executorCount' : 0, 
        'driverMemory': '400KB'}
        
c3 = tuplex.Context(conf)

In [None]:
# configure context object via YAML file

In [None]:
%%file config.yaml

# this creates a new config file in the current directory which can be passed to tuplex as well.
tuplex:
    -   driverMemory: 400KB
    -   executorCount: 0

In [None]:
c4 = tuplex.Context('config.yaml')

In [None]:
# release small contexts
del c2 
del c3
del c4

## 1.3 Writing your first pipeline in Tuplex
Pipelines in Tuplex are composed by calling operations on datasets. I.e., an operator transforms one dataset into another.

In order to pass in data to a pipeline, a source is declared using one of the operations available from the Tuplex context object. Examples for such operations are

- `parallelize` allows to pass Python objects as source to Tuplex
- `csv` allows to pass one or more files (using a UNIX wildcard pattern) to Tuplex
- ...

In [None]:
# this creates a new dataset
ds = c.parallelize([1, 2, 3, 4])

In [None]:
# To pass the data back as python objects, use .collect()
ds.collect()

Of course passing data to and from Tuplex is a first step, but it's way more interesting to work with the actual data!

For this, Tuplex provides the ability to use user-defined functions (UDFs) with various operators.

In [None]:
# Lambda expressions can be used
ds.map(lambda x: x * x).collect()

In [None]:
# functions as well
def f(x):
    return x * x + 1

ds.map(f).collect()

Besides transforming objects via functions, we can also use a UDF to filter out certain elements. I.e., to retain only the even numbers, we can use the condition `x % 2 == 0`

In [None]:
ds.filter(lambda x: x % 2 == 0).collect()

Naturally, pipelines can be composed of multiple operations. 

In [None]:
g = lambda x: x % 2 == 0

ds.map(f).filter(g).collect()

Each operation thereby creates (lazily) a new dataset.

In [None]:
ds_transformed = ds.map(f)

ds_filtered = ds_transformed.filter(g)

ds_filtered.collect()

## 1.4 Columns, tuples and names

In the above examples, single integers were used as input. Yet, similar to a standard relational database, it often makes sense to structure data in a tabular format. Indeed, many input formats are given in tabular form.

In Tuplex, the elementary object are tuples. Each object within a tuple may be indexed via a name. To understand this better, let's start with a simple example where multiple tuples are transformed and explore the various valid syntax options Tuplex provides:

In [None]:
ds = c.parallelize([(1, 2), (3, 4), (5, 6)])

In [None]:
# show allows to display the data in tabular form
ds.show()

Individual columns/elements of the underlying tuples can be accessed either via multiple parameters (matching number required) or the standard indexing syntax in Python:

In [None]:
ds.map(lambda a, b: a + b).collect()

In [None]:
ds.map(lambda t: t[0] + t[1]).collect()

Because remembering indices for tuples with a lot of elements may be cumbersome, therefore elements may be indexed via strings within UDFs as well:

In [None]:
ds = c.parallelize([(1, 2), (3, 4), (5, 6)], columns=['first', 'second'])

In [None]:
ds.map(lambda t: t['first'] + t['second']).show()

When using Tuplex, both the string access and the integer based access syntax may be freely mixed.

In [None]:
ds.map(lambda t: t[0] + t['second']).show()

When looking at the output of show, we see that Tuplex has no column assigned to the end-result. In the next section, we'll take a look at how to work with columns.
Yet, there's also a convenient syntax to provide a column name using only the `map` operation which is the case when a dictionary using string-only keys is provided:

In [None]:
ds.map(lambda t: {'Result' : t[0] + t['second']}).show()


### 1.5 Manipulating individual columns, adding new columns

Sometimes, only individual columns need to be manipulated or new ones should be created from existing ones. Tuplex provides two helper functions for this:

`mapColumn` and `withColumn`, each taking a string to identify a column and a UDF to apply.

In addition, to change the association of a column to particular name/key, the Tuplex API provides a `renameColumn` operation.

Let's first take a look at how individual columns can be associated with names:

In [None]:
# name columns by passing in information to parallelize
ds = c.parallelize([(1, 2), (3, 4), (5, 6)], columns=['first', 'second'])

ds.show()

In [None]:
ds = c.parallelize([(1, 2), (3, 4), (5, 6)]).renameColumn(0, 'first')
ds.show()

In [None]:
ds = ds.renameColumn(1, 'second').renameColumn('first', 'FIRST')

ds.show()

In [None]:
ds.columns

In [None]:
ds = ds.renameColumn(0, 'first')

In a next step, we can create new columns based on all columns via `withColumn` or manipulate a single column using `mapColumn`

In [None]:
ds.withColumn('third', lambda a, b: a + b).show()

In [None]:
ds.mapColumn('first', lambda x: x - 1).show()

Lastly, sometimes only a subset of columns is required. For this, Tuplex provides a `selectColumns` operation.

In [None]:
ds.withColumn('third', lambda a, b: a + b).selectColumns(['first', 'third']).show()

This notebook showed some basic interactions and manipulations of datasets. However, the data diplayed here is rather small and the benefits of a compiling framework will shine when dealing with larger amounts of data. In the [next part II](02_Working_with_files.ipynb), we'll therefore learn more about how to work efficiently with files in Tuplex!

(c) 2017 - 2022 Tuplex team