## Koalas: Making an Easy Transition from Pandas to Apache Spark

What's Koalas?
- Announced April 24, 2019

For pandas users
- Scale out the Pandas code using Koalas
- Male learning PySpark much easier

For PySpark users
- More productive by Pandas-like functions

APIs for Spark users:
- to_koalas(), to_spark()
- DataFrame.spark.cache(), ks.sql()

### A short example

In [None]:
import databricks.koalas as ks

kdf = ks.read_csv("caminho")

kdf.columns = ['x', 'y', 'z']
kdf['x2'] = kdf.x * kdf.x

kdf

### Convert from/to Pandas DataFrame

In [None]:
import pandas as pd

pdf = pd.DataFrame([[1, 10.0, 'a'], 
                    [2, 20.0, 'b'],
                    [3, 30.0, 'c']], columns=['x', 'y', 'z'])

In [None]:
kdf = ks.from_pandas(pdf)
kdf

In [None]:
kdf.to_pandas()

### Convert from/to Spark DataFrame

In [None]:
sdf = spark.createDataFrame([[1, 10.0, 'a'], 
                    [2, 20.0, 'b'],
                    [3, 30.0, 'c']], schema=['x', 'y', 'z'])

In [None]:
kdf = sdf.to_koalas()
kdf.to_spark().show()

### Specify index columns

In [None]:
sdf.to_koalas(index_col='x') # the column turns index

In [None]:
kdf.to_spark(index_col='x').show() # to preserve the index

## Transform and apply functions
`transform` and `apply`

In [None]:
kdf = ks.DataFrame({'a': [1, 2, 3],
                   'b': [4, 5, 6]})

def pandas_plus(item):
    return item + 1 # should always return the same length as input

kdf.transform(pandas_plus)

In [None]:
kdf = ks.DataFrame({'a': [1, 2, 3],
                   'b': [4, 5, 6]})

def pandas_plus(item):
    return item[item % 2 == 1] # allows an arbitrary length

kdf.apply(pandas_plus)

`transform_batch` and `apply_batch`

In [None]:
kdf = ks.DataFrame({'a': [1, 2, 3],
                   'b': [4, 5, 6]})

def pandas_plus(item):
    return item + 1 # should always return the same length as input

kdf.transform_batch(pandas_plus)

In [None]:
kdf = ks.DataFrame({'a': [1, 2, 3],
                   'b': [4, 5, 6]})

def pandas_plus(item):
    return item[item.a > 1] # allows an arbitrary length

kdf.apply_batch(pandas_plus)

### Spark Schema and Data Type

In [None]:
# Get the spark schema without the index columns
kdf.spark.schema().simpleString()

# Get the spark schema including the index columns
kdf.spark.schema(index_col='index'.simpleString())

In [None]:
kdf.spark.print_schema()

kdf.spark.print_schema(index_col='index')

In [None]:
kdf['A'].spark.data_type

### DataFrame.spark.apply

In [None]:
# 'sdf' is a Spark Dataframe

kdf.spark.apply(lambda sdf: sdf.select(sdf['A'] * sdf['b']), 
                index_col='index').head()

### Series.spark.transform

In [None]:
# 'scol' is a Spark column
kdf.spark.transform(lambda scol: scol.cast('int')).head()

### DataFrame.spark.explain()

In [None]:
(kdf + 1).spark.explain()