# Overview

In this notebook we look at the koalas library for python. It makes Spark DataFrames look like pandas DataFrames.

This notebook assumes you have read the prerequisite topics noted in the [README.md](README.md).

We will cover the following:

1. Why Koalas
2. What is Koalas
3. Important Caveats
4. Hands on keyboard Examples
5. Cleanup

# 1. Why Koalas
Apache Spark is a distributed computer environment. As such the computations and doresponding data are sharded on a pool of remote machines. The Spark API (and thus PySpark) provide abstracted APIs to the user which effectively hides the fact that data is sharded and where it exists. The Apark API include two notable objects for interacting with data: RDDs and DataFrames. The Spark DataFrame is the reccomended object for data scientists to use. The DataFrame is built on top of the more complicated RDDs and offers a number of creature comforts and simplicity that RDDs do not.

Unfortunately the Spark DataFrame is very different than the Pandas DataFrame. The learning curve isn't extremely steep but the fact that we need to use a different technology does pose a problem: we cannot easily lift and shift code from our local machines to Spark Cluster.

This is where the Koalas comes in. 

# 2. What is Koalas
The koalas library provides an API for interacting with data stored on an Apache spark cluster.
Koalas impliments the pandas DataFrame API on top of Apache Spark which means that Spark is going to look and feel a like Pandas. Koalas also exposes all the functionality of the vanilla Spark API such as Spark SQL.

Koalas was introduced in 2019. At the timeof writing this article, the latest version of koalas is 1.8. Koalas supports Apache Spark 3.1 and below as it will be officially included to PySpark in the upcoming Apache Spark 3.2

Currently Koalas is undergoing bi-weekly releases with a very active community. All the most common pandas functions have been implemented in Koalas but there still lies a lot of functions that aren’t. As of July 2020, the current state is as follows:

<center><img src="images/koalas_current_state.jpg" /></center>

More recently I have come to understand that over 90% of the pandas API has been implimented. That being said, I have found a few instances where features were not yet implimented while writing the notebooks in this directory.


# 3. Important Caveats

## 3.1. Data is sharded
Recall that koalas is built on top of Spark. The data is not on the local moachine generally speaking. Some operations will require all the data to be collected on a single machine. This may break things if our worker nodes do not have enough recources to accomodate the operation. 

## 3.2. We can configure how data is sharded
We can configure how and where data is placed. This is very advanced. We will not cover this topic but it is an important consideration for future growth.

## 3.3. Spark is Lazy
Spark is "lazy" meaning computation happen when they need to. If you dont use the data, no calculations are executed. Keep this in mind as you may see unexpected results or slow performance if youre not careful.

## Compatability Is Behind
Koalas [does not support](https://koalas.readthedocs.io/en/latest/getting_started/install.html#python-version-support) the bleeding edge of python. As of version 1.8.2, the release nodes and documentation states that koalas supports python 3.5 to 3.8. The official documentation states:
> Koalas support for Python 3.5 is deprecated and will be dropped in the future release. At that point, existing Python 3.5 workflows that use Koalas will continue to work without modification, but Python 3.5 users will no longer get access to the latest Koalas features and bugfixes. We recommend that you upgrade to Python 3.6 or newer.

# 4. Hands On Keyboard Examples

By this time you should be familiar with the SparkContext object and how to create it. If not, Go back and review the prerequisite materials mentioned in the [README.md](README.md) file.

For this exercise we will use a local spark cluster.

In [1]:
import findspark
findspark.init()

import pyspark
sparkConf = pyspark.SparkConf()
sparkConf.setAppName("spark-jupyter-koalas-demo")

<pyspark.conf.SparkConf at 0x7fa018770a90>

Ingore a koalas error about timezones

In [2]:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

## 4.1. Creating A Simple DataFrame
We can see this looks kust like pandas!

In [3]:
from databricks import koalas

koalas_dataframe = koalas.DataFrame({
    "A" : [1,2,3,4,5],
    "B" : ["a", "b", "c", "b", "a"],
    "C" : [333, 444, 555, 222, 333]
})

koalas_dataframe

Unnamed: 0,A,B,C
0,1,a,333
1,2,b,444
2,3,c,555
3,4,b,222
4,5,a,333


## 4.2. Indexing
Again, just like pandas!

In [None]:
koalas_dataframe["A"]

In [None]:
koalas_dataframe.iloc[1:4]

In [None]:
koalas_dataframe.loc[0]

In [None]:
koalas_dataframe[koalas_dataframe["B"] == "a"]

In [None]:
koalas_dataframe[["A","C"]]

## 4.4. Data Manipulation

In [None]:
koalas_dataframe["D"] = [10, 11, 12, 13 ,14]
koalas_dataframe

In [None]:
koalas_dataframe[["C", "D"]].groupby("C").apply(lambda x: x.mean())["D"]

## 4.5. Loading a CSV 

Please see the coresponding notebooks in the README as it is a bit complicated depending on our SparkContext.

## 4.6. Accessing Underlying Spark Objects

Convert the koalas DataFrame to a Spark DataFrame

In [None]:
spark_dataframe = koalas_dataframe.to_spark()
print(type(spark_dataframe))
spark_dataframe

Convert the Koalas DataFrame to a Spark RDD

In [None]:
koalas_dataframe.to_spark().rdd

Check how our data is sharded accross the cluster

In [None]:
koalas_dataframe.to_spark().rdd.getNumPartitions()

## 4.6. Using Spark SQL
Spark SQL is Apache Spark's module for working with structured data. It is an integrate API that allows users to seamlessly mix SQL queries with Spark programs. Spark SQL lets you query structured data in dataframes using SQL queries.

This is a very powerful feature as it provide a common way to access a variety of data sources, including SQL Server, CSV File, Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources!

https://spark.apache.org/sql/

Koalas allows us to apply this functionality to the koalas dataframe. We can select or insert data in a koalas dataframe. We can also join information between dataframes regardless of their underlying source.

In [None]:
sql_query = "Select A from {koalas_dataframe}"
koalas.sql(sql_query)

In [None]:
koalas.sql("select * from {koalas_dataframe} where C > 300")

In [4]:
kdf1 = koalas.DataFrame({
    "A" : [1,2,3,4,5],
    "B" : ["a", "b", "c", "b", "a"],
})
kdf2 = koalas.DataFrame({
    "A" : [1,2,3,4,5],
    "C" : [333, 444, 555, 222, 333]
})
koalas.sql("""
select kdf1.A, kdf1.B, kdf2.C from {kdf1} kdf1
join {kdf2} kdf2
on kdf1.A = kdf2.A
where kdf2.C > 300
""")

Unnamed: 0,A,B,C
0,5,a,333
1,1,a,333
2,3,c,555
3,2,b,444


In [5]:
kdf1 = koalas.DataFrame({
    "A" : [1,2,3,4,5],
    "B" : ["a", "b", "c", "b", "a"],
})
kdf2 = koalas.DataFrame({
    "A" : [6,7,8,9,10],
    "B" : ["a", "b", "c", "b", "a"]
})
koalas.sql("""
select * from {kdf1} kdf1
union
select * from {kdf2} kdf2
""")

Unnamed: 0,A,B
0,2,b
1,4,b
2,3,c
3,5,a
4,9,b
5,6,a
6,7,b
7,8,c
8,10,a
9,1,a


# 5. Cleanup

In [None]:
sc.stop()