# Overview

1. Why Koalas
2. What is Koalas
3. Important Caveats
4. Hands on keyboard Examples


# 1. Why Koalas
Apache Spark is a distributed computer environment. As such the computations and doresponding data are sharded on a pool of remote machines. The Spark API (and thus PySpark) provide abstracted APIs to the user which effectively hides the fact that data is sharded and where it exists. The Apark API include two notable objects for interacting with data: RDDs and DataFrames. The Spark DataFrame is the reccomended object for data scientists to use. The DataFrame is built on top of the more complicated RDDs and offers a number of creature comforts and simplicity that RDDs do not.

Unfortunately the Spark DataFrame is very different than the Pandas DataFrame. The learning curve isnt extremely steep but the fact that we need to use a different technology does pose a problem: we cannot easily lift and shift code from our local machines to Spark Cluster.

This is where the Koalas comes in. 

# 2. What is Koalas
Koalas impliments the pandas DataFrame API on top of Apache Spark. This means that Spark is going to look and feel a like Pandas.

Koalas was introduced in 2019. At the timeof writing this article, the latest version of koalas is 1.8. Koalas supports Apache Spark 3.1 and below as it will be officially included to PySpark in the upcoming Apache Spark 3.2

Currently Koalas is undergoing bi-weekly releases with a very active community. All the most common pandas functions have been implemented in Koalas but there still lies a lot of functions that arenâ€™t. As of July 2020, the current state is as follows:

<center><img src="koalas_current_state.jpg" /></center>

More recently I have come to understand that over 90% of the pandas API has been implimented. That being said, I have found a few instances where features were not yet implimented while writing the notebooks in this directory.

# 3. Important Caveats

Recall that koalas is built on top of Spark. The data is not on the local moachine generally speaking. Some operations will require all the data to be collected on a single machine. This may break things if our worker nodes do not have enough recources to accomodate the operation. 

Additionally we can configure how and where data is placed. This is very advanced. We will not cover this topic but it is an important consideration for future growth.

One last note: Spark is "lazy" meaning computation happen when they need to. If you dont use the data, no calculations are executed. Keep this in mind as you may see unexpected results or slow performance if youre not careful.

# 4. Hands On Keyboard Examples

By this time you should be familiar with the SparkContext object and how to create it. If not, Go back and review the prerequisite materials mentioned in the [README.md](README.md) file.

We will use a utility module to create our SparkContext for a Spark Cluster running on Kubernetes.

Initialize our SparkContext object using a utility module from this directory.

In [1]:
import create_spark_context

Setting SPARK_HOME
c:\spark\spark-3.1.1-bin-hadoop2.7

Running findspark.init() function
['c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python\\lib\\py4j-0.10.9-src.zip', 'c:\\program files\\python36\\python36.zip', 'c:\\program files\\python36\\DLLs', 'c:\\program files\\python36\\lib', 'c:\\program files\\python36', '', 'c:\\program files\\python36\\lib\\site-packages', 'c:\\program files\\python36\\lib\\site-packages\\win32', 'c:\\program files\\python36\\lib\\site-packages\\win32\\lib', 'c:\\program files\\python36\\lib\\site-packages\\Pythonwin', 'c:\\program files\\python36\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\Administrator\\.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Create SparkContext
The ip was detected as: 15.1.1.23



## 4.1. Creating A Simple DataFrame
We can see this looks kust like pandas!

In [17]:
from databricks import koalas

koalas_dataframe = koalas.DataFrame({
    "A" : [1,2,3,4,5],
    "B" : ["a", "b", "c", "b", "a"],
    "C" : [333, 444, 555, 222, 333]
})

koalas_dataframe

Unnamed: 0,A,B,C
0,1,a,333
1,2,b,444
2,3,c,555
3,4,b,222
4,5,a,333


## 4.2. Indexing
Again, just like pandas!

In [18]:
koalas_dataframe["A"]

0    1
1    2
2    3
3    4
4    5
Name: A, dtype: int64

In [19]:
koalas_dataframe.iloc[1:4]

Unnamed: 0,A,B,C
1,2,b,444
2,3,c,555
3,4,b,222


In [20]:
koalas_dataframe.loc[0]

A      1
B      a
C    333
Name: 0, dtype: object

In [21]:
koalas_dataframe[koalas_dataframe["B"] == "a"]

Unnamed: 0,A,B,C
0,1,a,333
4,5,a,333


In [22]:
koalas_dataframe[["A","C"]]

Unnamed: 0,A,C
0,1,333
1,2,444
2,3,555
3,4,222
4,5,333


## 4.4. Data Manipulation

In [32]:
koalas_dataframe["D"] = [10, 11, 12, 13 ,14]
koalas_dataframe

Unnamed: 0,A,B,C,D
0,1,a,333,10
1,2,b,444,11
2,3,c,555,13
3,4,b,222,12
4,5,a,333,14


In [39]:
koalas_dataframe[["C", "D"]].groupby("C").apply(lambda x: x.mean())["D"]

C
222    12.0
333    12.0
444    11.0
555    13.0
Name: D, dtype: float64

## 4.5. Loading a CSV 

Please see the coresponding notebooks in the README as it is a bit complicated depending on our SparkContext.

## 4.6. Accessing Underlying Spark Objects

Convert the koalas DataFrame to a Spark DataFrame

In [27]:
spark_dataframe = koalas_dataframe.to_spark()
print(type(spark_dataframe))
spark_dataframe

<class 'pyspark.sql.dataframe.DataFrame'>


DataFrame[A: bigint, B: string, C: bigint]

Convert the Koalas DataFrame to a Spark RDD

In [30]:
koalas_dataframe.to_spark().rdd

MapPartitionsRDD[106] at javaToPython at <unknown>:0

Check how our data is sharded accross the cluster

In [23]:
koalas_dataframe.to_spark().rdd.getNumPartitions()

6