## Setup

### Install Open JDK 8:


```
apt-get install openjdk-8-jdk-headless 
echo $JAVA_HOME
/home/wengong/jdk-11.0.8+10
```


### Install Apache Spark

```
$ tar xvf ~/Downloads/tmp/spark-3.0.1-bin-hadoop2.7.tgz
```

add to .bash_path
```
export SPARK_HOME=~/spark/spark-2.4.7-bin-hadoop2.7
export SPARK_VERSION=spark-2.4.7-bin-hadoop2.7
export PYSPARK_PYTHON=python3
if [ -d "$SPARK_HOME/bin" ] ; then
    PATH="$SPARK_HOME/bin:$PATH"
fi
```

```
pip3 install pyspark
```

Download graphframes jar from https://spark-packages.org/package/graphframes/graphframes
into $SPARK_HOME/jars/


To check which scala version is compatible with spark, type
$ spark-submit --version
version 3.0.1
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.8


### Install Scala

cd ~/scala
tar xvf ~/Downloads/tmp/scala-2.12.10.tgz

add to .bash_path
```
export SCALA_HOME=~/scala/scala-2.12.10
if [ -d "$SCALA_HOME/bin" ] ; then
    PATH="$SCALA_HOME/bin:$PATH"
fi
```

### Run pyspark in Jupyter Notebook

```
$ cd ~/projects/graph/graph-algo/Graph-Algo-git/notebooks
$ export PYSPARK_DRIVER_PYTHON=jupyter
$ export PYSPARK_DRIVER_PYTHON_OPTS=notebook
$ pyspark \
--driver-memory 2g \
--executor-memory 6g \
--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
```

In [1]:
from pyspark import *
from pyspark.sql import *

In [2]:
from graphframes import *

In [3]:
spark = SparkSession.builder.appName('fun').getOrCreate()
vertices = spark.createDataFrame([('1', 'Carter', 'Derrick', 50), 
                                  ('2', 'May', 'Derrick', 26),
                                 ('3', 'Mills', 'Jeff', 80),
                                  ('4', 'Hood', 'Robert', 65),
                                  ('5', 'Banks', 'Mike', 93),
                                 ('98', 'Berg', 'Tim', 28),
                                 ('99', 'Page', 'Allan', 16)],
                                 ['id', 'name', 'firstname', 'age'])
edges = spark.createDataFrame([('1', '2', 'friend'), 
                               ('2', '1', 'friend'),
                              ('3', '1', 'friend'),
                              ('1', '3', 'friend'),
                               ('2', '3', 'follows'),
                               ('3', '4', 'friend'),
                               ('4', '3', 'friend'),
                               ('5', '3', 'friend'),
                               ('3', '5', 'friend'),
                               ('4', '5', 'follows'),
                              ('98', '99', 'friend'),
                              ('99', '98', 'friend')],
                              ['src', 'dst', 'type'])

In [4]:
vertices.show()

+---+------+---------+---+
| id|  name|firstname|age|
+---+------+---------+---+
|  1|Carter|  Derrick| 50|
|  2|   May|  Derrick| 26|
|  3| Mills|     Jeff| 80|
|  4|  Hood|   Robert| 65|
|  5| Banks|     Mike| 93|
| 98|  Berg|      Tim| 28|
| 99|  Page|    Allan| 16|
+---+------+---------+---+



In [5]:
g = GraphFrame(vertices, edges)

## Ref

https://github.com/wgong/py4kids/blob/master/lesson-17-pyspark/spark-guide/notebook/chapter-30-graph.ipynb