# Welcome to the onboarding tutorial

We will run some short examples to get you familiar with the platform.

## Setup
### Copy the hdfs config files to the session
To connect to hdfs for the first time, we need to copy the hdfs-site.xml and core-site.xml files until this process is automated.

In the file overview, open a terminal by clicking the icon in the top right corner.


In [7]:
#Making the file executable
!ls -la getHDFSSite.sh
!chmod ug+x getHDFSSite.sh

-rwxr-xr-- 1 cdsw cdsw 570 Apr 14 07:08 getHDFSSite.sh


In the terminal please execute the provided script by typing ./getHDFSSite.sh (or using tab to autocomplete).
When prompted, please enter your workspace password, to complete the file transfer.

If this does not work, because the connection is taking some time, please try executing this script in a non-jupyter session.

The config files are stored in the conf folder, so in any future sessions it is enough to run:

In [6]:
!cp ~/conf/*site.xml /etc/hadoop/conf

## Supported Kernels
Currently the jupyter notebook is only configured for python.

To use R/Scala please switch to the Workbench editor, if you wish to use the relevant magic words.

## Python dependency management

In [3]:

# Using pip to install python packages
!pip3 install argparse

# Manage your depencencies in requirements.txt
!cat requirements.txt

# Run startup script in interactive sessions
!./cdsw-build.sh

# List installed packages
# !pip3 list installed

Collecting argparse
  Downloading https://files.pythonhosted.org/packages/f2/94/3af39d34be01a24a6e65433d19e107099374224905f1e0cc6bbe1fd22a2f/argparse-1.4.0-py2.py3-none-any.whl
Installing collected packages: argparse
Successfully installed argparse-1.4.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
cat: requirements.txt: No such file or directory
/bin/sh: ./cdsw-build.sh: No such file or directory


## Running some python examples

### Estimating Pi

In [8]:
#Estimating Pi

from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PythonPi").getOrCreate()

partitions = 2
n = 100000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 < 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

spark.stop()


Pi is roughly 3.145000


### Connecting to HDFS

In [9]:
# Connecting to HDFS
from __future__ import print_function
import sys
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
import sys, re, subprocess

spark = SparkSession\
    .builder\
    .appName("HDFSTest")\
    .getOrCreate()
 
spark.sparkContext._conf.getAll()

REMOTE_HDFS_MASTER = 'tst-env-public-lake-master1.tst-env.hm0v-a3xe.cloudera.site'

data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = spark.createDataFrame(data)

df.write.csv(path="hdfs://" + REMOTE_HDFS_MASTER + "/tmp/csv_example", mode="overwrite")

df_load = spark.read.csv("hdfs://" + REMOTE_HDFS_MASTER + "/tmp/csv_example")

df_load.show()



+------+---+
|   _c0|_c1|
+------+---+
| Third|  3|
|Fourth|  4|
| Fifth|  5|
| First|  1|
|Second|  2|
+------+---+



### Connecting to an S3 bucket

To connect to S3 with your default credentials, the path must exist.

The default path is s3a://caap-cloudera-data/cdp_datalake/projects/<project-group>/<project-name>/
    
Depending on your permissions, you may not be able to access the project folder 'raw'.    


In [10]:
# Connecting to S3
from __future__ import print_function
import sys
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

spark = SparkSession\
    .builder\
    .appName("PythonSQL")\
    .config("spark.executor.memory", "4g")\
    .config("spark.executor.instances", 2)\
    .config("spark.yarn.access.hadoopFileSystems","s3a://caap-cloudera-data/cdp_datalake/cdp-test/")\
    .config("spark.driver.maxResultSize","4g")\
    .getOrCreate()

spark.sparkContext._conf.getAll()

data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = spark.createDataFrame(data)


df.write.format("csv").mode("Overwrite").save("s3a://caap-cloudera-data/cdp_datalake/cdp-test/test.csv")

df_load = spark.read.csv("s3a://caap-cloudera-data/cdp_datalake/cdp-test/test.csv")
df_load.show()

spark.stop()


+------+---+
|   _c0|_c1|
+------+---+
| Third|  3|
|Fourth|  4|
| Fifth|  5|
| First|  1|
|Second|  2|
+------+---+



### Connecting to Hive

via spark.sql

In [3]:
from __future__ import print_function
import os
import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import Row,StructField,StructType,StringType,IntegerType
import sys, re, subprocess

spark = SparkSession.builder.appName("PythonSQL").config("spark.executor.instances", 2).config("spark.yarn.access.hadoopFileSystems","s3a://caap-cloudera-data/cdp_datalake/warehouse/tablespace/external/hive/").config("spark.sql.warehouse.dir", "s3://caap-cloudera-data/cdp_datalake/warehouse/tablespace/external/hive/").config("hive.mapred.supports.subdirectories", "true").config("hive.supports.subdirectories", "true").config("mapred.input.dir.recursive", "true").getOrCreate()
spark.sql("SHOW databases").show()


+------------------+
|      databaseName|
+------------------+
|           default|
|information_schema|
|               sys|
|              test|
+------------------+



## Modifying Spark configs.

As you saw in some of the examples, you can provide custom configs for spark.

On a project basis, the global settings are defined in the file spark-defaults.conf

You can also take a look at the docs: https://docs.cloudera.com/machine-learning/cloud/spark/topics/ml-managing-dependencies-for-spark-2-jobs.html

They provide some examples about adding files and other dependencies to spark as well as modifying loglevels and other settings.

Please note that some settings are set at session start up and may only take effect in a new session.

Thank you and have fun!