## Technical Setup for using Pyspark on Cloud or Local Machine

- This course is about using cloud technologies (such as Google Cloud Platform (GCP), Amazon Web Services (AWS) and Databricks) for Big Data Processing and Analytics (using mainly Spark).

- In this course, we use Spark (its Python API called Pyspark) for Big Data

- There are many options to setup Pyspark on Cloud services such as Google Cloud Computing (GCP) or Google Colabratory (Google Colab), DataBricks, Amazon Web Services (AWS) or Local Machine 

<img src="Big_Data_Components.png" width="450" height="450">

Here, when there exist computational resources (like CPUs, RAMs, etc), big data platforms such as Spark or Hadoop are capabale to distribute the analytical computations among those hardwares

## Setup Pyspark in Google Colab

- Do the following steps, you can write down Pyspark code on Google Colab
- Disclaimer: Google Colab does not provide resources for Big Data Processing 
- A drawback of this method: Everytime, when we close the Notebook and return back to the code later, we should do the steps again

In [None]:
- !apt-get install openjdk-8-jdk-headless -qq > /dev/null
# - !wget https://mirror.jframeworks.com/apache/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz
- !wget https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz    
    https://spark.apache.org/downloads.html -> click on Download Spark: spark-3.0.3-bin-hadoop2.7.tgz
- !tar xvf spark-3.0.3-bin-hadoop2.7.tgz
- !pip install -q findspark
- !pip install pyspark==3.0.3
- import os
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  os.environ["SPARK_HOME"] = "spark-3.0.3-bin-hadoop2.7"
- from pyspark.sql import SparkSession
    spark= SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate() 
    data = [1,2,3,4,5,6,7,8,9,10,11,12]
    rdd_sample = spark.sparkContext.parallelize(data, 2)
    rdd_sample.take(5)

## Use Pyspark in Dataproc (GCP)

- Dataproc is a managed service to run Hadoop and Spark jobs
- A Dataproc cluster comes with preinstalled Spark
- We will typically use the PySpark REPL or spark-submit with Python scripts
- Caution – delete you cluster when you are done working or you will waste all of your Google credits!
- Step 1 – create a Google account
- Step 2 – set up the Google cloud account

    Redeem your credits
    
- visit console.cloud.google.com
- Follow the steps in `Intro_to_Google_Cloud.pptx` and `Technical_Setup.pptx`

### Dataproc Online Tutorials:
    
- How to Create Google Cloud Dataproc Clusters for Spark: https://www.youtube.com/watch?v=nccCsk_MHDs
- Apache Spark & Jupyter on Google Cloud Dataproc Cluster: https://www.youtube.com/watch?v=5OYT2SSMGo8
- Using PySpark on Dataproc Hadoop Cluster to process large CSV file: https://www.youtube.com/watch?v=Y6OXGc0mmYM

## If we want to config or read data from GPC from our local machine

- Install the gcloud CLI: https://cloud.google.com/sdk/docs/install: This is optional. Instead of doing the steps on GCP manually, we can configure anything by command line from our local machine (laptop) 

- For example, dumpig a csv file to GCP Storage and do data analysis locally

<img src="Storage_GCP.png" width="800" height="800">

`pip install gcsfs`

`df = pd.read_csv('gs://angular-amp-303119/Data/Churn_Modelling.csv') -> Make The Data Public`

## Use Databricks for Pyspark

- Databricks provides you with a free trial to their notebooks

- Create account here for Databricks: https://community.cloud.databricks.com/

## Config local machine for Pyspark

- By doing the following steps, you can have Pyspark on linux local machine
- The purpose of doing this is learning Pyspark syntaxes only
- Because in most of the cases, we do not have resources in our local machine, the pyspark process will not be fast, so here learning the Pyspark syntaxes is the objective task 

In [None]:
Spark on local machine:

1- Install jdk 8 or higher via https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
2- tar zxvf jdk-8u281-linux-x64.tar.gz
3- pip install pyspark
4- pip install findspark
5- optional: sudo gedit ~/.bashrc
6- Download Hadoop from https://spark.apache.org/downloads.html, then tar zxvf spark-3.0.2-bin-hadoop2.7.tgz
7- python (type it in terminal)
8- import findspark
9- findspark.init('~/Downloads/spark-3.0.2-bin-hadoop2.7/bin')
10-exit()
11-pyspark (type it in the terminal)
from pyspark import SparkContext
sc = SparkContext()
ls = [1,2,3,4]
rdd_sample = sc.parallelize(ls, 2)
rdd_sample.take(2)
sc.stop()

## AWS

- We can have Jupyter Notebook on AWS from its SageMaker Service with pre-installed Pyspark

- We can dump large dataset on S3 (Simple Storage Service) and read it by using Boto -> Do the followings in the terminal on your local machine:

    - pip install awscli 

    - $ aws configure 

    - AWS Access Key ID [None]: ...

    - AWS Secret Access Key [None]: ...

    - Default region name [None]: ...

    - Default output format [None]: ...

In [None]:
# access to our dataset on S3 and create Pandas data frame from it
import pandas as pd
import boto3

bucket = "makeschooldata"
file_name = "data/Churn_Modelling.csv"

s3 = boto3.client('s3')
# 's3' is a key word. create connection to S3 using default config and all buckets within S3

obj = s3.get_object(Bucket=bucket, Key=file_name)
# get object and file (key) from bucket

df = pd.read_csv(obj['Body']) # 'Body' is a keyword
print(df.head())