<a href="https://colab.research.google.com/github/waleedsial/Spark---Data-Engineering-/blob/master/Installing_and_Getting_started_with_Apache_Spark_on_Google_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:

# Installing JVM
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Installing spark from spark website. 
!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

# Unzipping 
!tar xf spark-2.4.5-bin-hadoop2.7.tgz

# Let us allow to find the spark & set the path variable 
!pip install -q findspark





In [2]:
# Lets see which jvms are present already
!ls /usr/lib/jvm

default-java		   java-11-openjdk-amd64     java-8-openjdk-amd64
java-1.11.0-openjdk-amd64  java-1.8.0-openjdk-amd64


In [3]:
!pip install -U pyarrow
# Use Case: we would want to bring spark datafram into pandas dataframe 
# Python serialization is very slow 
# Pyarrow Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes.
# https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html

Collecting pyarrow
[?25l  Downloading https://files.pythonhosted.org/packages/00/d2/695bab1e1e7a4554b6dbd287d55cca096214bd441037058a432afd724bb1/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl (63.1MB)
[K     |████████████████████████████████| 63.2MB 65kB/s 
Installing collected packages: pyarrow
  Found existing installation: pyarrow 0.14.1
    Uninstalling pyarrow-0.14.1:
      Successfully uninstalled pyarrow-0.14.1
Successfully installed pyarrow-0.16.0


In [0]:
# Setting env paths
import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

In [0]:
#https://medium.com/@sushantgautam_930/apache-spark-in-google-collaboratory-in-3-steps-e0acbba654e6
#https://www.youtube.com/watch?v=d9g9xbNc5qA

In [0]:

import findspark
# finding spark & setting system path 
findspark.init()
from pyspark.sql import SparkSession
# In order to work in spark we need spark context 

# We are setting local since we dont have distributed nodes
spark = SparkSession.builder.master("local[*]").getOrCreate()

spark.conf.set("spark.executor.memory", "4g")
spark.conf.set("spark.driver.memory", "4g")
spark.conf.set("spark.memory.fraction", "0.9")

In [0]:
# Testing Spark 


#  **Testing Spark**

Instructions are similar for local & collab. 



In [0]:
import sys, tempfile, urllib

In [0]:
# Settting a base directory 
BASE_DIR = '/tmp'
# Setting output file for downloading credit data 
OUTPUT_FILE = os.path.join(BASE_DIR, 'credit_data.csv')

In [0]:
# downloading credit data 
credit_data = urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data',OUTPUT_FILE )

In [7]:
!ls /tmp

blockmgr-68cd3739-791f-4004-b027-f447c9360cf7
credit_data.csv
hsperfdata_root
spark-72ccd5a3-7466-427b-9a8d-348202a7d50b
spark-b52ce3d7-4186-4b8a-befe-624711e4cc9d


In [0]:
# Reading the file 
# Inferring the schema 
credit_df = spark.read.option("inferSchema", "true").csv("/tmp/credit_data.csv", header = False)

In [9]:
credit_df.head()


Row(_c0='b', _c1='30.83', _c2=0.0, _c3='u', _c4='g', _c5='w', _c6='v', _c7=1.25, _c8='t', _c9='t', _c10=1, _c11='f', _c12='g', _c13='00202', _c14=0, _c15='+')

In [10]:
type(credit_df)

pyspark.sql.dataframe.DataFrame

In [11]:
credit_df.show()

+---+-----+------+---+---+---+---+-----+---+---+----+----+----+-----+-----+----+
|_c0|  _c1|   _c2|_c3|_c4|_c5|_c6|  _c7|_c8|_c9|_c10|_c11|_c12| _c13| _c14|_c15|
+---+-----+------+---+---+---+---+-----+---+---+----+----+----+-----+-----+----+
|  b|30.83|   0.0|  u|  g|  w|  v| 1.25|  t|  t|   1|   f|   g|00202|    0|   +|
|  a|58.67|  4.46|  u|  g|  q|  h| 3.04|  t|  t|   6|   f|   g|00043|  560|   +|
|  a|24.50|   0.5|  u|  g|  q|  h|  1.5|  t|  f|   0|   f|   g|00280|  824|   +|
|  b|27.83|  1.54|  u|  g|  w|  v| 3.75|  t|  t|   5|   t|   g|00100|    3|   +|
|  b|20.17| 5.625|  u|  g|  w|  v| 1.71|  t|  f|   0|   f|   s|00120|    0|   +|
|  b|32.08|   4.0|  u|  g|  m|  v|  2.5|  t|  f|   0|   t|   g|00360|    0|   +|
|  b|33.17|  1.04|  u|  g|  r|  h|  6.5|  t|  f|   0|   t|   g|00164|31285|   +|
|  a|22.92|11.585|  u|  g| cc|  v| 0.04|  t|  f|   0|   f|   g|00080| 1349|   +|
|  b|54.42|   0.5|  y|  p|  k|  h| 3.96|  t|  f|   0|   f|   g|00180|  314|   +|
|  b|42.50| 4.915|  y|  p|  