# Runnning Spark in Google Colab  

The notebook here will explain how to download and install Apache Spark in a Google Colab environment.
You can directly open this notebook in Colab using the following button:

<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/02.000%20(optional)%20Setup_Spark_in_Google_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# find out OS details
! cat /etc/os-release

In [None]:
# see if java is available
! java -version

openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu218.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu218.04, mixed mode, sharing)


In [None]:
# spark needs JAVA_HOME and SPARK_HOME variables set.
# to do that
# we've got to locate java
! whereis java

java: /usr/bin/java /usr/share/java /usr/share/man/man1/java.1.gz


In [None]:
# let's check more details so we can supply exact details to JAVA_HOME
! find / -iname "*openjdk-*"

/var/lib/dpkg/info/openjdk-11-jdk-headless:amd64.prerm
/var/lib/dpkg/info/openjdk-11-jre-headless:amd64.conffiles
/var/lib/dpkg/info/openjdk-11-jdk-headless:amd64.md5sums
/var/lib/dpkg/info/openjdk-11-jre-headless:amd64.list
/var/lib/dpkg/info/openjdk-11-jre-headless:amd64.postinst
/var/lib/dpkg/info/openjdk-11-jdk-headless:amd64.list
/var/lib/dpkg/info/openjdk-11-jre:amd64.prerm
/var/lib/dpkg/info/openjdk-11-jre:amd64.md5sums
/var/lib/dpkg/info/openjdk-11-jre:amd64.postinst
/var/lib/dpkg/info/openjdk-11-jre:amd64.list
/var/lib/dpkg/info/openjdk-11-jdk-headless:amd64.postinst
/var/lib/dpkg/info/openjdk-11-jre-headless:amd64.prerm
/var/lib/dpkg/info/openjdk-11-jdk-headless:amd64.preinst
/var/lib/dpkg/info/openjdk-11-jre-headless:amd64.md5sums
/var/lib/dpkg/info/openjdk-11-jre-headless:amd64.postrm
/usr/lib/jvm/java-11-openjdk-amd64
/usr/lib/jvm/.java-1.11.0-openjdk-amd64.jinfo
/usr/lib/jvm/java-1.11.0-openjdk-amd64
/usr/lib/debug/usr/lib/jvm/java-11-openjdk-amd64
/usr/lib/debug/usr/lib/

In [None]:
# grab spark
# as of Dec 2022, the latest version is 3.2.3, get the link from Apache Spark's website
! wget -q https://dlcdn.apache.org/spark/spark-3.2.3/spark-3.2.3-bin-hadoop3.2.tgz
# unzip spark
!tar xf spark-3.2.3-bin-hadoop3.2.tgz

In [None]:
# install findspark package
!pip install -q findspark

In [None]:
# now the folder we are working in is "content"
! ls ../content

sample_data  spark-3.2.3-bin-hadoop3.2	spark-3.2.3-bin-hadoop3.2.tgz


In [None]:
# got to provide JAVA_HOME and SPARK_HOME vairables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.3-bin-hadoop3.2"

In [None]:
# Now we initialize spark like before

In [None]:
# Step 1: initialize findspark
import findspark
findspark.init()

In [None]:
# Step 2: import pyspark
import pyspark
from pyspark.sql import SparkSession
pyspark.__version__

'3.2.3'

In [None]:
# Step 3: Create a spark session

# 'local[1]' indicates spark on 1 core on the local machine (the Ubuntu VM on colab in this case), 
# specify the number of cores needed - we'll use local[*] in this case to engage as many cores as available
# use .config("spark.some.config.option", "some-value") for additional configuration

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("Setup-Spark-in-Google-Colab") \
    .getOrCreate()

In [None]:
spark

In [1]:
# Let's download and unzip the MovieLens 25M Dataset as well.

In [3]:
! wget -q https://files.grouplens.org/datasets/movielens/ml-25m.zip

In [4]:
! mkdir ./data
! ls

data  ml-25m.zip  ml-25m.zip.1	sample_data


In [5]:
! unzip ./ml-25m.zip -d ./data/

Archive:  ./ml-25m.zip
   creating: ./data/ml-25m/
  inflating: ./data/ml-25m/tags.csv  
  inflating: ./data/ml-25m/links.csv  
  inflating: ./data/ml-25m/README.txt  
  inflating: ./data/ml-25m/ratings.csv  
  inflating: ./data/ml-25m/genome-tags.csv  
  inflating: ./data/ml-25m/genome-scores.csv  
  inflating: ./data/ml-25m/movies.csv  


In [6]:
! ls ./data/ml-25m/

genome-scores.csv  links.csv   ratings.csv  tags.csv
genome-tags.csv    movies.csv  README.txt
