# Google Colab Setup
## Get Spark and download the MovieLens dataset   

The notebook here will explain how to download and install Apache Spark in a Google Colab environment.
It will also demonstrate how to download the MovieLens dataset needed for the workshop. 
You can directly open this notebook in Colab using the following button:

<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/03-Spark/001.01%20-%20Setup%20and%20suchlike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# find out OS details
! cat /etc/os-release

In [None]:
# see if java is available
! java -version

In [None]:
# spark needs JAVA_HOME and SPARK_HOME variables set.
# to do that
# we've got to locate java
! whereis java

In [None]:
# let's check more details so we can supply exact details to JAVA_HOME
! find / -iname "*openjdk-*"

# typically - /usr/lib/jvm/java-11-openjdk-amd64

In [None]:
# grab spark
# as of 2023-06-23, the latest version is 3.4.1, get the link from Apache Spark's website
! wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
# unzip spark
!tar xf spark-3.4.1-bin-hadoop3.tgz

In [None]:
# install pyspark (needed?)
!pip install pyspark plotly

In [None]:
# install findspark package
!pip install -q findspark

In [None]:
# now the folder we are working in is "content"
! ls ../content

In [None]:
# got to provide JAVA_HOME and SPARK_HOME vairables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
# IMPORTANT - UPDATE THE SPARK_HOME PATH BASED ON THE PACKAGE YOU DOWNLOAD
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3.tgz"

In [None]:
# Now we initialize spark like before

In [None]:
# Step 1: initialize findspark
import findspark
findspark.init()

In [None]:
# Step 2: import pyspark
import pyspark
from pyspark.sql import SparkSession
pyspark.__version__

In [None]:
# Step 3: Create a spark session

# 'local[1]' indicates spark on 1 core on the local machine (the Ubuntu VM on colab in this case), 
# specify the number of cores needed - we'll use local[*] in this case to engage as many cores as available
# use .config("spark.some.config.option", "some-value") for additional configuration

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("Setup-Spark-in-Google-Colab") \
    .getOrCreate()

In [None]:
spark

In [None]:
# Let's download and unzip the MovieLens 25M Dataset as well.

In [None]:
! wget -q https://files.grouplens.org/datasets/movielens/ml-25m.zip

In [None]:
! mkdir ./data
! ls

In [None]:
! unzip ./ml-25m.zip -d ./data/

In [None]:
! ls ./data/ml-25m/