# Data Analysis using __PySpark__  
*Fun with the __MovieLens__ dataset*  

**Part 4: Data Analysis using ratings.csv from the MovieLens dataset**

<font color='green'>__Support for Google Colab__  </font>

<font color='green'>uncomment and execute the cell below to setup and run this Spark notebook on Google Colab.</font>

In [None]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)

# # grab spark
# # as of 2023-06-23, the latest version is 3.4.1, get the link from Apache Spark's website
# ! wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
# # unzip spark
# !tar xf spark-3.4.1-bin-hadoop3.tgz
# # install findspark package
# !pip install -q findspark
# # Let's download and unzip the MovieLens 25M Dataset as well.
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-25m.zip
# ! unzip ./ml-25m.zip -d ./../data/

# # got to provide JAVA_HOME and SPARK_HOME vairables
# import os
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
# # IMPORTANT - UPDATE THE SPARK_HOME PATH BASED ON THE PACKAGE YOU DOWNLOAD
# os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"
# ! echo "DONE"

## Start the local/colab Spark Cluster

In [None]:
# Step 1: initialize findspark
import findspark

findspark.init()

# Step 2: import pyspark
import pyspark
from pyspark.sql import SparkSession

print(pyspark.__version__)

# Step 3: Create a spark session
#     using local[*] to use as many logical cores as available, use 1 when in doubt
#     'local[1]' indicates spark on 1 core on the local machine or specify the number of cores needed
#     use .config("spark.some.config.option", "some-value") for additional configuration

spark = (
    SparkSession.builder.master("local[*]")
    .appName("Analyzing Movielens Data")
    .getOrCreate()
)

# spark

# Problem Set 3  - ```ratings.csv```

1. Find number of films for each rating, so number of films that have at least one rating of 1, number of films that have at least one rating of 2 and so on...  

1. List user-IDs in order of number of films they have rated, descending.  

1. Are there users who have given multiple ratings to the same film?  

## Load Movies data from the MovieLens dataset

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

#
datalocation = "../data/ml-25m/"
file_path_ratings = datalocation + "ratings.csv"
#
schema_ratings = StructType(
    [
        StructField("userId", StringType(), False),
        StructField("movieId", StringType(), False),
        StructField("rating", FloatType(), True),
        StructField("timestamp", StringType(), True),
    ]
)
#
ratings_raw = (
    spark.read.format("csv")
    .option("encoding", "UTF-8")
    .option("header", True)
    .option("sep", ",")
    .option("escape", '"')
    .schema(schema_ratings)
    .load(file_path_ratings)
)

In [None]:
ratings_raw.show(10, False)

## Solutions to Problem Set 3

### Find number of films for each rating

* so number of films that have at least one rating of 1, number of films that have at least one rating of 2 and so on...

In [None]:
# to refresh, here's what ratings data looks like
ratings_raw.show(10)

In [None]:
absolute_freq_ratings = ratings_raw.groupBy("rating").count()

In [None]:
absolute_freq_ratings.orderBy(col("count").desc()).show()

In [None]:
# setup matplotlib before starting plotting
import matplotlib.pyplot as plt

# jupyter mpl magic
%matplotlib inline

# note: in case you want to use the widget or notebook magic:
# ensure that ipympl is also installed.
# there may be other steps involved as well
# for e.g. widget works based on nodeJS, so you'll have to config/enable that too

In [None]:
# convert each column to a list
absolute_freq_ratings_x = (
    absolute_freq_ratings.select(col("rating")).rdd.flatMap(lambda x: x).collect()
)
absolute_freq_ratings_y = (
    absolute_freq_ratings.select(col("count")).rdd.flatMap(lambda x: x).collect()
)

In [None]:
plt.figure(figsize=(18, 5))
plt.bar(absolute_freq_ratings_x, absolute_freq_ratings_y)
plt.title("Absolute Frequencies of Ratings")
plt.xlabel("Rating")
plt.ylabel("Number Of Movies")
plt.show()

### List user-IDs in order of number of films they have rated, descending

In [None]:
rating_freq_by_user = ratings_raw.groupBy("userId").count()

In [None]:
rating_freq_by_user.orderBy(col("count").desc()).show()

In [None]:
rating_freq_by_user.count()

That's a lot of ratings by a lot of users...  

Some of these like ```72315``` def seem like a bot - or a human whose spent a lot of time regularly watching films - if we estimate a film to be 90 minutes on an average, it comes to about 48303 hours - that's like 5.5 years of 24/7 movie-watching! In practice this would've taken the person 8-10 times longer (considering 3 hours of movies daily - no holidays) - so 44 to 55 years of movies... - yeah, I'll bet this was some automated thing



### Are there users who have given multiple ratings to the same film?

In [None]:
usr_movie_count = ratings_raw.groupBy("userId", "movieId").count()

In [None]:
# usr_movie_count.orderBy(col("count").asc()).show(10)

Doesn't seem like users have rated the same movie multiple times.
*[think]* is there another way to confirm this?

# Clear cache and stop the spark cluster

In [None]:
# clear cache
spark.catalog.clearCache()

In [None]:
# stop spark
spark.stop()

# Insights

We are practicing some of the same stuff, however ```ratings``` is a substantially larger dataset - we need to be more careful with joins etc.

# Next

We continue our data analysis exercises with multiple data files loaded.