# Distributed Data Analysis with Dask  
*__[Dask](https://www.dask.org/)__ with the __MovieLens__ dataset* 

**Part 2: Playing with the Movies data**

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/04-Dask/02.02-MovieLens-and-Dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

**Note**: The Dask Dashboard is not available on Google Colab unless you register with tunnelling systems like Saturn Cloud or NGrok - these are both good approaches - for folks running this on colab I have not built support for bit for the workshop. Your contributions / PRs would be very welcome.
  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

### Graphviz or Cytoscape (ipycytoscape)

For some Dask exercises, we may want to visualize the task graph. 
To do so we'll need: [GraphViz](https://graphviz.org/) or [Cytoscape](https://cytoscape.org/download.html) and [ipycytoscape](https://ipycytoscape.readthedocs.io/en/latest/installing.html)

These come with ready installers. Install, restart Jupyter if needed. 

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)

# # grab dask - in most cases it should already be available in colab
# ! python -m pip install --quiet --upgrade --no-cache-dir "dask[complete]"
# # Let's download and unzip the MovieLens 25M Dataset as well.
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-25m.zip
# ! unzip ./ml-25m.zip -d ./../data/

# ! echo "DONE"

**Citation**:  
*F. Maxwell Harper and Joseph A. Konstan.* 2015.  
The MovieLens Datasets: History and Context.  
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>  

# Setup Dask, Pandas and Numpy

## Setup the Dask Cluster

### Installation

A local install is as simple as ```pip install "dask[complete]"```  
  
Unlike Spark, Dask is incredibly easy to setup - checkout [Dask Installation Docs](https://docs.dask.org/en/stable/install.html)  

In [2]:
# Step 1: numpy and pandas

import numpy as np
import pandas as pd

print("numpy version: ", np.__version__)
print("pandas version: ", pd.__version__)

numpy version:  1.26.2
pandas version:  2.1.3


In [3]:
# Step 2: import dask and related
import dask
import dask.array as da
import dask.bag as db
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster

print("dask version: ", dask.__version__)

dask version:  2023.12.0


In [4]:
# Step 3: Create a Dask Cluster and a Client
try:
    cluster = LocalCluster()
    # alternative, when you want to specify the dashboard address/port
    # cluster = LocalCluster(dashboard_address = 'localhost:8786')
    if cluster.shutdown_on_close == False:
        cluster.shutdown_on_close = True
except Exception:
    pass
#
try:
    client = Client(cluster)
except Exception:
    pass

In [5]:
# see cluster information
# cluster

In [6]:
# see client information
# client

# Problem Set 1  - ```movies.csv```

1. How many unique movies exist in the ```movies.csv``` dataset?  

1. Prepare a yearwise list of movies - Extract the year of release in movies.csv into a new column year_of_release   

# Locate the data

In [9]:
datalocation = "../data/ml-25m/"

In [10]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"
file_path_genome_tags = datalocation + "genome-tags.csv"
file_path_genome_scores = datalocation + "genome-scores.csv"

# Load the dataset(s)

From the ```README.txt``` file in the small MovieLens dataset:
The dataset files are written as [**comma-separated values**](http://en.wikipedia.org/wiki/Comma-separated_values) files with a **single header row**. Columns that contain commas (`,`) are **escaped using double-quotes (`"`)**. These files are encoded as **UTF-8**. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

So, we specify:
* Separator - ```,```
* Escape Character - ```"```
* Encoding - ```UTF-8```

Often this is called the **dialect** of the CSV file.
These dialects vary often, so need our attention.

In [11]:
# dask dataframes parallelize pandas dataframes
# so many of the idioms are similar
csv_separator = ","
csv_escapechar = '"'
csv_quotechar = csv_escapechar
csv_encoding = "utf-8"

## Movies

Let's specify the [-  ```dtypes```  ](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) of each of the columns in the movies file. 

In [12]:
# schema, inferred from the README.txt file
movies_schema = {"movieId": "Int32", "title": "string", "genres": "string"}

Two of the columns are [strings of text](https://pandas.pydata.org/docs/user_guide/text.html#working-with-text-data). Pandas may treat those as ```object```, but we wanted to use the [```pandas.StringDType```](https://pandas.pydata.org/docs/reference/api/pandas.StringDtype.html#pandas-stringdtype) here.

In [13]:
# we are using dd - dask.dataframe
movies = dd.read_csv(
    file_path_movies,
    dtype=movies_schema,
    sep=csv_separator,
    quotechar=csv_quotechar,
    encoding=csv_encoding,
)

In [14]:
# data types of each column
movies.dtypes

movieId              Int32
title      string[pyarrow]
genres     string[pyarrow]
dtype: object

# Solutions to Problem Set 1

## How many unique movies?

In [15]:
# lazy
number_of_unique_movies_task = movies["title"].nunique()

In [20]:
# you know you want to...
# number_of_unique_movies_task.visualize()

In [21]:
# use compute() on scalar to get final value
number_of_unique_movies = number_of_unique_movies_task.compute()
print(number_of_unique_movies)

62325


#### What if we just needed to count the number of rows in the dataset?

In [22]:
# shape property helps
# compute number of rows
# needs to be collected across all workers
movies.shape[0].compute()

62423

In [23]:
# compute number of columns
# consistent across all workers
movies.shape[1]

3

In [24]:
number_of_rows_in_movies = movies.shape[0].compute()

In [25]:
number_of_rows_in_movies

62423

In [26]:
# seems like some movie titles
print(number_of_rows_in_movies - number_of_unique_movies)

98


#### Of course we need to find these...

In [27]:
movie_title_counts = (
    movies.groupby(by="title").count().sort_values(by="movieId", ascending=False)
)

In [30]:
# as expected, this one's kinda straightforward
# movie_title_counts.visualize()

In [31]:
movie_title_counts.head()

Unnamed: 0_level_0,movieId,genres
title,Unnamed: 1_level_1,Unnamed: 2_level_1
The Plague (2006),2,2
Seven Years Bad Luck (1921),2,2
Another World (2014),2,2
Interrogation (2016),2,2
Sing (2016),2,2


In [32]:
repeating_titles = movie_title_counts[movie_title_counts["movieId"] > 1]

In [33]:
repeating_titles.head()

Unnamed: 0_level_0,movieId,genres
title,Unnamed: 1_level_1,Unnamed: 2_level_1
The Plague (2006),2,2
Seven Years Bad Luck (1921),2,2
Another World (2014),2,2
Interrogation (2016),2,2
Sing (2016),2,2


In [34]:
repeating_titles.shape[0].compute()

98

## Year-wise list of movies 
- Extract the year of release in movies.csv into a new column year_of_release

In [35]:
import re

# regex:
# 1st capture group: match a single (
# 2nd capture group: match exactly 4 digits
# 3rd capture group: match a single )
# at the end of the string
year_regex_pattern = "([(])([0-9]{4})([)]$)"
# alternative: use \d{4} instead of [0-9]
print("No of groups in the regex: ", re.compile(year_regex_pattern).groups)

No of groups in the regex:  3


In [36]:
# just like pandas except lazy
movies["year"] = dd.to_numeric(
    movies["title"].str.extract(year_regex_pattern, flags=re.X, expand=False)[1]
)

In [37]:
# dont'ask me to do it...
# movies.visualize()

In [38]:
movies[movies["year"] == 1995].head(10)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995
5,6,Heat (1995),Action|Crime|Thriller,1995
6,7,Sabrina (1995),Comedy|Romance,1995
7,8,Tom and Huck (1995),Adventure|Children,1995
8,9,Sudden Death (1995),Action,1995
9,10,GoldenEye (1995),Action|Adventure|Thriller,1995


In [39]:
movies[movies["year"] == 1998].head(10)

Unnamed: 0,movieId,title,genres,year
867,887,Talk of Angels (1998),Drama,1998
1541,1598,Desperate Measures (1998),Crime|Drama|Thriller,1998
1593,1655,Phantoms (1998),Drama|Horror|Thriller,1998
1617,1679,Chairman of the Board (1998),Comedy,1998
1618,1680,Sliding Doors (1998),Drama|Romance,1998
1620,1682,"Truman Show, The (1998)",Comedy|Drama|Sci-Fi,1998
1661,1727,"Horse Whisperer, The (1998)",Drama|Romance,1998
1666,1732,"Big Lebowski, The (1998)",Comedy|Crime,1998
1669,1735,Great Expectations (1998),Drama|Romance,1998
1670,1738,Vermin (1998),Comedy,1998


In [40]:
movies[movies["year"] == 2009].head(10)

Unnamed: 0,movieId,title,genres,year
12536,60684,Watchmen (2009),Action|Drama|Mystery|Sci-Fi|Thriller|IMAX,2009
12695,62265,"Accidental Husband, The (2009)",Comedy|Romance,2009
12756,63072,"Road, The (2009)",Adventure|Drama|Thriller,2009
12970,65585,Bride Wars (2009),Comedy|Romance,2009
12973,65601,My Bloody Valentine 3-D (2009),Horror|Thriller,2009
12988,65682,Underworld: Rise of the Lycans (2009),Action|Fantasy|Horror|Thriller,2009
13004,65802,Paul Blart: Mall Cop (2009),Action|Comedy|Crime,2009
13005,65810,Notorious (2009),Drama|Musical,2009
13006,65813,"Unborn, The (2009)",Horror|Mystery|Thriller,2009
13012,65882,"Uninvited, The (2009)",Drama|Horror|Mystery|Thriller,2009


#### How many movies each year?

In [41]:
# using movies - knowing that 98 rows are duplicate
# todo - build a 'unique' only dataframe and use it for further analysis

movies_each_year = (
    movies.groupby(by="year").count().sort_values(by="movieId", ascending=False)
)

In [42]:
movies_each_year.shape[0].compute()

135

In [43]:
movies_each_year.head(135)

Unnamed: 0_level_0,movieId,title,genres
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015,2513,2513,2513
2016,2488,2488,2488
2014,2403,2403,2403
2017,2374,2374,2374
2013,2164,2164,2164
...,...,...,...
1878,1,1,1
1887,1,1,1
1883,1,1,1
1880,1,1,1


# Bonus Round   
## Find out how many times the word 'The' appears across all the titles in the movies dataset.  

Hint: this is just like the UDFs we discovered in Spark

In [44]:
movies['title'].head()

0                      Toy Story (1995)
1                        Jumanji (1995)
2               Grumpier Old Men (1995)
3              Waiting to Exhale (1995)
4    Father of the Bride Part II (1995)
Name: title, dtype: string

In [45]:
# custom function
def count_the(title):
	# split title by space, case insensitive
	words_in_title = str(title).lower().split(' ')
	# return the count of the word 'the'
	return words_in_title.count('the')

In [46]:
count_of_the = movies['title'].map(count_the).sum()

In [51]:
# what!?!
# count_of_the.visualize()

In [52]:
count_of_the.compute()

19583

# Wrap Up the cluster

In [None]:
# wrap up like this
client.retire_workers()
# QQ - do we really need cluster.close() here?
# cluster.close()
client.shutdown()

# Insights

* Things are very much Pandas like
* but *lazy* - use ```compute()```, it helps.
* ```.shape[0].compute()``` to get all the rows - because rows are distributed across workers
* ```.shape[1]``` to get all columns - these are consistent, so don't need compute

# Next

* Let's play with the MovieLens dataset some more.