# Distributed Data Analysis with Dask  
*__[Dask](https://www.dask.org/)__ with the __MovieLens__ dataset*  

**Part 1: Getting Started, Load the MovieLens dataset**

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/02-Pandas/02.01-Data-Wrangling-with-MovieLens-and-Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

**Note**: The Dask Dashboard is not available on Google Colab unless you register with tunnelling systems like Saturn Cloud or NGrok - these are both good approaches - for folks running this on colab I have not built support for bit for the workshop. Your contributions / PRs would be very welcome.
  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [None]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)

# # grab dask - in most cases it should already be available in colab
# ! python -m pip install --quiet --upgrade --no-cache-dir "dask[complete]"
# # Let's download and unzip the MovieLens 25M Dataset as well.
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-25m.zip
# ! unzip ./ml-25m.zip -d ./../data/

# ! echo "DONE"

## Overview
Select, filter, join, groupby, pivot, and windows.  

Instead of toy examples and '10 minutes to xx' we load an actual dataset and ask meaningful questions about it.
  
We'll use the [MovieLens](https://grouplens.org/datasets/movielens/) dataset for these exercises.  
This dataset is non trivial and should expand to about __1GB__ on you local disk.  

Download and unzip [MovieLens 25M Dataset](https://grouplens.org/datasets/movielens/25m/) for this analysis.

Either ensure the data is in ```"./data/ml-25m"``` folder or update the path to the data below.

**Citation**:  
*F. Maxwell Harper and Joseph A. Konstan.* 2015.  
The MovieLens Datasets: History and Context.  
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>  

## Approach

The idea is to tackle simple Dask use-cases first and move on to more complex ones.  

Starting with simply loading the data into a dask distributed dataframe, we then perform a data evaluation, some cleanup and finally analysis. We first ask questions based on individual data files, then move on to combining data from multiple files.

We are going to try and avoid the more mathematically involved parts of exploratory data analysis - for e.g. statistical analysis on various features etc. - the core focus in the ability to grok pyspark functions and have fun while doing it.  

By the end you'd not only have an idea of Dask, but also how we ask questions and analyze a chunk of data.  

_You may also end up with a watch-list to binge on your next weekend._ :)   

# Setup Dask, Pandas and Numpy

## Setup the Dask Cluster

### Installation

A local install is as simple as ```pip install "dask[complete]"```  
  
Unlike Spark, Dask is incredibly easy to setup - checkout [Dask Installation Docs](https://docs.dask.org/en/stable/install.html)  

In [1]:
# Step 1: numpy and pandas

import numpy as np
import pandas as pd

print("numpy version: ", np.__version__)
print("pandas version: ", pd.__version__)

numpy version:  1.26.0
pandas version:  2.1.1


In [2]:
# Step 2: import dask and related
import dask
import dask.array as da
import dask.bag as db
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster

print("dask version: ", dask.__version__)

dask version:  2023.9.2


![Dask](https://docs.dask.org/en/stable/_images/dask-overview.svg)

<font size = '2px' color='green'>Image from docs.dask.org</font>

In [3]:
# Step 3: Create a Dask Cluster and a Client
try:
    cluster = LocalCluster()
    # alternative, when you want to specify the dashboard address/port
    # cluster = LocalCluster(dashboard_address = 'localhost:8786')
    if (cluster.shutdown_on_close == False):
        cluster.shutdown_on_close = True
except Exception:
    pass
#
try:
    client = Client(cluster)
except Exception:
    pass

In [11]:
# see cluster information
# cluster

In [13]:
# see client information
# client

# Locate the data

In [14]:
datalocation = "../data/ml-25m/"

In [15]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"
file_path_genome_tags = datalocation + "genome-tags.csv"
file_path_genome_scores = datalocation + "genome-scores.csv"

## Schema Spec

# Load the dataset(s)

From the ```README.txt``` file in the small MovieLens dataset:
The dataset files are written as [**comma-separated values**](http://en.wikipedia.org/wiki/Comma-separated_values) files with a **single header row**. Columns that contain commas (`,`) are **escaped using double-quotes (`"`)**. These files are encoded as **UTF-8**. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

So, we specify:
* Separator - ```,```
* Escape Character - ```"```
* Encoding - ```UTF-8```

Often this is called the **dialect** of the CSV file.
These dialects vary often, so need our attention.

In [16]:
# dask dataframes parallelize pandas dataframes
# so many of the idioms are similar
csv_separator = ","
csv_escapechar = '"'
csv_quotechar = csv_escapechar
csv_encoding = "utf-8"

## Movies

Let's specify the [-  ```dtypes```  ](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) of each of the columns in the movies file. 

In [17]:
# schema, inferred from the README.txt file
movies_schema = {"movieId": "Int32", "title": "string", "genres": "string"}

Two of the columns are [strings of text](https://pandas.pydata.org/docs/user_guide/text.html#working-with-text-data). Pandas may treat those as ```object```, but we wanted to use the [```pandas.StringDType```](https://pandas.pydata.org/docs/reference/api/pandas.StringDtype.html#pandas-stringdtype) here.

In [18]:
# we are using dd - dask.dataframe
movies = dd.read_csv(
    file_path_movies,
    dtype=movies_schema,
    sep=csv_separator,
    quotechar=csv_quotechar,
    encoding=csv_encoding,
)

In [20]:
# movies.visualize()

In [None]:
# show the first 15 lines
movies.head(15)

Look at Row #10 - "American President, The (1995)" - so pandas seems to have correctly interpreted the quotation marks.

For now we'll keep things simple and let pandas give us an index.  
In some cases it would be interesting to use the ```movieId``` column [as the index](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv:~:text=index_colHashable%2C%20Sequence%20of%20Hashable%20or%20False%2C%20optional).  



In [None]:
# data types of each column
movies.dtypes

Just for practice, let's load the other datasets too...

## Links

In [None]:
# schema, inferred from the README.txt file
# load imdbId,tmdbId as strings because the are a part of a URL.
#   IMDB: http://www.imdb.com/title/imdbId/
#   TMDB: https://www.themoviedb.org/movie/tmdbId
links_schema = {"movieId": "Int32", "imdbId": "string", "tmdbId": "string"}

In [None]:
links = pd.read_csv(
    file_path_links,
    dtype=links_schema,
    sep=csv_separator,
    quotechar=csv_quotechar,
    encoding=csv_encoding,
)

In [None]:
links.head(15)

In [None]:
links.dtypes

## Ratings

Reading through the ```README``` file:  
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).  
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.  

Ooh! Ooh! We got our first [DateTime](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeTZDtype.html#pandas.DatetimeTZDtype)!

In [None]:
# schema, inferred from the README.txt file
# read timestamps as integers then convert to dates later.

ratings_schema = {
    "userId": "Int32",
    "movieId": "Int32",
    "rating": "Float32",
    "timestamp": "Int64",
}
#

In [None]:
ratings = pd.read_csv(
    file_path_ratings,
    dtype=ratings_schema,
    sep=csv_separator,
    quotechar=csv_quotechar,
    encoding=csv_encoding,
)

In [None]:
ratings.head()

In [None]:
# now let's add a datetime column that we derive from the raw timestamp
ratings["datetime"] = pd.to_datetime(ratings["timestamp"], unit="s", utc=True)

In [None]:
ratings.dtypes

#### ```pandas.Series.dt.date```  
Let's [extract the dates](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.date.html#pandas-series-dt-date) into a new column

...just in case you are wondering what's that [```dt``` part](https://github.com/pandas-dev/pandas/blob/9e1096e8373bc99675fd1b3490cfb7cf26041395/pandas/core/series.py#L5777C1-L5777C2), and want to dive into the code where it's defined as a [CachedAccessor](https://github.com/pandas-dev/pandas/blob/9e1096e8373bc99675fd1b3490cfb7cf26041395/pandas/core/accessor.py#L196) for ```datetimelike``` values in ```pandas/core/accessor.py```

In [None]:
ratings["date"] = ratings["datetime"].dt.date

In [None]:
ratings.head()

Niiice!  
Wait, let's check the data types once.

In [None]:
ratings.dtypes

We'd prefer if date was a datetime type as well.  
Let's prepare the date column again, wrapping it in ```pd.to_datetime()```


In [None]:
ratings["date"] = pd.to_datetime(ratings["datetime"].dt.date)

In [None]:
# check the data types again
ratings.dtypes

Ah! much better.  
Why you say?  
We could easily extract and manipulate dates this way.  
for e.g.

In [None]:
# extract the day, month and year of each rating
ratings["day"] = ratings["date"].dt.day
ratings["month"] = ratings["date"].dt.month
ratings["year"] = ratings["date"].dt.year

In [None]:
ratings.dtypes

very clean!

In [None]:
ratings.head()

Let's do this for the Tags data set too - just for practice.


## Tags

From ```README```:  
Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.  
  
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [None]:
# schema, inferred from the README.txt file
# read timestamps as integers then convert to dates later.
# userId,movieId,tag,timestamp
tags_schema = {
    "userId": "Int32",
    "movieId": "Int32",
    "tag": "string",
    "timestamp": "Int64",
}
#

In [None]:
tags = pd.read_csv(
    file_path_tags,
    dtype=tags_schema,
    sep=csv_separator,
    quotechar=csv_quotechar,
    encoding=csv_encoding,
)

In [None]:
tags.head()

just like before let's add a more readable ```datetime``` column here

In [None]:
tags["datetime"] = pd.to_datetime(tags["timestamp"], unit="s", utc=True)

In [None]:
tags.dtypes

In [None]:
# extract date into a new column
tags["date"] = pd.to_datetime(tags["datetime"].dt.date)

In [None]:
tags.dtypes

In [None]:
tags.head(5)

umm... go nuts.  
Extract the day, month and year from date, because why not?

In [None]:
tags["day"] = tags["date"].dt.day
tags["month"] = tags["date"].dt.month
tags["year"] = tags["date"].dt.year

In [None]:
tags.head()

# Wrap Up the cluster

In [None]:
# wrap up like this
client.retire_workers()
# QQ - do we really need cluster.close() here?
# cluster.close()
client.shutdown()

# Insights

1. For CSV data pay attention to the dialect
2. EscapeChar vs QuoteChar in Pandas
3. Opinion: Safe approach for timestamps - import as Integers/Numeric and convert using ```pd.to_datetime```
4. ```pandas.Series.dt.date```

# Next

* Let's play with the MovieLens dataset some more.