# Analysis with Polars  
*__[Polars](https://www.pola.rs/)__ with the __MovieLens__ dataset*  

**Getting Started, Load the MovieLens dataset, A quick look at Arrow, and some analysis**

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/02-Pandas/02.01-Data-Wrangling-with-MovieLens-and-Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  
  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# !pip install polars
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

# Locate the data

In [2]:
datalocation = "./../data/ml-latest-small/"

In [3]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

# Setup Polars, Pandas and Numpy

In [4]:
import numpy as np
import pandas as pd
import polars as pl

print("numpy version: ", np.__version__)
print("pandas version: ", pd.__version__)
print("polars version: ",pl.__version__)

numpy version:  1.26.0
pandas version:  2.1.1
polars version:  0.19.5


# A note on Apache Arrow and the Columnar Memory Model

**Apache Arrow? What?**
[Apache Arrow](https://arrow.apache.org/) is an open-source, cross-language development platform for in-memory data. It specifies a standardized, language-agnostic Columnar Memory Format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.  

**Aside** - [Wes McKinney's Apache Arrow and the “10 Things I Hate About pandas”](https://wesmckinney.com/blog/apache-arrow-pandas-internals/), read up!  

 **Importance and Benefits**: Arrow enables data systems to process and transfer data quickly. Its columnar memory format allows systems to avoid serialization costs, improving performance for analytics and data interchange  

That word again - **Columnar**.  
To refresh:

![Columnar Representation](./../images/Column-Wise-Representation.drawio.png).

## **Arrow Memory Model**:  
   - **Arrow’s Columnar Memory Layout**: Instead of storing data row-wise, Arrow stores data column-wise, which means data for a single column is stored together in memory  .
   - **Benefits for Analytics**: Storing data column-wise is beneficial for analytics because operations tend to focus on subsets of columns rather than entire rows.

Here's how Apache Arrow's columnar memory format benefits various operations:

### 1. **Better Cache Locality**

Traditional row-wise storage often leads to unnecessary cache misses when performing operations on specific columns. In a columnar format, since the data is stored continuously in memory, operations on a column benefit from better cache locality. 

**Example**: Consider summing the values in a column. 

In a row-wise format, you'd have to jump through memory for every row, leading to potential cache misses. In a columnar format, the summation is performed on contiguous blocks of memory.

### 2. **Vectorized Operations using SIMD**

Arrow's columnar format pairs perfectly with [SIMD (Single Instruction, Multiple Data)](https://en.wikipedia.org/wiki/Flynn%27s_taxonomy). SIMD allows a single operation to be performed on multiple data points simultaneously. When data is stored in columns, vectorized operations can efficiently process multiple data points in a single column at once.

**Example**: Consider multiplying every value in a column by 2.

With SIMD and columnar storage, multiple entries from that column can be loaded into large registers and multiplied by 2 simultaneously, speeding up the operation.

### 3. **Efficient Compression and Encoding**

Columnar data can be compressed more effectively than row-wise data. This is because adjacent values in a column often have similar lengths and patterns, making them suitable for compression algorithms.

**Example**: A column containing the dates for every day in a year would have repetitive year and month values. This repetitiveness can be ethe Parquet file, only the 'name' column's data blocks are loaded, making the read operation faster.

### 4. **Sparse Data Handling**

In datasets with many missing or null values, columnar storage can be more space-efficient.  
A whole column of missing values can be represented compactly.  

Consider the following example:  

In [5]:
# With Arrow's columnar format, the missing `age` values don't take up any more space than necessary.
import pyarrow as pa

data = {
    'name': ["Alice", "Bob", "Charlie", None, "Eve"],
    'age': [25, None, 35, None, 40]
}

table = pa.table(data)
print(table)

pyarrow.Table
name: string
age: int64
----
name: [["Alice","Bob","Charlie",null,"Eve"]]
age: [[25,null,35,null,40]]


### 5. **Less I/O**

When querying large datasets stored on disk, columnar formats enable more efficient I/O.  
If only specific columns are required, only those columns' data blocks need to be loaded, reducing the I/O overhead.   
   
Let's look at another example:

In [6]:
# when reading the Parquet file, 
# only the 'name' column's data blocks are loaded, 
# making the read operation faster.
import pyarrow.parquet as pq

# Writing table to a Parquet file (columnar storage)
pq.write_table(table,'data.parquet')

# Reading only the 'name' column - fast as only 'name' column is read.
table_subset = pq.read_table('data.parquet', columns=['name'])
print(table_subset)

pyarrow.Table
name: string
----
name: [["Alice","Bob","Charlie",null,"Eve"]]


# Back to Polars

Polars is still nascent.  
But growing real fast!  
  
The syntax varies from both [Pandas](https://pola-rs.github.io/polars/user-guide/migration/pandas/) and [Spark](https://pola-rs.github.io/polars/user-guide/migration/spark/).  

# Load the dataset(s)

From the ```README.txt``` file in the small MovieLens dataset:
The dataset files are written as [**comma-separated values**](http://en.wikipedia.org/wiki/Comma-separated_values) files with a **single header row**. Columns that contain commas (`,`) are **escaped using double-quotes (`"`)**. These files are encoded as **UTF-8**. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

So, we specify:
* Separator - ```,```
* Escape Character - ```"```
* Encoding - ```UTF-8```
* Quore Character - ```"```

Often this is called the **dialect** of the CSV file.
These dialects vary often, so need our attention.

In [7]:
csv_separator = ","
csv_escapechar = '"'
# not 'utf-8'
csv_encoding = "utf8"
csv_quotechar = csv_escapechar

## Movies

One of the reasons polars is fast is due to it's insistence on strict data types.

In [8]:
# schema, inferred from the README.txt file
# use polars constants
movies_schema = {"movieId": pl.UInt32, 
				 "title": pl.Utf8, 
				 "genres": pl.Utf8}

In [9]:
# schema instead of dtypes
# separator instead of sep
# quote_char instead of quotechar
movies = pl.read_csv(
    source=file_path_movies,
	has_header=True,
	schema=movies_schema,
    separator=csv_separator,
    quote_char=csv_quotechar,
    encoding=csv_encoding
)

In [10]:
movies.head()

movieId,title,genres
u32,str,str
1,"""Toy Story (199…","""Adventure|Anim…"
2,"""Jumanji (1995)…","""Adventure|Chil…"
3,"""Grumpier Old M…","""Comedy|Romance…"
4,"""Waiting to Exh…","""Comedy|Drama|R…"
5,"""Father of the …","""Comedy"""


### Expressions  
Polars has this notion of [Expressions](https://pola-rs.github.io/polars/user-guide/expressions/operators/) that is central to it's approach.  
If you've seen method chains in JavaScript - you will feel very comfortable with Expressions.  
Each 'expression' is effectively an operation that can be performed in parallel on data.  

In [11]:
# select one column
# series of all the titles
movies.select(pl.col('title')).head(10)

title
str
"""Toy Story (199…"
"""Jumanji (1995)…"
"""Grumpier Old M…"
"""Waiting to Exh…"
"""Father of the …"
"""Heat (1995)"""
"""Sabrina (1995)…"
"""Tom and Huck (…"
"""Sudden Death (…"
"""GoldenEye (199…"


In [12]:
# select all columns
movies.select(pl.col('*')).head(10)

movieId,title,genres
u32,str,str
1,"""Toy Story (199…","""Adventure|Anim…"
2,"""Jumanji (1995)…","""Adventure|Chil…"
3,"""Grumpier Old M…","""Comedy|Romance…"
4,"""Waiting to Exh…","""Comedy|Drama|R…"
5,"""Father of the …","""Comedy"""
6,"""Heat (1995)""","""Action|Crime|T…"
7,"""Sabrina (1995)…","""Comedy|Romance…"
8,"""Tom and Huck (…","""Adventure|Chil…"
9,"""Sudden Death (…","""Action"""
10,"""GoldenEye (199…","""Action|Adventu…"


In [13]:
# select all columns - option 2
movies.select(pl.all()).head(10)

movieId,title,genres
u32,str,str
1,"""Toy Story (199…","""Adventure|Anim…"
2,"""Jumanji (1995)…","""Adventure|Chil…"
3,"""Grumpier Old M…","""Comedy|Romance…"
4,"""Waiting to Exh…","""Comedy|Drama|R…"
5,"""Father of the …","""Comedy"""
6,"""Heat (1995)""","""Action|Crime|T…"
7,"""Sabrina (1995)…","""Comedy|Romance…"
8,"""Tom and Huck (…","""Adventure|Chil…"
9,"""Sudden Death (…","""Action"""
10,"""GoldenEye (199…","""Action|Adventu…"



## Tags

From ```README```:  
Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.  
  
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [14]:
# schema, inferred from the README.txt file
# read timestamps as integers then convert to dates later.
# userId,movieId,tag,timestamp
tags_schema = {
    "userId": pl.UInt32,
    "movieId": pl.UInt32,
    "tag": pl.Utf8,
    "timestamp": pl.UInt64,
}
#

In [15]:
tags = pl.read_csv(
    source=file_path_tags,
	has_header=True,
	schema=tags_schema,
    separator=csv_separator,
    quote_char=csv_quotechar,
    encoding=csv_encoding
)

In [16]:
tags.head()

userId,movieId,tag,timestamp
u32,u32,str,u64
2,60756,"""funny""",1445714994
2,60756,"""Highly quotabl…",1445714996
2,60756,"""will ferrell""",1445714992
2,89774,"""Boxing story""",1445715207
2,89774,"""MMA""",1445715200


In [17]:
# include all columns but exclude the timestamp
tags.select(pl.all().exclude('timestamp')).head(10)
# also
# tags.select(pl.col('*').exclude('timestamp')).head(10)

userId,movieId,tag
u32,u32,str
2,60756,"""funny"""
2,60756,"""Highly quotabl…"
2,60756,"""will ferrell"""
2,89774,"""Boxing story"""
2,89774,"""MMA"""
2,89774,"""Tom Hardy"""
2,106782,"""drugs"""
2,106782,"""Leonardo DiCap…"
2,106782,"""Martin Scorses…"
7,48516,"""way too long"""


In [18]:
# select just 2 columns
tags.select(pl.col('movieId','tag')).head(10)

movieId,tag
u32,str
60756,"""funny"""
60756,"""Highly quotabl…"
60756,"""will ferrell"""
89774,"""Boxing story"""
89774,"""MMA"""
89774,"""Tom Hardy"""
106782,"""drugs"""
106782,"""Leonardo DiCap…"
106782,"""Martin Scorses…"
48516,"""way too long"""


In [19]:
# add a date-time column
# does with_Columns remind you of Spark?

tags = tags.with_columns(
    (pl.col("timestamp")*1000).cast(pl.Datetime).dt.with_time_unit("ms").alias("datetime")
)

In [20]:
tags.head()

userId,movieId,tag,timestamp,datetime
u32,u32,str,u64,datetime[ms]
2,60756,"""funny""",1445714994,2015-10-24 19:29:54
2,60756,"""Highly quotabl…",1445714996,2015-10-24 19:29:56
2,60756,"""will ferrell""",1445714992,2015-10-24 19:29:52
2,89774,"""Boxing story""",1445715207,2015-10-24 19:33:27
2,89774,"""MMA""",1445715200,2015-10-24 19:33:20


In [21]:
# add a date column

tags = tags.with_columns(
    (pl.col("datetime")).cast(pl.Date).alias("date")
)

In [22]:
tags.head()

userId,movieId,tag,timestamp,datetime,date
u32,u32,str,u64,datetime[ms],date
2,60756,"""funny""",1445714994,2015-10-24 19:29:54,2015-10-24
2,60756,"""Highly quotabl…",1445714996,2015-10-24 19:29:56,2015-10-24
2,60756,"""will ferrell""",1445714992,2015-10-24 19:29:52,2015-10-24
2,89774,"""Boxing story""",1445715207,2015-10-24 19:33:27,2015-10-24
2,89774,"""MMA""",1445715200,2015-10-24 19:33:20,2015-10-24


# Problem Set 1

* How many unique movies in Tags? How many in Movies?

## Solutions to Problem Set 1

### How many unique movies in Tags? How many in Movies?

In [23]:
# can use the full movies dataset, but why do that when we can select a single column amiright?
# Series of all the titles
movie_titles = movies.select(pl.col('title'))
movie_titles.head()

title
str
"""Toy Story (199…"
"""Jumanji (1995)…"
"""Grumpier Old M…"
"""Waiting to Exh…"
"""Father of the …"


two ways to [count unique values](https://pola-rs.github.io/polars/user-guide/expressions/functions/#count-unique-values) in Polars: an exact methodology and an approximation.

In [24]:
unique_movies_movies = movie_titles.select(
    pl.col('title').n_unique().alias('unique'),
    pl.approx_n_unique('title').alias('unique_approx')
)

In [25]:
unique_movies_movies

unique,unique_approx
u32,u32
9737,9641


In [26]:
unique_movies_tags = movies.select(
    pl.col('movieId').n_unique().alias('unique'),
    pl.approx_n_unique('movieId').alias('unique_approx')
)

In [27]:
unique_movies_tags

unique,unique_approx
u32,u32
9742,9791


# Insights

1. Polars runs on Apache Arrow
2. Columnar
3. Strict data types
4. Syntax varies from Pandas and Spark - exercise care
5. Still growing (and fast!)

# Next

* Let's play with the MovieLens dataset some more.