# Data Wrangling in Python  
*__[Pandas](https://pandas.pydata.org/)__ with the __MovieLens__ dataset*  

**Part 1: Getting Started, Load the MovieLens dataset**

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/02-Pandas/02.01-Data-Wrangling-with-MovieLens-and-Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

# Locate the data

In [4]:
datalocation = "./../data/ml-latest-small/"

In [5]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

# Setup Pandas and Numpy

In [6]:
import numpy as np
import pandas as pd

print("numpy version: ",np.__version__)
print("pandas version: ",pd.__version__)

numpy version:  1.26.0
pandas version:  2.1.1


# Load the dataset(s)

From the ```README.txt``` file in the small MovieLens dataset:
The dataset files are written as [**comma-separated values**](http://en.wikipedia.org/wiki/Comma-separated_values) files with a **single header row**. Columns that contain commas (`,`) are **escaped using double-quotes (`"`)**. These files are encoded as **UTF-8**. If accented characters in movie titles or tag values (e.g. Mis√©rables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

So, we specify:
* Separator - ```,```
* Escape Character - ```"```
* Encoding - ```UTF-8```

In [22]:
csv_separator = ','
csv_escapechar='\"'
csv_encoding = 'utf-8'

## Movies

Let's specify the [-  ```dtypes```  ](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) of each of the columns in the movies file. 

In [24]:
# schema, inferred from the README.txt file
movies_schema = {
	'movieId':'Int32',
	'title':'string',
	'genres':'string'
}

Two of the columns are [strings of text](https://pandas.pydata.org/docs/user_guide/text.html#working-with-text-data). Pandas may treat those as ```object```, but we wanted to use the [```pandas.StringDType```](https://pandas.pydata.org/docs/reference/api/pandas.StringDtype.html#pandas-stringdtype) here.

In [30]:
# gives an error.
# movies = pd.read_csv(file_path_movies, dtype=movies_schema, sep=csv_separator, escapechar=csv_escapechar, encoding=csv_encoding)

In [31]:
# we need the quote character, not the escape character
# just to keep things readable, let's create another variable
csv_quotechar = csv_escapechar

In [32]:
movies = pd.read_csv(file_path_movies, 
					 dtype=movies_schema, 
					 sep=csv_separator, 
					 quotechar=csv_quotechar, 
					 encoding=csv_encoding)

In [33]:
# show the first 15 lines
movies.head(15)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Look at Row #10 - "American President, The (1995)" - so pandas seems to have correctly interpreted the quotation marks.

For now we'll keep things simple and let pandas give us an index.  
In some cases it would be interesting to use the ```movieId``` column [as the index](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv:~:text=index_colHashable%2C%20Sequence%20of%20Hashable%20or%20False%2C%20optional).  



In [29]:
# data types of each column
movies.dtypes

movieId             Int32
title      string[python]
genres     string[python]
dtype: object

Just for practice, let's load the other datasets too...

## Links

In [84]:
# schema, inferred from the README.txt file
# load imdbId,tmdbId as strings because the are a part of a URL.
#   IMDB: http://www.imdb.com/title/imdbId/
#   TMDB: https://www.themoviedb.org/movie/tmdbId
links_schema = {
	'movieId':'Int32',
	'imdbId':'string',
	'tmdbId':'string'
}

In [85]:
links = pd.read_csv(file_path_links, 
					dtype=links_schema, 
					sep=csv_separator, 
					quotechar=csv_quotechar, 
					encoding=csv_encoding)

In [86]:
links.head(15)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862
1,2,113497,8844
2,3,113228,15602
3,4,114885,31357
4,5,113041,11862
5,6,113277,949
6,7,114319,11860
7,8,112302,45325
8,9,114576,9091
9,10,113189,710


In [87]:
links.dtypes

movieId             Int32
imdbId     string[python]
tmdbId     string[python]
dtype: object

## Ratings

Reading through the ```README``` file:  
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).  
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.  

Ooh! Ooh! We got our first [DateTime](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeTZDtype.html#pandas.DatetimeTZDtype)!

In [77]:
# schema, inferred from the README.txt file
# read timestamps as integers then convert to dates later.

ratings_schema = {
	'userId':'Int32',
	'movieId':'Int32',
	'rating':'Float32',
	'timestamp':'Int64'
}
# 

In [78]:
ratings = pd.read_csv(file_path_ratings, 
					  dtype=ratings_schema, 
					  sep=csv_separator, 
					  quotechar=csv_quotechar, 
					  encoding=csv_encoding)

In [79]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [94]:
# now let's add a datetime column that we derive from the raw timestamp
ratings['datetime'] = pd.to_datetime(ratings['timestamp'], unit='s',utc=True)

In [82]:
ratings.dtypes

userId               Int32
movieId              Int32
rating             Float32
timestamp            Int64
datetime     datetime64[s]
dtype: object

let's [extract the dates](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.date.html#pandas-series-dt-date) into a new column

In [109]:
ratings['date'] = ratings['datetime'].dt.date

In [110]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,datetime,date
0,1,1,4.0,964982703,2000-07-30 18:45:03+00:00,2000-07-30
1,1,3,4.0,964981247,2000-07-30 18:20:47+00:00,2000-07-30
2,1,6,4.0,964982224,2000-07-30 18:37:04+00:00,2000-07-30
3,1,47,5.0,964983815,2000-07-30 19:03:35+00:00,2000-07-30
4,1,50,5.0,964982931,2000-07-30 18:48:51+00:00,2000-07-30


Niiice!


## Tags

From ```README```:  
Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.  
  
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [88]:
# schema, inferred from the README.txt file
# read timestamps as integers then convert to dates later.
# userId,movieId,tag,timestamp
tags_schema = {
	'userId':'Int32',
	'movieId':'Int32',
	'tag':'string',
	'timestamp':'Int64'
}
# 

In [89]:
tags = pd.read_csv(file_path_tags, 
				   dtype=tags_schema, 
				   sep=csv_separator, 
				   quotechar=csv_quotechar, 
				   encoding=csv_encoding)

In [90]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


just like before let's add a more readable ```datetime``` column here

In [96]:
tags['datetime'] = pd.to_datetime(tags['timestamp'],unit='s',utc=True)

In [105]:
tags.dtypes

userId                    Int32
movieId                   Int32
tag              string[python]
timestamp                 Int64
datetime     datetime64[s, UTC]
date                     object
dtype: object

In [106]:
# extract date into a new column
tags['date'] = tags['datetime'].dt.date

In [107]:
tags.dtypes

userId                    Int32
movieId                   Int32
tag              string[python]
timestamp                 Int64
datetime     datetime64[s, UTC]
date                     object
dtype: object

In [108]:
tags.head(5)

Unnamed: 0,userId,movieId,tag,timestamp,datetime,date
0,2,60756,funny,1445714994,2015-10-24 19:29:54+00:00,2015-10-24
1,2,60756,Highly quotable,1445714996,2015-10-24 19:29:56+00:00,2015-10-24
2,2,60756,will ferrell,1445714992,2015-10-24 19:29:52+00:00,2015-10-24
3,2,89774,Boxing story,1445715207,2015-10-24 19:33:27+00:00,2015-10-24
4,2,89774,MMA,1445715200,2015-10-24 19:33:20+00:00,2015-10-24


umm... go nuts.  
[Extract the day, month and year from](https://docs.python.org/3/library/datetime.html#datetime.date) date, because why not?

# Next

* Numpy and Pandas