# Data Wrangling in Python  
*__[Pandas](https://pandas.pydata.org/)__ with the __MovieLens__ dataset*  

**Part 2: Playing with the Movies and Ratings data**

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/02-Pandas/02.02-Data-Wrangling-with-MovieLens-and-Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

# Locate the data

In [2]:
datalocation = "./../data/ml-latest-small/"

In [3]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

# Setup Pandas and Numpy

In [4]:
import numpy as np
import pandas as pd

print("numpy version: ", np.__version__)
print("pandas version: ", pd.__version__)

numpy version:  1.26.2
pandas version:  2.1.4


# Load the dataset(s)

From the ```README.txt``` file in the small MovieLens dataset:
The dataset files are written as [**comma-separated values**](http://en.wikipedia.org/wiki/Comma-separated_values) files with a **single header row**. Columns that contain commas (`,`) are **escaped using double-quotes (`"`)**. These files are encoded as **UTF-8**. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

So, we specify:
* Separator - ```,```
* Escape Character - ```"```
* Encoding - ```UTF-8```  
  
We saw in the last notebook that what the README file really meant was that the **Quote Character** is ```"```, so additionally:  
* Quote Character - ```"```

In [5]:
csv_separator = ","
csv_escapechar = '"'
csv_encoding = "utf-8"
csv_quotechar = csv_escapechar

## Movies

Let's specify the [-  ```dtypes```  ](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) of each of the columns in the movies file. 

In [6]:
# schema, inferred from the README.txt file
movies_schema = {"movieId": "Int32", "title": "string", "genres": "string"}

In [7]:
movies = pd.read_csv(
    file_path_movies,
    dtype=movies_schema,
    sep=csv_separator,
    quotechar=csv_quotechar,
    encoding=csv_encoding,
)

In [8]:
# show the first 15 lines
movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [9]:
# data types of each column
movies.dtypes

movieId             Int32
title      string[python]
genres     string[python]
dtype: object

## Ratings

Reading through the ```README``` file:  
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).  
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.  

In [10]:
# schema, inferred from the README.txt file
# read timestamps as integers then convert to dates later.
ratings_schema = {
    "userId": "Int32",
    "movieId": "Int32",
    "rating": "Float32",
    "timestamp": "Int64",
}
#

In [11]:
ratings = pd.read_csv(
    file_path_ratings,
    dtype=ratings_schema,
    sep=csv_separator,
    quotechar=csv_quotechar,
    encoding=csv_encoding,
)

# now let's add a datetime column that we derive from the raw timestamp
ratings["datetime"] = pd.to_datetime(ratings["timestamp"], unit="s", utc=True)
ratings["date"] = pd.to_datetime(ratings["datetime"].dt.date)
ratings["day"] = ratings["date"].dt.day
ratings["month"] = ratings["date"].dt.month
ratings["year"] = ratings["date"].dt.year

In [12]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,datetime,date,day,month,year
0,1,1,4.0,964982703,2000-07-30 18:45:03+00:00,2000-07-30,30,7,2000
1,1,3,4.0,964981247,2000-07-30 18:20:47+00:00,2000-07-30,30,7,2000
2,1,6,4.0,964982224,2000-07-30 18:37:04+00:00,2000-07-30,30,7,2000
3,1,47,5.0,964983815,2000-07-30 19:03:35+00:00,2000-07-30,30,7,2000
4,1,50,5.0,964982931,2000-07-30 18:48:51+00:00,2000-07-30,30,7,2000


In [13]:
# now let's add a datetime column that we derive from the raw timestamp
ratings["datetime"] = pd.to_datetime(ratings["timestamp"], unit="s", utc=True)

In [14]:
ratings.dtypes

userId                    Int32
movieId                   Int32
rating                  Float32
timestamp                 Int64
datetime     datetime64[s, UTC]
date             datetime64[ns]
day                       int32
month                     int32
year                      int32
dtype: object

let's [extract the dates](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.date.html#pandas-series-dt-date) into a new column

In [15]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,datetime,date,day,month,year
0,1,1,4.0,964982703,2000-07-30 18:45:03+00:00,2000-07-30,30,7,2000
1,1,3,4.0,964981247,2000-07-30 18:20:47+00:00,2000-07-30,30,7,2000
2,1,6,4.0,964982224,2000-07-30 18:37:04+00:00,2000-07-30,30,7,2000
3,1,47,5.0,964983815,2000-07-30 19:03:35+00:00,2000-07-30,30,7,2000
4,1,50,5.0,964982931,2000-07-30 18:48:51+00:00,2000-07-30,30,7,2000


# Problem Set 1

1. That comma and quotechar thing... find out how many titles in movies data set have commas in them?
1. Can we extract the year of release from the movie title and put it in a separate column?
1. How many movies in the data set from each year? How many from each decade?
1. Can we calculate an average rating for each movie?
1. How many times was each movie rated? Is there a wide margin between the number of ratings one movie has recieved vs another? 
1. Is there a way I can query a movieID and get it's title and average rating back? 

## Solutions to Problem Set 1

### How many titles in movies data set have commas in them?

In [16]:
# 1
# Series of all the titles
movies['title'].head(15)

0                       Toy Story (1995)
1                         Jumanji (1995)
2                Grumpier Old Men (1995)
3               Waiting to Exhale (1995)
4     Father of the Bride Part II (1995)
5                            Heat (1995)
6                         Sabrina (1995)
7                    Tom and Huck (1995)
8                    Sudden Death (1995)
9                       GoldenEye (1995)
10        American President, The (1995)
11    Dracula: Dead and Loving It (1995)
12                          Balto (1995)
13                          Nixon (1995)
14               Cutthroat Island (1995)
Name: title, dtype: string

In [17]:
# 2
# Test if a title has a comma or not
movies['title'].str.contains(',').head(15)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13    False
14    False
Name: title, dtype: boolean

See how #10 matches?  
That's a clue to how we isolate such titles.  

We can build a [filter](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-and-selecting-data) using: ```new_df = df[condition]```

In [18]:
movies_with_commas = movies[movies['title'].str.contains(',') == True]

In [19]:
movies_with_commas.head(5)

Unnamed: 0,movieId,title,genres
10,11,"American President, The (1995)",Comedy|Drama|Romance
28,29,"City of Lost Children, The (Cité des enfants p...",Adventure|Drama|Fantasy|Mystery|Sci-Fi
36,40,"Cry, the Beloved Country (1995)",Drama
46,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
49,54,"Big Green, The (1995)",Children|Comedy


In [20]:
# total movies
movies.count()

movieId    9742
title      9742
genres     9742
dtype: int64

In [21]:
# number of movies with commas in their titles
movies_with_commas.count()

movieId    2079
title      2079
genres     2079
dtype: int64

### extract the year of release from the movie title to a separate column

We'll use regex to match here.
Something like [regex101](https://regex101.com/r/pWPPbM/1) is really helpful in building the expression.

In [22]:
# select a complex-ish titles for building regex
print(movies_with_commas.loc[10]['title'])
print(movies_with_commas.loc[28]['title'])

American President, The (1995)
City of Lost Children, The (Cité des enfants perdus, La) (1995)


In [23]:
import re
# regex: 
# 1st capture group: match a single (
# 2nd capture group: match exactly 4 digits
# 3rd capture group: match a single )
# at the end of the string
year_regex_pattern = '([(])([0-9]{4})([)]$)'
# alternative: use \d{4} instead of [0-9]
print('No of groups in the regex: ',re.compile(year_regex_pattern).groups)

No of groups in the regex:  3


We can use [Pandas Series' ```str.extract()```](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html#pandas-series-str-extract) method here.

In [24]:
# see if the regex works
print(re.search(year_regex_pattern, movies_with_commas.loc[10]['title']))
print(re.search(year_regex_pattern, movies_with_commas.loc[28]['title']))

<re.Match object; span=(24, 30), match='(1995)'>
<re.Match object; span=(57, 63), match='(1995)'>


```str.extract()``` will return 3 columns, one for each capture group.

In [25]:
# we are interested in the second capture group
movies['year'] = movies['title'].str.extract(year_regex_pattern, flags = re.X, expand=False)[1]

In [26]:
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [27]:
# check
print(movies.loc[10]['year'])
print(movies.loc[28]['year'])

1995
1995


In [28]:
# wait... let's look at dtypes
movies.dtypes

movieId             Int32
title      string[python]
genres     string[python]
year       string[python]
dtype: object

In [29]:
# year needs to be int
movies['year'] = pd.to_numeric(movies['year'],downcast='integer')
movies.dtypes

movieId             Int32
title      string[python]
genres     string[python]
year                Int16
dtype: object

In [30]:
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


### How many movies in the data set from each year?  

In [31]:
movies_by_year_groupby = movies.groupby(by = 'year')

In [32]:
dir(movies_by_year_groupby)

['_DataFrameGroupBy__examples_dataframe_doc',
 '__annotations__',
 '__class__',
 '__class_getitem__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__orig_bases__',
 '__parameters__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accessors',
 '_agg_examples_doc',
 '_agg_general',
 '_agg_py_fallback',
 '_aggregate_frame',
 '_aggregate_with_numba',
 '_apply_filter',
 '_apply_to_column_groupbys',
 '_ascending_count',
 '_choose_path',
 '_concat_objects',
 '_constructor',
 '_cumcount_array',
 '_cython_agg_general',
 '_cython_transform',
 '_define_paths',
 '_deprecate_axis',
 '_descending_count',
 '_dir_additions',
 '_dir_deletions',
 '_fill

In [33]:
# groupby.groups is a dict with all unique values of 'year' as keys
print('data type of movies_by_year_groupby.groups is: ', type(movies_by_year_groupby.groups))

data type of movies_by_year_groupby.groups is:  <class 'pandas.io.formats.printing.PrettyDict'>


In [34]:
print(movies_by_year_groupby.groups.keys())

dict_keys([1902, 1903, 1908, 1915, 1916, 1917, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018])


In [35]:
print('number of unique years in movies data: ',len(movies_by_year_groupby.groups.keys()))

number of unique years in movies data:  106


In [36]:
# compare with the series.nunique() method
print('number of unique years in movies data: ',movies['year'].nunique())

number of unique years in movies data:  106


In [37]:
count_movies_by_year = movies_by_year_groupby.count()

In [38]:
# show latest years first
count_movies_by_year.sort_values(by='year', ascending=False).head()

Unnamed: 0_level_0,movieId,title,genres
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,41,41,41
2017,147,147,147
2016,218,218,218
2015,274,274,274
2014,277,277,277


### How many from each decade?

#### [```pandas.DataFrame.apply()```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas-dataframe-apply)
Apply a function to all the elements.  
This gets really slow really fast.  
Exercise Caution.  

We'll define a function that returns a decade for a given year.  
Then ```apply()``` it to the ```year``` column.

In [39]:
# define a trivial function to return a decade
def get_decade(year):
	# in case year is missing
	if pd.isna(year):
		return 0
	return int(year // 10 * 10)
# 
print(get_decade(1924))
print(get_decade(1972))
print(get_decade(2001))
print(get_decade(2018))

1920
1970
2000
2010


In [40]:
# add a decade column to movies
movies['decade'] = movies['year'].apply(get_decade)
movies.head()

Unnamed: 0,movieId,title,genres,year,decade
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1990
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,1990
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,1990
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,1990
4,5,Father of the Bride Part II (1995),Comedy,1995,1990


In [41]:
movies.dtypes

movieId             Int32
title      string[python]
genres     string[python]
year                Int16
decade              int64
dtype: object

In [42]:
movies_by_decade_groupby = movies.groupby(by = 'decade')

In [43]:
movies_by_decade_groupby.count()

Unnamed: 0_level_0,movieId,title,genres,year
decade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,24,24,24,0
1900,3,3,3,3
1910,7,7,7,7
1920,37,37,37,37
1930,136,136,136,136
1940,197,197,197,197
1950,279,279,279,279
1960,401,401,401,401
1970,499,499,499,499
1980,1177,1177,1177,1177


### Average rating for each movie

In [44]:
# select 2 columns from the ratings dataframe - supply the columns as a list
ratings_movieId_groupby = ratings[['movieId', 'rating']].groupby(by = 'movieId')

In [45]:
ratings_movieId_groupby.mean()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.92093
2,3.431818
3,3.259615
4,2.357143
5,3.071429
...,...
193581,4.0
193583,3.5
193585,3.5
193587,3.5


### How many times was each movie rated?

In [46]:
rating_counts = ratings_movieId_groupby.count()
rating_counts.sort_values(by = 'rating', ascending=False).head(10)

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
356,329
318,317
296,307
593,279
2571,278
260,251
480,238
110,237
589,224
527,220


### Is there a wide margin between the number of ratings one movie has recieved vs another?

In [47]:
rating_counts.describe()

Unnamed: 0,rating
count,9724.0
mean,10.369807
std,22.401005
min,1.0
25%,1.0
50%,3.0
75%,9.0
max,329.0


seems like 75% of the movies have about 9 ratings or less.

In [48]:
# all the movies with over 9 ratings
movies_with_more_than_9_ratings = rating_counts[rating_counts['rating']>9]

In [49]:
movies_with_more_than_9_ratings.describe()

Unnamed: 0,rating
count,2269.0
mean,35.749669
std,35.986989
min,10.0
25%,14.0
50%,22.0
75%,43.0
max,329.0


In [50]:
# all the movies with 9 ratings or less
movies_with_9_or_fewer_ratings = rating_counts[rating_counts['rating']<=9]

In [51]:
movies_with_9_or_fewer_ratings.describe()

Unnamed: 0,rating
count,7455.0
mean,2.645205
std,2.181174
min,1.0
25%,1.0
50%,2.0
75%,4.0
max,9.0


In [52]:
# num 9 or less / total
7455.0/9724.0

0.7666598107774578

### BTW - what's that movie that has the most ratings?

In [53]:
movies[movies['movieId'] == 356]

Unnamed: 0,movieId,title,genres,year,decade
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,1994,1990


### Query a movieID and get it's title and average rating back

In [54]:
# create the average rating dataframe
avg_ratings = ratings_movieId_groupby.mean()
avg_ratings.head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.92093
2,3.431818
3,3.259615
4,2.357143
5,3.071429


In [55]:
# let's rename the rating column for clarity
avg_ratings.rename(columns={'rating':'average_rating'}, inplace=True)
avg_ratings.head()

Unnamed: 0_level_0,average_rating
movieId,Unnamed: 1_level_1
1,3.92093
2,3.431818
3,3.259615
4,2.357143
5,3.071429


#### [```pandas.DataFrame.merge```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)

```merge``` is how pandas does database style joins.  

Joins are a way to match data between two tables, allowing us to  combine columns from one or more tables into a new table. 

[Read up more!](https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging)

N.B. [```pandas.DataFrame.join()```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas-dataframe-join) also exists,  uses ```pandas.merge``` internally.

In [56]:
# merge (join) with the movies dataframe
movies = movies.merge(avg_ratings, on='movieId', how='left')
movies.head()

Unnamed: 0,movieId,title,genres,year,decade,average_rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1990,3.92093
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,1990,3.431818
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,1990,3.259615
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,1990,2.357143
4,5,Father of the Bride Part II (1995),Comedy,1995,1990,3.071429


In [57]:
def get_movie_title_and_avg_rating(movieId):
	if movieId:
		movie = movies[movies['movieId'] == movieId]
		# .values array to extract the raw value from a field 
		movie_title = movie['title'].values[0]
		movie_avg_rating = movie['average_rating'].values[0]
		# pay attention to double quotes and single quotes when constructing the string
		return f"Movie Title: {movie_title} - has an average rating of: {movie_avg_rating}"
	return 'Incorrect movieId'

Let's test this on the movies that got lots of ratings...

In [58]:
get_movie_title_and_avg_rating(356)

'Movie Title: Forrest Gump (1994) - has an average rating of: 4.164133548736572'

In [59]:
get_movie_title_and_avg_rating(318)

'Movie Title: Shawshank Redemption, The (1994) - has an average rating of: 4.429022312164307'

In [60]:
get_movie_title_and_avg_rating(296)

'Movie Title: Pulp Fiction (1994) - has an average rating of: 4.197068214416504'

In [61]:
get_movie_title_and_avg_rating(593)

'Movie Title: Silence of the Lambs, The (1991) - has an average rating of: 4.161290168762207'

In [62]:
get_movie_title_and_avg_rating(2571)

'Movie Title: Matrix, The (1999) - has an average rating of: 4.192446231842041'

...and those movies we used to test commas.

In [63]:
get_movie_title_and_avg_rating(11)

'Movie Title: American President, The (1995) - has an average rating of: 3.671428680419922'

In [64]:
get_movie_title_and_avg_rating(29)

'Movie Title: City of Lost Children, The (Cité des enfants perdus, La) (1995) - has an average rating of: 4.013157844543457'

# Insights

1. For CSV data pay attention to the dialect
2. EscapeChar vs QuoteChar in Pandas
3. Opinion: Safe approach for timestamps - import as Integers/Numeric and convert using ```pd.to_datetime```
4. ```pandas.Series.dt.date```
5. ```pandas.DataFrame.apply```
6. ```pandas.DataFrame.merge```

# Next

* Let's play with the MovieLens dataset some more.