# Data Wrangling in Python  
*Exploring the __MovieLens__ dataset using the Python Collections module*  

**Part 1: Basic collections in Python**

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/00-Python-Collections/01.01%20Data-Wrangling-with-Plain-Old-Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

In [2]:
datalocation = "./../data/ml-latest-small/"

In [3]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

#### Exploring the data

In [4]:
# old school print 10 lines of the file
def renderlines(file_name=file_path_movies, numlines=10):
    file_data = open(file_name, "r")
    c = 0
    lines_limit = 10
    for line in file_data:
        print(line)
        # condition tested after printing,
        # so at least lines_limit+1 lines will be printed
        c = c + 1
        if c > lines_limit:
            break
    file_data.close()

In [5]:
# movies file is default
renderlines()

movieId,title,genres

1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy

2,Jumanji (1995),Adventure|Children|Fantasy

3,Grumpier Old Men (1995),Comedy|Romance

4,Waiting to Exhale (1995),Comedy|Drama|Romance

5,Father of the Bride Part II (1995),Comedy

6,Heat (1995),Action|Crime|Thriller

7,Sabrina (1995),Comedy|Romance

8,Tom and Huck (1995),Adventure|Children

9,Sudden Death (1995),Action

10,GoldenEye (1995),Action|Adventure|Thriller



In [6]:
renderlines(file_path_links, 10)

movieId,imdbId,tmdbId

1,0114709,862

2,0113497,8844

3,0113228,15602

4,0114885,31357

5,0113041,11862

6,0113277,949

7,0114319,11860

8,0112302,45325

9,0114576,9091

10,0113189,710



In [7]:
renderlines(file_path_tags, 10)

userId,movieId,tag,timestamp

2,60756,funny,1445714994

2,60756,Highly quotable,1445714996

2,60756,will ferrell,1445714992

2,89774,Boxing story,1445715207

2,89774,MMA,1445715200

2,89774,Tom Hardy,1445715205

2,106782,drugs,1445715054

2,106782,Leonardo DiCaprio,1445715051

2,106782,Martin Scorsese,1445715056

7,48516,way too long,1169687325



In [8]:
renderlines(file_path_ratings, 10)

userId,movieId,rating,timestamp

1,1,4.0,964982703

1,3,4.0,964981247

1,6,4.0,964982224

1,47,5.0,964983815

1,50,5.0,964982931

1,70,3.0,964982400

1,101,5.0,964980868

1,110,4.0,964982176

1,151,5.0,964984041

1,157,5.0,964984100



### **Basic Python Collections**
Examples using the MovieLens Dataset

#### 1. **Lists**Oordered sequences of elements, and they are mutable.

**Example:** Extract all movie titles from the dataset into a list.

In [9]:
# using list comprehensions
movie_titles = [
    line.split(",")[1]
    for line in open(
        file_path_movies, "r", encoding="utf-8", newline="\r\n"
    ).readlines()[1:]
]
#
print("FIRST 5")
print("\n".join(movie_titles[:5]))  # Print the first 5 movie titles, each in a new line
print("---\nLAST 5")
print("\n".join(movie_titles[-5:]))  # Print the last 5 movie titles, each in a new line

FIRST 5
Toy Story (1995)
Jumanji (1995)
Grumpier Old Men (1995)
Waiting to Exhale (1995)
Father of the Bride Part II (1995)
---
LAST 5
Black Butler: Book of the Atlantic (2017)
No Game No Life: Zero (2017)
Flint (2017)
Bungo Stray Dogs: Dead Apple (2018)
Andrew Dice Clay: Dice Rules (1991)


#### 2. **Tuples**
Tuples are ordered, immutable sequences.

**Example:** Pair each movie with its genres in a tuple.

In [10]:
# choose the second and third values in the row
# read the file
# skip the header
# built a list of tuples
movie_genre_pairs = [
    (line.split(",")[1], line.split(",")[2])
    for line in open(
        file_path_movies, "r", encoding="utf-8", newline="\r\n"
    ).readlines()[1:]
]
#
print("FIRST 5")
print(movie_genre_pairs[:5])  # Print the first 5 (movie, genre) pairs
print("---\nLAST 5")
print(movie_genre_pairs[-5:])  # Print the last 5 (movie, genre) pairs

FIRST 5
[('Toy Story (1995)', 'Adventure|Animation|Children|Comedy|Fantasy\r\n'), ('Jumanji (1995)', 'Adventure|Children|Fantasy\r\n'), ('Grumpier Old Men (1995)', 'Comedy|Romance\r\n'), ('Waiting to Exhale (1995)', 'Comedy|Drama|Romance\r\n'), ('Father of the Bride Part II (1995)', 'Comedy\r\n')]
---
LAST 5
[('Black Butler: Book of the Atlantic (2017)', 'Action|Animation|Comedy|Fantasy\r\n'), ('No Game No Life: Zero (2017)', 'Animation|Comedy|Fantasy\r\n'), ('Flint (2017)', 'Drama\r\n'), ('Bungo Stray Dogs: Dead Apple (2018)', 'Action|Animation\r\n'), ('Andrew Dice Clay: Dice Rules (1991)', 'Comedy\r\n')]


#### 3. **Dictionaries**
Dictionaries map keys to values. They are mutable and keys are unique.

**Example:** Create a dictionary where movie IDs are keys and movie titles are values.

In [11]:
movie_dict = {}
with open(file_path_movies, "r", encoding="utf-8", newline="\r\n") as file:
    next(file)  # skip the header
    for line in file:
        data = line.split(",")
        movie_id, title = data[0], data[1]
        movie_dict[movie_id] = title
#
print("FIRST 5")
print(list(movie_dict.items())[:5])  # Print the first 5 id-title pairs
print("---\nLAST 5")
print(list(movie_dict.items())[-5:])  # Print the last 5 id-title pairs

FIRST 5
[('1', 'Toy Story (1995)'), ('2', 'Jumanji (1995)'), ('3', 'Grumpier Old Men (1995)'), ('4', 'Waiting to Exhale (1995)'), ('5', 'Father of the Bride Part II (1995)')]
---
LAST 5
[('193581', 'Black Butler: Book of the Atlantic (2017)'), ('193583', 'No Game No Life: Zero (2017)'), ('193585', 'Flint (2017)'), ('193587', 'Bungo Stray Dogs: Dead Apple (2018)'), ('193609', 'Andrew Dice Clay: Dice Rules (1991)')]


with this dictionary, we can query a movie title using it's id

In [12]:
movie_dict["260"]

'Star Wars: Episode IV - A New Hope (1977)'

#### 4. **Sets**
Sets are unordered collections of unique elements.

**Example:** Extract all unique genres from the dataset.

In [13]:
genres_set = set()
with open(file_path_movies, "r", encoding="utf-8", newline="\r\n") as file:
    next(file)  # skip the header
    #
    lines_with_too_many_commas = 0
    for line in file:
        row_data = line.split(",")
        genres = row_data[2].split("|")
        # some titles may have commas
        # we'll switch to the CSV reader to address it better
        # we can implement our own CSV reader but
        # it's probably a distraction for this notebook
        if len(row_data) > 3 and lines_with_too_many_commas < 15:
            lines_with_too_many_commas += 1
            print(genres)
        genres_set.update(genres)
# the movie titles have commas in them, find a way to escape those
print(genres_set)

[' The (1995)"']
[' The (Cité des enfants perdus']
[' the Beloved Country (1995)"']
[' The (1995)"']
[' The (1995)"']
[' The (Postino']
[' The (1995)"']
[' Les (1995)"']
[' The (1995)"']
[' The (1996)"']
[' The (Badkonake sefid) (1995)"']
[' La) (1995)"']
[' The (1995)"']
[' The (1995)"']
[' Steal Little (1995)"']
{' The (1957)"', ' A (Zhestokij Romans) (1984)"', ' An (1995)"', ' The (Bad ma ra khahad bord) (1999)"', ' The) (2013)"', ' The (Badkonake sefid) (1995)"', ' the Ironman (Tetsuo) (1988)"', ' The (a.k.a. Caravan of Courage: An Ewok Adventure) (1984)"', ' The (Les douze travaux d\'Astérix) (1976)"', ' The (Yôkai daisensô) (2005)"', ' An (Chien andalou', ' La (2005)"', ' The (Scaphandre et le papillon', ' M.D. (1963)"', ' Les) (1953)"', ' The (1951)"', ' The (Carabiniers', ' The (Ville est tranquille', ' Il) (1992)"', ' An (Un été inoubliable) (1994)"', ' The (Entre les murs) (2008)"', 'Western', ' Summer', ' The (Le mari de la coiffeuse) (1990)"', ' A (Panique au village) (2009

In [14]:
# for now lets read the last element in each row
# same code as before...
genres_set = set()
with open(file_path_movies, "r", encoding="utf-8", newline="\r\n") as file:
    next(file)  # skip the header
    #
    lines_with_too_many_commas = 0
    for line in file:
        row_data = line.split(",")
        genres = row_data[-1].split("|")
        # some titles may have commas
        if len(row_data) > 3 and lines_with_too_many_commas < 15:
            lines_with_too_many_commas += 1
            print(genres)
        genres_set.update(genres)
# the movie titles have commas in them, but now we should only see genres...
print(genres_set)

['Comedy', 'Drama', 'Romance\r\n']
['Adventure', 'Drama', 'Fantasy', 'Mystery', 'Sci-Fi\r\n']
['Drama\r\n']
['Crime', 'Mystery', 'Thriller\r\n']
['Children', 'Comedy\r\n']
['Comedy', 'Drama', 'Romance\r\n']
['Adventure', 'Children', 'Fantasy\r\n']
['Drama', 'War\r\n']
['Action', 'Crime', 'Drama', 'Thriller\r\n']
['Drama', 'Thriller\r\n']
['Children', 'Drama\r\n']
['Crime', 'Drama\r\n']
['Drama', 'Romance\r\n']
['Crime', 'Drama\r\n']
['Comedy\r\n']
{'Drama', 'Thriller', 'Sci-Fi\r\n', 'Action\r\n', 'War\r\n', 'Mystery\r\n', 'Thriller\r\n', 'Film-Noir\r\n', 'Film-Noir', 'Drama\r\n', 'Musical\r\n', 'Adventure', 'Adventure\r\n', 'Western', 'Children', 'Comedy\r\n', 'Action', 'Horror\r\n', 'IMAX\r\n', 'War', 'Mystery', 'Horror', 'Documentary\r\n', 'Sci-Fi', 'Animation\r\n', '(no genres listed)\r\n', 'Romance\r\n', 'Children\r\n', 'Fantasy\r\n', 'Animation', 'Romance', 'Western\r\n', 'Comedy', 'Documentary', 'Fantasy', 'Crime\r\n', 'Crime', 'Musical'}


### **The `collections` Module**

The `collections` module in Python offers a set of alternative collection datatypes that can augment the basic ones.
This gives us powerful tools for specialized tasks. 


#### 1. **`namedtuple`: Tuple Subclass with Named Fields**
- Provides clarity without the memory overhead of a full class.
- Useful in scenarios where you might use structs in C.

**Example:** Representing a movie with its title, genre, and rating.

In [15]:
from collections import namedtuple

Movie = namedtuple("Movie", ["title", "genre", "rating"])
inception = Movie("Star Wars: Episode IV - A New Hope (1977)", "Sci-Fi", 8.8)
print(inception.title)  # Output: Inception

Star Wars: Episode IV - A New Hope (1977)


[todo: load movies as named tuples]

#### 2. **`deque`: Double-Ended Queue**
- Allows for fast appends and pops from both ends.
- Useful for maintaining a sliding window or implementing certain types of parsers.

**Example:** Maintaining a fixed-size window of the last N ratings for a movie.

In [16]:
from collections import deque

last_five_ratings = deque(maxlen=5)
for rating in [7.5, 8.0, 8.3, 7.8, 8.2, 8.0]:
    last_five_ratings.append(rating)

print(last_five_ratings)  # Output: deque([8.0, 8.3, 7.8, 8.2, 8.0], maxlen=5)

deque([8.0, 8.3, 7.8, 8.2, 8.0], maxlen=5)


#### 3. **`Counter`: Counting Elements**
- Facilitates counting elements in an iterable.
- Useful for tasks like token counting or histogram creation.

**Example:** Counting genres in a list of movies.

In [17]:
from collections import Counter

genres = ["Action", "Drama", "Action", "Sci-Fi", "Drama", "Sci-Fi", "Action"]
genre_count = Counter(genres)
print(genre_count)  # Output: Counter({'Action': 3, 'Drama': 2, 'Sci-Fi': 2})

Counter({'Action': 3, 'Drama': 2, 'Sci-Fi': 2})


In [18]:
genres = []
with open(file_path_movies, "r", encoding="utf-8", newline="\r\n") as file:
    next(file)  # skip the header
    for line in file:
        row_data = line.split(",")
        # read from end of line, avoid running into commas in titles
        genres_in_row = row_data[-1].split("|")
        # strip the newlines
        genres_in_row = [g.strip() for g in genres_in_row]
        genres.append(genres_in_row)

In [19]:
print(genres[:10])

[['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy'], ['Adventure', 'Children', 'Fantasy'], ['Comedy', 'Romance'], ['Comedy', 'Drama', 'Romance'], ['Comedy'], ['Action', 'Crime', 'Thriller'], ['Comedy', 'Romance'], ['Adventure', 'Children'], ['Action'], ['Action', 'Adventure', 'Thriller']]


In [20]:
# flatten genres - list comprehensions
# they need you to be smart
# and sometimes make people who read your code feel dumb
genres = [genre for genre_row in genres for genre in genre_row]
# remove this for the workshop.

In [21]:
movielens_genre_count = Counter(genres)
print(movielens_genre_count)

Counter({'Drama': 4361, 'Comedy': 3756, 'Thriller': 1894, 'Action': 1828, 'Romance': 1596, 'Adventure': 1263, 'Crime': 1199, 'Sci-Fi': 980, 'Horror': 978, 'Fantasy': 779, 'Children': 664, 'Animation': 611, 'Mystery': 573, 'Documentary': 440, 'War': 382, 'Musical': 334, 'Western': 167, 'IMAX': 158, 'Film-Noir': 87, '(no genres listed)': 34})


#### 4. **`OrderedDict`: Dict subclass that remembers the order entries were added**
- Maintains the order of insertion.
- Especially relevant for tasks where order matters, like configuration parsing or specific serialization tasks.

**Example:** Storing movies and their ratings in the order they were rated.

In [22]:
from collections import OrderedDict

movie_ratings = OrderedDict()
movie_ratings["Inception"] = 8.8
movie_ratings["Interstellar"] = 8.6
movie_ratings["Dunkirk"] = 7.9

for movie, rating in movie_ratings.items():
    print(f"{movie}: {rating}")  # same order as added

Inception: 8.8
Interstellar: 8.6
Dunkirk: 7.9


[todo: load movies data and ratings, create an orderd dict based on timestamps - may be too advanced for the workshop]

#### 5. **`defaultdict`: Dict subclass with a default value for missing keys**
- Helps in avoiding "key not found" errors.
- Commonly used in graph algorithms, multi-value dictionaries, or accumulators.

**Example:** Storing a list of ratings for each movie.

In [23]:
from collections import defaultdict

movie_ratings = defaultdict(list)
movie_ratings["Inception"].extend([8.5, 8.6, 8.7])
print(movie_ratings["Inception"])  # Output: [8.5, 8.6, 8.7]
print(movie_ratings["Unknown Movie"])  # Output: [], without error

[8.5, 8.6, 8.7]
[]


[todo: create a default dict using the movie title, that would give a count of each rating given - may be too advanced for the workshop]

In [24]:
# renderlines(file_path_ratings, 15)

# Next

We look at itertools and functools