# Data Wrangling in Python  
*Comparing Dictionaries, Named Tuples and Data Classes*  
*Using the __MovieLens__ dataset*  

**Part 1.1: Comparing Dictionaries, Named Tuples and Data Classes**  
  
![Comparing Dictionaries, Named Tuples and Data Classes](./../images/data_munging_00-Python-Collections-015.png)

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/00-Python-Collections/01.02%20Playing%20with%20Itertools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

In [1]:
datalocation = "./../data/ml-latest-small/"

In [2]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

# Easy Problems with Dictionaries  
1. **Count Movies by Genre**: Build a dictionary where keys are movie genres and values are the number of movies in each genre.
2. **Find Highest Rated Movies**: Create a dictionary where keys are movie IDs and values are their average ratings. Then, identify the top N highest-rated movies.
3. **Group Movies by Release Decade**: Organize movies into a dictionary where keys are decades (e.g., 1980s, 1990s) and values are lists of movie IDs released in that decade.
4. **Identify Most Active Users**: Find the users who have rated the most movies, storing the user ID and count in a dictionary.
5. **Recommend Movies Based on Similar Users**: Implement a simple recommendation system where you recommend movies liked by users similar to the target user based on their shared ratings.
6. **Analyze Rating Distribution**: Create a dictionary where keys are rating values (1-5) and values are the percentage of movies receiving each rating.
7. **Identify Users with Similar Rating Patterns**: Implement a function to identify pairs of users with similar rating distributions for specific genres or movie types.
8. **Analyze User Preferences**: Build a dictionary for each user storing their preferred genres based on their rating history. (future state: if we find a way to join IMDb data in subsequent notebooks, it would be interesting to revisit this exercise to find out preferred directors, actors etc. for users)
9. **Identify Trending Movies**: Implement a function that analyzes rating timestamps and identifies movies with the highest recent activity or rating spikes.
10. **Create a Basic Movie Recommendation Engine** (possible duplicate, let's consider it for now): Combine user preferences and movie features to recommend movies likely to be enjoyed by a specific user.



# Next

We look at itertools and functools