### What is MovieLens?

MovieLens is a recommender system that was developed by GroupLens, a computer science research lab at the University of Minnesota. It recommends movies to its users based on their movie ratings. It is also a dataset that is widely used in research and teaching contexts. 

### Tutorial Outline

This tutorial is broken down into several steps, starting with loading the data and exploratory data analysis. We will then explore different recommendation techniques including popularity-based, content-based filtering, collaborative filtering, two-tower neural networks and using large language models.

### Chapter 1: Load Data and Exploratory Data Analysis (EDA)

In [None]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Suppress FutureWarnings
warnings.filterwarnings('ignore', category=FutureWarning)
pd.set_option('display.float_format', lambda x: '%.0f' % x)

# Reading ratings file
ratings = pd.read_csv('data-1m/ratings.csv', 
                    sep=_______, # What's the separator?
                    encoding='latin-1',
                    engine='python',
                    index_col=0
                     ) 

# Reading users file
users = pd.read_csv('data-1m/users.csv', 
                    sep=_______,
                    encoding='latin-1',
                    engine='python',
                    index_col=0
                     )

# Reading movies file
movies = pd.read_csv('data-1m/movies.csv', 
                    sep=_______,
                    encoding='latin-1',
                    engine='python',
                    index_col=0
                     ) 

### Movies Dataset

In [None]:
# Print the first 5 rows
_______

In [None]:
# Inspect the dataset
_______

In [None]:
# Import the wordcloud library
_______
from wordcloud import WordCloud, STOPWORDS

# Create a wordcloud of the movie titles
movies['title'] = movies['title'].fillna("").astype('str')
title_corpus = ' '.join(movies['title'])
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(_______)

# Plot the wordcloud
plt.figure(figsize=(16,8))
plt.imshow(_______)
plt.axis('off')
plt.show()

In [None]:
import seaborn as sns
all_genres = movies['genres'].str.split('|').explode()

# Count the frequency of each genre
genre_counts = _______

# Create a bar plot using seaborn
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.values, y=_______ , palette='viridis')
plt.title('Distribution of Movie Genres')
plt.xlabel('Number of Movies')
plt.ylabel('Genre')
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

### Users Dataset

In [None]:
#Inspect the users dataframe
_______

In [None]:
#Inspect the dataset
_______

In [None]:
# Create a figure with subplots
fig = plt.figure(figsize=(20, 12))

# 1. Gender Distribution
plt.subplot(2, 2, 1)
gender_dist = users['gender']._______
sns.barplot(x=_______, y=gender_dist.values)
plt.title('Gender Distribution')
plt.ylabel('Count')

# 3. Age Group Distribution
plt.subplot(2, 2, 2) 
age_desc_dist = users[_______].value_counts()
sns.barplot(x=_______, y=_______)
plt.title('Age Group Distribution')
plt.xlabel('Count')

# 4. Top 10 Occupations
plt.subplot(2, 2, 3)
occ_dist = users['occ_desc']._______().head(10)
sns.barplot(x=_______, y=_______)
plt.title('Top 10 Occupations')
plt.xlabel('Count')

# 5. Age vs Occupation (Box Plot) #Please make sure to complete this
plt.subplot(_______)
_______(_______ = users, y='occ_desc', x='age', order=occ_dist.index)
plt.title('Age Distribution by Top Occupations')
plt.xlabel('Age')

plt.tight_layout()
plt.show()

From the charts above, can you describe the age and gender distributions of the MovieLens users?

### Ratings Dataset

In [None]:
#Inspect the dataset ratings
_______

In [None]:
ratings = ratings.drop(columns=_______)
_______

In [None]:
#Inspect the dataframe
_______

In [None]:
# What is the distribution of the ratings?
_ = plt.hist(ratings[_______], bins=20)
plt.title('Histogram of Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

In [None]:
#Describe the ratings
print(_______.describe())

In [None]:
#Merge the datasets
df = pd.merge(pd.merge(_______, _______),_______)
df._______

#Saving the dataset
df.to_csv('data-1m/dataset_combined.csv', index=False)