## Let's begin in this end instead.

In [1]:
import polars as pl

polars=pl.read_csv('../Data/ml-latest/movies.csv',
    columns=['movieId','genres'],
    dtypes={
        'movieId':pl.Int32,
        'genres':pl.Utf8,
        }
    )

In [2]:
%%timeit
ratings = pl.scan_csv("../Data/ml-latest/ratings.csv")
ratings.fetch(5)

457 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


The pl.scan_csv method won't actually read in all the data into memory. If we then call .fetch(5) we get 5 random rows of data.<br> Because there's no concern about ordering we don't need to load in all the data which is why it is ridiculously fast.

In [3]:
# So if we quickly want to check a large csv without loading everything from it, we can just boom bap biddie bop and 
ratings = pl.scan_csv("../Data/ml-latest/ratings.csv")
ratings.fetch(5)

userId,movieId,rating,timestamp
i64,i64,f64,i64
1,307,3.5,1256677221
1,481,3.5,1256677456
1,1091,1.5,1256677471
1,1257,4.5,1256677460
1,1449,4.5,1256677264


Before we get into the user rarings, lets create some features for genres.<br>
Most movies are tagged with multiple genres separated by |. Let's change that.

In [4]:
polars = polars.filter(pl.col("genres") != "(no genres listed)") # filter out movies with no genres

# Testing with the top most popular genres
movies = ["Action","Horror","Drama","Comedy","Documentary","Adventure","Fantasy","Children","Sci-Fi","Romance","Mystery","Animation","Thriller"]

# pl.with_columns method can be very powerful for creating or manipulating columns 
# here we run str.contains() on genre column to pull out genres for every movieId and add them to new boolean columns

for i in movies:
    polars = polars.with_columns([
            pl.col('genres').str.contains(i).alias(i)
    ])

polars = polars.select([pl.exclude('genres')]) # remove old 'genres' column

# if this works we can go on and add more columns for all genres

polars.head(10)

movieId,Action,Horror,Drama,Comedy,Documentary,Adventure,Fantasy,Children,Sci-Fi,Romance,Mystery,Animation,Thriller
i32,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool
1,False,False,False,True,False,True,True,True,False,False,False,True,False
2,False,False,False,False,False,True,True,True,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,True,False,False,False
4,False,False,True,True,False,False,False,False,False,True,False,False,False
5,False,False,False,True,False,False,False,False,False,False,False,False,False
6,True,False,False,False,False,False,False,False,False,False,False,False,True
7,False,False,False,True,False,False,False,False,False,True,False,False,False
8,False,False,False,False,False,True,False,True,False,False,False,False,False
9,True,False,False,False,False,False,False,False,False,False,False,False,False
10,True,False,False,False,False,True,False,False,False,False,False,False,True


Now we have genres as boolean values, which should make it easier to find patterns. Probably? Anyway, let's add the ratings.

In [5]:
ratings=pl.read_csv('../Data/ml-latest/ratings.csv',
    columns=['movieId', 'userId','rating'],
    dtypes={
        'movieId':pl.Int32,
        'userId':pl.Int32,
        'rating':pl.Float32,
        }
    )

In [6]:
best_movies = ratings.filter(pl.col("rating") >= 4.5) # filter out all ratings except 4.5 and 5.0
wost_movies = ratings.filter(pl.col("rating") < 4.5) # filter out all ratings except <4.5 this will probably not be used at all

Now we join the two dataframes.

In [7]:
df_inner_join = polars.join(best_movies, on='movieId', how='inner')

In [8]:
df_inner_join.sort(by='movieId')

movieId,Action,Horror,Drama,Comedy,Documentary,Adventure,Fantasy,Children,Sci-Fi,Romance,Mystery,Animation,Thriller,userId,rating
i32,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,i32,f32
1,false,false,false,true,false,true,true,true,false,false,false,true,false,10,5.0
1,false,false,false,true,false,true,true,true,false,false,false,true,false,14,4.5
1,false,false,false,true,false,true,true,true,false,false,false,true,false,27,5.0
1,false,false,false,true,false,true,true,true,false,false,false,true,false,31,5.0
1,false,false,false,true,false,true,true,true,false,false,false,true,false,32,4.5
1,false,false,false,true,false,true,true,true,false,false,false,true,false,38,5.0
1,false,false,false,true,false,true,true,true,false,false,false,true,false,43,5.0
1,false,false,false,true,false,true,true,true,false,false,false,true,false,55,5.0
1,false,false,false,true,false,true,true,true,false,false,false,true,false,74,5.0
1,false,false,false,true,false,true,true,true,false,false,false,true,false,79,5.0


In [9]:
df_inner_join.null_count() # lets check if everything seems ok

movieId,Action,Horror,Drama,Comedy,Documentary,Adventure,Fantasy,Children,Sci-Fi,Romance,Mystery,Animation,Thriller,userId,rating
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


No nulls. Great!

My plan is to have a dataframe with only high ratings, when a new user submits movies they like, the user will be matched with the other users in the dataframe who rated the same movies.

In [10]:
df_inner_join.estimated_size("mb")

83.69224643707275

83 MB is still kinda heavy. But let's try it. Now is also a good time to map the movie titles to movieId.

In [11]:
titles=pl.read_csv('../Data/ml-latest/movies.csv',
    columns=['movieId','title'],
    dtypes={
        'movieId':pl.Int32,
        'title':pl.Utf8,
        }
    )

In [12]:
with_titles = df_inner_join.join(titles, on='movieId', how='inner')

In [13]:
with_titles.head(3)

movieId,Action,Horror,Drama,Comedy,Documentary,Adventure,Fantasy,Children,Sci-Fi,Romance,Mystery,Animation,Thriller,userId,rating,title
i32,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,i32,f32,str
1257,False,False,False,True,False,False,False,False,False,True,False,False,False,1,4.5,"""Better Off Dea..."
1449,False,False,False,True,False,False,False,False,False,False,False,False,False,1,4.5,"""Waiting for Gu..."
2134,False,False,False,True,False,False,True,False,True,False,False,False,False,1,4.5,"""Weird Science ..."


In [14]:
# I wonder how big of a size differnce the title string column made

with_titles.estimated_size("mb")

293.1260347366333

YIKES

It's apparently not the best idea to join the titles like this. It's probably better to just stick with movieId and then once needed map movieId to string title.

In [15]:
df_inner_join.estimated_size("mb")

83.69224643707275

In [16]:
titles.head()

movieId,title
i32,str
1,"""Toy Story (199..."
2,"""Jumanji (1995)..."
3,"""Grumpier Old M..."
4,"""Waiting to Exh..."
5,"""Father of the ..."


## Find movie by string search query

User: I like movie "12 Angry Men" / user_rated(movieId=1203, rating=4.5)

In [17]:
user_input = "12 Angry Men"

In [18]:
# This scans the entire movies.csv to get movieId(s) from string

find_movieId = (
    pl.scan_csv('../Data/ml-latest/movies.csv')
    .filter(pl.col('title').str.contains(user_input))
    .select(['movieId'])
)

find_movieId.collect()#.to_pandas() # if we want to parse it to pandas

movieId
i64
1203
77846


TODO: If multiple movies are found in search query, there should be a function which either:
- Lets user select movie from list

OR

- Automatically selects the most popular one

In [19]:
find_movieId.collect().shape

(2, 1)

In [20]:
if find_movieId.collect().shape[0] > 1:
    find_movie_Id = find_movieId.collect()[0].item() # We do this for now
else: find_movie_Id = find_movieId.collect().item()

## Get genre of movie

In [21]:
# We search the previously created dataframe (df_inner_join) to check genre. This is very redunant.
find_row = df_inner_join.filter(pl.col('movieId') == find_movie_Id)
find_row = find_row[0:1:]
find_row

movieId,Action,Horror,Drama,Comedy,Documentary,Adventure,Fantasy,Children,Sci-Fi,Romance,Mystery,Animation,Thriller,userId,rating
i32,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,i32,f32
1203,False,False,True,False,False,False,False,False,False,False,False,False,False,15,5.0


## Find other users who rated movie in search query high

In [22]:
# Find users who rated input searched movieId high
movie_ID = find_movieId.collect()[0]

# This scans the entire ratings.csv to get userId(s) from movieId search
# who only rated movie with 5 stars

find_users = (
    pl.scan_csv('../Data/ml-latest/ratings.csv')
    .filter(pl.col('movieId') == movie_ID)
    .filter(pl.col("rating") >= 5) # This should be adjustable depending on results. If not many 5 star ratings are found, it should adjust to lower ratings
    .select(['userId'])
)

find_users.collect()[::400] # slicing because very many users were found (~6k+)

userId
i64
15
17748
35561
53292
71019
90770
109823
128838
147636
165072


In [23]:
# Same thing as above cell, but for the dataframe we merged with genre bools

find_users = (
    df_inner_join
    .filter(pl.col('movieId') == movie_ID)
    .filter(pl.col("rating") >= 5) # This should be adjustable depending on results. If not many 5 star ratings are found, it should adjust to lower ratings
    .select(['userId'])
)

find_users[::400] # collect() is not needed when we run it on a df like this

userId
i32
15
17748
35561
53292
71019
90770
109823
128838
147636
165072


## Find movies also rated high by users who rated query movie high

In [25]:
# Now we take the filter find_users to find which other movies were also rated the highest

# For proof of concept, lets just use the first userId who rated input movie 5 stars

# TODO: fix this loop. the filter cannot be iterated like this as it replaces for each loop
similar_users = find_users

for i in range(similar_users.shape[0]):
    find_top_movies = (
        pl.scan_csv('../Data/ml-latest/ratings.csv')
        .filter(pl.col('userId') == similar_users[i])
        
        # Here we should implement the filter for genre to only get matching genres
        .filter(pl.col("rating") >= 5)
        .select(['movieId'])
    )
find_top_movies.collect()[::10] 

movieId
i64
111
428
802
1032
1198
1219
1270


In [32]:
# Same as above but loops through df_inner_join instead

# Just one user as of now
similar_users = find_users[0]

find_top_movies = (
    df_inner_join
    .filter(pl.col('userId') == similar_users)
    
    # Here we should implement the filter for genre to only get matching genres
    .filter(pl.col('Drama') == True) # TODO: This should be set by an expression matching get_genre
    
    .filter(pl.col("rating") >= 5)
    .select(['movieId'])
)
find_top_movies.head()

movieId
i32
296
356
778
1203
1206


In [38]:
# Find corresponding Movie (reverse from corresponding movieId)

movies_found = find_top_movies.head()

find_movie = (
    pl.scan_csv('../Data/ml-latest/movies.csv')
    .filter(pl.col('movieId') == movies_found[0])
    .select(['title'])
)

find_movie.collect()[0]

title
str
"""Pulp Fiction (..."


In [39]:
print(f'Similar users have enjoyed {find_movie.collect()[0].item()}. Watch it?')

Similar users have enjoyed Pulp Fiction (1994). Watch it?


I mean, while it's not wrong to reccomend Pulp Fiction in this scenario, it seems like it will be a very common recommendation if the only input from user is 'I like 12 Angry Men'.

But we only got recommendation from one user who liked 12 Angry Men. Let's see if we can fix that loop and get multiple users' ratings.

Also: It can recommend the same movie, but that should be easy to fix.

## Echo chamber

There seem to be an issue with most popular movies being recommended over and over again.

This happens when algorithms and personalized recommendations on social media platforms and other websites show content that aligns with their past behavior or preferences, leading to a reinforcement of their existing beliefs and interests. As a result, popular or trending content may be repeatedly recommended to users who are already interested in that topic, while less popular or more diverse content may be overlooked. This can contribute to a lack of exposure to new ideas and perspectives, and limit opportunities for discovery and growth.