Tags
- Tags are user-generated metadata about movies.
-Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
-Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Links
- movieId is an identifier for movies used by movielens.org.
-imdbId is an identifier for movies used by imdb.com
-tmdbId is an identifier for movies used by themoviedb.org

Movies
- Movie titles are entered manually or imported from themoviedb.org
- Genres are a pipe-separated list, and are selected from the following:
  - Action, Adventure, Animation , Children's , Comedy , Crime , Documentary, Drama, Fantasy, Film-Noir ,Horror ,Musical ,Mystery ,Romance ,Sci-Fi ,Thriller ,War ,Western ,(no genres listed)

Ratings
- Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).


movieID - the same id refers to the same movie across these four data files

userID - User ids are consistent between `ratings.csv` and `tags.csv`

## 1. Read data

In [1]:
import pandas as pd
import numpy as np
tags = pd.read_csv("https://github.com/tiagofassoni/useful-datasets/raw/main/ml-latest-small/tags.csv")
links = pd.read_csv("https://github.com/tiagofassoni/useful-datasets/raw/main/ml-latest-small/links.csv")
movies = pd.read_csv("https://github.com/tiagofassoni/useful-datasets/raw/main/ml-latest-small/movies.csv")
ratings = pd.read_csv("https://github.com/tiagofassoni/useful-datasets/raw/main/ml-latest-small/ratings.csv")

## Exploratory Analysis and Data Cleaning

##1.1 Tags

In [2]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [3]:
print(f"""

Info
{tags.info()}

shape
{tags.shape}

Describe
{tags.describe()}

Mising Value
{tags.isna().sum()}

Duplicates
{tags.duplicated().sum()}

""")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


Info 
None

shape
(3683, 4)

Describe 
            userId        movieId     timestamp
count  3683.000000    3683.000000  3.683000e+03
mean    431.149335   27252.013576  1.320032e+09
std     158.472553   43490.558803  1.721025e+08
min       2.000000       1.000000  1.137179e+09
25%     424.000000    1262.500000  1.137521e+09
50%     474.000000    4454.000000  1.269833e+09
75%     477.000000   39263.000000  1.498457e+09
max     610.000000  193565.000000  1.537099e+09 

Mising Value
userId       0
movieId      0
tag          0
timestamp    0
dtype: int64 

Duplicates
0 




In [4]:
#changing timestamp's data type from into to datetime
tags["timestamp"] = pd.to_datetime(tags["timestamp"])

tags["timestamp"]= tags.timestamp.dt.strftime("%Y-%m-%d %H:%M:%S") #removing microseconds
tags.tail(5)

Unnamed: 0,userId,movieId,tag,timestamp
3678,606,7382,for katie,1970-01-01 00:00:01
3679,606,7936,austere,1970-01-01 00:00:01
3680,610,3265,gun fu,1970-01-01 00:00:01
3681,610,3265,heroic bloodshed,1970-01-01 00:00:01
3682,610,168248,Heroic Bloodshed,1970-01-01 00:00:01


In [5]:
tags.groupby("userId").count()

Unnamed: 0_level_0,movieId,tag,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,9,9,9
7,1,1,1
18,16,16,16
21,4,4,4
49,3,3,3
62,370,370,370
63,2,2,2
76,2,2,2
103,5,5,5
106,2,2,2


##1.2 Links

In [6]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [7]:
print(f"""

Info
{links.info()}

shape
{links.shape}

Describe
{links.describe()}

Mising Value
{links.isna().sum()}

Duplicates
{links.duplicated().sum()}

""")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


Info 
None

shape
(9742, 3)

Describe 
             movieId        imdbId         tmdbId
count    9742.000000  9.742000e+03    9734.000000
mean    42200.353623  6.771839e+05   55162.123793
std     52160.494854  1.107228e+06   93653.481487
min         1.000000  4.170000e+02       2.000000
25%      3248.250000  9.518075e+04    9665.500000
50%      7300.000000  1.672605e+05   16529.000000
75%     76232.000000  8.055685e+05   44205.750000
max    193609.000000  8.391976e+06  525662.000000 

Mising Value
movieId    0
imdbId     0
tmdbId     8
dtype: int64 

Duplicates
0 




In [8]:
#removing 8 rows
links = links.loc[~links.tmdbId.isna(), :]
links.isna().sum()

movieId    0
imdbId     0
tmdbId     0
dtype: int64

##1.3 Movies



In [9]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
movies.title.iloc[700:710]

700                      Meet Me in St. Louis (1944)
701                         Wizard of Oz, The (1939)
702                        Gone with the Wind (1939)
703                          My Favorite Year (1982)
704    Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
705                              Citizen Kane (1941)
706                     2001: A Space Odyssey (1968)
707                             All About Eve (1950)
708                                Women, The (1939)
709                                   Rebecca (1940)
Name: title, dtype: object

In [11]:
print(f"""

Info
{movies.info()}

shape
{movies.shape}

Describe
{movies.describe()}

Mising Value
{movies.isna().sum()}

Duplicates
{movies.duplicated().sum()}

""")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


Info 
None

shape
(9742, 3)

Describe 
             movieId
count    9742.000000
mean    42200.353623
std     52160.494854
min         1.000000
25%      3248.250000
50%      7300.000000
75%     76232.000000
max    193609.000000 

Mising Value
movieId    0
title      0
genres     0
dtype: int64 

Duplicates
0 




In [12]:
#extracting the year excluding the bracket and placing them in another table
movies['year']= movies['title'].str.extract(r'\((\d{4})\)')

#removing the year from the title column
movies['title']= movies['title'].str.replace(r'\((\d{4})\)','')

  movies['title']= movies['title'].str.replace(r'\((\d{4})\)','')


In [13]:
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [14]:
movies["year"]= movies["year"].astype('int', errors='ignore')
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
 3   year     9729 non-null   object
dtypes: int64(1), object(3)
memory usage: 304.6+ KB


In [15]:
#movies["year"] = pd.to_numeric(movies["year"]).astype('int', errors='ignore')

##1.4 Ratings

In [16]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [17]:
print(f"""

Info
{ratings.info()}

shape
{ratings.shape}

Describe
{ratings.describe()}

Mising Value
{ratings.isna().sum()}

Duplicates
{ratings.duplicated().sum()}

""")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Info 
None

shape
(100836, 4)

Describe 
              userId        movieId         rating     timestamp
count  100836.000000  100836.000000  100836.000000  1.008360e+05
mean      326.127564   19435.295718       3.501557  1.205946e+09
std       182.618491   35530.987199       1.042529  2.162610e+08
min         1.000000       1.000000       0.500000  8.281246e+08
25%       177.000000    1199.000000       3.000000  1.019124e+09
50%       325.000000    2991.000000       3.500000  1.186087e+09
75%       477.000000    8122.000000       4.000000  1.435994e+09
max       610.000000  193609.00

In [18]:
# Changing timestamp datatype from into to float
ratings["timestamp"] = pd.to_datetime(ratings["timestamp"])
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,1970-01-01 00:00:00.964982703
1,1,3,4.0,1970-01-01 00:00:00.964981247
2,1,6,4.0,1970-01-01 00:00:00.964982224
3,1,47,5.0,1970-01-01 00:00:00.964983815
4,1,50,5.0,1970-01-01 00:00:00.964982931
...,...,...,...,...
100831,610,166534,4.0,1970-01-01 00:00:01.493848402
100832,610,168248,5.0,1970-01-01 00:00:01.493850091
100833,610,168250,5.0,1970-01-01 00:00:01.494273047
100834,610,168252,5.0,1970-01-01 00:00:01.493846352


In [19]:
ratings["timestamp"]= ratings.timestamp.dt.strftime("%Y-%m-%d %H:%M:%S") #removing microseconds

In [20]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,1970-01-01 00:00:00
1,1,3,4.0,1970-01-01 00:00:00
2,1,6,4.0,1970-01-01 00:00:00
3,1,47,5.0,1970-01-01 00:00:00
4,1,50,5.0,1970-01-01 00:00:00
...,...,...,...,...
100831,610,166534,4.0,1970-01-01 00:00:01
100832,610,168248,5.0,1970-01-01 00:00:01
100833,610,168250,5.0,1970-01-01 00:00:01
100834,610,168252,5.0,1970-01-01 00:00:01


##Popularity Ranking - Recommending Top rated movies
 Function that takes an input ‘n’ (n is an integer, how many movies we wish to display) and outputs the top n movies Be careful not to recommend any 5 star movies with only one rating.

In [21]:
def top_movies():
 n = int(input("How many top movies do you wish to display: "))
 popular = pd.DataFrame(ratings.groupby('movieId')['rating'].agg(["mean", "count"]))
 popular['overall_rating']=popular['mean'] * popular['count'] * 0.1
 top_movies = pd.DataFrame(popular.merge(movies, how = 'left', on = 'movieId'))
 top = top_movies.sort_values(by='overall_rating', ascending=False).head(n)
 return top

top_movies()

How many top movies do you wish to display: 6


Unnamed: 0,movieId,mean,count,overall_rating,title,genres,year
277,318,4.429022,317,140.4,"Shawshank Redemption, The",Crime|Drama,1994
314,356,4.164134,329,137.0,Forrest Gump,Comedy|Drama|Romance|War,1994
257,296,4.197068,307,128.85,Pulp Fiction,Comedy|Crime|Drama|Thriller,1994
1938,2571,4.192446,278,116.55,"Matrix, The",Action|Sci-Fi|Thriller,1999
510,593,4.16129,279,116.1,"Silence of the Lambs, The",Crime|Horror|Thriller,1991
224,260,4.231076,251,106.2,Star Wars: Episode IV - A New Hope,Action|Adventure|Sci-Fi,1977


##Item-based Collaborative Filtering
Function that takes as input a movie id and a number (n), and outputs the names of the top n most similar movies to the inputed one.

In [22]:
def top_similar_movies():
 id = int(input("Enter a movie ID : "))
 n = int(input("Enter no of similar rated movies to be displayed : "))

 places_crosstab = pd.pivot_table(data=ratings, values='rating', index='userId', columns='movieId')
 top_popular_movieid = id

 movie_ratings = places_crosstab[top_popular_movieid]
 movie_ratings[movie_ratings>=0] # exclude NaNs

 similar_to_movie = places_crosstab.corrwith(movie_ratings)

 corr_movie = pd.DataFrame(similar_to_movie, columns=['PearsonR'])
 corr_movie.dropna(inplace=True)

 rating = pd.DataFrame(ratings.groupby('movieId')['rating'].mean())
 rating['rating_count'] = ratings.groupby('movieId')['rating'].count()

 movie_corr_summary = corr_movie.join(rating['rating_count'])
 movie_corr_summary.drop(top_popular_movieid, inplace=True)

 top_n = movie_corr_summary[movie_corr_summary['rating_count']>=150].sort_values('PearsonR', ascending=False).head(n)

 top_n = top_n.merge(movies, left_index=True, right_on="movieId")

 return top_n

top_similar_movies()

Enter a movie ID : 8
Enter no of similar rated movies to be displayed : 7


  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0,PearsonR,rating_count,movieId,title,genres,year
0,0.968246,215,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
123,0.968246,201,150,Apollo 13,Adventure|Drama|IMAX,1995
3141,0.944911,159,4226,Memento,Mystery|Thriller,2000
546,0.944911,162,648,Mission: Impossible,Action|Adventure|Mystery|Thriller,1996
898,0.866025,211,1196,Star Wars: Episode V - The Empire Strikes Back,Action|Adventure|Sci-Fi,1980
509,0.850963,189,592,Batman,Action|Crime|Thriller,1989
325,0.790569,157,367,"Mask, The",Action|Comedy|Crime|Fantasy,1994


##User-based Collaborative Filtering
Function that takes the users userId, and a number (n) and outputs the n most recommended movies based on the cosine similarity of other users.

In [23]:
def user_based_CF(id,n):
 users_items = pd.pivot_table(data=ratings,
                                 values='rating',
                                 index='userId',
                                 columns='movieId')

 #The cosine similarity can't be computed with NaN's. Replacing NaN's with zero
 users_items.fillna(0, inplace=True)

 # Compute cosine similarities
 from sklearn.metrics.pairwise import cosine_similarity
 user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index,
                                 index=users_items.index)

 # compute the weights for one user.Here we will exclude user using .query().
 user_id = id
 weights = (user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id]))
 not_seen_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]
 weighted_averages = pd.DataFrame(not_seen_movies.T.dot(weights), columns=["predicted_rating"])

 #Find the top n movies from the rating predictions
 recommendations = weighted_averages.merge(movies, left_index=True, right_on="movieId")
 top_n = recommendations.sort_values("predicted_rating", ascending=False).head(n)
 return top_n

user_based_CF(1,5)



Unnamed: 0,predicted_rating,movieId,title,genres,year
277,2.654727,318,"Shawshank Redemption, The",Crime|Drama,1994
507,2.087327,589,Terminator 2: Judgment Day,Action|Sci-Fi,1991
659,1.859548,858,"Godfather, The",Crime|Drama,1972
2078,1.663564,2762,"Sixth Sense, The",Drama|Horror|Mystery,1999
3638,1.62482,4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,2001
