#  Content Based Recommendation System

A content-based recommendation system that suggests similar movies based on their metadata (e.g., genres, keywords, overview). We'll use cosine similarity to measure the similarity between movies by following given steps:
1.	Combine relevant features that you believe will contribute to the similarity between movies. For simplicity, let's use genres, keywords, cast, and overview.
2.	For text features, preprocess the data by converting to lowercase, removing stopwords, and applying stemming or lemmatization if necessary.
3.	Use TfidfVectorizer from sklearn.feature_extraction.text to convert the text data into a matrix of TF-IDF features.
4.	Calculate the cosine similarity between all movies based on the TF-IDF matrix. This will give us a similarity matrix where each element represents the similarity score between a pair of movies.
5.	Write a function that takes a movie title as input, finds that movie's index in the dataset, and returns a list of movies sorted by their similarity score in descending order.

In [260]:
import pandas as pd
import ast

In [261]:
df=pd.read_csv("tmdb_5000_movies.csv")

In [262]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [263]:
df=df[['original_title','genres','keywords','overview']]
df=df.rename(columns={'original_title':'title'})
df.head()

Unnamed: 0,title,genres,keywords,overview
0,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha..."
2,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...
4,John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca..."


In [264]:
def convert(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name'])
    return L

In [265]:
df['genres']=df['genres'].apply(convert)
df['keywords']=df['keywords'].apply(convert)

In [266]:
df.head()

Unnamed: 0,title,genres,keywords,overview
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha..."
2,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...
4,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca..."


In [267]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def combine_tags(row):
    return row['genres']+row['keywords']

def remove_stopwords(text):
    text=str(text)
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)
    



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\uzair\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [268]:
df['tags']=df.apply(combine_tags, axis=1)

In [269]:
df['overview']=df['overview'].apply(remove_stopwords)

In [270]:
df.head()

Unnamed: 0,title,genres,keywords,overview,tags
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","22nd century, paraplegic Marine dispatched moo...","[Action, Adventure, Fantasy, Science Fiction, ..."
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed dead, come bac...","[Adventure, Fantasy, Action, ocean, drug abuse..."
2,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",cryptic message Bond’s past sends trail uncove...,"[Action, Adventure, Crime, spy, based on novel..."
3,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","Following death District Attorney Harvey Dent,...","[Action, Crime, Drama, Thriller, dc comics, cr..."
4,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","John Carter war-weary, former military captain...","[Action, Adventure, Science Fiction, based on ..."


In [271]:
df['tags']=df['tags'].apply(lambda x: ' '.join(x))

In [272]:
df['tags'].head()

0    Action Adventure Fantasy Science Fiction cultu...
1    Adventure Fantasy Action ocean drug abuse exot...
2    Action Adventure Crime spy based on novel secr...
3    Action Crime Drama Thriller dc comics crime fi...
4    Action Adventure Science Fiction based on nove...
Name: tags, dtype: object

In [273]:
df['tags']=df['tags']+df['overview']

In [274]:
df['tags'].head()

0    Action Adventure Fantasy Science Fiction cultu...
1    Adventure Fantasy Action ocean drug abuse exot...
2    Action Adventure Crime spy based on novel secr...
3    Action Crime Drama Thriller dc comics crime fi...
4    Action Adventure Science Fiction based on nove...
Name: tags, dtype: object

In [275]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=2000)

In [276]:
X = tfidf.fit_transform(df['tags'])

In [277]:
X

<4803x2000 sparse matrix of type '<class 'numpy.float64'>'
	with 124412 stored elements in Compressed Sparse Row format>

In [278]:
movie2idx = pd.Series(df.index, index=df['title'])
movie2idx

title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

In [279]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances


In [280]:
similarity = cosine_similarity(X, X)
similarity


array([[1.        , 0.03229036, 0.0113861 , ..., 0.01422906, 0.00611833,
        0.        ],
       [0.03229036, 1.        , 0.0325986 , ..., 0.03413101, 0.        ,
        0.00973103],
       [0.0113861 , 0.0325986 , 1.        , ..., 0.02015251, 0.        ,
        0.        ],
       ...,
       [0.01422906, 0.03413101, 0.02015251, ..., 1.        , 0.03871513,
        0.06228915],
       [0.00611833, 0.        , 0.        , ..., 0.03871513, 1.        ,
        0.04476233],
       [0.        , 0.00973103, 0.        , ..., 0.06228915, 0.04476233,
        1.        ]])

In [281]:
def recommend(title):
  idx = movie2idx[title]
  if type(idx) == pd.Series:
    idx = idx.iloc[0]
  query = X[idx]
  scores = cosine_similarity(query, X)
  scores = scores.flatten()
  recommended_idx = (-scores).argsort()[1:6]
  return list(zip(df['title'].iloc[recommended_idx],scores[recommended_idx]))

In [282]:
recommend("The Godfather")

[('The Godfather: Part II', 0.3313878735253196),
 ('Summer of Sam', 0.2851162393117139),
 ('Snabba Cash', 0.27768537062480075),
 ('Blood Ties', 0.27099672344131376),
 ('Mambo Italiano', 0.256177762349392)]

In [283]:
recommend("The Dark Knight")

[('The Dark Knight Rises', 0.4355238583272938),
 ('Batman Begins', 0.42061517260508174),
 ('Batman v Superman: Dawn of Justice', 0.3425987451096608),
 ('Batman Forever', 0.3382246387949149),
 ('Sherlock Holmes: A Game of Shadows', 0.310962854170411)]

# Predictive Modeling for Movie Success

Developing a machine learning model to predict the success of movies based on features like budget, director, cast, and sentiment score of the movie's overview.

 1.	Convert categorical variables (e.g., director, cast) into numeric values using techniques like one-hot encoding or feature hashing

In [284]:
credits_df=pd.read_csv("tmdb_5000_credits.csv")
movies_df=pd.read_csv("tmdb_5000_movies.csv")
credits_df.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [285]:

credits_df=credits_df.rename(columns={'movie_id':'id'})
movies_df=movies_df.merge(credits_df,on='id')
movies_df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [286]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [287]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [288]:
selected_features = ['budget', 'crew', 'cast','popularity']
X = movies_df[selected_features]
y = movies_df['revenue'] 

In [289]:
X['crew']=X['crew'].apply(convert)
X['cast']=X['cast'].apply(convert)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['crew']=X['crew'].apply(convert)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['cast']=X['cast'].apply(convert)


In [290]:
X.head()

Unnamed: 0,budget,crew,cast,popularity
0,237000000,"[Stephen E. Rivkin, Rick Carter, Christopher B...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",150.437577
1,300000000,"[Dariusz Wolski, Gore Verbinski, Jerry Bruckhe...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",139.082615
2,245000000,"[Thomas Newman, Sam Mendes, Anna Pinnock, John...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",107.376788
3,250000000,"[Hans Zimmer, Charles Roven, Christopher Nolan...","[Christian Bale, Michael Caine, Gary Oldman, A...",112.31295
4,260000000,"[Andrew Stanton, Andrew Stanton, John Lasseter...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...",43.926995


Splitting the dataset into a training set and a testing set to evaluate the model's performance

In [291]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Data Preprocessing

In [298]:
from sklearn.base import BaseEstimator, TransformerMixin

class ListToStringTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if isinstance(X, list):
            return [', '.join(names) if isinstance(names, list) else names for names in X]
        else:
            return X

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['budget', 'popularity']),  
        ('cast_crew', Pipeline(steps=[
            ('list_to_string', ListToStringTransformer()),  # Convert lists to strings
            ('one_hot_encoding', OneHotEncoder())  # One-hot encode
        ]), ['cast', 'crew']),  # Apply to 'cast' and 'crew' columns
    ],
    remainder='passthrough'  
)

Choosing a regression model as the task is to predict a continuous outcome (e.g., box office revenue, rating). Options include Linear Regression, Random Forest Regressor, or Gradient Boosting Regressor.

In [299]:
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor())  
])

In [300]:
model.fit(X_train, y_train)

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['list']