# Movie Recommendation System Using TF-IDF & Random Forest

This project is a movie recommendation system that suggests films based on a user's preferences. The system uses a combination of content-based filtering and machine learning techniques to generate personalised movie recommendations. It is built using Python and trained on a dataset of the top 1,000 highest-grossing films.

#### Step 1: Import Required Libraries
We’ll use `pandas`, `sklearn`, and `numpy` for data processing, modeling, and feature extraction.

In [15]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split


#### Step 2: Load Movie Dataset
We use IMDb's Top 1000 Movies dataset as the source for recommendations.

In [16]:
movies = pd.read_csv("../Data/imdb_top_1000.csv")
movies.head()

Unnamed: 0,Poster_Link,Title,Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


#### Step 3: Clean and Preprocess the Data
We split genres, drop irrelevant columns, and rename them for simplicity.

In [17]:
# Convert year to int and split genre
movies['Year'] = movies['Year'].astype('int')
movies = movies.join(movies['Genre'].str.split(', ', expand=True)).drop([
    'Genre', 'Poster_Link', 'IMDB_Rating', 'Certificate', 'Overview',
    'Meta_score', 'No_of_Votes', 'Gross'
], axis=1)

# Rename columns
movies = movies.set_axis([
    'Title', 'Year', 'Length','Director','Star1','Star2',
    'Star3','Star4','Genre1','Genre2','Genre3'
], axis=1)

movies.head()

Unnamed: 0,Title,Year,Length,Director,Star1,Star2,Star3,Star4,Genre1,Genre2,Genre3
0,The Shawshank Redemption,1994,142 min,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,Drama,,
1,The Godfather,1972,175 min,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,Crime,Drama,
2,The Dark Knight,2008,152 min,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,Action,Crime,Drama
3,The Godfather: Part II,1974,202 min,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,Crime,Drama,
4,12 Angry Men,1957,96 min,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,Crime,Drama,


#### Step 4: User Input
Select your favorite movies from the top 100 list.

In [None]:
# Display top 100 movie titles for user selection
movie_titles = np.sort(movies['Title'].head(100).dropna().unique().tolist())

# Interactive multi-select (manually input as list for notebook use)
print("Top 100 Movies:\n")
for idx, title in enumerate(movie_titles, 1):
    print(f"{idx}. {title}")

selected_indexes = input("\nEnter the numbers of your favorite movies separated by commas (e.g., 1, 5, 10): ")
selected_indexes = [int(i.strip())-1 for i in selected_indexes.split(",") if i.strip().isdigit()]
selected_titles = [movie_titles[i] for i in selected_indexes]

# Display selections
if selected_titles:
    print("\nYou selected the following movies:")
    for title in selected_titles:
        print(f"- {title}")
else:
    print("No movies selected.")

Top 100 Movies:

1. 12 Angry Men
2. 1917
3. 3 Idiots
4. Alien
5. American Beauty
6. American History X
7. Amélie
8. Anand
9. Andhadhun
10. Apocalypse Now
11. Avengers: Endgame
12. Avengers: Infinity War
13. Ayla: The Daughter of War
14. Babam ve Oglum
15. Back to the Future
16. Capharnaüm
17. Casablanca
18. Cidade de Deus
19. City Lights
20. Coco
21. Dangal
22. Django Unchained
23. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
24. Drishyam
25. Eternal Sunshine of the Spotless Mind
26. Fight Club
27. Forrest Gump
28. Gisaengchung
29. Gladiator
30. Good Will Hunting
31. Goodfellas
32. Hamilton
33. Hotaru no haka
34. Il buono, il brutto, il cattivo
35. Incendies
36. Inception
37. Inglourious Basterds
38. Interstellar
39. It's a Wonderful Life
40. Jagten
41. Jodaeiye Nader az Simin
42. Joker
43. Kimi no na wa.
44. La vita è bella
45. Léon
46. Memento
47. Miracle in cell NO.7
48. Modern Times
49. Mononoke-hime
50. Nuovo Cinema Paradiso
51. Oldeuboi
52. Once Upon a Tim

#### Step 5: Label Movies as Liked or Not
We label the selected movies as "liked" (1) and others as "not liked" (0).

In [19]:
liked_movies = movies[movies['Title'].isin(selected_titles)]
movies['Liked'] = movies['Title'].isin(liked_movies['Title']).astype(int)

#### Step 6: Feature Engineering
We combine genre, director, and cast information into a single string for modeling.

In [20]:
# Combine genre, director, and cast into one string
movies['features'] = (
    movies['Genre1'].fillna('') + ' ' +
    movies['Genre2'].fillna('') + ' ' +
    movies['Director'].fillna('') + ' ' +
    movies['Star1'].fillna('') + ' ' +
    movies['Star2'].fillna('') + ' ' +
    movies['Star3'].fillna('') + ' ' +
    movies['Star4'].fillna('')
)

#### Step 7: TF-IDF Vectorization
Transform text features into numerical format using TF-IDF.

In [21]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(movies['features'])

# Target variable
y = movies['Liked']

#### Step 8: Train Random Forest Classifier
Train a supervised model to learn user preferences.

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)
model = RandomForestClassifier()
model.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


#### Step 9: Predict Likelihood of Liking Each Movie
Use the trained model to assign probabilities to each movie.

In [23]:
movies['Predicted_Probability'] = model.predict_proba(X)[:, 1]

#### Step 10: Top 10 Movie Recommendations
Show the top 10 movies based on predicted probability.

In [24]:
recommendations = movies[movies['Liked'] == 0].sort_values(
    by='Predicted_Probability', ascending=False
).head(10)

recommendations[['Title', 'Predicted_Probability']]

Unnamed: 0,Title,Predicted_Probability
779,Ray,0.22
505,Mystic River,0.17
36,The Prestige,0.12
63,The Dark Knight Rises,0.1
155,Batman Begins,0.1
329,The Martian,0.1
790,Black Hawk Down,0.08
217,Ford v Ferrari,0.08
896,Hell or High Water,0.08
777,The Bourne Supremacy,0.07
