**Table of contents**<a id='toc0_'></a>    
- [Importing packages](#toc1_1_1_1_)    
    - [Loading dataset](#toc1_1_2_)    
      - [Basic checks on the dataset](#toc1_1_2_1_)    
      - [Checking Null values](#toc1_1_2_1_1_)    
      - [Imputing null values](#toc1_1_2_2_)    
      - [Feature extraction](#toc1_1_2_3_)    
      - [TFIDF Vectorizer](#toc1_1_2_4_)    
      - [Scaling the values using StandasrScaler](#toc1_1_2_5_)    
      - [reducing the dimensions](#toc1_1_2_6_)    
      - [training with the DBSCAN](#toc1_1_2_7_)    
- [Recommendation sysytem logic](#toc1_1_2_8_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

#### <a id='toc1_1_1_1_'></a>[Importing packages](#toc0_)

In [105]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.cluster import DBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

### <a id='toc1_1_2_'></a>[Loading dataset](#toc0_)

In [106]:
df = pd.read_csv(r'C:\Users\SRIKANTH ADIPIREDDY\Desktop\credit_default\movie_recommendation\imdb_top_1000.csv')

In [107]:
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


#### <a id='toc1_1_2_1_'></a>[Basic checks on the dataset](#toc0_)

In [108]:
df.shape

(1000, 16)

In [109]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


##### <a id='toc1_1_2_1_1_'></a>[Checking Null values](#toc0_)

In [110]:
df.isna().sum()

Poster_Link        0
Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

#### <a id='toc1_1_2_2_'></a>[Imputing null values](#toc0_)

In [111]:
df.isna().sum()/len(df)*100

Poster_Link       0.0
Series_Title      0.0
Released_Year     0.0
Certificate      10.1
Runtime           0.0
Genre             0.0
IMDB_Rating       0.0
Overview          0.0
Meta_score       15.7
Director          0.0
Star1             0.0
Star2             0.0
Star3             0.0
Star4             0.0
No_of_Votes       0.0
Gross            16.9
dtype: float64

In [112]:
df.Certificate.fillna(df['Certificate'].mode()[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.Certificate.fillna(df['Certificate'].mode()[0],inplace=True)


In [113]:
df.Meta_score.fillna(df['Meta_score'].mean(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.Meta_score.fillna(df['Meta_score'].mean(),inplace=True)


#### <a id='toc1_1_2_3_'></a>[Feature extraction](#toc0_)

In [114]:
df['content'] = df['Genre'] + ' ' + df['Director'] + ' ' + df['Overview']+' ' + df['Star1'] + ' ' + df['Star2']+ ' ' + df['Star3'] + ' ' + df['Star4']+ ' ' + df['Certificate']+ ' ' + df['Runtime'] + ' ' + df['No_of_Votes'].astype(str) + ' ' + df['Gross'].astype(str)+' ' + df['Meta_score'].astype(str)+ ' ' + df['IMDB_Rating'].astype(str)+ ' ' + df['Poster_Link'].astype(str)+ ' ' + df['Series_Title']+ ' ' + df['Released_Year'].astype(str)

In [115]:
df['content'].str.lower()

0      drama frank darabont two imprisoned men bond o...
1      crime, drama francis ford coppola an organized...
2      action, crime, drama christopher nolan when th...
3      crime, drama francis ford coppola the early li...
4      crime, drama sidney lumet a jury holdout attem...
                             ...                        
995    comedy, drama, romance blake edwards a young n...
996    drama, western george stevens sprawling epic c...
997    drama, romance, war fred zinnemann in hawaii i...
998    drama, war alfred hitchcock several survivors ...
999    crime, mystery, thriller alfred hitchcock a ma...
Name: content, Length: 1000, dtype: object

In [116]:
df.content.fillna(df['content'].mode()[0],inplace=True)

In [117]:
df.IMDB_Rating

0      9.3
1      9.2
2      9.0
3      9.0
4      9.0
      ... 
995    7.6
996    7.6
997    7.6
998    7.6
999    7.6
Name: IMDB_Rating, Length: 1000, dtype: float64

#### <a id='toc1_1_2_4_'></a>[TFIDF Vectorizer](#toc0_)

In [118]:

#  TFIDF Vectorizer
tfidf = TfidfVectorizer(stop_words='english', max_features=300)
# Transform the overview text to TF-IDF features
overview_tfidf = tfidf.fit_transform(df['content'])


# Combine numerical and text
features = np.hstack([df[['IMDB_Rating']].values, overview_tfidf.toarray()
                      ])
features.shape

(1000, 301)

In [119]:
features

array([[9.3       , 0.        , 0.        , ..., 0.32764796, 0.        ,
        0.        ],
       [9.2       , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [9.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [7.6       , 0.2536674 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [7.6       , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [7.6       , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

#### <a id='toc1_1_2_5_'></a>[Scaling the values using StandasrScaler](#toc0_)

In [120]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

#### <a id='toc1_1_2_6_'></a>[reducing the dimensions](#toc0_)

In [121]:
pca = PCA(n_components=50)
features_pca = pca.fit_transform(features_scaled)

In [122]:
features_pca.shape

(1000, 50)

#### <a id='toc1_1_2_7_'></a>[training with the DBSCAN](#toc0_)

In [123]:
DBSCAN_clustering = DBSCAN(eps=10, min_samples=10, metric='euclidean')
df['cluster']=DBSCAN_clustering.fit_predict(features_pca)
df['cluster'].value_counts()


cluster
 0    682
-1    307
 1     11
Name: count, dtype: int64

In [139]:
df[df['Series_Title'] == 'Inception']['cluster'].values

array([0])

#### <a id='toc1_1_2_8_'></a>[Recommendation sysytem logic](#toc0_)

In [143]:
def recommend_movies(title, df, top_n=6): 
    
    # Check if the movie  title exists in the dataset
    if title not in df['Series_Title'].values:
        # If not, return a message indicating that the movie was not found
        return "Movie not found in the database."
    
    
    # Get the cluster label for the given movie title
    cluster_label = df[df['Series_Title'] == title]['cluster'].values[0]
    
    # Filter movies that belong to the same cluster
    cluster_movies = df[df['cluster'] == cluster_label]
    
    # Get the TF-IDF vector for the given movie title
    movie_cluster = overview_tfidf[df[df['Series_Title'] == title].index[0]]
    
    # Calculate cosine similarities between the given movie and other movies in the same cluster
    similarities = cosine_similarity(movie_cluster, overview_tfidf[cluster_movies.index]).flatten()
    
    # Get the indices of the top N most similar movies (excluding the given movie itself)
    similart_indices =similarities.argsort()[-(top_n+1):-1][::-1]
    
    # Return the top N most similar movies
    recommendation = cluster_movies.iloc[similart_indices][['Series_Title','Overview', 'IMDB_Rating','Poster_Link']]
    
    return  recommendation.reset_index(drop=True)

recommend_movies('Inception', df)
    
    

Unnamed: 0,Series_Title,Overview,IMDB_Rating,Poster_Link
0,Interstellar,A team of explorers travel through a wormhole ...,8.6,https://m.media-amazon.com/images/M/MV5BZjdkOT...
1,How to Train Your Dragon,A hapless young Viking who aspires to hunt dra...,8.1,https://m.media-amazon.com/images/M/MV5BMjA5ND...
2,Akira,A secret military project endangers Neo-Tokyo ...,8.0,https://m.media-amazon.com/images/M/MV5BM2ZiZT...
3,28 Days Later...,"Four weeks after a mysterious, incurable virus...",7.6,https://m.media-amazon.com/images/M/MV5BYTFkM2...
4,Ex Machina,A young programmer is selected to participate ...,7.7,https://m.media-amazon.com/images/M/MV5BMTUxNz...
5,Predator,A team of commandos on a mission in a Central ...,7.8,https://m.media-amazon.com/images/M/MV5BY2QwYm...


In [144]:
df.to_csv('movie_recommendation_with_clusters.csv', index=False)

import pickle

pickle.dump(overview_tfidf, open('tfidf_vectorizer.pkl', 'wb') )