In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df=pd.read_csv('/content/RS-A2_A3_movie.csv').fillna('')
df['genres']=df['genres'].str.lower().str.strip()
print(df.head(5))
print(df.columns)
print(df.info())
tfidf=TfidfVectorizer(token_pattern=r"[^|]+")
tfidf_matrix=tfidf.fit_transform(df['genres'])
cosine_similarity_matrix=cosine_similarity(tfidf_matrix,tfidf_matrix)
indices=pd.Series(df.index,index=df['title']).drop_duplicates()

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  adventure|animation|children|comedy|fantasy  
1                   adventure|children|fantasy  
2                               comedy|romance  
3                         comedy|drama|romance  
4                                       comedy  
Index(['movieId', 'title', 'genres'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB
None


In [3]:
#Recommend movies based on given movie title
def recommend_movies(title, top_n=5):
    if title not in indices:
        return f"Movie '{title}' not found."
    idx = indices[title]
    sim_scores = list(enumerate(cosine_similarity_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
    movie_indices = [i[0] for i in sim_scores]
    return df[['title','genres']].iloc[movie_indices]
print(recommend_movies('Toy Story (1995)', top_n=5))

                                               title  \
2209                                     Antz (1998)   
3027                              Toy Story 2 (1999)   
3663  Adventures of Rocky and Bullwinkle, The (2000)   
3922                Emperor's New Groove, The (2000)   
4790                           Monsters, Inc. (2001)   

                                           genres  
2209  adventure|animation|children|comedy|fantasy  
3027  adventure|animation|children|comedy|fantasy  
3663  adventure|animation|children|comedy|fantasy  
3922  adventure|animation|children|comedy|fantasy  
4790  adventure|animation|children|comedy|fantasy  


In [4]:
#Evaluating recommendation system
def evaluate_model():
    genre_score=[]
    for genre in ['action','comedy','drama','horror','romance']:
        df_genre=df[df['genres'].str.contains(genre)]
        if len(df_genre)>1:
            data_index=df_genre.index.tolist()
            cos_sim=cosine_similarity_matrix[data_index][:,data_index]
            avg_sim=cos_sim.mean()
            genre_score.append((genre,round(avg_sim,3)))
    return pd.DataFrame(genre_score,columns=['Genre','Average Similarity'])

In [5]:
print(evaluate_model())

     Genre  Average Similarity
0   action               0.478
1   comedy               0.507
2    drama               0.443
3   horror               0.623
4  romance               0.628


### Code Explanation and Output Significance

This notebook sets up a movie recommendation system based on genre similarity using TF-IDF and cosine similarity.

#### Cell 1: Importing Libraries
```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
```
- `import pandas as pd`: Imports the `pandas` library, essential for data manipulation and analysis, and aliases it as `pd` for convenience.
- `from sklearn.feature_extraction.text import TfidfVectorizer`: Imports `TfidfVectorizer` from `scikit-learn`. This tool converts text data (like movie genres) into numerical vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) method, which weighs words based on their frequency in a document and across all documents.
- `from sklearn.metrics.pairwise import cosine_similarity`: Imports `cosine_similarity` from `scikit-learn`. This function calculates the cosine of the angle between two non-zero vectors, which serves as a measure of similarity between them. In this context, it will measure the similarity between movie genre vectors.

#### Cell 2: Data Loading, Preprocessing, and Similarity Calculation
```python
df=pd.read_csv('/content/RS-A2_A3_movie.csv').fillna('')
df['genres']=df['genres'].str.lower().str.strip()
print(df.head(5))
print(df.columns)
print(df.info())
tfidf=TfidfVectorizer(token_pattern=r"[^|]+")
tfidf_matrix=tfidf.fit_transform(df['genres'])
cosine_similarity_matrix=cosine_similarity(tfidf_matrix,tfidf_matrix)
indices=pd.Series(df.index,index=df['title']).drop_duplicates()
```
- `df=pd.read_csv('/content/RS-A2_A3_movie.csv').fillna('')`: Loads the movie data from the specified CSV file into a pandas DataFrame called `df`. `.fillna('')` replaces any missing values (NaN) in the DataFrame with empty strings.
- `df['genres']=df['genres'].str.lower().str.strip()`: Cleans the 'genres' column by converting all genre strings to lowercase and removing any leading or trailing whitespace. This ensures consistency for accurate similarity calculations.
- `print(df.head(5))`: Displays the first 5 rows of the processed DataFrame, providing a quick look at the data structure and content.
- `print(df.columns)`: Shows the names of all columns in the DataFrame.
- `print(df.info())`: Provides a summary of the DataFrame, including the number of entries, column names, non-null counts, data types, and memory usage. This helps in understanding the data's completeness and structure.
- `tfidf=TfidfVectorizer(token_pattern=r"[^|]+")`: Initializes `TfidfVectorizer`. The `token_pattern=r"[^|]+"` argument tells the vectorizer to treat any sequence of characters *not* a pipe (`|`) as a token (i.e., a single genre). This is crucial because genres are separated by `|` in the dataset.
- `tfidf_matrix=tfidf.fit_transform(df['genres'])`: Learns the vocabulary and IDF from the 'genres' column and then transforms the genre strings into a numerical TF-IDF matrix. Each row in this matrix represents a movie, and each column represents a genre term, with values indicating the importance of that genre to the movie.
- `cosine_similarity_matrix=cosine_similarity(tfidf_matrix,tfidf_matrix)`: Calculates the cosine similarity between every pair of movies using their TF-IDF genre vectors. The result is a square matrix where `cosine_similarity_matrix[i][j]` represents the genre similarity between movie `i` and movie `j`.
- `indices=pd.Series(df.index,index=df['title']).drop_duplicates()`: Creates a pandas Series that maps movie titles to their corresponding integer indices in the DataFrame. This is used to quickly look up a movie's index given its title. `.drop_duplicates()` handles cases where movie titles might not be unique (though in this dataset, they likely are).

#### Cell 3: Movie Recommendation Function
```python
#Recommend movies based on given movie title
def recommend_movies(title, top_n=5):
    if title not in indices:
        return f"Movie '{title}' not found."
    idx = indices[title]
    sim_scores = list(enumerate(cosine_similarity_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
    movie_indices = [i[0] for i in sim_scores]
    return df[['title','genres']].iloc[movie_indices]
print(recommend_movies('Toy Story (1995)', top_n=5))
```
- `def recommend_movies(title, top_n=5):`: Defines a function named `recommend_movies` that takes a movie title and an optional number of recommendations (`top_n`, default 5) as input.
- `if title not in indices: return f"Movie '{title}' not found."`: Checks if the given movie title exists in our `indices` mapping. If not, it returns an error message.
- `idx = indices[title]`: Retrieves the internal DataFrame index for the provided movie title.
- `sim_scores = list(enumerate(cosine_similarity_matrix[idx]))`: Fetches the row from the `cosine_similarity_matrix` corresponding to `idx`. This row contains similarity scores between the input movie and all other movies. `enumerate` pairs each score with its original index.
- `sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]`: Sorts these similarity scores in descending order. `key=lambda x: x[1]` sorts based on the similarity score itself. `[1:top_n+1]` is used to get the top `top_n` most similar movies, *excluding* the input movie itself (which would always have a similarity of 1 with itself, hence `[1:`).
- `movie_indices = [i[0] for i in sim_scores]`: Extracts only the DataFrame indices of these top `top_n` recommended movies.
- `return df[['title','genres']].iloc[movie_indices]`: Returns a new DataFrame containing only the 'title' and 'genres' of the recommended movies.
- `print(recommend_movies('Toy Story (1995)', top_n=5))`: Calls the `recommend_movies` function for 'Toy Story (1995)' and prints the top 5 genre-similar movie recommendations.

**Significance of Output (Cell 3):**
The output for `recommend_movies('Toy Story (1995)', top_n=5)` shows a list of movies like 'Antz', 'Toy Story 2', 'Adventures of Rocky and Bullwinkle', 'Emperor's New Groove', and 'Monsters, Inc.', all sharing similar genres (adventure|animation|children|comedy|fantasy) with 'Toy Story (1995)'. This output is significant because it demonstrates the core functionality of the recommendation system: it successfully identifies and suggests movies that are genre-wise similar to a given input movie. It confirms that the TF-IDF vectorization and cosine similarity calculations are effectively capturing genre relationships.

#### Cell 4: Model Evaluation Function
```python
#Evaluating recommendation system
def evaluate_model():
    genre_score=[]
    for genre in ['action','comedy','drama','horror','romance']:
        df_genre=df[df['genres'].str.contains(genre)]
        if len(df_genre)>1:
            data_index=df_genre.index.tolist()
            cos_sim=cosine_similarity_matrix[data_index][:,data_index]
            avg_sim=cos_sim.mean()
            genre_score.append((genre,round(avg_sim,3)))
    return pd.DataFrame(genre_score,columns=['Genre','Average Similarity'])
```
- `def evaluate_model():`: Defines a function to evaluate the recommendation system's performance by looking at intra-genre similarity.
- `genre_score=[]`: Initializes an empty list to store the evaluation results for each genre.
- `for genre in ['action','comedy','drama','horror','romance']:`: Loops through a predefined set of key genres to evaluate.
- `df_genre=df[df['genres'].str.contains(genre)]`: Filters the main `df` to create a subset containing only movies that include the current genre in their 'genres' string.
- `if len(df_genre)>1:`: Ensures that there are at least two movies within the filtered genre to calculate meaningful similarity scores.
- `data_index=df_genre.index.tolist()`: Gets the original DataFrame indices of the movies belonging to the current genre.
- `cos_sim=cosine_similarity_matrix[data_index][:,data_index]`: Extracts a sub-matrix from the `cosine_similarity_matrix` containing only the similarity scores *between movies of the current genre*. This allows us to assess how similar movies *within* a specific genre are to each other.
- `avg_sim=cos_sim.mean()`: Calculates the average of all similarity scores within this genre-specific sub-matrix. This represents the average genre similarity among movies in that particular genre.
- `genre_score.append((genre,round(avg_sim,3)))`: Appends the genre name and its rounded average similarity score to the `genre_score` list.
- `return pd.DataFrame(genre_score,columns=['Genre','Average Similarity'])`: Returns a pandas DataFrame presenting the evaluation results, with 'Genre' and 'Average Similarity' columns.

#### Cell 5: Printing Model Evaluation
```python
print(evaluate_model())
```
- `print(evaluate_model())`: Executes the `evaluate_model` function and prints the resulting DataFrame.

**Significance of Output (Cell 5):**
The output of `evaluate_model()` shows a DataFrame with genres and their 'Average Similarity' scores (e.g., 'action': 0.478, 'comedy': 0.507, 'horror': 0.623, 'romance': 0.628). This table is significant for evaluating the consistency and distinctness of genre definitions in the dataset. Higher average similarity scores for a genre (like 'horror' and 'romance' in this example) suggest that movies categorized under these genres tend to have very similar genre profiles among themselves. This indicates that the genre tags for these categories are either very specific or consistently applied. Conversely, lower scores (like 'drama') might indicate a broader or more ambiguous definition of the genre, leading to less internal consistency in genre similarity. This evaluation helps understand the strengths and weaknesses of the genre-based recommendation system across different movie categories.