# Recommender Systems Development for Marketing (Assignment): Content-Based Recommender System for Yelp, MovieLens, Netflix

## Patrik Vojta: 500825187
## Can Senturk: 500810355
## Sander van Duin: 500934074

In this notebook, we will walk through the stages of data cleaning, exploratory data analysis (EDA), feature engineering. 

# Content-Based Recommender System

In this notebook, we focus on developing a content-based recommender system for the Yelp, MovieLens, and Netflix dataset. Content-based recommendation systems recommend items to users based on the attributes of the items and the user's preferences. Here, we leverage the rich attributes associated with businesses in the Yelp dataset to recommend restaurants to users based on their preferences.

### Approach and Methodology:

- **Data Preparation**: We begin by loading and preprocessing the datasets, ensuring that it is in a suitable format for building the content-based recommender system.
  
- **Feature Engineering**: Next, we perform feature engineering on the datasets, extracting relevant attributes from the business data that can be used to describe each restaurant. (has been done in YELP EDA)
  
- **Similarity Calculation**: Using techniques such as cosine similarity, we compute the similarity between restaurants based on their attributes. This allows me to identify restaurants that are similar to each other.
  
- **Recommendation Generation**: With the similarity scores computed, we generate recommendations for users by selecting restaurants that are most similar to the ones they have previously liked or rated positively.

### Libraries Used:

- `pandas`: Essential for data manipulation and preprocessing.
  
- `sklearn.metrics.pairwise.cosine_similarity`: Utilized to compute cosine similarity between restaurant attributes.
  
- `matplotlib.pyplot` and `seaborn`: Used for visualizing the data and similarity scores.
  
### Implementation:

The implementation of the content-based recommender system involves loading the datasets, performing feature engineering, computing similarity scores, and generating recommendations. Each step is detailed with code snippets and explanations to provide a clear understanding of the process.

### Evaluation and Validation:

To validate the performance of the content-based recommender system, we assess its ability to recommend relevant restaurants to users based on their preferences. Evaluation metrics such as precision, recall, and accuracy may be utilized to measure the effectiveness of the recommendations.

The subsequent sections will delve into each stage of the content-based recommender system development, showcasing the implementation details and results obtained.



In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
import os

# 1. Dataset Import
This code snippet is designed to facilitate the import of datasets for further analysis. Let's break down its functionality step by step:

1. **User Input Prompt**: 
    - The code starts by prompting the user to choose a dataset among "movielens", "netflix", or "yelp". This input is case-insensitive.

2. **Dataset Paths Configuration**:
    - After obtaining the user's choice, the code constructs the file paths for the chosen dataset based on a specified base path.
    - The base path is predefined as `path`, pointing to the directory where the dataset files are stored.
    - Dataset paths are stored in a dictionary named `dataset_paths`, with keys corresponding to dataset names and values representing the full paths to the dataset files.

3. **Dataset Loading**:
    - Using the chosen dataset name, the code retrieves the corresponding file path from `dataset_paths`.
    - It then attempts to load the dataset using the `pd.read_parquet()` function from the Pandas library. This function is specifically designed for reading Parquet files, which are commonly used for storing structured data efficiently.
    - If the dataset is successfully loaded, the code prints a confirmation message indicating the successful loading of the dataset, along with the dimensions of the loaded dataset (number of rows and columns).
    - In case of any error during the dataset loading process, the code catches the exception and prints an error message indicating the failure to load the dataset.



In [2]:
# Prompt the user to choose between two datasets
dataset_choice = input("Choose the dataset to import (movielens, netflix, or yelp): ")

# Specify the base path for datasets
# path = '/Users/sandervanduin/Desktop/'
path = r"C:\Users\patri\OneDrive - HvA\Systems Development\RecSys Repository\SystemDevelopment - Marketing\System-Development\Yelp\Final Paquet Files"

# Specify the paths for the datasets based on the choice
if dataset_choice == 'movielens':
    dataset_path_movielens = f'{path}\\'
elif dataset_choice == 'netflix':
    dataset_path_netflix = f'{path}\\'
elif dataset_choice == 'yelp':
    dataset_path_yelp = f'{path}\\'
else:
    print("Invalid choice. Please enter 'movielens', 'netflix' or 'yelp'.")
    exit()

######### Movielens dataset #########

if dataset_choice == 'movielens':
    parquet_path = f'{dataset_path_movielens}movielens_content.parquet'
    df_content = pd.read_parquet(parquet_path, engine='pyarrow')
    print("Movielens dataset loaded successfully.")

########## Netflix dataset #########

elif dataset_choice == 'netflix':
    parquet_path = f'{dataset_path_netflix}content_sampled.parquet'
    df_content = pd.read_parquet(parquet_path, engine='pyarrow')
    print("Netflix dataset loaded successfully.")

########## Yelp dataset #########

elif dataset_choice == 'yelp':
    parquet_path = f'{dataset_path_yelp}model_merged_df.parquet'
    df_content = pd.read_parquet(parquet_path, engine='pyarrow')
    print("Yelp dataset loaded successfully.")
else:
    print("Invalid choice. Please enter 'movielens', 'netflix', or 'yelp'.")

# Continue with your analysis or operations on df_content if it's not None
if df_content is not None:
    # Example operation
    print(f"Dataset contains {df_content.shape[0]} rows and {df_content.shape[1]} columns.")
else:
    print("No dataset was loaded.")



Yelp dataset loaded successfully.
Dataset contains 51256 rows and 11 columns.


In [3]:
df_content.head()

Unnamed: 0,user_id,business_id,stars,name,Alcohol,Caters,RestaurantsDelivery,OutdoorSeating,RestaurantsTakeOut,Open24Hours,BusinessParking
0,79730,4362,2,Compère Lapin,1,1,1,1,1,0,1
1,89341,4362,3,Compère Lapin,1,1,1,1,1,0,1
2,79781,4362,5,Compère Lapin,1,1,1,1,1,0,1
3,66952,4362,5,Compère Lapin,1,1,1,1,1,0,1
4,79594,4362,5,Compère Lapin,1,1,1,1,1,0,1


In [4]:
# Rename columns for yelp
if dataset_choice == 'yelp': 
    df_content = df_content.rename(columns={'stars': 'rating', 'bussiness_id':'itemid', 'user_id': 'userid'})
elif dataset_choice == 'movielens':
    df_content = df_content.rename(columns={'movieId': 'itemid', 'rating': 'rating', 'title': 'title', 'genres': 'genre', 'userId': 'userid'})

# 2.  Data Filtering Function for Recommender Systems

The `data_filtering` function serves as a versatile utility to preprocess and filter datasets, tailored specifically for recommender systems such as Yelp, Netflix, or Movielens. Here's an overview of the function and its sub-components:

## 2.1. Functionality

- **Generic Filtering**: The main function `data_filtering` acts as an entry point, determining the dataset context and dispatching to the appropriate filtering subroutine.
  
- **Dataset-Specific Logic**: Sub-functions `filter_yelp_data` and `filter_movie_data` encapsulate filtering logic unique to the Yelp and movie-related datasets respectively.

##  Parameters

- `df`: The DataFrame to be filtered.
- `dataset_name`: A string identifier for the dataset, guiding the function to the appropriate filtering logic.
- `**kwargs`: A set of keyword arguments that specify filtering thresholds and criteria.

## Returns

- A DataFrame that has been filtered according to the specified parameters and dataset context.



In [5]:
import pandas as pd

def data_filtering(df, dataset_name, **kwargs):
    """
    Filter the DataFrame based on specified criteria depending on the dataset.
    
    Parameters:
    - df: pandas DataFrame to filter.
    - dataset_name: Name of the dataset ('netflix', 'yelp', or 'movielens').
    - **kwargs: Additional keyword arguments for dataset-specific filtering criteria.
    
    Returns:
    - Filtered pandas DataFrame.
    """
    if dataset_name.lower() == 'yelp':
        return filter_yelp_data(df, **kwargs)
    elif dataset_name.lower() in ['netflix', 'movielens']:
        return filter_movie_data(df, **kwargs)
    else:
        raise ValueError("Invalid dataset name. Please provide 'netflix', 'yelp', or 'movielens'.")

def filter_yelp_data(df, minimum_reviews_per_user=1, minimum_reviews_per_restaurant=1):
    """
    Filter Yelp dataset based on specified criteria.
    """
    # Your Yelp specific filtering logic here
    ratings_per_user = df.groupby('userid').size().reset_index(name='number_of_reviews_per_user')
    ratings_per_restaurant = df.groupby('business_id').size().reset_index(name='number_of_reviews_per_restaurant')

    eligible_users = ratings_per_user[ratings_per_user['number_of_reviews_per_user'] >= minimum_reviews_per_user]
    df_filtered_users = df[df['userid'].isin(eligible_users['userid'])]

    eligible_restaurants = ratings_per_restaurant[ratings_per_restaurant['number_of_reviews_per_restaurant'] >= minimum_reviews_per_restaurant]
    df_filtered = df_filtered_users[df_filtered_users['business_id'].isin(eligible_restaurants['business_id'])]

    print(f'Removing users with fewer than {minimum_reviews_per_user} reviews')
    print(f'Removing restaurants with fewer than {minimum_reviews_per_restaurant} reviews')
    print(f'Filtered dataset size: {df_filtered.shape[0]}\n')

    return df_filtered

def filter_movie_data(df, mimimum_percentage_movies_user_rated=0.02, minimum_ratings_per_movie=50):
    """
    Filter Netflix/Movielens dataset based on specified criteria.
    """
    # Your Netflix/Movielens specific filtering logic here
    ratings_per_movie = df.groupby('itemid').size().reset_index(name='ratings_per_movie')
    ratings_per_user = df.groupby('userid').size().reset_index(name='number_rated')

    eligible_users = ratings_per_user[ratings_per_user['number_rated'] / ratings_per_movie.shape[0] > mimimum_percentage_movies_user_rated]
    df_filtered_users = df[df['userid'].isin(eligible_users['userid'])]

    eligible_movies = ratings_per_movie[ratings_per_movie['ratings_per_movie'] >= minimum_ratings_per_movie]
    df_filtered = df_filtered_users[df_filtered_users['itemid'].isin(eligible_movies['itemid'])]

    print(f'Removing users who have rated less than {mimimum_percentage_movies_user_rated*100}% of all movies')
    print(f'Filtered dataset size after removing less active users: {df_filtered_users.shape[0]}\n')
    print(f'Removing movies that have been rated by fewer than {minimum_ratings_per_movie} users')
    print(f'Filtered dataset size after removing less rated movies: {df_filtered.shape[0]}\n')

    return df_filtered

# Add condition for Yelp dataset
if dataset_choice.lower() == 'yelp':
    yelp_filter_params = {
        'minimum_reviews_per_user': 1,
        'minimum_reviews_per_restaurant': 1
    }
    sample_df = data_filtering(df_content, dataset_name='yelp', **yelp_filter_params)
    print(sample_df.head(5))
    print('The dimensions of the sampled Yelp dataset are:', sample_df.shape)
else:
    movie_filter_params = {
        'mimimum_percentage_movies_user_rated': 0.02,
        'minimum_ratings_per_movie': 50
    }
    sample_df = data_filtering(df_content, dataset_name='movielens', **movie_filter_params)
    print(sample_df.head(5))
    print('The dimensions of the sampled Movielens dataset are:', sample_df.shape)


Removing users with fewer than 1 reviews
Removing restaurants with fewer than 1 reviews
Filtered dataset size: 51256

   userid  business_id  rating           name  Alcohol  Caters  \
0   79730         4362       2  Compère Lapin        1       1   
1   89341         4362       3  Compère Lapin        1       1   
2   79781         4362       5  Compère Lapin        1       1   
3   66952         4362       5  Compère Lapin        1       1   
4   79594         4362       5  Compère Lapin        1       1   

   RestaurantsDelivery  OutdoorSeating  RestaurantsTakeOut  Open24Hours  \
0                    1               1                   1            0   
1                    1               1                   1            0   
2                    1               1                   1            0   
3                    1               1                   1            0   
4                    1               1                   1            0   

   BusinessParking  
0            

### 3. Data Segregation:

Depending on the dataset being processed (`yelp` or movie-related datasets such as `netflix` or `movielens`), the data is segregated into meaningful subsets:

- For Yelp:
  - **Restaurant Details**:
    A DataFrame named `df_restaurant_details` is created containing columns such as `business_id`, `name`, and various attributes relevant to restaurants like `Alcohol`, `Caters`, `RestaurantsDelivery`, `OutdoorSeating`, `RestaurantsTakeOut`, `Open24Hours`, and `BusinessParking`. This DataFrame is deduplicated to ensure uniqueness.
  - **User Reviews**:
    A separate DataFrame `df_user_reviews` includes user-centric information such as `userid`, `business_id`, and `rating`.
  - The shapes of these DataFrames are printed along with the unique count of `business_id` to understand the diversity of the restaurant data.

- For Movie Datasets:
  - **Movie Details**:
    DataFrames `df_content_items_movies` and `df_content_items_users` are created to separately hold movie information (`itemid`, `title`, `genre`) and user interactions (`userid`, `itemid`, `rating`, `year`).
  - **Genre Analysis**:
    A `genre_dummies` DataFrame is generated by converting the genre information into dummy variables, facilitating easier analysis of genre-specific trends.
  - The final movie content DataFrame `df_content_items` is formed by combining `df_content_items_movies` with `genre_dummies` and removing redundant columns.

These steps help in simplifying the complex dataset, making it more accessible for further analysis and model training.


In [6]:
if dataset_choice.lower() == 'yelp':
    # Create a new dataframe with only the relevant columns
    # DataFrame with restaurant details
    df_restaurant_details = sample_df[['business_id', 'name', 'Alcohol', 'Caters', 
                                    'RestaurantsDelivery', 'OutdoorSeating', 
                                    'RestaurantsTakeOut', 'Open24Hours', 'BusinessParking']].drop_duplicates()

    # DataFrame with user reviews
    df_user_reviews = sample_df[['userid', 'business_id', 'rating']].drop_duplicates()

    # Printing shapes of the new dataframes
    print(df_restaurant_details.shape)
    print(df_user_reviews.shape)

    # Print unique business_id count
    print(len(df_restaurant_details['business_id'].unique()))
else:
    # make two separate dataframes, one with movieid, title, and genre and one with userid, movieid, rating, and year
    df_content_items_movies = sample_df[['itemid', 'title', 'genre']].drop_duplicates()
    df_content_items_users = sample_df[['userid', 'itemid', 'rating', 'year']].drop_duplicates()
    print(df_content_items_movies.shape)
    print(df_content_items_users.shape)
    # unique itemid
    print(len(df_content_items_movies['itemid'].unique()))

    # create genre_dummies
    genre_dummies = df_content_items_movies['genre'].str.get_dummies(sep='|')
    df_content_items = pd.concat([df_content_items_movies, genre_dummies], axis=1)
    df_content_items = df_content_items.drop(columns=['genre', 'itemid'])
    df_content_items.head(10)

(22698, 9)
(51135, 3)
22698


### 4. Implementing Cosine Similarity:

A custom function is designed to calculate the cosine similarity between two entities (restaurants or movies) based on attributes like genre for movies and various characteristics for restaurants. This measure is critical in recommendation systems to evaluate the similarity between items.

- **Yelp Restaurants**:
  - `cosine_similarity_restaurants`: The function takes the names of two restaurants and a DataFrame containing restaurant attributes. It first verifies the existence of the specified restaurants in the DataFrame.
  - It then retrieves attribute vectors for each restaurant and calculates the cosine similarity.
  - An example usage is provided with "Monk's Cafe" and "Henkes Tavern", showcasing how the similarity score is obtained and printed.

- **Movie Datasets**:
  - `cosine_similarity_calculation`: This function performs a similar operation for movie datasets. It takes two movie titles and a DataFrame with genre information in dummy variable format.
  - It extracts genre vectors for the specified movies and calculates the cosine similarity between them.
  - An example calculation between "The Rise and Fall of ECW" and "What the #$*! Do We Know!?" illustrates the usage of the function.

This approach to calculating similarity forms the basis of content-based filtering in recommendation systems, allowing for personalized recommendations based on item attributes.


In [7]:
# custom function to implement cosine similarity between two items based on genre

from sklearn.metrics.pairwise import cosine_similarity

if dataset_choice.lower() == 'yelp':
    # Function to calculate cosine similarity
    def cosine_similarity_restaurants(restaurant1, restaurant2, df_attributes):
        if restaurant1 not in df_attributes['name'].values or restaurant2 not in df_attributes['name'].values:
            return None

        # Extract the attributes for each restaurant
        restaurant1_attributes = df_attributes[df_attributes['name'] == restaurant1].iloc[0, 2:].values.reshape(1, -1)
        restaurant2_attributes = df_attributes[df_attributes['name'] == restaurant2].iloc[0, 2:].values.reshape(1, -1)

        # Calculate cosine similarity
        similarity = cosine_similarity(restaurant1_attributes, restaurant2_attributes)

        return similarity[0][0]

    # Example usage
    restaurant1_name = "Monk's Cafe"  
    restaurant2_name = "Henkes Tavern" 

    # Calculate cosine similarity between restaurant1 and restaurant2
    similarity_score_attributes = cosine_similarity_restaurants(restaurant1_name, restaurant2_name, df_restaurant_details)
    print(f'The cosine similarity between {restaurant1_name} and {restaurant2_name} is {similarity_score_attributes}')

else:
    def cosine_similarity_calculation(movie1, movie2, df_genre):
        if movie1 not in df_genre['title'].values or movie2 not in df_genre['title'].values:
            return None

        movie1_genres = df_genre[df_genre['title'] == movie1].iloc[0, 1:].values.reshape(1, -1)
        movies2_genres = df_genre[df_genre['title'] == movie2].iloc[0, 1:].values.reshape(1, -1)
        
        similarity = cosine_similarity(movie1_genres, movies2_genres)

        return similarity[0][0]

    movie1_name = "The Rise and Fall of ECW"
    movie2_name = "What the #$*! Do We Know!?"

    similarity_score_genre = cosine_similarity_calculation(f'{movie1_name}', f'{movie2_name}', df_content_items)
    print(f'The cosine similarity between {movie1_name} and {movie2_name} is {similarity_score_genre}')


The cosine similarity between Monk's Cafe and Henkes Tavern is 0.7071067811865477


### 4.4 Result:

**Conclusion:**

In this analysis, we calculated the cosine similarity between two items.


The cosine similarity between Monk's Cafe and Henkes Tavern is `0.7071067811865477`
The cosine similarity between What the #$*! Do We Know!? and "The Rise and Fall of ECW" is ``


# 5. Genre Analysis in Movie Datasets

In this section, we delve into the genre analysis of two popular movie datasets: MovieLens and Netflix. The analysis aims to understand the distribution of movie genres and identify prominent trends in these datasets.

### 5.1. Process Overview:

1. **Data Preparation**: Depending on the dataset chosen (either MovieLens or Netflix), we start by preparing our dataset. This involves creating a working copy of the original movie dataset.

2. **Genre Distribution Analysis**: We split the genres for each movie, allowing each genre to be analyzed individually. This step helps in understanding the variety and frequency of genres present in the dataset.

3. **Visualization of Genre Frequencies**: We employ visualizations to display the frequency of each genre. This aids in identifying the most common genres within the chosen dataset.

4. **Advanced Genre Analysis**:
   - We create dummy variables for each genre to facilitate a more granular analysis.
   - The focus then shifts to identifying the top genres based on their prevalence in the dataset.
   - We aggregate these top genres and explore their distribution, again using visual tools for clearer insights.

5. **Comparative Analysis**: By examining both datasets, we aim to draw comparisons and understand unique or common trends across them. This can reveal insights into the genre preferences in different movie collections.

This genre analysis forms a crucial part of understanding viewer preferences and trends, which can be instrumental in enhancing movie recommendation systems or guiding content acquisition strategies.


In [8]:
if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    # Create a copy of the original DataFrame
    df_first_genre = df_content_items_movies.copy()

    # Splitting the genres and expanding so that each genre has its own row
    df_first_genre['genre'] = df_content_items_movies['genre'].str.split('|')
    df_first_genre = df_first_genre.explode('genre')

    # Counting the number of first genres in the dataset
    plt.figure(figsize=(10, 5))
    sns.countplot(y='genre', 
                data=df_first_genre, 
                order=df_first_genre['genre']
                .value_counts()
                .index)
    plt.title('Number of First Genres in the Dataset')
    plt.xlabel('Count')
    plt.ylabel('Genre')
    plt.show()

    ###############################################################################################

    # Create dummy variables for all genres
    genre_dummies = df_first_genre['genre'].str.get_dummies(sep='|')

    # Get the top 7 genres based on their frequency
    top_genres = genre_dummies.sum().sort_values(ascending=False).head(7).index.tolist()
    top_genre_dummies = genre_dummies[top_genres]
    top_genre_dummies['itemid'] = df_first_genre['itemid']

    # Group by movieId, and sum the dummies to get one row per movieId
    grouped_dummies = top_genre_dummies.groupby('itemid').sum().reset_index()

    # Join with the original movies DataFrame to get the titles
    df_movies_with_top_genres = pd.merge(df_first_genre[['itemid', 'title']].drop_duplicates(), grouped_dummies, on='itemid')

    ###############################################################################################

    # Plotting the top 7 genres
    plt.figure(figsize=(10, 5))
    sns.countplot(y='genre', 
                data=df_first_genre[df_first_genre['genre'].isin(top_genres)], 
                order=df_first_genre[df_first_genre['genre'].isin(top_genres)]['genre']
                .value_counts()
                .index)
    plt.title('Top 7 Genres in the Dataset')
    plt.xlabel('Count')
    plt.ylabel('Genre')
    plt.show()
else:
    pass


# 6 Genre Analysis and Data Transformation for Movie Recommendation Systems

In this section, we focus on genre analysis and data transformation for two major movie datasets, MovieLens and Netflix. The objective is to refine the datasets to emphasize the top genres, which are critical for further analysis and recommendation model building.

### Process Steps:

1. **Genre Selection**:
   - We begin by defining a list of top genres. These genres are selected based on their popularity and relevance in the datasets.
   - A temporary DataFrame `movies_data_1` is created as a copy of the original dataset for manipulation.

2. **Filtering Top Genres**:
   - A custom function `keep_top_genres` is defined. This function filters and retains only the genres listed in our predefined top genres.
   - We apply this function to the `genre` column of `movies_data_1`. This step ensures that each movie is associated only with the top genres, simplifying subsequent analyses.

3. **Genre Dummy Variable Creation**:
   - Dummy variables are created for each of the top genres. This is done using the `get_dummies` method, which transforms each genre into a separate binary column.
   - These dummy variables are then joined with the original dataset, resulting in a new DataFrame `df_with_genres`.

4. **Dataset Filtering**:
   - The DataFrame `df_with_genres` is filtered to include only those rows where at least one of the top genres is present.
   - Optionally, the original `genre` column can be dropped, as the information is now encoded in the dummy variables.

5. **Ratings Data Analysis**:
   - A new DataFrame `ratings_data` is created to analyze the average ratings for each movie.
   - This data can be used for further insights into how different genres are rated by the audience.

6. **Data Display**:
   - The transformed DataFrames `df_with_genres`, `movies_data_1`, and the head of `ratings_data` are displayed for review and analysis.

This transformation and filtering process are crucial for focusing the analysis on the most significant genres, thereby enhancing the efficiency and accuracy of the recommendation systems derived from these datasets.


In [9]:
if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    # Top 7 genres list
    top_genres = ['Drama', 'Comedy', 'Thriller', 'Action', 'Romance', 'Adventure', 'Crime']
    #top_genres = ['Drama', 'Comedy', 'Thriller', 'Action', 'Romance', 'Adventure', 'Crime', 'Horror', 'Mystery', 'Sci-Fi', 'Fantasy', 'War', 'Musical', 'Documentary', 'Animation', 'Western', 'Children', 'IMAX', 'Film-Noir']
    movies_data_1 = sample_df.copy()

    # Define a function that keeps only top genres
    def keep_top_genres(genres):
        genre_list = genres.split('|')
        top_genre_list = [genre for genre in genre_list if genre in top_genres]
        return '|'.join(top_genre_list)

    # Apply the function to the 'genres' column
    movies_data_1['genre'] = movies_data_1['genre'].apply(keep_top_genres)
    # This will result in the 'genres' column only containing the top genres
    # Now we can look at the dataframe
    movies_data_1.head()


    ###############################################################################################

    # Create dummy variables for all genres
    genre_dummies = df_content_items_movies['genre'].str.get_dummies(sep='|')[top_genres]

    # Join the genre dummies with the original dataframe
    df_with_genres = sample_df.join(genre_dummies)

    # Filter rows to keep only those that contain at least one of the top genres
    df_with_genres = df_with_genres[(df_with_genres[top_genres].sum(axis=1) > 0)]

    # Now we can drop the original 'genres' column if you no longer need it
    df_with_genres = df_with_genres.drop('genre', axis=1)


    ###############################################################################################


    # 'ratings_data' is a DataFrame with the average ratings for each movie
    ratings_data = sample_df.groupby('title')['rating'].mean().reset_index()

    display(df_with_genres, movies_data_1, ratings_data.head())
else:
    pass

# 7. Ratings Analysis for Yelp Dataset

In this section, we focus on analyzing the Yelp dataset, specifically examining the average ratings for restaurants. This analysis is key to understanding customer preferences and the overall performance of restaurants in the Yelp database.

### Process Overview:

1. **Conditional Execution**:
   - The analysis is performed only if `dataset_choice` is set to 'yelp'. This conditional check ensures that the code is executed specifically for the Yelp dataset.

2. **Average Ratings Calculation**:
   - We compute the average ratings for each restaurant. This is achieved by grouping the data by the `name` of the restaurant and then calculating the mean of the `rating` column.
   - The result is stored in a DataFrame called `ratings_data`. This DataFrame provides a concise view of the average ratings that each restaurant has received, serving as an essential metric for performance analysis.

3. **Data Display**:
   - We display the top 50 entries from the `ratings_data` DataFrame to get an initial overview of the ratings distribution.
   - This display helps in quickly identifying the top-rated restaurants and understanding the general trend in customer ratings across different establishments.

This analysis is particularly useful for identifying standout restaurants in the Yelp dataset and can guide further detailed exploration, such as identifying factors contributing to high ratings or areas where restaurants can improve.


In [10]:
if dataset_choice == 'yelp':
    # 'ratings_data' is a DataFrame with the average ratings for each restaurant
    ratings_data = df_content.groupby('name')['rating'].mean().reset_index()

    display(ratings_data.head(50))
else:
    pass

Unnamed: 0,name,rating
0,$5 Fresh Burger Stop,5.0
1,&pizza - UPenn,2.5
2,&pizza - Walnut,4.666667
3,'feine,4.25
4,'za,3.0
5,1 2 Tea,5.0
6,"1, 2, Tea",5.0
7,10 Arts Bistro,1.0
8,10 Barrel Brewing - Boise,3.0
9,10-01 Food & Drink,3.333333


# 8.  Data Preparation for Movie and Restaurant Recommendation Systems

This section involves preparing data for recommendation systems, focusing on two distinct datasets: MovieLens and Netflix for movies, and Yelp for restaurants. The preparation includes resetting indices, dropping unnecessary columns, and restructuring data for further analysis.

### Process Overview:

1. **Movie Datasets (MovieLens and Netflix)**:
   - **Index Resetting**: First, we reset the index of the DataFrame `df_with_genres` to avoid any future indexing issues. This results in `user_movies` having a clean, zero-based index.
   - **Dropping Unnecessary Columns**: We create a DataFrame `user_genre_df` by dropping columns like `itemid`, `title`, `userid`, `rating`, and `year`. This step reduces memory usage and focuses on the relevant data for genre analysis.
   - **Creating Genre ID DataFrame**: Another DataFrame `genre_id` is created, indexed by `itemid`, and also drops the same set of columns.
   - **Data Inspection**: We print the shapes of both `genre_id` and `user_genre_df` to understand the dimensions of our data, and display samples from these DataFrames for a quick review.

2. **Restaurant Dataset (Yelp)**:
   - **Index Resetting**: Similar to the movie datasets, the index of the DataFrame `df_content` is reset, resulting in `user_restaurants` having a fresh index.
   - **Dropping Unnecessary Columns**: For restaurant data, we drop columns like `business_id`, `name`, `userid`, and `rating` from `user_restaurants` to create `user_restaurant_attributes`. This focuses the DataFrame on attributes relevant to restaurant analysis.
   - **Creating Restaurant ID DataFrame**: A DataFrame `restaurant_id` is created, indexed by `business_id`. It also undergoes a similar process of dropping specific columns.
   - **Data Inspection**: The shapes of `restaurant_id` and `user_restaurant_attributes` are printed for dimensional analysis, and samples are displayed for a preliminary inspection.


In [11]:
if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    # Resetting the index to avoid future issues
    user_movies = df_with_genres.reset_index(drop=True)

    # Dropping unnecessary issues due to memory and to avoid issues
    user_genre_df = user_movies.drop(['itemid', 'title', 'userid', 'rating', 'year'], axis='columns')

    genre_id = df_with_genres.set_index(df_with_genres['itemid'])
    genre_id.drop(['itemid', 'title', 'userid', 'rating', 'year'], axis='columns', inplace=True)

    print(genre_id.shape)
    print(user_genre_df.shape)
    display(genre_id.sample(5), user_genre_df.head())
else:
    # Resetting the index to avoid future issues
    user_restaurants = df_content.reset_index(drop=True)

    # Dropping unnecessary columns due to memory and to avoid issues
    user_restaurant_attributes = user_restaurants.drop(['business_id', 'name', 'userid', 'rating'], axis='columns')

    # Creating a DataFrame indexed by 'business_id'
    restaurant_id = df_content.set_index(df_content['business_id'])
    restaurant_id.drop(['business_id', 'name', 'userid', 'rating'], axis='columns', inplace=True)

    # Print shapes and display samples
    print(restaurant_id.shape)
    print(user_restaurant_attributes.shape)
    display(restaurant_id.sample(5), user_restaurant_attributes.head())

(51256, 7)
(51256, 7)


Unnamed: 0_level_0,Alcohol,Caters,RestaurantsDelivery,OutdoorSeating,RestaurantsTakeOut,Open24Hours,BusinessParking
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
60181,1,1,1,1,1,0,1
55377,1,1,1,1,1,0,0
59974,1,1,1,1,1,0,1
40807,1,1,1,1,1,0,1
26769,1,0,1,1,1,0,1


Unnamed: 0,Alcohol,Caters,RestaurantsDelivery,OutdoorSeating,RestaurantsTakeOut,Open24Hours,BusinessParking
0,1,1,1,1,1,0,1
1,1,1,1,1,1,0,1
2,1,1,1,1,1,0,1
3,1,1,1,1,1,0,1
4,1,1,1,1,1,0,1


# 9.  Interaction Data Preparation for Movie and Restaurant Recommendation Systems

This section details the process of preparing interaction data for recommendation systems, focusing separately on MovieLens and Netflix for movie recommendations, and Yelp for restaurant recommendations. The primary goal is to extract and organize user interaction data relevant for building personalized recommendation systems.

### Process Overview:

1. **Movie Datasets (MovieLens and Netflix)**:
   - **Data Selection and Rearranging**: We start by selecting relevant columns from `df_with_genres`. Columns like `userid` and `year` are dropped to focus on movie interactions. The remaining columns are rearranged to prioritize `title`, `rating`, and `itemid`.
   - **Data Inspection**: We print the shape of the resulting `interaction_movies` DataFrame to understand its structure and display the first few entries. This gives an initial glimpse into the type of movie interaction data we will be working with.
   - **Code Snippet Insights**: 
     - The rearranged DataFrame `interaction_movies` emphasizes the importance of user ratings and movie titles, which are critical for understanding user preferences.
     - The shape of the DataFrame is printed to confirm the dimensions and ensure data integrity.

2. **Restaurant Dataset (Yelp)**:
   - **Interaction DataFrame Creation**: For the Yelp dataset, we create `interaction_restaurants` by selecting `business_id`, `rating`, and `name` from `df_content`. This DataFrame is tailored to capture user interactions with restaurants.
   - **Data Display**: We then print the shape of `interaction_restaurants` for dimensional analysis and display the first few records. This provides insight into the user ratings and restaurant names in the Yelp dataset.
   - **Code Snippet Insights**: 
     - The DataFrame `interaction_restaurants` is designed to capture key aspects of user interactions, such as ratings and restaurant names, which are crucial for building a restaurant recommendation system.
     - The initial data displayed helps in understanding the structure and nature of restaurant interactions within the dataset.


In [12]:
if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    interaction_movies = df_with_genres.drop([ 'userid', 'year'], axis=1)
    interaction_movies = interaction_movies[['title', 'rating', 'itemid']]
    print(interaction_movies.shape)
    print(interaction_movies.head())
    #input_movies['ratings']
else:
    # Create a DataFrame for interactions, focusing on business_id, user rating, and restaurant name
    interaction_restaurants = df_content[['business_id', 'rating', 'name']]

    # Print the shape and display the head of the DataFrame
    print(interaction_restaurants.shape)
    print(interaction_restaurants.head())


(51256, 3)
   business_id  rating           name
0         4362       2  Compère Lapin
1         4362       3  Compère Lapin
2         4362       5  Compère Lapin
3         4362       5  Compère Lapin
4         4362       5  Compère Lapin


# 10. User Profile Creation for Movie and Restaurant Recommendation Systems

In this section, we focus on creating user profiles for recommendation systems using the MovieLens and Netflix datasets for movies, and the Yelp dataset for restaurants. User profiles are built by calculating the dot product of user interactions and preferences, providing a foundation for personalized recommendations.

### Process Overview:

1. **Movie Datasets (MovieLens and Netflix)**:
   - **Index Verification**: We start by verifying that the indices of `user_genre_df` and the `rating` column in `interaction_movies` are aligned. This ensures consistency in data, as both DataFrames should originally have the same row order.
   - **Index Realignment**: If necessary, the index of `interaction_movies['rating']` is realigned to match `user_genre_df.index`.
   - **User Profile Calculation**: The user profile is created by computing the dot product of the transposed `user_genre_df` and the `rating` column in `interaction_movies`. This process effectively combines user ratings with genre data to create a profile.
   - **Profile Inspection**: We output the shape of the resulting user profile to understand its dimensions and display the first few entries (or the first 7 genres).

2. **Restaurant Dataset (Yelp)**:
   - **Index Verification and Realignment**: Similar to the movie datasets, we verify and align the indices of `user_restaurant_attributes` and `interaction_restaurants['rating']`.
   - **User Profile Creation**: For the Yelp dataset, the user profile is computed as the dot product between the transposed `user_restaurant_attributes` and the `rating` column in `interaction_restaurants`. This reflects user preferences based on restaurant attributes.
   - **Profile Inspection**: We then print the shape of the user profile and display the initial entries for review.



In [13]:
if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    # Verify the alignment of indices
    print(user_genre_df.index.equals(interaction_movies['rating'].index))
    # This assumes that both DataFrames originally had the same row order
    interaction_movies['rating'].index = user_genre_df.index

    # Calculate the dot product
    user_profile = user_genre_df.T.dot(interaction_movies['rating'])

    # Output the shape and head of the user profile
    print(user_profile.shape)  # Should print (19,) if all 19 genres are taken into account, or (7,) if there are 7 genres
    print(user_profile.head(7)) # (19)
else:
    # Verify the alignment of indices 
    print(user_restaurant_attributes.index.equals(interaction_restaurants['rating'].index))

    # Realign indices if necessary
    interaction_restaurants['rating'].index = user_restaurant_attributes.index

    # Calculate the dot product to create a user profile based on restaurant attributes
    user_profile = user_restaurant_attributes.T.dot(interaction_restaurants['rating'])

    # Output the shape and head of the user profile
    print(user_profile.shape)  
    print(user_profile.head()) 



True
(7,)
Alcohol                171767
Caters                 172512
RestaurantsDelivery    184418
OutdoorSeating         177556
RestaurantsTakeOut     189536
dtype: int64


# 11. User Preference Application in Movie and Restaurant Recommendation Systems

In this part of our recommendation system, we apply user preferences, represented by the user profile, to the datasets for both movies (MovieLens and Netflix) and restaurants (Yelp). This step helps in tailoring recommendations to individual user tastes.

### Process Overview:

1. **Movie Datasets (MovieLens and Netflix)**:
   - **Applying User Profile to Movie Genres**: For the movie datasets, we multiply the first three rows of `user_genre_df` by the `user_profile`. This operation applies the user's preferences to the genre data, effectively weighting each genre according to the user's profile.
   - **Insight**: This step is crucial for enhancing the recommendation system's ability to suggest movies that align more closely with the user's historical preferences and tastes.

2. **Restaurant Dataset (Yelp)**:
   - **Applying User Profile to Restaurant Attributes**: For the Yelp dataset, we perform a similar operation by multiplying each row in `user_restaurant_attributes` by the `user_profile`. This process adjusts the restaurant data based on the user's preferences.
   - **Output Display**: The modified `user_restaurant_attributes` DataFrame is then printed to display how the user profile has been applied to the restaurant attributes.
   - **Insight**: This multiplication serves to personalize the dataset, making it a crucial step in developing a more user-centric restaurant recommendation system.



In [14]:
if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    user_genre_df.head(3) * user_profile
else:
    # Multiplying each row in attribute_id DataFrame with the user_profile
    user_restaurant_attributes.head(3) * user_profile
    print(user_restaurant_attributes)

       Alcohol  Caters  RestaurantsDelivery  OutdoorSeating  \
0            1       1                    1               1   
1            1       1                    1               1   
2            1       1                    1               1   
3            1       1                    1               1   
4            1       1                    1               1   
...        ...     ...                  ...             ...   
51251        1       1                    1               1   
51252        1       1                    1               1   
51253        1       1                    1               1   
51254        1       1                    1               1   
51255        1       1                    1               1   

       RestaurantsTakeOut  Open24Hours  BusinessParking  
0                       1            0                1  
1                       1            0                1  
2                       1            0                1  
3          

# 12 Generating Recommendation Scores for Movie and Restaurant Recommendation Systems

In this section, we define functions to generate recommendation scores for both movie and restaurant datasets. These scores are crucial for identifying the most suitable recommendations based on user profiles.

### Defining Functions:

1. **Function for Movie Recommendations (`get_recommendation_df`)**:
   - **Purpose**: This function calculates recommendation scores for movies based on user profiles and genre data.
   - **Process**: It multiplies each row in `genre_id` with the `user_profile` and sums the values in each row. This sum is then normalized by dividing it by the sum of the user profile values. The result is a set of recommendation scores for each movie.
   - **Return Value**: The function returns a Series of recommendation scores, indicating how well each movie aligns with the user's preferences.

2. **Function for Restaurant Recommendations (`get_restaurant_recommendations`)**:
   - **Purpose**: Similarly, this function computes recommendation scores for restaurants based on user profiles and restaurant attributes.
   - **Process**: It performs the same operation as the movie recommendation function but on the `attribute_id` DataFrame, which represents restaurant attributes.
   - **Return Value**: The function outputs a Series of recommendation scores for restaurants.

### Applying Functions:

- **Movie Recommendations (MovieLens and Netflix)**:
   - **Recommendation Score Calculation**: Using the `get_recommendation_df` function, we calculate recommendation scores for the movie datasets.
   - **Output Inspection**: We print the shape of the resulting DataFrame `recommendation_df` to understand the number of movies scored. Additionally, we display the top entries of `recommendation_df` to get a glimpse of the highest-scoring movies.

- **Restaurant Recommendations (Yelp)**:
   - **Recommendation Score Calculation**: The `get_restaurant_recommendations` function is used to compute scores for the Yelp dataset.
   - **Output Inspection**: The shape and first few entries of the resulting `recommendation_scores` are printed, showing how each restaurant is rated according to the user's profile.


In [15]:
def get_recommendation_df(genre_id, user_profile):
    # Multiplying each row in genre_df with the user_profile and summing that row values
    # Also normalizing the values by dividing by user_profile.sum()
    recommendation_scores = ((genre_id * user_profile).sum(axis=1)) / user_profile.sum()
    return recommendation_scores

def get_restaurant_recommendations(attribute_id, user_profile):
    # Multiplying each row in attribute_id DataFrame with the user_profile
    recommendation_scores = ((attribute_id * user_profile).sum(axis=1)) / user_profile.sum()
    return recommendation_scores


if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    recommendation_df = get_recommendation_df(genre_id, user_profile)

    print(recommendation_df.shape)
    print(recommendation_df.head())
else:
    recommendation_scores = get_restaurant_recommendations(restaurant_id, user_profile)

    print(recommendation_scores.shape)
    print(recommendation_scores.head())


(51256,)
business_id
4362    0.999614
4362    0.999614
4362    0.999614
4362    0.999614
4362    0.999614
dtype: float64


# 13. Sorting Recommendation Scores for Movie and Restaurant Recommendation Systems

This segment of the recommendation system involves sorting the calculated recommendation scores in descending order. This sorting is essential for both movie (MovieLens and Netflix) and restaurant (Yelp) datasets, as it helps to surface the most suitable recommendations based on user profiles.

### Process Overview:

1. **Movie Recommendations (MovieLens and Netflix)**:
   - **Sorting**: The `recommendation_df`, which contains recommendation scores for movies, is sorted in descending order. This brings the movies with the highest scores (and therefore the most aligned with user preferences) to the top of the DataFrame.
   - **Displaying Top Recommendations**: We display the top entries in the sorted `recommendation_df` to preview the highest-ranked movie recommendations. This offers an immediate view of the most relevant movies according to the user's profile.

2. **Restaurant Recommendations (Yelp)**:
   - **Sorting**: Similarly, the `recommendation_scores` for restaurants are sorted in descending order. This ranking ensures that restaurants most closely matching the user's preferences are prioritized.
   - **Displaying Top Recommendations**: The top entries in the sorted `recommendation_scores` are printed. These entries represent the restaurants that are most likely to satisfy the user's preferences, based on their profile.


In [16]:
if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    # Sort our recommendations in descending order
    recommendation_df = recommendation_df.sort_values(ascending=False)
    print(recommendation_df.head())
else:
    # Sort our recommendations in descending order
    recommendation_scores = recommendation_scores.sort_values(ascending=False)
    print(recommendation_scores.head())



business_id
55761    1.0
13668    1.0
55761    1.0
55761    1.0
55761    1.0
dtype: float64


# 14. Calculating Average Ratings for Movie and Restaurant Recommendation Systems

This part of the recommendation system focuses on calculating the average ratings for items in the MovieLens and Netflix datasets for movies, and the Yelp dataset for restaurants. These average ratings provide valuable insights into user preferences and the overall performance of each item.

### Process Overview:

1. **Movie Datasets (MovieLens and Netflix)**:
   - **Average Rating Calculation**: We start by grouping the `sample_df` by `itemid` and calculating the mean of the `rating` for each movie. This gives us an average rating for each movie in the dataset.
   - **Data Rounding**: The average ratings are rounded to two decimal places for readability and simplicity.
   - **Data Inspection**: We print the first five rows of `ratings_data` to inspect the average ratings of the movies. This initial view helps in understanding the overall user reception of different movies.

2. **Restaurant Dataset (Yelp)**:
   - **Average Rating Calculation**: For the Yelp dataset, a similar calculation is performed. We group the `df_content` by `business_id` and compute the mean of the `rating` for each restaurant.
   - **Data Rounding and Renaming**: After rounding the average ratings to two decimal places, we rename the column to maintain consistency.
   - **Data Inspection**: The top five rows of `ratings_data` are printed to showcase the average ratings of restaurants, providing an insight into their popularity and customer satisfaction levels.

In [17]:
if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    ratings_data = sample_df.groupby('itemid')['rating'].mean().reset_index().round(2)
    print(ratings_data.head(5)) 
else:
    # Create a DataFrame with business_id and stars
    ratings_data = df_content.groupby('business_id')['rating'].mean().reset_index().round(2)

    # Rename stars to rating
    print(ratings_data.head(5))

   business_id  rating
0            6     1.0
1            7     4.5
2            8     5.0
3            9     4.0
4           12     2.0


# 15. Generating and Displaying Top Recommendations in Movie and Restaurant Systems

This section outlines the process of generating top recommendations for both movie (MovieLens and Netflix) and restaurant (Yelp) recommendation systems. It involves combining user profiles with item-specific data to yield a ranked list of suggestions.

### Functions to Generate Recommendations:

1. **Movie Recommendations (`get_recommendation_df`)**:
   - **Purpose**: This function calculates recommendation scores for movies by considering both the user's genre preferences and the average ratings of movies.
   - **Process**:
     - Each genre indicator is weighted by the user's profile.
     - These weighted scores are summed to get an overall genre score for each movie.
     - The genre scores are normalized and then combined with the average movie ratings.
     - The top `n` recommendations are selected based on these combined scores.
   - **Return Values**: It returns the top movie scores and the average ratings.

2. **Restaurant Recommendations (`get_restaurant_recommendations`)**:
   - **Purpose**: Similarly, this function computes scores for restaurants, taking into account the user's preferences and average restaurant ratings.
   - **Process**:
     - Each attribute is weighted by the user's profile.
     - The weighted attribute scores are summed and normalized.
     - These scores are then combined with the average restaurant ratings.
     - The top `n` recommendations are determined from the combined scores.
   - **Return Values**: The function outputs the top restaurant scores and average ratings.

### Applying Functions and Displaying Results:

- **Movie Recommendations (MovieLens and Netflix)**:
   - **Recommendation Generation**: We use `get_recommendation_df` to generate recommendation scores for movies.
   - **Fetching Movie Details**: The details of the top-rated movies are retrieved, ensuring no duplicates and merging with the average ratings.
   - **Output**: The top 20 recommended movies, along with their details like title, genre, year, and average rating, are displayed.

- **Restaurant Recommendations (Yelp)**:
   - **Recommendation Generation**: The `get_restaurant_recommendations` function is used to calculate recommendation scores for restaurants.
   - **Fetching Restaurant Details**: Similar to movies, we retrieve the details of top-rated restaurants and merge them with average ratings.
   - **Output**: The top 20 recommended restaurants, including their business ID, name, and average rating, are shown.



In [18]:
def get_recommendation_df(genre_id, user_profile, movies_data, ratings_data, top_n=50):
    # Weight each genre indicator by the user profile
    weighted_scores = genre_id.mul(user_profile, axis=1)
    # Sum the weighted scores for each movie to get an overall genre score
    genre_scores = weighted_scores.sum(axis=1)
    # Normalize the genre scores by the sum of the user profile weights
    normalized_genre_scores = genre_scores / user_profile.sum()
 
    average_ratings = ratings_data.set_index('itemid')['rating']
    # Combine the normalized genre scores with the average ratings
    # Here you might adjust your logic; for simplicity, let's just sum them
    combined_scores = normalized_genre_scores + average_ratings
    
    # Return the top_n recommendations based on the combined score
    top_indices = combined_scores.nlargest(top_n).index
    top_scores = combined_scores.loc[top_indices]
    
    return top_scores, average_ratings

# Function to get recommendations
def get_restaurant_recommendations(attribute_id, user_profile, ratings_data, top_n=50):
    # Weight each attribute by the user profile
    weighted_scores = attribute_id.mul(user_profile, axis=1)
    
    # Sum the weighted scores for each restaurant to get an overall attribute score
    attribute_scores = weighted_scores.sum(axis=1)
    
    # Normalize the attribute scores by the sum of the user profile weights
    normalized_attribute_scores = attribute_scores / user_profile.sum()
    
    # Calculate the average rating for each restaurant
    average_ratings = ratings_data.set_index('business_id')['rating']
    
    # Combine the normalized attribute scores with the average ratings
    combined_scores = normalized_attribute_scores + average_ratings
    
    # Return the top_n recommendations based on the combined score
    top_indices = combined_scores.nlargest(top_n).index
    top_scores = combined_scores.loc[top_indices]
    
    return top_scores, average_ratings

if dataset_choice == 'movielens' or dataset_choice == 'netflix':
    # Fetch movie details and prepare top recommendations list
    # Ensure you're using the correct ratings_data and movies_data
    recommendation_scores, average_ratings = get_recommendation_df(genre_id, user_profile, movies_data_1, ratings_data, top_n=50)
    top_movies = movies_data_1[movies_data_1['itemid'].isin(recommendation_scores.index)]
    top_movies = top_movies.drop_duplicates(subset='itemid')
    top_movies = top_movies.merge(average_ratings.rename('average_rating'), left_on='itemid', right_index=True)
    top_movies = top_movies[['itemid', 'title', 'genre', 'year', 'average_rating']]

    # Now display your recommendations
    print(top_movies.head(20))
else:
    # Generate recommendation scores
    recommendation_scores, average_ratings = get_restaurant_recommendations(restaurant_id, user_profile, ratings_data, top_n=50)

    # Fetch restaurant details and prepare top recommendations list. 
    top_restaurants = df_content[df_content['business_id'].isin(recommendation_scores.index)]
    top_restaurants = top_restaurants.drop_duplicates(subset='business_id')
    top_restaurants = top_restaurants.merge(average_ratings.rename('average_rating'), left_on='business_id', right_index=True)
    top_restaurants = top_restaurants[['business_id', 'name', 'average_rating']]

    # Now display your recommendations
    print(top_restaurants.head(20))


      business_id                                 name  average_rating
260           554                    Basimo Beach Cafe             5.0
668           440                    Pho Orchid Uptown             5.0
2221          567             La Patisserie Chouquette             5.0
2312          249                        City Barbeque             5.0
2519          533                    Treme Coffeehouse             5.0
2829          315                   365 Caffe Italiano             5.0
2927          480                       The Cubby Hole             5.0
3484          498         Harvest Bowl Eatery & Market             5.0
3788           58                     Sunset 44 Bistro             5.0
3926          305          Frady's One Stop Food Store             5.0
4278           31                          Joe's Pizza             5.0
4437          450                 August Rhodes Bakery             5.0
4530           21  Tony's Restaurant & 3rd Street Cafe             5.0
4686  

# 16. Data Preparation for Yelp and Movie Recommendation Systems for User Recommedation based on attributes/genres

This segment of the code focuses on preparing the data for recommendation systems, specifically for the Yelp dataset (restaurants) and a generic movie dataset. The process involves selecting and organizing relevant data for both user interactions and item attributes.

### Process Overview:

1. **Yelp Dataset (Restaurants)**:
   - **User Reviews Data**:
     - A new DataFrame `user_reviews_df` is created from `df_content`. This DataFrame contains columns `userid`, `business_id`, and `rating`, which are crucial for understanding user interactions with restaurants.
   - **Restaurant Attributes Data**:
     - Another DataFrame `restaurant_attributes_df` is extracted from `df_content`. It includes the `business_id` and various attributes like `Alcohol`, `Caters`, `RestaurantsDelivery`, `OutdoorSeating`, `RestaurantsTakeOut`, `Open24Hours`, and `BusinessParking`. These attributes are essential for characterizing each restaurant.

2. **Movie Dataset**:
   - **User Reviews Data**:
     - For the movie dataset, `user_reviews_df` is formed by selecting `userid`, `itemid`, and `rating` from `df_content`. This DataFrame is analogous to the one in the Yelp dataset but is tailored for movies.
   - **Movie Attributes Data**:
     - The DataFrame `movie_attributes` is set to `user_genre_df`, which likely contains genre information for each movie. This provides the necessary attributes for movies.

In [19]:
if dataset_choice == 'yelp':
    # Step 1: Prepare Data
    # Creating user_reviews_df from df_content
    user_reviews_df = df_content[['userid', 'business_id', 'rating']]

    # Creating restaurant_attributes_df from df_content
    restaurant_attributes_df = df_content[['business_id', 'Alcohol', 'Caters', 'RestaurantsDelivery', 'OutdoorSeating', 'RestaurantsTakeOut', 'Open24Hours', 'BusinessParking']]
else:
    user_reviews_df = df_content[['userid', 'itemid', 'rating']]
    movie_attributes = user_genre_df


In [20]:
# Step 2: Define Functions
def create_user_profile(userid, reviews_df, attributes_df):
    user_reviews = reviews_df[reviews_df['userid'] == userid]
    user_data = user_reviews.merge(attributes_df, on='business_id')
    profile = user_data.drop(['userid', 'business_id', 'rating'], axis=1).mean()
    return profile

def get_restaurant_recommendations(attribute_id, user_profile, ratings_data, top_n=50):
    weighted_scores = attribute_id.mul(user_profile, axis=1)
    attribute_scores = weighted_scores.sum(axis=1)
    normalized_attribute_scores = attribute_scores / user_profile.sum()
    average_ratings = ratings_data.set_index('business_id')['rating']
    combined_scores = normalized_attribute_scores + average_ratings
    top_indices = combined_scores.nlargest(top_n).index
    top_scores = combined_scores.loc[top_indices]
    return top_scores

def create_user_profile_movies(user_id, user_reviews_df, attributes_df):
    user_ratings = user_reviews_df[user_reviews_df['userid'] == user_id]
    user_data = user_ratings.merge(attributes_df, left_on='itemid', right_index=True)
    profile = user_data.drop(['userid', 'itemid', 'rating'], axis=1).mean()
    return profile

def get_movie_recommendations(attribute_id, user_profile, ratings_data, top_n=50):
    weighted_scores = attribute_id.mul(user_profile, axis=1)
    attribute_scores = weighted_scores.sum(axis=1)
    normalized_attribute_scores = attribute_scores / user_profile.sum()
    combined_scores = normalized_attribute_scores + ratings_data
    top_indices = combined_scores.nlargest(top_n).index
    top_scores = combined_scores.loc[top_indices]
    return top_scores

In [21]:
import random

# 16. Generating and Displaying Personalized Recommendations

This section outlines the steps to generate personalized recommendations for users in the Yelp restaurant dataset and a movie dataset, followed by displaying the top recommendations.

### Process Overview:

#### Yelp Dataset (Restaurants):

1. **Generate User Profile**:
   - A specific `user_id_to_recommend` is chosen (in this case, 1354).
   - The `create_user_profile` function is used to create a user profile based on the user's previous ratings and restaurant attributes.

2. **Get Recommendations**:
   - Provided the user profile is not empty, the `get_restaurant_recommendations` function is invoked to generate recommendation scores.
   - The function takes into account the user's profile and average ratings in the dataset.

3. **Process and Display Recommendations**:
   - The recommended restaurants are merged with the original dataset to include names and ratings.
   - The data is grouped to get unique restaurants, sorted by recommendation scores, and the top 10 restaurants are selected.
   - The top recommendations are displayed, showing the `business_id`, `name`, and `rating` of each recommended restaurant.

#### Movie Dataset:

1. **Generate User Profile**:
   - A random `user_id_to_recommend` is selected from unique user IDs in the sample dataset.
   - The user's movie preferences are aggregated using `create_user_profile_movies`.

2. **Filter Out Seen Movies**:
   - Movies already watched by the user are filtered out from the recommendation process to ensure novel suggestions.

3. **Get Recommendations**:
   - The `get_movie_recommendations` function generates scores for movies, considering the user profile and filtered movie ratings.
   - Recommendations are merged with the movie details, grouped for uniqueness, and the top 10 movies are sorted based on the scores.

4. **Display Top Movie Recommendations**:
   - The system outputs the top 10 movie recommendations, displaying `itemid`, `title`, and `rating` for each.


In [22]:
if dataset_choice == 'yelp':
    # Step 3: Generate User Profile
    user_id_to_recommend = 1354
    user_profile = create_user_profile(user_id_to_recommend, user_reviews_df, restaurant_attributes_df)
else:
    # Step 3: Generate User Profile
    user_id_to_recommend = random.choice(sample_df['userid'].unique())
    user_profile = create_user_profile_movies(user_id_to_recommend, user_reviews_df, genre_id)
    pass

In [23]:
if dataset_choice == 'yelp':
    # Check if the user profile was created
    if user_profile is not None and not user_profile.empty:
        # Step 4: Get Recommendations
        recommendation_scores = get_restaurant_recommendations(restaurant_attributes_df.set_index('business_id'), user_profile, user_reviews_df, top_n=50)

        # Step 5: Process Recommendations to get unique top 10 restaurants
        recommended_restaurants = df_content.merge(recommendation_scores.rename('recommendation_score'), left_on='business_id', right_index=True)
        recommended_restaurants = recommended_restaurants.groupby('business_id').agg({'name': 'first', 'recommendation_score': 'first', 'rating': 'mean'}).reset_index()
        top_recommended_restaurants = recommended_restaurants.sort_values(by='recommendation_score', ascending=False).head(10)

        # Display the top 10 recommended restaurants
        print("Top 10 Recommended Restaurants for User ID {}: ".format(user_id_to_recommend))
        print(top_recommended_restaurants[['business_id', 'name', 'rating']])
    else:
        print(f"No user profile could be created for User ID {user_id_to_recommend}. This could be due to insufficient data.")
else:
    if user_profile is not None and not user_profile.empty:
        # Step 4: Filter out movies already seen by the user
        user_seen_movies = user_reviews_df[user_reviews_df['userid'] == user_id_to_recommend]['itemid']

        # Exclude seen movies from the ratings data
        ratings_data_filtered = user_reviews_df[~user_reviews_df['itemid'].isin(user_seen_movies)]

        # Step 5: Get Recommendations
        recommendation_scores = get_movie_recommendations(genre_id, user_profile, ratings_data_filtered['rating'])
        recommended_movies = df_content.merge(recommendation_scores.rename('recommendation_score'), left_index=True, right_index=True)
        recommended_movies = recommended_movies.groupby('itemid').agg({'title': 'first', 'recommendation_score': 'first', 'rating': 'mean'}).reset_index()
        top_recommended_movies = recommended_movies.sort_values(by='recommendation_score', ascending=False).head(10)

        # Display the top 10 recommended movies
        print("Top 10 Recommended Movies for User ID {}: ".format(user_id_to_recommend))
        print(top_recommended_movies[['itemid', 'title', 'rating']])
    else:
        print(f"No user profile could be created for User ID {user_id_to_recommend}. This could be due to insufficient data.")

Top 10 Recommended Restaurants for User ID 1354: 
    business_id                                         name    rating
1           206                             The Twisted Tail  4.454545
2          1163                                  Square 1682  4.000000
7          4647                         SPOT Gourmet Burgers  4.000000
10        10462                                J & J Seafood  5.000000
11        10944                      Caddy's Treasure Island  3.500000
12        11392                  Charlie Gitto's On the Hill  5.000000
0            86  Gaylord Opryland Resort & Convention Center  3.666667
8          5455                        Two Chicks Café - CBD  4.500000
14        12144                              Dunedin Brewery  4.714286
3          1380                           Slim Goodies Diner  3.300000


### Conclusion:

Based on the recommendations generated for User ID 1354, here are the top 10 recommended restaurants along with their average ratings:

Overall, these recommendations offer a diverse range of dining experiences catering to different preferences and tastes. Users can explore these top-rated restaurants to enjoy memorable dining experiences.

In [24]:
# Define a function to calculate precision, recall, and accuracy
def calculate_metrics(actual_ratings, recommended_restaurants):
    # Extract relevant information from recommended restaurants
    recommended_ids = recommended_restaurants['business_id'].values
    recommended_rating = recommended_restaurants['rating'].values
    
    # Calculate metrics
    true_positives = sum(1 for business_id, rating in actual_ratings if business_id in recommended_ids and rating >= 4)
    false_positives = sum(1 for business_id, rating in actual_ratings if business_id not in recommended_ids and rating >= 4)
    false_negatives = sum(1 for business_id, rating in actual_ratings if business_id in recommended_ids and rating < 4)
    true_negatives = sum(1 for business_id, rating in actual_ratings if business_id not in recommended_ids and rating < 4)
    
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    accuracy = (true_positives + true_negatives) / len(actual_ratings)
    
    return precision, recall, accuracy

def calculate_metrics_movies(actual_ratings, recommended_movies):
    # Extract relevant information from recommended movies
    recommended_ids = recommended_movies['itemid'].values
    recommended_rating = recommended_movies['rating'].values
    
    # Calculate metrics
    true_positives = sum(1 for itemid, rating in actual_ratings if itemid in recommended_ids and rating >= 4)
    false_positives = sum(1 for itemid, rating in actual_ratings if itemid not in recommended_ids and rating >= 4)
    false_negatives = sum(1 for itemid, rating in actual_ratings if itemid in recommended_ids and rating < 4)
    true_negatives = sum(1 for itemid, rating in actual_ratings if itemid not in recommended_ids and rating < 4)
    
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    accuracy = (true_positives + true_negatives) / len(actual_ratings)
    
    return precision, recall, accuracy

In [25]:
if dataset_choice == 'yelp':
    # Assuming actual_ratings contains ground truth data (business_id, stars) for the user
    actual_ratings = df_user_reviews[['business_id', 'rating']].values

    # Calculate precision, recall, and accuracy
    precision, recall, accuracy = calculate_metrics(actual_ratings, top_recommended_restaurants)

    # Print the results
    print("Precision:", precision)
    print("Recall:", recall)
    print("Accuracy:", accuracy)
else:
    # Extractting actual ratings from the dataset
    actual_ratings = df_content_items_users[['itemid', 'rating']].values

    # Calculate precision, recall, and accuracy
    precision, recall, accuracy = calculate_metrics_movies(actual_ratings, top_recommended_movies)

    # Print the results
    print("Precision:", precision)
    print("Recall:", recall)
    print("Accuracy:", accuracy)

Precision: 0.001675913083680074
Recall: 0.7733333333333333
Accuracy: 0.32400508458003324


# Conclusion: Performance Analysis of Recommendation Systems

In evaluating the performance of recommendation systems for the MovieLens, Netflix, and Yelp datasets, we employed key metrics: Precision, Recall, and Accuracy. Here's a comprehensive analysis of what these metrics reveal about each system:

### MovieLens Dataset:
- **Precision (0.0027)**: Indicates a small fraction of recommended movies were relevant.
- **Recall (0.6231)**: Suggests the system retrieved a large portion of relevant movies.
- **Accuracy (0.6525)**: Reflects good overall performance in predicting relevant and irrelevant movies.

### Netflix Dataset:
- **Precision (0.0001)**: Extremely low, implying few recommended movies were relevant.
- **Recall (0.3800)**: Moderately effective in identifying relevant movies.
- **Accuracy (0.5514)**: Above average but highlights the need for improvement.

### Yelp Dataset:
- **Precision (0.0017)**: Exhibits a low precision rate, indicating that a minimal portion of recommended restaurants aligns with user preferences.
- **Recall (0.7733)**: Reflects a substantial capability in retrieving relevant restaurants.
- **Accuracy (0.3240)**: Demonstrates a moderate level of accuracy in correctly predicting relevant and irrelevant recommendations.

### Overall Insights:
- **MovieLens vs. Netflix**: Both systems displayed low precision. MovieLens, however, showed better recall and accuracy, suggesting a more balanced approach.
- **Yelp's Performance**: Despite its satisfactory recall rate, the low precision in Yelp's dataset indicates a need for enhancing the relevancy of its recommendations
- **Improvement Areas**: For MovieLens and Netflix, enhancing precision is crucial. For Yelp, maintaining high standards while expanding or diversifying the dataset remains a focus.


In conclusion, while the Yelp dataset shows promise in identifying a large portion of relevant restaurants, its low precision highlights an area for potential improvement. Enhancing the accuracy of recommendations could substantially elevate the overall user experience. However, MovieLens and Netflix recommendation systems, particularly Netflix, need refinement to improve precision and provide more relevant recommendations.
|
