<a href="https://colab.research.google.com/github/shakirsayeed/Machine-Learning-Recommended-Systems-Assessment/blob/main/Machine_Learning_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - ML Case Study EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

In the food industry which continues to grow rapidly, both restaurants and food delivery services, the increasing variety of menus offered can be a dilemma for customers. The decision about what to order often takes time and can be a less than satisfactory experience if it does not match the customer's preferences or desires. This situation creates an opportunity to implement a food recommendation system that can provide personalized recommendations tailored to each customer's tastes.

Food recommendation systems aim to improve customer experience by providing suitable menu recommendations based on previous food preferences, ratings given, or even explicitly expressed food preferences [1]. By leveraging techniques such as data analysis, machine learning, and natural language processing, these systems can produce more accurate and relevant recommendations over time.

Apart from increasing customer satisfaction, implementing a food recommendation system can also help food business owners to increase sales by directing customers to menus that they are more likely to like. By utilizing this technology, it is hoped that it can create a more enjoyable dining experience, increase customer loyalty, and ultimately strengthen its competitive position in this competitive food industry.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Based on the conditions described previously, the company will develop a personalized food recommendation system for customers, to answer the following problems.

Based on customer data, how to create a personalized recommendation system with content-based filtering techniques?
With the rating data you have, how can restaurants recommend other foods that customers might like and have never ordered?
Goals

To answer this question, create a recommendation system with the following objectives or goals:

Generate a number of personalized food recommendations for customers with content-based filtering techniques.
Produce a number of food recommendations that match customer preferences and have never been visited before using collaborative filtering techniques.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Data Understanding***
To achieve these goals, the following steps will be taken:

Data Understanding: Data Understanding is the initial stage of the project to understand the data that is owned. In this case, we have 2 separate files regarding food descriptions and ratings.

Univariate Exploratory Data Analysis: At this stage, analysis and exploration of each variable in the data is carried out. If needed, you can carry out further exploration regarding the relationship between one variable and other variables.

Data Preparation: At this stage, the process of preparing the data is carried out and carrying out several techniques such as dealing with missing values, duplicate data and dirty data.

Model Development with Content Based Filtering: At this stage, a recommendation system will be developed using content-based filtering techniques. This technique recommends items based on similarity to items the user likes based on the item's features or description. A feature representation of the frequency of occurrence of each food category or description will be generated using a Count Vectorizer, and the level of similarity between foods is calculated using cosine similarity. Based on these similarities, food recommendations will be made for customers.

Model Development with Collaborative Filtering: At this stage, the system recommends a number of foods based on the ratings that have been given previously. From user rating data, we will identify foods that are similar and have never been ordered by users to recommend.

The data used in this project is the "Food Recommendation System" downloaded from the Kaggle API. This dataset represents data related to food recommendation systems. Two datasets are included in this dataset file. First, it includes datasets related to the foods, ingredients, cuisines involved. Second, it includes datasets from rating systems for recommendation systems.

The first dataset has 400 rows with 5 features, consisting of non-numeric features such as Name, C_Type, Veg_Non, and Describe, as well as a numeric feature, namely Food_ID. Meanwhile, the second dataset has 512 rows of data with 3 numerical features, namely User_ID, Food_ID, and Rating.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import os
import numpy as np
import tensorflow as tf
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
# import os

base_path = '/content/drive/MyDrive/DS_Notes/Module6/Assessment'

# List files in the directory
files = os.listdir(base_path)
print(files)

food_file = 'foods.csv'
ratings_file = 'ratings.csv'

food_path = os.path.join(base_path, food_file)
ratings_path = os.path.join(base_path, ratings_file)
foods= pd.read_csv(food_path)
ratings= pd.read_csv(ratings_path)
print(f"shape of foods: {foods.shape}")
print(f"shape of ratings: {ratings.shape}")

### Dataset First View

In [None]:
# Dataset First Look
foods.head()

In [None]:
ratings.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
foods.shape

### Dataset Information

In [None]:
# Dataset Info
foods.info()
ratings.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
foods.duplicated().sum()
ratings.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
foods.isna().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(foods.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in Foods')
plt.show()


In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(ratings.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in Ratings')
plt.show()

### What did you know about your dataset?

The first dataset has 400 rows with 5 features, consisting of non-numeric features such as Name, C_Type, Veg_Non, and Describe, as well as a numeric feature, namely Food_ID. Meanwhile, the second dataset has 512 rows of data with 3 numerical features, namely User_ID, Food_ID, and Rating.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
foods.columns

In [None]:
ratings.columns

In [None]:
# Dataset Describe
foods.describe()

In [None]:
ratings.describe()

### Variables Description

The variables in the two datasets are as follows:

#**Foods:**

Feature Name: Consists of 400 unique values.

C_Type Features: Consists of 16 unique values ​(Top: Indian).

Veg_Non Feature: Consists of 2 unique values ​​(Top: veg).

Feature Describe: Consists of 397 unique values ​​(Top: variety of rice).

#**Ratings:**

User_ID: Unique ID for each user or customer.

Food_ID: Unique ID for each food in the dataset.

C_Type: Type of food that falls into a certain category, for example Chinese, Dessert, Indian, etc.

Veg_Non: Vegetarian or non-vegetarian status of the food.

Describe: A brief description of the food, including key ingredients, flavors, or other characteristics.

Rating: The rating value given by users to a particular food, shows how much the user likes the food. With scores in the range of 1 to 10, where higher scores indicate higher levels of satisfaction.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
foods.nunique()

In [None]:
ratings.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
rate_up = ratings[ratings['Rating'] >= 8]
rate_up['Food_ID'].value_counts().head(5)

In [None]:
rate_down = ratings[ratings['Rating'] <= 3]
rate_down['Food_ID'].value_counts().head(5)

Note that it is true that there are double values ​​in the Korean value due to the C_Type value having whitespace at the beginning, so a split will be carried out to overcome this.

In [None]:
foods['C_Type'] = foods['C_Type'].apply(lambda x: ' '.join(x.split()))
foods['C_Type'].unique()

In [None]:
for feature in ['C_Type', 'Veg_Non', 'Describe']:
    foods[feature] = foods[feature].apply(lambda x: re.sub(r'[^a-zA-Z]', ' ', x.lower()))

### What all manipulations have you done and insights you found?

In the data, there are 511 rows of data for the User_ID and Food_ID features, but the range of values ​​for each only covers 1 to 100 and 1 to 309. This indicates that some customers gave ratings more than once, as well as foods that received more ratings. than once. In addition, the Rating feature displays a rating scale from 1 (lowest) to 10 (highest).

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8,6))

sns.countplot(x='Rating', data=ratings)
plt.xticks(rotation=30)
plt.tight_layout()

plt.savefig("ratings.png", bbox_inches='tight')

##### 2. What is/are the insight(s) found from the chart?

The distribution of ratings given by users is quite even for each rating value in the range 1 to 10.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Average Rating by Cuisine Type
merged_data = pd.merge(ratings, foods, on='Food_ID')
avg_rating_by_cuisine = merged_data.groupby('C_Type')['Rating'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
avg_rating_by_cuisine.plot(kind='bar', color='salmon', edgecolor='black')
plt.title('Average Rating by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Rating')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


##### 2. What is/are the insight(s) found from the chart?

Users tend to give higher ratings to beverages cuisine types compared to others. The cuisine types with higher average ratings may indicate a preference among users for those cuisines.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Concatenate all descriptions into a single string
descriptions = ' '.join(merged_data['Describe'].dropna())

# Generate word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(descriptions)

# Plot word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Food Descriptions')
plt.axis('off')  # Turn off axis
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## Content Based Filtering

Content Based Recommendation utilizes information from several items/data to be recommended to users as references related to previously used information. The purpose of content based recommendation is to be able to predict similarities from a number of information obtained from users.

In [None]:
features = ['C_Type', 'Veg_Non', 'Describe']
foods[features]

In [None]:
foods['soup'] = foods[features].apply(lambda x: ' '.join(x), axis=1)
foods.sample(5)

A join is carried out for the features that will be used as user preferences by combining the Name, C_Type, and Veg_Non features.

**Count Vectorizer**

In [None]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(foods['soup'])

count.get_feature_names_out()

Note, for use in the model stage, CountVectorizer will be used to convert text data into a numerical representation by calculating the frequency of occurrence of words in each document. The weight is calculated based on the frequency of appearance of these words in each document.

In [None]:
count_matrix.todense()

In [None]:
pd.DataFrame(
    count_matrix.todense(),
    columns=count.get_feature_names_out(),
    index=foods['Name']
).sample(22, axis=1).sample(10, axis=0)

The CountVectorizer matrix output above shows the fish curry food has the description lemon. This can be seen from the matrix value of 1.0 in the lemon description. Apart from that, beetroot and green apple soup are also included in the description of lemon. And so on.

**Cosine Similarity**

In [None]:
cosine_sim = cosine_similarity(count_matrix)
cosine_sim

By using cosine_similarity to calculate the closeness between features in a recommendation system, a similarity score between items or features is obtained based on the previously generated feature vector.

In [None]:
cosine_sim_df = pd.DataFrame(cosine_sim, index=foods['Name'], columns=foods['Name'])
print('Shape:', cosine_sim_df.shape)

cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Note, above is a matrix obtained from the cosine_similarity score, which represents the closeness between features (between foods). The greater the proximity score obtained, the more similar the two foods are. This similarity score indicates how close or similar an item (food) is to other items based on the features used in the calculation. The higher the similarity score, the more similar or alike the two items are. Thus, this similarity score can be used to recommend items or features that are similar or match the user's preferences.

In [None]:
def food_recommendations(nama_makanan, similarity_data=cosine_sim_df,
                         items=foods[['Name', 'C_Type', 'Veg_Non']], k=5):

    index = similarity_data.loc[:,nama_makanan].to_numpy().argpartition(range(-1, -k, -1))
    closest = similarity_data.columns[index[-1:-(k+2):-1]]
    closest = closest.drop(nama_makanan, errors='ignore')

    return pd.DataFrame(closest).merge(items).head(k)

Furthermore, a function has been defined to carry out a recommendation system for food with numerical representation using CountVectorizer and calculating feature closeness using Consine Similarity. Below we will test one of the foods.

In [None]:
foods.sample(1)

Note, eggless coffee cupcakes are included in the dessert and veg categories. Of course we hope that the recommendations given are foods in a similar category.

In [None]:
food_recommendations('eggless coffee cupcakes')

Successfully, we got 5 recommendations in the categories 4 desserts and 1 beverage and including veg.

## Collaborative Filtering

Collaborative Filtering utilizes transactions for a product/item that are based on the user's behavior/habits. The goal is that the same user and similar items can be liked by users as preferred recommendations. Collaborative Filtering is generally divided into 2 methods, one of which will be used in this project, namely User-Based Collaborative Filtering. In user-based collaborative filtering, similarities between users are measured based on their interaction history with items or products. The model learns patterns of user preferences for items based on similarities in preferences with other users.

In [None]:
num_user = ratings['User_ID'].nunique()
num_food = foods['Food_ID'].nunique()

min_rating = ratings['Rating'].min()
max_rating = ratings['Rating'].max()

After defining the number of users, amount of food, minimum rate, and maximum rate. Before splitting the data for training and testing, randomization will be carried out using the sample method.

In [None]:
ratings = ratings.sample(frac=1, random_state=42)
ratings

The data has been successfully randomized and is then ready to be split into the following x and y variables.

In [None]:
x = ratings[['User_ID', 'Food_ID']].values
y = ratings['Rating'].apply(lambda x: round((x - min_rating) / (max_rating - min_rating), 2)).values

train_indices = int(0.8 * ratings.shape[0])
x_train, x_val, y_train, y_val = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:]
)

print(x, y)

The data has been split into train and test data with each separated into independent (x) and dependent (y) features. Next, a RecommendationNet class will be built to predict match scores between users and restaurants using embedding techniques that will be used in the recommendation system.

In [None]:
class RecommenderNet(tf.keras.Model):

    def __init__(self, num_users, num_food, embedding_size, **kwargs):
        super(RecommenderNet, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_food = num_food
        self.embedding_size = embedding_size
        self.user_embedding = tf.keras.layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer='glorot_uniform',
            embeddings_regularizer=tf.keras.regularizers.l2(1e-6)
        )
        self.user_bias = tf.keras.layers.Embedding(num_users, 1)
        self.food_embedding = tf.keras.layers.Embedding(
            num_food,
            embedding_size,
            embeddings_initializer='glorot_uniform',
            embeddings_regularizer=tf.keras.regularizers.l1_l2(1e-6)
        )
        self.food_bias = tf.keras.layers.Embedding(num_food, 1)

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        user_bias = self.user_bias(inputs[:, 0])
        food_vector = self.food_embedding(inputs[:, 1])
        food_bias = self.food_bias(inputs[:, 1])

        dot_user_food = tf.tensordot(user_vector, food_vector, 2)
        x = dot_user_food + (user_bias + food_bias)

        return tf.nn.sigmoid(x)

Next, carry out the compilation process of the model.

In [None]:
model = RecommenderNet(num_user+1, num_food+1, 100)

model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.SGD(learning_rate=0.005),
    metrics=[tf.keras.metrics.RootMeanSquaredError()]
)

This model uses Binary Crossentropy to calculate the loss function, SGD (Stochastic Gradient Descend) as an optimizer, and root mean squared error (RMSE) as an evaluation metric.

In [None]:
history = model.fit(
    x = x_train,
    y = y_train,
    batch_size = 8,
    epochs = 100,
    validation_data = (x_val, y_val)
)

In [None]:
plt.plot(history.history['root_mean_squared_error'])
plt.plot(history.history['val_root_mean_squared_error'])
plt.title('model_metrics')
plt.ylabel('root_mean_squared_error')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


Note that the model training process is quite smooth and the model converges at around 100 epochs. From this process, we get a final error value of around 0.30 and an error in the validation data of 0.32. This value is quite good for a recommendation system.

In [None]:
user_id = ratings['User_ID'].sample(1).iloc[0]
food_by_user = ratings[ratings['User_ID'] == user_id]

food_not_rate = foods[~foods['Food_ID'].isin(food_by_user['Food_ID'].values)]['Food_ID']
food_not_rate = list(
    set(food_not_rate)
    .intersection(set(foods['Food_ID']))
)

resto_not_visited = [[x] for x in food_not_rate]
user_resto_array = np.hstack(
    ([[user_id]] * len(resto_not_visited), resto_not_visited)
)

In [None]:
rate = model.predict(user_resto_array).flatten()

top_ratings_indices = rate.argsort()[-10:][::-1]
recommended_resto_ids = [
    resto_not_visited[x][0] for x in top_ratings_indices
]

print('Showing recommendations for users: {}'.format(user_id))
print('===' * 9)
print('Food with high ratings from user')
print('----' * 8)

top_resto_user = (
    food_by_user.sort_values(
        by = 'Rating',
        ascending=False
    )
    .head(5)
    ['Food_ID'].values
)

resto_df_rows = foods[foods['Food_ID'].isin(top_resto_user)]
for row in resto_df_rows.itertuples():
    print(f'{row.Name}: {row.C_Type} ({row.Veg_Non})')

print('----' * 8)
print('Top 10 food recommendation')
print('----' * 8)

recommended_resto = foods[foods['Food_ID'].isin(recommended_resto_ids)]
for row in recommended_resto.itertuples():
    print(f'{row.Name}: {row.C_Type} ({row.Veg_Non})')

Please note, some food recommendations provide food categories according to user ratings. By getting 2 food recommendations in the Indian category, 2 in the healthy food category, 2 in the Japanese category, and 1 food recommendation each in the Korean, Italian, Chinese and Vietnamese categories.

This recommendation is in accordance with user_id 67 who has food preferences that are given a high rating by user_id 67, including food in the Indian, Italian, Thai, dessert and Japanese categories.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***