# Table of contents

[<h3>1. Exploratory data analysis</h3>](#1)

[<h3>2. Collaborative Recommendation System</h3>](#2)

[<h3>3. Recommendations</h3>](#3)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive



The datasets contain ratings and free-text tagging activities from Dummy data . Its composition in numbers:
* 7120 ratings
* 1274 foods

<h2> Content:</h2>

**foodrating3.csv that contains ratings of foods by users:**
* userId
* foodId
* rating
* timestamp

**makanan.csv that contains food information:**
* foodId
* Nama
* Tipe





# 1. Exploratory data analysis<a class="anchor" id="1"></a>




In [None]:
# Let's have a look at the csv-files
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
print(os.listdir("/content/drive/MyDrive/dataset/collaborative2"))

['foodrating3.csv', 'makanan.csv']


In [None]:
rating = pd.read_csv('/content/drive/MyDrive/dataset/collaborative2/foodrating.csv')
rating.shape

(892586, 4)

In [None]:
rating.head()

Unnamed: 0,userId,foodId,rating,timestamp
0,1,1259,3.5,02/04/2005 23:53
1,1,692,3.5,02/04/2005 23:31
2,1,756,3.5,02/04/2005 23:33
3,1,772,3.5,02/04/2005 23:32
4,1,447,3.5,02/04/2005 23:29


In [None]:
#delete timestamp since its inrelevant
rating = rating[rating.columns.drop("timestamp")]

In [None]:
rating.describe()

Unnamed: 0,userId,foodId,rating
count,892586.0,892586.0,892586.0
mean,3531.553646,635.7985,3.552189
std,2021.18622,367.556937,1.050816
min,1.0,0.0,0.5
25%,1813.0,317.0,3.0
50%,3534.0,635.0,4.0
75%,5261.0,954.0,4.0
max,7120.0,1273.0,5.0


In [None]:
rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892586 entries, 0 to 892585
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   userId  892586 non-null  int64  
 1   foodId  892586 non-null  int64  
 2   rating  892586 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 20.4 MB


In [None]:
rating.isnull().sum()

userId    0
foodId    0
rating    0
dtype: int64

The dataset is clean and has no missing value.


Rating only has the ID of foods and we would like to see also the name of the foods, therefore we will later get the name out of the file makanan.csv.

**makanan.csv:**

In [None]:
food = pd.read_csv('/content/drive/MyDrive/dataset/collaborative/makanan.csv')

In [None]:
food.head(5)

Unnamed: 0,foodId,Nama,Tipe
0,1,Sosis Bakar,ayam-daging
1,2,Ngohiong Ayam Udang,ayam-daging
2,3,Rawon Ayam,ayam-daging
3,4,Usus Goreng Crispy,ayam-daging
4,5,Ceker Rica Rica,ayam-daging


In [None]:
food.describe()

Unnamed: 0,foodId
count,1273.0
mean,637.0
std,367.627756
min,1.0
25%,319.0
50%,637.0
75%,955.0
max,1273.0


In [None]:
food.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1273 entries, 0 to 1272
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   foodId  1273 non-null   int64 
 1   Nama    1273 non-null   object
 2   Tipe    1273 non-null   object
dtypes: int64(1), object(2)
memory usage: 30.0+ KB


makanan.csv seems to be clean

In [None]:
food.head()

Unnamed: 0,foodId,Nama,Tipe
0,1,Sosis Bakar,ayam-daging
1,2,Ngohiong Ayam Udang,ayam-daging
2,3,Rawon Ayam,ayam-daging
3,4,Usus Goreng Crispy,ayam-daging
4,5,Ceker Rica Rica,ayam-daging


In [None]:
# Merge both DataFrame to have also the Name of the foods
df = pd.merge(food,rating)

# Keep only the columns Nama, userId and rating
df = df[['Nama','userId','rating']]

In [None]:
# Show the result
df.head(5)

Unnamed: 0,Nama,userId,rating
0,Sosis Bakar,14,3.5
1,Sosis Bakar,18,4.5
2,Sosis Bakar,21,4.0
3,Sosis Bakar,22,4.0
4,Sosis Bakar,24,4.0


In [None]:
# Group the name by number of ratings to see which foods where rated the most
count_rating = df.groupby("Nama")['rating'].count().sort_values(ascending=False)
count_rating.head(10)

Nama
Sambal Goreng Kentang Udang    766
Mie Ayam Bangka                764
Nasi Goreng Magelangan         761
Tahu Bubuk                     761
Gyoza                          757
Pesmol Ikan Nila               757
Lontong Sayur Padang           755
Tongseng Sapi                  755
Sambal Cibiuk                  755
Kepiting Lada Hitam            754
Name: rating, dtype: int64

In [None]:
# Select the foods with at least 500 ratings
r = 500
more_than_200_ratings = count_rating[count_rating.apply(lambda x: x >= r)].index

# Keep only the foods with at least 500 ratings in the DataFrame
df_r = df[df['Nama'].apply(lambda x: x in more_than_200_ratings)]

In [None]:
# Display the count of ratings the each food
# Having only the foods with at least 500 ratings
df_r.groupby("Nama")['rating'].count().sort_values(ascending=False)

Nama
Sambal Goreng Kentang Udang    766
Mie Ayam Bangka                764
Nasi Goreng Magelangan         761
Tahu Bubuk                     761
Gyoza                          757
                              ... 
Madu Mongso                    641
Kering Tempe (Orek Tempe)      639
Nasi Liwet Sunda Teri          628
Jambal Roti Tumis              624
Cheese Tart                    619
Name: rating, Length: 1273, dtype: int64

In [None]:
before = len(df.Nama.unique())
after = len(df_r.Nama.unique())
rows_before = df.shape[0]
rows_after = df_r.shape[0]
print(f'''There are {before} food in the dataset before filtering and {after} food after the filtering.

{before} food => {after} food
{rows_before} rows before filtering => {rows_after} rows after filtering''')

There are 1273 food in the dataset before filtering and 1273 food after the filtering.

1273 food => 1273 food
891874 rows before filtering => 891874 rows after filtering


# 2. Collaborative Recommendation System<a class="anchor" id="2"></a>

In [None]:
# Create a matrix with userId as rows and the Name of the foods as column.
# Each cell will have the rating given by the user to the food.
# There will be a lot of NaN values, because each user hasn't watched most of the foods
foods = df_r.pivot_table(index='userId',columns='Nama',values='rating')
foods.iloc[:5,:5]

Nama,Abon Ikan Tongkol,Abon Sapi,Abon Tuna,Acar Timun,Almond Crispy
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,,,,3.5
2,,,,,
3,,,,5.0,5.0
4,,,,,
5,,,,,


In [None]:
# Let's choose a famous food
food = 'Mie Ayam Bangka'

# Display the first ratings of the users for this food
foods[food].head(5)

userId
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
Name: Mie Ayam Bangka, dtype: float64

In [None]:
#function to find correlation 
def find_corr(df_foods, food_name):
    similar_to_food = df_foods.corrwith(foods[food_name])
    similar_to_food = pd.DataFrame(similar_to_food,columns=['Correlation'])
    similar_to_food = similar_to_food.sort_values(by = 'Correlation', ascending = False)
    return similar_to_food

# 3. Recommendations <a class="anchor" id="3"></a>




## 3.1. Mie Ayam Bangka




In [None]:
# Let's try with the second food
food_name = 'Mie Ayam Bangka'
find_corr(foods, food_name).head(5)

Unnamed: 0_level_0,Correlation
Nama,Unnamed: 1_level_1
Mie Ayam Bangka,1.0
Ayam Kecap Jahe,0.440634
Ratatouille,0.412222
Pempek Lenggang,0.384053
Sambal Cumi,0.362979


We can see that people who ate mie ayam bangka are most likely to like also ayam kecap jahe because the correlation is highler. On the bottom are the foods which are not likely to be liked by the users.

## 3.2. Sambal Goreng Kentang Udang

In [None]:
# Let's try with the first food
food_name = 'Sambal Goreng Kentang Udang'
find_corr(foods, food_name).head(5)

Unnamed: 0_level_0,Correlation
Nama,Unnamed: 1_level_1
Sambal Goreng Kentang Udang,1.0
Martabak Mie,0.384575
Kue Tusuk Gigi,0.384289
Martabak Manis Keju Susu,0.376577
Kimbab,0.373212


This recommendation system using correlation is simple and nonetheless effective as a basic recommendation system. It works well to recommend foods for someone who likes one specific food.