# **0. 참고 자료**
## **0-1. 웹 사이트**
- 데엔잘하고싶은데엔 - 추천시스템02.콘텐츠기반 필터링(content based filtering) 구현 | [[블로그 링크]](https://pearlluck.tistory.com/666)

## **0-2. 데이터 셋 출처**
- Kaggle - Zomato Restaurant Dataset| [[데이터 셋 링크]](https://www.kaggle.com/datasets/deewakarchakraborty/zomato-restaurant-dataset)
- Kaggle - Chennai Zomato Restaurants Data | [[데이터 셋 링크]](https://www.kaggle.com/datasets/phiitm/chennai-zomato-restaurants-data)


In [263]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
import ast
import os
import re

In [264]:
ROOT_PATH    = '/home/jovyan/project/MidnightTable'
DATASET_PATH = f'{ROOT_PATH}/TableSeekTag/src/data/zomato' 

## **1. 첫번째 데이터 셋으로 만들어보자**
### **1-1. EDA 및 데이터 전처리**
- rating이 '-'이나 'New'로 되어있는 데이터는 리뷰 점수를 0.0으로 지정해 주었습니다.


In [265]:
split_cuisine = lambda cuisine: ' '.join(cuisine.split(','))
def rescoring(rating):
    try: return float(rating)
    except Exception as e: return 0.0

csv = pd.read_csv(f'{DATASET_PATH}/HyderabadResturants.csv')[['names', 'cuisine', 'price for one', 'ratings']]
csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 657 entries, 0 to 656
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   names          657 non-null    object
 1   cuisine        657 non-null    object
 2   price for one  657 non-null    int64 
 3   ratings        657 non-null    object
dtypes: int64(1), object(3)
memory usage: 20.7+ KB


In [266]:
csv['cuisine'] = csv['cuisine'].apply(split_cuisine)
csv.sort_values('ratings', ascending = True).head(5)

Unnamed: 0,names,cuisine,price for one,ratings
500,Heavenly Biryani,Biryani Beverages,150,-
543,Hangout,Biryani Chinese Shake,250,-
544,20 Curries Meals,South Indian North Indian,250,-
547,Cost To Cost Curry Point,South Indian,250,-
549,Non Veg And Veg Bojanam,Andhra Street Food,150,-


In [267]:
csv['ratings'] = csv['ratings'].apply(rescoring)
csv.sort_values('ratings', ascending = True).head(5)

Unnamed: 0,names,cuisine,price for one,ratings
489,Jai Balaji Mithai Bhandar,Mithai Street Food,150,0.0
435,Hussain's Food Court,Chinese South Indian,150,0.0
433,1 Kg Curries,South Indian,150,0.0
557,Telugu Vari Inti Ruchulu,South Indian,150,0.0
558,Smoodies,Beverages,150,0.0


### **1-2. 코사인 유사도 구하기**
- DataFrame의 'cuisine' column에 있는 cuisine 데이터들을 벡터화하여 Cosine Similarity를 계산해줌.
- 추천식당의 DataFrame은 rating을 기준으로 내림차순 정렬


In [268]:
vectorizer   = CountVectorizer(ngram_range = (1, 3))
cuisine_vecs = vectorizer.fit_transform(csv['cuisine'])
cuisine_sims = cosine_similarity(cuisine_vecs, cuisine_vecs).argsort()[:, ::-1]

In [269]:
def filtering(name, top = 15):
    restaurant_idx = csv[csv['names'] == name].index.values
    sim_index      = cuisine_sims[restaurant_idx, :top].reshape(-1)
    
    sim_index = sim_index[sim_index != restaurant_idx] 
    result    = csv.iloc[sim_index].sort_values(['ratings', 'price for one'], ascending = [False, True])
    
    return result

In [271]:
filtering('Burger King')

Unnamed: 0,names,cuisine,price for one,ratings
496,US Live Pops,Fast Food,150,4.6
307,Burger It Up,Burger Fast Food Beverages Shake,200,4.2
486,Occasionkart Cakes,Bakery Desserts Pizza Sandwich Burger Fas...,100,4.1
569,V Cafe- Meals By PVR,Burger Sandwich Fast Food,100,4.0
607,PopKing Popcorn,Fast Food,250,4.0
1,KFC,Burger Fast Food Biryani Desserts Beverages,100,3.9
488,Govind Di Maggie,Fast Food,100,3.8
581,Zaika Bites,Fast Food,100,3.6
522,Turkish Shawarma And Burgers,Fast Food Shawarma,100,3.6
509,Burger Hub's,Burger Fast Food Desserts,150,3.6


## **2. 두번째 데이터 셋으로 만들어보자**
### **2-1. EDA 및 데이터 전처리**

In [279]:
csv2 = pd.read_csv(f'{DATASET_PATH}/Zomato Chennai Listing 2020.csv')
data = csv2.iloc[:, 1: ]
data.head(2)

Unnamed: 0,Name of Restaurant,Address,Location,Cuisine,Top Dishes,Price for 2,Dining Rating,Dining Rating Count,Delivery Rating,Delivery Rating Count,Features
0,Yaa Mohaideen Briyani,"336 & 338, Main Road, Pallavaram, Chennai",Pallavaram,['Biryani'],"['Bread Halwa', ' Chicken 65', ' Mutton Biryan...",500.0,4.3,1500,4.3,9306,"['Home Delivery', 'Indoor Seating']"
1,Sukkubhai Biriyani,"New 14, Old 11/3Q, Railway Station Road, MKN ...",Alandur,"['Biryani', ' North Indian', ' Mughlai', ' Des...","['Beef Biryani', ' Beef Fry', ' Paratha', ' Pa...",1000.0,4.4,3059,4.1,39200,"['Home Delivery', 'Free Parking', 'Table booki..."


In [280]:
print(f'before filtering : {len(data)}')

data = data[data['Dining Rating'] != 'None']
data = data[data['Delivery Rating'] != 'None']
data = data[data['Top Dishes'] != 'Invalid']

print(f'after filtering : {len(data)}')

data['Price for 2'] = data['Price for 2'].apply(float)
data['Dining Rating'] = data['Dining Rating'].apply(float)
data['Dining Rating Count'] = data['Dining Rating Count'].apply(int)
data['Delivery Rating'] = data['Delivery Rating'].apply(float)
data['Delivery Rating Count'] = data['Delivery Rating Count'].apply(int)

data.info()

before filtering : 12032
after filtering : 1678
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1678 entries, 0 to 11134
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Name of Restaurant     1678 non-null   object 
 1   Address                1678 non-null   object 
 2   Location               1678 non-null   object 
 3   Cuisine                1678 non-null   object 
 4   Top Dishes             1678 non-null   object 
 5   Price for 2            1678 non-null   float64
 6   Dining Rating          1678 non-null   float64
 7   Dining Rating Count    1678 non-null   int64  
 8   Delivery Rating        1678 non-null   float64
 9   Delivery Rating Count  1678 non-null   int64  
 10  Features               1678 non-null   object 
dtypes: float64(3), int64(2), object(6)
memory usage: 157.3+ KB


In [281]:
## dining rating이랑 delivery rating을 
## weighted rating으로 만들어 주기 위한 과정

def wighted_rating(df, column, alpha =0.9):
    m = data[column].mean()
    q = data[f'{column} Count'].quantile(alpha) 

    v = data[f'{column} Count']
    r = data[column]

    return ((r / m) + (q / v))

data['Dining Score']   = weighted_rating(data, 'Dining Rating')
data['Delivery Score'] = weighted_rating(data, 'Delivery Rating') 
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1678 entries, 0 to 11134
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Name of Restaurant     1678 non-null   object 
 1   Address                1678 non-null   object 
 2   Location               1678 non-null   object 
 3   Cuisine                1678 non-null   object 
 4   Top Dishes             1678 non-null   object 
 5   Price for 2            1678 non-null   float64
 6   Dining Rating          1678 non-null   float64
 7   Dining Rating Count    1678 non-null   int64  
 8   Delivery Rating        1678 non-null   float64
 9   Delivery Rating Count  1678 non-null   int64  
 10  Features               1678 non-null   object 
 11  Dining Score           1678 non-null   float64
 12  Delivery Score         1678 non-null   float64
dtypes: float64(5), int64(2), object(6)
memory usage: 183.5+ KB


In [282]:
data = data[['Name of Restaurant', 'Address', 'Location',
            'Cuisine', 'Top Dishes', 'Price for 2', 'Features',
             'Dining Score', 'Delivery Score']]

data.tail()

Unnamed: 0,Name of Restaurant,Address,Location,Cuisine,Top Dishes,Price for 2,Features,Dining Score,Delivery Score
10867,Perambur Sri Srinivasa,"98, MTH Road, Padi, Ambattur, Chennai",Ambattur,"['North Indian', ' South Indian', ' Chinese', ...","['Fried Rice', ' Panneer Butter Masala']",400.0,"['Breakfast', 'Home Delivery', 'Vegetarian Onl...",7.872212,115.428223
10964,Hot Breads,"147, Ground Floor, Sivagami Square, GN Chetty...",T. Nagar,"['Bakery', ' Fast Food', ' Beverages']","['Choco Truffle', ' Cupcake']",450.0,"['Home Delivery', 'Catering Available', 'Desse...",18.522958,380.691735
11035,Uzo Sandwiches,"71, Pilayar Koil Street, SRM Nagar, Potheri, ...",Potheri,"['Sandwich', ' Burger', ' Wraps', ' Fast Food'...","['Burgers', ' Chicken Sandwich', ' Pizza Sandw...",300.0,"['Home Delivery', 'Free Parking', 'Indoor Seat...",2.90754,2.229818
11080,The Cake World,"40 A, Velachery Main Road, Velachery, Chennai",Velachery,['Bakery'],['Blackforest Cake'],500.0,"['Home Delivery', 'Indoor Seating', 'Desserts ...",15.162027,186.627493
11134,Nandhiniee Sweets,"55, South Usman Road, T. Nagar, Chennai",T. Nagar,"['Mithai', ' Fast Food', ' North Indian']","['Chaat', ' Rasmalai']",400.0,"['Home Delivery', 'Vegetarian Only', 'Free Par...",17.906532,10.0501


In [283]:
## 문자열로 되어 있는 녀석들을 리스트로 바꿔줌.
data['Cuisine']  = data['Cuisine'].apply(ast.literal_eval).apply(lambda x: ' '.join(x))
data['Features'] = data['Features'].apply(ast.literal_eval).apply(lambda x: ' '.join(x))
data['Top Dishes'] = data['Top Dishes'].apply(ast.literal_eval).apply(lambda x: ' '.join(x))

In [284]:
data.head()

Unnamed: 0,Name of Restaurant,Address,Location,Cuisine,Top Dishes,Price for 2,Features,Dining Score,Delivery Score
0,Yaa Mohaideen Briyani,"336 & 338, Main Road, Pallavaram, Chennai",Pallavaram,Biryani,Bread Halwa Chicken 65 Mutton Biryani Chick...,500.0,Home Delivery Indoor Seating,1.704344,2.001089
1,Sukkubhai Biriyani,"New 14, Old 11/3Q, Railway Station Road, MKN ...",Alandur,Biryani North Indian Mughlai Desserts Beve...,Beef Biryani Beef Fry Paratha Paya Brinjal...,1000.0,Home Delivery Free Parking Table booking recom...,1.448078,1.265242
2,SS Hyderabad Biryani,"98/339, Arcot Road, Opposite Gokulam Chit Fun...",Kodambakkam,Biryani North Indian Chinese Arabian,Brinjal Curry Tandoori Chicken Chicken Grill...,500.0,Home Delivery Indoor Seating,1.761054,1.92468
3,KFC,"10, Periyar Nagar, 70 Feet Road, Near Sheeba ...",Perambur,Burger Fast Food Finger Food Beverages,Zinger Burger,500.0,Home Delivery Free Parking Card Upon Delivery ...,1.825403,1.772309
4,Tasty Kitchen,"135B, SRP Colony, Peravallur, Near Perambur, ...",Perambur,Chinese Biryani North Indian Chettinad Ara...,Mutton Biryani Chicken Rice Tomato Rice Sha...,450.0,Home Delivery Indoor Seating,2.472273,1.425061


In [285]:
vectorizer   = CountVectorizer(ngram_range = (1, 3))
cuisine_vecs = vectorizer.fit_transform(data['Cuisine'])
cuisine_sims = cosine_similarity(cuisine_vecs, cuisine_vecs).argsort()[:, ::-1]

In [286]:
def filtering(name, top = 15):
    restaurant_idx = data[data['Name of Restaurant'] == name].index.values
    sim_index      = cuisine_sims[restaurant_idx, :top].reshape(-1)
    
    sim_index = sim_index[sim_index != restaurant_idx] 
    result    = data.iloc[sim_index].sort_values('Price for 2', ascending = True)
    
    return result

In [287]:
filtering('KFC')

IndexError: index 8790 is out of bounds for axis 0 with size 1678