# Library

## Imports

In [2]:
from sqlalchemy.dialects.mssql.information_schema import columns
!pip install kagglehub



In [3]:
import pandas as pd
import kagglehub

## EDA

# Datasets loading and filtering 

## Reasoning of the datasets' choice 

The following datasets were chosen to explore and use for **SVD++** since they have information either about the relationships between users and businesses (this information is the purpose of the RecSys to restore) or about some features of users or items itself (such as business dataset):
- business dataset;
- tip dataset
- review

User dataset contains only information that is related to user, but not to U-I relations as well as checkin dataset that contain potential feature only for business.

Normally datasets are downloaded from the Kaggle, but in the purpose of time-saving the local path placed instead.
If you want to download the datasets again, just uncomment the code below.

In [4]:
path = "/Users/simon/.cache/kagglehub/datasets/yelp-dataset/yelp-dataset/versions/4"

# path = kagglehub.dataset_download("yelp-dataset/yelp-dataset")
# 
# print("Path to dataset files:", path)

Business' features description:
- `business_id` - id of a business
- `name` - name of the business
- `address` - address of the business (geographical data)
- `city` - city of the business (geographical data)
- `state` - state of the business (geographical data)
- `postal_code` - postal code of the business (geographical data)
- `latitude` - latitude of the business (geographical data)
- `longitude` - longitude of the business (geographical data)
- `review_count` - the amount of the reviews gathered for the particular business
- `is_open` - is the business opened (closed businesses are not relevant to use)
- `attributes` - business' attributes (no provided context)
- `categories` - categories related to the business
- `hours` - working hours

In [5]:
business_df = pd.read_json(f"{path}/yelp_academic_dataset_business.json", lines=True)
business_df

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150341,IUQopTMmYQG-qRtBk-8QnA,Binh's Nails,3388 Gateway Blvd,Edmonton,AB,T6J 5H2,53.468419,-113.492054,3.0,13,1,"{'ByAppointmentOnly': 'False', 'RestaurantsPri...","Nail Salons, Beauty & Spas","{'Monday': '10:0-19:30', 'Tuesday': '10:0-19:3..."
150342,c8GjPIOTGVmIemT7j5_SyQ,Wild Birds Unlimited,2813 Bransford Ave,Nashville,TN,37204,36.115118,-86.766925,4.0,5,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Pets, Nurseries & Gardening, Pet Stores, Hobby...","{'Monday': '9:30-17:30', 'Tuesday': '9:30-17:3..."
150343,_QAMST-NrQobXduilWEqSw,Claire's Boutique,"6020 E 82nd St, Ste 46",Indianapolis,IN,46250,39.908707,-86.065088,3.5,8,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","Shopping, Jewelry, Piercing, Toy Stores, Beaut...",
150344,mtGm22y5c2UHNXDFAjaPNw,Cyclery & Fitness Center,2472 Troy Rd,Edwardsville,IL,62025,38.782351,-89.950558,4.0,24,1,"{'BusinessParking': '{'garage': False, 'street...","Fitness/Exercise Equipment, Eyewear & Optician...","{'Monday': '9:0-20:0', 'Tuesday': '9:0-20:0', ..."


Reasons of feature dropping:
- geographical data doesn't make any sense since we don't have the same type of information about the user
- there are features that don't describe anything in the purpose of RecSys (i.e. `name` or `hours`)
- `attributes` feature doesn't have a context to analyze it

Features that remains:
- `business_id` - id of a business
- `review_count` - this feature can potentially participate in the forming of **implicit rating**
- `categories` - this feature can potentially participate in the forming of **implicit rating** + it's necessary to separate it into different entity
- `is_open` - it's necessary to check ratio between open / closed businesses and filter them based on this feature since the recommendation of the closed business doesn't make any sense

In [None]:
filtered_business_id = business_df[['business_id', 'review_count', 'is_open', 'categories']]
filtered_business_id

Reviews' features:
- `review_id` | `user_id` | `business_id` - id of the review and foreign keys (one user can leave several reviews for one item)
- `stars` - **explicit rating** provided by user for the particular item in the particular moment
- `useful` | `funny` | `cool`  - user's flags
- `text` - the content of review (can be useful for potential sentimental analysis)
- `date` - the timestamp of review

All the features can be used for the future development and need to be analysed.

In [7]:
review_df = pd.read_json(f"{path}/yelp_academic_dataset_review.json", lines=True)
review_df

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15
...,...,...,...,...,...,...,...,...,...
6990275,H0RIamZu0B0Ei0P4aeh3sQ,qskILQ3k0I_qcCMI-k6_QQ,jals67o91gcrD4DC81Vk6w,5,1,2,1,Latest addition to services from ICCU is Apple...,2014-12-17 21:45:20
6990276,shTPgbgdwTHSuU67mGCmZQ,Zo0th2m8Ez4gLSbHftiQvg,2vLksaMmSEcGbjI5gywpZA,5,2,1,2,"This spot offers a great, affordable east week...",2021-03-31 16:55:10
6990277,YNfNhgZlaaCO5Q_YJR4rEw,mm6E4FbCMwJmb7kPDZ5v2Q,R1khUUxidqfaJmcpmGd4aw,4,1,0,0,This Home Depot won me over when I needed to g...,2019-12-30 03:56:30
6990278,i-I4ZOhoX70Nw5H0FwrQUA,YwAMC-jvZ1fvEUum6QkEkw,Rr9kKArrMhSLVE9a53q-aA,5,1,0,0,For when I'm feeling like ignoring my calorie-...,2022-01-19 18:59:27


Tip's features:
- `user_id` | `business_id` - ids of users and businesses with uncertainty in uniqueness of them and their pairs
- `text` - content of tip (can be useful for potential sentimental analysis)
- `date` - the date of publication
- `compliment_count` - how many users complimented a particular tip

In [8]:
tip_df = pd.read_json(f"{path}/yelp_academic_dataset_tip.json", lines=True)
tip_df

Unnamed: 0,user_id,business_id,text,date,compliment_count
0,AGNUgVwnZUey3gcPCJ76iw,3uLgwr0qeCNMjKenHJwPGQ,Avengers time with the ladies.,2012-05-18 02:17:21,0
1,NBN4MgHP9D3cw--SnauTkA,QoezRbYQncpRqyrLH6Iqjg,They have lots of good deserts and tasty cuban...,2013-02-05 18:35:10,0
2,-copOvldyKh1qr-vzkDEvw,MYoRNLb5chwjQe3c_k37Gg,It's open even when you think it isn't,2013-08-18 00:56:08,0
3,FjMQVZjSqY8syIO-53KFKw,hV-bABTK-glh5wj31ps_Jw,Very decent fried chicken,2017-06-27 23:05:38,0
4,ld0AperBXk1h6UbqmM80zw,_uN0OudeJ3Zl_tf6nxg5ww,Appetizers.. platter special for lunch,2012-10-06 19:43:09,0
...,...,...,...,...,...
908910,eYodOTF8pkqKPzHkcxZs-Q,3lHTewuKFt5IImbXJoFeDQ,Disappointed in one of your managers.,2021-09-11 19:18:57,0
908911,1uxtQAuJ2T5Xwa_wp7kUnA,OaGf0Dp56ARhQwIDT90w_g,Great food and service.,2021-10-30 11:54:36,0
908912,v48Spe6WEpqehsF2xQADpg,hYnMeAO77RGyTtIzUSKYzQ,Love their Cubans!!,2021-11-05 13:18:56,0
908913,ckqKGM2hl7I9Chp5IpAhkw,s2eyoTuJrcP7I_XyjdhUHQ,Great pizza great price,2021-11-20 16:11:44,0


The only feature was dropped - `compliment_count` because of the following reason: this feature describes **user-to-user** relationships and would be useful if we've wanted to recommend *users to other users*. But the purpose of the RecSys under development is recommending **items to user**

In [None]:
filtered_tip_df = tip_df.drop(axis=1, columns=['categories'])

# EDA

## Reviews dataset