# Introduction

The United States restaurant industry was projected at $899 billion in sales for 2020 by the National Restaurant Association,  accounting for about 5 percent of the country's gross domestic product. An estimated 99% of companies in the industry are family-owned small businesses with fewer than 50 employees.The industry as a whole as of February 2020 employed more than 15 million people, representing 10% of the workforce directly. Given the significance of this industry, it’s size and employment numbers, I decided to create a model that could help business owners and investors determine the essential features that predict a restaurant’s success or failure.

Based on this analysis, investors can decide whether they should invest at a particular business based on the likelihood that it is going to get closed in the future; existing businesses could intervene and improve upon those parameters whereas new businesses could analyze the potential before entering the market.

The dataset for this project was collected from Yelp Open Dataset which is publicly available as of Mar 16, 2021 for educational and academic purposes. The dataset's size is 11 GB with 5 json files on businesses, reviews, users, tips and checkin data. For this project, I have focused on businesses' and their reviews data. There are 160,585 businesses and 8,635,403 reviews in the complete dataset, but for this project, I have utilized the data for restaurants within Tampa, Florida.

Initially, in part 1, we will start by importing the dataset and extracting the required data from .json files. In part 2, the data will be cleaned, primarily consisting of removing/imputing duplicates and null values and subsequently, project feature engineering will be performed to create custom data columns that can help in predicting success/failure of a restaurant. This is then followed by an Exploratory Data Analysis (EDA) to understand the final cleaned dataset. In part 3 of the project, models will be build and evaluated on various performance parameters.

## Importing libraries

We import the pandas and json modules to extract and load the data.

In [1]:
import pandas as pd 
import json

In [2]:
business_df = pd.read_json('yelp_academic_dataset_business.json', lines=True, encoding='utf-8')

In [3]:
# check the count
print(f'Number of businesses: {business_df.shape[0]: ,}')
print(f'Number of features: {business_df.shape[1]: ,}')

# view the data
business_df.head(3)

Number of businesses:  150,346
Number of features:  14


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."


In [4]:
# extracting the Tampa restaurants data
Tampa_restaurants = business_df[business_df['categories'].str.lower().str.contains('restaurant', na = False)]

# checking number of restaurants
print(f'Number of Restaurants: {Tampa_restaurants.shape[0]: ,}')
print(f'Number of features: {Tampa_restaurants.shape[1]: ,}')

Number of Restaurants:  52,286
Number of features:  14


In [5]:
Tampa_restaurants.head(5)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,36.269593,-87.058943,2.0,6,1,"{'BusinessParking': 'None', 'BusinessAcceptsCr...","Burgers, Fast Food, Sandwiches, Food, Ice Crea...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', '..."
8,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,MO,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': 'u'full_bar'', '...","Pubs, Restaurants, Italian, Bars, American (Tr...",
9,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,TN,37207,36.208102,-86.76817,1.5,10,1,"{'RestaurantsAttire': ''casual'', 'Restaurants...","Ice Cream & Frozen Yogurt, Fast Food, Burgers,...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', '..."
11,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,,Tampa Bay,FL,33602,27.955269,-82.45632,4.0,10,1,"{'Alcohol': ''none'', 'OutdoorSeating': 'None'...","Vietnamese, Food, Restaurants, Food Trucks","{'Monday': '11:0-14:0', 'Tuesday': '11:0-14:0'..."


In [6]:
# extracting the Tampa restaurants data
Tampa_restaurants = business_df[(business_df['state'] == 'FL') &\
                                    (business_df['categories'].str.lower().str.contains('restaurant'))].reset_index(drop=True)

# checking number of restaurants
print(f'Number of Restaurants: {Tampa_restaurants.shape[0]: ,}')
print(f'Number of features: {Tampa_restaurants.shape[1]: ,}')

Number of Restaurants:  8,732
Number of features:  14


In [7]:
Tampa_restaurants

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,,Tampa Bay,FL,33602,27.955269,-82.456320,4.0,10,1,"{'Alcohol': ''none'', 'OutdoorSeating': 'None'...","Vietnamese, Food, Restaurants, Food Trucks","{'Monday': '11:0-14:0', 'Tuesday': '11:0-14:0'..."
1,0bPLkL0QhhPO5kt1_EXmNQ,Zio's Italian Market,2575 E Bay Dr,Largo,FL,33771,27.916116,-82.760461,4.5,100,0,"{'OutdoorSeating': 'False', 'RestaurantsGoodFo...","Food, Delis, Italian, Bakeries, Restaurants","{'Monday': '10:0-18:0', 'Tuesday': '10:0-20:0'..."
2,uI9XODGY_2_ieTE6xJ0myw,Roman Forum,10440 N Dale Mabry Hwy,Tampa,FL,33618,28.046203,-82.505053,4.0,23,0,"{'BusinessParking': '{'garage': False, 'street...","Restaurants, American (New), Italian","{'Monday': '11:30-21:0', 'Tuesday': '11:30-21:..."
3,JgpnXv_0XhV3SfbfB50nxw,Joe's Pizza,2038 N Dale Mabry Hwy,Tampa,FL,33607,27.960514,-82.506127,4.0,35,0,"{'BusinessParking': '{'garage': False, 'street...","Restaurants, Pizza","{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'..."
4,pJfh3Ct8iL58NZa8ta-a5w,Top Shelf Sports Lounge,3173 Cypress Ridge Blvd,Wesley Chapel,FL,33544,28.196252,-82.380615,4.5,95,1,"{'BestNights': '{'monday': False, 'tuesday': F...","Burgers, Sports Bars, Bars, Lounges, Restauran...","{'Monday': '11:30-22:0', 'Tuesday': '11:30-23:..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8727,Scd-rcsQCn60t1sHHFv-og,First Watch,"4045 N Tyrone Blvd, Ste 204",St. Petersburg,FL,33709,27.808314,-82.752110,3.5,183,1,"{'RestaurantsPriceRange2': '2', 'OutdoorSeatin...","Cafes, Restaurants, Breakfast & Brunch, Americ...","{'Monday': '0:0-0:0', 'Tuesday': '7:0-14:30', ..."
8728,8MzF1Tlgz0pOkxmhP5dYzA,El Cap Restaurant,3500 4th St N,St. Petersburg,FL,33704,27.804140,-82.638855,3.5,414,1,"{'GoodForKids': 'True', 'BikeParking': 'True',...","American (Traditional), Burgers, Restaurants","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'..."
8729,-bZQH8yjm7ntTyGeLQwh8Q,Farmer's Kitchen Restaurant,3500 E Bay Dr,Largo,FL,33771,27.916787,-82.750395,4.0,6,0,"{'RestaurantsReservations': 'True', 'Restauran...","Sandwiches, Restaurants, Diners",
8730,BIyT7Kr7tMJqlfp4oOOYQg,Copper Bell Cafe,11228 Boyette Rd,Riverview,FL,33569,27.853745,-82.316887,3.5,49,0,"{'BikeParking': 'True', 'RestaurantsReservatio...","Breakfast & Brunch, Cafes, Restaurants","{'Monday': '7:30-14:30', 'Tuesday': '7:30-14:3..."


In [8]:
# open the review file in input folder
data_file = open('yelp_academic_dataset_review.json', encoding="utf8")
data = []
i = 0

# read data line by line and append if it's for the restaurants in Tampa
for line in data_file:
    review = json.loads(line)
    if review['business_id'] in Tampa_restaurants['business_id'].values:
        data.append(review)
        print(i, end='\r')
    i = i+1

# add data in dataframe
review_df = pd.DataFrame(data)

# close the file
data_file.close()

# check size of data
print(f'Number of Reviews: {review_df.shape[0]: ,}')
print(f'Number of features: {review_df.shape[1]: ,}')

1560137

KeyboardInterrupt: 

In [None]:
tip_df = pd.read_json('yelp_academic_dataset_tip.json', lines=True, encoding='utf-8')

In [None]:
Tampa_restaurants = business_df[(business_df['city'].str.lower().str.contains('tampa')) &\
                                    (business_df['state'] == 'FL') &\
                                    (business_df['categories'].str.lower().str.contains('restaurant'))]

In [None]:
Tampa_restaurants.head(3)

In [None]:
# open the review file in input folder
data_file = open('yelp_academic_dataset_review.json', encoding="utf8")
data = []
i = 0

# read data line by line and append if it's for the restaurants in Tampa
for line in data_file:
    review = json.loads(line)
    if review['business_id'] in Tampa_restaurants['business_id'].values:
        data.append(review)
        print(i, end='\r')
    i = i+1

# add data in dataframe
review_df = pd.DataFrame(data)

# close the file
data_file.close()

# check size of data
print(f'Number of Reviews: {review_df.shape[0]: ,}')
print(f'Number of features: {review_df.shape[1]: ,}')

In [None]:
# open the review file in input folder
data_file = open('yelp_academic_dataset_tip.json', encoding="utf8")
data = []
i = 0

# read data line by line and append if it's for the restaurants in Tampa
for line in data_file:
    tip = json.loads(line)
    if tip['business_id'] in Tampa_restaurants['business_id'].values:
        data.append(tip)
        print(i, end='\r')
    i = i+1

# add data in dataframe
tip1_df = pd.DataFrame(data)

# close the file
data_file.close()

# check size of data
print(f'Number of Reviews: {review_df.shape[0]: ,}')
print(f'Number of features: {review_df.shape[1]: ,}')

In [None]:
tip1_df.head()

In [None]:
tip1_df.shape

In [None]:
# save the filtered content as csv file in data folder
#Tampa_restaurants.to_csv(r'data\Tampa_restaurants.csv', encoding='utf-8', index=False)
#review_df.to_csv(r'data\Tampa_restaurants_reviews.csv', encoding='utf-8', index=False)
#tip_df.to_csv(r'data\Tampa_restaurants_tip.csv', encoding='utf-8', index=False)