# What factors determine the success or popularity of a business?
In this notebook we are going to see how we could make sure our dream business venture succeeds despite neighboring competition all around. There will be a lot of facotrs to take into account such as the kind of food we serve, or location , and even the cost of our products. Until now we haven't decided what kind of business to create but rather we're just looking to analyze the market on a cursory level in this notebook and in subsequent ones we plan on going further in depth with regressions including multiples independent variables , time-series analysis, etc.

To begin we'll import some need libraries

In [1]:
from google.oauth2 import service_account
from google.cloud import bigquery
import configparser
import pandas as pd

Now we'll be setting up authentication to connect to our data warehouse in Google Big Query

In [3]:
KEY_PATH = "/mnt/c/Users/Ron/git-repos/yelp-data/gourmanddwh-f75384f95e86.json"
CREDS = service_account.Credentials.from_service_account_file(KEY_PATH)
client = bigquery.Client(credentials=CREDS, project=CREDS.project_id)

Using the bigquery API we can use select statements to place the results in pandas dataframes so we'll do so for a couple of queries


In [4]:
cg_file = open('county_growth_est.sql','r')
county_growth_query =  cg_file.read()
cg_dataframe = (
    client.query(county_growth_query)
    .result()
    .to_dataframe()
)

In [5]:
holding_file = open('business_daily_holding.sql')
holding_query = holding_file.read()

holding_dataframe = (
    client.query(holding_query)
    .result()
    .to_dataframe()
)

In [6]:
bus_cat_file = open('business_category_location.sql')
bus_cat_query = bus_cat_file.read()

bus_cat_dataframe = (
    client.query(bus_cat_query)
    .result()
    .to_dataframe()
)

We'll puase to serialize the data before we start any transforms to retain it's state. 

In [7]:
cg_dataframe.to_csv('county_growth_est.csv',sep='|', float_format = '{:,.2f}', index=False )
holding_dataframe.to_csv('holding.csv',sep='|', float_format = '{:,.2f}', index=False )
bus_cat_dataframe.to_csv('bus_cat.csv',sep='|', float_format = '{:,.2f}', index=False )

Now we'll read the data back in and conduct some exploratory analysis

In [8]:
cg_dataframe = pd.read_csv('county_growth_est.csv',sep='|', low_memory=True)
holding_dataframe = pd.read_csv('holding.csv',sep='|', low_memory=True)
bus_cat_dataframe = pd.read_csv('bus_cat.csv',sep='|', low_memory=True)

The dataset `cg_dataframe` shows us some estimated population statistics for the year 2019. We will look into what impact this year by have on the rest of our data as it comes 2021/2022

In [9]:
cg_dataframe

Unnamed: 0,CountyName,EstimationYear,EstimatedPopulation
0,Williamson County,2019,590551
1,Williamson County,2019,590551
2,Williamson County,2019,590551
3,Webb County,2019,276652
4,Washakie County,2019,7805
...,...,...,...
7584,Bartow County,2019,107738
7585,Bartow County,2019,107738
7586,Bartow County,2019,107738
7587,Bartholomew County,2019,83779


The dataset `holding_dataframe` shows us some the business rating and reviewcount with which a business closed out the day

In [10]:
holding_dataframe

Unnamed: 0,BusinessName,ChainName,BusinessRating,ReviewCount
0,hoodsport-coffee-company-hoodsport-2,Hoodsport Coffee Company,0.0,0
1,hoodsport-coffee-company-hoodsport-2,Hoodsport Coffee Company,0.0,0
2,hoodsport-coffee-company-hoodsport-2,Hoodsport Coffee Company,0.0,0
3,hermans-bakery-cambridge-2,Hermans Bakery,0.0,0
4,hermans-bakery-cambridge-2,Hermans Bakery,0.0,0
...,...,...,...,...
198468,happy-buffet-onley,Happy Buffet,2.5,14
198469,happy-buffet-onley,Happy Buffet,2.5,13
198470,l-and-m-firehouse-orting,L & M Firehouse,2.5,68
198471,l-and-m-firehouse-orting,L & M Firehouse,2.5,67


The dataset `bus_cat_dataframe` provides additional information about each business such as location, the categories offered by the business and as such it contains duplicates businesses (a business instance for each of it's categories)

In [11]:
bus_cat_dataframe

Unnamed: 0,BusinessKey,BusinessName,ChainName,PaymentLevelName,Longitude,Latitude,BusinessCategoryName,CityName,CountyName,StateName,CountryName
0,3,the-shoppe-geneva,The Shoppe,Very Low,-85.865918,31.034793,Sandwiches,Geneva,Geneva County,Alabama,US
1,3,the-shoppe-geneva,The Shoppe,Very Low,-85.865918,31.034793,Ice Cream & Frozen Yogurt,Geneva,Geneva County,Alabama,US
2,3,the-shoppe-geneva,The Shoppe,Very Low,-85.865918,31.034793,Burgers,Geneva,Geneva County,Alabama,US
3,7,porky-barn-geneva,Porky Barn,Very Low,-85.885675,31.048576,Breakfast & Brunch,Geneva,Geneva County,Alabama,US
4,110,mellow-mushroom-mobile-mobile,Mellow Mushroom Mobile,Low,-88.173363,30.689876,Pizza,Mobile,Mobile County,Alabama,US
...,...,...,...,...,...,...,...,...,...,...,...
126522,63180,rock-bottom-cafe-and-gifts-glenrock,Rock Bottom Cafe & Gifts,Low,-105.872560,42.862490,Steakhouses,Glenrock,Converse County,Wyoming,US
126523,63189,chutes-restaurant-douglas-3,Chutes Restaurant,Unknown,-105.406297,42.763781,American (Traditional),Douglas,Converse County,Wyoming,US
126524,63189,chutes-restaurant-douglas-3,Chutes Restaurant,Unknown,-105.406297,42.763781,Indian,Douglas,Converse County,Wyoming,US
126525,63189,chutes-restaurant-douglas-3,Chutes Restaurant,Unknown,-105.406297,42.763781,Bars,Douglas,Converse County,Wyoming,US


To begin bus_cat_dataframe and holding_dataframe will be merged to get some summary statistics 

In [12]:
bus_cat_holding= bus_cat_dataframe.merge(right=holding_dataframe, how='inner', on = 'BusinessName')

In [13]:
bus_cat_holding

Unnamed: 0,BusinessKey,BusinessName,ChainName_x,PaymentLevelName,Longitude,Latitude,BusinessCategoryName,CityName,CountyName,StateName,CountryName,ChainName_y,BusinessRating,ReviewCount
0,3,the-shoppe-geneva,The Shoppe,Very Low,-85.865918,31.034793,Sandwiches,Geneva,Geneva County,Alabama,US,The Shoppe,4.5,2
1,3,the-shoppe-geneva,The Shoppe,Very Low,-85.865918,31.034793,Sandwiches,Geneva,Geneva County,Alabama,US,The Shoppe,4.5,2
2,3,the-shoppe-geneva,The Shoppe,Very Low,-85.865918,31.034793,Sandwiches,Geneva,Geneva County,Alabama,US,The Shoppe,4.5,2
3,3,the-shoppe-geneva,The Shoppe,Very Low,-85.865918,31.034793,Sandwiches,Geneva,Geneva County,Alabama,US,The Shoppe,4.5,2
4,3,the-shoppe-geneva,The Shoppe,Very Low,-85.865918,31.034793,Sandwiches,Geneva,Geneva County,Alabama,US,The Shoppe,4.5,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
396790,63189,chutes-restaurant-douglas-3,Chutes Restaurant,Unknown,-105.406297,42.763781,Bars,Douglas,Converse County,Wyoming,US,Chutes Restaurant,5.0,6
396791,63189,chutes-restaurant-douglas-3,Chutes Restaurant,Unknown,-105.406297,42.763781,Bars,Douglas,Converse County,Wyoming,US,Chutes Restaurant,5.0,6
396792,63259,fort-laramie-historic-site-fort-laramie,Fort Laramie Historic Site,Unknown,-104.546270,42.206905,Landmarks & Historical Buildings,Fort Laramie,Goshen County,Wyoming,US,Fort Laramie Historic Site,5.0,6
396793,63259,fort-laramie-historic-site-fort-laramie,Fort Laramie Historic Site,Unknown,-104.546270,42.206905,Landmarks & Historical Buildings,Fort Laramie,Goshen County,Wyoming,US,Fort Laramie Historic Site,5.0,6


Now we'll group by categoryname and check out some aggregates results

In [14]:
cat_groups = bus_cat_holding.groupby(['BusinessCategoryName'], as_index=False)[['ReviewCount','BusinessRating']].agg({"ReviewCount": ['sum', 'mean', 'max'], "BusinessRating": ['mean', 'max']})
cat_groups

Unnamed: 0_level_0,BusinessCategoryName,ReviewCount,ReviewCount,ReviewCount,BusinessRating,BusinessRating
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,max,mean,max
0,ATV Rentals/Tours,476,26.444444,55,4.250000,5.0
1,Acai Bowls,15481,103.899329,530,4.436242,5.0
2,Accessories,238,7.677419,53,4.274194,5.0
3,Active Life,20,5.000000,5,4.500000,4.5
4,Aerial Tours,508,127.000000,128,3.000000,3.0
...,...,...,...,...,...,...
526,Wraps,45539,67.565282,1057,4.207715,5.0
527,Yelp Events,630,27.391304,129,4.913043,5.0
528,Yoga,406,16.240000,35,4.740000,5.0
529,Ziplining,3780,75.600000,183,4.620000,5.0


Here we took a some of the review count which isn't the best statistic considering our data consists of review counts at the end of each data however it does give some insight into the most visited businesses.\
To make more sense of the data it would be best to sort on some columns

In [16]:
cat_groups_Sorted = cat_groups.sort_values(by=[('ReviewCount', 'mean'), ('BusinessRating', 'mean'), ('ReviewCount', 'sum')], ascending=False)
cat_groups_Sorted.head(10)

Unnamed: 0_level_0,BusinessCategoryName,ReviewCount,ReviewCount,ReviewCount,BusinessRating,BusinessRating
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,max,mean,max
382,Public Markets,17082,5694.0,5700,4.5,4.5
418,Shanghainese,63141,1540.02439,6560,4.073171,4.5
176,Eritrean,3480,1160.0,1160,4.5,4.5
254,Iberian,6767,1127.833333,2107,4.0,4.0
84,Burmese,39381,1036.342105,7012,4.328947,5.0
293,Live/Raw Food,88064,926.989474,7566,4.305263,5.0
266,Izakaya,37686,897.285714,3224,4.214286,4.5
481,Train Stations,4926,821.0,1430,4.25,4.5
73,Brasseries,22950,819.642857,3976,4.232143,4.5
438,South African,4641,773.5,1544,4.75,5.0


What's interesting is that these businesses seem to be very niche and don't seem that they would be frequented by a diverse range of visitors aside from the public markets \
We're going to modify our sort and see what happens

In [19]:
cat_groups_Sorted = cat_groups.sort_values(by=[('ReviewCount', 'sum'), ('ReviewCount', 'mean'),  ('BusinessRating', 'mean')], ascending=False)
cat_groups_Sorted.head(10)

Unnamed: 0_level_0,BusinessCategoryName,ReviewCount,ReviewCount,ReviewCount,BusinessRating,BusinessRating
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,max,mean,max
75,Breakfast & Brunch,3821273,208.778506,14412,4.045512,5.0
10,American (Traditional),3526088,126.67366,14412,3.836201,5.0
9,American (New),3034454,258.868282,8947,4.05865,5.0
413,Seafood,2661587,226.88492,7566,3.950217,5.0
408,Sandwiches,2152298,111.912334,13230,3.877236,5.0
43,Bars,1944257,171.693483,8855,3.96843,5.0
362,Pizza,1814484,90.633566,6442,3.77972,5.0
83,Burgers,1692777,86.556067,5932,3.422892,5.0
305,Mexican,1568992,84.631965,5082,3.88125,5.0
265,Italian,1557850,172.32854,6442,3.984403,5.0


Here we get a different picture but it could be perhaps misleading shown by the following business category counts

In [23]:
bus_cat_holding.groupby(['BusinessCategoryName'], as_index=False)['BusinessName'].count().sort_values(by=[
    'BusinessName'], ascending=False).head(10)

Unnamed: 0,BusinessCategoryName,BusinessName
10,American (Traditional),27836
362,Pizza,20020
83,Burgers,19557
408,Sandwiches,19232
305,Mexican,18539
75,Breakfast & Brunch,18303
123,Coffee & Tea,15815
184,Fast Food,15703
413,Seafood,11731
9,American (New),11722


Here a decision has to be made do would we really want to go in the above industries knowing there is going to be much competition? Maybe that's a good sign showing less barriers to entry?
Checking the first set of criteria the businesses categories seem rather diffcult to break into so we will decide to stick with the more common business.
