# Cleaning Data

## Table of Contents

1. [Find restaurants in Business data](#Find-restaurants-in-Business-data)
2. [Select restaurant reviews from Review data](#Select-restaurant-reviews-from-Review-data)
3. [Clean Review Data](#Clean-Review-Data)
4. [Clean Restaurant data](#Clean-restaurant-data)



### Import Libraries

In [1]:
import csv
import os
import pandas as pd
from time import time

### Set filepaths

In [2]:
raw_data_directory     = os.path.join('..', 'data', 'raw')
interim_data_directory = os.path.join('..', 'data', 'interim')

review_filepath            = os.path.join(raw_data_directory, 'yelp_academic_dataset_review.csv')
business_filepath          = os.path.join(raw_data_directory, 'yelp_academic_dataset_business.csv')
restaurant_review_filepath = os.path.join(interim_data_directory, 'restaurant_review.csv')
restaurant_filepath        = os.path.join(interim_data_directory, 'restaurant.csv')

### Find restaurants in Business data
Restaurants are businesses with "Restaurants" in the `categories` column.

In [3]:
%%time

restaurant_ids = set()

with open(restaurant_filepath, mode = 'w', encoding = 'utf_8') as f_out:
    with open(business_filepath, encoding = 'utf_8') as f_in:

        reader = csv.DictReader(f_in)
        columns = reader.fieldnames
        
        writer = csv.DictWriter(f_out, fieldnames = columns)
        writer.writeheader()

        # iterate through each line in the file
        for row in reader:

            # skip businesses that are not restaurants
            try: 
                if 'Restaurants' not in row['categories']:
                    continue
            except:
                continue

            # add business_id to restaurant_ids set
            restaurant_ids.add(row['business_id'])
            
            # write row
            writer.writerow(row)            

restaurant_ids = frozenset(restaurant_ids)

print (f'Found {len(restaurant_ids)} restaurants.')

Found 57173 restaurants.
CPU times: user 7.19 s, sys: 223 ms, total: 7.41 s
Wall time: 7.59 s


### Select restaurant reviews from Review data

Save restaurant reviews to a separate file.

In [36]:
%%time

# Change to True to run this code block. Estimated runtime: 3 minutes
run = False

if run:
    num_reviews = 0

    # Write restaurant reviews to csv file
    with open(restaurant_review_filepath, 'w', encoding = 'utf_8') as f_out:

        # Open all reviews csv file
        with open(review_filepath, encoding = 'utf_8') as f_in:
            # Instantiate reader
            reader = csv.DictReader(f_in)

            # Get column names
            columns = reader.fieldnames

            # Instantiate writer
            writer = csv.DictWriter(f_out, fieldnames = columns)

            # Write column names
            writer.writeheader()

            # Loop through all reviews
            for row in reader:            
                # Skip reviews that are not about a restaurant
                if row['business_id'] not in restaurant_ids:
                    continue

                # Write row    
                writer.writerow(row)
                num_reviews += 1

    print (f'Found {num_reviews} restaurant reviews.')


Found 3654797 restaurant reviews.
CPU times: user 2min 17s, sys: 5.62 s, total: 2min 23s
Wall time: 2min 25s


#### Load Restaurant Review data

In [4]:
review_df = pd.read_csv(restaurant_review_filepath)
review_df.head()

Unnamed: 0,text,cool,funny,review_id,date,stars,business_id,useful,user_id
0,The pizza was okay. Not the best I've had. I p...,0,0,x7mDIiDB3jEiPGPHOmDzyw,2011-02-25,2,iCQpiavjjPzJ5_3gPD5Ebg,0,msQe1u7Z_XuqjGoqhB0J5g
1,I love this place! My fiance And I go here atl...,0,0,dDl8zu1vWPdKGihJrwQbpw,2012-11-13,5,pomGBqfbxcqPv14c3XH-ZQ,0,msQe1u7Z_XuqjGoqhB0J5g
2,Terrible. Dry corn bread. Rib tips were all fa...,1,1,LZp4UX5zK3e-c5ZGSeo3kA,2014-10-23,1,jtQARsP6P-LbkyjbO1qNGg,3,msQe1u7Z_XuqjGoqhB0J5g
3,Back in 2005-2007 this place was my FAVORITE t...,0,0,Er4NBWCmCD4nM8_p1GRdow,2011-02-25,2,elqbBhBfElMNSrjFqW3now,2,msQe1u7Z_XuqjGoqhB0J5g
4,Delicious healthy food. The steak is amazing. ...,0,0,jsDu6QEJHbwP2Blom1PLCA,2014-09-05,5,Ums3gaP2qM3W1XcA5r6SsQ,0,msQe1u7Z_XuqjGoqhB0J5g


### Get the reviewed restaurants

In [5]:
business_df = pd.read_csv(business_filepath)

In [14]:
business_df.nlargest(5, 'review_count')[['business_id', 'name', 'review_count']]

Unnamed: 0,business_id,name,review_count
137635,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi,7968
185167,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,7866
62723,K7lWdNUhCbcnEvI0NhGewg,Wicked Spoon,6446
188309,cYwJA2A6I12KNkm2rtXd5g,Gordon Ramsay BurGR,5472
170129,f4x1YBxkLrZg652xt2KR5g,Hash House A Go Go,5382


#### Cross reference with review data

In [7]:
review_df['business_id'].value_counts().head()

4JNXUYY8wbaaDmk3BPzlWw    7968
RESDUcs7fIiihp38-d6_6g    7861
K7lWdNUhCbcnEvI0NhGewg    6447
cYwJA2A6I12KNkm2rtXd5g    5472
f4x1YBxkLrZg652xt2KR5g    5382
Name: business_id, dtype: int64

For the rest of this project, I will use Mon Ami Gabi.

### Clean Review Data

#### Check for missing data

In [32]:
review_df.isna().sum()

text           1
cool           0
funny          0
review_id      0
date           0
stars          0
business_id    0
useful         0
user_id        0
dtype: int64

There is one review with missing text, so let's remove it.

#### Drop rows with missing reviews

In [33]:
review_df[review_df['text'].isna()]

Unnamed: 0,text,cool,funny,review_id,date,stars,business_id,useful,user_id
2882650,,0,0,QW01qOsaqlxMKoMazOw1Bg,2018-06-17,0,FgNgBLayRFm6H6Qr66ecbQ,0,wrqh88xVEE1U5d_5TjzN4Q


In [34]:
review_df.drop(review_df[review_df['text'].isna()].index, axis = 0, inplace = True)

In [35]:
review_df[review_df['text'].isna()]

Unnamed: 0,text,cool,funny,review_id,date,stars,business_id,useful,user_id


#### Add actual `business_name` to Review Data

In [38]:
# Build a dictionary to map business_id : business_name 
business_id_name_mapper = dict(zip(business_df['business_id'], business_df['name']))

In [39]:
%%time
# add business name to dataframe
review_df['business_name'] = review_df['business_id'].map(business_id_name_mapper)

CPU times: user 1.62 s, sys: 55.8 ms, total: 1.67 s
Wall time: 1.7 s


In [40]:
review_df.head()

Unnamed: 0,text,cool,funny,review_id,date,stars,business_id,useful,user_id,business_name
0,The pizza was okay. Not the best I've had. I p...,0,0,x7mDIiDB3jEiPGPHOmDzyw,2011-02-25,2,iCQpiavjjPzJ5_3gPD5Ebg,0,msQe1u7Z_XuqjGoqhB0J5g,Secret Pizza
1,I love this place! My fiance And I go here atl...,0,0,dDl8zu1vWPdKGihJrwQbpw,2012-11-13,5,pomGBqfbxcqPv14c3XH-ZQ,0,msQe1u7Z_XuqjGoqhB0J5g,Leticia's Mexican Cocina
2,Terrible. Dry corn bread. Rib tips were all fa...,1,1,LZp4UX5zK3e-c5ZGSeo3kA,2014-10-23,1,jtQARsP6P-LbkyjbO1qNGg,3,msQe1u7Z_XuqjGoqhB0J5g,H&H BBQ Plus 2
3,Back in 2005-2007 this place was my FAVORITE t...,0,0,Er4NBWCmCD4nM8_p1GRdow,2011-02-25,2,elqbBhBfElMNSrjFqW3now,2,msQe1u7Z_XuqjGoqhB0J5g,Pin Kaow Thai Restaurant
4,Delicious healthy food. The steak is amazing. ...,0,0,jsDu6QEJHbwP2Blom1PLCA,2014-09-05,5,Ums3gaP2qM3W1XcA5r6SsQ,0,msQe1u7Z_XuqjGoqhB0J5g,Braddah's Island Style


#### Remove `\n` characters

In [41]:
review_df['text'] = review_df['text'].apply(lambda row : row.replace('\n', ' '))

In [42]:
# Confirm '\n' characters are removed
review_df[review_df['text'].str.contains('\n')]

Unnamed: 0,text,cool,funny,review_id,date,stars,business_id,useful,user_id,business_name


#### Keep relevant columns

In [43]:
review_df = review_df[['date', 'stars', 'text', 'review_id', 'business_id', 'business_name']]

In [77]:
review_df[review_df['business_name']=='Euro Shawarma']

Unnamed: 0,date,stars,text,review_id,business_id,business_name
29656,2014-02-27,2,"I've had better shawarama, it just tastes a li...",BZqwk1avJKbaEVU9r94G7A,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma
36953,2014-04-09,5,One of the best if not THE best beef shawarma ...,JcOoOV98S_vqzOtNAluBtA,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma
140389,2016-02-18,3,If u go during rush time good luck because it ...,Lb33clc09RDML39TbMll4Q,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma
181709,2013-06-08,3,"The restaurant is very clean and modern, I was...",d19BqUan5Ke27qPgJ4HYqg,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma
191503,2014-10-07,4,Very good service there.... always have fresh ...,zU1te6zVk4Jd3Nkp5x9maw,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma
219112,2015-01-02,5,This place has the best prepared sides of any ...,VRWcyii54hqPI58PVXpbmw,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma
254957,2015-01-21,4,My pick for the best shawarma in town. Service...,NUnkOEoPenLixVNNS0c6zg,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma
283079,2014-12-20,2,No rice for you! I felt like I was in a Seinf...,KICPIFVqWNjBXv_iJZqiFQ,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma
286156,2015-10-08,5,Chicken shawarma plate every time I go! It's ...,qPjm25kJVPOQA3WL9tJruw,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma
401455,2017-10-14,5,One of the best shawarma places in the city. N...,lJ7m7gk2Ssj21kVfmq6l8Q,NNs2v3f6FIdZ7cSu3LRaHw,Euro Shawarma


In [59]:
review_df.head()

Unnamed: 0,date,stars,text,review_id,business_id,business_name
0,2011-02-25,2,The pizza was okay. Not the best I've had. I p...,x7mDIiDB3jEiPGPHOmDzyw,iCQpiavjjPzJ5_3gPD5Ebg,Secret Pizza
1,2012-11-13,5,I love this place! My fiance And I go here atl...,dDl8zu1vWPdKGihJrwQbpw,pomGBqfbxcqPv14c3XH-ZQ,Leticia's Mexican Cocina
2,2014-10-23,1,Terrible. Dry corn bread. Rib tips were all fa...,LZp4UX5zK3e-c5ZGSeo3kA,jtQARsP6P-LbkyjbO1qNGg,H&H BBQ Plus 2
3,2011-02-25,2,Back in 2005-2007 this place was my FAVORITE t...,Er4NBWCmCD4nM8_p1GRdow,elqbBhBfElMNSrjFqW3now,Pin Kaow Thai Restaurant
4,2014-09-05,5,Delicious healthy food. The steak is amazing. ...,jsDu6QEJHbwP2Blom1PLCA,Ums3gaP2qM3W1XcA5r6SsQ,Braddah's Island Style


In [45]:
review_df.shape

(3654796, 6)

#### Save clean review data

In [60]:
%%time
review_df.to_csv(restaurant_review_filepath, index = False)

CPU times: user 1min 3s, sys: 11.6 s, total: 1min 15s
Wall time: 1min 27s


### Clean restaurant data

#### Load restaurant data

In [47]:
restaurant_df = pd.read_csv(restaurant_filepath)

  interactivity=interactivity, compiler=compiler, result=result)


In [48]:
restaurant_df.head()

Unnamed: 0,attributes.WiFi,hours.Monday,attributes.RestaurantsPriceRange2,hours.Sunday,attributes.CoatCheck,attributes.RestaurantsGoodForGroups,neighborhood,attributes.DriveThru,city,attributes.NoiseLevel,...,stars,attributes.RestaurantsAttire,hours.Thursday,attributes.Music,review_count,attributes.BikeParking,name,attributes.OutdoorSeating,attributes.WheelchairAccessible,attributes.Corkage
0,,8:30-17:0,2.0,,,True,,,Calgary,average,...,4.0,casual,11:0-21:0,,24,False,Minhas Micro Brewery,False,,
1,no,,2.0,17:0-23:0,,True,,False,Henderson,,...,4.5,casual,,,3,False,CK'S BBQ & Catering,True,True,
2,free,10:0-22:0,2.0,10:0-22:0,,True,Rosemont-La Petite-Patrie,,Montréal,average,...,4.0,casual,10:0-22:0,,5,True,La Bastringue,False,,
3,,,2.0,,,True,Ridgewood,,Mississauga,,...,2.0,casual,,,7,,Thai One On,False,,
4,no,0:0-0:0,1.0,0:0-0:0,,True,,,Avondale,average,...,2.5,casual,0:0-0:0,,40,True,Filiberto's Mexican Food,False,True,


In [49]:
restaurant_df.shape

(57173, 61)

In [50]:
restaurant_df.columns

Index(['attributes.WiFi', 'hours.Monday', 'attributes.RestaurantsPriceRange2',
       'hours.Sunday', 'attributes.CoatCheck',
       'attributes.RestaurantsGoodForGroups', 'neighborhood',
       'attributes.DriveThru', 'city', 'attributes.NoiseLevel', 'address',
       'attributes.BestNights', 'attributes.GoodForKids',
       'attributes.GoodForMeal', 'attributes.ByAppointmentOnly',
       'attributes.Open24Hours', 'attributes.RestaurantsTakeOut',
       'postal_code', 'hours.Friday', 'attributes.BusinessAcceptsBitcoin',
       'attributes.RestaurantsReservations', 'business_id', 'hours.Wednesday',
       'attributes.RestaurantsCounterService', 'attributes.HappyHour',
       'attributes.HairSpecializesIn', 'attributes.GoodForDancing',
       'attributes.AgesAllowed', 'attributes.Caters', 'is_open', 'attributes',
       'categories', 'attributes.RestaurantsDelivery', 'attributes.Alcohol',
       'latitude', 'hours.Saturday', 'attributes.DietaryRestrictions',
       'attributes.DogsAllow

### Cleaning Restaurant Data

#### Drop rows with missing `postal_code` or `city`

In [51]:
restaurant_df['postal_code'].isna().sum() / restaurant_df.shape[0]

0.002028929739562381

In [52]:
restaurant_df['city'].isna().sum() / restaurant_df.shape[0]

3.498154723383415e-05

0.2% of restaurants are missing postal code or city data. I will remove them.

In [53]:
restaurant_df = restaurant_df.dropna(subset=['postal_code', 'city'])

In [54]:
len(restaurant_df)

57056

#### Keep relevant columns

In [55]:
columns = ['name', 'business_id', 'stars', 'review_count', 
           'categories', 'longitude', 'latitude', 'postal_code', 'city', 'state']

restaurant_df = restaurant_df[columns]


In [56]:
restaurant_df.head()

Unnamed: 0,name,business_id,stars,review_count,categories,longitude,latitude,postal_code,city,state
0,Minhas Micro Brewery,Apn5Q_b6Nz61Tq4XzPdf9A,4.0,24,"Tours, Breweries, Pizza, Restaurants, Food, Ho...",-114.031675,51.091813,T2E 6L6,Calgary,AB
1,CK'S BBQ & Catering,AjEbIBw6ZFfln7ePHha9PA,4.5,3,"Chicken Wings, Burgers, Caterers, Street Vendo...",-114.939821,35.960734,89002,Henderson,NV
2,La Bastringue,O8S5hYJ1SMc8fA4QBtVujA,4.0,5,"Breakfast & Brunch, Restaurants, French, Sandw...",-73.5993,45.540503,H2G 1K7,Montréal,QC
3,Thai One On,6OuOZAok8ikONMS_T3EzXg,2.0,7,"Restaurants, Thai",-79.632763,43.712946,L4T 1A8,Mississauga,ON
4,Filiberto's Mexican Food,8-NRKkPY1UiFXW20WXKiXg,2.5,40,"Mexican, Restaurants",-112.341302,33.448106,85323,Avondale,AZ


#### Last check for missing values

In [57]:
restaurant_df.isna().sum()

name            0
business_id     0
stars           0
review_count    0
categories      0
longitude       0
latitude        0
postal_code     0
city            0
state           0
dtype: int64

#### Save clean restaurant data

In [58]:
restaurant_df.to_csv(restaurant_filepath, index = False)