**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - EDA Checkpoint

# Names

- Car
- Rab
- Lew
- Jac
- Zif

# Research Question

Do restaurants further from college/university campuses have higher or lower ratings than those that are closer? Can we determine if the reviewers of these restaurants are students and if so, do students increase or lower Yelp ratings at these restaurants?

## Background and Prior Work

Colleges and universities host many students every year and as the number of students grow business around these places also grow, some even forming college towns where businesses’ livelihoods depend on students like at UC Davis. We were curious as to specifically restaurants in the vicinity of a college or university and as students of UCSD, it follows that we were curious about restaurants around that specific area. As a restaurant’s success or profits are hard to measure and obtain, we settled on trying to see if restaurants’ ratings are affected by their vicinity to UCSD.

One article published by QSR Magazine seems to indicate so, citing data from “College & University Keynote Report” from Datassential saying that 58% of students eat off campus and 49% of students consider themselves foodies and are more conscious about what they eat.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) This means that more than half of the student body is regularly eating at restaurants around their campus and reviewing and recommending those restaurants. Another study that more closely looks at customer satisfaction with food service, points out several important factors that can contribute to a restaurant’s rating such as their food quality, service quality, decor quality, and most importantly price.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) The conclusions they found is that good service and then good food were the best indicators of a high customer satisfaction. These point out significant attributes that aren’t considered in our research question, and possibly could be much greater of a factor of a restaurant’s rating than distance.

1. Baltazar, Amanda. “Restaurants Would Be Wise to Court College Students.” QSR Magazine, 7 July 2023, https://www.qsrmagazine.com/operations/business-advice/restaurants-would-be-wise-court-college-students/
2. Serhan, Mireille, and Carole Serhan. “The Impact of Food Service Attributes on Customer Satisfaction in a Rural University Campus Environment.” International Journal of Food Science, Hindawi, 31 Dec. 2019, https://www.hindawi.com/journals/ijfs/2019/2154548/

# Hypothesis


We believe that restaurants further from the university/college campuses will have similar ratings as those closer to university/college campuses. While an influx of students in the area can have an impact on nearby restaurants, we believe that this impact will be minimal and negligible. This is because students don’t make up a majority of a restaurants’ clientele, especially in more metropolitan areas. There are many other customers that either live in the area or are traveling that can give ratings to restaurants.

# Data

## Data overview


Dataset #1  
- Dataset Name: Yelp Academic Dataset
- Link to the dataset: https://www.yelp.com/dataset
- Number of observations: 6990280 reviews, 150346 businesses
- Number of variables: 9, 14

This dataset contains information on a selection of businesses, reviews, and user data centered around different metropolitan areas from the app Yelp. It is separated into 5 different json files, businesses, reviews, checkin, tip, user. We only particularly care about the businesses and reviews. For businesses, the variables we care about are the business_id, city, longitude, latitude, stars, review_count. For reviews, the variables we care about are the business_id, stars and text.

## Yelp Academic Dataset

In [1]:
# import necessary libraries

import numpy as np
import pandas as pd

First let's load the business data into a dataframe.

In [2]:
# load business data
business = pd.read_json('https://drive.usercontent.google.com/download?id=1HGtRB3g1Hx1t1j2vPqCdTEfG-WJtTFVN&confirm=xxx', lines=True)

In [3]:
business.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


Let's drop all the observations with missing values in the important columns

In [4]:
business = business.dropna(subset=['latitude', 'longitude', 'stars', 'review_count', 'categories']) # this changes nothing though

Restaurants and cafes are the businesses that we care about, so let's filter our business dataframe by category.

In [5]:
def identify_restaurants(data, keywords):
    keywords = [keyword.lower() for keyword in keywords]
    def check_categories(category):
        return any(keyword in category.lower() for keyword in keywords)
    return data[data['categories'].apply(check_categories)]

In [6]:
keywords = ['Restaurants', 'Food', 'Coffee & Tea']
identify_restaurants(business, keywords)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."
5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,36.269593,-87.058943,2.0,6,1,"{'BusinessParking': 'None', 'BusinessAcceptsCr...","Burgers, Fast Food, Sandwiches, Food, Ice Crea...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', '..."
8,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,MO,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': 'u'full_bar'', '...","Pubs, Restaurants, Italian, Bars, American (Tr...",
9,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,TN,37207,36.208102,-86.768170,1.5,10,1,"{'RestaurantsAttire': ''casual'', 'Restaurants...","Ice Cream & Frozen Yogurt, Fast Food, Burgers,...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', '..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150327,cM6V90ExQD6KMSU3rRB5ZA,Dutch Bros Coffee,1181 N Milwaukee St,Boise,ID,83704,43.615401,-116.284689,4.0,33,1,"{'WiFi': ''free'', 'RestaurantsGoodForGroups':...","Cafes, Juice Bars & Smoothies, Coffee & Tea, R...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-17:0', '..."
150328,1jx1sfgjgVg0nM6n3p0xWA,Savaya Coffee Market,11177 N Oracle Rd,Oro Valley,AZ,85737,32.409552,-110.943073,4.5,41,1,"{'BusinessParking': '{'garage': False, 'street...","Specialty Food, Food, Coffee & Tea, Coffee Roa...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-14:0', '..."
150336,WnT9NIzQgLlILjPT0kEcsQ,Adelita Taqueria & Restaurant,1108 S 9th St,Philadelphia,PA,19147,39.935982,-75.158665,4.5,35,1,"{'WheelchairAccessible': 'False', 'Restaurants...","Restaurants, Mexican","{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'..."
150339,2O2K6SXPWv56amqxCECd4w,The Plum Pit,4405 Pennell Rd,Aston,DE,19014,39.856185,-75.427725,4.5,14,1,"{'RestaurantsDelivery': 'False', 'BusinessAcce...","Restaurants, Comfort Food, Food, Food Trucks, ...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W..."


In [7]:
d = business['city'].value_counts()
d[:30]

city
Philadelphia        14560
Tucson               9244
Tampa                9042
Indianapolis         7537
Nashville            6969
New Orleans          6208
Reno                 5929
Edmonton             5053
Saint Louis          4824
Santa Barbara        3828
Boise                2934
Clearwater           2221
Saint Petersburg     1660
Metairie             1639
Sparks               1621
Wilmington           1444
Franklin             1414
St. Louis            1255
St. Petersburg       1185
Meridian             1043
Brandon              1032
Largo                1000
Carmel                966
Cherry Hill           960
West Chester          838
Goleta                798
Brentwood             767
Palm Harbor           665
Greenwood             649
New Port Richey       603
Name: count, dtype: int64

From above we can see that the dataset has data for restaurants from many different cities. We will pick a few and find universities/colleges within to base our research around. Our initial selection was based on the top cities (with most reviews) in our dataset. However, upon further analysis, we identified that 51% of the cities in the top 29 in our dataset didn't have universities that fit our research criteria. We then refined our list to only include cities with universities in the city limits.

In [8]:
universities = pd.read_csv('./yelp_dataset/universities.csv')

FileNotFoundError: [Errno 2] No such file or directory: './yelp_dataset/universities.csv'

In [9]:
universities = universities.set_index('City')

NameError: name 'universities' is not defined

In [10]:
universities

NameError: name 'universities' is not defined

Let's also filter our business dataframe to only include businesses inside these cities.

In [11]:
business = business[business['city'].isin(universities.index)]
business

NameError: name 'universities' is not defined

We can then use this information on universities to calculate whether a business is close or far from a university using their latitude and longitude positions.

In [12]:
# this is a lat long distance calculator from https://community.esri.com/t5/coordinate-reference-systems-blog/distance-on-a-sphere-the-haversine-formula/ba-p/902128#:~:text=All%20of%20these%20can%20be,longitude%20of%20the%20two%20points

def haversine(coord1, coord2):
    import math

    # Coordinates in decimal degrees (e.g. 2.89078, 12.79797)
    lon1, lat1 = coord1
    lon2, lat2 = coord2

    R = 6371000  # radius of Earth in meters
    phi_1 = math.radians(lat1)
    phi_2 = math.radians(lat2)

    delta_phi = math.radians(lat2 - lat1)
    delta_lambda = math.radians(lon2 - lon1)

    a = math.sin(delta_phi / 2.0)**2 + math.cos(phi_1) * math.cos(phi_2) * math.sin(delta_lambda / 2.0)**2

    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

    meters = R * c  # output distance in meters
    km = meters / 1000.0  # output distance in kilometers

    meters = round(meters, 3)
    km = round(km, 3)


#     print(f"Distance: {meters} m")
#     print(f"Distance: {km} km‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍")
    return km

In [13]:
def calc_distance(df):
    new_df = df.copy()
    lat, long = universities.loc[df['city'], ['Latitude', 'Longitude']]
    lat1, long1 = df[['latitude', 'longitude']]
    # threshold 10 km
    new_df['close_to_university'] = haversine((lat, long), (lat1, long1)) < 10
    return new_df

We decided to use a threshold value of 10 km. A business would be considered close to an university/college if it is within 10 km of one and far otherwise.

In [14]:
business = business.apply(calc_distance, axis=1)

NameError: name 'universities' is not defined

Now let's create a dataframe for our reviews. There are simply too many reviews (6990280!) and the kernel cannot handle loading that many observations into memory, thus we have cut down the file using python to just the first 100000 reviews.

In [15]:
review = pd.read_json('https://drive.usercontent.google.com/download?id=1xE5dbDWd1Mp8kFQwoMmtLVj5xPq9tpuG&confirm=xxx', lines=True)

In [16]:
review.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


Again we drop observations with missing values in the important variables.

In [17]:
review.dropna(subset=['stars', 'text'])

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15
...,...,...,...,...,...,...,...,...,...
99995,pAEbIxvr6ebx2bHc1XvguA,SMH5CeiLvKx61lKwtLZ_PA,lV0k3BnslFRkuWD_kbKd0Q,4,0,0,0,Came here for lunch with a group. They were bu...,2018-05-30 22:28:56
99996,xH1AoE-4nf2ECGQJRjO4_g,2clTdtp-BjphxLjN83CpUA,G0xz3kyRhRi6oZl7KfR0pA,1,1,0,0,The equipment is so old and so felty! I just u...,2015-04-05 23:31:52
99997,GatIbXTz-WDru5emONUSIg,MRrN6DH3QGCFcDv5RENYVg,C4lZdhasjZVQyDlOiXY1sA,4,0,0,0,This is one of my favorite Mexican restaurants...,2016-06-04 00:59:15
99998,6NfkodAdhvI89xONXuBC3A,rnNQzeKJbvqVCsYsL10mkQ,dChRGpit9fM_kZK5pafNyA,2,0,0,0,Came here for brunch - had an omlette ($19 + t...,2018-06-11 12:45:08


To figure out which reviews are written by college students, we queried reviews that have any mention of words related to them. To do this, we chose keywords such as student, students, college, colleges, university, universities, and uni. These keywords were chosen in the hopes that the review was written by a person who is mentioning their status as a student to be used as their way of reliability to other reviewees.

We also want to disclose concerns of using such a method. By querying reviews written by students this way, we could be counting false positives and missing false negatives. In this case, a false positive is when a review incorrectly classifies a non-student's review as a student, while a false negative is when a review incorrectly classifies a student's review as a non-student. Here are a few examples:
1. False Positive: A non-student brought up a "college" nearby.
2. False Positive: The "students" mentioned in a review could be students in highschool, not college.
3. False Positive: A non-student talks about how a restaurant is often visited by many "students"
4. False Negative: The review could have been written by a student, but the person did not mention that they were a student in their review.

In [18]:
def identify_student_reviews(data, keywords):

    keywords = [keyword.lower() for keyword in keywords]


    def check_keywords(review):

        return any(keyword in review.lower() for keyword in keywords)

    data['student_or_not'] = data['text'].apply(check_keywords)
    return data

In [19]:
keywords = ['student', 'students', 'college', 'colleges', 'university', 'universities', 'uni', "univ", "penn", "upenn", "ua", "uarizona", "usf", "purdue", "vanderbilt", "vandy", "vu", "unr", "u of a", "ualberta", "washu", "wustl", "ucsb", "uc"]
review_student = identify_student_reviews(review, keywords)

We have included some common abbreviations of the universities/colleges we are focusing our analysis around. Below are the translations:
- University of Pennsylvania -> "Penn" and "UPenn" (2)
- University of Arizona Tuscon -> "UA" and "UArizona" (2)
- University of South Florida -> "USF" (1)
- Purdue University -> "Purdue" (1)
- Vanderbilt University -> "Vanderbilt" and "Vandy" and "VU" (3)
- Tulane University -> i couldn't find any informal names for this (0)
- University of Nevada -> "UNR" (university of nevada, reno) (1)
- University of Alberta -> "U of A" and "UAlberta" (2)
- Washington University in St. Louis -> "WashU" and "WUSTL" (2)
- UC Santa Barbara -> "UCSB" (1)

In [20]:
review_student.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,student_or_not
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11,True
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18,True
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30,True
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03,False
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15,True


We can then inner join these two dataframe's on business_id to get a dataframe where the observations are reviews with information on if the review was written by a student and if the business the review was written for is close or far from a university/college.

In [21]:
review_businesses = pd.merge(review_student, business, how='inner', on='business_id')

Let's reduce this dataframe to the columns that we care about, namely the review rating, whether its written by a student, business' average rating, the amount of reviews that business has, and whether that business is close or far from a university/college.

In [22]:
review_businesses = review_businesses[['city', 'name', 'stars_x', 'student_or_not', 'stars_y', 'review_count', 'close_to_university']]
review_businesses = review_businesses.rename(columns={'name': 'restaurant_name', 'stars_x': 'rating', 'stars_y': 'avg_rating'})

KeyError: "['close_to_university'] not in index"

In [23]:
review_businesses

Unnamed: 0,review_id,user_id,business_id,stars_x,useful,funny,cool,text,date,student_or_not,...,state,postal_code,latitude,longitude,stars_y,review_count,is_open,attributes,categories,hours
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11,True,...,PA,19454,40.210196,-75.223639,3.0,169,1,"{'NoiseLevel': 'u'average'', 'HasTV': 'False',...","Restaurants, Breakfast & Brunch, Food, Juice B...","{'Monday': '7:30-15:0', 'Tuesday': '7:30-15:0'..."
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18,True,...,PA,19119,39.952103,-75.172753,5.0,144,0,"{'BusinessAcceptsCreditCards': 'True', 'GoodFo...","Active Life, Cycling Classes, Trainers, Gyms, ...","{'Monday': '6:30-20:30', 'Tuesday': '6:30-20:3..."
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30,True,...,AZ,85713,32.207233,-110.980864,3.5,47,1,"{'RestaurantsReservations': 'True', 'BusinessP...","Restaurants, Breakfast & Brunch",
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03,False,...,PA,19114,40.079848,-75.025080,4.0,181,1,"{'Caters': 'True', 'Ambience': '{'romantic': F...","Halal, Pakistani, Restaurants, Indian","{'Tuesday': '11:0-21:0', 'Wednesday': '11:0-21..."
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15,True,...,LA,70119,29.962102,-90.087958,4.0,32,0,"{'BusinessParking': '{'garage': False, 'street...","Sandwiches, Beer, Wine & Spirits, Bars, Food, ...","{'Monday': '0:0-0:0', 'Friday': '11:0-17:0', '..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99991,pAEbIxvr6ebx2bHc1XvguA,SMH5CeiLvKx61lKwtLZ_PA,lV0k3BnslFRkuWD_kbKd0Q,4,0,0,0,Came here for lunch with a group. They were bu...,2018-05-30 22:28:56,False,...,IN,46260,39.913046,-86.200355,4.0,175,0,"{'RestaurantsTableService': 'True', 'OutdoorSe...","American (Traditional), Breakfast & Brunch, Re...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-20:0', '..."
99992,xH1AoE-4nf2ECGQJRjO4_g,2clTdtp-BjphxLjN83CpUA,G0xz3kyRhRi6oZl7KfR0pA,1,1,0,0,The equipment is so old and so felty! I just u...,2015-04-05 23:31:52,True,...,PA,19401,40.147359,-75.318160,2.5,55,1,"{'GoodForKids': 'False', 'ByAppointmentOnly': ...","Yoga, Gyms, Trainers, Fitness & Instruction, A...","{'Monday': '0:0-0:0', 'Tuesday': '5:0-22:0', '..."
99993,GatIbXTz-WDru5emONUSIg,MRrN6DH3QGCFcDv5RENYVg,C4lZdhasjZVQyDlOiXY1sA,4,0,0,0,This is one of my favorite Mexican restaurants...,2016-06-04 00:59:15,False,...,PA,19355,40.042104,-75.541083,3.5,107,1,"{'NoiseLevel': 'u'quiet'', 'RestaurantsTakeOut...","Mexican, Restaurants","{'Monday': '11:30-21:0', 'Tuesday': '11:30-21:..."
99994,6NfkodAdhvI89xONXuBC3A,rnNQzeKJbvqVCsYsL10mkQ,dChRGpit9fM_kZK5pafNyA,2,0,0,0,Came here for brunch - had an omlette ($19 + t...,2018-06-11 12:45:08,False,...,PA,19103,39.950656,-75.170899,4.0,618,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Restaurants, American (New), Breakfast & Brunc...","{'Monday': '0:0-0:0', 'Tuesday': '17:0-21:0', ..."


Note there are many restaurants with the same name, however as these are different restaurants of the same chain, we have decided to keep them as the information about close or far could be different would be useful to study there. The other reason is that there are many reviews for the same restaurant so in many cases it is still referring to the same restaurant.

# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

Now we can use the combined reviews and businesses dataframe to answer our research question: Do restaurants further from college/university campuses have higher or lower ratings than those that are closer? Can we determine if the reviewers of these restaurants are students and if so, do students increase or lower Yelp ratings at these restaurants?

First let's count how many restaurants are close to an university/college and how many aren't.

What is the average Yelp rating of the restaurants that are close versus the restaurants that are far?

What can we understand from this?

Let's also check the counts of close and far restaurants per university/college.

What is the average Yelp rating of those? What about the average across all these universities, is it the same or different? What does that similarity/difference mean?

Do restaurants closer to university/college campuses have higher or lower ratings than those that are farther?

Ultimately, we want to determine how students from these campuses impact these restaurants, i.e. do they increase or decrease restaurant's Yelp ratings? Let's start by checking the reviews of the restaurants that are close. How many are of these reviews are from students and how many are from non students?

What is the average Yelp rating among the reviews from students? What is the average Yelp rating among the reviews from non students?

What can we understand from this?

We can again split this up by university to see if some universities have different results from others. 

### Section 1 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [24]:
import seaborn as sns
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm

ModuleNotFoundError: No module named 'patsy'

### 1-3

How many businesses are close to a university?

In [25]:
review_businesses[review_businesses['close_to_university'] == True].shape[0]

KeyError: 'close_to_university'

How many businesses are not close to a university?

In [26]:
review_businesses[review_businesses['close_to_university'] == False].shape[0]

KeyError: 'close_to_university'

Average rating for business close to the university

In [27]:
review_businesses[review_businesses['close_to_university'] == True]['avg_rating'].mean()

KeyError: 'close_to_university'

Average rating for businesses not close to the university

In [28]:
review_businesses[review_businesses['close_to_university'] == False]['avg_rating'].mean()

KeyError: 'close_to_university'

In [29]:
test_statistic = review_businesses[review_businesses['close_to_university'] == True]['avg_rating'].mean() - review_businesses[review_businesses['close_to_university'] == False]['avg_rating'].mean()
test_statistic

KeyError: 'close_to_university'

When we compare the average ratings of restaurants close to and far from university campuses, the average ratings closer to the university are higher, but barely as their difference is less than 1. Because this difference is less than 1, we will be considering the average ratings to be similar. This similarity means that location and distance from a university does not have much impact on the average ratings of restaurants.

We'll be using this value as the test statistic to see whether this difference is observable by chance.

In [30]:
def permutation_tests():
    diff_array = list()
    shuffled_df = review_businesses.copy()
    for i in range (1000):
        shuffled_df['close_to_university'] = np.random.permutation(shuffled_df['close_to_university'])
        close = shuffled_df[shuffled_df['close_to_university'] == True]['avg_rating'].mean()
        not_close = shuffled_df[shuffled_df['close_to_university'] == False]['avg_rating'].mean()
        diff_array.append((close-not_close))
    return np.array(diff_array)

results = permutation_tests()
sns.histplot(results)
results.mean()
np.mean(test_statistic < results)

KeyError: 'close_to_university'

### 7 Linear Regression
We can perform linear regression to see if distance from a chosen university is a predictor for Yelp ratings. (We would need to create a column with exact distances.)

First let's create a new column for our dataframe that contains the distance from a university/college. We can adapt our calc_distance function from before to calculate distance by removing the threshold check. 

In [31]:
def calc_distance(df):
    new_df = df.copy()
    lat, long = universities.loc[df['city'], ['Latitude', 'Longitude']]
    lat1, long1 = df[['latitude', 'longitude']]
    new_df['distance'] = haversine((lat, long), (lat1, long1))
    return new_df

In [32]:
business_lg = business.apply(calc_distance, axis=1)

NameError: name 'universities' is not defined

In [33]:
business_lg

NameError: name 'business_lg' is not defined

In [34]:
review_business_lg = pd.merge(review_student, business_lg, how='inner', on='business_id')
review_business_lg = review_business_lg[['city', 'name', 'stars_x', 'student_or_not', 'stars_y', 'review_count', 'close_to_university', 'distance']]
review_business_lg = review_business_lg.rename(columns={'name': 'restaurant_name', 'stars_x': 'rating', 'stars_y': 'avg_rating'})

NameError: name 'business_lg' is not defined

Let's check the distribution on distances.

In [35]:
sns.histplot(data=business_lg['distance'])

f1 = plt.gcf()

NameError: name 'business_lg' is not defined

It looks like there's an outlier, lets remove it and check the distribution again.

In [36]:
business_lg = business_lg.drop(business_lg['distance'].idxmax())

NameError: name 'business_lg' is not defined

In [37]:
sns.histplot(data=business_lg['distance'])
f1 = plt.gcf()

NameError: name 'business_lg' is not defined

Interestingly there are no restaurants at a distance of between around 30 kilometers to 60 kilometers. Let's check the scatterplot to see if we can spot a linear relation between the Yelp ratings of restaurants and the distance they are from a university/college.

In [38]:
sns.scatterplot(data=review_business_lg, y='rating', x='distance')

NameError: name 'review_business_lg' is not defined

In [39]:
sns.scatterplot(data=review_business_lg, x='distance', y='avg_rating')

NameError: name 'review_business_lg' is not defined

From the scatterplot there seems to be no relation between Yelp ratings and distance as there is about the same amount of low and high ratings for restaurants of all distances from an university/college. (Note that the scatterplot looks like this because the Yelp ratings are at hard intervals) If we were to draw a line between these points it should look flat. Lets check using linear regression. 

In [40]:
outcome, predictors = patsy.dmatrices('rating ~ distance', review_business_lg)
mod = sm.OLS(outcome, predictors)
res_1 = mod.fit()

NameError: name 'patsy' is not defined

In [41]:
print(res_1.summary())

NameError: name 'res_1' is not defined

In [42]:
outcome_2, predictors_2 = patsy.dmatrices('avg_rating ~ distance', review_business_lg)
mod_2 = sm.OLS(outcome_2, predictors_2)
res_2 = mod_2.fit()

NameError: name 'patsy' is not defined

In [43]:
print(res_2.summary())

NameError: name 'res_2' is not defined

In both linear regressions, the first trying to predict review rating, and the second trying to predict a restaurant's average rating, both using distance, we see a high pvalue in the distance row meaning that we can't really conclude that distance affects the Yelp Rating either in the review rating or the average rating.

However, it might be the case for the average ratings, that because there are many repeated values, it could affect the model. Lets revert back to just the business_lg dataframe and run linear regression using that.

In [44]:
outcome_3, predictors_3 = patsy.dmatrices('stars ~ distance', business_lg)
mod_3 = sm.OLS(outcome_3, predictors_3)
res_3 = mod_3.fit()

NameError: name 'patsy' is not defined

In [45]:
print(res_3.summary())

NameError: name 'res_3' is not defined

Now we see a pvalue of 0! This means that we are pretty confidant in our coefficient. However, the coefficient for distance is -0.0014, a value that is pretty much 0. So even in the case that we are confident, distance barely affects the Yelp rating of a restaurant (when we work in a scale of 0.5, changes on a scale of 0.0014 don't really matter).

Lets check these coefficients per university to see if this is different for some specific universities.

In [46]:
# do linear regression per city, since we only picked one university/college per city
for city in review_business_lg['city'].unique():
    out, pred = patsy.dmatrices('avg_rating ~ distance', review_business_lg[review_business_lg['city']==city])
    mod = sm.OLS(out, pred)
    res = mod.fit()
    print(city, 'distance coef, pvalue:', res.params[1], res.pvalues[1])

NameError: name 'review_business_lg' is not defined

In [47]:
for city in business_lg['city'].unique():
    out, pred = patsy.dmatrices('stars ~ distance', business_lg[business_lg['city']==city])
    mod = sm.OLS(out, pred)
    res = mod.fit()
    print(city, 'distance coef, pvalue:', res.params[1], res.pvalues[1])

NameError: name 'business_lg' is not defined

Even separated among specific universities/colleges, we see mostly the same results. Except curiosly in Tuscon when using the combined dataframe. However, that is likely due to the extra repeated information in the average rating column.

### 8 - 10

How many are of these reviews are from students and how many are from non students?

In [48]:
close_reviews=review_businesses[review_businesses['close_to_university'] == True]
is_student=close_reviews[close_reviews['student_or_not']==True].shape[0]
print(is_student,'of the reviews of restaurants that are close to campus are from students')
not_student=close_reviews[close_reviews['student_or_not']==False].shape[0]
print(not_student, 'of the reviews of restaurants that are close to campus are not from students')

KeyError: 'close_to_university'

What is the average Yelp rating among the reviews from students? What is the average Yelp rating among the reviews from non students?

In [49]:
avg_rating_student=close_reviews[close_reviews['student_or_not']==True]['avg_rating'].mean()
print(avg_rating_student,'is the average Yelp rating among the reviews from students')
avg_rating_nonstudent=close_reviews[close_reviews['student_or_not']==False]['avg_rating'].mean()
print(avg_rating_nonstudent,' is the average Yelp rating among the reviews from non students')

NameError: name 'close_reviews' is not defined

44470 of the reviews of restaurants that are close to campus are from students
55530 of the reviews of restaurants that are close to campus are not from students

What is the average Yelp rating among the reviews from students? What is the average Yelp rating among the reviews from non students?

3.738700247357769 is the average Yelp rating among the reviews from students
3.9263641274986494  is the average Yelp rating among the reviews from non students

What can we understand from this?

1.Among the reviews of restaurants that are close to campus, are there more student reviewers or non-student reviewers? 2.Compare the avg Yelp rating between students reviewers and non-students reviewers.

### Section 2 of EDA if you need it  - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [50]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

We need inclusive data representation, by ensuring the dataset adequately represents different types of restaurants and doesn’t exclude any significant group. We must ensure that the data sources that we use comply with ethical standards and privacy laws. The data should be publicly available and not include any personal information about individuals who submitted the ratings. We must review the data and results to monitor for biases.

The data on Yelp and Google Maps is self-uploaded, so some restaurants may not appear. Since the data is self-uploaded, we believe the restaurants would be open to having other people see it. The restaurants are self-uploaded. After a restaurant is uploaded, they cannot be taken down unless the restaurant has closed down. Since closed restaurants do not show up, the data could be biased. For instance, restaurants that have closed down could potentially have lower ratings, but these lower ratings are no longer part of the data. Another bias could be with online reviews in general. Restaurants could encourage positive reviews by offering a discount or a free dessert to customers. Another concern is that people who have a negative experience might feel frustrated and post negative reviews while a person who has a great experience has no problems, so they also don’t feel the need to make a review. Negative reviews may be overrepresented. 

We must develop a methodology that addresses these issues, such as figuring out how we are going to include each level of budget, restaurant type, etc. We have to be sure to exclude personal identifiers (such as names) for those who submitted reviews. We also have to make sure to remove the reviewer’s names. We will include restaurants in the main price categories and all cuisines that are ordered to avoid bias.

# Team Expectations 

* Team members are expected to attend regular discord meetings, contribute ideas and actively participate in project discussions.
* Team members should be communicative and complete their assigned works by the agreed-upon deadlines.
* Team members should provide feedback to each other’s work in a respectful manner.

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date   | Meeting Time    | Completed Before Meeting                                                         | Discuss at Meeting                                                       |
|:---------------|:----------------|:--------------------------------------------------------------------------------|:--------------------------------------------------------------------------|
| 1/20           | 1 PM            | Read & Think about COGS 108 expectations; brainstorm topics/questions       | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research |
| 1/26           | 10 AM           | Do background research on topic                                                | Discuss ideal dataset(s) and ethics; draft project proposal              |
| 2/1            | 10 AM           | Edit, finalize, and submit proposal; Search for datasets                     | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part |
| 2/14           | 4:15 PM         | Gather and Import Data (Jac, Rab); Wrangle and Preprocess Data (Lew, Car, Zif) |                                                                           |
| 2/23           | 4:15 PM         | Finalize wrangling/EDA; Begin Analysis (Team)                             | Discuss/edit Analysis; Complete project check-in                        |
| 2/25           | 2:30 PM         | Complete Data Checkpoint                                                       |                                                                           |
| 3/1 | 4:15 PM         | Discuss EDA Start EDA                                                            |                                                                           |
| 3/8           | 4:15 PM           | Complete analysis; Draft results/conclusion/discussion (Wasp)                  | Discuss/edit full project                                                |
| 3/20           | Before 11:59 PM | NA                                                                            | Turn in Final Project & Group Project Surveys                            |