# COGS 108 - Final Project (change this to your project's title)

<h1><a href="https://drive.google.com/file/d/1xmfes9IP1iozrRmKh65glpn_2M0vhymA/view?usp=sharing"> Presentation Video </a> </h1>

# Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [  ] YES - make available
* [  ] NO - keep private

# Names

- Yasushi Oh
- Nancy Shen
- Taggert Smith
- Katelyn Villamin

# Abstract

Please write one to four paragraphs that describe a very brief overview of why you did this, how you did, and the major findings and conclusions.

# Research Question

Has the presence of PokeStops in Californian restaurants or cafes increased their traffic as measured by the frequency of Yelp check-ins? We will examine yelp check-ins frequency alongside the rate of Pokemon Go players over time, focusing on the initial release period in 2016.

## Background and Prior Work


- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)

  **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 2 or 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.


# Hypothesis


We predict that PokéStops in Pokémon Go positively impact foot traffic at nearby locations, as seen by the frequency of Yelp check-ins. Since Pokémon Go saw a surge in popularity at launch, we believe that restaurants designated as PokéStops would have experienced a noticeable increase in visitors, resulting in a rise in the number of check-ins as well.

# Data

## Data overview


- Dataset #1: 
  - Dataset Name: PokéStop Coordinates
  - Link to the dataset: https://www.pogomap.info/ 
  - Number of observations: 386
  - Number of variables: 2 (latitude, longitude)
- Dataset #2: 
  - Dataset Name: Yelp Reviews: business.json
  - Link to the dataset: https://www.yelp.com/dataset/documentation/main 
  - Number of observations: 150,000
  - Number of variables: 8 - business_id, name, address, city, state, zip, latitude, longitude
- Dataset #3: 
  - Dataset Name: Yelp Reviews: checkin.json 
  - Link to the dataset: https://www.yelp.com/dataset/documentation/main 
  - Number of observations: 131,930
  - Number of Variables: 2 - business_id, date 


The first dataset includes the coordinates in terms of longitude and latitude of pokestops and pokegyms where players are more likely to visit to collect rewards. The current dataset was scraped from the link aforementioned and includes coordinates of pokestops and gyms within the downtown Los Angeles area. However, if needed, we are able to adjust the location of pokestop through our code that will scrape data from the website for pokestops and pokegyms in other locations. The data within this dataset is already cleaned as it just consists of a singular list with each pokestops coordinates. If needed, we can incorporate distinctions between pokestops and pokegyms within the dataset for further analysis on whether pokegyms or pokestops have more of a correlation with yelp reviews. 

Dataset 2 is taken from yelp’s official database and includes information on the businesses with yelp reviews, their IDs, names, addresses, and coordinates. We plan on using their addresses and coordinates to match the pokestop information from dataset 1. We’ve cleaned this dataset by querying categories that match restaurant data then querying again for restaurants that are in California. 
Dataset 3 is also taken from yelp’s official database. This dataset includes information about customer interactions with businesses. It’s composed of 2 columns: business_id, and date, where business_id is an unique identifier of each individual business and date is the date during which customers checked in or visited that business. We cleaned this dataset by merging it to the already cleaned dataset 2 using business_id’s in order to find out when customers went to a restaurant. 

We will combine these datasets as follows:
We merge Datasets #2 and #3 by business_id to create a combined dataset with business names and their check in data, and then query for businesses that identify as restaurants.
We filter the resulting dataset by coordinates to include only restaurants in our target area (likely downtown LA)
We calculate the total number of check-ins for each observation in Dataset #2 by month, for each month in our observation period, and add them as columns to each observation in the resulting dataset.
We split the resulting dataset into “Near Pokestop” restaurants and “No Pokestop” restaurants, by checking with Dataset #1 to determine whether they have a pokestop within 80m (the maximum interaction distance in Pokémon GO).

## Imports

In [None]:
import pandas as pd


## Pokéstop Coordinates

## Yelp Businesses

In [None]:
# #Load business.json and checkin.json as a pandas dataframe
# business_data = pd.read_json("\\Users\\works\\Desktop\\yelp_academic_dataset_business.json", lines=True)
# checkin_data = pd.read_json('\\Users\\works\\Desktop\\yelp_academic_dataset_checkin.json', lines=True)

# Load business.json as a pandas dataframe
business_data = pd.read_json('yelp_academic_dataset_business.json', lines=True)

# Use json_normalize to flatten the structure
business = pd.json_normalize(business_data.to_dict(orient="records"))

#Take only the necessary columns
business_cleaned= business[['business_id','name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'categories']]

#Rename columns
business_cleaned= business_cleaned.rename(columns={'name': 'Name', 'address': 'Street Address', 'city': 'City', 'state': 'State', 'postal_code': 'Zip Code'})

## Yelp Check-ins

In [None]:
# Load checkins.json as a pandas dataframe
checkin_data = pd.read_json('yelp_academic_dataset_checkin.json', lines=True)

# Use json_normalize to flatten the structure
checkin = pd.json_normalize(checkin_data.to_dict(orient="records"))

# Explode the string format checkin dates into separate rows values
checkin['date'] = checkin['date'].str.split(',')
checkin =checkin.explode('date')

In [None]:
#Merge businesses with the checkins
merged= pd.merge(business_cleaned, checkin, how='left', on='business_id')

#Look for businesses that are restaurants only based on categories
categories= ['Food', 'Restaurants', 'Coffee', 'Bars']
pattern= '|'.join(categories)
restaurants= merged[merged['categories'].str.contains(pattern, case=False, na=False)]

#Look for restaurants in CA only
restaurants= restaurants[restaurants['State']== 'CA']
restaurants= restaurants.rename(columns={'date':'Check-in Date'})

restaurants.head()

# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

### Check-ins by City Barplot

In order to determine the region for our focus, we want to find out which city had the highest number of check ins. To start, we filtered the original dataset to only get the count of check-ins. To visualize it, we created a bar plot of check-in counts against cities.

In [None]:
#Filter the table to find the count of Check-ins by city

#Clean Santa Barbara as there are duplicates
restaurants['City'] = restaurants['City'].replace('Santa  Barbara', 'Santa Barbara')

checkin_count= restaurants.groupby(['City'])['Check-in Date'].count().reset_index().sort_values(by= 'Check-in Date', ascending= False)
checkin_count.head()

In [None]:
#Bar graph to visualize which city in California had the most check-ins
import seaborn as sns
import matplotlib.pyplot as plt

ax= sns.barplot(checkin_count, x='City', y= 'Check-in Date', palette= 'pastel')

ax.set_title("Most Check-ins by City in California")
ax.set_xlabel("City")
ax.set_ylabel("Check-in Count")

ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")

plt.tight_layout()
plt.show()

Here, we see that Santa Barbara had the highest number of check-ins within our timeframe. As such, we decided to put our focus for this project on this city. In order to ensure that the restaurants in our dataset is within the Santa Barbara region of Pokestops, we checked the maximum and minimum coordinates possible.

In [None]:
#load the data.json file for all the Santa Barbara Pokestop coordinates
pokestop_sb = pd.read_json('data.json')

#split the data so we can see latitude and longitude in separate columns
pokestop_sb[['latitude', 'longitude']] = pokestop_sb[0].str.split(', ', expand=True)

#remove the column that had combined latitudes and longitudes
pokestop_sb = pokestop_sb.drop(columns=0)

#convert str to float for each column
pokestop_sb['latitude'] = pokestop_sb['latitude'].astype('float64')
pokestop_sb['longitude'] = pokestop_sb['longitude'].astype('float64')

pokestop_sb.head()

In [None]:
#find the min and max latitudes and longitudes of pokestops
min_latitude = pokestop_sb['latitude'].min()
min_longitude = pokestop_sb['longitude'].min()
max_latitude = pokestop_sb['latitude'].max()
max_longitude = pokestop_sb['longitude'].max()

print(f"min_dtsb_coords: {min_latitude}, {min_longitude}; max_dtla_coords: {max_latitude}, {max_longitude}")

In [None]:
#find the min and max latitudes and longitudes of restaurants
restaurants_latitude_max = restaurants['latitude'].max()
restaurants_latitude_min = restaurants['latitude'].min()
restaurants_longitude_max = restaurants['longitude'].max()
restaurants_longitude_min = restaurants['longitude'].min()


print(f"min_restaurant_coords: {restaurants_latitude_min}, {restaurants_longitude_min}; max_restaurant_coords: {restaurants_latitude_max}, {restaurants_longitude_max}")

In [None]:
#query restuarants dataframe so that it only shows restaurants within the Santa Barbara area
latitude_query = (restaurants['latitude'] >= min_latitude) & (restaurants['latitude'] <= max_latitude)
longitude_query = (restaurants['longitude'] >= min_longitude) & (restaurants['longitude'] <= max_longitude)
restaurants_sb = restaurants[ latitude_query & longitude_query]
restaurants_sb.head()

### Pokestops Heat Map

To initially visualize the correlation between pokestops and checkins, we took in the coordinates of restaurant checkins and compared it with the coordinates of Pokestops in Santa Barbara. From here, we saw that restaurants with nearby Pokestops tended to have a higher number of check-ins compared to restaurants without Pokestops nearby.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

## ETC AD NASEUM

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

# Discusison and Conclusion

Wrap it all up here.  Somewhere between 3 and 10 paragraphs roughly.  A good time to refer back to your Background section and review how this work extended the previous stuff. 


# Team Contributions

Speficy who did what.  This should be pretty granular, perhaps bullet points, no more than a few sentences per person.