# Acquire Service Area Data from Instacart

Using BeautifulSoup, the following code allowed me to scrape through the Instacart Service Region Site to acquire the information on the areas services as of 1.9.2020 and insert it into a NoSQL DB for later use.

### Import dependencies

In [1]:
from splinter import Browser
from bs4 import BeautifulSoup
import os
import pandas as pd
import pymongo

### Setup Splinter, DB connection and establish collection for storage

In [2]:
# Setup configuration variables to enable Splinter to interact with browser
executable_path = {'executable_path': 'resources/chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)

In [3]:
# Setup connection to MongoDB
conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)

In [4]:
# Create DB and Collection to receive data but first drop collection to ensure no cross over and starting fresh
db = client.food_desert_db
collection = db.instacart_cities
collection.drop()

In [5]:
collection = db.instacart_cities

### Define URL to scrape, inform the browser to visit the page and set variables to capture underlying HTML and pass back to BeautifulSoup

In [6]:
url = 'https://www.instacart.com/grocery-delivery/regions'
browser.visit(url)

In [7]:
html = browser.html
soup = BeautifulSoup(html, 'html.parser')

### The names of the cities are housed within the links on the page.  
Finding the number of links will help target how many iterations will be needed and help to find starting and ending point to eliminate unnecessary links being pulled in such as the footer and header items.

In [8]:
len(soup.find_all('a'))

10779

### Iterate over links in page to scrape text from each which reflects the City, State serviced.  Note: the range used for i reflected the links associated with the cities and eliminated the header and footer links.

As a backup I also created a variable to collect and convert to a dataframe that could be transferred, but found it unnecessary. I did use it though to ensure the last value was th final city on the page and count the number of entries expected to confirm against my collection in MongoDB. Additonally, I created a reference csv file of the states services by Instacart for use in limiting other datasets we will refer to. 

Also, had found a way initially before figuring out how to iterate to pull the href which also included the city names, but that would have required much more transformation and splitting in Pandas if this iteration method had not worked out. You'll find start to both commented out at the end of this code.

In [9]:
textContent = []
for i in range(30, 10762):
    link_text = soup.find_all('a')[i].text
    post = {'City/State': link_text}
    textContent.append(link_text)
    collection.insert_one(post)

In [10]:
len(textContent)

10732

In [11]:
textContent[0]

'Adamsville, AL'

In [12]:
textContent[10731]

'Wilson, WY'

In [13]:
find_states = pd.DataFrame(textContent)
find_states.columns = ["city_state"]
find_states.head()

Unnamed: 0,city_state
0,"Adamsville, AL"
1,"Alabaster, AL"
2,"Albertville, AL"
3,"Anniston, AL"
4,"Arab, AL"


In [14]:
find_states[['city','state']] = find_states['city_state'].str.split(', ', n=1, expand=True)
find_states.head()

Unnamed: 0,city_state,city,state
0,"Adamsville, AL",Adamsville,AL
1,"Alabaster, AL",Alabaster,AL
2,"Albertville, AL",Albertville,AL
3,"Anniston, AL",Anniston,AL
4,"Arab, AL",Arab,AL


In [15]:
states = find_states['state'].value_counts().reset_index()
states.rename(columns = {'index': 'state', 'state':'counts'}, 
                                 inplace = True) 
states.head()

Unnamed: 0,state,counts
0,NY,851
1,PA,850
2,CA,704
3,TX,524
4,OH,513


In [16]:
states.to_csv(os.path.join("Data", "instacart_states.csv"))

In [17]:
listings = db.instacart_cities.find()

for listing in listings:
    print(listing)

{'_id': ObjectId('5e1a883d49ffc32c34c05001'), 'City/State': 'Adamsville, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c05002'), 'City/State': 'Alabaster, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c05003'), 'City/State': 'Albertville, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c05004'), 'City/State': 'Anniston, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c05005'), 'City/State': 'Arab, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c05006'), 'City/State': 'Ashford, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c05007'), 'City/State': 'Athens, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c05008'), 'City/State': 'Attalla, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c05009'), 'City/State': 'Auburn University, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c0500a'), 'City/State': 'Auburn, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c0500b'), 'City/State': 'Bayou La Batre, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c0500c'), 'City/State': 'Bessemer, AL'}
{'_id': ObjectId('5e1a883d49ffc32c34c0500d'), 'City/State': 'Birmingham, AL'}

In [None]:
#textContent

In [None]:
#cities = pd.DataFrame(textContent)
#cities.head()

In [None]:
#cities.columns=["City/State"]
#cities.head()

In [None]:
#links = [a.get('href') for a in soup.find_all('a', href=True)]

In [None]:
#links