# Sean Juel Ayo Alabi Tech Skill Swap Tutorial

## this tutorial outlines the process for parsing a webpage for spatial data in the form of addresses, passing a list of those addresses through a geocoder (in this case opencagedata) and returns a dataframe with the coordinates obtained. Once the spatial data is organized in a dataframe, we'll write a simple spatial query in python and map our results with leaflet through folium.

This tutorial is structured in a Jupyter IPythonNotebook. This is one way to write in python. IPythonNotebooks are comprised of cells. This cell is a markdown cell, which are used to commentate the code. Code is written in "code" cells. Below this markdown cell is the first true code cell of this IPythonNotebook, and all it does is import the many dependencies that the rest of the code relies on to function properly.

In [1]:
import urllib
import urllib2
from lxml import html
import unicodedata
import folium
import pandas
import geopandas
import shapely
import shapely.geometry
import fiona
import fiona.crs

Now that we've imported the necessary dependencies, we can begin in earnest. The first thing we'll learn to do is find spatial data online and organize it in a useful format. In this tutorial, we'll use a webpage that lists some addresses for the locations of every Starbucks in Seattle. As you may know, addresses can't be directly plugged into GIS software, they have to be geocoded first. What's more, the addresses aren't organized neatly in a downloadable table, they're simply listed on the webpage. In the following cell, we'll use a cssselector to parse the webpage to organize the addresses and the names of the stores they correspond with into lists.

In [2]:
# This is text that has been "commented out" by placing the # symbol in front of it.
# commenting on the code within code cells is useful for providing commentary in a precise manner.
# The following chunk of code defines a bunch of variables, as you can see, the "url" variable is declared, and then
# is assigned the url of our starbucks webpage, which is a "string", or human language.

url = "https://www.seattlemet.com/articles/2015/8/17/every-single-starbucks-in-seattle-ranked"
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib2.urlopen( req )
doc_text = con.read()
doc = html.fromstring(doc_text)
doc.make_links_absolute(url) #this last line in this chunk of code isn't defining a varibale, it's using an existing
# variable, "doc", and calling a method, in this case the bit that follows the period. the part in the parenthesis is
# the parameter of the method. The "make_links_absolute" method only has one paramater, but methods can have several parameters.

names = []  # We need a place to store the names and addresses of the stores, so here we create two lists.
addresses = []  # These lists are created with empty closed brackets, and are empty to start with, but we'll populate them shortly.

# this bit of code is where we collect the data from the webpage.
# In the first line we have a couple important concepts, a loop to go through each piece of data we want in the webpage,
# and a cssselector to define what part of the webpage we're interested in.
for row in doc.cssselect("body div.site-wrapper main article div div.c-body h6"):
    currentname = row.text_content()
    currentname = currentname.replace(u'\xe2', ' ').encode('utf-8')
    names.append(currentname)
#after the for loop goes through each item that meets the requirements of the cssselector, we may still need to clean up the data.
names[13] = "14. Seattle Childrens Hospital and Kiosk"
names[17] = "18. 1200 Westlake Avenue"

#This is very similar to the loop we just used, only now we're collecting the addresses instead of the names.
for row in doc.cssselect("body div.site-wrapper main article div div.c-body em"):
    currentaddress = row.text_content()
    if "Sea-Tac" not in currentaddress:
        currentaddress = currentaddress[:-13]
    currentaddress = currentaddress.strip(',')
    currentaddress = currentaddress.replace(u'\xc2', ' ').encode('utf-8')
    currentaddress += ", Seattle"
    addresses.append(currentaddress)
del addresses[0]
del addresses[81]

#These lines of code will "print" out the lists we just made in the space below this cell, which we can use to check our work.
print names
print addresses

['1. Madison Park', '2. Ballard Drive-Thru', '3. Seventh and Pike', '4. 23rd and Jackson', '5. Columbia Tower 40th Floor', '6. Century Square', '7. Roy Street Coffee and Tea', '8. Olive Way', '9. Russell Investments Center', '10. Broadway and East Pike', '11. Every Other Airport Starbucks', '12. University Village North', '13. Northgate Way', '14. Seattle Childrens Hospital and Kiosk', '15. Fourth and Seneca', '16. Terry and Republican', '17. City University', '18. 1200 Westlake Avenue', '19. Eighth and Virginia', '20. SoDo Lobby', '21. 15th Ave E', '22. Leschi', '23. Fourth and Union', '24. University Village South', '25. Magnolia', '26. Swedish Medical Center', '27. Second and Cherry', '28. Fourth and Diagonal', '29. Columbia Center', '30. Swedish First Hill Lobby', '31. University Way', '32. 1211 Dexter Ave', '33. MLK Way and Graham', '34. Greenlake', '35. Second and Seneca', '36. Queen Anne', '37. Pacific Place', '38. 12th and Columbia', '39. Lake City Way and 120th', '40. Westlake

It should be noted that the previous cell is written specifically for the starbucks web page and would need to be significantly altered to work for a different page.

In the following cell, we'll pass the addresses from the "scraper" cell through a geocoder API provided by opencagedata that returns latitude and longtitude coordinates in an xml page. by using cssselectors again, we'll organize all the data we have so far: names, latitude, and longtitude into a nested list.

In [3]:
#opencagedata is a geocoder api with free and paid licensing options. I've used a free license, which allows 2500 requests a day.
key = "88bbe8303315e03dd482a5972d8e6aa2" #use the key from the free opencagedata account you created
data = []
i = 0
for row in addresses:
    url = "http://api.opencagedata.com/geocode/v1/xml?q=" + urllib.quote(row) + "&key=" + key
    req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    con = urllib2.urlopen( req )
    doc_text = con.read()
    doc = html.fromstring(doc_text)
    doc.make_links_absolute(url)
    currentlat = float(doc.cssselect("response results result geometry lat")[0].text)
    currentlong = float(doc.cssselect("response results result geometry lng")[0].text)
    data.append([names[i],currentlat,currentlong])
    i += 1
print data

[['1. Madison Park', 47.6341227, -122.2807885], ['2. Ballard Drive-Thru', 47.6668892, -122.3766051], ['3. Seventh and Pike', 47.60621, -122.33207], ['4. 23rd and Jackson', 47.59949545, -122.301972385268], ['5. Columbia Tower 40th Floor', 47.60621, -122.33207], ['6. Century Square', 47.60621, -122.33207], ['7. Roy Street Coffee and Tea', 47.6253026, -122.321085], ['8. Olive Way', 47.6194572, -122.325070491391], ['9. Russell Investments Center', 47.60621, -122.33207], ['10. Broadway and East Pike', 47.6141035, -122.3164747], ['11. Every Other Airport Starbucks', 47.60621, -122.33207], ['12. University Village North', 47.6644178, -122.2988381], ['13. Northgate Way', 47.60621, -122.33207], ['14. Seattle Childrens Hospital and Kiosk', 47.663215, -122.2843412], ['15. Fourth and Seneca', 47.60621, -122.33207], ['16. Terry and Republican', 47.60621, -122.33207], ['17. City University', 47.6177084, -122.3444623], ['18. 1200 Westlake Avenue', 47.6299052653061, -122.342425387755], ['19. Eighth an

now that we have a nested list with the names, latitude, and longitude coordinates of our locations, we'll convert this list into useful formats for GIS work, starting with a pandas dataframe. As we go through each step in this process, we'll print out the data in its current form.

In [4]:
import shapely
import shapely.geometry
import fiona
import fiona.crs
import pandas

pandas_df = pandas.DataFrame(data,columns=['name','lat','long']) #creates the dataframe from the nested list.

print "The original Pandas DataFrame:"
print pandas_df

print
CoordinateTuples_list = zip(pandas_df['long'], pandas_df['lat']) #creates a Coordinate tuples list
print "CoordinateTuples_list: "
print CoordinateTuples_list 

print
geometry_list = [shapely.geometry.Point(CoordinateTuple) for CoordinateTuple in CoordinateTuples_list]
print "geometry_list: "
print geometry_list

print
geometry_gs = geopandas.GeoSeries(geometry_list)
print "geometry_gs, a GeoSeries:"
print str(geometry_gs) 

print
geopandas_gdf = geopandas.GeoDataFrame(
    pandas_df,
    geometry=geometry_gs,
)
geopandas_gdf.crs=fiona.crs.from_epsg(4326)   # sets the proper crs for the coordinates provided by the geocoder we used.
print "GeoDataFrame:"
print geopandas_gdf

#You'll nedd to edit the following pathnames to reflect a location on your disk.
geopandas_gdf.to_file("C:\\Users\\Sean\\Downloads\\starbucks.geojson", driver="GeoJSON") #exports the geodataframe to a geojson file.
geopandas_gdf.to_file("C:\\Users\\Sean\\Downloads\\starbucks.shp", driver="ESRI Shapefile") #exports the geodataframe to a shapefile

The original Pandas DataFrame:
                                         name        lat        long
0                             1. Madison Park  47.634123 -122.280788
1                       2. Ballard Drive-Thru  47.666889 -122.376605
2                         3. Seventh and Pike  47.606210 -122.332070
3                         4. 23rd and Jackson  47.599495 -122.301972
4                5. Columbia Tower 40th Floor  47.606210 -122.332070
5                           6. Century Square  47.606210 -122.332070
6                7. Roy Street Coffee and Tea  47.625303 -122.321085
7                                8. Olive Way  47.619457 -122.325070
8               9. Russell Investments Center  47.606210 -122.332070
9                  10. Broadway and East Pike  47.614103 -122.316475
10          11. Every Other Airport Starbucks  47.606210 -122.332070
11               12. University Village North  47.664418 -122.298838
12                          13. Northgate Way  47.606210 -122.332070
13 

The following cell performs a simple distance spatial query that selects all the points in a geodataframe that are within a certain distance from a given point defined by a user. As we go through the geodataframe, a new nested list is created for the selected points, and a web map is created with folium. Folium is a software package that allows leaflet to be used in Python.

In [12]:
import folium
import math
map_starbucks = folium.Map(location=[47.65, -122.3],
                            zoom_start=11,
                            tiles='Stamen Terrain')

#define a point by latitude and longitude, and set a distance (in miles) to check for other points within
chosenlat = 47.65
chosenlong = -122.3
chosendistance = 4

selections = []
folium.RegularPolygonMarker([chosenlat,chosenlong], popup="chosenpoint", fill_color = 'blue').add_to(map_starbucks)
for row in geopandas_gdf.itertuples():
    asquared = ((chosenlat - row[2])*69)**2
    bsquared = ((chosenlong - row[3])*(69*math.cos(math.radians(chosenlat))))**2
    c = (asquared + bsquared)**(1/2.0)
    if c < chosendistance:
        folium.RegularPolygonMarker([row[2],row[3]], popup=row[1], fill_color = 'green').add_to(map_starbucks)
        selections.append([row[1],row[2],row[3]])
    else:
        folium.RegularPolygonMarker([row[2],row[3]], popup=row[1], fill_color = 'white').add_to(map_starbucks)

selections_df = pandas.DataFrame(selections,columns=['name','lat','long']) #creates the dataframe from the nested list.
CoordinateTuples_list = zip(selections_df['long'], selections_df['lat']) #creates a Coordinate tuples list
geometry_list = [shapely.geometry.Point(CoordinateTuple) for CoordinateTuple in CoordinateTuples_list]
geometry_gs = geopandas.GeoSeries(geometry_list)
selections_gdf = geopandas.GeoDataFrame(
    selections_df,
    geometry=geometry_gs,
)
selections_gdf.crs=fiona.crs.from_epsg(4326)   # sets the proper crs for the coordinates provided by the API we used.
print "GeoDataFrame:"
print selections_gdf

#change the directories below to a path on your disk
selections_gdf.to_file("C:\\Users\\Sean\\Downloads\\selections.geojson", driver="GeoJSON") #exports the selections geodataframe to a geoJSON file
selections_gdf.to_file("C:\\Users\\Sean\\Downloads\\selections.shp", driver="ESRI Shapefile") #exports the selections geodataframe to a shapefile

map_starbucks.save('C:\\Users\\Sean\\Downloads\\starbucksmap.html')  # writes the map out to html on disk   

map_starbucks   # display interactive map of Starbucks locations in the notebook below             


GeoDataFrame:
                                        name        lat        long  \
0                            1. Madison Park  47.634123 -122.280788   
1                      2. Ballard Drive-Thru  47.666889 -122.376605   
2                        3. Seventh and Pike  47.606210 -122.332070   
3                        4. 23rd and Jackson  47.599495 -122.301972   
4               5. Columbia Tower 40th Floor  47.606210 -122.332070   
5                          6. Century Square  47.606210 -122.332070   
6               7. Roy Street Coffee and Tea  47.625303 -122.321085   
7                               8. Olive Way  47.619457 -122.325070   
8              9. Russell Investments Center  47.606210 -122.332070   
9                 10. Broadway and East Pike  47.614103 -122.316475   
10         11. Every Other Airport Starbucks  47.606210 -122.332070   
11              12. University Village North  47.664418 -122.298838   
12                         13. Northgate Way  47.606210 -122.33