#### Downloading CSVs directly from the web

Imagine you are doing market research on crime in an area.

Therefore you need to access historical crime data for as many years as possible to determine the frequency of crime and also examine seasonal trends.

The city of Chicago provides a large selection of crime data indicating the time, location, and type:
Visit this url to see an example (Crime data for 2015): https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9

We can download the data as a csv file by clicking the blue "export" button at the top-right of the page, and then clicking "CSV" link that appears (Where it says, "Download As")

Notice that the link to download the csv is: https://data.cityofchicago.org/api/views/vwwp-7yr9/rows.csv

Also, notice that the link to download a csv datafile from the city of Chicago has the following format: "https://data.cityofchicago.org/api/views/" and then the last part of the page's URL (vwwp-7yr9) and then "rows.csv"

If we knew what the url was for each csv file, we can easily have Python automatically download all of the crime data 

Our goal for this activity is going to be to have a script that finds out what the links are for all of those files, and then downloads the csv files for is.

You can use the search box on the page, "https://data.cityofchicago.org/"

This is the url for the first page of data that has the crime data
https://www.metrochicagodata.org/browse?q=crime&page=1

We ideally want to write a script that goes through EVERY page of results, and downloads the link name, and then accesses the data.

The following script is an example of how we would download the results page from page 1, and get the HTML tree to find the links in the page

In [None]:
import requests
import lxml.html as ET

#This is the url for the first page of data that has the crime data
url="https://www.metrochicagodata.org/browse?q=crime&page=1"

#send the request
content= requests.get(url)

#save the page source code to a string called content_string
content_string = content.text.encode("utf-8")

#pass the page source to our html parse
doc = ET.document_fromstring(content_string)

#### Extracting the links

Look at the page source on the data page: https://www.metrochicagodata.org/browse?q=crime&page=1

Our goal is to know what the links are for each dataset so we can use them to download the data.

What tag does each link have around it, and what properties do those tags have?

You should notice that each data link is surrounded by an <a> tag

All of the links to datasets have a class='browse2-result-name-link'

Therefore, we want to access all of the <a> tags with the class equal to browse2-result-name-link


In [None]:
links = doc.xpath("//a[@class='browse2-result-name-link']")

#### Retreiving the links

For every link that matched the criteria we asked for, we can get it's url, which is stored in the 'href' attribute within its <a> tag

In [None]:
#make an empty empty to hold the data_urls and their titles (as the key)
data_urls = {}

#for every link in out list of links
for link in links:
    
    #check to make sure the data is relevant by seeing if the link text has "Crimes -" in it, which is the format the data tend to follow
    if "Crimes - " in link.text:
        #if the link text is relevant to our interest, add it to our url list
       data_urls[link.text] = link.attrib['href'] 

print data_urls

#### Finish the process - Make a complete script to download all of the Crime data

Make a complete script that downloads the crime data and save it to your computer.

The comments help guide the code you should write. The code is also indented to help guide your code writing

In [None]:
#### Write the full code here

#make an empty dictionary to hold the urls and titles of the data file pages


#for every page number from 1-4

    #have mechanize go to the results page for the page number
    #remember, the results page url for page 1 was: "https://www.metrochicagodata.org/browse?q=crime&page=1"
    
    
    #get the page's source code and save it to a variable that is a string
    
    
    #parse the source code into an html tree and save as a variable called doc
    
    
    #get the list of links from the results page that we save
    
    
    #for every link in the list of links
        
        #add the link to the dictionary, using the link text as the key
        
    
#For every key,url in the dictionary.items():
    
    #get the ending of the url (the part right after the /) and save it as a variable called linkending
    #Hint: use.split("/")[-1]
    #.split() allows you to break up a string on a character of your choice and the result is a list, where the elements are the contents that were separated by the character
    #"hello/how/are/you".split("/") would produce ["hello","how","are","you"]
    
    
    #download the csv 
    #The code to download files from urls in Python is:
   
    
    import urllib #import the library to grab the file and save it to your computer
    
    
    url= "https://domain.com/subdirectory/filename.csv" #edit this to be what you need
    
    #remember, the url for city of Chicago data uses the following structure:
    #"https://data.cityofchicago.org/api/views/linkending/rows.csv"
    #linkending should be the information you parsed when you used split
    #to join strings together, you can use the + sign
    #hello" + "!" results in "hello!"
    

    filename="whatever_you_want_to_name_the_file.csv" #edit this to be the text link
    urllib.urlretrieve(url, filename) #saves the file found at the url to your computer using the filename you provided
