The download page for the IPEDS data center is accessed via a javascript form, but I've stored a current (fall 2015) copy. This notebook will download all of the zipped data files and their dictionaries, and put them in the correct directories. The full collection of files is more than a gigabyte, zipped, so I'm not storing it in git.

In [None]:
from bs4 import BeautifulSoup
import re

## Parse the page and grab the links

In [None]:
soup = BeautifulSoup(open('../data_archives/IPEDS Data Center.html'), 'lxml')

In [None]:
rows = soup.find_all('tr', "idc_gridviewrow")

In [None]:
datalinks = []
dictionarylinks = []

csvfile = re.compile("^((?!STATA|SPSS|SAS|Dictionary).)*$")

for row in rows:
    datalinks.append(row.find("a", text=csvfile).attrs['href'])
    dictionarylinks.append(row.find("a", text="Dictionary").attrs['href'])

In [None]:
len(datalinks)

In [None]:
datalinks[0]

The form of the link shows that we will have to prepend the bulk of the web address.


## Download and unpack the files

In [None]:
import requests
import os
import zipfile

In [None]:
def download_and_unpack(link, directory, redownload=False, verbose=False):
    """Download an IPEDS data file and unzip it.
    
    Could use some error handling.
    """
    os.chdir("../" + directory)

    zipfilename = link.split("/")[-1]
    filebase = zipfilename.split(".")[0]
    zipfilepath = "../data_archives/" + zipfilename

    linkbase = "https://nces.ed.gov/ipeds/datacenter/"
    if redownload or not os.path.exists(zipfilepath):    
        try:
            r = requests.get(linkbase + link)
            r.raise_for_status()
        except requests.exceptions.HTTPError as err:
            print(err)
            
        chunk_size = 1024
        with open(zipfilepath, 'wb') as fd:
            for chunk in r.iter_content(chunk_size):
                fd.write(chunk)
        if verbose:
            print("Downloaded {}".format(zipfilepath))
            
        csvzip = zipfile.ZipFile(zipfilepath)
        csvzip.extractall()
        if verbose:
            print("unpacked")
    else:
        if verbose:
            print("file already downloaded: {}".format(zipfilepath))

In [None]:
download_and_unpack(dictionarylinks[-3], "dictionaries", verbose=True)

In [None]:
download_and_unpack(datalinks[1], "data", verbose=True)

Let's add some throttling to be kind to their webserver.

In [None]:
import time

for link in datalinks:
    download_and_unpack(link, 'data', verbose=True)
    #time.sleep(3)
    

In [None]:
for link in dictionarylinks:
    download_and_unpack(link, 'dictionaries', verbose=True)
    #time.sleep(3)