<a id='Top of document'></a>

# Wrangle OpenStreetMap Data

## Data Wrangling

* [Quiz: Parsing CSV Files](#quiz_parsing_csv)
    * [Testing Code](#testing_code_1)
    * [My Solution: Parsing CSV Files](#my_sol_parsing_csv)
    * [Instructor Solution: Parsing CSV Files](#inst_sol_parsing_csv)
    * [With CSV Module](#csv_module)
* [Intro to XLRD](#intro_xlrd)
    * [Quiz: Reading Excel Files](#qz_reading_excel_files)
    * [My Solution: Reading Excel Files](#my_sol_reading_excel_files)
    * [Test: Reading Excel Files](#test_reading_excel_files)
* [Quiz: JSON](#quiz_json)
* [Quiz: Using CSV Module](#quiz_csv_mod)
    * [My Solution](#my_sol_using_csv_mod)
* [Quiz: Excel to CSV](#quiz_excel_csv)
    * [My Solution](#my_sol_excel_csv)
    * [Instructor Solution](#inst_sol_excel_csv)
    * [Test](#test_excel_csv)
* [Quiz: Wrangling JSON](#quiz_wrangling_json)
    * [My Solution](#my_sol_wrangling_json)
    * [Instructor Solution](#ins_sol_wrangling_json)
* [Data in More Complex Formats](#data_complex_formats)
    * [XML](#data_complex_formats)
    * [Quiz: Extracting Data XML](#quiz_extracting_data_xml)
    * [Quiz: Handling Attributes XML](#quiz_handling_attributes_xml)
    * [Quiz: Using Beautiful Soup](#quiz_beautiful_soup)
    * [Quiz: Carrier List](#quiz_carrier_list)
    * [Quiz: Airport List](#quiz_airport_list)
    * [Quiz: Processing All](#quiz_processing_all)
    * [Quiz: Patent Database](#quiz_patent_database)
    * [Quiz: Result of Parsing the Datafile](#quiz_result_parsing_datafile)
    * [Quiz: Processing Patents](#quiz_processing_patents)
* [Data Quality](#data_quality)
    * [Quiz: Correcting Validity](#quiz_correcting_validity)
    * [Quiz: Auditing Data Quality](#quiz_auditing_data_quality)
    * [Quiz: Fixing the Area](#quiz_fixing_area)
    * [Quiz: Fixing Name](#quiz_fixing_name)
    * [Quiz: Crossfield Auditing](#quiz_crossfield_auditing)

[Back to top](#Top of document)
<a id='quiz_parsing_csv'></a>

## Quiz: Parsing CSV Files

Your task is to read the input DATAFILE line by line, and for the first 10 lines (not including the header) split each line on "," and then for each line, create a dictionary where the key is the header title of the field, and the value is the value of that field in the row.
The function parse_file should return a list of dictionaries, each data line in the file being a single list entry.
Field names and values should not contain extra whitespace, like spaces or newline characters.
You can use the Python string method strip() to remove the extra whitespace.
You have to parse only the first 10 data lines in this exercise, so the returned list should have 10 entries!

[Back to top](#Top of document)
<a id='testing_code_1'></a>

### Testing Code

In [None]:
import os

DATADIR = os.getcwd()
DATAFILE = "beatles-diskography.csv"

datafile = os.path.join(DATADIR, DATAFILE)

print(datafile)

data = []
with open(datafile, "r") as f:
    for line in f:
        music = dict()
        line.strip('\n')
        words = line.split(',')
        music['Title'] = words[0]
        music['UK Chart Position'] = words[3]
        music['Label'] = words[2]
        music['Released'] = words[1]
        music['US Chart Position'] = words[4]
        music['RIAA Certification'] = words[6]
        music['BPI Certification'] = words[5]
        data.append(music)
    print(data[1:11])

In [None]:
data[10]

[Back to top](#Top of document)
<a id='my_sol_parsing_csv'></a>

### My Solution: Parsing CSV Files

In [None]:
import os
import pprint

DATADIR = os.getcwd()
DATAFILE = "beatles-diskography.csv"

def parse_file(datafile):
    data = []
    with open(datafile, "r") as f:
        for line in f:
            music = dict()
            line = line.strip()
            words = line.split(',')
            music['Title'] = words[0]
            music['UK Chart Position'] = words[3]
            music['Label'] = words[2]
            music['Released'] = words[1]
            music['US Chart Position'] = words[4]
            music['RIAA Certification'] = words[6]
            music['BPI Certification'] = words[5]
            data.append(music)

    print('My Solution')
    pprint.pprint(data[1:11])
    return data[1:11]

[Back to top](#Top of document)
<a id='inst_sol_parsing_csv'></a>

### Instructor Solution: Parsing CSV Files

In [None]:
import os
import pprint

DATADIR = os.getcwd()
DATAFILE = "beatles-diskography.csv"

def parse_file(datafile):
    data = []
    with open(datafile, "r") as f:
        header = f.readline().split(',')
        counter = 0
        for line in f:
            if counter == 10:
                break
                
            fields = line.split(',')
            entry = {}
            
            for i, value in enumerate(fields):
                entry[header[i].strip()] = value.strip()
                
            data.append(entry)
            counter += 1
            
    print('Instructor Solution')
    pprint.pprint(data)
    return data

### Test: Parsing CSV Files

In [None]:
def test():
    # a simple test of your implemetation
    datafile = os.path.join(DATADIR, DATAFILE)
    d = parse_file(datafile)
    firstline = {'Title': 'Please Please Me', 'UK Chart Position': '1', 'Label': 'Parlophone(UK)', 'Released': '22 March 1963', 
                 'US Chart Position': '-', 'RIAA Certification': 'Platinum', 'BPI Certification': 'Gold'}
    tenthline = {'Title': '', 'UK Chart Position': '1', 'Label': 'Parlophone(UK)', 'Released': '10 July 1964', 
                 'US Chart Position': '-', 'RIAA Certification': '', 'BPI Certification': 'Gold'}

    assert d[0] == firstline
    assert d[9] == tenthline

    
test()

[Back to top](#Top of document)
<a id='csv_module'></a>

### With CSV Module

https://docs.python.org/2/library/csv.html

In [None]:
import os
import pprint
import csv

DATADIR = os.getcwd()
DATAFILE = "beatles-diskography.csv"

def parse_csv(datafile):
    data = []
    n =  0
    with open(datafile, 'r') as sd:
        r = csv.DictReader(sd)
        for line in r:
            data.append(line)
    return data

In [None]:
datafile = os.path.join(DATADIR, DATAFILE)
parse_csv(datafile)
d = parse_csv(datafile)
pprint.pprint(d)

[Back to top](#Top of document)
<a id='intro_xlrd'></a>

## Intro to XLRD

http://xlrd.readthedocs.io/en/latest/

In [None]:
import xlrd

datafile = '2013_ERCOT_Hourly_Load_Data.xls'

def parse_file(datafile):
    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)
    
    data = [[sheet.cell_value(r, col)
        for col in range(sheet.ncols)]
            for r in range(sheet.nrows)]
    
    print("\nList Comprehension")
    print("data[3][2]:", data[3][2])

    print("\nCells in a nested loop:")
    for row in range(sheet.nrows):
        for col in range(sheet.ncols):
            if row == 50:
                print(sheet.cell_value(row, col))


    ### other useful methods:
    print("\nROWS, COLUMNS, and CELLS:")
    print("Number of rows in the sheet:", sheet.nrows)
    print("Type of data in cell (row 3, col 2):", sheet.cell_type(3, 2))
    print("Value in cell (row 3, col 2):", sheet.cell_value(3, 2))
    print("Get a slice of values in column 3, from rows 1-3:", sheet.col_values(3, start_rowx=1, end_rowx=4))

    print("\nDATES:")
    print("Type of data in cell (row 1, col 0):", sheet.cell_type(1, 0))
    
    exceltime = sheet.cell_value(1, 0)
    
    print("Time in Excel format:", exceltime)
    print("Convert time to a Python datetime tuple, from the Excel float:", xlrd.xldate_as_tuple(exceltime, 0))

    return data

data = parse_file(datafile)

[Back to top](#Top of document)
<a id='qz_reading_excel_files'></a>

## Quiz: Reading Excel Files

Your task is as follows:
- read the provided Excel file
- find and return the min, max and average values for the COAST region
- find and return the time value for the min and max entries
- the time values should be returned as Python tuples

Please see the test function for the expected return format

In [None]:
import xlrd
from zipfile import ZipFile
datafile = "2013_ERCOT_Hourly_Load_Data.xls"


def open_zip(datafile):
    with ZipFile('{0}.zip'.format(datafile), 'r') as myzip:
        myzip.extractall()

[Back to top](#Top of document)
<a id='my_sol_reading_excel_files'></a>

### My Solution: Reading Excel Files

#### example on how you can get the data
* sheet_data = [[sheet.cell_value(r, col) for col in range(sheet.ncols)] for r in range(sheet.nrows)]

#### other useful methods:
* print("\nROWS, COLUMNS, and CELLS:")
* print("Number of rows in the sheet:", sheet.nrows)
* print("Type of data in cell (row 3, col 2):", sheet.cell_type(3, 2))
* print("Value in cell (row 3, col 2):", sheet.cell_value(3, 2))
* print("Get a slice of values in column 3, from rows 1-3:", sheet.col_values(3, start_rowx=1, end_rowx=4))
* print("\nDATES:")
* print("Type of data in cell (row 1, col 0):", sheet.cell_type(1, 0))
* exceltime = sheet.cell_value(1, 0)
* print("Time in Excel format:", exceltime)
* print("Convert time to a Python datetime tuple, from the Excel float:", xlrd.xldate_as_tuple(exceltime, 0))

In [None]:
import numpy as np
import pprint

def parse_file(datafile):
    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)
    
    header = sheet.row_values(0, start_colx=0, end_colx=sheet.ncols)
    coast_loc = header.index('COAST')
    coast_col = sheet.col_values(coast_loc, start_rowx=1, end_rowx=sheet.nrows)
    
    maxtime_loc = coast_col.index(max(coast_col))
    mintime_loc = coast_col.index(min(coast_col))
    
    # Use +1 because coast_col doesn't include the header (which is the first value)
    maxtime_excel = sheet.cell_value(maxtime_loc+1, 0) 
    mintime_excel = sheet.cell_value(mintime_loc+1, 0)
    
    data = {
            'maxtime': xlrd.xldate_as_tuple(maxtime_excel, 0),
            'maxvalue': max(coast_col),
            'mintime': xlrd.xldate_as_tuple(mintime_excel, 0),
            'minvalue': min(coast_col),
            'avgcoast': np.mean(coast_col)
            }
    
    pprint.pprint(data)
    
    return data

[Back to top](#Top of document)
<a id='test_reading_excel_files'></a>

### Test: Reading Excel Files

In [None]:
def test():
    #open_zip(datafile)
    data = parse_file(datafile)

    assert data['maxtime'] == (2013, 8, 13, 17, 0, 0)
    assert round(data['maxvalue'], 10) == round(18779.02551, 10)


test()

[Back to top](#Top of document)
<a id='quiz_json'></a>

## Quiz: JSON

* [JSON Tutorial](http://www.w3schools.com/js/js_json_intro.asp)
* http://www.json.org
* [Requests Documentation](http://requests.readthedocs.io/en/latest/)

In [None]:
import json
import requests

BASE_URL = "http://musicbrainz.org/ws/2/"
ARTIST_URL = BASE_URL + "artist/"

# query parameters are given to the requests.get function as a dictionary; this
# variable contains some starter parameters.
query_type = {  "simple": {},
                "atr": {"inc": "aliases+tags+ratings"},
                "aliases": {"inc": "aliases"},
                "releases": {"inc": "releases"}}

In [None]:
def query_site(url, params, uid="", fmt="json"):
    # This is the main function for making queries to the musicbrainz API.
    # A json document should be returned by the query.
    params["fmt"] = fmt
    r = requests.get(url + uid, params=params)
    print("requesting", r.url)

    if r.status_code == requests.codes.ok:
        return r.json()
    else:
        r.raise_for_status()

In [None]:
def query_by_name(url, params, name):
    # This adds an artist name to the query parameters before making
    # an API call to the function above.
    params["query"] = "artist:" + name
    return query_site(url, params)

In [None]:
def pretty_print(data, indent=4):
    # After we get our output, we can format it to be more readable
    # by using this function.
    if type(data) == dict:
        print(json.dumps(data, indent=indent, sort_keys=True))
    else:
        print(data)

In [None]:
def main():
    '''
    Modify the function calls and indexing below to answer the questions on
    the next quiz. HINT: Note how the output we get from the site is a
    multi-level JSON document, so try making print statements to step through
    the structure one level at a time or copy the output to a separate output
    file.
    '''
    results = query_by_name(ARTIST_URL, query_type["simple"], "Nirvana")
    pretty_print(results)

    artist_id = results["artists"][1]["id"]
    print("\nARTIST:")
    pretty_print(results["artists"][1])

    '''
    artist_data = query_site(ARTIST_URL, query_type["releases"], artist_id)
    releases = artist_data["releases"]
    print("\nONE RELEASE:")
    pretty_print(releases[0], indent=2)
    release_titles = [r["title"] for r in releases]

    print("\nALL TITLES:")
    for t in release_titles:
        print(t)
    '''
    
    print("\nAlias:")
    pretty_print(results["artists"][0]['aliases'])

In [None]:
main()

## Quiz: Exploring JSON Data

1. How many bands named 'First Aid Kit'? 2
2. Begin_Area Name for Queen? London
3. Spanish Alias for Beatles? Los Beatles
4. Nirvana Disambiguation?  90s US grunge band
5. When was One Direction formed?  2010-07

[Back to top](#Top of document)
<a id='quiz_csv_mod'></a>

## Quiz: Using the CSV Module

You can check the data in the dropdown in the top-left corner of the quiz starter code.

Data comes from [NREL](www.nrel.gov) website. The datafile in this exercise is a small subset from the full file for one of the stations. You can download it from the Downloadables section > or see the full data files for other stations on the [National Solar Radiation Data Base](http://rredc.nrel.gov/solar/old_data/nsrdb/1991-2005/tmy3/by_USAFN.html).

[Documentation on csv.reader on docs.python.org](http://docs.python.org/2/library/csv.html#csv.reader)

[Documentation on Reader object methods on docs.python.org](http://docs.python.org/2/library/csv.html#reader-objects)

Your task is to process the supplied file and use the csv module to extract data from it.
The data comes from NREL (National Renewable Energy Laboratory) website. Each file
contains information from one meteorological station, in particular - about amount of
solar and wind energy for each hour of day.

Note that the first line of the datafile is neither data entry, nor header. It is a line
describing the data source. You should extract the name of the station from it.

The data should be returned as a list of lists (not dictionaries).
You can use the csv modules "reader" method to get data in such format.
Another useful method is next() - to get the next line from the iterator.
You should only change the parse_file function.

[Back to top](#Top of document)
<a id='my_sol_using_csv_mod'></a>

## My Solution

In [None]:
import csv
import os


DATADIR = os.getcwd()
DATAFILE = "745090.csv"

datafile = os.path.join(DATADIR, DATAFILE)

def parse_file(datafile):
    name = ""
    data = []
    counter = 0
    with open(datafile, 'r') as f:
        test_reader = csv.reader(f)
        for row in test_reader:
            if counter == 0:
                name = row[1]
            if counter >= 2:
                data.append(row)
            
            counter += 1

    #print(name, data)
    # Do not change the line below
    return (name, data)

In [None]:
parse_file(DATAFILE)

In [None]:
def test():
    datafile = os.path.join(DATADIR, DATAFILE)
    name, data = parse_file(datafile)

    assert name == "MOUNTAIN VIEW MOFFETT FLD NAS"
    assert data[0][1] == "01:00"
    assert data[2][0] == "01/01/2005"
    assert data[2][5] == "2"


if __name__ == "__main__":
    test()

[Back to top](#Top of document)
<a id='quiz_excel_csv'></a>

## Quiz: Excel to CSV

Find the time and value of max load for each of the regions
COAST, EAST, FAR_WEST, NORTH, NORTH_C, SOUTHERN, SOUTH_C, WEST
and write the result out in a csv file, using pipe character | as the delimiter.

An example output can be seen in the "example.csv" file.

See csv module documentation on how to use different delimeters for csv.writer- http://docs.python.org/2/library/csv.html

In [None]:
import xlrd
import os
import csv
from zipfile import ZipFile
import pprint

DATADIR = os.getcwd()
DATAFILE = "2013_ERCOT_Hourly_Load_Data.xls"
datafile = os.path.join(DATADIR, DATAFILE)

OUTFILE = "2013_Max_Loads.csv"
outfile = os.path.join(DATADIR, OUTFILE)

In [None]:
def open_zip(datafile):
    with ZipFile('{0}.zip'.format(datafile), 'r') as myzip:
        myzip.extractall()

[Back to top](#Top of document)
<a id='my_sol_excel_csv'></a>

## My Solution: Excel to CSV

In [None]:
def parse_file(datafile):
    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)

    header = sheet.row_values(0, start_colx=0, end_colx=sheet.ncols)
    
    data = [('Station', 'Year', 'Month', 'Day', 'Hour', 'Max Load')]
    
    for station in header[1:-1]:
        station_loc = header.index(station)
        station_col = sheet.col_values(station_loc, start_rowx=1, end_rowx=sheet.nrows)
    
        maxtime_loc = station_col.index(max(station_col))
        
        # Use +1 because coast_col doesn't include the header (which is the first value)
        maxtime_excel = sheet.cell_value(maxtime_loc+1, 0) 
        
        xldate = xlrd.xldate_as_tuple(maxtime_excel, 0)[0:4]
        
        data.append((station, xldate[0], xldate[1], xldate[2], xldate[3], max(station_col)))
        
        
    return(data)

In [None]:
parse_file(datafile)

In [None]:
def save_file(data, filename):
    with open(filename, 'w', newline='') as myfile:
        wr = csv.writer(myfile, delimiter = '|')
        for row in data:
            wr.writerow(row)

[Back to top](#Top of document)
<a id='inst_sol_excel_csv'></a>

## Instructor Solution: Excel to CSV

In [None]:
def parse_file(datafile):
    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)
    data = {}
    # process all rows that contain station data
    for n in range (1, 9):
        station = sheet.cell_value(0, n)
        cv = sheet.col_values(n, start_rowx=1, end_rowx=None)

        maxval = max(cv)
        maxpos = cv.index(maxval) + 1
        maxtime = sheet.cell_value(maxpos, 0)
        realtime = xlrd.xldate_as_tuple(maxtime, 0)
        data[station] = {"maxval": maxval,
                         "maxtime": realtime}

    print data
    return data

In [None]:
def save_file(data, filename):
    with open(filename, "w") as f:
        w = csv.writer(f, delimiter='|')
        w.writerow(["Station", "Year", "Month", "Day", "Hour", "Max Load"])
        for s in data:
            year, month, day, hour, _ , _= data[s]["maxtime"]
            w.writerow([s, year, month, day, hour, data[s]["maxval"]])

[Back to top](#Top of document)
<a id='test_excel_csv'></a>

## Test: Excel to CSV

In [None]:
def test():
    #open_zip(datafile)
    data = parse_file(datafile)
    save_file(data, outfile)

    number_of_rows = 0
    stations = []

    ans = {'FAR_WEST': {'Max Load': '2281.2722140000024',
                        'Year': '2013',
                        'Month': '6',
                        'Day': '26',
                        'Hour': '17'}}
    correct_stations = ['COAST', 'EAST', 'FAR_WEST', 'NORTH',
                        'NORTH_C', 'SOUTHERN', 'SOUTH_C', 'WEST']
    fields = ['Year', 'Month', 'Day', 'Hour', 'Max Load']

    with open(outfile) as of:
        csvfile = csv.DictReader(of, delimiter="|")
        for line in csvfile:
            station = line['Station']
            if station == 'FAR_WEST':
                for field in fields:
                    # Check if 'Max Load' is within .1 of answer
                    if field == 'Max Load':
                        max_answer = round(float(ans[station][field]), 1)
                        max_line = round(float(line[field]), 1)
                        assert max_answer == max_line

                    # Otherwise check for equality
                    else:
                        assert ans[station][field] == line[field]

            number_of_rows += 1
            stations.append(station)

        # Output should be 8 lines not including header
        assert number_of_rows == 8

        # Check Station Names
        assert set(stations) == set(correct_stations)

In [None]:
test()

[Back to top](#Top of document)
<a id='quiz_wrangling_json'></a>

## Quiz: Wrangling JSON

This exercise shows some important concepts that you should be aware about:
- using codecs module to write unicode files
- using authentication with web APIs
- using offset when accessing web APIs

To run this code locally you have to register at the NYTimes developer site 
and get your own API key. You will be able to complete this exercise in our UI
without doing so, as we have provided a sample result. (See the file 
'popular-viewed-1.json' from the tabs above.)

Your task is to modify the article_overview() function to process the saved
file that represents the most popular articles (by view count) from the last
day, and return a tuple of variables containing the following data:
- labels: list of dictionaries, where the keys are the "section" values and
  values are the "title" values for each of the retrieved articles.
- urls: list of URLs for all 'media' entries with "format": "Standard Thumbnail"

All your changes should be in the article_overview() function. See the test() 
function for examples of the elements of the output lists.
The rest of functions are provided for your convenience, if you want to access
the API by yourself.

You can check the data in the dropdown in the top-left corner of the quiz starter code.

If you want to know more, or query the site by yourself, please read the [NYTimes Developer Documentation for the Most Popular API](http://developer.nytimes.com/docs/most_popular_api/) and [apply for your own API Key for NY Times](http://developer.nytimes.com/page).

In [None]:
import os
import json
import codecs
import requests
from pprint import pprint

DATADIR = os.getcwd()
DATAFILE = "popular-{0}-{1}.json"
datafile = os.path.join(DATADIR, DATAFILE)

URL_MAIN = "http://api.nytimes.com/svc/"
URL_POPULAR = URL_MAIN + "mostpopular/v2/"
API_KEY = { "popular": "",
            "article": ""}

def get_from_file(kind, period):
    filename = datafile.format(kind, period)
    with open(filename, "r") as f:
        return json.loads(f.read())

[Back to top](#Top of document)
<a id='my_sol_wrangling_json'></a>

## My Solution: Wrangling JSON

In [None]:
pprint(get_from_file('viewed', 1))

In [None]:
artist_id = results["artists"][1]["id"]
    print("\nARTIST:")
    pretty_print(results["artists"][1])

In [None]:
def article_overview(kind, period):
    data = get_from_file(kind, period)

    titles = []
    urls =[]
    
    for d in data:
        section = d['section']
        title = d['title']
        titles.append({section: title})
        for m in d['media']:
            for mm in m['media-metadata']:
                if mm['format'] == 'Standard Thumbnail':
                    urls.append(mm['url'])
 
    return (titles, urls)

In [None]:
pprint(article_overview('viewed', 1))

[Back to top](#Top of document)
<a id='inst_sol_wrangling_json'></a>

## Instructor Solution: Wrangling JSON

In [None]:
def article_overview(kind, period):
    data = get_from_file(kind, period)
    titles = []
    urls =[]

    for article in data:
        section = article["section"]
        title = article["title"]
        titles.append({section: title})
        if "media" in article:
            for m in article["media"]:
                for mm in m["media-metadata"]:
                    if mm["format"] == "Standard Thumbnail":
                        urls.append(mm["url"])

    return (titles, urls)

## Test and Supplied Code: Wrangling JSON

In [None]:
def query_site(url, target, offset):
    # This will set up the query with the API key and offset
    # Web services often use offset paramter to return data in small chunks
    # NYTimes returns 20 articles per request, if you want the next 20
    # You have to provide the offset parameter
    if API_KEY["popular"] == "" or API_KEY["article"] == "":
        print("You need to register for NYTimes Developer account to run this program.")
        print ("See Intructor notes for information")
        return False
    params = {"api-key": API_KEY[target], "offset": offset}
    r = requests.get(url, params = params)

    if r.status_code == requests.codes.ok:
        return r.json()
    else:
        r.raise_for_status()

In [None]:
def get_popular(url, kind, days, section="all-sections", offset=0):
    # This function will construct the query according to the requirements of the site
    # and return the data, or print an error message if called incorrectly
    if days not in [1,7,30]:
        print("Time period can be 1,7, 30 days only")
        return False
    if kind not in ["viewed", "shared", "emailed"]:
        print("kind can be only one of viewed/shared/emailed")
        return False

    url += "most{0}/{1}/{2}.json".format(kind, section, days)
    data = query_site(url, "popular", offset)

    return data

In [None]:
def save_file(kind, period):
    # This will process all results, by calling the API repeatedly with supplied offset value,
    # combine the data and then write all results in a file.
    data = get_popular(URL_POPULAR, "viewed", 1)
    num_results = data["num_results"]
    full_data = []
    with codecs.open("popular-{0}-{1}.json".format(kind, period), encoding='utf-8', mode='w') as v:
        for offset in range(0, num_results, 20):        
            data = get_popular(URL_POPULAR, kind, period, offset=offset)
            full_data += data["results"]
        
        v.write(json.dumps(full_data, indent=2))

In [None]:
def test():
    titles, urls = article_overview("viewed", 1)
    assert len(titles) == 20
    assert len(urls) == 30
    assert titles[2] == {'Opinion': 'Professors, We Need You!'}
    assert urls[20] == 'http://graphics8.nytimes.com/images/2014/02/17/sports/ICEDANCE/ICEDANCE-thumbStandard.jpg'

In [None]:
test()

[Back to top](#Top of document)
<a id='data_complex_formats'></a>

## Data in More Complex Formats

## XML

XML - Extensible Markup Language
* https://www.w3schools.com/xml/xml_whatis.asp
* https://www.w3.org/TR/xml/#sec-origin-goals
* http://www.tizag.com/xmlTutorial/index.php
* https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
* https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

In [None]:
import xml.etree.ElementTree as ET
import pprint as pp

tree = ET.parse('exampleresearcharticle.xml')
root = tree.getroot()

title = root.find('./fm/bibl/title')
title_text = ''
for p in title:
    title_text += p.text
print('\nTitle:\n', title_text)

print('\nAuthor email addresses:')
for a in root.findall('./fm/bibl/aug/au'):
    email = a.find('email')
    if email is not None:
        print(email.text)

[Back to top](#Top of document)
<a id='quiz_extracting_data_xml'></a>

## Quiz: Extracting Data XML

1. Your task here is to extract data from xml on authors of an article and add it to a list, one item for an author.
2. See the provided data structure for the expected format.
3. The tags for first name, surname and email should map directly to the dictionary keys

In [None]:
import xml.etree.ElementTree as ET

article_file = "exampleResearchArticle.xml"

def get_root(fname):
    tree = ET.parse(fname)
    return tree.getroot()

## My Solution

In [None]:
def get_authors(root):
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None
        }
        # MY CODE HERE

        fnm = author.find('fnm')
        snm = author.find('snm')
        email = author.find('email')
        
        authors.append({'fnm': fnm.text, 'snm': snm.text, 'email': email.text})

    return authors

In [None]:
get_authors(root)

## Instructor Solution

In [None]:
def get_author(root):
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None
        }
        data["fnm"] = author.find('./fnm').text
        data["snm"] = author.find('./snm').text
        data["email"] = author.find('./email').text

        authors.append(data)

    return authors

## Test

In [None]:
def test():
    solution = [{'fnm': 'Omer', 'snm': 'Mei-Dan', 'email': 'omer@extremegate.com'},
                {'fnm': 'Mike', 'snm': 'Carmont', 'email': 'mcarmont@hotmail.com'},
                {'fnm': 'Lior', 'snm': 'Laver', 'email': 'laver17@gmail.com'},
                {'fnm': 'Meir', 'snm': 'Nyska', 'email': 'nyska@internet-zahav.net'},
                {'fnm': 'Hagay', 'snm': 'Kammar', 'email': 'kammarh@gmail.com'},
                {'fnm': 'Gideon', 'snm': 'Mann', 'email': 'gideon.mann.md@gmail.com'},
                {'fnm': 'Barnaby', 'snm': 'Clarck', 'email': 'barns.nz@gmail.com'},
                {'fnm': 'Eugene', 'snm': 'Kots', 'email': 'eukots@gmail.com'}]
    
    root = get_root(article_file)
    data = get_authors(root)

    assert data[0] == solution[0]
    assert data[1]["fnm"] == solution[1]["fnm"]

In [None]:
test()

[Back to top](#Top of document)
<a id='quiz_handling_attributes_xml'></a>

## Quiz: Handling Attributes XML

1. Your task here is to extract data from xml on authors of an article and add it to a list, one item for an author.
2. See the provided data structure for the expected format.
3. The tags for first name, surname and email should map directly to the dictionary keys, but you have to extract the attributes from the "insr" tag and add them to the list for the dictionary key "insr"

In [None]:
import xml.etree.ElementTree as ET

article_file = "exampleResearchArticle.xml"


def get_root(fname):
    tree = ET.parse(fname)
    return tree.getroot()

## My Solution

In [None]:
def get_authors(root):
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None,
                "insr": []
        }

        data['fnm'] = author.find('./fnm').text
        data['snm'] = author.find('./snm').text
        data['email'] = author.find('./email').text
        
        for insr in author.iter('insr'):
            #print(insr.attrib)
            data['insr'].append(insr.attrib['iid'])

        authors.append(data)

    return authors

In [None]:
get_authors(root)

## Instructor Solution

In [None]:
def get_authors(root):
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None,
                "insr": []
        }
        data["fnm"] = author.find('./fnm').text
        data["snm"] = author.find('./snm').text
        data["email"] = author.find('./email').text
        insr = author.findall('./insr')
        for i in insr:
            data["insr"].append(i.attrib["iid"])
        authors.append(data)

    return authors

## Test

In [None]:
def test():
    solution = [{'insr': ['I1'], 'fnm': 'Omer', 'snm': 'Mei-Dan', 'email': 'omer@extremegate.com'},
                {'insr': ['I2'], 'fnm': 'Mike', 'snm': 'Carmont', 'email': 'mcarmont@hotmail.com'},
                {'insr': ['I3', 'I4'], 'fnm': 'Lior', 'snm': 'Laver', 'email': 'laver17@gmail.com'},
                {'insr': ['I3'], 'fnm': 'Meir', 'snm': 'Nyska', 'email': 'nyska@internet-zahav.net'},
                {'insr': ['I8'], 'fnm': 'Hagay', 'snm': 'Kammar', 'email': 'kammarh@gmail.com'},
                {'insr': ['I3', 'I5'], 'fnm': 'Gideon', 'snm': 'Mann', 'email': 'gideon.mann.md@gmail.com'},
                {'insr': ['I6'], 'fnm': 'Barnaby', 'snm': 'Clarck', 'email': 'barns.nz@gmail.com'},
                {'insr': ['I7'], 'fnm': 'Eugene', 'snm': 'Kots', 'email': 'eukots@gmail.com'}]

    root = get_root(article_file)
    data = get_authors(root)

    assert data[0] == solution[0]
    assert data[1]["insr"] == solution[1]["insr"]

In [None]:
test()

[Back to top](#Top of document)
<a id='quiz_beautiful_soup'></a>

## Quiz: Using Beautiful Soup

1. Please note that the function 'make_request' is provided for your reference only.
2. You will not be able to to actually use it from within the Udacity web UI.
3. Your task is to process the HTML using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), extract the hidden form field values for "_EVENTVALIDATION" and "_VIEWSTATE" and set the appropriate values in the data dictionary.

4. All your changes should be in the 'extract_data' function

In [None]:
from bs4 import BeautifulSoup
import requests
import json

html_page = "page_source.html"

## My Solution

In [None]:
soup = BeautifulSoup(open(html_page), 'html.parser')

In [None]:
def has_class_but_no_id(tag):
    return tag.has_attr('type') and tag.has_attr('name') and tag.has_attr('id') and tag.has_attr('value')

soup.find_all(has_class_but_no_id)

In [None]:
list = ['__EVENTVALIDATION', '__VIEWSTATE']
for item in list:
    my_tag = (soup.find(id=item))
    print(my_tag['value'])
    print('\n''\n''\n''\n''\n''\n')

In [None]:
def extract_data(page):
    data = {"eventvalidation": "",
            "viewstate": ""}
    
    soup = BeautifulSoup(open(page), 'html.parser')
    
    list = ['__EVENTVALIDATION', '__VIEWSTATE']
    for item in list:
        my_tag = (soup.find(id=item))
        if item == '__EVENTVALIDATION':
            data['eventvalidation'] = my_tag['value']
        if item == '__VIEWSTATE':
            data['viewstate'] = my_tag['value']
    
    return data

extract_data(html_page)

## Instructor Solution

In [None]:
def extract_data(page):
    data = {"eventvalidation": "",
            "viewstate": ""}
    with open(page, "r") as html:
        soup = BeautifulSoup(html, "lxml")
        ev = soup.find(id="__EVENTVALIDATION")
        data["eventvalidation"] = ev["value"]

        vs = soup.find(id="__VIEWSTATE")
        data["viewstate"] = vs["value"]

    return data

In [None]:
def make_request(data):
    eventvalidation = data["eventvalidation"]
    viewstate = data["viewstate"]

    r = requests.post("http://www.transtats.bts.gov/Data_Elements.aspx?Data=2",
                    data={'AirportList': "BOS",
                          'CarrierList': "VX",
                          'Submit': 'Submit',
                          "__EVENTTARGET": "",
                          "__EVENTARGUMENT": "",
                          "__EVENTVALIDATION": eventvalidation,
                          "__VIEWSTATE": viewstate
                    })

    return r.text

## Test

In [None]:
def test():
    data = extract_data(html_page)
    assert data["eventvalidation"] != ""
    assert data["eventvalidation"].startswith("/wEWjAkCoIj1ng0")
    assert data["viewstate"].startswith("/wEPDwUKLTI")
    
test()

[Back to top](#Top of document)
<a id='quiz_carrier_list'></a>

## Quiz: Carrier List

1. Your task in this exercise is to modify 'extract_carrier()' to get a list of all airlines. 
2. Exclude all of the combination values like 'All U.S. Carriers' from the data that you return.
3. You should return a list of codes for the carriers.
4. All your changes should be in the 'extract_carrier()' function. 
5. The 'options.html' file in the tab above is a stripped down version of what is actually on the website, but should provide an example of what you should get from the full file.
6. Please note that the function 'make_request()' is provided for your reference only. You will not be able to to actually use it from within the Udacity web UI.

All data is from:
[TranStats website](https://www.transtats.bts.gov/Data_Elements.aspx?Data=2)

## My Solution

In [None]:
from bs4 import BeautifulSoup
html_page = "options.html"


def extract_carriers(page):
    data = []

    with open(page, "r") as html:
        soup = BeautifulSoup(html, "lxml")
        cl = soup.find(id='CarrierList')

        tag = cl.find_all('option')
        for value in tag:

            if value['value'] != 'All' and value['value'] != 'AllUS' and value['value'] != 'AllForeign' :
                data.append(value['value'])
        
        

    return data

In [None]:
extract_carriers(html_page)

In [None]:
def make_request(data):
    eventvalidation = data["eventvalidation"]
    viewstate = data["viewstate"]
    airport = data["airport"]
    carrier = data["carrier"]

    r = s.post("https://www.transtats.bts.gov/Data_Elements.aspx?Data=2",
               data = (("__EVENTTARGET", ""),
                       ("__EVENTARGUMENT", ""),
                       ("__VIEWSTATE", viewstate),
                       ("__VIEWSTATEGENERATOR",viewstategenerator),
                       ("__EVENTVALIDATION", eventvalidation),
                       ("CarrierList", carrier),
                       ("AirportList", airport),
                       ("Submit", "Submit")))

    return r.text

## Test

In [None]:
def test():
    data = extract_carriers(html_page)
    assert len(data) == 16
    assert "FL" in data
    assert "NK" in data

In [None]:
test()

[Back to top](#Top of document)
<a id='quiz_airport_list'></a>

## Quiz: Airport List

## My Solution

1. Complete the 'extract_airports()' function so that it returns a list of airport codes, excluding any combinations like "All".
2. Refer to the 'options.html' file in the tab above for a stripped down version of what is actually on the website. The test() assertions are based on the given file.

In [None]:
from bs4 import BeautifulSoup
html_page = "options.html"

## My Solution

In [None]:
def extract_airports(page):
    data = []
    with open(page, "r") as html:
        soup = BeautifulSoup(html, "lxml")
        
        al = soup.find(id='AirportList')

        tag = al.find_all('option')
        for value in tag:

            if value['value'] != 'All' and value['value'] != 'AllMajors' and value['value'] != 'AllOthers':
                data.append(value['value'])

    return data

In [None]:
extract_airports(html_page)

## Test

In [None]:
def test():
    data = extract_airports(html_page)
    assert len(data) == 15
    assert "ATL" in data
    assert "ABR" in data

In [None]:
test()

[Back to top](#Top of document)
<a id='quiz_processing_all'></a>

## Quiz: Processing All

Let's assume that you combined the code from the previous 2 exercises with code
from the lesson on how to build requests, and downloaded all the data locally.
The files are in a directory "data", named after the carrier and airport:
"{}-{}.html".format(carrier, airport), for example "FL-ATL.html".

The table with flight info has a table class="dataTDRight". Your task is to
use 'process_file()' to extract the flight data from that table as a list of
dictionaries, each dictionary containing relevant data from the file and table
row. This is an example of the data structure you should return:

```data = [{"courier": "FL",
         "airport": "ATL",
         "year": 2012,
         "month": 12,
         "flights": {"domestic": 100,
                     "international": 100}
        },
         {"courier": "..."}
]
```

Note - year, month, and the flight data should be integers.
You should skip the rows that contain the TOTAL data for a year.

There are a couple of helper functions to deal with the data files.
Please do not change them for grading purposes.
All your changes should be in the 'process_file()' function.

The 'data/FL-ATL.html' file in the tab above is only a part of the full data,
covering data through 2003. The test() code will be run on the full table, but
the given file should provide an example of what you will get.

In [None]:
from bs4 import BeautifulSoup
from zipfile import ZipFile
import os

html_page = "AA-ATL.html"
#html_page = "FL-ATL.html"

datadir = "data"


def open_zip(datadir):
    with ZipFile('{0}.zip'.format(datadir), 'r') as myzip:
        myzip.extractall()


def process_all(datadir):
    files = os.listdir(datadir)
    return files

## My Solution 1

In [None]:
def process_file(f):
    '''
    This function extracts data from the file given as the function argument in
    a list of dictionaries. This is example of the data structure you should
    return:

    data = [{"courier": "FL",
             "airport": "ATL",
             "year": 2012,
             "month": 12,
             "flights": {"domestic": 100,
                         "international": 100}
            },
            {"courier": "..."}
    ]

    Note - year, month, and the flight data should be integers.
    You should skip the rows that contain the TOTAL data for a year.
    '''
    data = []
    info = {}
    #info["courier"], info["airport"] = f[:6].split("-")
    # Note: create a new dictionary for each entry in the output data list.
    # If you use the info dictionary defined here each element in the list 
    # will be a reference to the same info dictionary.
    with open("{}/{}".format(datadir, f), "r") as html:

        soup = BeautifulSoup(html, 'lxml')
        
        # Extract the carrier and airport from within the file
        def has_value_selected(tag):
            return tag.has_attr('value') and tag.has_attr('selected')
        
        cl = soup.find(id='CarrierList')
        cl_value = cl.find(has_value_selected)['value']
        
        al = soup.find(id='AirportList')
        al_value = al.find(has_value_selected)['value']
        
        # Ignore values = All, AllMajors and AllOthers
        if al_value != 'All' and al_value != 'AllMajors' and al_value != 'AllOthers':

            # Parse the flight information table
            for row in soup.select('table.dataTDRight tr')[1:]:
                aux = row.findAll('td')
                if aux[1].string != 'TOTAL':

                    aux_2 = int((aux[2].string).replace(',', ''))
                    aux_3 = int(((aux[3].string).replace(u'\xa0', u'0')).replace(',', ''))

                    data.append({'courier': cl_value,
                                 'airport': al_value,
                                 'year': int(aux[0].string), 'month': int(aux[1].string),
                                 'flights': {'domestic': aux_2,
                                             'international': aux_3}
                                 })

    return data

In [None]:
process_file(html_page)

## My Solution 2

In [None]:
def process_file(f):

    data = []
    info = {}
    
    # Use info dict to supply the courier and airport code
    info["courier"], info["airport"] = f[:6].split("-")

    with open("{}/{}".format(datadir, f), "r") as html:

        soup = BeautifulSoup(html, 'lxml')
        
        # Parse the flight data table
        for row in soup.select('table.dataTDRight tr')[1:]:
            aux = row.findAll('td')
            if aux[1].string != 'TOTAL':

                aux_2 = int((aux[2].string).replace(',', ''))
                aux_3 = int(((aux[3].string).replace(u'\xa0', u'0')).replace(',', ''))

                info.update({'year': int(aux[0].string), 'month': int(aux[1].string),
                             'flights': {'domestic': aux_2, 'international': aux_3}
                             })
                
                data.append(info)

    return data

In [None]:
process_file(html_page)

## Instructor Solution

In [None]:
def process_file(f):

    data = []
    info = {}
    info["courier"], info["airport"] = f[:6].split("-")

    with open("{}/{}".format(datadir, f), "r") as html:
        soup = BeautifulSoup(html, 'lxml')
    
        table_row = soup.find_all("tr", class_ = 'dataTDRight')
        table_data = []
        for row in table_row:
            table_col = []
            for col in row.find_all("td"):
                table_col.append(col.get_text())
            table_data.append(table_col)    
        
        for cell in table_data:
            if cell[1] == "TOTAL":
                continue
            else: 
                
                infoN ={}
                flight = {}
        
                infoN['year'] = int(cell[0])
                infoN['month'] = int(cell[1])
                flight['domestic'] = int(cell[2].replace(",",""))
                flight['international'] = int(cell[3].replace(",","")) 
                infoN['flights'] = flight 
                
                info.update(infoN)
                data.append(info) 
    
    return data

## Test

In [None]:
def test():
    print('Running a simple test...')
    #open_zip(datadir)
    data = process_file(html_page)

    assert len(data) == 170  # Total number of rows
    for entry in data[:3]:
        assert type(entry["year"]) == int
        assert type(entry["month"]) == int
        assert type(entry["flights"]["domestic"]) == int
        assert len(entry["airport"]) == 3
        assert len(entry["courier"]) == 2
    assert data[0]["courier"] == 'AA'
    assert data[0]["month"] == 10
    assert data[-1]["airport"] == "ATL"
    assert data[-1]["flights"] == {'international': 0, 'domestic': 1077}
    
    print('... success!')

In [None]:
test()

[Back to top](#Top of document)
<a id='quiz_patent_database'></a>

## Quiz: Patent Database

This and the following exercise are using the US Patent database. The patent.data
file is a small excerpt of much larger datafiles that are available for
download from US Patent website. These files are pretty large ( >100 MB each).
The original file is ~600MB large, you might not be able to open it in a text
editor.

The data itself is in XML, however there is a problem with how it's formatted.
Please run this script and observe the error. Then find the line that is
causing the error. You can do that by just looking at the datafile in the web
UI, or programmatically. For quiz purposes it does not matter, but as an
exercise we suggest that you try to do it programmatically.

NOTE: You do not need to correct the error - for now, just find where the error
is occurring.

In [None]:
import xml.etree.ElementTree as ET

PATENTS = 'patent.data'

def get_root(fname):

    tree = ET.parse(fname)
    print(tree)
    return tree.getroot()

In [None]:
get_root(PATENTS)

[Back to top](#Top of document)
<a id='quiz_result_parsing_datafile'></a>

## Quiz: Result of Processing the Datafile

There are multiple xml trees in the same file.

[Back to top](#Top of document)
<a id='quiz_processing_patents'></a>

## Quiz: Processing Patents

1. The problem is that the gigantic file is actually not a valid XML
    * It has several root elements
    * Is also has XML declarations
2. It's a collection of a lot of concatenated XML documents.
3. One solution is to split the file into separate documents, so you can process the resulting files as valid XML documents.

In [None]:
import xml.etree.ElementTree as ET
PATENTS = 'patent.data'

def get_root(fname):
    tree = ET.parse(fname)
    return tree.getroot()

## My Solution

In [None]:
def split_file(filename):
    """
    Split the input file into separate files, each containing a single patent.
    As a hint - each patent declaration starts with the same line that was
    causing the error found in the previous exercises.
    
    The new files should be saved with filename in the following format:
    "{}-{}".format(filename, n) where n is a counter, starting from 0.
    """
    infile = open(PATENTS, 'r')
    expected_first_line = '<?xml version="1.0" encoding="UTF-8"?>'

    n = 0
    for line in infile.readlines():


        if line.strip() == expected_first_line.strip():
            outfile = open('{}-{}'.format(filename, n), 'w')
            n+=1

        outfile.write(line.strip())

    pass

In [None]:
split_file(PATENTS)

## Test

In [None]:
def test():
    split_file(PATENTS)
    for n in range(4):
        try:
            fname = "{}-{}".format(PATENTS, n)
            f = open(fname, "r")
            if not f.readline().startswith("<?xml"):
                print('You have not split the file {} in the correct boundary!').format(fname)
            f.close()
        except:
            print('Could not find file {}. Check if the filename is correct!').format(fname)

In [None]:
test()

[Back to top](#Top of document)
<a id='data_quality'></a>

## Data Quality

[Back to top](#Top of document)
<a id='quiz_correcting_validity'></a>

## Quiz: Correcting Validity

Your task is to check the "productionStartYear" of the DBPedia autos datafile for valid values.
The following things should be done:
1. Check if the field "productionStartYear" contains a year
2. Check if the year is in range 1886-2014
3. Convert the value of the field to be just a year (not full datetime)
4. The rest of the fields and values should stay the same
5. If the value of the field is a valid year in the range as described above, write that line to the output_good file
6. If the value of the field is not a valid year as described above,  write that line to the output_bad file
7. Discard rows (neither write to good nor bad) if the URI is not from dbpedia.org
8. You should use the provided way of reading and writing data (DictReader and DictWriter), they will take care of dealing with the header.

You can write helper functions for checking the data and writing the files, but we will call only the 
'process_file' with 3 arguments (inputfile, output_good, output_bad).

## My Solution

In [None]:
import csv
import pprint

INPUT_FILE = 'autos.csv'
OUTPUT_GOOD = 'autos-valid.csv'
OUTPUT_BAD = 'FIXME-autos.csv'

def process_file(input_file, output_good, output_bad):
    
    with open(input_file, "r") as f:
        reader = csv.DictReader(f)
        header = reader.fieldnames
        
        auto_good = []
        auto_bad = []
        
        uri_match = 'http://dbpedia.org/resource'
    
        for row in reader:
                        
            if row['URI'][0:27] == uri_match:
                
                row['productionStartYear'] = row['productionStartYear'][0:4].strip()
                prod_year = row['productionStartYear']
                
                if prod_year == 'NULL':
                    print(prod_year, ' Bad')
                    auto_bad.append(row)
                    continue                    
                if int(prod_year) >= 1886 and int(prod_year) <= 2014:
                    print(prod_year, ' Good')
                    auto_good.append(row)
                    continue                    
                else:
                    print(prod_year, ' Bad')                    
                    auto_bad.append(row)

    # This is just an example on how you can use csv.DictWriter
    # Remember that you have to output 2 files
    with open(output_good, "w", newline='') as g: # newline for python3 (no extra lines)
        writer = csv.DictWriter(g, fieldnames= header)
        writer.writeheader()
        for row in auto_good:
            writer.writerow(row)
            
    with open(output_bad, "w") as b:
        writer = csv.DictWriter(b, delimiter=",", fieldnames= header) # delimiter for python2
        writer.writeheader()
        for row in auto_bad:
            writer.writerow(row)

In [None]:
process_file(INPUT_FILE, OUTPUT_GOOD, OUTPUT_BAD)

## Instructor Solution

In [None]:
def process_file(input_file, output_good, output_bad):
    # store data into lists for output
    data_good = []
    data_bad = []
    with open(input_file, "r") as f:
        reader = csv.DictReader(f)
        header = reader.fieldnames
        for row in reader:
            # validate URI value
            if row['URI'].find("dbpedia.org") < 0:
                continue

            ps_year = row['productionStartYear'][:4]
            try: # use try/except to filter valid items
                ps_year = int(ps_year)
                row['productionStartYear'] = ps_year
                if (ps_year >= 1886) and (ps_year <= 2014):
                    data_good.append(row)
                else:
                    data_bad.append(row)
            except ValueError: # non-numeric strings caught by exception
                if ps_year == 'NULL':
                    data_bad.append(row)

    # Write processed data to output files
    with open(output_good, "w") as good:
        writer = csv.DictWriter(good, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in data_good:
            writer.writerow(row)

    with open(output_bad, "w") as bad:
        writer = csv.DictWriter(bad, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in data_bad:
            writer.writerow(row)

## Test

In [None]:
def test():

    process_file(INPUT_FILE, OUTPUT_GOOD, OUTPUT_BAD)

In [None]:
test()

[Back to top](#Top of document)
<a id='quiz_auditing_data_quality'></a>

## Quiz: Auditing Data Quality

In this problem set, you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up. In the first exercise we want you to audit
the datatypes that can be found in some particular fields in the dataset.
The possible types of values can be:

- NoneType if the value is a string "NULL" or an empty string ""
- list, if the value starts with "{"
- int, if the value can be cast to int
- float, if the value can be cast to float, but CANNOT be cast to int.
   For example, '3.23e+07' should be considered a float because it can be cast
   as float but int('3.23e+07') will throw a ValueError
- 'str', for all other values

The audit_file function should return a dictionary containing fieldnames and a 
SET of the types that can be found in the field. e.g.
{"field1": set([type(float()), type(int()), type(str())]),
 "field2": set([type(str())]),
  ....
}
The type() function returns a type object describing the argument given to the 
function. You can also use examples of objects to create type objects, e.g.
type(1.1) for a float: see the test function below for examples.

Note that the first three rows (after the header row) in the cities.csv file
are not actual data points. The contents of these rows should note be included
when processing data types. Be sure to include functionality in your code to
skip over or detect these rows.

In [None]:
import codecs
import csv
import json
import pprint

cities = 'cities.csv'
cities_full = 'cities_full.csv'

FIELDS = ["name", "timeZone_label", "utcOffset", "homepage", "governmentType_label",
          "isPartOf_label", "areaCode", "populationTotal", "elevation",
          "maximumElevation", "minimumElevation", "populationDensity",
          "wgs84_pos#lat", "wgs84_pos#long", "areaLand", "areaMetro", "areaUrban"]


## My Solution

In [None]:
def is_list(s):
    if s[0] == '{':
        return True
    else:
        return False

def is_int(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

def is_float(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

In [None]:
def audit_file(filename, fields):
    fieldtypes = {}
    
    for field in fields:
        fieldtypes[f'{field}'] = set()

    with open(filename, 'r', encoding='utf8') as f:
        reader = csv.DictReader(f)
        header = reader.fieldnames
        for row in reader:
            # validate URI value
            if row['URI'].find("dbpedia.org") < 0:
                continue

            for field in fields:

                if row[field] == ('NULL' or ''):
                    fieldtypes[field].add(type(None))
                elif is_list(row[field]) is True:
                    fieldtypes[field].add(type([]))
                elif is_int(row[field]) is True:
                    fieldtypes[field].add(type(1))
                elif is_float(row[field]) is True:
                    fieldtypes[field].add(type(1.1))
                else:
                    fieldtypes[field].add(type(''))


    return fieldtypes

## Test

In [None]:
def test():
    fieldtypes = audit_file(cities, FIELDS)

    pprint.pprint(fieldtypes)

    assert fieldtypes["areaLand"] == set([type(1.1), type([]), type(None)])
    assert fieldtypes['areaMetro'] == set([type(1.1), type(None)])

test()

[Back to top](#Top of document)
<a id='quiz_fixing_area'></a>

## Quiz: Fixing the Area

In this problem set you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up.

Since in the previous quiz you made a decision on which value to keep for the
"areaLand" field, you now know what has to be done.

Finish the function fix_area(). It will receive a string as an input, and it
has to return a float representing the value of the area or None.
You have to change the function fix_area. You can use extra functions if you
like, but changes to process_file will not be taken into account.
The rest of the code is just an example on how this function can be used.

In [None]:
import codecs
import csv
import json
import pprint

CITIES = 'cities.csv'

## My Solution

In [None]:
def fix_area(areas):

    def is_list(s):
        if s[0] == '{':
            return True
        else:
            return False
        
    def is_float(s):
        try:
            float(s)
            return True
        except ValueError:
            return False
    
    values = []
    
    clean_areas =  str.maketrans('', '', '}{')
    
    if areas == ('NULL' or ''):
        return None
        
    elif is_float(areas) is True:
        return float(areas)
        
    elif is_list(areas) is True:
        areas = [s.translate(clean_areas) for s in areas]
        areas = (''.join(areas)).split('|')
                
        for area in areas:
            area_len = len(((area.lower()).split('e')[0]).rstrip('0'))
            values.append(area_len)
            
        if values[0] >= values[1]:
            return float(areas[0])
        else:
            return float(areas[1])
            

In [None]:
def process_file(filename):
    # CHANGES TO THIS FUNCTION WILL BE IGNORED WHEN YOU SUBMIT THE EXERCISE
    data = []

    with open(filename, "r") as f:
        reader = csv.DictReader(f)

        #skipping the extra metadata
        for i in range(3):
            l = next(reader)

        # processing file
        for line in reader:
            # calling your function to fix the area value
            if "areaLand" in line:
                print('Old: ', line['areaLand'])
                line["areaLand"] = fix_area(line["areaLand"])
                print('New: ', line['areaLand'], '\n')
            data.append(line)

    return data

## Test

In [None]:
def test():
    data = process_file(CITIES)

    print('Printing three example results:')
    for n in range(5,8):
        pprint.pprint(data[n]["areaLand"])

    assert data[3]["areaLand"] == None        
    assert data[8]["areaLand"] == 55166700.0
    assert data[20]["areaLand"] == 14581600.0
    assert data[33]["areaLand"] == 20564500.0
    
test()

[Back to top](#Top of document)
<a id='quiz_fixing_name'></a>

## Quiz: Fixing Name

In this problem set you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up.

In the previous quiz you recognized that the "name" value can be an array (or
list in Python terms). It would make it easier to process and query the data
later if all values for the name are in a Python list, instead of being
just a string separated with special characters, like now.

Finish the function fix_name(). It will recieve a string as an input, and it
will return a list of all the names. If there is only one name, the list will
have only one item in it; if the name is "NULL", the list should be empty.
The rest of the code is just an example on how this function can be used.

In [None]:
import codecs
import csv
import pprint

CITIES = 'cities.csv'

## My Solution

In [None]:
# from string import maketrans  # use if python 2.7

def clean_string(dirty_string):
    string_strip = str.maketrans('', '', '}{')
    #string_strip =  maketrans('}{', '  ')  # use if python 2.7
    
    dirty_string = [s.translate(string_strip) for s in dirty_string]
    dirty_string = ((''.join(dirty_string)).strip()).split('|')
    
    return dirty_string


def is_list(s):
    if s[0] == '{':
        return True
    else:
        return False

In [None]:
test = '{Negtemiut|Nightmute}'
if is_list(test) is True:
    print(clean_string(test))
else:
    print('Not a list')

In [None]:
def fix_name(names):

    if names == ('NULL' or ''):
        return []
    
    elif is_list(names) is True:
        return clean_string(names)
        
    else:
        return [names]
    

In [None]:
def process_file(filename):
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        #skipping the extra metadata
        for i in range(3):
            l = next(reader)
        # processing file
        for line in reader:
            # calling your function to fix the area value
            if "name" in line:
                #print('Old: ', line['name'])
                line["name"] = fix_name(line["name"])
                #print('New: ', line['name'], '\n')
            data.append(line)
    return data

## Test

In [None]:
def test():
    data = process_file(CITIES)

    print('Printing 20 results:')
    for n in range(20):
        pprint.pprint(data[n]["name"])

    assert data[14]["name"] == ['Negtemiut', 'Nightmute']
    assert data[9]["name"] == ['Pell City Alabama']
    assert data[3]["name"] == ['Kumhari']
    
test()

[Back to top](#Top of document)
<a id='quiz_crossfield_auditing'></a>

## Quiz: Crossfield Auditing

In this problem set you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up.

If you look at the full city data, you will notice that there are couple of
values that seem to provide the same information in different formats: "point"
seems to be the combination of "wgs84_pos#lat" and "wgs84_pos#long". However,
we do not know if that is the case and should check if they are equivalent.

Finish the function check_loc(). It will recieve 3 strings: first, the combined
value of "point" followed by the separate "wgs84_pos#" values. You have to
extract the lat and long values from the "point" argument and compare them to
the "wgs84_pos# values, returning True or False.

Note that you do not have to fix the values, only determine if they are
consistent. To fix them in this case you would need more information. Feel free
to discuss possible strategies for fixing this on the discussion forum.

The rest of the code is just an example on how this function can be used.
Changes to "process_file" function will not be taken into account for grading.

In [None]:
import csv
import pprint

CITIES = 'cities.csv'

## My Solution

In [None]:
def check_loc(point, lat, longi):
    point = point.split(' ')

    if point[0].strip() == lat and point[1].strip() == longi:
        return True
    else:
        return False
    
    pass

In [None]:
def process_file(filename):
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        #skipping the extra matadata
        for i in range(3):
            l = next(reader)
        # processing file
        for line in reader:
            # calling your function to check the location
            print('Old: {}: Point {} Lat {} Long {}'.format(line["name"], line["point"], line["wgs84_pos#lat"], line["wgs84_pos#long"]))
            result = check_loc(line["point"], line["wgs84_pos#lat"], line["wgs84_pos#long"])
            if not result:
                print('{}: {} != {} {}'.format(line["name"], line["point"], line["wgs84_pos#lat"], line["wgs84_pos#long"]))
            data.append(line)
            print('\n')

In [None]:
process_file(CITIES)

## Test

In [None]:
def test():
    assert check_loc("33.08 75.28", "33.08", "75.28") == True
    assert check_loc("44.57833333333333 -91.21833333333333", "44.5783", "-91.2183") == False
    
test()