# Data Ingestion and Wrangling Make-Up Assignment - September 2016

### Completed by Veronica Helms as part of the [Georgetown Data Science Certificate Program](http://scs.georgetown.edu/programs/375/data-science/) 

## Part One: Short Answer

### General

**1. During the ingestion phase, it is important to store the data in as close its original form as possible.  No wrangling or changes should be made to data during this phase.  Why do you think this is an important point?**

In the field of data science, it is best practice to keep ingested, raw data in its "purest" form. A core concept of data management, unmanipulated raw data storage is industry standard for many reasons. First, many programmers and analysts may have access to raw data sources. When multiple programmers have data access, it is important to note that mistakes can easily occur, especally when conducting mundane, repetitive tasks. These mistakes can unwillingly lead to incorrect data manipulation. By safeguarding raw data (i.e., allowing read but not write access to the original data), data integrity is ensured. Additionally, storing raw, unmanipulated data ensures research integrity and replicability. Storing original data allows for future access (i.e., study replication) and double coding processes.

### Worm Stores & Data Lakes

**2. The process of ingestion is taking data from its source to its resting place in a WORM store.  The slides make reference to it being a safe place to store data as well as a starting point for the Wrangling process.  Please list and explain three properties of a WORM store that make it a safe place.**

A write once read many (WORM) store is a safe place to store data for three primary reasons:
1. First, a WORM store is a data storage platform in which data cannot be modified after written. By employing write protection, data cannot be tampered with after written to a WORM store. This safeguards the data. 
2. Second, WORM stores provide a safe place to store all relevant data sources for a project in a raw, unmanipulated form. As mentioned above, this ensures data and research integrity in a systematic manner. 
3. Lastly, WORM stores prevent accidental or intentional erasing or alteration of data. In summary, WORM stores represent safe places to store data. 


**3. The slides refer to a Data Lake as the WORM store for an entire business or organization.  With this in mind, please read Martin Fowler's post concerning Data Lakes at http://martinfowler.com/bliki/DataLake.html.  Then choose what you think are his three most important points and summarize them.**

This question is really interesting because the government agency that I work for is currently developing a Data Lake. After reading Martin Fowler's post about Data Lakes, the following three points emerged as key points: 

1. I think Fowler's first point about diffentiating Data Lakes and data warehouses is important because data warehouses have been around for a long time. By immediately identifying key differences, Fowler both highlights the main advantages of a Data Lake while also educating the reader. In summary, perhaps the largest advantage of a Data Lake over traditional data warehousing is that Data Lakes allow for raw data storage in any form or schema. This allows for data users to utilize data is more realistic ways since data storage isnt limited to a single unified data model. 

2. I also appreciated how Fowler stressed the importance of clearly documenting the historical source (i.e. time and place) of all data put into a Data Lake. Although Data Lakes represent chaos, managers must promote organized chaos to ensure a historical record. 

3. Lastly, Fowler closes his piece by discussing data privacy and security. As a data analyst who works with large quantities of personally identifiable information (PII), this is an ongoing concern. As my organization works to develop a Data Lake, restricting access to a small group will be crucial. Additionally, as mentioned by Fowler, guidelines regarding accoutability must be established. 

**4. HDFS is an excellent tool for use as a WORM store due to its distributed nature and feature set.  Please choose three features of HDFS that support this and explain how they are useful.**

Hadoop File System (HDFS) is an excellent tool for use as a WORM store. Three key features of HDFS include:

1. *Scalability*: HDFS allows for multiple machines to act as a single file system. Additionally, the system breaks large files into pieces saved throughout the system - a process that ensures scalability and usability for big data. For example, the system has demonstrated capacity for holding 200 PB of data across a 4,500 machine cluster.
2. *Fault-Tolerance*: With a user-friendly interface and easy administrative processes, HDFS is fault-tolerant. The system allows for concurrent access, a key feature for companies with complex data management structure. Lastly, since HDFS acts as a file system, it is rather intuitive. The distributed nature and redundant blocks of data allow for a very high level of fault tolerance.
3. *Cost-Effectiveness*: HDFS is integrated with analytical tools, adding the cost-effectiveness of purchasing the system. Additionally, HDFS costs approximately one thousand dollars per terabyte. This translates to approximately 10% the cost of other data management systems. 

### Common Data Formats

**5. Explain why CSVs are best for tabular data.  Then describe a means to store non-tabular data (such as hierarchichal data) in a CSV format.**

The Character Seperated Value (CSV) format is best for tabular data for many reasons. Foremost, CSVs are widely supported and understood. Dating back to the 1960s, the simplified syntex of CSVs allows for easy human readability. Additionally, CSVs are well supported by desktop and server applications. For these reasons, CSVs are best for tabular data. 

A variety of Python packages/library can be used to store non-tabular data in a CSV format (i.e., csv and unicodecsv). For example, BeautifulSoup can convert HTML to a CSV.

**6. Provide two benefits for using JSON as a data format and then explain how these benefits have contributed to its quick adoption throughout the tech community.**

Although newer than other formats, JSON has quickly been adopted throughout the tech community. Two benefits of using JSON include:
1. *Conciseness*: JSON uses data objects and only supports limited data types: string, number, object, array, true, false, and null. Additionally, with roots in JavaScript, many techies already have rudimentary knowledge of the JSON format. 
2. *Human Readability*: Due to its simplified syntax, JSON is user friendly and easy to learn. Additionally, there is an official standard available. 

### Web APIs & REST

**7. Providing web services is on of the more common means of sharing data these days making it a relatively common ingestion source.  Please find and list 3 NON RESTful web services that provide data in a JSON format.**

The following three NON RESTful web services provide data in a JSON format: 
1. *JavaScript Object Notation Web-Service Protocol (JSON-WSP)*
2. *Simple Object Access Protocol (SOAP)*
3. *JSON SRV Library*

**8. The HTTP protocol was covered in class in order to explain the underlying structure of RESTful endpoints.  Please describe the intended usage for the POST, GET, PUT, and DELETE HTTP verbs within a RESTful website.**

There any many common HTTP verbs. The intended usage for the following verbs within a RESTful website are briefly described below. 
- **POST:** POST is utilized to create new resources. Specifically, POST creates subordinate resources (i.e. one that is dependent on a parent resource). It informs processing of the request body as a subordinate of the URL being posted to. 
- **GET:** As a read-only, GET is used to read a representation of a resource. It instructs the server to transmit data identified by the URL.  
- **PUT:** PUT is used to for update capabilities. It is used when there is a need to create or update the resource identified by the URL. 
- **DELETE HTTP:** Consistent with its name, DELETE is used to delete a resource identified by the URL of the request. 

**9. Imagine you are responsible for developing the RESTful endpoints for an application dealing with the medical industry.  Provide a list of the endpoints needed to manage the details about physicians.  See the slide titled "Interacting with Endpoints" for an example.**

Below is an example of RESTful endpoints for an application dealing with the medical industry. Specifically, it relates to physicians. 

GET /physicians: Returns a list of physicians. 

POST /physicians: Creates a new physician entry with the data posted. 

GET /physicians/: id: Gets detailed information about physicians. 

PUT /physicians/: id: Modifies information about a physician. 

DELETE /physicians/:id: Deletes a physician on the server. 

### Ingestion Systems

**10. Stream processing vs Batch processing is an often discussed topic in ingestion systems.  Describe to a non-technical audience the difference between the two process models.**

Put simply, there are two primary ingestion systems: (1) stream Processing and (2) batch Processing. Stream processing computes one data element (or a small subet of elements) while batch processing computes larger quantities of big data. Generally, stream processing handles simple data elements streamed singularly while batch processing handles more complex data streamed in groups for improved performance. Generally, stream processing is better for smaller datasets while batch processing is better for larger datasets. 

**11. Provide 2 examples of streaming data and explain why they should be considered streaming.**

Streaming Data represents data that is generated continuously by thousands of data sources simultaneously. Streaming data includes a wide variety of data. Two examples of streaming data include: 

1. An online gaming company continuously collects player interaction data. As it calls streaming data, it simultaneously feeds data back into the online gaming interface. Using machine learning processes, the platform analyzes the data and offers player incentives and user-focused experiences. 

2. A coupon website (i.e., LivingSocial) tracking mobile device data and making recommendations based on users' geo-location and prior purchase history. 

**12. The slide titled "Basic Ingestion Example" list some reasons as to why using your single laptop to download a large dataset likely isn't the best choice.  Please provide two more reasons to support this argument.**

In addition to the items listed on the "Basic Ingestion Example" slide, the following reasons also support the argument that a single laptop should not be used to download a large dataset. 
1. **Data Security:** What if the data contains personally identifiable information (PII)? Downloaded data should be downloaded to a secure location (likely not your laptop!). 
2. **Internet Speed**: What if your home internet speed is not fast enough to download the dataset? This could lead to increased (and unneccesary) lagtime. 

**13. Please provide brief descriptions of Apache Hadoop and Apache Spark.  Be sure to include the benefits of each as well as high level differences.**

Although Apache Hadoop and Apache Spark are both big data platforms, they are very different. Note the key characteristics (and differences) of these two platforms below: 

* **Apache Hadoop:** 
  * Distributed data infrastructure
  * Indexes data
  * Resilient to system failures
  * Contain a processing component called MapReduce

* **Apache Spark:** 
  * Data-processing tool
  * Operates on distributed data collections
  * DOES NOT do distributed storage (does not have its own file management system) 
  * Faster than Hadoop 
  * Data objects stored in resilient distributed datasets but not neccesarily systematically resilient 

**14. Python Celery (http://www.celeryproject.org/) is a task/queue system for distributing work such as ingestion.  Please list some of the benefits of using this library and then describe a scenario where you might want to use it for ingestion purposes.**

Python Celery is a task/queue system for distributing work such as ingestion. Benefits of using Celery include: 
* Focus on real-time operation 
* Support scheduling 
* Tasks are executed concurrently 
* Tasks can execute asynchronously or synchronously
* Can process millions of tasks a day
* Easy to integrate with web frameworks

Celery if often used when traditional methods are not successful in creating an ingestion system. The library essentially allows a user to create their own ingestion system. 

## Part One: Coding Assignments

### RSS Feed Ingestion

**1. The first assignment involves RSS feeds as an ingestion source.  Using the provided template (https://gist.github.com/looselycoupled/8f6b0226e92f8e3a13c0), fill in the missing code to get the desired functionality.  When finished, your program should pull the New York Times RSS feed (url is provided) using the feedparser (https://pythonhosted.org/feedparser/introduction.html) library.  Then it should loop through each entry and download the HTML for the article using the Requests library.  Each entry contains a number of attributes - one of which will be the URL for the actual web page.  After the article is downloaded, the program should then save the file to disk. When saving to disk, use the title of the article (from the RSS entry object) as the filename.  Because this string may not be suitable for filenames, a function (slugify) has been provided which takes a string, and then returns a modified version which should be safe to use as a filename.  The "open" Python function is used to write data to disk and is well documented.  A common problem with this lab is that the HTML may contain invalid characters (UTF-8 vs ASCII).  To get around this you may need to use something like: "f.write(content.encode('utf-8'))"**

In [67]:
import os
import re
import requests as rqst
import feedparser as fp

In [68]:
RSS_URL = 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'

In [69]:
feed = fp.parse(RSS_URL)

In [70]:
def slugify():
    """
    Converts to ASCII. Converts spaces to hyphens. Removes characters that
    aren't alphanumerics, underscores, or hyphens. Converts to lowercase.
    Also strips leading and trailing whitespace.
    """
    value = feed.encode('ascii', 'ignore').decode('ascii')
    value = re.sub('[^\w\s-]', '', value).strip().lower()
    return re.sub('[-\s]+', '-', value)

def save_article(title):
    """
    Save HTML content using a slugged version of the title as the basis for
    the filename
    """
    pass

for entry in feed.entries:
    print entry.title

U.S. to Bar Arbitration Clauses in Nursing Home Contracts
Congress Votes to Override Obama Veto on 9/11 Victims Bill
Congress Approves Spending Bill, Averting Government Shutdown
New Debate Strategy for Donald Trump: Practice, Practice, Practice
Hillary Clinton Struggles to Win Back Young Voters From Third Parties
Colin Kaepernick Says Presidential Candidates Were Trying to ‘Debate Who’s Less Racist’
President and Michelle Obama Lash Out at Donald Trump
‘She Has a Name,’ Alicia Machado, and It Is Everywhere
Your Evening Briefing
California Today: California Today: First-Day Jitters for Kevin Durant
More Wealth, More Jobs, but Not for Everyone: What Fuels the Backlash on Trade
Your Daily Mini Crossword
Elvis Costello’s New York Soul

The 2016 Race: If Your Vote Doesn’t Really Count, Is There Anything You Can Do?
Contributing Op-Ed Writer: The One Question You Should Ask About Every New Job
Life Without Shimon Peres? In Many Ways, His Israel Faded Long Ago
Obama and Bill Clinton to Trave

### RESTful Website Ingestion

**2.  The second assignment involves downloading data from a RESTful website using the Requests library.  Specifically, you should provide the missing code in the https://gist.github.com/looselycoupled/5fd93fe80c39a24d64c1 template file.  Similar to the previous coding assignment, the goal is to download the 5 latest press releases from the U.S. Department of Justice REST web service and save each one to disk.  Documentation for the DOJ web service can be found at https://www.justice.gov/developer/api-documentation/api_v1.  The Requests documentation website should contain all of the explanations and example code you should need for this assignment. **

In [71]:
import os
import json
import requests

In [72]:
DOJ_RELEASES_URL = 'http://www.justice.gov/api/v1/press_releases.json?pagesize=5'

In [73]:
def fetch_press_releases():
    """
    Performs a GET on the DOJ web service and return the array found in the
    'results' attribute of the JSON response
    """
    # execute a GET request and store the results
    r = requests.get('http://www.justice.gov/api/v1/press_releases.json?pagesize=5')
    # decode as json and store the results
    json_data = json.loads(r.text)
    # return the 'results' array of press releases
    print json_data
    
def main():
    """
    Main execution function to perform required actions
    """
    # fetch array of press releases
    press_releases = fetch_press_releases()
    
##########################################################################
## Execution
##########################################################################
if __name__ == '__main__':
    main()

{u'results': [{u'body': u'<root><div class="presscontenthdr-container">\n<div class="presscontenthdr-container">\n<div class="presscontenthdr">\n<p>WASHINGTON<strong> - </strong>INTERPOL Washington, the United States National Central Bureau (USNCB), announced the capture and return of Shilo Watts, 38, a United States citizen and resident of Atascosa County, Texas from Oman to the United States. Watts is wanted in Texas for charges of aggravated sexual assault of a minor, beginning when the minor was three years old and continuing over a prolonged period of time. In 2012, Watts fled the United States, resulting in the issuance of federal felony charge of unlawful flight to avoid prosecution.</p>\n</div>\n</div>\n\n<div class="presscontenttext">\n<p>In April, INTERPOL Washington expedited the publication of an INTERPOL Red Notice, or international wanted persons notice, for Watts based on the charges in Texas. The Red Notice was disseminated via INTERPOL\'s network to its 190 member coun