# Data Wrangling in Python
-----------------------------------------
By: Ranysha Ware (<rwjanee@gmail.com>)

Machine learning algorithms require data. In the age of information, there are many options to aquire a lot of data for free. Today, we're going to review some ways you can gather data on the Web using Python by gathering data to answer the question: **What new 2016 shows should I watch this fall**?

The outline of today's lecture is as follows:
  1. 5 Minute Intro: Python
  2. 5 Minute Intro: How The Internet Works
  3. Relevant Python packages 
  4. How to access data via a public API
  5. Scraping Wikipedia TV data
  6. Data cleaning

## 5 Minute Intro: Python
-------------------------------
Python is a [dynamically typed](https://en.wikipedia.org/wiki/Type_system#Dynamic_type_checking_and_runtime_type_information) [interpreted language](https://en.wikipedia.org/wiki/Interpreted_language). 

There are two supported versions of Python, Python 2.7 and Python 3. Use the latest version of Python 3.  The code in this lecture has been tested with Python 3.5.

**Everything in Python is an object.** Objects have attributes and methods.  Attributes are key-value pairs.  Methods are functions, usually used to modify the object.  

Some useful builtin objects include: <br>
`None` - null object <br>
`int` - a whole number <br>
`float` - a floating point number <br>
`string` - a sequence of characters <br>
`list` - a mutable sequence of objects <br>
`tuple` - an immutable sequence of objects <br>
`set` - a mutable group of data <br>
`dict` - a mutable set of key-value pairs

In [1]:
example = {
    "hey i'm a dictionary key" : "and here's my value",
    "here's another value" : [1,2,3,4],
    10 : (1,2),
    'a' : None 
}

In [2]:
example['a'] = 'new value'
example

{"hey i'm a dictionary key": "and here's my value",
 10: (1, 2),
 "here's another value": [1, 2, 3, 4],
 'a': 'new value'}

In [3]:
example["here's another value"].append(5)
example

{"hey i'm a dictionary key": "and here's my value",
 10: (1, 2),
 "here's another value": [1, 2, 3, 4, 5],
 'a': 'new value'}

Even functions are just objects that contain some code to run when you call them.  Functions are defined with the `def` keyword. By default, all functions return `None`. You can change the value returned with the `return` keyword.

Other useful language constructs include `for` loops and list comprehensions. In addition, to use packages you can import them into your current project with the `import` keyword.

We'll see illustrative examples of functions, loops, comprehensions, and imports later.

## 5 Minute Intro: How The Internet Works
-----------------------------------------------------
You all interact with the Internet via the World Wide Web typically through a web browser.  When communicating over the Web, your browser (or computer) is acting as a *client*.  **Clients send and recieve some data to and from a *server**.  **A server recieves requests from clients and sends back some data.** The way clients and servers communicate is via some well-defined *protocol*. **A protocol is a set of rules for how clients and servers should exchange messages.**

The protocol we'll be using to gather data is *HTTP*. (Don't worry what that actually stands for, it *really* doesn't matter).  Everytime you type a website address into your browser and press enter, you are sending one or more HTTP requests to an HTTP server.  The HTTP server will send you back some data.  In the case of web browsing, this data is typically some *HTML* string that your browser will use to render the website you're trying to visit.

[comment]: <> (HTML consists of a series of tags that can tell a web browser how to render the text and images. While HTML and HTTP alone were sufficient for sharing data over the internet in the beginning, in today's "application-centric" world, many developers want to be able to receive data from )

[comment]: <> (Note, there are way more protocols that make the internet work than just HTTP, but those are beyond the scope of this lecture.)

## Python Packages We'll Need
----------------------------------------

In [4]:
import requests 

`requests` will let us send HTTP requests to an HTTP server, and recieve the response without worrying about the low-level details. 

In [5]:
# send request to get the Wikipedia Terms of Use
response = requests.get(
    'https://wikimediafoundation.org/wiki/Terms_of_Use')
# this will throw an assertion error if the response is an error
assert(response.ok)
# print the first 200 characters of the text of the response
print(response.text[:200])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Terms of Use - Wikimedia Foundation</title>
<script>document.documentElement.className = document.d


In [6]:
from bs4 import BeautifulSoup   # use from keyword to import sub-packages

We'll use the `bs4` (BeautifulSoup 4) package to search HTML text we get back in an HTML response. This way we can extract the parts of the web page we need.

In [7]:
# parse the html text
parser = BeautifulSoup(response.text, 'html.parser')
# find all the <b> tags
bold_text = parser.find_all('b')
# display the first 10 tags
bold_text[:10]

[<b>In other languages</b>,
 <b><strong class="selflink">English</strong></b>,
 <b>This is a summary of the Terms of Use.  To read the full terms, scroll down or <a href="/wiki/Terms_of_Use#Our_Terms_of_Use" title="Terms of Use">click here</a>.</b>,
 <b>summary</b>,
 <b>Part of our mission is to</b>,
 <b>Empower and Engage</b>,
 <b>Disseminate</b>,
 <b>You are free to</b>,
 <b>Read and Print</b>,
 <b>Share and Reuse</b>]

In [8]:
import re

Regular expressions for searching strings.

In [9]:
# use regex to find all bold tags that contain the string 'language'
parser.find_all('b', text=re.compile('languages'))

[<b>In other languages</b>]

In [10]:
import pandas as pd

DataFrames similar to data frames in `R`

## Scraping Wikipedia TV Data
---------------------------------------
### Some General Web Scraping Rules
- *First, read the terms of service.* Unfortunately, many websites explicitly forbid web scraping of any kind. Examples: [Craigslist](https://www.craigslist.org/about/terms.of.use.en)
- *Prefer an API.* Websites with lots of data that they are willing to share, will typically expose a special API for requesting data.  These are typically REST API's that will return data in nicely formatted JSON that is easier to parse than HTML. 
- *Obey rate limits and authentication requirements.* You'll get blocked if you don't.
- *Anticipate errors.* The format of websites is not monolithic. Always be prepared to handle the erros that will inevitable occur.
- *Validate assumptions about data after scraping.*  Ensure fields that are suppose to be non-null actually are.  Ensure fields you expect to be numbers actually are all numbers.


### Wikimedia API

[comment]: <> (need to add information here about why we prefer wikimedia to calling requests to render the pages we want directly)

Documentation: https://www.mediawiki.org/wiki/API:Main_page

Endpoint: https://en.wikipedia.org/w/api.php

### Collecting TV Data

A few different kinds of Wikipedia pages we can use. 

1. [2015-16 United States network television schedule](https://en.wikipedia.org/wiki/2015%E2%80%9316_United_States_network_television_schedule)

2. [2016 in American television](https://en.wikipedia.org/wiki/2016_in_American_television)

3. [Category:2016 American television debuts](https://en.wikipedia.org/wiki/Category:2016_American_television_series_debuts)

Let's use 1 and focus only on network television: ABC, CBS, The CW, Fox, and NBC.

[comment]: <> (Should mention that JSON is more readable than HTML for machines and is equivalent to a Python dictionary.)

For a few years, let's collect the show name, date aired, num seasons.

We can use this category page, to get the pages that are like 1:

[Category: United States primetime network television schedules](https://en.wikipedia.org/wiki/Category:United_States_primetime_network_television_schedules)

**1 -- Get all the pages in the category**

In [11]:
wikipedia_api_url = 'https://en.wikipedia.org/w/api.php'
request_params = {'action': 'query',
       'list': 'categorymembers',
       'cmlimit': 500, # gonna assume no more than 500 categories
       'cmtitle' : 'Category:United_States_primetime_network_television_schedules',
       'format' : 'json'}
response = requests.get(url=wikipedia_api_url, params=request_params)
category_members = response.json()['query']['categorymembers']

In [12]:
category_members[-5:]

[{'ns': 0,
  'pageid': 35690173,
  'title': '2012–13 United States network television schedule'},
 {'ns': 0,
  'pageid': 38473731,
  'title': '2013–14 United States network television schedule'},
 {'ns': 0,
  'pageid': 40770098,
  'title': '2014–15 United States network television schedule'},
 {'ns': 0,
  'pageid': 44341616,
  'title': '2015–16 United States network television schedule'},
 {'ns': 0,
  'pageid': 47762863,
  'title': '2016–17 United States network television schedule'}]

Let's walk through gathering data for 2012-13 and then write some functions to gather data for the other years.

Use the 'Inspect Element' feature of Firefox browser (or 'Developer Tools' in Chrome) to view the HTML and CSS for the elements you want.

**2 -- request page using Wikipedia API**

In [13]:
wikipedia_api_url = 'https://en.wikipedia.org/w/api.php'
request_params = {'action': 'parse', # the kind of action we're doing
                'pageid' : 35690173,   
                'format' : 'json'}  # format of data returned
response = requests.get(url=wikipedia_api_url, params=request_params)

**3 -- the text is just an html blob; we'll use beautifulsoup <br>
to parse it**

In [14]:
parser = BeautifulSoup(
    response.json()['parse']['text']['*'],'html.parser')

**4 -- this will give us the table of programs on ABC**

In [15]:
abc_table = parser.find('span', id='ABC').find_next('div')

**5 -- this will give us the name of new series**

In [16]:
abc_new_shows = abc_table.find_all('td')[1].find_all('li')
abc_new_shows[0].find('a')['title']

'666 Park Avenue'

** 6 -- parse tv show page and get num seasons **

In [17]:
wikipedia_api_url = 'https://en.wikipedia.org/w/api.php'
request_params = {'action': 'parse',                 # the kind of action we're doing
                'page' : '20/20 (U.S. TV series)',   # title of page
                'format' : 'json'}                   # format of data returned
response = requests.get(url=wikipedia_api_url, params=request_params)

parser = BeautifulSoup(
    response.json()['parse']['text']['*'],'html.parser')
table = parser.find(
    "table", class_="infobox vevent")
table_row = table.find(
    string=re.compile("of seasons")).find_parent("tr")
num_seasons = table_row.find("td").text

In [18]:
num_seasons

'38'

**8 -- write functions to get all the data we want**

In [19]:
def get_show_data(title):
    wikipedia_api_url = 'https://en.wikipedia.org/w/api.php'
    request_params = {'action': 'parse',  # the kind of action we're doing
                'page' : title,   # title of page
                'format' : 'json'}   # format of data returned
    response = requests.get(url=wikipedia_api_url, params=request_params)

    parser = BeautifulSoup(
        response.json()['parse']['text']['*'],'html.parser')
    table = parser.find("table", class_="infobox vevent")
    if table is None:
        num_seasons = None
        release_date = None
    else:
        # get number of seasons
        try:
            table_row = table.find(
                string=re.compile("of seasons")).find_parent("tr")
            num_seasons = table_row.find("td").text
        except AttributeError as e: # we may not find the table
            num_seasons = None
        
        # get release date
        try:
            release_date = table.find(
                'span', class_='bday dtstart published updated').text
        except AttributeError as e:
            release_date = None

    return {'title':title, 
            'num_seasons':num_seasons, 
            'release_date':release_date}

In [20]:
channels = ['ABC', 'CBS', 'The_CW', 'Fox', 'NBC']
def get_shows(pageid):
    """Given a US network television schedule"""
    request_params = {'action': 'parse',                      
                'pageid' : pageid,   
                'format' : 'json'}                          
    response = requests.get(url=wikipedia_api_url, params=request_params)
    parser = BeautifulSoup(
        response.json()['parse']['text']['*'],'html.parser')
    
    # create list of data we're collecting
    all_shows = []
    for channel in channels:
        channel_table = parser.find(
            'span', id=channel).find_next('div')
        new_shows = channel_table.find_all('td')[1].find_all('li')
        for show in new_shows:
            show_title = show.find('a')['title']
            show_data = get_show_data(show_title)
            show_data['channel'] = channel
            all_shows.append(show_data)
    return all_shows

In [21]:
show_data_1213 = get_shows(35690173)

In [22]:
df = pd.DataFrame.from_dict(show_data_1213)
df.head()

Unnamed: 0,channel,num_seasons,release_date,title
0,ABC,1,2012-09-30,666 Park Avenue
1,ABC,2,2013-04-13,Bet on Your Baby
2,ABC,1,2013-05-01,Family Tools
3,ABC,1,2013-04-03,How to Live with Your Parents (For the Rest of...
4,ABC,1,2012-09-27,Last Resort (U.S. TV series)


Cool, we have some data. We would repeat this for all the seasons we want to get the shows from.

## Data Cleaning

We have some shows that have an original release date that is either not a valid date or is before 2012.  Let's convert invalidate dates to NaNs in our dataframe.

In [23]:
# coerce to a valid date
df['release_date'] = pd.to_datetime(
    df['release_date'], infer_datetime_format=True, errors='coerce')

In [24]:
from datetime import datetime
# remove dates less than 2012
df[df['release_date'] < datetime(year=2012, day=1, month=1)]

Unnamed: 0,channel,num_seasons,release_date,title
32,The_CW,12,1998-08-05,Whose Line Is It Anyway? (U.S. TV series)
35,Fox,15,1999-01-01,Fox College Football


In [25]:
df.loc[
 df['release_date'] < datetime(year=2012, day=1, month=1), 'release_date']=None

In [26]:
df.head(10)

Unnamed: 0,channel,num_seasons,release_date,title
0,ABC,1,2012-09-30,666 Park Avenue
1,ABC,2,2013-04-13,Bet on Your Baby
2,ABC,1,2013-05-01,Family Tools
3,ABC,1,2013-04-03,How to Live with Your Parents (For the Rest of...
4,ABC,1,2012-09-27,Last Resort (U.S. TV series)
5,ABC,1,2012-11-02,Malibu Country
6,ABC,4,2013-06-03,Mistresses (U.S. TV series)
7,ABC,4,2013-02-03,Motive (TV series)
8,ABC,4,2012-10-10,Nashville (2012 TV series)
9,ABC,2,2012-09-26,The Neighbors (2012 TV series)


Finally, we can save the data to a file to use later.

In [27]:
df.to_csv('tv_show_data.csv')