<a href="https://colab.research.google.com/github/zackives/upenn-cis5450-hw/blob/main/Module_1_Data_Acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview of Part I of Big Data Analytics

As we start our journey into Big Data Analytics, the first thing we need to do is **get the data** in the form we need for analysis!  We'll start with an overview of how to acquire and *wrangle* data.

This notebook will be built incrementally to consider several tasks:

* Acquiring data from files and remote sources
* Information extraction over HTML content
* A basic "vocabulary" of operators over tables (the relational algebra)
* Basic manipulation using SQL in DuckDB

* "Data wrangling" or integration:
  * Cleaning and filtering data, using rules and based operations
  * Linking data across dataframes or relations
  * The need for approximate match and record linking
  * Different techniques


## The Motivating Question
To illustrate the principles, we focus on the question of **how old company CEOs and founders** (in general, leaders) are.  The question was in part motivated by the following New York Times article:

* Founders of Successful Tech Companies Are Mostly Middle-Aged: https://www.nytimes.com/2019/08/29/business/tech-start-up-founders-nest.html?searchResultPosition=2

So let's test this hypothesis!

## Initial Libraries

We'll be using [DuckDB](https://duckdb.org/) as a means of managing our tables.  DuckDB works like a Python library, but manages a full SQL database (in files).  It also integrates very nicely with Pandas, so we'll use it in this course.

In [None]:
!pip3 install duckdb

In [None]:
!pip3 install lxml

In [None]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

In [None]:
!pip3 install penngrader-client

For quiz credit you'll need to update your student ID here!

In [None]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

Quizzes will cumulatively count as HW9... Don't edit this...

In [None]:
%set_env HW_ID=cis5450_25f_HW9

In [None]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)

In [None]:
# Imports we'll use through the notebook, collected here for simplicity

# For parsing dates and being able to compare
import datetime

# For fetching remote data
import urllib
import urllib.request

# Pandas dataframes and operations
import pandas as pd

# Numpy matrix and array operations
import numpy as np

# Sqlite is a simplistic database
import duckdb

# Data visualization
import matplotlib



# 1. Acquiring External Data

To test our hypothesis, we might want:

1. A list of companies (and, for futher details, perhaps their lines of business)
2. A list of company CEOs
3. Ages of the CEOs

We'll go through each of these using real data from the web.

### Reading Structured Data Sources

Let's start by looking up data about companies.  We are using a dataset from: https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset?resource=download

but we have a copy of it at an alternate site for convenience of downloading.

## 1.1. External CSV Data

Comma-separated values are generally easy to read. The main questions are column headings (which are in an optional row that isn't always provided) and datatypes (which might default to the wrong thing).

In [None]:
!wget -nc https://storage.googleapis.com/penn-cis5450/companies_sorted.csv

In [None]:
# This reads remotely. To avoid multiple fetches, we'll instead..

# data = urllib.request.urlopen(\
#        'https://storage.googleapis.com/penn-cis5450/companies_sorted.csv')
# company_data_df = pd.read_csv(data)

## ... instead copy to a local file and read there...

company_data_df = pd.read_csv('companies_sorted.csv')

company_data_df

Now let's use DuckDB to work with the dataframe.

In [None]:
# We can ask for the contents of a Pandas Dataframe through DuckDB, in the SQL language.
duckdb.sql("""SELECT *
              FROM company_data_df""")

## 1.2. Storing Data Locally and Re-Loading it

DuckDB nicely integrates with Pandas and Python. If you create a connection to a file, this results in the creation of a database stored within that file.

Normally we need to `CREATE TABLE` with the table name and columns. But we can actually create the table to match the *schema* of the DataFrame, as follows.

In [None]:
con = duckdb.connect('local.db')
con.sql("""CREATE TABLE IF NOT EXISTS company_data AS
           SELECT *
           FROM company_data_df""")

# query the table
con.table("company_data").show()

In [None]:
company_data_df = con.table("company_data").df()

company_data_df

## 1.3. Companies' CEOs: a Web Table

Now we need to figure out who the CEOs are for corporations.  One place to look is Wikipedia, which has an HTML table describing the CEOs.

https://en.wikipedia.org/wiki/List_of_chief_executive_officers#List_of_CEOs

Pandas actually makes it easy to read HTML tables...

In [None]:
# Now let's read an HTML table!

company_ceos_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_chief_executive_officers#List_of_CEOs')[1]

company_ceos_df

In [None]:
con.sql('create table if not exists company_ceos as select * from company_ceos_df')

con.table('company_ceos').show()

## 1.4. The Problem Gets Harder... Extracting Fields from Tagged Data on the Web

So far we have companies and CEOs.  But we don't have information on how old the CEOs are!

For a solution, we're going to go back to Wikipedia -- this time looking at the web pages for the CEOs!

This involves "crawling" the CEO pages, and "scraping" the relevant content.  In other words we have to do *information extraction*.  For this particular problem, we will do extraction over very regular parts of Wikipedia.

We'll start by constructing a list of CEO web pages, from the Company CEO dataframe above.  For this, we need to take the names and do a bit of tweaking, for example adding underscores instead of spaces.

(Later we'll see how to do this over more free-form text.)

In [None]:
def get_ceo_urls(company_ceos_df):
  crawl_list = []

  for executive in company_ceos_df['Executive']:
    crawl_list.append('https://en.wikipedia.org/wiki/' + executive.replace(' ', '_'))
  return crawl_list

crawl_list = get_ceo_urls(company_ceos_df)

crawl_list

In [None]:
# Use urllib.urlopen to crawl all pages in crawl_list, and store the response of the page
# in list pages

pages = []

for url in crawl_list:
    page = url.split("/")[-1] #extract the person name at the end of the url

    # An issue: some of the accent characters won't work.  We need to convert them
    # into an HTML URL.  We'll split the URL, then use "parse.quote" to change
    # the structure, then re-form the URL
    url_list = list(urllib.parse.urlsplit(url))
    url_list[2] = urllib.parse.quote(url_list[2])
    url_ascii = urllib.parse.urlunsplit(url_list)
    try:
      response = urllib.request.urlopen((url_ascii))
      #Save page and url for later use.
      pages.append(response)
    except urllib.error.URLError as e:
      print(e.reason)


## 1.5. Crawling: Populating the Table with Executives

In [None]:
# Use lxml.etree.HTML(...) on the HTML content of each page to get a DOM tree that
# can be processed via XPath to extract the bday information.  Store the CEO name,
# webpage, and the birthdate (born) in exec_df.

# We first check that the HTML content has a table of type `vcard`,
# and then extract the `bday` information.  If there is no birthdate, the datetime
# value is NaT (not a type).

from lxml import etree

rows = []
for page in pages:
  url = page.geturl()
  print (url)
  content = page.read().decode("utf-8")
  tree = etree.HTML(content)  #create a DOM tree of the page
  bday = tree.xpath('//table[contains(@class,"vcard")]//span[@class="bday"]/text()')
  if len(bday) > 0:
      name = url[url.rfind('/')+1:] # The part of the URL after the last /
      rows.append({'name': name, 'page': url,
                  'born': datetime.datetime.strptime(bday[0], '%Y-%m-%d')})
  else:
          rows.append({'name': url[url.rfind('/')+1:], 'page': url
                                    , 'born': np.datetime64('NaT')})

exec_df = pd.DataFrame(rows)
exec_df

In [None]:
con.sql('create table if not exists executives as select * from exec_df')

con.table('executives').show()

## Section 1 Exercises

Extract, as a dataframe, the list of cities in Pennsylvania (see https://en.wikipedia.org/wiki/List_of_cities_in_Pennsylvania). Store these in the dataframe `pa_cities_df`.

In [None]:
pa_cities_df = # Do something here!

pa_cities_df

Let's check (and record) your answer...  You can retry until you get things right!

In [None]:
grader.grade('cities_quiz', answer=pa_cities_df)

### Here is a helper function to visualize A DOM tree

It uses the GraphViz library.  Run the next couple of cells to define `show_dom` which takes a string.

In [None]:
!pip install graphviz

In [None]:
import graphviz
from bs4 import BeautifulSoup
from bs4.element import Tag, NavigableString
import bs4
from IPython.display import Image

def visualize_dom_tree(html_content):
    """Visualizes a DOM tree from HTML content using Graphviz.

    Args:
        html_content (str): The HTML content to parse.

    Returns:
        graphviz.Digraph: The generated Graphviz graph object.
    """

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Create a Graphviz graph object
    graph = graphviz.Digraph(comment='DOM Tree Visualization')

    # Create a recursive function to traverse the DOM tree and add nodes and edges
    def add_node_and_edges(tag, parent_node):
        # Create a node for the tag
        node_id = f"node_{id(tag)}"
        if isinstance(tag, bs4.element.Tag):
          graph.node(node_id, label=tag.name)
        else:
          graph.node(node_id, label=tag.string)

        # Add an edge from the parent node to this node
        if parent_node:
            graph.edge(parent_node, node_id)

        # Recursively add child nodes
        if isinstance(tag, bs4.element.Tag):
          for child in tag.children:
              add_node_and_edges(child, node_id)

    # Start the recursion with the root element
    add_node_and_edges(soup, None)

    return graph

def show_dom(content):
  graph = visualize_dom_tree(content.strip())
  graph.render('dom_tree.gv', view=True, format='png')
  return Image('dom_tree.gv.png')

Here's a simple example using the results from the executives...

In [None]:
from lxml import etree

## Simple example

tree = etree.HTML(content)  #create a DOM tree of the page
example = tree.xpath('//table[contains(@class,"vcard")]')

# example is a list of nodes
# let's take the first element of the list (in our case there should only be one)
# and let's convert it back to a Unicode string using etree.tostring(...)

show_dom(etree.tostring(example[0], encoding='unicode'))

Now let's actually do our task! You can refer back to Section 1.4 for example code.

* Use `urllib.request.urlopen` to fetch https://en.wikipedia.org/wiki/List_of_women_CEOs_of_Fortune_500_companies.

* Read, decode, and parse the page into a DOM tree called `tree`.



In [None]:
page = urllib.request.urlopen( # TODO

tree = # TODO


Now Compute a list called `rows` with all rows (`tr`) within the single `table` in the DOM tree. Remember to use `//` and `/` appropriately. Beware that the table has a `tbody` between the `table` and `tr`.

In [None]:
rows = tree.xpath(# TODO

show_dom(etree.tostring(rows[1], encoding='unicode'))

Using XPath, compute, in the variable `oracle_ceo`, the DOM element for the *name* of the CEO of `Oracle` corporation.  You might want to look at https://en.wikipedia.org/wiki/List_of_women_CEOs_of_Fortune_500_companies in your browser and View source.

In [None]:
oracle_ceo = tree.xpath(# TODO


# This will help you see what you got. A list of elements, a Unicode
# string, or a DOM tree
if len(oracle_ceo) > 1:
  print ('Multiple answers: {}'.format(oracle_ceo))
elif isinstance(oracle_ceo[0], etree._ElementUnicodeResult):
  print(str(oracle_ceo[0]))
else:
  show_dom(etree.tostring(oracle_ceo[0], encoding='unicode'))

In [None]:
# Once you actually get a person's name that you think is right, submit here.
grader.grade('ceo', answer=str(oracle_ceo[0]))