<a href="https://colab.research.google.com/github/zackives/upenn-cis-2450/blob/main/Module_1_Data_Acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview of Part I of Big Data Analytics

As we start our journey into Big Data Analytics, the first thing we need to do is **get the data** in the form we need for analysis!  We'll start with an overview of how to acquire and *wrangle* data.

This notebook will be built incrementally to consider several tasks:

* Acquiring data from files and remote sources
* Information extraction over HTML content
* A basic "vocabulary" of operators over tables (the relational algebra)
* Basic manipulation using SQL in DuckDB

* "Data wrangling" or integration:
  * Cleaning and filtering data, using rules and based operations
  * Linking data across dataframes or relations
  * The need for approximate match and record linking
  * Different techniques


## The Motivating Question
To illustrate the principles, we focus on the question of **how old company CEOs and founders** (in general, leaders) are.  The question was in part motivated by the following New York Times article:

* Founders of Successful Tech Companies Are Mostly Middle-Aged: https://www.nytimes.com/2019/08/29/business/tech-start-up-founders-nest.html?searchResultPosition=2

So let's test this hypothesis!

## Initial Libraries

We'll be using [DuckDB](https://duckdb.org/) as a means of managing our tables.  DuckDB works like a Python library, but manages a full SQL database (in files).  It also integrates very nicely with Pandas, so we'll use it in this course.

In [None]:
!pip3 install duckdb



In [None]:
!pip3 install lxml



In [None]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

Writing notebook-config.yaml


In [None]:
!pip3 install penngrader-client

Collecting penngrader-client
  Downloading penngrader_client-0.5.2-py3-none-any.whl.metadata (15 kB)
Collecting dill (from penngrader-client)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Downloading penngrader_client-0.5.2-py3-none-any.whl (10 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill, penngrader-client
Successfully installed dill-0.3.8 penngrader-client-0.5.2


For quiz credit you'll need to update your student ID here!

In [None]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

Quizzes will cumulatively count as HW9... Don't edit this...

In [None]:
%set_env HW_ID=cis2450_fall24_HW9

env: HW_ID=cis2450_fall24_HW9


In [None]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)

PennGrader initialized with Student ID: 99999999

Make sure this correct or we will not be able to store your grade


In [None]:
# Imports we'll use through the notebook, collected here for simplicity

# For parsing dates and being able to compare
import datetime

# For fetching remote data
import urllib
import urllib.request

# Pandas dataframes and operations
import pandas as pd

# Numpy matrix and array operations
import numpy as np

# Sqlite is a simplistic database
import duckdb

# Data visualization
import matplotlib



# 1. Acquiring External Data

To test our hypothesis, we might want:

1. A list of companies (and, for futher details, perhaps their lines of business)
2. A list of company CEOs
3. Ages of the CEOs

We'll go through each of these using real data from the web.

### Reading Structured Data Sources

Let's start by looking up data about companies.  We are using a dataset from: https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset?resource=download

but we have a copy of it at an alternate site for convenience of downloading.

## 1.1. External CSV Data

Comma-separated values are generally easy to read. The main questions are column headings (which are in an optional row that isn't always provided) and datatypes (which might default to the wrong thing).

In [None]:
!wget -nc https://storage.googleapis.com/penn-cis5450/companies_sorted.csv

--2024-08-26 22:45:18--  https://storage.googleapis.com/penn-cis5450/companies_sorted.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.101.207, 142.251.2.207, 142.250.141.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.101.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1085578742 (1.0G) [text/csv]
Saving to: ‘companies_sorted.csv’


2024-08-26 22:45:36 (57.3 MB/s) - ‘companies_sorted.csv’ saved [1085578742/1085578742]



In [None]:
# This reads remotely. To avoid multiple fetches, we'll instead..

# data = urllib.request.urlopen(\
#        'https://storage.googleapis.com/penn-cis5450/companies_sorted.csv')
# company_data_df = pd.read_csv(data)

## ... instead copy to a local file and read there...

company_data_df = pd.read_csv('companies_sorted.csv')

company_data_df

Unnamed: 0.1,Unnamed: 0,name,domain,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate
0,5872184,ibm,ibm.com,1911.0,information technology and services,10001+,"new york, new york, united states",united states,linkedin.com/company/ibm,274047,716906
1,4425416,tata consultancy services,tcs.com,1968.0,information technology and services,10001+,"bombay, maharashtra, india",india,linkedin.com/company/tata-consultancy-services,190771,341369
2,21074,accenture,accenture.com,1989.0,information technology and services,10001+,"dublin, dublin, ireland",ireland,linkedin.com/company/accenture,190689,455768
3,2309813,us army,goarmy.com,1800.0,military,10001+,"alexandria, virginia, united states",united states,linkedin.com/company/us-army,162163,445958
4,1558607,ey,ey.com,1989.0,accounting,10001+,"london, greater london, united kingdom",united kingdom,linkedin.com/company/ernstandyoung,158363,428960
...,...,...,...,...,...,...,...,...,...,...,...
7173421,1494427,certiport vouchers,certiportvouchers.com,2011.0,information technology and services,1 - 10,,,linkedin.com/company/certiport-vouchers,0,1
7173422,1494429,black tiger fight club,blacktigerclub.com,2006.0,"health, wellness and fitness",1 - 10,"peking, beijing, china",china,linkedin.com/company/black-tiger-club-hero,0,6
7173423,4768462,catholic bishop of chicago,,,religious institutions,1 - 10,"inverness, illinois, united states",united states,linkedin.com/company/catholic-bishop-of-chicago,0,1
7173424,1494436,medexo robotics ltd,,,research,1 - 10,"london, london, united kingdom",united kingdom,linkedin.com/company/medexo-robotics-ltd,0,2


Now let's use DuckDB to work with the dataframe.

In [None]:
# We can ask for the contents of a Pandas Dataframe through DuckDB, in the SQL language.
duckdb.sql('select * from company_data_df')

┌────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│ Unnamed: 0 │         name         │ … │     linkedin url     │ current employee e…  │ total employee est…  │
│   int64    │       varchar        │   │       varchar        │        int64         │        int64         │
├────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│    5872184 │ ibm                  │ … │ linkedin.com/compa…  │               274047 │               716906 │
│    4425416 │ tata consultancy s…  │ … │ linkedin.com/compa…  │               190771 │               341369 │
│      21074 │ accenture            │ … │ linkedin.com/compa…  │               190689 │               455768 │
│    2309813 │ us army              │ … │ linkedin.com/compa…  │               162163 │               445958 │
│    1558607 │ ey                   │ … │ linkedin.com/compa…  │               158363 │               428960 │
│

## 1.2. Storing Data Locally and Re-Loading it

DuckDB nicely integrates with Pandas and Python. If you create a connection to a file, this results in the creation of a database stored within that file.

Normally we need to `CREATE TABLE` with the table name and columns. But we can actually create the table to match the *schema* of the DataFrame, as follows.

In [None]:
con = duckdb.connect('local.db')
con.sql('create table if not exists company_data as select * from company_data_df')

# query the table
con.table("company_data").show()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

┌────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│ Unnamed: 0 │         name         │ … │     linkedin url     │ current employee e…  │ total employee est…  │
│   int64    │       varchar        │   │       varchar        │        int64         │        int64         │
├────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│    5872184 │ ibm                  │ … │ linkedin.com/compa…  │               274047 │               716906 │
│    4425416 │ tata consultancy s…  │ … │ linkedin.com/compa…  │               190771 │               341369 │
│      21074 │ accenture            │ … │ linkedin.com/compa…  │               190689 │               455768 │
│    2309813 │ us army              │ … │ linkedin.com/compa…  │               162163 │               445958 │
│    1558607 │ ey                   │ … │ linkedin.com/compa…  │               158363 │               428960 │
│

In [None]:
company_data_df = con.table("company_data").df()

company_data_df

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0.1,Unnamed: 0,name,domain,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate
0,5872184,ibm,ibm.com,1911.0,information technology and services,10001+,"new york, new york, united states",united states,linkedin.com/company/ibm,274047,716906
1,4425416,tata consultancy services,tcs.com,1968.0,information technology and services,10001+,"bombay, maharashtra, india",india,linkedin.com/company/tata-consultancy-services,190771,341369
2,21074,accenture,accenture.com,1989.0,information technology and services,10001+,"dublin, dublin, ireland",ireland,linkedin.com/company/accenture,190689,455768
3,2309813,us army,goarmy.com,1800.0,military,10001+,"alexandria, virginia, united states",united states,linkedin.com/company/us-army,162163,445958
4,1558607,ey,ey.com,1989.0,accounting,10001+,"london, greater london, united kingdom",united kingdom,linkedin.com/company/ernstandyoung,158363,428960
...,...,...,...,...,...,...,...,...,...,...,...
7173421,1494427,certiport vouchers,certiportvouchers.com,2011.0,information technology and services,1 - 10,,,linkedin.com/company/certiport-vouchers,0,1
7173422,1494429,black tiger fight club,blacktigerclub.com,2006.0,"health, wellness and fitness",1 - 10,"peking, beijing, china",china,linkedin.com/company/black-tiger-club-hero,0,6
7173423,4768462,catholic bishop of chicago,,,religious institutions,1 - 10,"inverness, illinois, united states",united states,linkedin.com/company/catholic-bishop-of-chicago,0,1
7173424,1494436,medexo robotics ltd,,,research,1 - 10,"london, london, united kingdom",united kingdom,linkedin.com/company/medexo-robotics-ltd,0,2


## 1.3. Companies' CEOs: a Web Table

Now we need to figure out who the CEOs are for corporations.  One place to look is Wikipedia, which has an HTML table describing the CEOs.

https://en.wikipedia.org/wiki/List_of_chief_executive_officers#List_of_CEOs

Pandas actually makes it easy to read HTML tables...

In [None]:
# Now let's read an HTML table!

company_ceos_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_chief_executive_officers#List_of_CEOs')[1]

company_ceos_df

Unnamed: 0,Company,Executive,Title,Since,Notes,Updated
0,Accenture,Julie Sweet,CEO[1],2019,"Succeeded Pierre Nanterme, died",2019-01-31
1,Aditya Birla Group,Kumar Mangalam Birla,Chairman[2],1995[2],Part of the Birla family business house in India,2018-10-01
2,Adobe Systems,Shantanu Narayen,"Chairman, president and CEO[3]",2007,Formerly with Apple,2018-10-01
3,Airbus,Guillaume Faury,CEO[4],2012,Succeeded Louis Gallois,2017-11-14
4,Alibaba,Eddie Wu,Director and CEO[5],2023[6],,2024-07-19
...,...,...,...,...,...,...
130,Warner Brothers,Ann Sarnoff,Chairwoman and CEO[122],2019,First woman to hold the position at the compan...,2019-10-10
131,WarnerMedia,Jason Kilar,CEO[123],2020,Previously with Hulu and Amazon,2020-11-19
132,Wells Fargo,Charles Scharf,CEO and president[124],2019,"Succeeded John Stumpf, previously COO",
133,Whole Foods Market,John Mackey,CEO[125],1980,Co-founder,2017-11-11


In [None]:
con.sql('create table if not exists company_ceos as select * from company_ceos_df')

con.table('company_ceos').show()

┌──────────────────────┬──────────────────────┬──────────────────────┬─────────┬──────────────────────────┬────────────┐
│       Company        │      Executive       │        Title         │  Since  │          Notes           │  Updated   │
│       varchar        │       varchar        │       varchar        │ varchar │         varchar          │  varchar   │
├──────────────────────┼──────────────────────┼──────────────────────┼─────────┼──────────────────────────┼────────────┤
│ Accenture            │ Julie Sweet          │ CEO[1]               │ 2019    │ Succeeded Pierre Nante…  │ 2019-01-31 │
│ Aditya Birla Group   │ Kumar Mangalam Birla │ Chairman[2]          │ 1995[2] │ Part of the Birla fami…  │ 2018-10-01 │
│ Adobe Systems        │ Shantanu Narayen     │ Chairman, presiden…  │ 2007    │ Formerly with Apple      │ 2018-10-01 │
│ Airbus               │ Guillaume Faury      │ CEO[4]               │ 2012    │ Succeeded Louis Gallois  │ 2017-11-14 │
│ Alibaba              │ Eddie W

## 1.4. The Problem Gets Harder... Extracting Fields from Tagged Data on the Web

So far we have companies and CEOs.  But we don't have information on how old the CEOs are!

For a solution, we're going to go back to Wikipedia -- this time looking at the web pages for the CEOs!

This involves "crawling" the CEO pages, and "scraping" the relevant content.  In other words we have to do *information extraction*.  For this particular problem, we will do extraction over very regular parts of Wikipedia.

We'll start by constructing a list of CEO web pages, from the Company CEO dataframe above.  For this, we need to take the names and do a bit of tweaking, for example adding underscores instead of spaces.

(Later we'll see how to do this over more free-form text.)

In [None]:
crawl_list = []

for executive in company_ceos_df['Executive']:
  crawl_list.append('https://en.wikipedia.org/wiki/' + executive.replace(' ', '_'))

crawl_list

['https://en.wikipedia.org/wiki/Julie_Sweet',
 'https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla',
 'https://en.wikipedia.org/wiki/Shantanu_Narayen',
 'https://en.wikipedia.org/wiki/Guillaume_Faury',
 'https://en.wikipedia.org/wiki/Eddie_Wu',
 'https://en.wikipedia.org/wiki/Andy_Jassy',
 'https://en.wikipedia.org/wiki/Lisa_Su',
 'https://en.wikipedia.org/wiki/Stephen_Squeri',
 'https://en.wikipedia.org/wiki/Joseph_R._Swedish',
 'https://en.wikipedia.org/wiki/Tim_Cook',
 'https://en.wikipedia.org/wiki/Lakshmi_Niwas_Mittal',
 'https://en.wikipedia.org/wiki/John_Stankey',
 'https://en.wikipedia.org/wiki/Charles_Woodburn',
 'https://en.wikipedia.org/wiki/Tapan_Singhel',
 'https://en.wikipedia.org/wiki/Carlos_Torres_Vila',
 'https://en.wikipedia.org/wiki/Brian_Moynihan',
 'https://en.wikipedia.org/wiki/C.S._Venkatakrishnan',
 'https://en.wikipedia.org/wiki/Warren_Buffett',
 'https://en.wikipedia.org/wiki/Hubert_Joly',
 'https://en.wikipedia.org/wiki/Sunil_Bharti_Mittal',
 'https://en.wiki

In [None]:
# Use urllib.urlopen to crawl all pages in crawl_list, and store the response of the page
# in list pages

pages = []

for url in crawl_list:
    page = url.split("/")[-1] #extract the person name at the end of the url

    # An issue: some of the accent characters won't work.  We need to convert them
    # into an HTML URL.  We'll split the URL, then use "parse.quote" to change
    # the structure, then re-form the URL
    url_list = list(urllib.parse.urlsplit(url))
    url_list[2] = urllib.parse.quote(url_list[2])
    url_ascii = urllib.parse.urlunsplit(url_list)
    try:
      response = urllib.request.urlopen((url_ascii))
      #Save page and url for later use.
      pages.append(response)
    except urllib.error.URLError as e:
      print(e.reason)


## 1.5. Crawling: Populating the Table with Executives

In [None]:
# Use lxml.etree.HTML(...) on the HTML content of each page to get a DOM tree that
# can be processed via XPath to extract the bday information.  Store the CEO name,
# webpage, and the birthdate (born) in exec_df.

# We first check that the HTML content has a table of type `vcard`,
# and then extract the `bday` information.  If there is no birthdate, the datetime
# value is NaT (not a type).

from lxml import etree

rows = []
for page in pages:
  url = page.geturl()
  print (url)
  content = page.read().decode("utf-8")
  tree = etree.HTML(content)  #create a DOM tree of the page
  bday = tree.xpath('//table[contains(@class,"vcard")]//span[@class="bday"]/text()')
  if len(bday) > 0:
      name = url[url.rfind('/')+1:] # The part of the URL after the last /
      rows.append({'name': name, 'page': url,
                  'born': datetime.datetime.strptime(bday[0], '%Y-%m-%d')})
  else:
          rows.append({'name': url[url.rfind('/')+1:], 'page': url
                                    , 'born': np.datetime64('NaT')})

exec_df = pd.DataFrame(rows)
exec_df

https://en.wikipedia.org/wiki/Julie_Sweet
https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla
https://en.wikipedia.org/wiki/Shantanu_Narayen
https://en.wikipedia.org/wiki/Guillaume_Faury
https://en.wikipedia.org/wiki/Eddie_Wu
https://en.wikipedia.org/wiki/Andy_Jassy
https://en.wikipedia.org/wiki/Lisa_Su
https://en.wikipedia.org/wiki/Stephen_Squeri
https://en.wikipedia.org/wiki/Joseph_R._Swedish
https://en.wikipedia.org/wiki/Tim_Cook
https://en.wikipedia.org/wiki/Lakshmi_Niwas_Mittal
https://en.wikipedia.org/wiki/John_Stankey
https://en.wikipedia.org/wiki/Charles_Woodburn
https://en.wikipedia.org/wiki/Tapan_Singhel
https://en.wikipedia.org/wiki/Carlos_Torres_Vila
https://en.wikipedia.org/wiki/Brian_Moynihan
https://en.wikipedia.org/wiki/C.S._Venkatakrishnan
https://en.wikipedia.org/wiki/Warren_Buffett
https://en.wikipedia.org/wiki/Hubert_Joly
https://en.wikipedia.org/wiki/Sunil_Bharti_Mittal
https://en.wikipedia.org/wiki/Stephen_A._Schwarzman
https://en.wikipedia.org/wiki/Mike_Henry
http

Unnamed: 0,name,page,born
0,Julie_Sweet,https://en.wikipedia.org/wiki/Julie_Sweet,NaT
1,Kumar_Mangalam_Birla,https://en.wikipedia.org/wiki/Kumar_Mangalam_B...,1967-06-14 00:00:00
2,Shantanu_Narayen,https://en.wikipedia.org/wiki/Shantanu_Narayen,1963-05-27 00:00:00
3,Guillaume_Faury,https://en.wikipedia.org/wiki/Guillaume_Faury,1968-02-22 00:00:00
4,Eddie_Wu,https://en.wikipedia.org/wiki/Eddie_Wu,NaT
...,...,...,...
130,Ann_Sarnoff,https://en.wikipedia.org/wiki/Ann_Sarnoff,NaT
131,Jason_Kilar,https://en.wikipedia.org/wiki/Jason_Kilar,1971-04-26 00:00:00
132,Charles_Scharf,https://en.wikipedia.org/wiki/Charles_Scharf,1965-04-24 00:00:00
133,John_Mackey,https://en.wikipedia.org/wiki/John_Mackey,NaT


In [None]:
con.sql('create table if not exists executives as select * from exec_df')

con.table('executives').show()

┌──────────────────────┬────────────────────────────────────────────────────┬─────────────────────┐
│         name         │                        page                        │        born         │
│       varchar        │                      varchar                       │      timestamp      │
├──────────────────────┼────────────────────────────────────────────────────┼─────────────────────┤
│ Julie_Sweet          │ https://en.wikipedia.org/wiki/Julie_Sweet          │ NULL                │
│ Kumar_Mangalam_Birla │ https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla │ 1967-06-14 00:00:00 │
│ Shantanu_Narayen     │ https://en.wikipedia.org/wiki/Shantanu_Narayen     │ 1963-05-27 00:00:00 │
│ Guillaume_Faury      │ https://en.wikipedia.org/wiki/Guillaume_Faury      │ 1968-02-22 00:00:00 │
│ Eddie_Wu             │ https://en.wikipedia.org/wiki/Eddie_Wu             │ NULL                │
│ Andy_Jassy           │ https://en.wikipedia.org/wiki/Andy_Jassy           │ 1968-01-13 00:00:00 │


## Section 1 Exercises

Extract, as a dataframe, the list of cities in Pennsylvania (see https://en.wikipedia.org/wiki/List_of_cities_in_Pennsylvania). Store these in the dataframe `pa_cities_df`.

In [None]:
pa_cities_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_cities_in_Pennsylvania')[0]

pa_cities_df

Unnamed: 0,Name,Type,County[3],Class,Population (2020 Census),Incorporation date (as city),Area (sq miles)[4],Area (km2)
0,Aliquippa,City,Beaver,Third,"9,238[5]",1987,4.19,10.9
1,Allentown†,City,Lehigh,Third,"125,845[6]",1867,17.55,45.5
2,Altoona,City,Blair,Third,"43,963[7]",1868,9.91,25.7
3,Arnold,City,Westmoreland,Third,"4,772[8]",1939,0.73,1.9
4,Beaver Falls,City,Beaver,Third,"9,005[9]",1928,2.13,5.5
5,Bethlehem,City,Lehigh Northampton,Third,"75,781[10]",1917,19.1,49.5
6,Bradford,City,McKean,Third,"7,849[11]",1879,3.35,8.7
7,Butler†,City,Butler,Third,"13,502[12]",1816,2.72,7.0
8,Carbondale,City,Lackawanna,Third,"8,828[13]",1851,3.24,8.4
9,Chester,City,Delaware,Third,"32,605[14]",1866,4.84,12.5


Let's check (and record) your answer...  You can retry until you get things right!

In [None]:
grader.grade('cities_quiz', answer=pa_cities_df)

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.
