<a href="https://colab.research.google.com/github/zackives/upenn-cis-2450/blob/main/Module_1_Data_Acquisition_Wrangling_Linking_iii.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview of Part I of Big Data Analytics

As we start our journey into Big Data Analytics, the first thing we need to do is **get the data** in the form we need for analysis!  We'll start with an overview of how to acquire and *wrangle* data.

This notebook will be built incrementally to consider several tasks:

* Acquiring data from files and remote sources
* Information extraction over HTML content
* A basic "vocabulary" of operators over tables (the relational algebra)
* Basic manipulation using SQL in DuckDB

* "Data wrangling" or integration:
  * Cleaning and filtering data, using rules and based operations
  * Linking data across dataframes or relations
  * The need for approximate match and record linking
  * Different techniques


## The Motivating Question
To illustrate the principles, we focus on the question of **how old company CEOs and founders** (in general, leaders) are.  The question was in part motivated by the following New York Times article:

* Founders of Successful Tech Companies Are Mostly Middle-Aged: https://www.nytimes.com/2019/08/29/business/tech-start-up-founders-nest.html?searchResultPosition=2

So let's test this hypothesis!

## Initial Libraries

We'll be using [DuckDB](https://duckdb.org/) as a means of managing our tables.  DuckDB works like a Python library, but manages a full SQL database (in files).  It also integrates very nicely with Pandas, so we'll use it in this course.

In [None]:
!pip3 install duckdb



In [None]:
!pip3 install lxml



In [None]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

Writing notebook-config.yaml


In [None]:
!pip3 install penngrader-client

Collecting penngrader-client
  Downloading penngrader_client-0.5.2-py3-none-any.whl.metadata (15 kB)
Collecting dill (from penngrader-client)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Downloading penngrader_client-0.5.2-py3-none-any.whl (10 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill, penngrader-client
Successfully installed dill-0.3.8 penngrader-client-0.5.2


For quiz credit you'll need to update your student ID here!

In [None]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

Quizzes will cumulatively count as HW9... Don't edit this...

In [None]:
%set_env HW_ID=cis2450_fall24_HW9

env: HW_ID=cis2450_fall24_HW9


In [None]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)

PennGrader initialized with Student ID: 99999999

Make sure this correct or we will not be able to store your grade


In [None]:
# Imports we'll use through the notebook, collected here for simplicity

# For parsing dates and being able to compare
import datetime

# For fetching remote data
import urllib
import urllib.request

# Pandas dataframes and operations
import pandas as pd

# Numpy matrix and array operations
import numpy as np

# Sqlite is a simplistic database
import duckdb

# Data visualization
import matplotlib



# 1. Acquiring External Data

To test our hypothesis, we might want:

1. A list of companies (and, for futher details, perhaps their lines of business)
2. A list of company CEOs
3. Ages of the CEOs

We'll go through each of these using real data from the web.

### Reading Structured Data Sources

Let's start by looking up data about companies.  We are using a dataset from: https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset?resource=download

but we have a copy of it at an alternate site for convenience of downloading.

## 1.1. External CSV Data

Comma-separated values are generally easy to read. The main questions are column headings (which are in an optional row that isn't always provided) and datatypes (which might default to the wrong thing).

In [None]:
!wget -nc https://storage.googleapis.com/penn-cis5450/companies_sorted.csv

--2024-08-26 22:45:18--  https://storage.googleapis.com/penn-cis5450/companies_sorted.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.101.207, 142.251.2.207, 142.250.141.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.101.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1085578742 (1.0G) [text/csv]
Saving to: ‘companies_sorted.csv’


2024-08-26 22:45:36 (57.3 MB/s) - ‘companies_sorted.csv’ saved [1085578742/1085578742]



In [None]:
# This reads remotely. To avoid multiple fetches, we'll instead..

# data = urllib.request.urlopen(\
#        'https://storage.googleapis.com/penn-cis5450/companies_sorted.csv')
# company_data_df = pd.read_csv(data)

## ... instead copy to a local file and read there...

company_data_df = pd.read_csv('companies_sorted.csv')

company_data_df

Unnamed: 0.1,Unnamed: 0,name,domain,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate
0,5872184,ibm,ibm.com,1911.0,information technology and services,10001+,"new york, new york, united states",united states,linkedin.com/company/ibm,274047,716906
1,4425416,tata consultancy services,tcs.com,1968.0,information technology and services,10001+,"bombay, maharashtra, india",india,linkedin.com/company/tata-consultancy-services,190771,341369
2,21074,accenture,accenture.com,1989.0,information technology and services,10001+,"dublin, dublin, ireland",ireland,linkedin.com/company/accenture,190689,455768
3,2309813,us army,goarmy.com,1800.0,military,10001+,"alexandria, virginia, united states",united states,linkedin.com/company/us-army,162163,445958
4,1558607,ey,ey.com,1989.0,accounting,10001+,"london, greater london, united kingdom",united kingdom,linkedin.com/company/ernstandyoung,158363,428960
...,...,...,...,...,...,...,...,...,...,...,...
7173421,1494427,certiport vouchers,certiportvouchers.com,2011.0,information technology and services,1 - 10,,,linkedin.com/company/certiport-vouchers,0,1
7173422,1494429,black tiger fight club,blacktigerclub.com,2006.0,"health, wellness and fitness",1 - 10,"peking, beijing, china",china,linkedin.com/company/black-tiger-club-hero,0,6
7173423,4768462,catholic bishop of chicago,,,religious institutions,1 - 10,"inverness, illinois, united states",united states,linkedin.com/company/catholic-bishop-of-chicago,0,1
7173424,1494436,medexo robotics ltd,,,research,1 - 10,"london, london, united kingdom",united kingdom,linkedin.com/company/medexo-robotics-ltd,0,2


Now let's use DuckDB to work with the dataframe.

In [None]:
# We can ask for the contents of a Pandas Dataframe through DuckDB, in the SQL language.
duckdb.sql('select * from company_data_df')

┌────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│ Unnamed: 0 │         name         │ … │     linkedin url     │ current employee e…  │ total employee est…  │
│   int64    │       varchar        │   │       varchar        │        int64         │        int64         │
├────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│    5872184 │ ibm                  │ … │ linkedin.com/compa…  │               274047 │               716906 │
│    4425416 │ tata consultancy s…  │ … │ linkedin.com/compa…  │               190771 │               341369 │
│      21074 │ accenture            │ … │ linkedin.com/compa…  │               190689 │               455768 │
│    2309813 │ us army              │ … │ linkedin.com/compa…  │               162163 │               445958 │
│    1558607 │ ey                   │ … │ linkedin.com/compa…  │               158363 │               428960 │
│

## 1.2. Storing Data Locally and Re-Loading it

DuckDB nicely integrates with Pandas and Python. If you create a connection to a file, this results in the creation of a database stored within that file.

Normally we need to `CREATE TABLE` with the table name and columns. But we can actually create the table to match the *schema* of the DataFrame, as follows.

In [None]:
con = duckdb.connect('local.db')
con.sql('create table if not exists company_data as select * from company_data_df')

# query the table
con.table("company_data").show()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

┌────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│ Unnamed: 0 │         name         │ … │     linkedin url     │ current employee e…  │ total employee est…  │
│   int64    │       varchar        │   │       varchar        │        int64         │        int64         │
├────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│    5872184 │ ibm                  │ … │ linkedin.com/compa…  │               274047 │               716906 │
│    4425416 │ tata consultancy s…  │ … │ linkedin.com/compa…  │               190771 │               341369 │
│      21074 │ accenture            │ … │ linkedin.com/compa…  │               190689 │               455768 │
│    2309813 │ us army              │ … │ linkedin.com/compa…  │               162163 │               445958 │
│    1558607 │ ey                   │ … │ linkedin.com/compa…  │               158363 │               428960 │
│

In [None]:
company_data_df = con.table("company_data").df()

company_data_df

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0.1,Unnamed: 0,name,domain,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate
0,5872184,ibm,ibm.com,1911.0,information technology and services,10001+,"new york, new york, united states",united states,linkedin.com/company/ibm,274047,716906
1,4425416,tata consultancy services,tcs.com,1968.0,information technology and services,10001+,"bombay, maharashtra, india",india,linkedin.com/company/tata-consultancy-services,190771,341369
2,21074,accenture,accenture.com,1989.0,information technology and services,10001+,"dublin, dublin, ireland",ireland,linkedin.com/company/accenture,190689,455768
3,2309813,us army,goarmy.com,1800.0,military,10001+,"alexandria, virginia, united states",united states,linkedin.com/company/us-army,162163,445958
4,1558607,ey,ey.com,1989.0,accounting,10001+,"london, greater london, united kingdom",united kingdom,linkedin.com/company/ernstandyoung,158363,428960
...,...,...,...,...,...,...,...,...,...,...,...
7173421,1494427,certiport vouchers,certiportvouchers.com,2011.0,information technology and services,1 - 10,,,linkedin.com/company/certiport-vouchers,0,1
7173422,1494429,black tiger fight club,blacktigerclub.com,2006.0,"health, wellness and fitness",1 - 10,"peking, beijing, china",china,linkedin.com/company/black-tiger-club-hero,0,6
7173423,4768462,catholic bishop of chicago,,,religious institutions,1 - 10,"inverness, illinois, united states",united states,linkedin.com/company/catholic-bishop-of-chicago,0,1
7173424,1494436,medexo robotics ltd,,,research,1 - 10,"london, london, united kingdom",united kingdom,linkedin.com/company/medexo-robotics-ltd,0,2


## 1.3. Companies' CEOs: a Web Table

Now we need to figure out who the CEOs are for corporations.  One place to look is Wikipedia, which has an HTML table describing the CEOs.

https://en.wikipedia.org/wiki/List_of_chief_executive_officers#List_of_CEOs

Pandas actually makes it easy to read HTML tables...

In [None]:
# Now let's read an HTML table!

company_ceos_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_chief_executive_officers#List_of_CEOs')[1]

company_ceos_df

Unnamed: 0,Company,Executive,Title,Since,Notes,Updated
0,Accenture,Julie Sweet,CEO[1],2019,"Succeeded Pierre Nanterme, died",2019-01-31
1,Aditya Birla Group,Kumar Mangalam Birla,Chairman[2],1995[2],Part of the Birla family business house in India,2018-10-01
2,Adobe Systems,Shantanu Narayen,"Chairman, president and CEO[3]",2007,Formerly with Apple,2018-10-01
3,Airbus,Guillaume Faury,CEO[4],2012,Succeeded Louis Gallois,2017-11-14
4,Alibaba,Eddie Wu,Director and CEO[5],2023[6],,2024-07-19
...,...,...,...,...,...,...
130,Warner Brothers,Ann Sarnoff,Chairwoman and CEO[122],2019,First woman to hold the position at the compan...,2019-10-10
131,WarnerMedia,Jason Kilar,CEO[123],2020,Previously with Hulu and Amazon,2020-11-19
132,Wells Fargo,Charles Scharf,CEO and president[124],2019,"Succeeded John Stumpf, previously COO",
133,Whole Foods Market,John Mackey,CEO[125],1980,Co-founder,2017-11-11


In [None]:
con.sql('create table if not exists company_ceos as select * from company_ceos_df')

con.table('company_ceos').show()

┌──────────────────────┬──────────────────────┬──────────────────────┬─────────┬──────────────────────────┬────────────┐
│       Company        │      Executive       │        Title         │  Since  │          Notes           │  Updated   │
│       varchar        │       varchar        │       varchar        │ varchar │         varchar          │  varchar   │
├──────────────────────┼──────────────────────┼──────────────────────┼─────────┼──────────────────────────┼────────────┤
│ Accenture            │ Julie Sweet          │ CEO[1]               │ 2019    │ Succeeded Pierre Nante…  │ 2019-01-31 │
│ Aditya Birla Group   │ Kumar Mangalam Birla │ Chairman[2]          │ 1995[2] │ Part of the Birla fami…  │ 2018-10-01 │
│ Adobe Systems        │ Shantanu Narayen     │ Chairman, presiden…  │ 2007    │ Formerly with Apple      │ 2018-10-01 │
│ Airbus               │ Guillaume Faury      │ CEO[4]               │ 2012    │ Succeeded Louis Gallois  │ 2017-11-14 │
│ Alibaba              │ Eddie W

## 1.4. The Problem Gets Harder... Extracting Fields from Tagged Data on the Web

So far we have companies and CEOs.  But we don't have information on how old the CEOs are!

For a solution, we're going to go back to Wikipedia -- this time looking at the web pages for the CEOs!

This involves "crawling" the CEO pages, and "scraping" the relevant content.  In other words we have to do *information extraction*.  For this particular problem, we will do extraction over very regular parts of Wikipedia.

We'll start by constructing a list of CEO web pages, from the Company CEO dataframe above.  For this, we need to take the names and do a bit of tweaking, for example adding underscores instead of spaces.

(Later we'll see how to do this over more free-form text.)

In [None]:
crawl_list = []

for executive in company_ceos_df['Executive']:
  crawl_list.append('https://en.wikipedia.org/wiki/' + executive.replace(' ', '_'))

crawl_list

['https://en.wikipedia.org/wiki/Julie_Sweet',
 'https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla',
 'https://en.wikipedia.org/wiki/Shantanu_Narayen',
 'https://en.wikipedia.org/wiki/Guillaume_Faury',
 'https://en.wikipedia.org/wiki/Eddie_Wu',
 'https://en.wikipedia.org/wiki/Andy_Jassy',
 'https://en.wikipedia.org/wiki/Lisa_Su',
 'https://en.wikipedia.org/wiki/Stephen_Squeri',
 'https://en.wikipedia.org/wiki/Joseph_R._Swedish',
 'https://en.wikipedia.org/wiki/Tim_Cook',
 'https://en.wikipedia.org/wiki/Lakshmi_Niwas_Mittal',
 'https://en.wikipedia.org/wiki/John_Stankey',
 'https://en.wikipedia.org/wiki/Charles_Woodburn',
 'https://en.wikipedia.org/wiki/Tapan_Singhel',
 'https://en.wikipedia.org/wiki/Carlos_Torres_Vila',
 'https://en.wikipedia.org/wiki/Brian_Moynihan',
 'https://en.wikipedia.org/wiki/C.S._Venkatakrishnan',
 'https://en.wikipedia.org/wiki/Warren_Buffett',
 'https://en.wikipedia.org/wiki/Hubert_Joly',
 'https://en.wikipedia.org/wiki/Sunil_Bharti_Mittal',
 'https://en.wiki

In [None]:
# Use urllib.urlopen to crawl all pages in crawl_list, and store the response of the page
# in list pages

pages = []

for url in crawl_list:
    page = url.split("/")[-1] #extract the person name at the end of the url

    # An issue: some of the accent characters won't work.  We need to convert them
    # into an HTML URL.  We'll split the URL, then use "parse.quote" to change
    # the structure, then re-form the URL
    url_list = list(urllib.parse.urlsplit(url))
    url_list[2] = urllib.parse.quote(url_list[2])
    url_ascii = urllib.parse.urlunsplit(url_list)
    try:
      response = urllib.request.urlopen((url_ascii))
      #Save page and url for later use.
      pages.append(response)
    except urllib.error.URLError as e:
      print(e.reason)


## 1.5. Crawling: Populating the Table with Executives

In [None]:
# Use lxml.etree.HTML(...) on the HTML content of each page to get a DOM tree that
# can be processed via XPath to extract the bday information.  Store the CEO name,
# webpage, and the birthdate (born) in exec_df.

# We first check that the HTML content has a table of type `vcard`,
# and then extract the `bday` information.  If there is no birthdate, the datetime
# value is NaT (not a type).

from lxml import etree

rows = []
for page in pages:
  url = page.geturl()
  print (url)
  content = page.read().decode("utf-8")
  tree = etree.HTML(content)  #create a DOM tree of the page
  bday = tree.xpath('//table[contains(@class,"vcard")]//span[@class="bday"]/text()')
  if len(bday) > 0:
      name = url[url.rfind('/')+1:] # The part of the URL after the last /
      rows.append({'name': name, 'page': url,
                  'born': datetime.datetime.strptime(bday[0], '%Y-%m-%d')})
  else:
          rows.append({'name': url[url.rfind('/')+1:], 'page': url
                                    , 'born': np.datetime64('NaT')})

exec_df = pd.DataFrame(rows)
exec_df

https://en.wikipedia.org/wiki/Julie_Sweet
https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla
https://en.wikipedia.org/wiki/Shantanu_Narayen
https://en.wikipedia.org/wiki/Guillaume_Faury
https://en.wikipedia.org/wiki/Eddie_Wu
https://en.wikipedia.org/wiki/Andy_Jassy
https://en.wikipedia.org/wiki/Lisa_Su
https://en.wikipedia.org/wiki/Stephen_Squeri
https://en.wikipedia.org/wiki/Joseph_R._Swedish
https://en.wikipedia.org/wiki/Tim_Cook
https://en.wikipedia.org/wiki/Lakshmi_Niwas_Mittal
https://en.wikipedia.org/wiki/John_Stankey
https://en.wikipedia.org/wiki/Charles_Woodburn
https://en.wikipedia.org/wiki/Tapan_Singhel
https://en.wikipedia.org/wiki/Carlos_Torres_Vila
https://en.wikipedia.org/wiki/Brian_Moynihan
https://en.wikipedia.org/wiki/C.S._Venkatakrishnan
https://en.wikipedia.org/wiki/Warren_Buffett
https://en.wikipedia.org/wiki/Hubert_Joly
https://en.wikipedia.org/wiki/Sunil_Bharti_Mittal
https://en.wikipedia.org/wiki/Stephen_A._Schwarzman
https://en.wikipedia.org/wiki/Mike_Henry
http

Unnamed: 0,name,page,born
0,Julie_Sweet,https://en.wikipedia.org/wiki/Julie_Sweet,NaT
1,Kumar_Mangalam_Birla,https://en.wikipedia.org/wiki/Kumar_Mangalam_B...,1967-06-14 00:00:00
2,Shantanu_Narayen,https://en.wikipedia.org/wiki/Shantanu_Narayen,1963-05-27 00:00:00
3,Guillaume_Faury,https://en.wikipedia.org/wiki/Guillaume_Faury,1968-02-22 00:00:00
4,Eddie_Wu,https://en.wikipedia.org/wiki/Eddie_Wu,NaT
...,...,...,...
130,Ann_Sarnoff,https://en.wikipedia.org/wiki/Ann_Sarnoff,NaT
131,Jason_Kilar,https://en.wikipedia.org/wiki/Jason_Kilar,1971-04-26 00:00:00
132,Charles_Scharf,https://en.wikipedia.org/wiki/Charles_Scharf,1965-04-24 00:00:00
133,John_Mackey,https://en.wikipedia.org/wiki/John_Mackey,NaT


In [None]:
con.sql('create table if not exists executives as select * from exec_df')

con.table('executives').show()

┌──────────────────────┬────────────────────────────────────────────────────┬─────────────────────┐
│         name         │                        page                        │        born         │
│       varchar        │                      varchar                       │      timestamp      │
├──────────────────────┼────────────────────────────────────────────────────┼─────────────────────┤
│ Julie_Sweet          │ https://en.wikipedia.org/wiki/Julie_Sweet          │ NULL                │
│ Kumar_Mangalam_Birla │ https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla │ 1967-06-14 00:00:00 │
│ Shantanu_Narayen     │ https://en.wikipedia.org/wiki/Shantanu_Narayen     │ 1963-05-27 00:00:00 │
│ Guillaume_Faury      │ https://en.wikipedia.org/wiki/Guillaume_Faury      │ 1968-02-22 00:00:00 │
│ Eddie_Wu             │ https://en.wikipedia.org/wiki/Eddie_Wu             │ NULL                │
│ Andy_Jassy           │ https://en.wikipedia.org/wiki/Andy_Jassy           │ 1968-01-13 00:00:00 │


# 2.0 Data Transformation and Querying

Looking at our data to clean via *projection*...

Generally, we can extract one "narrower" table form another by using **double brackets**.

In [None]:
# Let's take a look at the data.  Here's a way of PROJECTING the exec_df dataframe into
# a smaller table

exec_df[['name', 'born']]

Unnamed: 0,name,born
0,Julie_Sweet,NaT
1,Kumar_Mangalam_Birla,1967-06-14 00:00:00
2,Shantanu_Narayen,1963-05-27 00:00:00
3,Guillaume_Faury,1968-02-22 00:00:00
4,Eddie_Wu,NaT
...,...,...
130,Ann_Sarnoff,NaT
131,Jason_Kilar,1971-04-26 00:00:00
132,Charles_Scharf,1965-04-24 00:00:00
133,John_Mackey,NaT


In [None]:
# In SQL it's SELECT with the fields FROM the table
con.sql('select name, born from executives')

┌──────────────────────┬─────────────────────┐
│         name         │        born         │
│       varchar        │      timestamp      │
├──────────────────────┼─────────────────────┤
│ Julie_Sweet          │ NULL                │
│ Kumar_Mangalam_Birla │ 1967-06-14 00:00:00 │
│ Shantanu_Narayen     │ 1963-05-27 00:00:00 │
│ Guillaume_Faury      │ 1968-02-22 00:00:00 │
│ Eddie_Wu             │ NULL                │
│ Andy_Jassy           │ 1968-01-13 00:00:00 │
│ Lisa_Su              │ 1969-11-07 00:00:00 │
│ Stephen_Squeri       │ NULL                │
│ Joseph_R._Swedish    │ 1951-05-17 00:00:00 │
│ Tim_Cook             │ 1960-11-01 00:00:00 │
│    ·                 │          ·          │
│    ·                 │          ·          │
│    ·                 │          ·          │
│ Vittorio_Colao       │ 1961-10-03 00:00:00 │
│ Herbert_Diess        │ 1958-10-24 00:00:00 │
│ Robert_Iger          │ 1951-02-10 00:00:00 │
│ Stefano_Pessina      │ 1941-06-04 00:00:00 │
│ Doug_McMill

In [None]:
# If I use single brackets, I can extract a single column as a Series.
exec_df['name']

Unnamed: 0,name
0,Julie_Sweet
1,Kumar_Mangalam_Birla
2,Shantanu_Narayen
3,Guillaume_Faury
4,Eddie_Wu
...,...
130,Ann_Sarnoff
131,Jason_Kilar
132,Charles_Scharf
133,John_Mackey


In [None]:
# We can use SQL over the dataframe OR here
duckdb.sql('select name from exec_df')

┌──────────────────────┐
│         name         │
│       varchar        │
├──────────────────────┤
│ Julie_Sweet          │
│ Kumar_Mangalam_Birla │
│ Shantanu_Narayen     │
│ Guillaume_Faury      │
│ Eddie_Wu             │
│ Andy_Jassy           │
│ Lisa_Su              │
│ Stephen_Squeri       │
│ Joseph_R._Swedish    │
│ Tim_Cook             │
│    ·                 │
│    ·                 │
│    ·                 │
│ Vittorio_Colao       │
│ Herbert_Diess        │
│ Robert_Iger          │
│ Stefano_Pessina      │
│ Doug_McMillon        │
│ Ann_Sarnoff          │
│ Jason_Kilar          │
│ Charles_Scharf       │
│ John_Mackey          │
│ Rich_Barton          │
├──────────────────────┤
│ 135 rows (20 shown)  │
└──────────────────────┘

In [None]:
# Notice anything awry?

for person in exec_df['name']:
    print (person)

Julie_Sweet
Kumar_Mangalam_Birla
Shantanu_Narayen
Guillaume_Faury
Eddie_Wu
Andy_Jassy
Lisa_Su
Stephen_Squeri
Joseph_R._Swedish
Tim_Cook
Lakshmi_Niwas_Mittal
John_Stankey
Charles_Woodburn
Tapan_Singhel
Carlos_Torres_Vila
Brian_Moynihan
C.S._Venkatakrishnan
Warren_Buffett
Hubert_Joly
Sunil_Bharti_Mittal
Stephen_A._Schwarzman
Mike_Henry
Oliver_Zipse
Dave_Calhoun
Rich_Lesser
Bob_Dudley
Hock_Tan
Denise_Morrison
Mark_Shuttleworth
Richard_Fairbank
Jim_Umpleby
Evan_Greenberg
Chuck_Robbins
Jane_Fraser
James_Quincey
Brian_L._Roberts
Thomas_Gottstein
Ola_K%C3%A4llenius
Michael_Dell
Ed_Bastian
Christian_Sewing
Tobias_Meyer
Edward_D._Breen
Devin_Wenig
B%C3%B6rje_Ekholm
Darren_Woods
Carmine_Di_Sibio
Mark_Zuckerberg
Frederick_W._Smith
Sergio_Marchionne
Abigail_Johnson
James_Hackett
Terry_Gou
Lachlan_Murdoch
Phebe_Novakovic
H._Lawrence_Culp_Jr.
Mary_T._Barra
Emma_Walmsley
David_M._Solomon
Sundar_Pichai
C_Vijayakumar
Antonio_Neri
Darius_Adamczyk
Noel_Quinn
Arvind_Krishna
Salil_Parekh
Pat_Gelsinger
Jame

In [None]:
def to_space(x):
  return x.replace('_', ' ')

# Let's use *apply* to call a function over each element, returning a new Series
exec_df['name'].apply(to_space)

Unnamed: 0,name
0,Julie Sweet
1,Kumar Mangalam Birla
2,Shantanu Narayen
3,Guillaume Faury
4,Eddie Wu
...,...
130,Ann Sarnoff
131,Jason Kilar
132,Charles Scharf
133,John Mackey


In [None]:
# Let's use *apply* to call a function over each element, returning a new Series
exec_df['name'].apply(lambda x: x.replace('_', ' '))

Unnamed: 0,name
0,Julie Sweet
1,Kumar Mangalam Birla
2,Shantanu Narayen
3,Guillaume Faury
4,Eddie Wu
...,...
130,Ann Sarnoff
131,Jason Kilar
132,Charles Scharf
133,John Mackey


In [None]:
# I can also use *apply* to call a function over the rows of a dataframe
exec_df.apply(lambda x: x['name'].replace('_', ' '), axis=1)

Unnamed: 0,0
0,Julie Sweet
1,Kumar Mangalam Birla
2,Shantanu Narayen
3,Guillaume Faury
4,Eddie Wu
...,...
130,Ann Sarnoff
131,Jason Kilar
132,Charles Scharf
133,John Mackey


In [None]:
# Let's clean the name by removing underscores...
exec_df['clean_name'] = exec_df['name'].apply(lambda x: x.replace('_', ' '))

exec_df

Unnamed: 0,name,page,born,clean_name
0,Julie_Sweet,https://en.wikipedia.org/wiki/Julie_Sweet,NaT,Julie Sweet
1,Kumar_Mangalam_Birla,https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla,1967-06-14 00:00:00,Kumar Mangalam Birla
2,Shantanu_Narayen,https://en.wikipedia.org/wiki/Shantanu_Narayen,1963-05-27 00:00:00,Shantanu Narayen
3,Guillaume_Faury,https://en.wikipedia.org/wiki/Guillaume_Faury,1968-02-22 00:00:00,Guillaume Faury
4,Eddie_Wu,https://en.wikipedia.org/wiki/Eddie_Wu,NaT,Eddie Wu
...,...,...,...,...
130,Ann_Sarnoff,https://en.wikipedia.org/wiki/Ann_Sarnoff,NaT,Ann Sarnoff
131,Jason_Kilar,https://en.wikipedia.org/wiki/Jason_Kilar,1971-04-26 00:00:00,Jason Kilar
132,Charles_Scharf,https://en.wikipedia.org/wiki/Charles_Scharf,1965-04-24 00:00:00,Charles Scharf
133,John_Mackey,https://en.wikipedia.org/wiki/John_Mackey,NaT,John Mackey


In [None]:
exec_df.rename(columns={'name': 'old_name'})

Unnamed: 0,old_name,page,born,clean_name
0,Julie_Sweet,https://en.wikipedia.org/wiki/Julie_Sweet,NaT,Julie Sweet
1,Kumar_Mangalam_Birla,https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla,1967-06-14 00:00:00,Kumar Mangalam Birla
2,Shantanu_Narayen,https://en.wikipedia.org/wiki/Shantanu_Narayen,1963-05-27 00:00:00,Shantanu Narayen
3,Guillaume_Faury,https://en.wikipedia.org/wiki/Guillaume_Faury,1968-02-22 00:00:00,Guillaume Faury
4,Eddie_Wu,https://en.wikipedia.org/wiki/Eddie_Wu,NaT,Eddie Wu
...,...,...,...,...
130,Ann_Sarnoff,https://en.wikipedia.org/wiki/Ann_Sarnoff,NaT,Ann Sarnoff
131,Jason_Kilar,https://en.wikipedia.org/wiki/Jason_Kilar,1971-04-26 00:00:00,Jason Kilar
132,Charles_Scharf,https://en.wikipedia.org/wiki/Charles_Scharf,1965-04-24 00:00:00,Charles Scharf
133,John_Mackey,https://en.wikipedia.org/wiki/John_Mackey,NaT,John Mackey


In [None]:
# We can do the same via SQL.  For the example we'll save the dataframe first...  we'll convert
# types to string first to avoid errors.

duckdb.sql("select name, replace(name, '_', ' ') as clean_name from exec_df")

┌──────────────────────┬──────────────────────┐
│         name         │      clean_name      │
│       varchar        │       varchar        │
├──────────────────────┼──────────────────────┤
│ Julie_Sweet          │ Julie Sweet          │
│ Kumar_Mangalam_Birla │ Kumar Mangalam Birla │
│ Shantanu_Narayen     │ Shantanu Narayen     │
│ Guillaume_Faury      │ Guillaume Faury      │
│ Eddie_Wu             │ Eddie Wu             │
│ Andy_Jassy           │ Andy Jassy           │
│ Lisa_Su              │ Lisa Su              │
│ Stephen_Squeri       │ Stephen Squeri       │
│ Joseph_R._Swedish    │ Joseph R. Swedish    │
│ Tim_Cook             │ Tim Cook             │
│    ·                 │    ·                 │
│    ·                 │    ·                 │
│    ·                 │    ·                 │
│ Vittorio_Colao       │ Vittorio Colao       │
│ Herbert_Diess        │ Herbert Diess        │
│ Robert_Iger          │ Robert Iger          │
│ Stefano_Pessina      │ Stefano Pessina

## 2.1. Selecting a subset of the rows

In [None]:
# Here's a column

exec_df['clean_name']

Unnamed: 0,clean_name
0,Julie Sweet
1,Kumar Mangalam Birla
2,Shantanu Narayen
3,Guillaume Faury
4,Eddie Wu
...,...
130,Ann Sarnoff
131,Jason Kilar
132,Charles Scharf
133,John Mackey


In [None]:
# We can apply a test (predicate) to each column, returning a Series of boolean true/false values

exec_df['clean_name'] == 'Julie Sweet'

Unnamed: 0,clean_name
0,True
1,False
2,False
3,False
4,False
...,...
130,False
131,False
132,False
133,False


In [None]:
# If we compose these, we'll get only those rows where the boolean condition was True

exec_df[exec_df['clean_name'] == 'Julie Sweet']

Unnamed: 0,name,page,born,clean_name
0,Julie_Sweet,https://en.wikipedia.org/wiki/Julie_Sweet,NaT,Julie Sweet


SQL lets us use any case, but convention is to capitalize the SQL keywords such as `SELECT`, `FROM`, `WHERE` to aid in readability.  Also, we should use single-quotes for SQL strings, so we'll typically pass the SQL command in with double-quotes.

In [None]:
duckdb.sql("SELECT * FROM exec_df WHERE clean_name='Julie Sweet'")

┌─────────────┬───────────────────────────────────────────┬───────────┬─────────────┐
│    name     │                   page                    │   born    │ clean_name  │
│   varchar   │                  varchar                  │ timestamp │   varchar   │
├─────────────┼───────────────────────────────────────────┼───────────┼─────────────┤
│ Julie_Sweet │ https://en.wikipedia.org/wiki/Julie_Sweet │ NULL      │ Julie Sweet │
└─────────────┴───────────────────────────────────────────┴───────────┴─────────────┘

In [None]:
exec_df[exec_df['clean_name'] == 'Julie Sweet'][['page']]

Unnamed: 0,page
0,https://en.wikipedia.org/wiki/Julie_Sweet



Here we'll use the triple-quote syntax for Python strings, which allows us to pass a multi-line string to SQL...


In [None]:
duckdb.sql("""SELECT clean_name
            FROM exec_df
            WHERE clean_name='Julie Sweet'""")

┌─────────────┐
│ clean_name  │
│   varchar   │
├─────────────┤
│ Julie Sweet │
└─────────────┘

In [None]:
import numpy as np

exec_df.dropna(subset=['born'])

Unnamed: 0,name,page,born,clean_name
1,Kumar_Mangalam_Birla,https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla,1967-06-14 00:00:00,Kumar Mangalam Birla
2,Shantanu_Narayen,https://en.wikipedia.org/wiki/Shantanu_Narayen,1963-05-27 00:00:00,Shantanu Narayen
3,Guillaume_Faury,https://en.wikipedia.org/wiki/Guillaume_Faury,1968-02-22 00:00:00,Guillaume Faury
5,Andy_Jassy,https://en.wikipedia.org/wiki/Andy_Jassy,1968-01-13 00:00:00,Andy Jassy
6,Lisa_Su,https://en.wikipedia.org/wiki/Lisa_Su,1969-11-07 00:00:00,Lisa Su
...,...,...,...,...
127,Robert_Iger,https://en.wikipedia.org/wiki/Robert_Iger,1951-02-10 00:00:00,Robert Iger
128,Stefano_Pessina,https://en.wikipedia.org/wiki/Stefano_Pessina,1941-06-04 00:00:00,Stefano Pessina
129,Doug_McMillon,https://en.wikipedia.org/wiki/Doug_McMillon,1966-10-17 00:00:00,Doug McMillon
131,Jason_Kilar,https://en.wikipedia.org/wiki/Jason_Kilar,1971-04-26 00:00:00,Jason Kilar


In [None]:
duckdb.sql("""SELECT *
            FROM exec_df
            WHERE born IS NOT NULL""")

┌──────────────────────┬──────────────────────────────────────────────────┬─────────────────────┬──────────────────────┐
│         name         │                       page                       │        born         │      clean_name      │
│       varchar        │                     varchar                      │      timestamp      │       varchar        │
├──────────────────────┼──────────────────────────────────────────────────┼─────────────────────┼──────────────────────┤
│ Kumar_Mangalam_Birla │ https://en.wikipedia.org/wiki/Kumar_Mangalam_B…  │ 1967-06-14 00:00:00 │ Kumar Mangalam Birla │
│ Shantanu_Narayen     │ https://en.wikipedia.org/wiki/Shantanu_Narayen   │ 1963-05-27 00:00:00 │ Shantanu Narayen     │
│ Guillaume_Faury      │ https://en.wikipedia.org/wiki/Guillaume_Faury    │ 1968-02-22 00:00:00 │ Guillaume Faury      │
│ Andy_Jassy           │ https://en.wikipedia.org/wiki/Andy_Jassy         │ 1968-01-13 00:00:00 │ Andy Jassy           │
│ Lisa_Su              │ https:/

## 2.2. Joining Data

We start with a simple join between company_ceos_df and exec_df and persist it to the database.  We then check how many companies did not have a match on CEO name.

In [None]:
exec_df[['clean_name', 'born']]

Unnamed: 0,clean_name,born
0,Julie Sweet,NaT
1,Kumar Mangalam Birla,1967-06-14 00:00:00
2,Shantanu Narayen,1963-05-27 00:00:00
3,Guillaume Faury,1968-02-22 00:00:00
4,Eddie Wu,NaT
...,...,...
130,Ann Sarnoff,NaT
131,Jason Kilar,1971-04-26 00:00:00
132,Charles Scharf,1965-04-24 00:00:00
133,John Mackey,NaT


In [None]:
# Remove any duplicate executive entries

exec_df = exec_df.drop_duplicates()

In [None]:
company_ceos_df[['Executive', 'Company']]

Unnamed: 0,Executive,Company
0,Julie Sweet,Accenture
1,Kumar Mangalam Birla,Aditya Birla Group
2,Shantanu Narayen,Adobe Systems
3,Guillaume Faury,Airbus
4,Eddie Wu,Alibaba
...,...,...
130,Ann Sarnoff,Warner Brothers
131,Jason Kilar,WarnerMedia
132,Charles Scharf,Wells Fargo
133,John Mackey,Whole Foods Market


In [None]:
company_ceos_df[['Executive', 'Company']].merge(exec_df[['clean_name', 'born']],
                                                left_on=['Executive'],
                                                right_on=['clean_name'])

Unnamed: 0,Executive,Company,clean_name,born
0,Julie Sweet,Accenture,Julie Sweet,NaT
1,Kumar Mangalam Birla,Aditya Birla Group,Kumar Mangalam Birla,1967-06-14 00:00:00
2,Shantanu Narayen,Adobe Systems,Shantanu Narayen,1963-05-27 00:00:00
3,Guillaume Faury,Airbus,Guillaume Faury,1968-02-22 00:00:00
4,Eddie Wu,Alibaba,Eddie Wu,NaT
...,...,...,...,...
127,Ann Sarnoff,Warner Brothers,Ann Sarnoff,NaT
128,Jason Kilar,WarnerMedia,Jason Kilar,1971-04-26 00:00:00
129,Charles Scharf,Wells Fargo,Charles Scharf,1965-04-24 00:00:00
130,John Mackey,Whole Foods Market,John Mackey,NaT


We can `JOIN ON` in the `FROM` clause.

In [None]:
duckdb.sql("""
            SELECT Executive, Company, born
            FROM company_ceos_df JOIN exec_df ON Executive=clean_name
          """)

┌──────────────────────┬──────────────────────────┬─────────────────────┐
│      Executive       │         Company          │        born         │
│       varchar        │         varchar          │      timestamp      │
├──────────────────────┼──────────────────────────┼─────────────────────┤
│ Julie Sweet          │ Accenture                │ NULL                │
│ Kumar Mangalam Birla │ Aditya Birla Group       │ 1967-06-14 00:00:00 │
│ Shantanu Narayen     │ Adobe Systems            │ 1963-05-27 00:00:00 │
│ Guillaume Faury      │ Airbus                   │ 1968-02-22 00:00:00 │
│ Eddie Wu             │ Alibaba                  │ NULL                │
│ Andy Jassy           │ Amazon                   │ 1968-01-13 00:00:00 │
│ Lisa Su              │ Advanced Micro Devices   │ 1969-11-07 00:00:00 │
│ Stephen Squeri       │ American Express         │ NULL                │
│ Joseph R. Swedish    │ Anthem                   │ 1951-05-17 00:00:00 │
│ Tim Cook             │ Apple        

Note there is another way you'll sometimes see, in older versions of SQL... Which is to put the join as a `WHERE` condition:

In [None]:
duckdb.sql("""
            SELECT Executive, Company, born
            FROM company_ceos_df, exec_df
            WHERE Executive=clean_name
          """)

┌──────────────────────┬──────────────────────────┬─────────────────────┐
│      Executive       │         Company          │        born         │
│       varchar        │         varchar          │      timestamp      │
├──────────────────────┼──────────────────────────┼─────────────────────┤
│ Julie Sweet          │ Accenture                │ NULL                │
│ Kumar Mangalam Birla │ Aditya Birla Group       │ 1967-06-14 00:00:00 │
│ Shantanu Narayen     │ Adobe Systems            │ 1963-05-27 00:00:00 │
│ Guillaume Faury      │ Airbus                   │ 1968-02-22 00:00:00 │
│ Eddie Wu             │ Alibaba                  │ NULL                │
│ Andy Jassy           │ Amazon                   │ 1968-01-13 00:00:00 │
│ Lisa Su              │ Advanced Micro Devices   │ 1969-11-07 00:00:00 │
│ Stephen Squeri       │ American Express         │ NULL                │
│ Joseph R. Swedish    │ Anthem                   │ 1951-05-17 00:00:00 │
│ Tim Cook             │ Apple        

OK, let's drop the cases where we don't have a CEO's birthday: these aren't useful!

In [None]:
# Shall we skip the cases where we don't have the birthday?
duckdb.sql("""
            SELECT Executive, Company, born
            FROM company_ceos_df JOIN exec_df ON Executive=clean_name
            WHERE born is not null
          """)

┌──────────────────────┬─────────────────────────────────┬─────────────────────┐
│      Executive       │             Company             │        born         │
│       varchar        │             varchar             │      timestamp      │
├──────────────────────┼─────────────────────────────────┼─────────────────────┤
│ Kumar Mangalam Birla │ Aditya Birla Group              │ 1967-06-14 00:00:00 │
│ Shantanu Narayen     │ Adobe Systems                   │ 1963-05-27 00:00:00 │
│ Guillaume Faury      │ Airbus                          │ 1968-02-22 00:00:00 │
│ Andy Jassy           │ Amazon                          │ 1968-01-13 00:00:00 │
│ Lisa Su              │ Advanced Micro Devices          │ 1969-11-07 00:00:00 │
│ Joseph R. Swedish    │ Anthem                          │ 1951-05-17 00:00:00 │
│ Tim Cook             │ Apple                           │ 1960-11-01 00:00:00 │
│ Lakshmi Niwas Mittal │ Arcelor Mittal                  │ 1950-06-15 00:00:00 │
│ Charles Woodburn     │ BAE

## 2.4. Finding the misses in the join with OUTER JOINs.

Note that the join above resulted in 174 rows.  However, there are more rows in company_ceos_df so we are missing some companies.  We can see which are missed using a LEFT OUTERJOIN (aka LEFT JOIN); setting "indicator= True" allows us to see which tuples in company_ceos_df failed to find a match (left_only, e.g. row 24 and 172).

In [None]:
pd.set_option('display.max_rows', 200)
display(company_ceos_df[['Executive', 'Company']].merge(exec_df[['clean_name', 'born']],
                                                left_on=['Executive'],
                                                right_on=['clean_name'], how="left", indicator=True))



Unnamed: 0,Executive,Company,clean_name,born,_merge
0,Julie Sweet,Accenture,Julie Sweet,NaT,both
1,Kumar Mangalam Birla,Aditya Birla Group,Kumar Mangalam Birla,1967-06-14 00:00:00,both
2,Shantanu Narayen,Adobe Systems,Shantanu Narayen,1963-05-27 00:00:00,both
3,Guillaume Faury,Airbus,Guillaume Faury,1968-02-22 00:00:00,both
4,Eddie Wu,Alibaba,Eddie Wu,NaT,both
5,Andy Jassy,Amazon,Andy Jassy,1968-01-13 00:00:00,both
6,Lisa Su,Advanced Micro Devices,Lisa Su,1969-11-07 00:00:00,both
7,Stephen Squeri,American Express,Stephen Squeri,NaT,both
8,Joseph R. Swedish,Anthem,Joseph R. Swedish,1951-05-17 00:00:00,both
9,Tim Cook,Apple,Tim Cook,1960-11-01 00:00:00,both


In [None]:
pd.set_option('display.max_rows', 50)
result_df = company_ceos_df[['Executive', 'Company']].merge(exec_df[['clean_name', 'born']],
                                                left_on=['Executive'],
                                                right_on=['clean_name'], how="outer", indicator=True)

result_df[result_df['_merge'] != 'both']


Unnamed: 0,Executive,Company,clean_name,born,_merge
37,Ola Källenius,Daimler AG,,,left_only
44,Börje Ekholm,Ericsson,,,left_only
98,Michael O'Leary,Ryanair,,,left_only
135,,,Ola K%C3%A4llenius,1969-06-11 00:00:00,right_only
136,,,B%C3%B6rje Ekholm,NaT,right_only
137,,,Michael O%27Leary,NaT,right_only


We can also do this in SQL (there is no indicator but we can test for NULL):

In [None]:
duckdb.sql("""
            SELECT Executive, Company, clean_name, born
            FROM company_ceos_df FULL JOIN exec_df ON Executive=clean_name
            WHERE clean_name is null or Company is null
          """)

┌─────────────────┬────────────┬────────────────────┬─────────────────────┐
│    Executive    │  Company   │     clean_name     │        born         │
│     varchar     │  varchar   │      varchar       │      timestamp      │
├─────────────────┼────────────┼────────────────────┼─────────────────────┤
│ NULL            │ NULL       │ Ola K%C3%A4llenius │ 1969-06-11 00:00:00 │
│ NULL            │ NULL       │ B%C3%B6rje Ekholm  │ NULL                │
│ NULL            │ NULL       │ Michael O%27Leary  │ NULL                │
│ Michael O'Leary │ Ryanair    │ NULL               │ NULL                │
│ Ola Källenius   │ Daimler AG │ NULL               │ NULL                │
│ Börje Ekholm    │ Ericsson   │ NULL               │ NULL                │
└─────────────────┴────────────┴────────────────────┴─────────────────────┘

## 2.3. Composing Joins

Of course, we can join the results of a join with another table -- representing a *composition*!



Let's join with company data!

In [None]:
con.sql("""SELECT Executive, Company, born from company_ceos
            JOIN executives ON Executive=replace(name, '_', ' ')
            JOIN company_data cd ON Company=cd.name
            WHERE born is not null""")

┌───────────┬─────────┬───────────┐
│ Executive │ Company │   born    │
│  varchar  │ varchar │ timestamp │
├───────────┴─────────┴───────────┤
│             0 rows              │
└─────────────────────────────────┘

Hmm, what is wrong here?

In [None]:
con.sql('SELECT * from company_data')

┌────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│ Unnamed: 0 │         name         │ … │     linkedin url     │ current employee e…  │ total employee est…  │
│   int64    │       varchar        │   │       varchar        │        int64         │        int64         │
├────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│    5872184 │ ibm                  │ … │ linkedin.com/compa…  │               274047 │               716906 │
│    4425416 │ tata consultancy s…  │ … │ linkedin.com/compa…  │               190771 │               341369 │
│      21074 │ accenture            │ … │ linkedin.com/compa…  │               190689 │               455768 │
│    2309813 │ us army              │ … │ linkedin.com/compa…  │               162163 │               445958 │
│    1558607 │ ey                   │ … │ linkedin.com/compa…  │               158363 │               428960 │
│

Notice the case for `name`?

In [None]:
con.sql("""SELECT Executive, Company, born from company_ceos
            JOIN executives ON Executive=replace(name, '_', ' ')
            JOIN company_data cd ON lower(Company)=lower(cd.name)
            WHERE born is not null
            ORDER BY Company""")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

┌──────────────────────┬──────────────────────────┬─────────────────────┐
│      Executive       │         Company          │        born         │
│       varchar        │         varchar          │      timestamp      │
├──────────────────────┼──────────────────────────┼─────────────────────┤
│ Kumar Mangalam Birla │ Aditya Birla Group       │ 1967-06-14 00:00:00 │
│ Lisa Su              │ Advanced Micro Devices   │ 1969-11-07 00:00:00 │
│ Guillaume Faury      │ Airbus                   │ 1968-02-22 00:00:00 │
│ Guillaume Faury      │ Airbus                   │ 1968-02-22 00:00:00 │
│ Andy Jassy           │ Amazon                   │ 1968-01-13 00:00:00 │
│ Andy Jassy           │ Amazon                   │ 1968-01-13 00:00:00 │
│ Joseph R. Swedish    │ Anthem                   │ 1951-05-17 00:00:00 │
│ Tim Cook             │ Apple                    │ 1960-11-01 00:00:00 │
│ Tim Cook             │ Apple                    │ 1960-11-01 00:00:00 │
│ Tim Cook             │ Apple        

Hmm, there are duplicates!  This is because of fields in the `company_data` table that we don't care about. We can remove the duplicates via `SELECT DISTINCT`.

In [None]:
con.sql("""SELECT DISTINCT Executive, Company, born from company_ceos
            JOIN executives ON Executive=replace(name, '_', ' ')
            JOIN company_data cd ON lower(Company)=lower(cd.name)
            WHERE born is not null
            ORDER BY Company""")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

┌──────────────────────┬──────────────────────────┬─────────────────────┐
│      Executive       │         Company          │        born         │
│       varchar        │         varchar          │      timestamp      │
├──────────────────────┼──────────────────────────┼─────────────────────┤
│ Kumar Mangalam Birla │ Aditya Birla Group       │ 1967-06-14 00:00:00 │
│ Lisa Su              │ Advanced Micro Devices   │ 1969-11-07 00:00:00 │
│ Guillaume Faury      │ Airbus                   │ 1968-02-22 00:00:00 │
│ Andy Jassy           │ Amazon                   │ 1968-01-13 00:00:00 │
│ Joseph R. Swedish    │ Anthem                   │ 1951-05-17 00:00:00 │
│ Tim Cook             │ Apple                    │ 1960-11-01 00:00:00 │
│ Charles Woodburn     │ BAE Systems              │ 1971-03-11 00:00:00 │
│ Oliver Zipse         │ BMW                      │ 1964-02-07 00:00:00 │
│ Bob Dudley           │ BP                       │ 1955-09-14 00:00:00 │
│ Brian Moynihan       │ Bank of Ameri

Can we do all of this in Pandas? Of course!

First, we need to lowercase the company names.

In [None]:
company_ceos_df['company_lc'] = company_ceos_df['Company'].apply(lambda x: x.lower())

Notice this is slower than DuckDB?

In [None]:
company_ceos_df.merge(exec_df.dropna(),
                      left_on=['Executive'],
                      right_on=['clean_name']).\
                      merge(company_data_df,
                            left_on='company_lc',
                            right_on='name')[['Executive','Company','born']].drop_duplicates().sort_values('Company')

Unnamed: 0,Executive,Company,born
0,Kumar Mangalam Birla,Aditya Birla Group,1967-06-14 00:00:00
5,Lisa Su,Advanced Micro Devices,1969-11-07 00:00:00
1,Guillaume Faury,Airbus,1968-02-22 00:00:00
3,Andy Jassy,Amazon,1968-01-13 00:00:00
6,Joseph R. Swedish,Anthem,1951-05-17 00:00:00
...,...,...,...
136,Vittorio Colao,Vodafone,1961-10-03 00:00:00
139,Stefano Pessina,Walgreens Boots Alliance,1941-06-04 00:00:00
140,Doug McMillon,Walmart,1966-10-17 00:00:00
138,Robert Iger,Walt Disney Company,1951-02-10 00:00:00


# 3.0: Validating and Cleaning Data

How do we know our data is good?  We can create rules that trigger when the data fails some particular set of **constraints**.

In [None]:
# Test with validation rules

replace_item = ''

failed = False
for name in exec_df['clean_name']:
  if not name.replace(' ', replace_item).\
          replace('.', replace_item).\
          replace('\'',replace_item).\
          replace('-',replace_item).isalpha():
    print ("Illegal name %s"%name)
    failed = True

if failed:
  print('Found illegal names!')

Illegal name Ola K%C3%A4llenius
Illegal name B%C3%B6rje Ekholm
Illegal name Michael O%27Leary
Found illegal names!


## 3.1. Data Cleaning: Fixing the Errors

One approach we could take is to realize that these are all strings in which accented characters are specially coded for use in (Wikipedia) URLs.  For instance, %C3 is a hex character code for an accented "u".  We can use a function called `unqote` to fix this...


In [None]:
from urllib.parse import unquote

exec_df['clean_name'].apply(unquote)

Unnamed: 0,clean_name
0,Julie Sweet
1,Kumar Mangalam Birla
2,Shantanu Narayen
3,Guillaume Faury
4,Eddie Wu
...,...
130,Ann Sarnoff
131,Jason Kilar
132,Charles Scharf
133,John Mackey


## 3.2. More Generally: Data Validation Tools

Are there tools to help us validate data?  Of course!  We see an [example](https://validators.readthedocs.io/en/latest/) of one such tool, simply called `validators`, below.  There are many others in the data cleaning literature.

In [None]:
!pip install validators

Collecting validators
  Downloading validators-0.33.0-py3-none-any.whl.metadata (3.8 kB)
Downloading validators-0.33.0-py3-none-any.whl (43 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.3/43.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: validators
Successfully installed validators-0.33.0


In [None]:
import validators.url

# Are all of the URLs valid?
exec_df['page'].apply(validators.url)

Unnamed: 0,page
0,True
1,True
2,True
3,True
4,True
...,...
130,True
131,True
132,True
133,True


## 3.3. Validation against a master list

We can also compare against "master" lists (in tables).  For example, company info about stock ticker symbols can be compared against the full list of symbols; states and countries can be compared against definitive lists.

Suppose we want to confirm that `company_info_df`, which includes companies' home countries, has valid country codes.  


In [None]:
data = urllib.request.urlopen(\
       'https://gist.github.com/jvilledieu/c3afe5bc21da28880a30/raw/a344034b82a11433ba6f149afa47e57567d4a18f/Companies.csv')

company_info_df = pd.read_csv(data)


In [None]:
company_info_df[['name','country_code']]

Unnamed: 0,name,country_code
0,#waywire,USA
1,&TV Communications,USA
2,'Rock' Your Paper,EST
3,(In)Touch Network,GBR
4,+n (PlusN),USA
...,...,...
47753,Zzish,GBR
47754,ZZNode Science and Technology,CHN
47755,Zzzzapp Wireless ltd.,HRV
47756,[a]list games,


Some of these are NaNs -- let's drop them...

From someone's Github content, here's a nice list of all countries and region codes.

In [None]:
countries_df = pd.read_csv("https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv")

display(countries_df)

Unnamed: 0,name,alpha-2,alpha-3,country-code,iso_3166-2,region,sub-region,intermediate-region,region-code,sub-region-code,intermediate-region-code
0,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,,142.0,34.0,
1,Åland Islands,AX,ALA,248,ISO 3166-2:AX,Europe,Northern Europe,,150.0,154.0,
2,Albania,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,,150.0,39.0,
3,Algeria,DZ,DZA,12,ISO 3166-2:DZ,Africa,Northern Africa,,2.0,15.0,
4,American Samoa,AS,ASM,16,ISO 3166-2:AS,Oceania,Polynesia,,9.0,61.0,
...,...,...,...,...,...,...,...,...,...,...,...
244,Wallis and Futuna,WF,WLF,876,ISO 3166-2:WF,Oceania,Polynesia,,9.0,61.0,
245,Western Sahara,EH,ESH,732,ISO 3166-2:EH,Africa,Northern Africa,,2.0,15.0,
246,Yemen,YE,YEM,887,ISO 3166-2:YE,Asia,Western Asia,,142.0,145.0,
247,Zambia,ZM,ZMB,894,ISO 3166-2:ZM,Africa,Sub-Saharan Africa,Eastern Africa,2.0,202.0,14.0


In [None]:
validated = company_info_df[['name','country_code']].dropna().merge(countries_df, left_on=['country_code'], right_on=['alpha-3'],
                      how='left', indicator=True)

validated[validated['_merge'] != 'both']

Unnamed: 0,name_x,country_code,name_y,alpha-2,alpha-3,country-code,iso_3166-2,region,sub-region,intermediate-region,region-code,sub-region-code,intermediate-region-code,_merge
38,123ContactForm,ROM,,,,,,,,,,,,left_only
567,Access Point,ROM,,,,,,,,,,,,left_only
3526,Avito.ru,ROM,,,,,,,,,,,,left_only
3600,Axigen Messaging,ROM,,,,,,,,,,,,left_only
4694,BitDefender,ROM,,,,,,,,,,,,left_only
4898,Blogvio,ROM,,,,,,,,,,,,left_only
5216,Boommy Fashion,ROM,,,,,,,,,,,,left_only
7618,Client24,ROM,,,,,,,,,,,,left_only
8953,CreditCardsOnline,ROM,,,,,,,,,,,,left_only
9991,DesignFace IT,ROM,,,,,,,,,,,,left_only


## 3.4. Record Linking: Working around the Errors

Rather than figuring out how to clean these characters, we'll instead look at doing approximate matching.

Now we'll need to import some similarity matching code, to do approximate match between the original names and those returned by Wikipedia


In [None]:
!pip3 install git+https://github.com/anhaidgroup/py_stringsimjoin.git@rel_0_3_6

Collecting git+https://github.com/anhaidgroup/py_stringsimjoin.git@rel_0_3_6
  Cloning https://github.com/anhaidgroup/py_stringsimjoin.git (to revision rel_0_3_6) to /tmp/pip-req-build-lb_nu7a6
  Running command git clone --filter=blob:none --quiet https://github.com/anhaidgroup/py_stringsimjoin.git /tmp/pip-req-build-lb_nu7a6
  Running command git checkout -b rel_0_3_6 --track origin/rel_0_3_6
  Switched to a new branch 'rel_0_3_6'
  Branch 'rel_0_3_6' set up to track remote branch 'rel_0_3_6' from 'origin'.
  Resolved https://github.com/anhaidgroup/py_stringsimjoin.git to commit d3ab2b31a8f9515e11bfa7abf9c33722cf6c9938
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting PyPrind>=2.9.3 (from py-stringsimjoin==0.3.6)
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl.metadata (1.1 kB)
Collecting py_stringmatching>=0.2.1 (from py-stringsimjoin==0.3.6)
  Downloading py-stringmatching-0.4.6.tar.gz (849 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.6/84

You'll need to restart your kernel after this one...

In [None]:
!pip3 install linktransformer

Collecting linktransformer
  Downloading linktransformer-0.1.15-py3-none-any.whl.metadata (58 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/58.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.1/58.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting faiss-cpu==1.8.0 (from linktransformer)
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting hdbscan==0.8.36 (from linktransformer)
  Downloading hdbscan-0.8.36-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting networkx==3.1 (from linktransformer)
  Downloading networkx-3.1-py3-none-any.whl.metadata (5.3 kB)
Collecting torch==2.3.0 (from linktransformer)
  Downloading torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting sentence-transformers==2.3.1 (from linktransformer)
  Downloading sentence_transformers-2.3.1-py3-none-any.whl

In [None]:
# Approximate string matching, see
import py_stringsimjoin as ssj
import py_stringmatching as sm

import linktransformer as lt

ModuleNotFoundError: No module named 'linktransformer'

In [None]:
# We are going to match the strings approximately, via "n-grams" or "q-grams" (sequences of n or q characters)
# Here it's five-grams

tok = sm.QgramTokenizer(qval=5,return_set=True)

In [None]:
# Now let's do a similarity join

# We'll reset the index, so there is a unique index field in the company_ceos_df dataframe
company_ceos_df.reset_index(inplace=True)

output_pairs = ssj.jaccard_join(company_ceos_df, exec_df, 'index', 'page', 'Executive', 'clean_name', tok, 0.35,
                                l_out_attrs=['Executive'], r_out_attrs=['name'])

output_pairs[output_pairs['_sim_score'] < 1.0]

In [None]:
# At last! Company info + CEO info, together!

total = company_ceos_df.merge(output_pairs,left_on=['Executive'],right_on=['l_Executive']).\
        merge(exec_df.dropna(),left_on=['r_page'],right_on=['page']).\
        merge(company_data_df, left_on='company_lc', right_on='name', how="left")

total

In [None]:
# Let's get ready to plot

%matplotlib inline

In [None]:
total

## Exercises

# 4.0 Simple Analysis of Linked Data: Grouping and Analytics

The `groupby` command allows us to coalesce data by groups.  In Pandas a Group is a special object with a set of rows.  We can see this with the `get_group` command.

In [None]:
total[['born','Company','Executive']].drop_duplicates().sort_values('born')

In [None]:
total[['born','Company','Executive']].drop_duplicates().groupby(by='born').get_group(datetime.datetime.strptime('1935-11-01', '%Y-%m-%d'))[['Company','Executive','born']]

We can apply computations, such as `count`, to the items in the group.  Values that are empty (NaN) do not count (no pun intended).

In [None]:
total[['born','Company','Executive']].drop_duplicates().groupby(by='born').count()

In [None]:
# We can do this in SQL too...

# By default, SQL will include the Nan value (called NULL in SQL).  If we want to exclude it,
# we need to include WHERE ... IS NOT NULL.

duckdb.sql("""SELECT born, count(Company)
            FROM total
            WHERE born IS NOT NULL
            GROUP BY born""")

CatalogException: Catalog Error: Table with name total does not exist!
Did you mean "temp.information_schema.tables"?

In [None]:
total2 = total.dropna()
total2['born'] = total2['born'].astype('datetime64[ns]')

In [None]:
# Let's look at when the CEOs were born

birthdays = total2.groupby(by='born').count()[['Executive']]
birthdays.index = pd.to_datetime(birthdays.index, unit='s')

birthdays

We can actually use `resample` over dates, with a parameter, to group eg by decade (10 years, where A is the code for year):

In [None]:
birthdays.resample('10A').count().plot(kind='bar')

Maybe that's a little weird.  Can we do something more along the lines of what we expect, i.e., "1920s, 1930s, ..."?

In [None]:
# Get rid of the nulls!
bdays = total[['born']].dropna()

bdays = bdays.applymap(lambda bday: str(int(bday.year / 10) * 10) + 's')

bdays.reset_index().groupby('born').count().plot(kind='bar')

# An Exercise

Is there a correlation between the kind of company and the age of the CEO?

Does the company's line of business matter?