<a href="https://colab.research.google.com/github/zackives/upenn-cis-2450/blob/main/5_Module_2_Data_Modeling_and_Decomposition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture Module 2: Logical Design: Conceptual Data Representation

## LinkedIn Social Analysis

Our second module explores concepts in:

* Designing data representations to capture important relationships
* Reasoning over graphs
* Exploring and traversing graphs


Subsequently, in the next module, we'll look at how *physical design* (indexing, data layout) and *algorithms* can affect performance.

## Generality of Data Models

We have claimed that data can be represented as a tree, as tables, or as graphs -- and all are equivalent. We'll see this in action here.

## Hierarchical data



### Preliminaries

We'll use MongoDB on the cloud as a sample NoSQL database.

We'll first collect Colab's host IP address, which you might need if you aren't able to connect to the database.

In [None]:
!curl ipecho.net/plain

130.211.240.130

In [None]:
!pip3 install pymongo[srv]
!pip3 install lxml
!pip3 install duckdb

Collecting pymongo[srv]
  Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
[0mCollecting dnspython<3.0.0,>=1.16.0 (from pymongo[srv])
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.8.0


In [None]:
import pandas as pd
import numpy as np

# JSON parsing
import json

# HTML parsing
from lxml import etree
import urllib

# DuckDB RDBMS
import duckdb

# Time conversions
import time

# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

## Our Example Dataset

A crawl of LinkedIn, stored as a sequence of JSON objects (one per line).  Here's a scan through the sample dataset, taken from Kaggle (https://www.kaggle.com/linkedindata/linkedin-crawled-profiles-dataset).  We have subsequently removed all names of individuals.

In [None]:
!wget -nc https://storage.googleapis.com/penn-cis5450/linkedin_anon.jsonl

--2024-09-14 19:50:16--  https://storage.googleapis.com/penn-cis5450/linkedin_anon.jsonl
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.203.207, 74.125.204.207, 64.233.187.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.203.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 179851696 (172M) [application/octet-stream]
Saving to: ‘linkedin_anon.jsonl’


2024-09-14 19:50:26 (20.1 MB/s) - ‘linkedin_anon.jsonl’ saved [179851696/179851696]



In [None]:
!pip3 install randomname

Collecting randomname
  Downloading randomname-0.2.1.tar.gz (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fire (from randomname)
  Downloading fire-0.6.0.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.4/88.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: randomname, fire
  Building wheel for randomname (setup.py) ... [?25l[?25hdone
  Created wheel for randomname: filename=randomname-0.2.1-py3-none-any.whl size=89195 sha256=fdd9a402a9f575dfe6ba7eaf959a59acae212026ddaad07709101da8f2975cec
  Stored in directory: /root/.cache/pip/wheels/10/50/8a/25f3820d26a431ffed1834d72ff2eb349123cf2b44c5a45727
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Created wheel for fire: filename=fire-0.6.0-py2.py3-no

This code randomizes people's names

In [None]:
import randomname
import json
import pandas as pd
import os.path

def capsFirst(x):
  return x[0].upper() + x[1:]

if os.path.exists('linkedin_small.json'):
  linked_in = open('linkedin_small.json')

  people = []
  count = 0
  LIMIT = 50000

  for line in linked_in:
      person = json.loads(line)
      person['name'] = {'family_name': capsFirst(randomname.generate('names/surnames/scottish')),
                      'given_name': randomname.generate('names/people/butlers')}
      del person['also_view']
      person['_id'] = randomname.generate()
      person['url'] = 'https://www.linkedin.com/in/' + person['_id']
      if 'homepage' in person:
        del person['homepage']
      if 'overview_html' in person:
        del person['overview_html']
      people.append(person)
      count = count + 1
      if count >= LIMIT:
          break

  people_df = pd.DataFrame(people)
  print ("%d records"%len(people_df))

  with open('linkedin_anon.jsonl', 'wt') as my_file:
    for line in people_df.iterrows():
        jline = line[1].to_json()
        my_file.write(f'{jline}\n')


In [None]:
%%time
# 50K records from linkedin
linked_in = open('linkedin_anon.jsonl')

people = []

for line in linked_in:
    person = json.loads(line)
    people.append(person)

people_df = pd.DataFrame(people)
print ("%d records"%len(people_df))

people_df

50000 records
CPU times: user 3.03 s, sys: 872 ms, total: 3.9 s
Wall time: 5.51 s


Unnamed: 0,_id,name,locality,skills,industry,summary,url,education,group,interval,experience,specilities,events,interests,honors
0,moist-vodka,"{'family_name': 'Post', 'given_name': 'Belvede...",United States,"[Key Account Development, Strategic Planning, ...",Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,https://www.linkedin.com/in/moist-vodka,,,,,,,,
1,adagio-catalyst,"{'family_name': 'Watt', 'given_name': 'Brunton'}","Antwerp Area, Belgium","[Molecular Biology, Biomarkers]",Pharmaceuticals,Ph.D. scientist with background in cancer rese...,https://www.linkedin.com/in/adagio-catalyst,"[{'start': '2008', 'major': 'Economics', 'end'...","{'affilition': ['ASMALLWORLD.net', 'Biomarker ...",20.0,"[{'org': 'Johnson and Johnson', 'title': 'Seni...","Biomarkers in Oncology, Cancer Genomics, Molec...","[{'from': 'Sahlgrenska University Hospital', '...",,
2,tart-acorn,"{'family_name': 'Hannay', 'given_name': 'Passe...","San Francisco, California","[DNA, Nanotechnology, Molecular Biology, Softw...",Research,I am interested in inventing new methods to co...,https://www.linkedin.com/in/tart-acorn,"[{'major': 'Biophysics', 'end': '2009', 'name'...",,0.0,"[{'org': 'UCSF', 'title': 'Assistant Professor...",,[{'from': 'Wyss Institute for Biologically Ins...,"personal genomics, nanotechnology",
3,objective-riesling,"{'family_name': 'Carnegie', 'given_name': 'Pas...",San Francisco Bay Area,,Information Technology and Services,OBJECTIVE<Primary> Work on an interesting and ...,https://www.linkedin.com/in/objective-riesling,,"{'affilition': ['Big Data, Low Latency', 'Expe...",5.0,"[{'org': '<Online Recruiting Company>', 'desc'...",,"[{'from': '<Employee Benefits, Administration ...",,
4,generative-amberjack,"{'family_name': 'Duncan', 'given_name': 'Merri...","Chennai Area, India","[Program Management, French, Avionics, Embedde...",Aviation & Aerospace,"Experience in Avionics Systems, Embedded Syste...",https://www.linkedin.com/in/generative-amberjack,"[{'start': '1988', 'end': '1989', 'name': 'Eco...",{'member': 'Member of Project Management Insti...,,,,,"Literature, Philosophy, Music",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,glowing-flush,"{'family_name': 'Kincaid', 'given_name': 'Pass...",Greater Chicago Area,"[Spanish-speaking, Cicerone-Certified Beer Ser...",Marketing and Advertising,Sales and marketing professional specializing ...,https://www.linkedin.com/in/glowing-flush,"[{'start': '2007', 'major': 'PR and Advertisin...",,5.0,"[{'org': 'Louis Glunz Beer Inc.', 'title': 'On...",,"[{'from': 'Peet's Coffee and Tea', 'to': 'Expl...","craft beer industry, coffee industry, running,...",
49996,grouchy-flight,"{'family_name': 'Ogilvy', 'given_name': 'Bulli...",Greater Atlanta Area,,Financial Services,Accomplished business development manager expe...,https://www.linkedin.com/in/grouchy-flight,"[{'major': 'Marketing Focus', 'end': '2008', '...","{'member': 'Sigma Chi Fraternity', 'affilition...",14.0,"[{'org': 'Georgia-Pacific LLC', 'title': 'Acco...",,"[{'from': 'Bayer Advanced', 'to': 'BBDO', 'tit...",,[National Deans List]
49997,dense-bell,"{'family_name': 'Macdougall', 'given_name': 'B...","Calgary, Canada Area","[Project Management, Electrical Engineering, M...",Design,Brad Gibson is a recognized expert in power qu...,https://www.linkedin.com/in/dense-bell,[{'major': 'Engineering Physics (Solid State E...,"{'member': 'IEEE, APEGGA, APEGBC, PEO, APEGS, ...",42.0,"[{'org': 'DIALOG', 'desc': 'Electrical Enginee...","Data center design, high reliability power, po...","[{'from': 'Current Thinking Inc.', 'to': 'The ...",,
49998,brave-hoops,"{'family_name': 'Forsyth', 'given_name': 'Cadb...",San Francisco Bay Area,"[Corporate Social Responsibility, Public Polic...",Public Policy,Brad Kane's multi-faceted career in the govern...,https://www.linkedin.com/in/brave-hoops,"[{'major': 'Law', 'end': '1984', 'name': 'Univ...",{'affilition': ['Association for Public Policy...,26.0,"[{'org': 'The Bipartisan Bridge', 'title': 'Ex...","Brad has led initiatives, public policy, and p...","[{'from': 'Congresswoman Cardiss Collins', 'to...",,


## NoSQL storage

For this part we will give you read-only access to our copy of MongoDB.  

We may need to tell MongoDB to add your Colab IP address (so you can talk to the machine).

In [None]:
# Store in MongoDB and in an in-memory list

START = 0
# We already have the data loaded into MongoDB, so we won't actually
# read all 50000 records.  We'll test by reading + writing the first
# 3700 though!
LIMIT = 3700

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
uri = "mongodb+srv://cis2450:UWcHn7ofLNCik0XQ@test2450.3emsbl6.mongodb.net/?retryWrites=true&w=majority&appName=Test2450"
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))

linkedin_db = client['linkedin']
linked_in = open('linkedin_anon.jsonl')

people = 0
for line in linked_in:
    person = json.loads(line)
    if people >= START:
        try:
            linkedin_db.posts.insert_one(person)
        except DuplicateKeyError:
            pass
        except OperationFailure:
            # If the above still uses our cluster, you'll get this error in
            # attempting to write to our MongoDB client
            pass
    people = people + 1
    if people >= LIMIT:
        break

In [None]:
# Build a list of the JSON elements
list_for_comparison = []

people = 0
for line in linked_in:
    person = json.loads(line)
    if people >= START:
        try:
            list_for_comparison.append(person)
        except DuplicateKeyError:
            pass
        except OperationFailure:
            # If the above still uses our cluster, you'll get this error in
            # attempting to write to our MongoDB client
            pass
    people = people + 1
    if people >= LIMIT:
        break

In [None]:
list_for_comparison

[{'_id': 'bisque-muntin',
  'name': {'family_name': 'Blair', 'given_name': 'Bullimore'},
  'locality': 'Washington D.C. Metro Area',
  'skills': None,
  'industry': 'Investment Management',
  'summary': "Adam Steiner is the Managing Director, and a founder, of the Steiner Family Office. The Family Office is an investment group and back office for the Steiner family and engages in venture capital, real estate and other traditional investment vehicles. Prior to this, Adam was the president/owner of Branch Electric Supply, a $360M Electrical Wholesale-Distributor, where he utilized expertise in process re-engineering, managing technology change, integrating acquisitions and building highly effective teams to drive performance. Branch was a recognized leader in the industry and sold for top industry multiples to Rexel, SA, the largest electrical wholesale-distribution company in the world, in 2000. Adam managed a successful transition to Rexel, post-sale, as Rexel/Branch’s Division Preside

In [None]:
# Two ways of looking up skills, one based on an in-memory
# list, one based on MongoDB queries

def find_skills_in_list(skill):
    for post in list_for_comparison:
        if 'skills' in post:
            skills = post['skills']
            if skills is not None:
              for this_skill in skills:
                  if this_skill == skill:
                      return post
    return None

def find_skills_in_mongodb(skill):
    return linkedin_db.posts.find_one({'skills': skill})

In [None]:
%%time
find_skills_in_list('Marketing')

CPU times: user 40 µs, sys: 4 µs, total: 44 µs
Wall time: 48.9 µs


{'_id': 'thick-manuscript',
 'name': {'family_name': 'Macdonnell', 'given_name': 'Brunton'},
 'locality': 'Elkhart, Indiana Area',
 'skills': ['Business Development',
  'Social Media',
  'Marketing',
  'Intellectual Property',
  'Alternative Dispute Resolution',
  'Strategic Planning',
  'Team Building',
  'Team Leadership',
  'Project Management',
  'Research',
  'New Business Development',
  'Microsoft Office'],
 'industry': 'Management Consulting',
 'summary': '• 10+ years of management experience leading business development, customer service & branding efforts with regional, national, international and not-for-profit organizations• Extremely active in local community (Multiple Local Chambers of Commerce, Rotary International, Greater Elkhart Chamber Ambassador’s Council, Horse Feathers, ITT Technical Institute Curriculum Committee, March of Dimes “March for Babies” Chairman Committee).• Passionate customer focus and driven to provide top-level customer service & solutions• Proven 

In [None]:
%%time
find_skills_in_mongodb('Marketing')

CPU times: user 1.47 ms, sys: 0 ns, total: 1.47 ms
Wall time: 187 ms


{'_id': 'plain-torpedo',
 'name': {'family_name': 'Hamilton', 'given_name': 'Brunton'},
 'locality': 'Hyderabad Area, India',
 'skills': ['Microbiology',
  'Vaccines',
  'International Sales',
  'Market Intelligence',
  'International Business Development',
  'Microsoft Excel',
  'Strategic Thinking',
  'Strategy',
  'PowerPoint',
  'Market Access',
  'Immunohistochemistry',
  'Marketing',
  'Operations Management',
  'Business Development',
  'Pharmaceutical Industry',
  'Six Sigma',
  'DMAIC',
  'Green Belt'],
 'industry': 'Biotechnology',
 'summary': '•Having 12 Yrs of Experience in Marketing & International Business Development in Pharmaceutical sector.•Experienced in managing operations in large business area, formulating and implementing strategies, developing new markets for business excellence.•Adept at developing the distribution network infrastructure and channel management.•Proven track record acquiring product registrations for several products in ASEAN Region.•An effective

## Designing a relational schema from hierarchical data

Given that we already have a predefined set of fields / attributes / features, we don't need to spend a lot of time defining our table *schemas*, except that we need to unnest data.

* Nested relationships can be captured by creating a second table, which has a **foreign key** pointing to the identifier (key) for the main (parent) table.
* Ordered lists can be captured by encoding an index number or row number.

In [None]:
'''
Simple code to pull out data from JSON and load into DuckDB.
'''
import ast

linked_in = open('linkedin_anon.jsonl')

START = 0
LIMIT = 10000

def get_df(rel):
    ret = pd.DataFrame(rel)
    return ret

lines = []
i = 1
for line in linked_in:
    if i > START + LIMIT:
        break
    elif i >= START:
        person = json.loads(line)

        lines.append(person)
    i = i + 1

people_df = get_df(pd.DataFrame(lines))



In [None]:
people_df

Unnamed: 0,_id,name,locality,skills,industry,summary,url,education,group,interval,experience,specilities,events,interests,honors
0,moist-vodka,"{'family_name': 'Post', 'given_name': 'Belvede...",United States,"[Key Account Development, Strategic Planning, ...",Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,https://www.linkedin.com/in/moist-vodka,,,,,,,,
1,adagio-catalyst,"{'family_name': 'Watt', 'given_name': 'Brunton'}","Antwerp Area, Belgium","[Molecular Biology, Biomarkers]",Pharmaceuticals,Ph.D. scientist with background in cancer rese...,https://www.linkedin.com/in/adagio-catalyst,"[{'start': '2008', 'major': 'Economics', 'end'...","{'affilition': ['ASMALLWORLD.net', 'Biomarker ...",20.0,"[{'org': 'Johnson and Johnson', 'title': 'Seni...","Biomarkers in Oncology, Cancer Genomics, Molec...","[{'from': 'Sahlgrenska University Hospital', '...",,
2,tart-acorn,"{'family_name': 'Hannay', 'given_name': 'Passe...","San Francisco, California","[DNA, Nanotechnology, Molecular Biology, Softw...",Research,I am interested in inventing new methods to co...,https://www.linkedin.com/in/tart-acorn,"[{'major': 'Biophysics', 'end': '2009', 'name'...",,0.0,"[{'org': 'UCSF', 'title': 'Assistant Professor...",,[{'from': 'Wyss Institute for Biologically Ins...,"personal genomics, nanotechnology",
3,objective-riesling,"{'family_name': 'Carnegie', 'given_name': 'Pas...",San Francisco Bay Area,,Information Technology and Services,OBJECTIVE<Primary> Work on an interesting and ...,https://www.linkedin.com/in/objective-riesling,,"{'affilition': ['Big Data, Low Latency', 'Expe...",5.0,"[{'org': '<Online Recruiting Company>', 'desc'...",,"[{'from': '<Employee Benefits, Administration ...",,
4,generative-amberjack,"{'family_name': 'Duncan', 'given_name': 'Merri...","Chennai Area, India","[Program Management, French, Avionics, Embedde...",Aviation & Aerospace,"Experience in Avionics Systems, Embedded Syste...",https://www.linkedin.com/in/generative-amberjack,"[{'start': '1988', 'end': '1989', 'name': 'Eco...",{'member': 'Member of Project Management Insti...,,,,,"Literature, Philosophy, Music",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,gourmet-block,"{'family_name': 'Drummond', 'given_name': 'Alf...",Singapore,,Chemicals,Site manager for a chemical MC with extensive ...,https://www.linkedin.com/in/gourmet-block,"[{'major': 'Applied Finance', 'end': '2007', '...","{'affilition': ['COMPANY PHARMA TALENT', 'Chem...",22.0,"[{'org': 'Perstorp Singapore Pte Ltd', 'title'...","Operations, Production chemical petrochemical,...",[{'from': 'Stazione Sperimentale per I Combust...,,
9996,matte-curve,"{'family_name': 'Rollo', 'given_name': 'Bullim...","Sevilla y alrededores, España","[Automotive, Marketing Strategy, Product Manag...",Sector automovilístico,,https://www.linkedin.com/in/matte-curve,[{'name': 'Licenciado en Marketing e investiga...,,9.0,"[{'org': 'Glassdrive España', 'title': 'Direct...",,"[{'from': 'Saint-Gobain Glassdrive España', 't...",,
9997,adiabatic-pilot,"{'family_name': 'Barclay', 'given_name': 'Simo...","Gijón y alrededores, España","[Intranet, Spanish, Personnel Management, Inte...",Minería y metalurgia,"In my current position, I've faced two key cha...",https://www.linkedin.com/in/adiabatic-pilot,"[{'major': 'HR Management', 'end': '2001', 'na...","{'affilition': ['ArcelorMittal Group', 'Basket...",19.0,"[{'org': 'ArcelorMittal', 'title': 'Head of In...",,"[{'from': 'Secades, Lozano y Tejon', 'to': 'Pr...",,
9998,advanced-object,"{'family_name': 'Hope', 'given_name': 'Jeeves'}","Nice Area, France","[Amadeus, Project Coordination, Project Manage...",Technologies et services de l'information,,https://www.linkedin.com/in/advanced-object,[{'major': 'Computer science and telecommunica...,,0.0,"[{'org': 'Amadeus IT Group', 'title': 'Impleme...",,"[{'from': 'Vodafone IT', 'to': 'Reply', 'title...",,


In [None]:
def get_nested_dict(rel, name):
  # This evaluates the string that describes the dictionary, as a dictionary
  # definition
  ret = rel.copy()
  # ret[name] = rel[name].map(lambda x: ast.literal_eval(x) if len(x) else np.NaN)
  ret = ret.dropna()
  # This joins rows on the index
  return ret.drop(columns=name).join(pd.DataFrame(ret[name].tolist()))

def get_nested_list(rel, name):
  # This evaluates the string that describes the dictionary, as a dictionary
  # definition
  ret = rel.copy()
  ret = ret.dropna().explode(name).dropna()
  ret = ret.join(pd.DataFrame(ret[name].tolist())).drop(columns=name).drop_duplicates()
  return ret.rename(columns={0: name})

def get_nested_list_dict(rel, name):
  ret = rel.copy()

  ret = ret.dropna().explode(name)

  exploded_pairs = pd.DataFrame(ret.apply(lambda x: {'_id': x['_id']} | x[name] if isinstance(x[name], dict) else {'_id': x['_id']}, axis=1).tolist())

  return ret.merge(exploded_pairs, on='_id').drop(columns=name)
  #pd.DataFrame(ret[name].tolist())).drop(columns=name).drop_duplicates()

# Take the lists, drop any blank strings
specialties_df = people_df[['_id','specilities']].explode('specilities').rename(columns={'_id': 'person'})
specialties_df.dropna(inplace=True)
interests_df = people_df[['_id','interests']].explode('interests').rename(columns={'_id': 'person'})
interests_df.dropna(inplace=True)

names_df = get_nested_dict(people_df[['_id','name']], 'name')

education_df = get_nested_list_dict(people_df[['_id','education']], 'education')
experience_df = get_nested_list_dict(people_df[['_id','experience']], 'experience')
skills_df = get_nested_list(people_df[['_id','skills']], 'skills')
honors_df = get_nested_list(people_df[['_id','honors']], 'honors')
events_df = get_nested_list_dict(people_df[['_id','events']], 'events')

groups_df = get_nested_dict(people_df[['_id','group']], 'group')

#people_df = people_df.drop(columns=['name','education','group','skills','experience','honors','events','specilities','interests'])


In [None]:
events_df

Unnamed: 0,_id,from,to,title1,start,title2,end
0,adagio-catalyst,Sahlgrenska University Hospital,Memorial Sloan Kettering Cancer Center,Research Scientist,24022.0,Post Doctoral Research Fellow,24036.0
1,adagio-catalyst,Memorial Sloan Kettering Cancer Center,Columbia University,Post Doctoral Research Fellow,24036.0,Associate Research Scientist,24079.0
2,adagio-catalyst,Columbia University,Albert Einstein Medical Center,Associate Research Scientist,24079.0,Associate at Dept of Molecular Genetics,24104.0
3,adagio-catalyst,Albert Einstein Medical Center,Johnson and Johnson,Associate at Dept of Molecular Genetics,24104.0,"Senior Scientist, Oncology Biomarkers",24118.0
4,adagio-catalyst,Sahlgrenska University Hospital,Memorial Sloan Kettering Cancer Center,Research Scientist,24022.0,Post Doctoral Research Fellow,24036.0
...,...,...,...,...,...,...,...
229237,advanced-object,Amadeus,Amadeus IT Group,Product definition engineer,24102.0,Implementation engineer,24121.0
229238,inflammable-tarragon,Koch Media srl,Atari Games,Sales Manager Italy,24039.0,sales director,24070.0
229239,inflammable-tarragon,Atari Games,Namco Bandai Partners,sales director,24070.0,Sales Director,24114.0
229240,inflammable-tarragon,Koch Media srl,Atari Games,Sales Manager Italy,24039.0,sales director,24070.0


In [None]:
interests_df

Unnamed: 0,person,interests
2,tart-acorn,"personal genomics, nanotechnology"
4,generative-amberjack,"Literature, Philosophy, Music"
5,salty-section,"travelling,the sea,trying new things, trying t..."
9,rich-laser,Marketing and statistical marketing applicatio...
10,cold-reveal,"Fashion Photography, Public Relations, Marketi..."
...,...,...
9980,delicious-voxel,Programming and Clean Code
9984,cerulean-gabardine,"free software, copyleft, open source, linux, p..."
9987,linear-foie-gras,"Keen on motorbike travelling, sailing, diving ..."
9990,radiant-rottweiler,"Arte, Diseño, 3D, Programación, Flash, Motion ..."


In [None]:
specialties_df

Unnamed: 0,person,specilities
1,adagio-catalyst,"Biomarkers in Oncology, Cancer Genomics, Molec..."
5,salty-section,"A passion for Brands, coupled with experience ..."
13,forgiving-desert,"Internet Marketing, Interactive Marketing, Dig..."
16,plain-torpedo,"Marketing , Operations Management , P&L Head, ..."
23,cheerful-mackerel,"SQL, DB2, COBOL,JCL, JAVA"
...,...,...
9984,cerulean-gabardine,"Web Design, Web Development, Web Standards, We..."
9988,camel-mason,"Innovation, comunication, social media"
9992,callous-horror,"Marketing, Advertising, Commercial & Business ..."
9995,gourmet-block,"Operations, Production chemical petrochemical,..."


In [None]:
names_df

Unnamed: 0,_id,family_name,given_name
0,moist-vodka,Post,Belvedere
1,adagio-catalyst,Watt,Brunton
2,tart-acorn,Hannay,Passepartout
3,objective-riesling,Carnegie,Passepartout
4,generative-amberjack,Duncan,Merriman
...,...,...,...
9995,gourmet-block,Drummond,Alfred
9996,matte-curve,Rollo,Bullimore
9997,adiabatic-pilot,Barclay,Simonides
9998,advanced-object,Hope,Jeeves


In [None]:
education_df

Unnamed: 0,_id,start,major,end,name,desc,degree
0,adagio-catalyst,2008,Economics,2008,Columbia University - Columbia Business School,"Coursework ""Principals of Economics"" ECON1105\...",
1,adagio-catalyst,2007,,2007,Columbia University - Columbia Business School,,
2,adagio-catalyst,1996,Cancer genomics,2001,Göteborgs universitet,"Thesis: ""The role of p53 in tumor progression ...",Ph.D.
3,adagio-catalyst,1994,"Biology, Medicine;German Language",1995,Universität Regensburg,,"Cancer Research, Coursework"
4,adagio-catalyst,1989,Biology,1994,Göteborgs universitet,,Master
...,...,...,...,...,...,...,...
60872,advanced-object,2005,Computer science and telecommunications,2008,Politecnico di Torino,"Degree thesis: ""Vehicle-ground communication b...",Telematics engineer
60873,advanced-object,2002,Computer science,2005,Politecnico di Torino,,Computer engineer
60874,advanced-object,2005,Computer science and telecommunications,2008,Politecnico di Torino,"Degree thesis: ""Vehicle-ground communication b...",Telematics engineer
60875,advanced-object,2002,Computer science,2005,Politecnico di Torino,,Computer engineer


In [None]:
experience_df

Unnamed: 0,_id,org,title,end,start,desc
0,adagio-catalyst,Johnson and Johnson,"Senior Scientist, Oncology Biomarkers",Present,November 2009,Biomarker Leader for compounds in clinical dev...
1,adagio-catalyst,Albert Einstein Medical Center,Associate at Dept of Molecular Genetics,,September 2008,Single Cell Gene expression.
2,adagio-catalyst,Columbia University,Associate Research Scientist,,August 2006,Work on peptide to restore wt p53 function in ...
3,adagio-catalyst,Memorial Sloan Kettering Cancer Center,Post Doctoral Research Fellow,,January 2003,Molecular profiling of colorectal cancer.
4,adagio-catalyst,Sahlgrenska University Hospital,Research Scientist,,November 2001,Cancer Research at Dept of Surgery.Molecular p...
...,...,...,...,...,...,...
339192,inflammable-tarragon,Atari Games,sales director,,November 2005,"Sales Director Italiy , for a Branch of Atari ..."
339193,inflammable-tarragon,Koch Media srl,Sales Manager Italy,,April 2003,Sales Manager Italy distributor of software an...
339194,inflammable-tarragon,Namco Bandai Partners,Sales Director,Present,July 2009,
339195,inflammable-tarragon,Atari Games,sales director,,November 2005,"Sales Director Italiy , for a Branch of Atari ..."


In [None]:
groups_df


Unnamed: 0,_id,affilition,member
1,adagio-catalyst,"[Big Data, Low Latency, Experts Answer's, Link...",
3,objective-riesling,"[Canadian Marketing Association, LeadingLoyalt...",
4,generative-amberjack,"[CFA Institute Candidates, Economist Intellige...",Associate Member of SAMRA
6,chalky-tenement,"[BMW Group, BPO Executives, Engineering jobs B...",
9,rich-laser,"[Annamalai University Alumni, Annamalai Univer...",
...,...,...,...
9992,callous-horror,,
9993,equidistant-lumen,,
9994,savory-dimension,,
9995,gourmet-block,,


In [None]:
conn = duckdb.connect('linkedin.db')

conn.sql('drop table if exists people')
conn.sql('drop table if exists names')
conn.sql('drop table if exists education')
conn.sql('drop table if exists groups')
conn.sql('drop table if exists skills')
conn.sql('drop table if exists experience')
conn.sql('drop table if exists honors')
conn.sql('drop table if exists events')
conn.sql('drop table if exists specialties')
conn.sql('drop table if exists interests')

In [None]:
# Save these to the SQLite database

conn.sql("""
  CREATE TABLE IF NOT EXISTS people AS
   SELECT * FROM people_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS names AS
   SELECT * FROM names_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS education AS
   SELECT * FROM education_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS groups AS
   SELECT * FROM groups_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS skills AS
   SELECT * FROM skills_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS experience AS
   SELECT * FROM experience_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS honors AS
   SELECT * FROM honors_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS events AS
   SELECT * FROM events_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS specialties AS
   SELECT * FROM specialties_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS interests AS
   SELECT * FROM interests_df
""")

In [None]:
conn.sql("""
  SELECT experience._id, org
  FROM people
  JOIN experience ON people._id=experience._id""")

┌─────────────────┬────────────────────────────────────────┐
│       _id       │                  org                   │
│     varchar     │                varchar                 │
├─────────────────┼────────────────────────────────────────┤
│ adagio-catalyst │ Johnson and Johnson                    │
│ adagio-catalyst │ Albert Einstein Medical Center         │
│ adagio-catalyst │ Columbia University                    │
│ adagio-catalyst │ Memorial Sloan Kettering Cancer Center │
│ adagio-catalyst │ Sahlgrenska University Hospital        │
│ adagio-catalyst │ Johnson and Johnson                    │
│ adagio-catalyst │ Albert Einstein Medical Center         │
│ adagio-catalyst │ Columbia University                    │
│ adagio-catalyst │ Memorial Sloan Kettering Cancer Center │
│ adagio-catalyst │ Sahlgrenska University Hospital        │
│       ·         │          ·                             │
│       ·         │          ·                             │
│       ·         │     

In [None]:
conn.sql("""
  SELECT experience._id, group_concat(org) AS experience
  FROM people
  LEFT JOIN experience ON people._id=experience._id
  GROUP BY experience._id""")

┌───────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐
│        _id        │                                            experience                                            │
│      varchar      │                                             varchar                                              │
├───────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ adagio-catalyst   │ Johnson and Johnson,Albert Einstein Medical Center,Columbia University,Memorial Sloan Ketterin…  │
│ tart-acorn        │ UCSF,Wyss Institute for Biologically Inspired Engineering,UCSF,Wyss Institute for Biologically…  │
│ chalky-tenement   │ Canadian MedicAlert Foundation,CMAF,CMAF,Canadian MedicAlert Foundation,CMAF,CMAF,Canadian Med…  │
│ sensitive-estuary │ Complete IT Systems Ltd,Complete IT Systems Ltd,NTS (UK) Ltd,Altrigen Solutions Limited,Leeds …  │
│ rich-laser        │ Ericsson R

## Views

The following code starts a transaction (we can either `commit` or `rollback` at the end), removes an existing view, and creates a new one.

In [None]:
conn.sql('BEGIN TRANSACTION')
conn.sql('DROP VIEW IF EXISTS people_experience')
conn.execute("""
  CREATE VIEW IF NOT EXISTS people_experience AS
    SELECT experience._id, group_concat(org) AS experience
    FROM people
    LEFT JOIN experience ON people._id=experience._id
    GROUP BY experience._id""")
conn.execute('COMMIT')

# Treat the view as a table, see what's there
conn.sql('SELECT * FROM people_experience')

┌───────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐
│        _id        │                                            experience                                            │
│      varchar      │                                             varchar                                              │
├───────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ adagio-catalyst   │ Johnson and Johnson,Albert Einstein Medical Center,Columbia University,Memorial Sloan Ketterin…  │
│ tart-acorn        │ UCSF,Wyss Institute for Biologically Inspired Engineering,UCSF,Wyss Institute for Biologically…  │
│ chalky-tenement   │ Canadian MedicAlert Foundation,CMAF,CMAF,Canadian MedicAlert Foundation,CMAF,CMAF,Canadian Med…  │
│ sensitive-estuary │ Complete IT Systems Ltd,Complete IT Systems Ltd,NTS (UK) Ltd,Altrigen Solutions Limited,Leeds …  │
│ rich-laser        │ Ericsson R

## Deep Dive: Converting a Complex Tree to Relations

Now that we've seen the basics of taking hierarchical data and turning it into relations, let's put the LinkedIn data on the stack for a brief time, and try a more difficult exercise representing (and querying) tree-structured data.

We'll take the HTML data from Wikipedia pages, seen in the Lecture 1 Notebook, and "shred" the HTML into tables.

Briefly, if we think of the HTML as a tree of nodes, e.g.:

```
   <html>
   |   |
<head> <body>
   |    |   |
<title> <h1> <p>
   |     |    \
 ABC    ABC    DEF
```

Then we can give a **node ID** to each node in the tree; an a **position** (0, 1, ...) to each sibling at a level in the tree.  We will "slice" the tree into segments, each of which becomes a row in a table.  The row will include the node ID, the node label or type ("h1" or "text()"), the node value if the type is text(), and the position.

In [None]:
import urllib
from lxml import etree
import pandas as pd

## HTML as edges

Each time we parse an HTML node, we can give it a new ID.  If we record the ID of its parent, we essentially get an _edge_ going back to the parent.

In [None]:
# Recursively crawl the node and add rows to the html_tree table
def traverse_html(node, parent, pos, nodes_list) -> list:
    if node.text and parent > -1 and len(str(node.text).strip()):
        text_id = len(nodes_list)
        entry = {'node_id': text_id, 'parent_node_id': parent, 'type_or_label': 'text()', 'pos': pos, 'value': str(node.text).strip()}
        print (str(entry))
        nodes_list.append(entry)

    if node.tag:
        node_id = len(nodes_list)
        entry = {'node_id': node_id, 'parent_node_id': parent, 'type_or_label': node.tag, 'pos': pos, 'value': ''}
        nodes_list.append(entry)
        print (str(entry))
        index = 0
        for child in list(node):
            (child_id, nodes_list) = traverse_html(child, node_id, index, nodes_list)
            index = index + 1

    if node.tail:
        text_id = len(nodes_list)
        entry = {'node_id': text_id, 'parent_node_id': parent, 'type_or_label': 'text()', 'pos': pos, 'value': node.tail}
        print (str(entry))
        nodes_list.append(entry)
    return (node_id, nodes_list)

pages_list = []
nodes_list = []


# Crawl these pages
page_list = ['https://en.wikipedia.org/wiki/Tim_Cook',
             'https://en.wikipedia.org/wiki/Chan_Zuckerberg_Initiative']
for page in page_list:
    page_content = urllib.request.urlopen(page).read()
    page_tree = etree.HTML(page_content)
    (root_node,nodes_list) = traverse_html(page_tree, -1, 0, nodes_list)
    pages_list.append({'url': page, 'root_id': root_node})

pages_df = pd.DataFrame(pages_list)
pages_df

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
{'node_id': 7736, 'parent_node_id': 7699, 'type_or_label': 'li', 'pos': 9, 'value': ''}
{'node_id': 7737, 'parent_node_id': 7736, 'type_or_label': 'text()', 'pos': 0, 'value': 'Free'}
{'node_id': 7738, 'parent_node_id': 7736, 'type_or_label': 'a', 'pos': 0, 'value': ''}
{'node_id': 7739, 'parent_node_id': 7699, 'type_or_label': 'text()', 'pos': 9, 'value': '\n'}
{'node_id': 7740, 'parent_node_id': 7699, 'type_or_label': 'li', 'pos': 10, 'value': ''}
{'node_id': 7741, 'parent_node_id': 7740, 'type_or_label': 'text()', 'pos': 0, 'value': 'Mag'}
{'node_id': 7742, 'parent_node_id': 7740, 'type_or_label': 'a', 'pos': 0, 'value': ''}
{'node_id': 7743, 'parent_node_id': 7699, 'type_or_label': 'text()', 'pos': 10, 'value': '\n'}
{'node_id': 7744, 'parent_node_id': 7699, 'type_or_label': 'li', 'pos': 11, 'value': ''}
{'node_id': 7745, 'parent_node_id': 7744, 'type_or_label': 'text()', 'pos': 0, 'value': 'Shox'}
{'node_id': 7746, '

Unnamed: 0,url,root_id
0,https://en.wikipedia.org/wiki/Tim_Cook,0
1,https://en.wikipedia.org/wiki/Chan_Zuckerberg_...,8387


Let's see the nodes and the edges to their parents.  A parent of `-1` is a root node.

In [None]:
node_df = pd.DataFrame(nodes_list)
node_df

Unnamed: 0,node_id,parent_node_id,type_or_label,pos,value
0,0,-1,html,0,
1,1,0,head,0,
2,2,1,meta,0,
3,3,1,text(),0,\n
4,4,1,text(),1,Tim Cook - Wikipedia
...,...,...,...,...,...
12731,12731,8467,text(),4,\n
12732,12732,8467,text(),5,"{""@context"":""https:\/\/schema.org"",""@type"":""Ar..."
12733,12733,8467,script,5,
12734,12734,8467,text(),5,\n


Let's look at the pages and the IDs of their root nodes.

In [None]:
pages_df

Unnamed: 0,url,root_id
0,https://en.wikipedia.org/wiki/Tim_Cook,0
1,https://en.wikipedia.org/wiki/Chan_Zuckerberg_...,8387


From the pages, we can join the root nodes and see what the tags are.

In [None]:
# Find all document roots
pages_df.merge(node_df,left_on=['root_id'],right_on=['node_id'])

Unnamed: 0,url,root_id,node_id,parent_node_id,type_or_label,pos,value
0,https://en.wikipedia.org/wiki/Tim_Cook,0,0,-1,html,0,
1,https://en.wikipedia.org/wiki/Chan_Zuckerberg_...,8387,8387,-1,html,0,


Now let's consider an XPath query to find all text within paragraphs.

This would be `//p/text()`. We can evaluate this easily by just looking for `p` elements whose children are text. This can be done by joining between nodes with `p`s and nodes with text.

In [None]:
# Return the contents of all text() nodes inside of <p> tags

node_df[node_df['type_or_label']=='p'][['node_id']].\
    merge(node_df[node_df['type_or_label']=='text()'], \
          left_on=['node_id'], right_on=['parent_node_id'])[['value']]

Unnamed: 0,value
0,Timothy Donald Cook
1,"(born November 1, 1960)"
2,is an American business executive who is the ...
3,chief executive officer
4,of
...,...
431,and has fewer other transparency requirements...
432,"Under this legal structure, as"
433,"wrote it, ""Zuckerberg will still control the ..."
434,The Chan Zuckerberg Initiative publicly lists...


It's potentially more informative to see a bit of context.  Let's show (1) the node ID of the parent of the `p` tag, (2) the node ID of the `p` tag, (3) the node ID of the text node.

In [None]:
p_text_nodes = node_df[node_df['type_or_label']=='p'][['parent_node_id','node_id']].\
    merge(node_df[node_df['type_or_label']=='text()'][['parent_node_id','node_id']], \
          left_on=['node_id'], right_on=['parent_node_id']).\
    rename(columns={'parent_node_id_x': 'p_parent_node_id', 'node_id_y': 'text_node_id'}).\
    drop(columns='node_id_x').rename(columns={'parent_node_id_y': 'p_node_id'})

p_text_nodes

Unnamed: 0,p_parent_node_id,p_node_id,text_node_id
0,1131,1214,1215
1,1131,1214,1217
2,1131,1214,1225
3,1131,1214,1226
4,1131,1214,1228
...,...,...,...
431,9241,10031,10100
432,9241,10031,10129
433,9241,10031,10133
434,9241,10031,10148


What can we say about the types of the parent's of the `p`-nodes?

In [None]:
current_items_df = p_text_nodes.rename(columns={'p_parent_node_id': 'ancestor_node_id'})

parents_df = current_items_df[['ancestor_node_id','text_node_id']].\
    merge(node_df,\
    left_on=['ancestor_node_id'],right_on=['node_id'])\
    [['parent_node_id','text_node_id','type_or_label']].\
rename(columns={'parent_node_id': 'ancestor_node_id'})

parents_df

Unnamed: 0,ancestor_node_id,text_node_id,type_or_label
0,1130,1215,div
1,1130,1217,div
2,1130,1225,div
3,1130,1226,div
4,1130,1228,div
...,...,...,...
431,9240,10100,div
432,9240,10129,div
433,9240,10133,div
434,9240,10148,div


And we can even traverse once more, to the parents of the parents!

In [None]:
current_items_df = parents_df

grandparents_df = current_items_df[['ancestor_node_id','text_node_id']].drop_duplicates().\
    merge(node_df,\
    left_on=['ancestor_node_id'],right_on=['node_id'])\
    [['parent_node_id','text_node_id','type_or_label']].\
rename(columns={'parent_node_id': 'ancestor_node_id'}).drop_duplicates()

grandparents_df

Unnamed: 0,ancestor_node_id,text_node_id,type_or_label
0,1113,1215,div
1,1113,1217,div
2,1113,1225,div
3,1113,1226,div
4,1113,1228,div
...,...,...,...
431,9229,10100,div
432,9229,10129,div
433,9229,10133,div
434,9229,10148,div


## Recursively find all ancestors!

We can start with the text nodes, then find their parents, then find their parents, then ...

This is a recursive process that stops when there aren't any more parents, and is called a *transitive closure* because it includes the full set of all transitively related nodes.

In [None]:
def find_ancestor_nodes(node_df, current_items_df):
    if len(current_items_df) == 0:
        return current_items_df
    else:
        parents_df = current_items_df[['ancestor_node_id','text_node_id']].drop_duplicates().\
            merge(node_df,\
            left_on=['ancestor_node_id'],right_on=['node_id'])\
            [['parent_node_id','text_node_id','type_or_label']].\
        rename(columns={'parent_node_id': 'ancestor_node_id'}).drop_duplicates()

        return pd.concat([parents_df,find_ancestor_nodes(node_df, parents_df)]).drop_duplicates()

nodes_ancestors = find_ancestor_nodes(node_df, p_text_nodes.rename(columns={'p_parent_node_id': 'ancestor_node_id'}))

nodes_ancestors

Unnamed: 0,ancestor_node_id,text_node_id,type_or_label
0,1130,1215,div
1,1130,1217,div
2,1130,1225,div
3,1130,1226,div
4,1130,1228,div
...,...,...,...
431,-1,10100,html
432,-1,10129,html
433,-1,10133,html
434,-1,10148,html


In [None]:
# Can we find ONLY text from the Tim Cook (0th) document?

nodes_ancestors[nodes_ancestors['ancestor_node_id']==pages_df.iloc[0]['root_id']].\
    merge(node_df, left_on=['text_node_id'],right_on=['node_id'])[['text_node_id','value']]

Unnamed: 0,text_node_id,value
0,1215,Timothy Donald Cook
1,1217,"(born November 1, 1960)"
2,1225,is an American business executive who is the ...
3,1226,chief executive officer
4,1228,of
...,...,...
278,2486,LGBTQ youth dealing with homelessness
279,2488,and
280,2489,suicide
281,2491,hope that their situation could get better.


## Exercises

In [142]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

Writing notebook-config.yaml


In [143]:
!pip3 install penngrader-client

Collecting penngrader-client
  Downloading penngrader_client-0.5.2-py3-none-any.whl.metadata (15 kB)
Collecting dill (from penngrader-client)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Downloading penngrader_client-0.5.2-py3-none-any.whl (10 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill, penngrader-client
Successfully installed dill-0.3.8 penngrader-client-0.5.2


In [144]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

In [145]:
%set_env HW_ID=cis2450_fall24_HW9

env: HW_ID=cis2450_fall24_HW9


In [146]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)

PennGrader initialized with Student ID: 99999999

Make sure this correct or we will not be able to store your grade


Can we find *all paragraphs* from the Zuckerberg (position-1) document, that have text children? (Hint: you can get all text nodes in the document, then find their parents)

In [147]:
# TODO: Return (node_id, type_or_label) as results_df

results_df = # TODO

This quick-check verifies your schema...

In [148]:
assert list(results_df.columns)==['node_id', 'type_or_label']

And submit!

In [149]:
grader.grade('zuckerberg', results_df)

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.
