<a href="https://colab.research.google.com/github/zackives/upenn-cis5450-hw/blob/main/5_Module_2_Data_Modeling_and_Decomposition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture Module 2: Logical Design: Conceptual Data Representation

## LinkedIn Social Analysis

Our second module explores concepts in:

* Designing data representations to capture important relationships
* Reasoning over graphs
* Exploring and traversing graphs


Subsequently, in the next module, we'll look at how *physical design* (indexing, data layout) and *algorithms* can affect performance.

## Generality of Data Models

We have claimed that data can be represented as a tree, as tables, or as graphs -- and all are equivalent. We'll see this in action here.

## Hierarchical data



### Preliminaries

We'll use MongoDB on the cloud as a sample NoSQL database.

We'll first collect Colab's host IP address, which you might need if you aren't able to connect to the database.  If you get an authorization error in connecting to MongoDB, you'll need to post this IP address to Ed Discussion so we can add permissions to make a request.

In [None]:
#TODO: fill in from Ed
%env PASSWORD=#TODO

In [None]:
!curl ipecho.net/plain

In [None]:
!pip3 install pymongo
!pip3 install lxml
!pip3 install duckdb

In [None]:
import pandas as pd
import numpy as np

# JSON parsing
import json

# HTML parsing
from lxml import etree
import urllib

# DuckDB RDBMS
import duckdb

# Time conversions
import time

# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

## Our Example Dataset

A crawl of LinkedIn, stored as a sequence of JSON objects (one per line).  Here's a scan through the sample dataset, taken from Kaggle (https://www.kaggle.com/linkedindata/linkedin-crawled-profiles-dataset).  We have subsequently removed all names of individuals.

In [None]:
!wget -nc https://storage.googleapis.com/penn-cis5450/linkedin_anon.jsonl

In [None]:
%%time
# 50K records from linkedin
linked_in = open('linkedin_anon.jsonl')

people = []

for line in linked_in:
    person = json.loads(line)
    people.append(person)

people_df = pd.DataFrame(people)
print ("%d records"%len(people_df))

people_df

## NoSQL storage

For this part we will give you read-only access to our copy of MongoDB.

We may need to tell MongoDB to add your Colab IP address (so you can talk to the machine).

In [None]:
import os
# Store in MongoDB and in an in-memory list

START = 0
# We already have the data loaded into MongoDB, so we won't actually
# read all 50000 records.  We'll test by reading + writing the first
# 3700 though!
LIMIT = 37#00

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
import pymongo.client_options
import pymongo
password = os.getenv('PASSWORD')

if password is None:
    raise Exception("You must set the PASSWORD environment variable")

uri = "mongodb+srv://cis5450:" + password + "@test2450.3emsbl6.mongodb.net/?retryWrites=true&w=majority&appName=Test2450"
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))

linkedin_db = client['linkedin']
linked_in = open('linkedin_anon.jsonl')

print('MongoDB has the following databases' + str(client.list_database_names()))

people = 0
for line in linked_in:
    person = json.loads(line)
    if people >= START:
        try:
            linkedin_db.posts.insert_one(person)
        except DuplicateKeyError as e:
            print (e)
            pass
        except OperationFailure as e:
            # If the above still uses our cluster, you'll get this error in
            # attempting to write to our MongoDB client because we haven't
            # given you write access
            if ("user is not allowed to do action [insert]" not in str(e)):
              print (e)
            pass
    people = people + 1
    if people >= LIMIT:
        break

In [None]:
# Build a list of the JSON elements
list_for_comparison = []

people = 0
for line in linked_in:
    person = json.loads(line)
    if people >= START:
        try:
            list_for_comparison.append(person)
        except DuplicateKeyError:
            pass
        except OperationFailure:
            # If the above still uses our cluster, you'll get this error in
            # attempting to write to our MongoDB client
            pass
    people = people + 1
    if people >= LIMIT:
        break

In [None]:
list_for_comparison

In [None]:
# Two ways of looking up skills, one based on an in-memory
# list, one based on MongoDB queries

def find_skills_in_list(skill):
    for post in list_for_comparison:
        if 'skills' in post:
            skills = post['skills']
            if skills is not None:
              for this_skill in skills:
                  if this_skill == skill:
                      return post
    return None

def find_skills_in_mongodb(skill):
    return linkedin_db.posts.find_one({'skills': skill})

In [None]:
%%time
find_skills_in_list('Marketing')

In [None]:
%%time
find_skills_in_mongodb('Marketing')

## Designing a relational schema from hierarchical data

Given that we already have a predefined set of fields / attributes / features, we don't need to spend a lot of time defining our table *schemas*, except that we need to unnest data.

* Nested relationships can be captured by creating a second table, which has a **foreign key** pointing to the identifier (key) for the main (parent) table.
* Ordered lists can be captured by encoding an index number or row number.

In [None]:
'''
Simple code to pull out data from JSON and load into DuckDB.
'''
import ast

linked_in = open('linkedin_anon.jsonl')

START = 0
LIMIT = 10000

def get_df(rel):
    ret = pd.DataFrame(rel)
    return ret

lines = []
i = 1
for line in linked_in:
    if i > START + LIMIT:
        break
    elif i >= START:
        person = json.loads(line)

        lines.append(person)
    i = i + 1

people_df = get_df(pd.DataFrame(lines))



In [None]:
people_df

In [None]:
def get_nested_dict(rel, name):
  # This evaluates the string that describes the dictionary, as a dictionary
  # definition
  ret = rel.copy()
  # ret[name] = rel[name].map(lambda x: ast.literal_eval(x) if len(x) else np.NaN)
  ret = ret.dropna()
  # This joins rows on the index
  return ret.drop(columns=name).join(pd.DataFrame(ret[name].tolist()))

def get_nested_list(rel, name):
  ret = rel.copy()
  ret = ret.dropna().explode(name).dropna()
  ret = ret.join(pd.DataFrame(ret[name].tolist())).drop(columns=name).drop_duplicates()
  return ret.rename(columns={0: name})

def get_nested_list_dict(rel, name):
  ret = rel.copy()

  ret = ret.dropna().explode(name)

  exploded_pairs = pd.DataFrame(ret.apply(lambda x: {'_id': x['_id']} | x[name] if isinstance(x[name], dict) else {'_id': x['_id']}, axis=1).tolist())

  return ret.merge(exploded_pairs, on='_id').drop(columns=name)
  #pd.DataFrame(ret[name].tolist())).drop(columns=name).drop_duplicates()

# Take the lists, drop any blank strings
specialties_df = people_df[['_id','specilities']].explode('specilities').rename(columns={'_id': 'person'})
specialties_df.dropna(inplace=True)
interests_df = people_df[['_id','interests']].explode('interests').rename(columns={'_id': 'person'})
interests_df.dropna(inplace=True)

names_df = get_nested_dict(people_df[['_id','name']], 'name')

education_df = get_nested_list_dict(people_df[['_id','education']], 'education')
experience_df = get_nested_list_dict(people_df[['_id','experience']], 'experience')
skills_df = get_nested_list(people_df[['_id','skills']], 'skills')
honors_df = get_nested_list(people_df[['_id','honors']], 'honors')
events_df = get_nested_list_dict(people_df[['_id','events']], 'events')

groups_df = get_nested_dict(people_df[['_id','group']], 'group')

people_df = people_df.drop(columns=['name','education','group','skills','experience','honors','events','specilities','interests'])


In [None]:
events_df

In [None]:
interests_df

In [None]:
specialties_df

In [None]:
names_df

In [None]:
education_df

In [None]:
experience_df

In [None]:
groups_df


In [None]:
conn = duckdb.connect('linkedin.db')

conn.sql('drop table if exists people')
conn.sql('drop table if exists names')
conn.sql('drop table if exists education')
conn.sql('drop table if exists groups')
conn.sql('drop table if exists skills')
conn.sql('drop table if exists experience')
conn.sql('drop table if exists honors')
conn.sql('drop table if exists events')
conn.sql('drop table if exists specialties')
conn.sql('drop table if exists interests')

In [None]:
# Save these to the SQLite database

conn.sql("""
  CREATE TABLE IF NOT EXISTS people AS
   SELECT * FROM people_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS names AS
   SELECT * FROM names_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS education AS
   SELECT * FROM education_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS groups AS
   SELECT * FROM groups_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS skills AS
   SELECT * FROM skills_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS experience AS
   SELECT * FROM experience_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS honors AS
   SELECT * FROM honors_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS events AS
   SELECT * FROM events_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS specialties AS
   SELECT * FROM specialties_df
""")

conn.sql("""
  CREATE TABLE IF NOT EXISTS interests AS
   SELECT * FROM interests_df
""")

In [None]:
conn.sql("""
  SELECT experience._id, org
  FROM people
  JOIN experience ON people._id=experience._id""")

In [None]:
conn.sql("""
  SELECT experience._id, group_concat(org) AS experience
  FROM people
  LEFT JOIN experience ON people._id=experience._id
  GROUP BY experience._id""")

## Views

The following code starts a transaction (we can either `commit` or `rollback` at the end), removes an existing view, and creates a new one.

In [None]:
conn.sql('BEGIN TRANSACTION')
conn.sql('DROP VIEW IF EXISTS people_experience')
conn.execute("""
  CREATE VIEW IF NOT EXISTS people_experience AS
    SELECT experience._id, group_concat(org) AS experience
    FROM people
    LEFT JOIN experience ON people._id=experience._id
    GROUP BY experience._id""")
conn.execute('COMMIT')

# Treat the view as a table, see what's there
conn.sql('SELECT * FROM people_experience')

## Deep Dive: Converting a Complex Tree to Relations

Now that we've seen the basics of taking hierarchical data and turning it into relations, let's put the LinkedIn data on the stack for a brief time, and try a more difficult exercise representing (and querying) tree-structured data.

We'll take the HTML data from Wikipedia pages, seen in the Lecture 1 Notebook, and "shred" the HTML into tables.

Briefly, if we think of the HTML as a tree of nodes, e.g.:

```
   <html>
   |   |
<head> <body>
   |    |   |
<title> <h1> <p>
   |     |    \
 ABC    ABC    DEF
```

Then we can give a **node ID** to each node in the tree; an a **position** (0, 1, ...) to each sibling at a level in the tree.  We will "slice" the tree into segments, each of which becomes a row in a table.  The row will include the node ID, the node label or type ("h1" or "text()"), the node value if the type is text(), and the position.

In [None]:
import urllib
from lxml import etree
import pandas as pd
import requests


## HTML as edges

Each time we parse an HTML node, we can give it a new ID.  If we record the ID of its parent, we essentially get an _edge_ going back to the parent.

In [None]:
def import_html(url: str):
  # Now let's read an HTML table!
  headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
  }

  return requests.get(url, headers=headers).text


# Recursively crawl the node and add rows to the html_tree table
def traverse_html(node, parent, pos, nodes_list) -> list:
    if node.text and parent > -1 and len(str(node.text).strip()):
        text_id = len(nodes_list)
        entry = {'node_id': text_id, 'parent_node_id': parent, 'type_or_label': 'text()', 'pos': pos, 'value': str(node.text).strip()}
        print (str(entry))
        nodes_list.append(entry)

    if node.tag:
        node_id = len(nodes_list)
        entry = {'node_id': node_id, 'parent_node_id': parent, 'type_or_label': node.tag, 'pos': pos, 'value': ''}
        nodes_list.append(entry)
        print (str(entry))
        index = 0
        for child in list(node):
            (child_id, nodes_list) = traverse_html(child, node_id, index, nodes_list)
            index = index + 1

    if node.tail:
        text_id = len(nodes_list)
        entry = {'node_id': text_id, 'parent_node_id': parent, 'type_or_label': 'text()', 'pos': pos, 'value': node.tail}
        print (str(entry))
        nodes_list.append(entry)
    return (node_id, nodes_list)

pages_list = []
nodes_list = []


# Crawl these pages
page_list = ['https://en.wikipedia.org/wiki/Tim_Cook',
             'https://en.wikipedia.org/wiki/Chan_Zuckerberg_Initiative']
for page in page_list:
    page_content = import_html(page)
    page_tree = etree.HTML(page_content)
    (root_node,nodes_list) = traverse_html(page_tree, -1, 0, nodes_list)
    pages_list.append({'url': page, 'root_id': root_node})

pages_df = pd.DataFrame(pages_list)
pages_df

Let's see the nodes and the edges to their parents.  A parent of `-1` is a root node.

In [None]:
node_df = pd.DataFrame(nodes_list)
node_df

Let's look at the pages and the IDs of their root nodes.

In [None]:
pages_df

From the pages, we can join the root nodes and see what the tags are.

In [None]:
# Find all document roots
pages_df.merge(node_df,left_on=['root_id'],right_on=['node_id'])

Now let's consider an XPath query to find all text within paragraphs.

This would be `//p/text()`. We can evaluate this easily by just looking for `p` elements whose children are text. This can be done by joining between nodes with `p`s and nodes with text.

In [None]:
# Return the contents of all text() nodes inside of <p> tags

node_df[node_df['type_or_label']=='p'][['node_id']].\
    merge(node_df[node_df['type_or_label']=='text()'], \
          left_on=['node_id'], right_on=['parent_node_id'])[['value']]

It's potentially more informative to see a bit of context.  Let's show (1) the node ID of the parent of the `p` tag, (2) the node ID of the `p` tag, (3) the node ID of the text node.

In [None]:
p_text_nodes = node_df[node_df['type_or_label']=='p'][['parent_node_id','node_id']].\
    merge(node_df[node_df['type_or_label']=='text()'][['parent_node_id','node_id']], \
          left_on=['node_id'], right_on=['parent_node_id']).\
    rename(columns={'parent_node_id_x': 'p_parent_node_id', 'node_id_y': 'text_node_id'}).\
    drop(columns='node_id_x').rename(columns={'parent_node_id_y': 'p_node_id'})

p_text_nodes

What can we say about the types of the parent's of the `p`-nodes?

In [None]:
current_items_df = p_text_nodes.rename(columns={'p_parent_node_id': 'ancestor_node_id'})

parents_df = current_items_df[['ancestor_node_id','text_node_id']].\
    merge(node_df,\
    left_on=['ancestor_node_id'],right_on=['node_id'])\
    [['parent_node_id','text_node_id','type_or_label']].\
rename(columns={'parent_node_id': 'ancestor_node_id'})

parents_df

And we can even traverse once more, to the parents of the parents!

In [None]:
current_items_df = parents_df

grandparents_df = current_items_df[['ancestor_node_id','text_node_id']].drop_duplicates().\
    merge(node_df,\
    left_on=['ancestor_node_id'],right_on=['node_id'])\
    [['parent_node_id','text_node_id','type_or_label']].\
rename(columns={'parent_node_id': 'ancestor_node_id'}).drop_duplicates()

grandparents_df

## Recursively find all ancestors!

We can start with the text nodes, then find their parents, then find their parents, then ...

This is a recursive process that stops when there aren't any more parents, and is called a *transitive closure* because it includes the full set of all transitively related nodes.

In [None]:
def find_ancestor_nodes(node_df, current_items_df):
    if len(current_items_df) == 0:
        return current_items_df
    else:
        parents_df = current_items_df[['ancestor_node_id','text_node_id']].drop_duplicates().\
            merge(node_df,\
            left_on=['ancestor_node_id'],right_on=['node_id'])\
            [['parent_node_id','text_node_id','type_or_label']].\
        rename(columns={'parent_node_id': 'ancestor_node_id'}).drop_duplicates()

        return pd.concat([parents_df,find_ancestor_nodes(node_df, parents_df)]).drop_duplicates()

nodes_ancestors = find_ancestor_nodes(node_df, p_text_nodes.rename(columns={'p_parent_node_id': 'ancestor_node_id'}))

nodes_ancestors

In [None]:
# Can we find ONLY text from the Tim Cook (0th) document?

nodes_ancestors[nodes_ancestors['ancestor_node_id']==pages_df.iloc[0]['root_id']].\
    merge(node_df, left_on=['text_node_id'],right_on=['node_id'])[['text_node_id','value']]

## Exercises

In [None]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

In [None]:
!pip3 install penngrader-client

In [None]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

In [None]:
%set_env HW_ID=cis5450_25f_HW9

In [None]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)



Can we find *all paragraphs* from the Zuckerberg (position-1) document, that have text children? (Hint: you can get all text nodes in the document, then find their parents). We'll give you a starting point here:

In [None]:
zuckerberg_root_id = pages_df.iloc[1]['root_id']
zuckerberg_root_id

In [None]:
# TODO: Return (node_id, type_or_label) as results_df

results_df = # something

results_df

Quick-check for your columns.

In [None]:
assert list(results_df.columns)==['node_id', 'type_or_label']

And submit!

In [None]:
grader.grade('zuckerberg', results_df)