<a href="https://colab.research.google.com/github/zackives/upenn-cis5450-hw/blob/main/6_Module_2_Part_II_Query_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture Module 2.2: Making Choices about Data Processing

## LinkedIn Social Analysis

Our next module explores concepts in:

* Algorithmic implications of design choices
* Techniques for indexing, parallelism, and sequence

It sets the stage for Module 3, which focuses on cloud/cluster-compute data processing.



In [None]:
!wget -nc https://storage.googleapis.com/penn-cis5450/linkedin_anon.jsonl

## Access Patterns

Let's create two data structures, an integer list and a dictionary (hash map).  Each will have the same data.

In [None]:
intlist = []
for i in range(0,5000000):
  intlist.append((i+1,'a value'))

intdict = {}
for i in range(0,5000000):
  intdict[i] = ((i+1,'a value'))

In [None]:
%%time
count = 0
for i in range(0,len(intlist)):
  count += intlist[i][0]

In [None]:
%%time
count = 0
for i in range(0,len(intdict)):
  count += intdict[i][0]

In [None]:
%%time
# All 50,000+ records from linkedin
linked_in = open('linkedin_anon.jsonl')

copied_data = open('linkedin_anon_copy.jsonl','w', buffering=1)

count = 0
for repeat in range(0,10):
  linked_in.seek(0)
  for line in linked_in:
    count += 1
    copied_data.write(line)

print (f"Copied {count} records")

In [None]:
%%time
# All 50,000+ records from linkedin
linked_in = open('linkedin_anon.jsonl')

copied_data = open('linkedin_anon_copy.jsonl','w', buffering=4096)

count = 0
for repeat in range(0,10):
  linked_in.seek(0)
  for line in linked_in:
    count += 1
    copied_data.write(line)

print (f"Copied {count} records")

# Big Data Takes a Long Time to Process

Now that we've seen how to do fairly complex queries over data in relations, we'll "pop back" to our big data example, which is the LinkedIn dataset.  Recall that we had a segment of the LinkedIn input file in our previous examples earlier in this module.

In [None]:
!pip3 install lxml
!pip3 install duckdb

In [None]:
import pandas as pd
import numpy as np

# JSON parsing
import json

# HTML parsing
from lxml import etree
import urllib

# DuckDB RDBMS
import duckdb

# Time conversions
import time

In [None]:
%%time
# 50,000 records from linkedin
linked_in = open('linkedin_anon.jsonl')

people = []

for line in linked_in:
    person = json.loads(line)
    people.append(person)

people_df = pd.DataFrame(people)
people_df[people_df['industry'] == 'Medical Devices']

In [None]:
%%time
# 500,000 records from linkedin
linked_in = open('linkedin_anon.jsonl')

people = []

for line in linked_in:
    person = json.loads(line)
    if 'industry' in person and person['industry'] == 'Medical Devices':
        people.append(person)

people_df = pd.DataFrame(people)
people_df

## SQL query without an index

SQL databases will automatically "push down" selection and projection where feasible.  They also don't need to parse.

Let's load people_df into tables as per our prior notebook.

In [None]:
'''
Simple code to pull out data from JSON and load into DuckDB.
'''
import ast

linked_in = open('linkedin_anon.jsonl')

START = 0
LIMIT = 50000

def get_df(rel):
    ret = pd.DataFrame(rel)
    return ret

lines = []
i = 1
for line in linked_in:
    if i > START + LIMIT:
        break
    elif i >= START:
        person = json.loads(line)

        lines.append(person)
    i = i + 1

people_df = get_df(pd.DataFrame(lines))



In [None]:
people_df

In [None]:
def get_nested_dict(rel, name):
  # This evaluates the string that describes the dictionary, as a dictionary
  # definition
  ret = rel.copy()
  # ret[name] = rel[name].map(lambda x: ast.literal_eval(x) if len(x) else np.NaN)
  ret = ret.dropna()
  # This joins rows on the index
  return ret.drop(columns=name).join(pd.DataFrame(ret[name].tolist()))

def get_nested_list(rel, name):
  ret = rel.copy()
  ret = ret.dropna().explode(name).dropna()
  ret = ret.join(pd.DataFrame(ret[name].tolist())).drop(columns=name).drop_duplicates()
  return ret.rename(columns={0: name})

def get_nested_list_dict(rel, name):
  ret = rel.copy()

  ret = ret.dropna().explode(name)

  exploded_pairs = pd.DataFrame(ret.apply(lambda x: {'_id': x['_id']} | x[name] if isinstance(x[name], dict) else {'_id': x['_id']}, axis=1).tolist())

  return ret.merge(exploded_pairs, on='_id').drop(columns=name)
  #pd.DataFrame(ret[name].tolist())).drop(columns=name).drop_duplicates()

# Take the lists, drop any blank strings
specialties_df = people_df[['_id','specilities']].explode('specilities').rename(columns={'_id': 'person'})
specialties_df.dropna(inplace=True)
interests_df = people_df[['_id','interests']].explode('interests').rename(columns={'_id': 'person'})
interests_df.dropna(inplace=True)

names_df = get_nested_dict(people_df[['_id','name']], 'name')

education_df = get_nested_list_dict(people_df[['_id','education']], 'education')
experience_df = get_nested_list_dict(people_df[['_id','experience']], 'experience')
skills_df = get_nested_list(people_df[['_id','skills']], 'skills')
honors_df = get_nested_list(people_df[['_id','honors']], 'honors')
events_df = get_nested_list_dict(people_df[['_id','events']], 'events')

groups_df = get_nested_dict(people_df[['_id','group']], 'group')

people_only_df = people_df.drop(columns=['name','education','group','skills','experience','honors','events','specilities','interests'])

In [None]:
## This is just to reset things so we don't have an index
conn = duckdb.connect('linkedin.db')
conn.execute('BEGIN TRANSACTION')
conn.execute('DROP TABLE IF EXISTS people')
conn.execute('DROP INDEX IF EXISTS people_industry')
conn.execute('CREATE TABLE people AS SELECT * FROM people_df')
conn.execute('CREATE TABLE education AS SELECT * FROM education_df')
conn.execute('CREATE TABLE experience AS SELECT * FROM experience_df')
conn.execute('CREATE TABLE skills AS SELECT * FROM skills_df')
conn.execute('CREATE TABLE honors AS SELECT * FROM honors_df')
conn.execute('CREATE TABLE events AS SELECT * FROM events_df')
conn.execute('CREATE TABLE groups AS SELECT * FROM groups_df')
conn.execute('CREATE TABLE specialties AS SELECT * FROM specialties_df')
conn.execute('CREATE TABLE interests AS SELECT * FROM interests_df')
conn.execute('COMMIT')

In [None]:
%%time

conn.sql("""
  SELECT *
  FROM people JOIN experience ON people._id = experience._id
  WHERE industry='Medical Devices'""")

## Let's build an index now...

Our data is very small, so the index probably won't speed anything up at this scale. But it can be created and the database will use it *transparently*!


In [None]:
conn.execute('BEGIN TRANSACTION')
conn.execute('DROP INDEX IF EXISTS people_industry')
conn.execute("CREATE INDEX people_industry ON people(industry)")
conn.execute('COMMIT')

In [None]:
%%time
# Treat the view as a table, see what's there
conn.sql("""
 CREATE VIEW people_medicine AS
  SELECT *
  FROM people JOIN experience ON people._id = experience._id
  WHERE industry='Medical Devices'""")

conn.sql("""
  SELECT *
  FROM people_medicine""")

# In our tests, this was 5x faster!

In [None]:
%%time

conn.sql("""
  SELECT name.given_name, name.family_name
  FROM people
  WHERE name.given_name='Jeeves'""")

In [None]:
people_df2 = conn.sql('select * from people limit 500').df()
experience_df2 = conn.sql('select * from experience limit 5000').df()
skills_df2 = conn.sql('select * from skills limit 8000').df()

print ("%d people"%len(people_df2))
print ("%d experiences"%len(experience_df2))
print ("%d skills"%len(skills_df2))

In [None]:
def merge(S,T,l_on,r_on):
    ret = []
    count = 0
    s_pos = S.columns.get_loc(l_on)
    t_pos = T.columns.get_loc(r_on)
    for s_index in range(0, len(S)):
        for t_index in range(0, len(T)):
            count = count + 1
            if S.iat[s_index, s_pos] == T.iat[t_index, t_pos]:
              ret.append(S.iloc[s_index].to_dict() | T.iloc[t_index].to_dict())

    print('Merge compared %d tuples'%count)
    return pd.DataFrame(ret)

In [None]:
%%time
# Here's a test join, with people and their experiences.  We can see how many
# comparisons are made

merge(people_df2, experience_df2, '_id', '_id')

In [None]:
# Let's find all people (by ID) who have Marketing as a skill

mbio_df = skills_df2[skills_df2['skills'] == 'Molecular Biology'].reset_index()[['_id']]
mbio_df

In [None]:
%%time
merge(merge(people_df2, experience_df2, '_id', '_id'), mbio_df, '_id', '_id')

In [None]:
%%time
merge(merge(people_df2, mbio_df, '_id', '_id'), experience_df2, '_id', '_id')

In [None]:
%%time

conn.sql("""select distinct s._id,s.skills from people p join skills s on p._id=s._id join
                  experience ex on s._id=ex._id and s.skills='Molecular Biology'""")

In [None]:
%%time

conn.sql("""select distinct s._id,s.skills from skills s join
                  experience ex on s._id=ex._id join people p on p._id=s._id where s.skills='Molecular Biology'""")

In [None]:
conn.sql("select count(distinct _id) from skills where skills='Molecular Biology'")

In [None]:
# Join using a *hash map*
# from keys to (single) values
def merge_map(S,T,l_on,r_on):
    ret = []
    T_map = {}
    count = 0
    # Take each value in the r_on field, and
    # make a map entry for it
    t_pos = T.columns.get_loc(r_on)
    for t_index in range(0, len(T)):
        # Make sure we aren't overwriting an entry!
        if (T.iat[t_index,t_pos] not in T_map):
          T_map[T.iat[t_index,t_pos]] = [T.loc[t_index]]
        else:
          T_map[T.iat[t_index,t_pos]].append(T.loc[t_index])
        count = count + 1

    # Now find matches
    S2 = S.reset_index().drop(columns=['index'])
    for s_index in range(0, len(S2)):
        count = count + 1
        if S2.loc[s_index, l_on] in T_map:
          for item in T_map[S2.loc[s_index, l_on]]:
            ret.append(S2.loc[s_index].to_dict() | item.drop(labels=r_on).to_dict())

    print('Merge compared %d tuples'%count)
    return pd.DataFrame(ret)

In [None]:
%%time

# Here's a test join, with people and their experiences.  We can see how many
# comparisons are made
merge_map(experience_df2, people_df2, '_id', '_id')

## Exercise

In [None]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

In [None]:
!pip3 install penngrader-client

In [None]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER##PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO

In [None]:
%set_env HW_ID=cis5450_25f_HW9

In [None]:
import os
from penngrader.grader import *

grader = PennGrader('notebook-config.yaml', os.environ['HW_ID'], STUDENT_ID, STUDENT_ID)

Take the following query and use the `merge` or `merge_map` functions to execute it.  You can use Pandas to pre-apply or post-apply any filter conditions (selections) on dataframes.

```
SELECT _id, industry, skills
FROM people_df2 p JOIN skills_df2 s ON p._id = s._id
WHERE industry = 'Pharmaceuticals'
```

In [None]:
# TODO: compute results_df as per the above
results_df = # TODO

results_df

In [None]:
grader.grade('pharma', results_df)