<a href="https://colab.research.google.com/github/sanket1009/dsrs-kb-prototype/blob/main/DSRS_KB_Prototype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSRS Knowledge-Base Prototype

**What’s inside**  
Prototype KB (`dsrs_kb_source.csv`), a NetworkX graph converted to an
interactive PyVis HTML (`dsrs_graph.html`), and a fully-reproducible notebook
(`DSRS_KB_Prototype.ipynb`).

**Steps to run**

1. As soon as you open `DSRS_KB_Prototype.ipynb` and run *Runtime ▸ Run all*  
2. When execution finishes, double-click `dsrs_graph.html` in the file browser
   to explore the graph.

**Key files**

| File | Purpose |
| ---  | --- |
| `dsrs_kb_source.csv` | Source rows for the knowledge-base (23 rows) |
| `dsrs_graph.html`    | Interactive graph (23 nodes / 20 edges) |
| `DSRS_KB_Prototype.ipynb` | Notebook that builds the CSV, graph and demo queries |

**Sources**  
All KB rows cite:

* `DSRS Strategy.pdf`
* Public pages on **dsrs.illinois.edu**
* LinkedIn post permalinks (last 6 months)


In [1]:
|!pip install --quiet pandas networkx pyvis beautifulsoup4


In [2]:
import pandas, networkx, pyvis
print("Versions:", pandas.__version__, networkx.__version__, pyvis.__version__)


Versions: 2.2.2 3.4.2 0.3.2



The purpose of this chunk is:
-------------
1. Define a minimal Knowledge-Base schema (dict keys: id, title, type, subunit, summary, source, last_updated)
2. Adding the 5 DSRS sub-units with one-liner summaries.
3. Saving (or appending) these rows to `kb_seed.csv`.


In [3]:
import pandas as pd
from pathlib import Path
from datetime import date


In [4]:
KB_CSV_PATH = Path("dsrs_kb_source.csv")   # central place for the prototype KB
TODAY       = date.today()

In [5]:
# ---------------------------------------------------------------------------
# Data rows
# ---------------------------------------------------------------------------
strategy_rows = [
    {
        "id"          : "sub_01",
        "title"       : "Data Hub",
        "type"        : "subunit",
        "subunit"     : "Data Hub",
        "summary"     : (
            "Manages all research data for DSRS — procurement, documentation, storage, and secure access via "
            "our private cloud and Microsoft Azure. Current work centres on rigorous data-management workflows, "
            "while the roadmap includes usage tracking, dataset expansion, interactive tutorials, AI chat-bots, "
            "and ultimately a unified research Data Lake for faculty and students."
        ),
        "source"      : "DSRS Strategy.pdf",
        "last_updated": TODAY,
    },
    {
        "id"          : "sub_02",
        "title"       : "Services",
        "type"        : "subunit",
        "subunit"     : "Services",
        "summary"     : (
            "Core consulting arm offering a broad menu of data-science services detailed on the DSRS website. "
            "Closely integrated with Data Hub and Infrastructure, the team has already supported 70+ faculty and "
            "is actively expanding its catalogue to embrace new technologies and maximise research impact."
        ),
        "source"      : "DSRS Strategy.pdf",
        "last_updated": TODAY,
    },
    {
        "id"          : "sub_03",
        "title"       : "Infrastructure",
        "type"        : "subunit",
        "subunit"     : "Infrastructure",
        "summary"     : (
            "Runs the entire technical stack behind DSRS, including the website, databases, cloud compute, and a "
            "640-core HPC cluster (2.5 TB RAM, 30 TB shared storage). The unit safeguards security and uptime for "
            "Services and Data Hub, with plans to scale capacity, modernise documentation, and build internal apps."
        ),
        "source"      : "DSRS Strategy.pdf",
        "last_updated": TODAY,
    },
    {
        "id"          : "sub_04",
        "title"       : "Engagement",
        "type"        : "subunit",
        "subunit"     : "Engagement",
        "summary"     : (
            "Amplifies DSRS visibility via LinkedIn, workshops, hackathons, events, newsletters, blogs, and training "
            "sessions. The goal is to deepen ties across campus and showcase DSRS capabilities to students, faculty, "
            "and external partners."
        ),
        "source"      : "DSRS Strategy.pdf",
        "last_updated": TODAY,
    },
    {
        "id"          : "sub_05",
        "title"       : "Growth",
        "type"        : "subunit",
        "subunit"     : "Growth",
        "summary"     : (
            "Charts DSRS’s long-term expansion: engaging industry, extending services beyond Gies, developing "
            "proprietary data-science packages, and pursuing grants and other funding. Future plans also include "
            "R&D collaborations and teaching support through reusable data-science use-cases."
        ),
        "source"      : "DSRS Strategy.pdf",
        "last_updated": TODAY,
    },
]


In [6]:
subunit_df = pd.DataFrame(strategy_rows)

In [7]:
# Merge with existing file (if any) to keep the cell idempotent
if KB_CSV_PATH.exists():
    existing_df = pd.read_csv(KB_CSV_PATH)
    combined_df = (
        pd.concat([existing_df, subunit_df])
          .drop_duplicates(subset="id")
          .reset_index(drop=True)
    )
else:
    combined_df = subunit_df

combined_df.to_csv(KB_CSV_PATH, index=False)

print(f"{KB_CSV_PATH} now holds {len(combined_df)} rows.")
display(combined_df)

dsrs_kb_source.csv now holds 23 rows.


Unnamed: 0,id,title,type,subunit,summary,source,last_updated,methods,hpc_cores,hpc_storage_tb,azure_credits_usd,coverage_years,post_date,snapshot_date,total_employees,median_tenure_yrs,growth_6m_pct,growth_1y_pct,growth_2y_pct,function_mix
0,sub_01,Data Hub,subunit,Data Hub,Manages all research data for DSRS — procureme...,DSRS Strategy.pdf,2025-05-31,,,,,,,,,,,,,
1,sub_02,Services,subunit,Services,Core consulting arm offering a broad menu of d...,DSRS Strategy.pdf,2025-05-31,,,,,,,,,,,,,
2,sub_03,Infrastructure,subunit,Infrastructure,"Runs the entire technical stack behind DSRS, i...",DSRS Strategy.pdf,2025-05-31,,,,,,,,,,,,,
3,sub_04,Engagement,subunit,Engagement,"Amplifies DSRS visibility via LinkedIn, worksh...",DSRS Strategy.pdf,2025-05-31,,,,,,,,,,,,,
4,sub_05,Growth,subunit,Growth,Charts DSRS’s long-term expansion: engaging in...,DSRS Strategy.pdf,2025-05-31,,,,,,,,,,,,,
5,svc_01,Research Support,service,Services,DSRS assists Gies College of Business faculty ...,https://dsrs.illinois.edu/,2025-05-31,,,,,,,,,,,,,
6,svc_02,Student Internships,service,Services,The DSRS is regularly looking for data science...,https://dsrs.illinois.edu/,2025-05-31,,,,,,,,,,,,,
7,svc_03,Data Science Consulting,service,Services,Assessment and recommendations of statistical ...,https://dsrs.illinois.edu/,2025-05-31,,,,,,,,,,,,,
8,cap_01,Assistance,capability,Services,Advises researchers on the best statistical an...,https://dsrs.illinois.edu/researchers,2025-05-31,"inferential statistics, social media analytics...",,,,,,,,,,,,
9,cap_02,Analyses,capability,Services,Core DSRS service run by student interns; deli...,https://dsrs.illinois.edu/researchers,2025-05-31,,,,,,,,,,,,,


Chunk purpose
-------------
1. Grab the most up-to-date list of DSRS service offerings from the website.
2. Convert each service into a Knowledge-Base row (type='service').
3. Append to dsrs_kb_source.csv, avoiding duplicates by 'id'.

In [8]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from pathlib import Path
from datetime import date


In [9]:
import pandas as pd, requests, bs4
from pathlib import Path
from datetime import date

ROOT_URL  = "https://dsrs.illinois.edu/"
KB_PATH   = Path("dsrs_kb_source.csv")
TODAY     = date.today()

# -------------------------------------------------------------
# 1. Download DSRS home page
# -------------------------------------------------------------
html  = requests.get(ROOT_URL, timeout=10).text
soup  = bs4.BeautifulSoup(html, "html.parser")

# -------------------------------------------------------------
# 2. Extract <h3> + next <p>
# -------------------------------------------------------------
rows = []
for idx, h3 in enumerate(soup.find_all("h3"), start=1):
    # find the next paragraph after this heading
    p = h3.find_next_sibling("p")
    if not p:
        continue

    rows.append({
        "id"          : f"svc_{idx:02}",
        "title"       : h3.get_text(strip=True),
        "type"        : "service",
        "subunit"     : "Services",
        "summary"     : p.get_text(strip=True),
        "source"      : ROOT_URL,
        "last_updated": TODAY
    })

if not rows:
    raise SystemExit("No service cards found. Check the selector or page structure.")

print(f"Scraped {len(rows)} service items.")

# -------------------------------------------------------------
# 3. Convert to DataFrame
# -------------------------------------------------------------
svc_df = pd.DataFrame(rows)

# -------------------------------------------------------------
# 4. Merge into (or create) KB CSV
# -------------------------------------------------------------
if KB_PATH.exists():
    combined = (pd.concat([pd.read_csv(KB_PATH), svc_df])
                .drop_duplicates(subset="id")
                .reset_index(drop=True))
else:
    combined = svc_df

combined.to_csv(KB_PATH, index=False)
print(f"📊  {KB_PATH.name} now has {len(combined)} total rows.")

# -------------------------------------------------------------
# 5. Show what we just added
# -------------------------------------------------------------
display(svc_df)


Scraped 3 service items.
📊  dsrs_kb_source.csv now has 23 total rows.


Unnamed: 0,id,title,type,subunit,summary,source,last_updated
0,svc_01,Research Support,service,Services,DSRS assists Gies College of Business faculty ...,https://dsrs.illinois.edu/,2025-05-31
1,svc_02,Student Internships,service,Services,The DSRS is regularly looking for data science...,https://dsrs.illinois.edu/,2025-05-31
2,svc_03,Data Science Consulting,service,Services,Assessment and recommendations of statistical ...,https://dsrs.illinois.edu/,2025-05-31


In [10]:
import pandas as pd
from datetime import date
from pathlib import Path

KB_PATH = Path("dsrs_kb_source.csv")   # same file you’ve been using
TODAY   = date.today()

cap_rows = [
    {
        "id":"cap_01","title":"Assistance","type":"capability","subunit":"Services",
        "summary":"Advises researchers on the best statistical and data-science methods.",
        "methods":"inferential statistics, social media analytics, text mining, NLP, machine learning, deep learning, data viz, dashboard design, web scraping, data storage, database usage, data cleaning, data gathering, scalable computing, image processing, app development",
        "source":"https://dsrs.illinois.edu/researchers","last_updated":TODAY,
    },
    {
        "id":"cap_02","title":"Analyses","type":"capability","subunit":"Services",
        "summary":"Core DSRS service run by student interns; delivers end-to-end analyses across multiple disciplines.",
        "source":"https://dsrs.illinois.edu/researchers","last_updated":TODAY,
    },
    {
        "id":"cap_03","title":"Data","type":"capability","subunit":"Services",
        "summary":"Maintains an in-house dataset library and helps acquire/licence external data for faculty projects.",
        "source":"https://dsrs.illinois.edu/researchers","last_updated":TODAY,
    },
    {
        "id":"cap_04","title":"Computational capability","type":"capability","subunit":"Infrastructure",
        "summary":"Operates a 600-core, 30 TB cluster plus campus and cloud compute resources; brokers NCSA jobs when cheaper.",
        "hpc_cores":600,"hpc_storage_tb":30,
        "source":"https://dsrs.illinois.edu/researchers","last_updated":TODAY,
    },
    {
        "id":"cap_05","title":"Microsoft Azure","type":"capability","subunit":"Infrastructure",
        "summary":"> $160 k in Azure credits enable large-scale computations and exploratory research.",
        "azure_credits_usd":160000,
        "source":"https://dsrs.illinois.edu/researchers","last_updated":TODAY,
    },
]

cap_df = pd.DataFrame(cap_rows)

# merge idempotently
if KB_PATH.exists():
    combined = (
        pd.concat([pd.read_csv(KB_PATH), cap_df])
          .drop_duplicates(subset="id")
          .reset_index(drop=True)
    )
else:
    combined = cap_df

combined.to_csv(KB_PATH, index=False)
print(f"📊  {KB_PATH.name} now holds {len(combined)} total rows.")
display(cap_df)


📊  dsrs_kb_source.csv now holds 23 total rows.


Unnamed: 0,id,title,type,subunit,summary,methods,source,last_updated,hpc_cores,hpc_storage_tb,azure_credits_usd
0,cap_01,Assistance,capability,Services,Advises researchers on the best statistical an...,"inferential statistics, social media analytics...",https://dsrs.illinois.edu/researchers,2025-05-31,,,
1,cap_02,Analyses,capability,Services,Core DSRS service run by student interns; deli...,,https://dsrs.illinois.edu/researchers,2025-05-31,,,
2,cap_03,Data,capability,Services,Maintains an in-house dataset library and help...,,https://dsrs.illinois.edu/researchers,2025-05-31,,,
3,cap_04,Computational capability,capability,Infrastructure,"Operates a 600-core, 30 TB cluster plus campus...",,https://dsrs.illinois.edu/researchers,2025-05-31,600.0,30.0,
4,cap_05,Microsoft Azure,capability,Infrastructure,> $160 k in Azure credits enable large-scale c...,,https://dsrs.illinois.edu/researchers,2025-05-31,,,160000.0


In [11]:
display(cap_df)

Unnamed: 0,id,title,type,subunit,summary,methods,source,last_updated,hpc_cores,hpc_storage_tb,azure_credits_usd
0,cap_01,Assistance,capability,Services,Advises researchers on the best statistical an...,"inferential statistics, social media analytics...",https://dsrs.illinois.edu/researchers,2025-05-31,,,
1,cap_02,Analyses,capability,Services,Core DSRS service run by student interns; deli...,,https://dsrs.illinois.edu/researchers,2025-05-31,,,
2,cap_03,Data,capability,Services,Maintains an in-house dataset library and help...,,https://dsrs.illinois.edu/researchers,2025-05-31,,,
3,cap_04,Computational capability,capability,Infrastructure,"Operates a 600-core, 30 TB cluster plus campus...",,https://dsrs.illinois.edu/researchers,2025-05-31,600.0,30.0,
4,cap_05,Microsoft Azure,capability,Infrastructure,> $160 k in Azure credits enable large-scale c...,,https://dsrs.illinois.edu/researchers,2025-05-31,,,160000.0


In [12]:
"""
Chunk: Data Hub datasets  ➜  add six 'dataset' records
------------------------------------------------------
Manual seed—because the Data Hub pages are JS-rendered and hard to scrape
with requests/BeautifulSoup inside Colab.
"""

import pandas as pd
from datetime import date
from pathlib import Path

KB_PATH = Path("dsrs_kb_source.csv")          # same master CSV
TODAY   = date.today()

ds_rows = [
    {
        "id"      : "ds_01",
        "title"   : "BvD Historicals",
        "type"    : "dataset",
        "subunit" : "Data Hub",
        "summary" : "Financials, ownership structures and company profiles for robust corporate-finance research.",
        "coverage_years": "various",
        "source"  : "https://dsrs.illinois.edu/datahub/category/bvd-historicals/overview",
        "last_updated": TODAY
    },
    {
        "id"      : "ds_02",
        "title"   : "Gies Consumer Credit Panel (GCCP)",
        "type"    : "dataset",
        "subunit" : "Data Hub",
        "summary" : "Long-run Experian credit-bureau micro-data (2004-2022) for consumers and small businesses.",
        "coverage_years": "2004-2022",
        "source"  : "https://dsrs.illinois.edu/datahub/category/gies-consumer-credit-panel/overview",
        "last_updated": TODAY
    },
    {
        "id"      : "ds_03",
        "title"   : "PitchBook",
        "type"    : "dataset",
        "subunit" : "Data Hub",
        "summary" : "Private-market intelligence on companies, deals, funds, investors and service providers.",
        "source"  : "https://dsrs.illinois.edu/datahub/category/pitchbook",
        "last_updated": TODAY
    },
    {
        "id"      : "ds_04",
        "title"   : "WRDS",
        "type"    : "dataset",
        "subunit" : "Data Hub",
        "summary" : "Web-based query platform hosting 350 TB+ of accounting, banking, ESG, finance and marketing data.",
        "source"  : "https://dsrs.illinois.edu/datahub/category/wrds/overview",
        "last_updated": TODAY
    },
    {
        "id"      : "ds_05",
        "title"   : "Nielsen Datasets",
        "type"    : "dataset",
        "subunit" : "Data Hub",
        "summary" : "US consumer-purchase panel, weekly retail-scanner data and survey add-ons dating back to 2004.",
        "source"  : "https://dsrs.illinois.edu/datahub/category/nielsen-datasets",
        "last_updated": TODAY
    },
    {
        "id"      : "ds_06",
        "title"   : "S&P Capital IQ Pro",
        "type"    : "dataset",
        "subunit" : "Data Hub",
        "summary" : "Market-intelligence platform with data, news and analytics across banking, energy, TMT and more.",
        "source"  : "https://dsrs.illinois.edu/datahub/category/sp-capital-iq-pro",
        "last_updated": TODAY
    },
]

ds_df = pd.DataFrame(ds_rows)

# merge idempotently
if KB_PATH.exists():
    combined = (pd.concat([pd.read_csv(KB_PATH), ds_df])
                .drop_duplicates(subset="id")
                .reset_index(drop=True))
else:
    combined = ds_df

combined.to_csv(KB_PATH, index=False)
print(f"📊  {KB_PATH.name} now has {len(combined)} total rows.")
display(ds_df)


📊  dsrs_kb_source.csv now has 23 total rows.


Unnamed: 0,id,title,type,subunit,summary,coverage_years,source,last_updated
0,ds_01,BvD Historicals,dataset,Data Hub,"Financials, ownership structures and company p...",various,https://dsrs.illinois.edu/datahub/category/bvd...,2025-05-31
1,ds_02,Gies Consumer Credit Panel (GCCP),dataset,Data Hub,Long-run Experian credit-bureau micro-data (20...,2004-2022,https://dsrs.illinois.edu/datahub/category/gie...,2025-05-31
2,ds_03,PitchBook,dataset,Data Hub,"Private-market intelligence on companies, deal...",,https://dsrs.illinois.edu/datahub/category/pit...,2025-05-31
3,ds_04,WRDS,dataset,Data Hub,Web-based query platform hosting 350 TB+ of ac...,,https://dsrs.illinois.edu/datahub/category/wrd...,2025-05-31
4,ds_05,Nielsen Datasets,dataset,Data Hub,"US consumer-purchase panel, weekly retail-scan...",,https://dsrs.illinois.edu/datahub/category/nie...,2025-05-31
5,ds_06,S&P Capital IQ Pro,dataset,Data Hub,"Market-intelligence platform with data, news a...",,https://dsrs.illinois.edu/datahub/category/sp-...,2025-05-31


In [13]:
"""
Chunk: LinkedIn news posts  ➜  add three 'news' records
-------------------------------------------------------
Manual seed (LinkedIn is behind login, so we capture the key facts by hand).
"""

import pandas as pd
from datetime import date
from pathlib import Path

KB_PATH = Path("dsrs_kb_source.csv")   # same master CSV
TODAY   = date.today()

news_rows = [
    {
        "id"      : "news_01",
        "title"   : "Join the DSRS Team for Summer 2025!",
        "type"    : "news",
        "subunit" : "Engagement",
        "summary" : "Announced three open Summer-2025 positions for UIUC students, promising promotion paths and hands-on project work.",
        "post_date": "2025-05-10",          # ~3 weeks ago
        "source"  : "https://www.linkedin.com/posts/dsrs_opportunities-dsrs-activity-7326295803306614784-UhwU",
        "last_updated": TODAY
    },
    {
        "id"      : "news_02",
        "title"   : "Project Manager Opportunity at DSRS",
        "type"    : "news",
        "subunit" : "Engagement",
        "summary" : "Hiring a Project Manager to drive tech-enabled research; application deadline 15 Nov 2024.",
        "post_date": "2024-11-01",
        "source"  : "https://www.linkedin.com/posts/dsrs_dsrs-project-manager-gies-college-of-business-activity-7259248802430124032-Mdl7",
        "last_updated": TODAY
    },
    {
        "id"      : "news_03",
        "title"   : "Exciting Volunteer Opportunities at DSRS!",
        "type"    : "news",
        "subunit" : "Engagement",
        "summary" : "Seeking volunteer Data-Viz Analysts & ML Engineers; deadline 18 Oct 2024; hands-on experience with cutting-edge tools.",
        "post_date": "2024-09-30",
        "source"  : "https://www.linkedin.com/posts/dsrs_opportunities-dsrs-activity-7252433007226032129-tZ9s",
        "last_updated": TODAY
    },
]

news_df = pd.DataFrame(news_rows)

# merge safely
if KB_PATH.exists():
    combined = (
        pd.concat([pd.read_csv(KB_PATH), news_df])
          .drop_duplicates(subset="id")
          .reset_index(drop=True)
    )
else:
    combined = news_df

combined.to_csv(KB_PATH, index=False)
print(f"📊  {KB_PATH.name} now holds {len(combined)} total rows.")
display(news_df)


📊  dsrs_kb_source.csv now holds 23 total rows.


Unnamed: 0,id,title,type,subunit,summary,post_date,source,last_updated
0,news_01,Join the DSRS Team for Summer 2025!,news,Engagement,Announced three open Summer-2025 positions for...,2025-05-10,https://www.linkedin.com/posts/dsrs_opportunit...,2025-05-31
1,news_02,Project Manager Opportunity at DSRS,news,Engagement,Hiring a Project Manager to drive tech-enabled...,2024-11-01,https://www.linkedin.com/posts/dsrs_dsrs-proje...,2025-05-31
2,news_03,Exciting Volunteer Opportunities at DSRS!,news,Engagement,Seeking volunteer Data-Viz Analysts & ML Engin...,2024-09-30,https://www.linkedin.com/posts/dsrs_opportunit...,2025-05-31


In [14]:

KB_PATH = Path("dsrs_kb_source.csv")
TODAY   = date.today()

metric_row = {
    "id"           : "metric_2025_05",
    "type"         : "metric",
    "subunit"      : "Engagement",
    "title"        : "LinkedIn head-count snapshot",
    "snapshot_date": "2025-05-01",
    # ---- core KPIs ----
    "total_employees"   : 14,
    "median_tenure_yrs" : 1.1,
    "growth_6m_pct"     : -7,
    "growth_1y_pct"     : 17,
    "growth_2y_pct"     : 600,
    # ---- functional mix (as simple JSON string) ----
    "function_mix" : '{"Engineering":0.45,"Information Technology":0.25,"Product Management":0.09,"Research":0.21}',
    "source"       : "LinkedIn Insights panel (screenshot 2025-05-31)",
    "last_updated" : TODAY,
}

metric_df = pd.DataFrame([metric_row])

combined = (
    pd.concat([pd.read_csv(KB_PATH), metric_df])
      .drop_duplicates(subset="id")
      .reset_index(drop=True)
)
combined.to_csv(KB_PATH, index=False)
print(f"📊  {KB_PATH.name} now has {len(combined)} total rows.")
display(metric_df)


📊  dsrs_kb_source.csv now has 23 total rows.


Unnamed: 0,id,type,subunit,title,snapshot_date,total_employees,median_tenure_yrs,growth_6m_pct,growth_1y_pct,growth_2y_pct,function_mix,source,last_updated
0,metric_2025_05,metric,Engagement,LinkedIn head-count snapshot,2025-05-01,14,1.1,-7,17,600,"{""Engineering"":0.45,""Information Technology"":0...",LinkedIn Insights panel (screenshot 2025-05-31),2025-05-31


In [26]:
# 5.1  Install (quiet) & import ------------------------------------------------
!pip install --quiet jinja2==3.1.2 pyvis==0.3.2


import pandas as pd, networkx as nx, json
from pathlib import Path
from pyvis.network import Network

# -----------------------------------------------------------------------------
# 5.2  Load KB and create graph
KB_CSV = Path("dsrs_kb_source.csv")
df = pd.read_csv(KB_CSV)

G = nx.DiGraph()

# add nodes with attributes
for _, row in df.iterrows():
    attrs   = row.dropna().to_dict()
    node_id = attrs.pop("id")
    G.add_node(node_id, **attrs)

# add generic ownership edges (sub-unit → children)
for node, data in G.nodes(data=True):
    if data.get("type") in {"service", "capability", "dataset", "news", "metric"}:
        parent_title = data.get("subunit")
        if parent_title:
            parent_row = df.loc[(df["type"] == "subunit") & (df["title"] == parent_title)]
            if not parent_row.empty:
                parent_id = parent_row["id"].values[0]
                G.add_edge(parent_id, node, relation="OWNS")

# -----------------------------------------------------------------------------
# 5.3  Link Growth → PitchBook & S&P Capital IQ Pro
growth_id = "sub_05"
for title in ["PitchBook", "S&P Capital IQ Pro"]:
    child_id = df.loc[df["title"] == title, "id"].values[0]
    G.add_edge(growth_id, child_id, relation="ENABLES")

print("Neighbors of Growth :", list(G.successors(growth_id)))
print("Nodes :", G.number_of_nodes(), "| Edges :", G.number_of_edges())  # should be 20

# -----------------------------------------------------------------------------
# 5.4  PyVis visual
net = Network(height="600px", width="100%", notebook=False, directed=True)

type_color = {
    "subunit"   : "orange",
    "service"   : "lightblue",
    "capability": "lightgreen",
    "dataset"   : "violet",
    "news"      : "gold",
    "metric"    : "gray"
}

for node, data in G.nodes(data=True):
    net.add_node(
        node,
        label=data.get("title", node),
        color=type_color.get(data.get("type"), "#cccccc"),
        title=data.get("summary", data.get("type"))  # hover tooltip
    )

for u, v, d in G.edges(data=True):
    net.add_edge(u, v, title=d.get("relation", ""))

net.write_html("dsrs_graph.html")
print("Graph saved → dsrs_graph.html (open from the file browser)")


Neighbors of Growth : ['ds_03', 'ds_06']
Nodes : 23 | Edges : 20
Graph saved → dsrs_graph.html (open from the file browser)


In [27]:
print(df[["id", "title"]].sort_values("title").to_string(index=False))

            id                                     title
        cap_02                                  Analyses
        cap_01                                Assistance
         ds_01                           BvD Historicals
        cap_04                  Computational capability
        cap_03                                      Data
        sub_01                                  Data Hub
        svc_03                   Data Science Consulting
        sub_04                                Engagement
       news_03 Exciting Volunteer Opportunities at DSRS!
         ds_02         Gies Consumer Credit Panel (GCCP)
        sub_05                                    Growth
        sub_03                            Infrastructure
       news_01       Join the DSRS Team for Summer 2025!
metric_2025_05              LinkedIn head-count snapshot
        cap_05                           Microsoft Azure
         ds_05                          Nielsen Datasets
         ds_03                 

In [28]:
# 1. Show everything Growth ENABLES
print("Growth leverages:", list(G.successors("sub_05")))

# 2. Simple centrality ranking
dc = nx.degree_centrality(G)
print(sorted(dc.items(), key=lambda x: -x[1])[:5])

# 3. Timeline plot of news posts
df_news = df[df["type"]=="news"][["title","post_date"]]
display(df_news)

Growth leverages: ['ds_03', 'ds_06']
[('sub_01', 0.2727272727272727), ('sub_02', 0.2727272727272727), ('sub_04', 0.18181818181818182), ('sub_03', 0.09090909090909091), ('sub_05', 0.09090909090909091)]


Unnamed: 0,title,post_date
19,Join the DSRS Team for Summer 2025!,2025-05-10
20,Project Manager Opportunity at DSRS,2024-11-01
21,Exciting Volunteer Opportunities at DSRS!,2024-09-30
