<a href="https://colab.research.google.com/github/rr2020/csc786-ethics-demo/blob/main/CSC786_ethic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# CSC 786 – Data Ethics & Reproducibility Workshop  

This notebook demonstrates a complete ethical, reproducible data-collection workflow:

- Ethical handling of APIs and environment variables  
- Data collection using both key-based and public APIs  
- Provenance logging and metadata documentation  
- Responsible data storage and reproducible version control  
- Pushing results to a GitHub repository  

All steps run directly in Google Colab.


## The Big Picture
Think of your Colab notebook as the entry point to your research repo.
The notebook does the work (collects data, logs metadata), while the repo (on GitHub) stores the evidence — code, data samples, metadata logs, and ethical documentation.

As a prerequisite, you need to create the GitHub repo first (empty). See the next cell for details.



## Create an empty GitHub repo (UI steps)
1. Sign in to GitHub.
2. Click the + (top-right) → New repository.
3. Repository name: e.g., csc786-ethics-demo.
4. Owner: your account.
5. Visibility: Public (recommended for this class) or Private.
6. Important: Do NOT check “Add a README”, “Add .gitignore”, or “Choose a license”. Leaving these unchecked keeps the repo truly empty, which makes the first push from Colab simplest.
7. Click Create repository.
8. On the next page, copy the HTTPS URL. You will it use it later in notebook.

# Create (or confirm) a GitHub Personal Access Token (PAT) for Colab pushes
You’ll push from Colab using HTTPS + a token (safer/simpler than SSH during class).
1. Go to https://github.com/settings → Developer settings → Personal access tokens. Choose “Fine-grained tokens” (preferred).
2. Generate new token
- Token name (e.g. colab-demo)
- Only select respositories -> choose course repository
- Permissions -> Add permissions -> Contents -> Access: Read and write  
3. Generate the token and copy it once (you won’t see it again).

Tip: keep this token handy just for the class; you can revoke it afterward.

Setup Cell
Run once per session

In [2]:
#%env GITHUB_TOKEN=token_here

!git config --global user.name "Ronald Rivera" ## Display name not necessarily your username
!git config --global user.email "seabeerrivera@gmail.com"

# One time only: Connect the empty repo from Colab (first push)

In [2]:
%cd /content/csc786-ethics-demo
!rm -rf .git

!git init
!git add .
!git commit -m "Initial reproducibility demo"
!git branch -M main

# Replace <username> and <PAT> and repo name.

!git remote add origin https://rr2020:$GITHUB_TOKEN@github.com/rr2020/csc786-ethics-demo.git

!git push -u origin main

[Errno 2] No such file or directory: '/content/csc786-ethics-demo'
/content
[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/
[master (root-commit) d4706ca] Initial reproducibility demo
 21 files changed, 51025 insertions(+)
 create mode 100644 .config/.last_opt_in_prompt.yaml
 create mode 100644 .config/.last_survey_prompt.yaml
 create mode 100644 .config/.last_update_check.json
 create mode 100644 .config/active_config
 create mode 100644 .config/config_sentinel
 create mode 100644 .config/configurations/co

# When you reopen Colab next time
You’ll simply clone your GitHub repo back into /content, instead of re-initializing a new one.

So, the reconnect workflow will look like this:

In [3]:
%%bash
# --- Create and push .gitignore for clean, ethical repo ---

cat > .gitignore << 'EOF'
.ipynb_checkpoints/
__pycache__/
data/*
.env
*.env
EOF

git add .gitignore
git commit -m "Add .gitignore for data, cache, and secrets"
git push


[main f808417] Add .gitignore for data, cache, and secrets
 1 file changed, 5 insertions(+)
 create mode 100644 .gitignore


To https://github.com/rr2020/csc786-ethics-demo.git
   d4706ca..f808417  main -> main


In [16]:
# You can always check what's currently configured by:

!git config --global --list

user.name=Ronald Rivera
user.email=seabeerrivera@gmail.com


## Colab-specific access details
Note: While we work in Colab, everything inside /content/ is a temporary mini-repo.
As you run the notebook:
1. It creates the folder /content/data/ for your CSVs.
2. It appends provenance info into /content/DATA_README.md.
3. You can add extra markdown files manually.

In [4]:
# 1. Clone your existing repo from GitHub
!git clone https://github.com/rr2020/csc786-ethics-demo.git # todo update url
%cd csc786-ethics-demo


# 2. Optional: verify remote
!git remote -v


# 3. If you make changes and want to push again
!git remote set-url origin https://rr2020:$GITHUB_TOKEN@github.com/rr2020/csc786-ethics-demo.git
 # todo update url

!git add .
!git commit -m "Update from Colab session"
!git push


Cloning into 'csc786-ethics-demo'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects:   3% (1/31)[Kremote: Counting objects:   6% (2/31)[Kremote: Counting objects:   9% (3/31)[Kremote: Counting objects:  12% (4/31)[Kremote: Counting objects:  16% (5/31)[Kremote: Counting objects:  19% (6/31)[Kremote: Counting objects:  22% (7/31)[Kremote: Counting objects:  25% (8/31)[Kremote: Counting objects:  29% (9/31)[Kremote: Counting objects:  32% (10/31)[Kremote: Counting objects:  35% (11/31)[Kremote: Counting objects:  38% (12/31)[Kremote: Counting objects:  41% (13/31)[Kremote: Counting objects:  45% (14/31)[Kremote: Counting objects:  48% (15/31)[Kremote: Counting objects:  51% (16/31)[Kremote: Counting objects:  54% (17/31)[Kremote: Counting objects:  58% (18/31)[Kremote: Counting objects:  61% (19/31)[Kremote: Counting objects:  64% (20/31)[Kremote: Counting objects:  67% (21/31)[Kremote: Counting objects:  70% (22/31)[Kremote

## Step 1 – Setup Environment

In [5]:
!pip install python-dotenv --quiet
import os, pandas as pd, requests, hashlib, json, sys, time
from datetime import datetime, timezone
from pathlib import Path

ROOT = Path("/content/csc786-ethics-demo") ## todo: may update repo name if needed
DATA = ROOT / "data"
DATA.mkdir(exist_ok=True)
print("Environment ready. Files will be stored in:", DATA)


Environment ready. Files will be stored in: /content/csc786-ethics-demo/data


## Step 2 – Create Reproducibility Documentation Files


# Ethical Reminder

Before collecting any data:

- Check Terms of Service and rate limits.  
- Avoid collecting or storing personally identifiable information (PII).  
- Document every endpoint, parameter, and date of collection.  
- Keep secrets (API keys) out of public repositories.  


In [6]:
from pathlib import Path
ROOT = Path("/content/csc786-ethics-demo")

# 1 - README.md  (general project overview)
readme_text = """# Reproducibility Demo – CSC 786

This repository demonstrates an ethical, reproducible data-collection workflow used in the CSC 786 course.

## Overview (udpate as necessary)
This project collects sample open data from a sample of recent vulnerabilities metadata from NIST NVD
logs all collection parameters and metadata, and stores them in a version-controlled repository.

## Files
| File | Purpose |
|------|----------|
| `README.md` | Project overview and usage instructions |
| `ETHICS.md` | Ethical statement for transparency |
| `DATA_README.md` | Auto-logged metadata for every data collection event |


"""
(ROOT / "README.md").write_text(readme_text)


# 2 - ETHICS.md  → ethical statement / responsible data use
ethics_text = """## Ethical Statement

- Data sources are open and public.
- No personally identifiable information (PII) is collected.
- All API usage complies with provider Terms of Service and rate limits.
- API keys usage complies with NIST NVD are stored securely using environment variables.
- Every dataset generated is logged with parameters, timestamps, and hashes in `DATA_README.md`.
- This workflow aligns with academic integrity and reproducibility standards at Dakota State University.

- Potential risks (bias, privacy, security)
--Actionable Intellegince, the raw data contains descriptions of known, exploitable flaws (CVEs). This information can be used to exploit systems if used by unauthorized individuals
--Risk Biased Coverage, this is fovused on public vulnerabilities and may not include zero days or private threats

- Mitigations (data handling, bias checks)
--Control access - the NVD API key is handled securely and is not commited to the public Github repo.
--Responsible sharing the collected sample data is small and public. the code is designed for read only collection

- Limitations (known constraints)
--Latency, data may not be in real time since there is a delay between the publish date and when is analyzed
--Collection is limited by the NVD's rate limits, so there won't be massive data pulls.


"""
(ROOT / "ETHICS.md").write_text(ethics_text)


# 3 - DATA_README.md  → provenance log (append-only)
data_readme_path = ROOT / "DATA_README.md"
if not data_readme_path.exists():
    data_readme_path.write_text("""# Data Provenance Log
Each entry below documents a data-collection event.
Auto-generated by the notebook.

Example entry format (NIST NVD Data):
- {"timestamp_utc": "...", "endpoint": "...", "params": {...}, "output_file": "nvd_vulnerabilities_....csv", "sha256": "...", "data_columns": ["cve.id", "cve.published", "..."]}

---
""")

print("Created reproducibility files:")
!ls -lh /content/csc786-ethics-demo | grep .md

Created reproducibility files:
-rw-r--r-- 1 root root  328 Oct 25 05:46 DATA_README.md
-rw-r--r-- 1 root root 1.3K Oct 25 05:46 ETHICS.md
-rw-r--r-- 1 root root  615 Oct 25 05:46 README.md


## Step 3 – Managing Secrets (Key-based API Example)

In [7]:


# Store key securely in this Colab session
# You must replace the key with our own, visit the NIST NVD Data Feed page and follow instructions
%env NVD_API_KEY=YOUR_KEY_HERE

API_KEY = os.getenv("NVD_API_KEY")
print("Key loaded:", API_KEY[:6] + "****" if API_KEY else "No key found")


env: NVD_API_KEY=0977dfa5-0217-4864-83ba-401503196c0d
Key loaded: 0977df****


## Step 4 – Public API Example ()

You will work with your own Key-based API.

In [17]:
import pandas as pd
import os, requests, json
from datetime import datetime

# Configuration
RESULTS_COUNT = 30
MOD_START_DATE = "2025-09-01T00:00:00.000"
MOD_END_DATE = "2025-10-01T00:00:00.000"

# Load API key
API_KEY = os.getenv("NVD_API_KEY", None)
if not API_KEY:
    raise ValueError("⚠️ No NVD API key found. Please run: %env NVD_API_KEY=your_key")

# API endpoint
url = "https://services.nvd.nist.gov/rest/json/cves/2.0"


PARAMS = {
    "pubStartDate": MOD_START_DATE + "Z",
    "pubEndDate": MOD_END_DATE + "Z",
    "resultsPerPage": RESULTS_COUNT
}

headers = {"apiKey": API_KEY}

print("Fetching data from NVD API...")
response = requests.get(url, params=PARAMS, headers=headers)

if response.status_code == 200:
    data = response.json()
    cves = []
    for item in data.get("vulnerabilities", []):
        cve = item.get("cve", {})
        cve_id = cve.get("id")
        published = cve.get("published")
        last_modified = cve.get("lastModified")

        severity = None
        metrics = cve.get("metrics", {})
        if "cvssMetricV31" in metrics:
            severity = metrics["cvssMetricV31"][0]["cvssData"]["baseSeverity"]

        cves.append({
            "cve.id": cve_id,
            "cve.published": published,
            "cve.lastModified": last_modified,
            "cve.metrics.cvssMetricV31.0.cvssData.baseSeverity": severity
        })

    df_clean = pd.DataFrame(cves)
    print(f"✅ Successfully collected {len(df_clean)} CVEs.")
else:
    print(f"❌ Request failed with status {response.status_code}. Response: {response.text[:300]}")


display(df_clean.head())


Fetching data from NVD API...
✅ Successfully collected 30 CVEs.


Unnamed: 0,cve.id,cve.published,cve.lastModified,cve.metrics.cvssMetricV31.0.cvssData.baseSeverity
0,CVE-2025-9751,2025-09-01T00:15:34.580,2025-09-08T14:06:05.217,HIGH
1,CVE-2025-9752,2025-09-01T01:15:46.817,2025-09-04T18:47:25.440,HIGH
2,CVE-2025-9753,2025-09-01T01:15:47.060,2025-09-04T18:46:50.757,LOW
3,CVE-2025-9754,2025-09-01T02:15:45.223,2025-09-04T18:46:58.453,LOW
4,CVE-2025-9755,2025-09-01T02:15:45.493,2025-09-05T19:54:52.480,MEDIUM


## Step 5 – Save Data and Log Provenance

In [13]:
url = "https://services.nvd.nist.gov/rest/json/cves/2.0"

timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.000Z")

out_csv = DATA / "nvd_cve_data.csv"
df_clean.to_csv(out_csv, index=False)
print(f"✅ Data saved to {out_csv}")

file_hash = hashlib.sha256(open(out_csv, 'rb').read()).hexdigest()

meta = {
    "status": "SUCCESS",
    "timestamp_utc": timestamp,
    "data_source": "NIST NVD API v2.0",
    "endpoint": url,
    "params": PARAMS,
    "rows_collected": len(df_clean),
    "output_file": str(out_csv),
    "sha256": file_hash,
    "python": sys.version.split()[0],
    "pandas": pd.__version__,
}

with open(ROOT / "DATA_README.md", "a") as f:
    f.write(f"\n- {json.dumps(meta, indent=None)}")

print("📘 Provenance entry successfully added to DATA_README.md")


✅ Data saved to /content/csc786-ethics-demo/data/nvd_cve_data.csv
📘 Provenance entry successfully added to DATA_README.md


You can veryify everything before pushing.

In [24]:
%cd /content/csc786-ethics-demo

!ls -lh
!ls -lh data
!head -n 5 README.md
!tail -n 5 DATA_README.md

/content/csc786-ethics-demo
total 68K
-rw-r--r-- 1 root root  48K Oct 25 06:47 CSC786_ethic.ipynb
drwxr-xr-x 2 root root 4.0K Oct 25 05:48 data
-rw-r--r-- 1 root root  825 Oct 25 05:48 DATA_README.md
-rw-r--r-- 1 root root 1.3K Oct 25 05:46 ETHICS.md
-rw-r--r-- 1 root root  615 Oct 25 05:46 README.md
drwxr-xr-x 2 root root 4.0K Oct 25 05:41 sample_data
total 4.0K
-rw-r--r-- 1 root root 2.1K Oct 25 05:48 nvd_cve_data.csv
# Reproducibility Demo – CSC 786

This repository demonstrates an ethical, reproducible data-collection workflow used in the CSC 786 course.

## Overview (udpate as necessary)
- {"timestamp_utc": "...", "endpoint": "...", "params": {...}, "output_file": "nvd_vulnerabilities_....csv", "sha256": "...", "data_columns": ["cve.id", "cve.published", "..."]}

---

- {"status": "SUCCESS", "timestamp_utc": "2025-10-25T05:48:17.000Z", "data_source": "NIST NVD API v2.0", "endpoint": "https://services.nvd.nist.gov/rest/json/cves/2.0", "params": {"pubStartDate": "2025-09-01T00:00:00

## Step 7 – Push to GitHub

In [21]:

%cd /content/csc786-ethics-demo

!git remote set-url origin https://rr2020:$GITHUB_TOKEN@github.com/rr2020/csc786-ethics-demo.git



!cp "/content/CSC786_ethic.ipynb" /content/csc786-ethics-demo/
!ls

!git add CSC786_ethic.ipynb

!git add .
!git commit -m "Update from Colab session"
!git push

/content/csc786-ethics-demo
cp: cannot stat '/content/CSC786_ethic.ipynb': No such file or directory
data  DATA_README.md  ETHICS.md  README.md  sample_data
fatal: pathspec 'CSC786_ethic.ipynb' did not match any files
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
Everything up-to-date


In [5]:
!git clone https://github.com/rr2020/csc786-ethics-demo.git
%cd csc786-ethics-demo

Cloning into 'csc786-ethics-demo'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 36 (delta 5), reused 35 (delta 4), pack-reused 0 (from 0)[K
Receiving objects: 100% (36/36), 8.42 MiB | 19.96 MiB/s, done.
Resolving deltas: 100% (5/5), done.
/content/csc786-ethics-demo


In [6]:
!ls /content


csc786-ethics-demo  sample_data


In [4]:
%cd /content/csc786-ethics-demo
!cp /content/CSC786_ethic.ipynb .
!git add CSC786_ethic.ipynb
!git commit -m "Add notebook"
!git remote set-url origin https://rr2020:$GITHUB_TOKEN@github.com/rr2020/csc786-ethics-demo.git
!git push


[Errno 2] No such file or directory: '/content/csc786-ethics-demo'
/content
cp: cannot stat '/content/CSC786_ethic.ipynb': No such file or directory
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git



### In this demo we:
- Accessed both key-based and open APIs ethically.  
- Created transparency files: README.md, ETHICS.md, DATA_README.md.  
- Logged complete metadata (endpoint, params, hash, timestamp).  
- Pushed the entire reproducible workflow to GitHub.  

### Now think:
- How could you adapt this structure for your own project?
   
   Could be used to organize my script, notebooks and modules. Will help keep everything orginized. Help explain the purpose, setup and execution of the tool I am building. Will also be useful to store JSON/YAML files and track changes
    
- What extra metadata might your discipline require (license, consent, citation)?  

   Medatada could be the consent of system owners, the scope of the collections and rules. Can keep track of what artifacts were collected, timestamps and source dependencies can be documented.