# Lab 1: Data Discovery & Classification

**Data Discovery: Harnessing AI, AGI & Vector Databases - Day 1**

| Duration | Framework | Sections |
|---|---|---|
| 90 min | pandas, scikit-learn, sentence-transformers, chromadb | 5 |

In this lab, you'll explore:
- Profiling synthetic data assets and identifying quality issues
- Building a text classifier with TF-IDF and RandomForest
- Extracting metadata using regex patterns
- Discovering data clusters with KMeans and PCA
- Building a vector catalogue with semantic search

---

## Student Notes & Background

### What is Data Discovery?

**Data discovery** is the process of finding, understanding, and cataloguing data assets across an organisation. In a typical enterprise, data lives in dozens of systems — relational databases, cloud storage buckets, shared drives, SaaS platforms, and legacy mainframes. Without a systematic discovery process, teams waste hours searching for data they need, duplicate efforts, and risk using stale or incorrect information.

Modern data discovery goes beyond simple file listings. It combines **profiling** (understanding what's in the data), **classification** (labelling data by department, sensitivity, or type), and **search** (finding relevant assets using natural language queries). In this lab, you'll build all three capabilities from scratch.

### Key Concepts

#### 1. Data Profiling
**Data profiling** examines a dataset to collect statistics and identify quality issues. A thorough profile includes:
- **Shape and structure** — how many rows, columns, and what data types?
- **Value distributions** — what are the most common values in each column?
- **Missing data** — which fields have gaps, and how severe are they?
- **Sensitivity breakdown** — how much data is public vs. restricted?

Profiling is always the first step in data discovery because you cannot classify or govern data you don't understand.

#### 2. TF-IDF (Term Frequency–Inverse Document Frequency)
**TF-IDF** converts text into numerical vectors by measuring how important each word is to a document relative to the entire corpus:
- **Term Frequency (TF):** How often a word appears in a single document (higher = more relevant to that document)
- **Inverse Document Frequency (IDF):** How rare a word is across all documents (rarer words carry more information)
- **TF-IDF = TF × IDF** — words that are frequent in one document but rare overall get the highest scores

For example, "employee" in an HR document scores high on TF, but if it appears across many departments, its IDF is lower. "Payroll," which appears almost exclusively in HR documents, gets a very high TF-IDF score for HR assets.

#### 3. Random Forest Classification
A **Random Forest** is an ensemble of decision trees that vote on the final prediction. For text classification:
1. Each tree is trained on a random subset of the TF-IDF features
2. Each tree makes an independent prediction
3. The forest takes a majority vote across all trees

Random Forests are robust to overfitting, handle high-dimensional data well (TF-IDF can produce thousands of features), and provide feature importance scores that reveal which words are most predictive of each category.

#### 4. Metadata Extraction with Regex
**Regular expressions (regex)** are pattern-matching rules that can extract structured information from unstructured text. In data discovery, regex is used to:
- Find business terms (e.g., "salary," "revenue," "compliance") in asset descriptions
- Detect PII patterns (e.g., SSN format `XXX-XX-XXXX`, email addresses)
- Extract identifiers (e.g., invoice numbers, employee IDs)

Regex extraction is fast and deterministic — the same pattern always produces the same result, making it ideal for automated metadata tagging.

#### 5. KMeans Clustering & PCA
**KMeans clustering** groups data points into *k* clusters by minimising the distance between each point and its assigned cluster centre. Applied to TF-IDF vectors, KMeans discovers natural groupings in the catalogue — assets with similar descriptions end up in the same cluster, even without labelled training data.

**PCA (Principal Component Analysis)** reduces high-dimensional vectors to 2D or 3D for visualisation. It finds the axes of maximum variance in the data, so the 2D projection preserves as much structure as possible. The resulting scatter plot reveals whether clusters are well-separated or overlapping.

#### 6. Vector Catalogues & Semantic Search
A **vector catalogue** stores data asset descriptions as dense embeddings (from a model like `all-MiniLM-L6-v2`) in a vector database (ChromaDB). Unlike TF-IDF, these embeddings capture **semantic meaning** — "employee compensation" and "staff salary" will have similar vectors even though they share no words.

**Semantic search** queries the catalogue with a natural language question and returns the most similar assets by cosine distance. This is the foundation of modern data discovery platforms.

### What You'll Build

In this lab, you will:
1. **Profile** a synthetic catalogue of 500 data assets to understand its structure, distributions, and quality issues
2. **Train** a TF-IDF + Random Forest classifier that predicts an asset's department from its text description
3. **Extract** business terms from descriptions using regex patterns and analyse their frequency distribution
4. **Cluster** assets with KMeans on TF-IDF vectors and visualise the clusters in 2D using PCA
5. **Build** a ChromaDB vector catalogue with sentence-transformer embeddings and perform semantic searches

### Prerequisites
- Basic Python: lists, dictionaries, functions, f-strings
- Familiarity with pandas DataFrames (selecting columns, filtering rows, value counts)
- No prior machine learning experience required — all concepts are introduced as needed

### Tips
- All synthetic data uses `np.random.seed(42)` for reproducibility — your numbers should match the solution exactly
- When evaluating the classifier, look at the **per-category F1 scores** in the classification report, not just overall accuracy
- For clustering, try different values of `n_clusters` (3, 5, 7) and observe how the PCA visualisation changes
- Semantic search results include a **distance** metric — lower distance means higher relevance

---

## Setup

First, let's import the necessary libraries.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
from collections import Counter

# ML libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Vector database
from sentence_transformers import SentenceTransformer
import chromadb

# Settings
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries loaded successfully!")

## Part 1: Generate Synthetic Data Assets

We'll create a synthetic catalogue of ~500 data assets representing a typical enterprise.

In [None]:
np.random.seed(42)

categories = ['HR', 'Finance', 'Marketing', 'Engineering', 'Legal']
sources = ['PostgreSQL', 'S3 Bucket', 'SharePoint', 'Salesforce', 'MongoDB']
data_types = ['Table', 'Document', 'Spreadsheet', 'Log File', 'Report']
sensitivity_levels = ['Public', 'Internal', 'Confidential', 'Restricted']

descriptions_pool = {
    'HR': [
        'Employee personal records including name address and date of birth',
        'Annual performance review scores and manager feedback',
        'Payroll data with salary deductions and tax withholdings',
        'Recruitment pipeline tracking applicant status and interview notes',
        'Benefits enrollment records for health dental and vision plans',
        'Employee onboarding documentation and training completion',
        'Workforce diversity and inclusion metrics by department',
        'Time and attendance records with overtime calculations',
        'Employee termination records and exit interview summaries',
        'Compensation benchmarking data across industry roles',
    ],
    'Finance': [
        'Quarterly revenue reports broken down by business unit',
        'Accounts payable invoices and payment processing records',
        'Annual budget forecasts with departmental allocations',
        'Customer billing records including credit card transactions',
        'Expense reimbursement claims with receipt attachments',
        'General ledger entries and journal adjustments',
        'Tax filing documents and regulatory compliance records',
        'Cash flow projections and working capital analysis',
        'Vendor payment terms and contract financial summaries',
        'Audit trail logs for financial transaction approvals',
    ],
    'Marketing': [
        'Campaign performance metrics including click rates and conversions',
        'Customer segmentation profiles based on purchase behaviour',
        'Social media analytics with engagement and reach data',
        'Email marketing subscriber lists with opt-in preferences',
        'Brand sentiment analysis from customer reviews and surveys',
        'Website traffic analytics and user journey tracking',
        'Lead scoring models and marketing qualified lead reports',
        'Content calendar and editorial planning documents',
        'Competitive intelligence reports and market research data',
        'Event registration lists with attendee contact information',
    ],
    'Engineering': [
        'Application server logs with error traces and stack dumps',
        'CI/CD pipeline metrics including build times and failure rates',
        'Infrastructure monitoring data from cloud resources',
        'API usage statistics and rate limiting configurations',
        'Database schema documentation and migration scripts',
        'Code repository commit history and pull request reviews',
        'Load testing results and performance benchmarks',
        'Security vulnerability scan reports and remediation tracking',
        'Microservice dependency maps and architecture diagrams',
        'Incident response logs and post-mortem analysis documents',
    ],
    'Legal': [
        'Active contract repository with vendor agreements and SLAs',
        'Intellectual property filings including patents and trademarks',
        'Regulatory compliance audit findings and remediation plans',
        'Data processing agreements under GDPR Article 28',
        'Litigation case files and legal correspondence records',
        'Corporate governance meeting minutes and board resolutions',
        'Privacy impact assessments for new data processing activities',
        'Non-disclosure agreement tracking and expiration dates',
        'Employment law compliance documentation by jurisdiction',
        'Insurance policy records and claims history',
    ],
}

n_assets = 500
records = []

for i in range(n_assets):
    cat = np.random.choice(categories)
    desc = np.random.choice(descriptions_pool[cat])
    # Add slight variation
    if np.random.random() < 0.3:
        desc += ' updated ' + np.random.choice(['weekly', 'monthly', 'quarterly', 'annually'])
    records.append({
        'asset_id': f'ASSET-{i+1:04d}',
        'name': f'{cat.lower()}_{np.random.choice(["report", "dataset", "log", "file", "table"])}_{i+1:04d}',
        'description': desc,
        'category': cat,
        'source': np.random.choice(sources),
        'data_type': np.random.choice(data_types),
        'sensitivity': np.random.choice(sensitivity_levels, p=[0.15, 0.35, 0.30, 0.20]),
        'owner': np.random.choice(['alice', 'bob', 'carol', 'dave', 'eve', None], p=[0.2, 0.2, 0.2, 0.2, 0.15, 0.05]),
        'row_count': np.random.randint(100, 1_000_000) if np.random.random() > 0.2 else None,
        'last_updated': pd.Timestamp('2023-01-01') + pd.Timedelta(days=int(np.random.randint(0, 730))),
    })

df = pd.DataFrame(records)
print(f"Generated {len(df)} data asset records")
df.head(10)

## Section 1.1: Data Profiling

The code below profiles the data asset catalogue — examining its shape, distributions, and quality issues. This is always the first step in data discovery because you cannot classify or govern data you don't understand.

In [None]:
# Shape and data types
print(f"Shape: {df.shape}")
print(f"\nData Types:")
print(df.dtypes)
print(f"\nMemory Usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

In [None]:
# Value counts
print("Category distribution:")
print(df['category'].value_counts())
print("\nSource distribution:")
print(df['source'].value_counts())
print("\nData type distribution:")
print(df['data_type'].value_counts())

In [None]:
# Missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
print("Missing values:")
print(pd.DataFrame({'Missing': missing, 'Percent': missing_pct}).query('Missing > 0'))

In [None]:
# Sensitivity distribution
fig, ax = plt.subplots(figsize=(8, 5))
df['sensitivity'].value_counts().plot(kind='bar', ax=ax, color=['#10b981', '#3b82f6', '#f59e0b', '#ef4444'])
ax.set_title('Data Asset Sensitivity Distribution')
ax.set_xlabel('Sensitivity Level')
ax.set_ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

### Analysis Questions

1. Which categories have the most assets? Is the distribution balanced across departments?
2. What percentage of assets are missing an owner? What governance risk does this create?
3. How does the sensitivity distribution compare to what you'd expect in a real enterprise?

## Section 1.2: Text Classification with TF-IDF + RandomForest

The code below builds a text classifier that predicts an asset's department from its description. It uses TF-IDF to convert descriptions into numerical vectors, then trains a Random Forest ensemble to learn the mapping from text features to categories.

In [None]:
def build_classifier(df):
    """Build a TF-IDF + RandomForest text classifier."""
    # Vectorise descriptions
    tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
    X_tfidf = tfidf.fit_transform(df['description'])
    y = df['category']

    # Train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        X_tfidf, y, test_size=0.2, random_state=42
    )

    # Train classifier
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)

    # Evaluate
    y_pred = clf.predict(X_test)
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

    return tfidf, clf, X_tfidf

tfidf, clf, X_tfidf = build_classifier(df)

### Analysis Questions

1. Which department has the highest F1-score? Which has the lowest? Why might some departments be harder to classify?
2. Look at the confusion matrix — which categories get confused with each other? What do their descriptions have in common?
3. If you were deploying this classifier in production, what accuracy threshold would you require?

## Section 2.1: Metadata Extraction

The code below extracts business terms from data asset descriptions using regex patterns. This automated metadata tagging helps data stewards quickly understand what each asset contains without reading every description manually.

In [None]:
def extract_business_terms(text):
    """Extract business terms from a data asset description."""
    terms = [
        'salary', 'revenue', 'customer', 'employee', 'invoice',
        'compliance', 'contract', 'performance', 'billing', 'payroll',
        'budget', 'marketing', 'security', 'legal', 'audit'
    ]
    found = []
    for term in terms:
        if re.search(rf'\b{term}\b', text, re.IGNORECASE):
            found.append(term)
    return found

# Apply to all descriptions
df['business_terms'] = df['description'].apply(extract_business_terms)

# Count most common terms
all_terms = [term for terms_list in df['business_terms'] for term in terms_list]
term_counts = Counter(all_terms).most_common(10)
print("Top 10 business terms:")
for term, count in term_counts:
    print(f"  {term:15} {count}")

### Analysis Questions

1. Which business terms appear most frequently? Do they align with the department distribution?
2. Are there terms that appear across multiple departments? What does this tell you about data overlap?

## Section 2.2: Unsupervised Discovery with Clustering

The code below uses KMeans clustering on TF-IDF vectors to discover natural groupings in the data catalogue, then visualises them in 2D using PCA. Unlike the supervised classifier in Section 1.2, clustering finds structure without any labels.

In [None]:
def cluster_and_visualise(X_tfidf, df, n_clusters=5):
    """Cluster data assets using KMeans and visualise with PCA."""
    # Fit KMeans
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(X_tfidf)

    # Reduce to 2D
    pca = PCA(n_components=2)
    coords = pca.fit_transform(X_tfidf.toarray())

    # Plot
    fig, ax = plt.subplots(figsize=(12, 8))
    scatter = ax.scatter(coords[:, 0], coords[:, 1], c=clusters, cmap='viridis', alpha=0.6, s=30)
    plt.colorbar(scatter, label='Cluster')
    ax.set_title('Data Asset Clusters (PCA Projection)')
    ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
    ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
    plt.tight_layout()
    plt.show()

    # Print cluster composition
    df_temp = df.copy()
    df_temp['cluster'] = clusters
    print("\nCluster composition (actual categories):")
    print(pd.crosstab(df_temp['cluster'], df_temp['category']))

    return clusters

clusters = cluster_and_visualise(X_tfidf, df)

### Analysis Questions

1. Do the 5 clusters correspond to the 5 actual categories? Where do they diverge?
2. How much variance do PC1 and PC2 capture? Is a 2D projection sufficient to understand the data?
3. Look at the cluster-category crosstab — which cluster is the "purest" and which is the most mixed?

## Section 2.3: Vector Catalogue with Semantic Search

The code below builds a vector catalogue using SentenceTransformer embeddings stored in ChromaDB, then performs semantic searches with natural language queries. Unlike TF-IDF, these dense embeddings capture meaning — "employee compensation" and "staff salary" will have similar vectors even though they share no words.

In [None]:
def build_vector_catalogue(df):
    """Build a vector catalogue with ChromaDB."""
    # Load model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Encode descriptions
    descriptions = df['description'].tolist()
    embeddings = model.encode(descriptions)

    # Create ChromaDB collection
    client = chromadb.Client()
    collection = client.create_collection("data_catalogue")

    # Add to collection
    collection.add(
        embeddings=embeddings.tolist(),
        documents=descriptions,
        ids=df['asset_id'].tolist(),
        metadatas=[{'category': cat, 'sensitivity': sens}
                   for cat, sens in zip(df['category'], df['sensitivity'])]
    )

    print(f"Vector catalogue built with {collection.count()} assets")
    return collection, model

collection, model = build_vector_catalogue(df)

In [None]:
def semantic_search(collection, queries, n_results=5):
    """Perform semantic searches against the vector catalogue."""
    for query in queries:
        results = collection.query(
            query_texts=[query],
            n_results=n_results
        )
        print(f"\nQuery: '{query}'")
        print("-" * 60)
        for i, (doc, dist, meta) in enumerate(zip(
            results['documents'][0],
            results['distances'][0],
            results['metadatas'][0]
        )):
            print(f"  {i+1}. [{meta['category']:12}] {doc[:70]}... (dist: {dist:.3f})")

# Test queries
test_queries = [
    "customer financial transactions",
    "employee personal information",
    "software development metrics",
]

semantic_search(collection, test_queries)

### Analysis Questions

1. For the query "customer financial transactions", why do certain results appear despite not containing those exact words?
2. Compare the distance scores of the top 5 results — is there a clear drop-off between relevant and irrelevant results?
3. How would semantic search help a data steward who doesn't know the exact terminology used in the catalogue?

## Summary

In this lab, you learned how to:

1. **Profile** synthetic data asset catalogues and identify quality issues
2. **Classify** data assets using TF-IDF + RandomForest text classification
3. **Extract** business metadata from descriptions using regex patterns
4. **Cluster** data assets with KMeans to discover natural groupings
5. **Build** a vector catalogue with semantic search using ChromaDB

---

*Data Discovery: Harnessing AI, AGI & Vector Databases | AI Elevate*