# Lab 1: Data Discovery & Classification

**Data Discovery: Harnessing AI, AGI & Vector Databases - Day 1**

| Duration | Difficulty | Framework | Exercises |
|---|---|---|---|
| 90 min | Beginner | pandas, scikit-learn, sentence-transformers, chromadb | 5 |

In this lab, you'll practice:
- Profiling synthetic data assets and identifying quality issues
- Building a text classifier with TF-IDF and RandomForest
- Extracting metadata using regex patterns
- Discovering data clusters with KMeans and PCA
- Building a vector catalogue with semantic search

---

## Setup

First, let's import the necessary libraries.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
from collections import Counter

# ML libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Vector database
from sentence_transformers import SentenceTransformer
import chromadb

# Settings
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries loaded successfully!")

## Part 1: Generate Synthetic Data Assets

We'll create a synthetic catalogue of ~500 data assets representing a typical enterprise.

In [None]:
np.random.seed(42)

categories = ['HR', 'Finance', 'Marketing', 'Engineering', 'Legal']
sources = ['PostgreSQL', 'S3 Bucket', 'SharePoint', 'Salesforce', 'MongoDB']
data_types = ['Table', 'Document', 'Spreadsheet', 'Log File', 'Report']
sensitivity_levels = ['Public', 'Internal', 'Confidential', 'Restricted']

descriptions_pool = {
    'HR': [
        'Employee personal records including name address and date of birth',
        'Annual performance review scores and manager feedback',
        'Payroll data with salary deductions and tax withholdings',
        'Recruitment pipeline tracking applicant status and interview notes',
        'Benefits enrollment records for health dental and vision plans',
        'Employee onboarding documentation and training completion',
        'Workforce diversity and inclusion metrics by department',
        'Time and attendance records with overtime calculations',
        'Employee termination records and exit interview summaries',
        'Compensation benchmarking data across industry roles',
    ],
    'Finance': [
        'Quarterly revenue reports broken down by business unit',
        'Accounts payable invoices and payment processing records',
        'Annual budget forecasts with departmental allocations',
        'Customer billing records including credit card transactions',
        'Expense reimbursement claims with receipt attachments',
        'General ledger entries and journal adjustments',
        'Tax filing documents and regulatory compliance records',
        'Cash flow projections and working capital analysis',
        'Vendor payment terms and contract financial summaries',
        'Audit trail logs for financial transaction approvals',
    ],
    'Marketing': [
        'Campaign performance metrics including click rates and conversions',
        'Customer segmentation profiles based on purchase behaviour',
        'Social media analytics with engagement and reach data',
        'Email marketing subscriber lists with opt-in preferences',
        'Brand sentiment analysis from customer reviews and surveys',
        'Website traffic analytics and user journey tracking',
        'Lead scoring models and marketing qualified lead reports',
        'Content calendar and editorial planning documents',
        'Competitive intelligence reports and market research data',
        'Event registration lists with attendee contact information',
    ],
    'Engineering': [
        'Application server logs with error traces and stack dumps',
        'CI/CD pipeline metrics including build times and failure rates',
        'Infrastructure monitoring data from cloud resources',
        'API usage statistics and rate limiting configurations',
        'Database schema documentation and migration scripts',
        'Code repository commit history and pull request reviews',
        'Load testing results and performance benchmarks',
        'Security vulnerability scan reports and remediation tracking',
        'Microservice dependency maps and architecture diagrams',
        'Incident response logs and post-mortem analysis documents',
    ],
    'Legal': [
        'Active contract repository with vendor agreements and SLAs',
        'Intellectual property filings including patents and trademarks',
        'Regulatory compliance audit findings and remediation plans',
        'Data processing agreements under GDPR Article 28',
        'Litigation case files and legal correspondence records',
        'Corporate governance meeting minutes and board resolutions',
        'Privacy impact assessments for new data processing activities',
        'Non-disclosure agreement tracking and expiration dates',
        'Employment law compliance documentation by jurisdiction',
        'Insurance policy records and claims history',
    ],
}

n_assets = 500
records = []

for i in range(n_assets):
    cat = np.random.choice(categories)
    desc = np.random.choice(descriptions_pool[cat])
    # Add slight variation
    if np.random.random() < 0.3:
        desc += ' updated ' + np.random.choice(['weekly', 'monthly', 'quarterly', 'annually'])
    records.append({
        'asset_id': f'ASSET-{i+1:04d}',
        'name': f'{cat.lower()}_{np.random.choice(["report", "dataset", "log", "file", "table"])}_{i+1:04d}',
        'description': desc,
        'category': cat,
        'source': np.random.choice(sources),
        'data_type': np.random.choice(data_types),
        'sensitivity': np.random.choice(sensitivity_levels, p=[0.15, 0.35, 0.30, 0.20]),
        'owner': np.random.choice(['alice', 'bob', 'carol', 'dave', 'eve', None], p=[0.2, 0.2, 0.2, 0.2, 0.15, 0.05]),
        'row_count': np.random.randint(100, 1_000_000) if np.random.random() > 0.2 else None,
        'last_updated': pd.Timestamp('2023-01-01') + pd.Timedelta(days=int(np.random.randint(0, 730))),
    })

df = pd.DataFrame(records)
print(f"Generated {len(df)} data asset records")
df.head(10)

## Exercise 1.1: Data Profiling

Explore the data asset catalogue. Compute summary statistics, counts by source/type/category, and identify missing values.

**Your Task:** Profile the dataset to understand its structure and quality.

In [None]:
# TODO: Print the shape and data types of the dataset
pass

In [None]:
# TODO: Show value counts for category, source, and data_type columns
pass

In [None]:
# TODO: Check for missing values and compute the percentage missing per column
# Hint: Use df.isnull().sum() and divide by len(df)
pass

In [None]:
# TODO: Show the distribution of sensitivity levels
# Hint: Use df['sensitivity'].value_counts() and plot as a bar chart
pass

## Exercise 1.2: Text Classification with TF-IDF + RandomForest

Build a classifier that predicts the category of a data asset based on its text description.

**Your Task:** Implement the TF-IDF + RandomForest pipeline and evaluate its performance.

In [None]:
def build_classifier(df):
    """Build a TF-IDF + RandomForest text classifier.
    
    Steps:
    1. Vectorise descriptions with TfidfVectorizer (max_features=1000, stop_words='english')
    2. Train/test split (test_size=0.2, random_state=42)
    3. Train RandomForestClassifier (n_estimators=100, random_state=42)
    4. Print classification_report on test set
    
    Returns: (tfidf, clf, X_tfidf) tuple
    """
    # TODO: Create TfidfVectorizer and fit_transform on descriptions
    # TODO: Split into train/test
    # TODO: Train RandomForestClassifier
    # TODO: Print classification_report
    # TODO: Return (tfidf, clf, X_tfidf)
    pass

result = build_classifier(df)
if result:
    tfidf, clf, X_tfidf = result

## Exercise 2.1: Metadata Extraction

Extract business terms from data asset descriptions using regex patterns.

**Your Task:** Write a function that extracts key business terms (e.g., 'salary', 'revenue', 'customer', 'compliance') from descriptions.

In [None]:
def extract_business_terms(text):
    """Extract business terms from a data asset description.
    
    Use regex to find known business terms in the text.
    Terms to detect: salary, revenue, customer, employee, invoice,
    compliance, contract, performance, billing, payroll, budget,
    marketing, security, legal, audit
    
    Returns: list of matched terms (lowercase)
    """
    # TODO: Define a list of business terms
    # TODO: Use regex to find each term in the text (case-insensitive)
    # TODO: Return list of found terms
    pass

# TODO: Apply to all descriptions and add as a new column 'business_terms'
# TODO: Show the 10 most common business terms across all assets
pass

## Exercise 2.2: Unsupervised Discovery with Clustering

Use KMeans clustering on TF-IDF vectors to discover natural groupings, then visualise with PCA.

**Your Task:** Cluster the data assets and create a 2D scatter plot coloured by cluster.

In [None]:
def cluster_and_visualise(X_tfidf, df, n_clusters=5):
    """Cluster data assets using KMeans and visualise with PCA.
    
    Steps:
    1. Fit KMeans with n_clusters on X_tfidf
    2. Reduce to 2D with PCA
    3. Create scatter plot coloured by cluster
    4. Print top terms per cluster
    
    Returns: cluster labels array
    """
    # TODO: Fit KMeans on X_tfidf
    # TODO: Reduce to 2D with PCA
    # TODO: Create scatter plot (figsize 12, 8)
    # TODO: Return cluster labels
    pass

if result:
    clusters = cluster_and_visualise(X_tfidf, df)

## Exercise 2.3: Vector Catalogue with Semantic Search

Build a vector catalogue using SentenceTransformer embeddings and ChromaDB, then perform semantic searches.

**Your Task:** Create embeddings, store in ChromaDB, and query with natural language.

In [None]:
def build_vector_catalogue(df):
    """Build a vector catalogue with ChromaDB.
    
    Steps:
    1. Load SentenceTransformer('all-MiniLM-L6-v2')
    2. Encode all descriptions
    3. Create a ChromaDB collection called 'data_catalogue'
    4. Add embeddings with documents and IDs
    
    Returns: (collection, model) tuple
    """
    # TODO: Load SentenceTransformer model
    # TODO: Encode descriptions
    # TODO: Create ChromaDB client and collection
    # TODO: Add embeddings, documents, and IDs
    # TODO: Return (collection, model)
    pass

catalogue_result = build_vector_catalogue(df)

In [None]:
def semantic_search(collection, queries, n_results=5):
    """Perform semantic searches against the vector catalogue.
    
    For each query string, retrieve the top n_results matches
    and print them with their distances.
    """
    # TODO: For each query, call collection.query()
    # TODO: Print the query and top results with distances
    pass

# Test queries
test_queries = [
    "customer financial transactions",
    "employee personal information",
    "software development metrics",
]

if catalogue_result:
    collection, model = catalogue_result
    semantic_search(collection, test_queries)

## Summary

In this lab, you learned how to:

1. **Profile** synthetic data asset catalogues and identify quality issues
2. **Classify** data assets using TF-IDF + RandomForest text classification
3. **Extract** business metadata from descriptions using regex patterns
4. **Cluster** data assets with KMeans to discover natural groupings
5. **Build** a vector catalogue with semantic search using ChromaDB

---

*Data Discovery: Harnessing AI, AGI & Vector Databases | AI Elevate*