## Data Source Disclaimer
The dataset used in this capstone project has been synthetically generated and does not represent real vendor information or actual third-party risk assessments.

# Rationale for Using Synthetic Data
This project utilizes synthetic data for the following compelling reasons:
1) Data Privacy and Confidentiality: Real vendor risk assessments contain highly sensitive business information, financial data, and proprietary security details that cannot be shared publicly or used in academic projects without violating confidentiality agreements.
2) Regulatory Compliance: Actual third-party risk data often includes personally identifiable information (PII) and falls under strict data protection regulations (GDPR, CCPA) that prohibit unauthorized use or disclosure.
3) Competitive Sensitivity: Genuine vendor risk scores, security incident histories, and compliance ratings are commercially sensitive information that companies protect as trade secrets.

# Dataset Overview
This synthetic dataset contains 1,000 third-party vendor profiles designed to simulate real-world vendor risk assessment scenarios for machine learning model development.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

In [2]:
np.random.seed(42)
n_vendors = 1000

# Privacy & Compliance Features
gdpr_compliance_score = np.random.normal(75, 15, n_vendors)
gdpr_compliance_score = np.clip(gdpr_compliance_score, 0, 100)

consent_management_rating = np.random.choice(['Poor', 'Fair', 'Good', 'Excellent'], 
                                           n_vendors, p=[0.15, 0.25, 0.35, 0.25])

data_subject_rights_score = np.random.normal(70, 20, n_vendors)
data_subject_rights_score = np.clip(data_subject_rights_score, 0, 100)

# Data Quality Features
data_completeness_pct = np.random.normal(85, 12, n_vendors)
data_completeness_pct = np.clip(data_completeness_pct, 50, 100)

data_accuracy_score = np.random.normal(82, 15, n_vendors)
data_accuracy_score = np.clip(data_accuracy_score, 40, 100)

bias_detection_capability = np.random.choice(['None', 'Basic', 'Advanced', 'Expert'], 
                                           n_vendors, p=[0.2, 0.3, 0.35, 0.15])

# Security Features
encryption_level = np.random.choice(['Basic', 'Standard', 'Advanced', 'Military'], 
                                   n_vendors, p=[0.1, 0.3, 0.5, 0.1])

access_control_rating = np.random.normal(7.5, 1.5, n_vendors)
access_control_rating = np.clip(access_control_rating, 1, 10)

security_incidents_last_year = np.random.poisson(1.2, n_vendors)

# Vendor Reliability Features
financial_stability_score = np.random.normal(75, 18, n_vendors)
financial_stability_score = np.clip(financial_stability_score, 20, 100)

sla_performance_pct = np.random.normal(95, 8, n_vendors)
sla_performance_pct = np.clip(sla_performance_pct, 70, 100)

years_in_business = np.random.exponential(8, n_vendors)
years_in_business = np.clip(years_in_business, 1, 50)

vendor_size = np.random.choice(['Startup', 'Small', 'Medium', 'Large', 'Enterprise'], 
                              n_vendors, p=[0.15, 0.25, 0.3, 0.2, 0.1])


In [3]:
# Create risk score based on weighted features
risk_score = (
    (100 - gdpr_compliance_score) * 0.15 +
    (100 - data_subject_rights_score) * 0.10 +
    (100 - data_completeness_pct) * 0.12 +
    (100 - data_accuracy_score) * 0.13 +
    security_incidents_last_year * 8 +
    (100 - financial_stability_score) * 0.15 +
    (100 - sla_performance_pct) * 0.10 +
    np.maximum(0, 10 - years_in_business) * 2
)

# Add some noise
risk_score += np.random.normal(0, 5, n_vendors)
risk_score = np.clip(risk_score, 0, 100)

# Create risk categories
risk_level = pd.cut(risk_score, bins=[0, 30, 60, 100], labels=['Low', 'Medium', 'High'])
# Add some noise to make it more realistic
risk_score += np.random.normal(0, 5, n_vendors)
risk_score = np.clip(risk_score, 0, 100)

# Create risk categories based on risk score
risk_level = pd.cut(risk_score, bins=[0, 30, 60, 100], labels=['Low', 'Medium', 'High'])

# Create DataFrame
dataset = pd.DataFrame({
    'vendor_id': range(1, n_vendors + 1),
    'gdpr_compliance_score': np.round(gdpr_compliance_score, 1),
    'consent_management_rating': consent_management_rating,
    'data_subject_rights_score': np.round(data_subject_rights_score, 1),
    'data_completeness_pct': np.round(data_completeness_pct, 1),
    'data_accuracy_score': np.round(data_accuracy_score, 1),
    'bias_detection_capability': bias_detection_capability,
    'encryption_level': encryption_level,
    'access_control_rating': np.round(access_control_rating, 1),
    'security_incidents_last_year': security_incidents_last_year,
    'financial_stability_score': np.round(financial_stability_score, 1),
    'sla_performance_pct': np.round(sla_performance_pct, 1),
    'years_in_business': np.round(years_in_business, 1),
    'vendor_size': vendor_size,
    'risk_score': np.round(risk_score, 2),
    'risk_level': risk_level
})


In [5]:
# Save to CSV
dataset.to_csv('data/third_party_vendor_risk_dataset.csv', index=False)

In [6]:
print(f"Dataset created successfully!")
print(f"Number of vendors: {len(dataset)}")
print(f"Number of features: {len(dataset.columns) - 2}")  # Excluding vendor_id and risk_level
print(f"File saved as: third_party_vendor_risk_dataset.csv")

print("\nDataset Overview:")
print(dataset.head(10))

print("\nRisk Level Distribution:")
print(dataset['risk_level'].value_counts())

print("\nDataset Summary Statistics:")
print(dataset.describe())

print("\n" + "="*80)
print("DATASET FEATURE DESCRIPTIONS")
print("="*80)
print("PRIVACY & COMPLIANCE FEATURES:")
print("• gdpr_compliance_score: GDPR compliance rating (0-100)")
print("• consent_management_rating: Quality of consent management (Poor/Fair/Good/Excellent)")
print("• data_subject_rights_score: Data subject rights handling score (0-100)")
print("\nDATA QUALITY FEATURES:")
print("• data_completeness_pct: Percentage of complete data records (50-100%)")
print("• data_accuracy_score: Data accuracy assessment score (40-100)")
print("• bias_detection_capability: Bias detection capabilities (None/Basic/Advanced/Expert)")
print("\nSECURITY FEATURES:")
print("• encryption_level: Data encryption level (Basic/Standard/Advanced/Military)")
print("• access_control_rating: Access control system rating (1-10)")
print("• security_incidents_last_year: Number of security incidents in past year")
print("\nVENDOR RELIABILITY FEATURES:")
print("• financial_stability_score: Financial stability assessment (20-100)")
print("• sla_performance_pct: SLA performance percentage (70-100%)")
print("• years_in_business: Number of years in operation")
print("• vendor_size: Company size category (Startup/Small/Medium/Large/Enterprise)")
print("\nTARGET VARIABLES:")
print("• risk_score: Continuous risk score (0-100, higher = more risky)")
print("• risk_level: Categorical risk level (Low/Medium/High)")
print("="*80)

Dataset created successfully!
Number of vendors: 1000
Number of features: 14
File saved as: third_party_vendor_risk_dataset.csv

Dataset Overview:
   vendor_id  gdpr_compliance_score consent_management_rating  \
0          1                   82.5                      Fair   
1          2                   72.9                      Poor   
2          3                   84.7                      Good   
3          4                   97.8                      Good   
4          5                   71.5                      Poor   
5          6                   71.5                 Excellent   
6          7                   98.7                      Poor   
7          8                   86.5                      Good   
8          9                   68.0                      Good   
9         10                   83.1                 Excellent   

   data_subject_rights_score  data_completeness_pct  data_accuracy_score  \
0                       63.8                   75.8          