# Generate Synthetic Classification Dataset

This notebook generates a synthetic dataset with:
- 100 features
- 1 binary label (0 or 1)
- Suitable for machine learning classification tasks

The dataset is written to a Databricks Delta table: `ryuta.ray.synthetic_data`

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Configuration
N_SAMPLES = 10000  # Number of samples
N_FEATURES = 100   # Number of features
N_INFORMATIVE = 50 # Number of informative features
N_REDUNDANT = 25   # Number of redundant features
N_CLASSES = 2      # Binary classification
CLASS_BALANCE = [0.6, 0.4]  # Slight class imbalance

In [None]:
# Generate synthetic classification dataset
print(f"Generating synthetic dataset with {N_SAMPLES} samples and {N_FEATURES} features...")

X, y = make_classification(
    n_samples=N_SAMPLES,
    n_features=N_FEATURES,
    n_informative=N_INFORMATIVE,
    n_redundant=N_REDUNDANT,
    n_classes=N_CLASSES,
    weights=CLASS_BALANCE,
    flip_y=0.01,  # Add 1% label noise
    random_state=42
)

print(f"Dataset generated successfully!")
print(f"Feature matrix shape: {X.shape}")
print(f"Label vector shape: {y.shape}")
print(f"Class distribution: {np.bincount(y)}")

In [None]:
# Create a pandas DataFrame
# Feature columns named as feature_0, feature_1, ..., feature_99
feature_columns = [f'feature_{i}' for i in range(N_FEATURES)]
df = pd.DataFrame(X, columns=feature_columns)

# Add the binary label column
df['label'] = y

# Display basic statistics
print("\nDataset Info:")
print(f"Total rows: {len(df)}")
print(f"Total columns: {len(df.columns)}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())

# Show first few rows
print("\nFirst few rows:")
df.head()

In [None]:
# Display summary statistics
print("Summary statistics for first 5 features and label:")
df[['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'label']].describe()

In [None]:
# Convert pandas DataFrame to Spark DataFrame
print("Converting to Spark DataFrame...")
spark_df = spark.createDataFrame(df)

# Show schema
print("\nSpark DataFrame Schema:")
spark_df.printSchema()

# Show first few rows
print("\nFirst few rows in Spark DataFrame:")
spark_df.show(5)

In [None]:
# Define the Delta table name
TABLE_NAME = "ryuta.ray.synthetic_data"

print(f"Writing dataset to Delta table: {TABLE_NAME}...")

# Write to Delta table (overwrite mode)
spark_df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable(TABLE_NAME)

print(f"\nDataset successfully written to {TABLE_NAME}!")

In [None]:
# Verify the table was created and data was written
print(f"Verifying table {TABLE_NAME}...")

# Read from the Delta table
verify_df = spark.table(TABLE_NAME)

print(f"\nTable row count: {verify_df.count()}")
print(f"\nTable schema:")
verify_df.printSchema()

print(f"\nSample data from table:")
verify_df.show(5)

# Show label distribution in the table
print(f"\nLabel distribution in table:")
verify_df.groupBy('label').count().orderBy('label').show()

In [None]:
# Optional: Display table details
print(f"Table details for {TABLE_NAME}:")
spark.sql(f"DESCRIBE DETAIL {TABLE_NAME}").show(truncate=False)

## Summary

This notebook has:
1. Generated a synthetic dataset with 100 features and 1 binary label
2. Created 10,000 samples suitable for classification tasks
3. Written the data to the Delta table: `ryuta.ray.synthetic_data`

The dataset includes:
- 50 informative features that are useful for classification
- 25 redundant features (linear combinations of informative features)
- 25 random noise features
- Slight class imbalance (60/40 split)
- 1% label noise to make it more realistic

You can now use this table for machine learning classification tasks!