# Basic Data Exploration, Manipulation & Visualization

This notebook demonstrates fundamental data science workflows using:
- Scikit-learn datasets
- Synthetic data generation
- Data exploration techniques
- Data manipulation with pandas
- Visualization with matplotlib and seaborn

## Environment Setup for Positron/VS Code

To use this notebook with the project's virtual environment in **Positron** or **VS Code**:

### Option 1: VS Code / Positron - Select Kernel
1. Open this notebook in VS Code or Positron
2. Click on the kernel selector in the top-right corner (or press `Ctrl/Cmd + Shift + P` and search "Select Kernel")
3. Choose **"Python Environments..."**
4. Select the `.venv` interpreter from this project: `.venv/bin/python`
5. If the `.venv` doesn't appear, you may need to:
   - Run `source .venv/bin/activate` in the terminal first
   - Reload the window (`Ctrl/Cmd + Shift + P` â†’ "Reload Window")

### Option 2: Command Line
If you prefer to run Jupyter from the terminal:
```bash
# From the project root directory
source .venv/bin/activate
jupyter lab notebooks/00_basic_data_exploration.ipynb
```

### Verify Setup
Run the first code cell below. If it executes without errors, your environment is correctly configured!

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.datasets import make_classification, make_regression, make_blobs

# Set style for better-looking plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("Libraries loaded successfully!")

## Part 1: Exploring Scikit-learn Datasets

### 1.1 Wine Dataset

In [None]:
# Load the wine dataset
wine = datasets.load_wine()

# Convert to DataFrame
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
wine_df['target'] = wine.target
wine_df['target_name'] = wine_df['target'].map({0: 'class_0', 1: 'class_1', 2: 'class_2'})

print("Wine Dataset Loaded")
print(f"Shape: {wine_df.shape}")
print(f"\nFeatures: {wine.feature_names[:5]}...")
wine_df.head()

### Basic Exploration

In [None]:
# Dataset info
print("Dataset Info:")
print(wine_df.info())
print("\n" + "="*50 + "\n")

# Summary statistics
print("Summary Statistics:")
wine_df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
print(wine_df.isnull().sum())
print("\n" + "="*50 + "\n")

# Target distribution
print("Target Distribution:")
print(wine_df['target_name'].value_counts())

### Data Visualization - Wine Dataset

In [None]:
# Distribution of alcohol content by wine class
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
wine_df.hist(column='alcohol', by='target_name', ax=axes, bins=15, edgecolor='black', alpha=0.7)
plt.suptitle('Alcohol Content Distribution by Wine Class', y=1.02, fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Box plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=wine_df, x='target_name', y='alcohol', palette='Set2')
plt.title('Alcohol Content by Wine Class', fontsize=14, fontweight='bold')
plt.xlabel('Wine Class')
plt.ylabel('Alcohol Content')
plt.show()

In [None]:
# Scatter plot: Alcohol vs Color Intensity
plt.figure(figsize=(10, 6))
scatter = sns.scatterplot(data=wine_df, x='alcohol', y='color_intensity', 
                          hue='target_name', palette='viridis', s=100, alpha=0.7)
plt.title('Alcohol vs Color Intensity', fontsize=14, fontweight='bold')
plt.xlabel('Alcohol Content')
plt.ylabel('Color Intensity')
plt.legend(title='Wine Class')
plt.show()

In [None]:
# Correlation heatmap (select subset of features for readability)
features_to_plot = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols']
plt.figure(figsize=(10, 8))
sns.heatmap(wine_df[features_to_plot].corr(), annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, fmt='.2f')
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Data Manipulation - Wine Dataset

In [None]:
# Filter: High alcohol wines (>13%)
high_alcohol = wine_df[wine_df['alcohol'] > 13]
print(f"High Alcohol Wines: {len(high_alcohol)} out of {len(wine_df)} ({len(high_alcohol)/len(wine_df)*100:.1f}%)")
high_alcohol.head()

In [None]:
# Group by target and calculate mean values
wine_by_class = wine_df.groupby('target_name').agg({
    'alcohol': ['mean', 'std'],
    'malic_acid': ['mean', 'std'],
    'total_phenols': ['mean', 'std'],
    'color_intensity': ['mean', 'std']
}).round(2)

print("Average Feature Values by Wine Class:")
wine_by_class

In [None]:
# Create new features
wine_df['phenol_intensity_ratio'] = wine_df['total_phenols'] / wine_df['color_intensity']
wine_df['alcohol_category'] = pd.cut(wine_df['alcohol'], 
                                      bins=[0, 12, 13, 15], 
                                      labels=['Low', 'Medium', 'High'])

print("New Features Created:")
print(wine_df[['alcohol', 'alcohol_category', 'phenol_intensity_ratio']].head(10))

## Part 2: Synthetic Data Generation

### 2.1 Classification Dataset

In [None]:
# Generate synthetic classification data
X_class, y_class = make_classification(
    n_samples=500,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    random_state=42
)

class_df = pd.DataFrame(X_class, columns=['feature_1', 'feature_2'])
class_df['target'] = y_class

print(f"Classification Dataset Shape: {class_df.shape}")
class_df.head()

In [None]:
# Visualize synthetic classification data
plt.figure(figsize=(10, 6))
scatter = plt.scatter(class_df['feature_1'], class_df['feature_2'], 
                     c=class_df['target'], cmap='coolwarm', alpha=0.6, s=50)
plt.colorbar(scatter, label='Target Class')
plt.title('Synthetic Classification Dataset', fontsize=14, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
plt.show()

### 2.2 Regression Dataset

In [None]:
# Generate synthetic regression data
X_reg, y_reg = make_regression(
    n_samples=300,
    n_features=1,
    noise=20,
    random_state=42
)

reg_df = pd.DataFrame(X_reg, columns=['feature'])
reg_df['target'] = y_reg

print(f"Regression Dataset Shape: {reg_df.shape}")
print(f"\nTarget Statistics:")
print(reg_df['target'].describe())
reg_df.head()

In [None]:
# Visualize synthetic regression data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot
axes[0].scatter(reg_df['feature'], reg_df['target'], alpha=0.5, s=30)
axes[0].set_title('Feature vs Target', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Feature')
axes[0].set_ylabel('Target')
axes[0].grid(True, alpha=0.3)

# Distribution of target
axes[1].hist(reg_df['target'], bins=30, edgecolor='black', alpha=0.7, color='skyblue')
axes[1].set_title('Target Distribution', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Target Value')
axes[1].set_ylabel('Frequency')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

### 2.3 Clustering Dataset (Blobs)

In [None]:
# Generate synthetic clustering data
X_blobs, y_blobs = make_blobs(
    n_samples=400,
    n_features=2,
    centers=4,
    cluster_std=1.0,
    random_state=42
)

blobs_df = pd.DataFrame(X_blobs, columns=['x', 'y'])
blobs_df['cluster'] = y_blobs

print(f"Clustering Dataset Shape: {blobs_df.shape}")
print(f"\nCluster Distribution:")
print(blobs_df['cluster'].value_counts().sort_index())
blobs_df.head()

In [None]:
# Visualize synthetic clustering data
plt.figure(figsize=(10, 8))
scatter = plt.scatter(blobs_df['x'], blobs_df['y'], 
                     c=blobs_df['cluster'], cmap='viridis', 
                     alpha=0.6, s=50, edgecolor='black', linewidth=0.5)
plt.colorbar(scatter, label='Cluster ID')
plt.title('Synthetic Clustering Dataset (Blobs)', fontsize=14, fontweight='bold')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.grid(True, alpha=0.3)
plt.show()

## Part 3: Time Series Synthetic Data

In [None]:
# Generate synthetic time series data
np.random.seed(42)
date_range = pd.date_range(start='2023-01-01', end='2024-12-31', freq='D')
n_days = len(date_range)

# Create trend + seasonality + noise
trend = np.linspace(100, 200, n_days)
seasonality = 20 * np.sin(2 * np.pi * np.arange(n_days) / 365)
noise = np.random.normal(0, 10, n_days)
values = trend + seasonality + noise

ts_df = pd.DataFrame({
    'date': date_range,
    'value': values
})
ts_df['month'] = ts_df['date'].dt.month
ts_df['year'] = ts_df['date'].dt.year

print(f"Time Series Dataset Shape: {ts_df.shape}")
ts_df.head()

In [None]:
# Visualize time series
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Full time series
axes[0].plot(ts_df['date'], ts_df['value'], linewidth=1, alpha=0.7)
axes[0].set_title('Synthetic Time Series (Trend + Seasonality + Noise)', 
                  fontsize=12, fontweight='bold')
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Value')
axes[0].grid(True, alpha=0.3)

# Monthly average by year
monthly_avg = ts_df.groupby(['year', 'month'])['value'].mean().reset_index()
for year in monthly_avg['year'].unique():
    year_data = monthly_avg[monthly_avg['year'] == year]
    axes[1].plot(year_data['month'], year_data['value'], marker='o', label=f'Year {year}')

axes[1].set_title('Monthly Average by Year', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Average Value')
axes[1].set_xticks(range(1, 13))
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 4: Data Manipulation Examples

In [None]:
# Rolling statistics on time series
ts_df['rolling_mean_7d'] = ts_df['value'].rolling(window=7).mean()
ts_df['rolling_mean_30d'] = ts_df['value'].rolling(window=30).mean()

plt.figure(figsize=(14, 6))
plt.plot(ts_df['date'], ts_df['value'], alpha=0.3, label='Original', linewidth=1)
plt.plot(ts_df['date'], ts_df['rolling_mean_7d'], label='7-Day Rolling Mean', linewidth=2)
plt.plot(ts_df['date'], ts_df['rolling_mean_30d'], label='30-Day Rolling Mean', linewidth=2)
plt.title('Time Series with Rolling Averages', fontsize=14, fontweight='bold')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Aggregation and pivot tables
monthly_stats = ts_df.groupby(['year', 'month'])['value'].agg([
    ('mean', 'mean'),
    ('std', 'std'),
    ('min', 'min'),
    ('max', 'max')
]).round(2)

print("Monthly Statistics:")
monthly_stats.head(10)

## Part 5: Comparative Visualization

In [None]:
# Create a comparison of distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Wine alcohol distribution
axes[0, 0].hist(wine_df['alcohol'], bins=20, edgecolor='black', alpha=0.7, color='coral')
axes[0, 0].set_title('Wine: Alcohol Distribution', fontweight='bold')
axes[0, 0].set_xlabel('Alcohol Content')
axes[0, 0].set_ylabel('Frequency')

# Classification features
axes[0, 1].scatter(class_df['feature_1'], class_df['feature_2'], 
                   c=class_df['target'], cmap='coolwarm', alpha=0.5, s=20)
axes[0, 1].set_title('Synthetic: Classification Features', fontweight='bold')
axes[0, 1].set_xlabel('Feature 1')
axes[0, 1].set_ylabel('Feature 2')

# Regression relationship
axes[1, 0].scatter(reg_df['feature'], reg_df['target'], alpha=0.5, s=20, color='green')
axes[1, 0].set_title('Synthetic: Regression Relationship', fontweight='bold')
axes[1, 0].set_xlabel('Feature')
axes[1, 0].set_ylabel('Target')

# Time series trend
axes[1, 1].plot(ts_df['date'], ts_df['rolling_mean_30d'], linewidth=2, color='navy')
axes[1, 1].set_title('Synthetic: Time Series Trend', fontweight='bold')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Value (30-Day MA)')

plt.tight_layout()
plt.show()

## Summary

This notebook demonstrated:

1. **Dataset Exploration**
   - Loading scikit-learn datasets (Wine)
   - Basic statistics and info methods
   - Checking for missing values
   - Understanding target distributions

2. **Data Visualization**
   - Histograms and box plots
   - Scatter plots with color encoding
   - Correlation heatmaps
   - Time series plots

3. **Data Manipulation**
   - Filtering rows based on conditions
   - Grouping and aggregation
   - Creating new features
   - Rolling statistics

4. **Synthetic Data**
   - Classification datasets
   - Regression datasets
   - Clustering datasets (blobs)
   - Time series with trend and seasonality