# Data Sampling - Smart Parking IoT System

## Overview
This notebook samples 30% of the 100 taxis SFpark dataset and saves it as `smart_parking_full.csv` for analysis.

**Source Dataset**: `sfpark_filtered_136_247_100taxis.csv` (332MB, 5M+ rows)
**Target**: Sample 30% of rows for manageable analysis
**Output**: `smart_parking_full.csv`

In [2]:
# Import libraries
import pandas as pd
import numpy as np
from pathlib import Path

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Load Source Dataset

In [None]:
# Define paths
current_dir = Path.cwd()
print(f"Current directory: {current_dir}")

# Go up one level to project root since we're in notebooks folder
project_root = current_dir.parent
source_path = project_root / "data" / "raw" / "extracted" / "sfpark_filtered_136_247_100taxis.csv"
output_path = project_root / "data" / "raw" / "smart_parking_full.csv"

print(f"Project root: {project_root}")
print(f"Source path: {source_path}")
print(f"Output path: {output_path}")

# Check if source exists
if source_path.exists():
    file_size = source_path.stat().st_size / (1024 * 1024)
    print(f"‚úÖ Source dataset found: {file_size:.1f} MB")
else:
    print("‚ùå Source dataset not found!")
    print("Checking available files:")
    extracted_dir = project_root / "data" / "raw" / "extracted"
    if extracted_dir.exists():
        for file in sorted(extracted_dir.iterdir()):
            if file.is_file():
                size_mb = file.stat().st_size / (1024 * 1024)
                print(f"  - {file.name} ({size_mb:.1f} MB)")

Source path: c:\Users\vedp3\OneDrive\Desktop\AAI_530_Final_Project\AAI530-Group10-smart-parking-iot-forecasting\notebooks\data\raw\extracted\sfpark_filtered_136_247_100taxis.csv
Output path: c:\Users\vedp3\OneDrive\Desktop\AAI_530_Final_Project\AAI530-Group10-smart-parking-iot-forecasting\notebooks\data\raw\smart_parking_full.csv
‚ùå Source dataset not found!


## 2. Load and Sample Data

In [5]:
# Load the full 100 taxis dataset
print("üîÑ Loading 100 taxis dataset...")
df_full = pd.read_csv(source_path, sep=';')

print(f"‚úÖ Full dataset loaded: {df_full.shape}")
print(f"   - Rows: {df_full.shape[0]:,}")
print(f"   - Columns: {df_full.shape[1]}")

# Sample 30% of the data
print("\nüé≤ Sampling 30% of the data...")
df_sample = df_full.sample(frac=0.3, random_state=42)

print(f"‚úÖ Sampled dataset: {df_sample.shape}")
print(f"   - Rows: {df_sample.shape[0]:,}")
print(f"   - Columns: {df_sample.shape[1]}")
print(f"   - Sample percentage: {len(df_sample) / len(df_full) * 100:.1f}%")

üîÑ Loading 100 taxis dataset...


FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\vedp3\\OneDrive\\Desktop\\AAI_530_Final_Project\\AAI530-Group10-smart-parking-iot-forecasting\\notebooks\\data\\raw\\extracted\\sfpark_filtered_136_247_100taxis.csv'

## 3. Save Sampled Data

In [None]:
# Save the sampled dataset
print(f"üíæ Saving sampled data to {output_path}...")

df_sample.to_csv(output_path, sep=';', index=False)

# Verify the saved file
if output_path.exists():
    output_size = output_path.stat().st_size / (1024 * 1024)
    print(f"‚úÖ Sampled data saved successfully!")
    print(f"   - File size: {output_size:.1f} MB")
    print(f"   - Expected size: ~{file_size * 0.3:.1f} MB")
else:
    print("‚ùå Failed to save sampled data!")

## 4. Verification

In [None]:
# Load and verify the saved dataset
print("üîç Verifying saved dataset...")
df_verify = pd.read_csv(output_path, sep=';')

print(f"‚úÖ Verification successful: {df_verify.shape}")
print(f"   - Rows: {df_verify.shape[0]:,}")
print(f"   - Columns: {df_verify.shape[1]}")

# Show first few rows
print("\nüìÑ First 5 rows:")
display(df_verify.head())

# Show column names
print("\nüìã Column names:")
for i, col in enumerate(df_verify.columns):
    print(f"   {i+1:2d}. {col}")

## 5. Summary

### ‚úÖ Completed Tasks:
1. **Loaded** 100 taxis SFpark dataset (5M+ rows)
2. **Sampled** 30% of the data (~1.5M rows)
3. **Saved** as `smart_parking_full.csv` for analysis
4. **Verified** the saved dataset integrity

### üìä Dataset Information:
- **Source**: 100 taxis variant (332MB)
- **Sample**: 30% random sample (~100MB)
- **Format**: Semicolon-delimited CSV
- **Columns**: 24 (timestamp, segmentid, capacity, occupied, observed1-10, diff1-10)
- **Period**: June 13 - July 24, 2013

### üéØ Ready for Analysis:
The sampled `smart_parking_full.csv` is now ready for data overview and analysis in the main notebook!