# 🎯 Learn to Build SmartAutoDataLoader
## Step-by-Step Programming Tutorial

**Welcome to your programming journey!** 🚀

This notebook will teach you how to build a professional data loader from scratch. Each section contains:
- 📝 **Clear explanations** of what each function should do
- 🔧 **Empty functions** for you to implement
- 💡 **Tips and hints** to guide your thinking
- ✅ **Learning goals** for each step

### What You'll Build:
A **SmartAutoDataLoader** that can:
- 📊 Load CSV, Excel, and JSON files automatically
- 🔍 Detect file formats, encodings, and delimiters
- 📅 Find and parse datetime columns automatically
- 📋 Generate comprehensive loading reports
- 💾 Estimate memory usage for large files

### Learning Philosophy:
**"The best way to learn programming is by writing code yourself!"**

Ready to start coding? Let's begin! 🎉

## 📚 Section 1: Import Required Libraries

**Learning Goal:** Understand which libraries we need and why

### What This Section Does:
Import all the essential Python libraries that our SmartAutoDataLoader will use:

- **`pandas`** - For reading and manipulating data (CSV, Excel, JSON)
- **`pathlib.Path`** - For handling file paths in a modern way
- **`time`** - For measuring loading performance
- **`re`** - For pattern matching (finding dates in text)
- **`typing`** - For type hints (helps catch errors early)
- **`dataclasses`** - For creating structured data containers

### Your Task:
Import all the necessary libraries in the cell below. Think about what each library will be used for!

In [31]:
# TODO: Import all required libraries here
# Hint: You'll need pandas, pathlib, time, re, typing, and dataclasses

# Your imports go here:
import pandas as pd
from pathlib import Path
import time
import re
from typing import Optional, List, Union, Dict, Any
from dataclasses import dataclass, field



## 📊 Section 2: Create LoadReport Data Class

**Learning Goal:** Learn how to create structured data containers

### What This Section Does:
Create a `LoadReport` class that stores comprehensive information about file loading results.

### Key Concepts:
- **`@dataclass`** - A decorator that automatically creates constructor and other methods
- **Type hints** - Specify what type of data each field should contain
- **Data organization** - Group related information together

### Your Task:
Create a LoadReport dataclass with these fields:
- `file_path` (str) - Path to the loaded file
- `file_size_mb` (float) - Size of file in megabytes
- `detected_format` (str) - File format (csv, excel, json)
- `detected_encoding` (str) - Character encoding used
- `detected_delimiter` (str) - CSV delimiter character
- `has_header` (bool) - Whether file has header row
- `total_rows` (int) - Number of data rows
- `total_columns` (int) - Number of columns
- `column_info` (Dict[str, str]) - Information about each column
- `date_columns_found` (List[str]) - Names of datetime columns
- `date_formats_detected` (Dict[str, str]) - Date formats for each datetime column
- `loading_time_seconds` (float) - How long loading took
- `quality_score` (int) - Data quality score (0-100)
- `warnings` (List[str]) - Any warnings during loading
- `errors` (List[str]) - Any errors encountered
- `success` (bool) - Whether loading was successful

In [32]:
# TODO: Create the LoadReport dataclass
# Hint: Use @dataclass decorator and include all the fields listed above

# Your LoadReport class goes here:
@dataclass
class LoadReport:
  file_path: str
  file_size_mb: float
  detected_format: str
  detected_encoding: str
  detected_delimiter: str
  has_header: bool
  total_rows: int
  total_columns: int
  column_info: Dict[str, str]
  date_columns_found: List[str]
  date_formats_detected: Dict[str, str]
  loading_time_seconds: float
  quality_score: int
  warnings: List[str]
  errors: List[str]
  success: bool


## 🔧 Section 3: Initialize SmartAutoDataLoader Class

**Learning Goal:** Learn how to create class constructors and instance variables

### What This Section Does:
Create the main SmartAutoDataLoader class with its constructor (`__init__` method).

### Key Concepts:
- **Class definition** - Creating a blueprint for objects
- **Constructor (`__init__`)** - Method that runs when creating a new instance
- **Instance variables (`self.variable`)** - Data that belongs to each instance
- **Default parameters** - Providing sensible defaults for optional arguments

### Your Task:
Create the SmartAutoDataLoader class with:

1. **Constructor parameters:**
   - `verbose` (bool, default=True) - Whether to print progress messages

2. **Instance variables:**
   - `self.verbose` - Store the verbose setting
   - `self.date_patterns` - List of regex patterns for detecting dates

3. **Date patterns to include:**
   - ISO format: `r'\d{4}-\d{2}-\d{2}'` → `'%Y-%m-%d'`
   - EU format: `r'\d{2}/\d{2}/\d{4}'` → `'%d/%m/%Y'`
   - German format: `r'\d{2}\.\d{2}\.\d{4}'` → `'%d.%m.%Y'`
   - UK format: `r'\d{2}-\d{2}-\d{4}'` → `'%d-%m-%Y'`
   - US format: `r'\d{4}/\d{2}/\d{2}'` → `'%Y/%m/%d'`

4. **Print welcome message** if verbose is True

In [33]:
# TODO: Create the SmartAutoDataLoader class with __init__ method
# Hint: Store verbose setting and date patterns as instance variables

class SmartAutoDataLoader:
    """Smart automatic data loader for CSV, Excel, and JSON files"""
    def __init__(self, verbose: bool = True):
        """
        Initialize the SmartAutoDataLoader

        Args:
            verbose: Whether to print progress messages
        """
        # Your initialization code goes here:
        self.verbose = verbose
        self.date_patterns = [
            (r'\d{4}-\d{2}-\d{2}', '%Y-%m-%d'),      # ISO format
            (r'\d{2}/\d{2}/\d{4}', '%d/%m/%Y'),      # EU format
            (r'\d{2}\.\d{2}\.\d{4}', '%d.%m.%Y'),    # German format
            (r'\d{2}-\d{2}-\d{4}', '%d-%m-%Y'),      # UK format
            (r'\d{4}/\d{2}/\d{2}', '%Y/%m/%d'),      # US format
        ]

        if self.verbose:
            print("🎯 SmartAutoDataLoader  initialized!")

    # TODO: Add this method to your SmartAutoDataLoader class
    # Hint: Use Path(source).suffix.lower() to get the file extension

    def detect_format(self, source: str) -> str:
        """
        Detect file format based on extension

        Args:
            source: Path to the file

        Returns:
            Format name: 'csv', 'excel', or 'json'
        """
        path = Path(source)
        suffix = path.suffix.lower()

        format_map = {
            '.csv': 'csv',
            '.tsv': 'csv',  # Treat TSV as CSV for simplicity
            '.txt': 'csv',  # Treat TXT as CSV for simplicity
            '.xlsm': 'excel',
            '.xlsx': 'excel',
            '.xls': 'excel',
            '.json': 'json',
        }
        detected = format_map.get(suffix, 'csv') # Default to CSV if not recognized
        if self.verbose:
            print(f"🔍 Detected format: {detected} for file: {source}")

        return detected

    # TODO: Add this method to your SmartAutoDataLoader class
    # Hint: Use a for loop with try-except to test each encoding

    def detect_encoding(self, source: str) -> str:
        """
        Detect file encoding by trying common encodings

        Args:
            source: Path to the file

        Returns:
            Detected encoding name
        """
        encodings_to_try = ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1', 'utf-16', 'utf-8-sig', 'utf-32', 'utf-8-sig', 'utf-16-le', 'utf-16-be', 'utf-8-sig', 'utf-16-be']

        for encoding in encodings_to_try:
            try:
                with open(source, 'r', encoding=encoding) as f:
                    f.read(1024)
                if self.verbose:
                    print(f"🔤 Encoding detected: {encoding}")
                return encoding
            except UnicodeDecodeError:
              continue
        return 'utf-8' #Fallback
        pass

    # TODO: Add this INTERNAL helper method to your SmartAutoDataLoader class
    # Hint: Use max(counts, key=counts.get) to find the delimiter with highest count

    def _sniff_delimiter(self, path: Path, encoding: str) -> str:
        """
        Internal method: Detect CSV delimiter by analyzing first line

        Args:
            path: Path object for the file
            encoding: Encoding to use for reading

        Returns:
            Most likely delimiter character
        """
        try:
            with open(path, 'r', encoding=encoding) as f:
                first_line = f.readline()

            delimiters = [',', ';', '\t', '|']
            counts = {delimiter: first_line.count(delimiter) for delimiter in delimiters}
            return max(counts, key=counts.get) if max(counts.values()) > 0 else ','
        except Exception:
            return ','

    # TODO: Add this method to your SmartAutoDataLoader class
    # Hint: Call your other detection methods and combine results

    def sniff_csv_params(self, source: str) -> Dict[str, Any]:
        """
        Detect comprehensive CSV parameters

        Args:
            source: Path to the CSV file

        Returns:
            Dictionary with detected parameters
        """
        detected_encoding = self.detect_encoding(source)
        detected_delimiter = self._sniff_delimiter(Path(source), detected_encoding)

        params = {
         'delimiter': detected_delimiter,
        'encoding': detected_encoding,
        'has_header': True  # Assume files have headers for now

        }

        if self.verbose:
            print(f"🔍 Detected CSV parameters: {params}")

        return params

    # TODO: Add this method to your SmartAutoDataLoader class
    # Hint: Use re.search() to check patterns and pd.to_datetime() to convert

    def parse_datetimes(self, df: 'pd.DataFrame') -> 'pd.DataFrame':
        """
        Automatically detect and parse datetime columns

        Args:
            df: Input DataFrame

        Returns:
            DataFrame with datetime columns converted
        """
        import pandas as pd
        df_copy = df.copy()
        date_columns = []

        for col in df_copy.columns:
            if df_copy[col].dtype == 'object':  # Text columns only
                sample = df_copy[col].dropna().astype(str).head(10)


                for pattern, date_format in self.date_patterns:
                  matches = sum(1 for val in sample if re.search(pattern, val))

                  if matches >= len(sample) * 0.5:  # 50% threshold
                    try:
                        df_copy[col] = pd.to_datetime(df_copy[col], format=date_format, errors='coerce')
                        date_columns.append(col)

                        if self.verbose:
                            print(f"📅 Parsed datetime column: {col} ({date_format})")
                    except:
                      continue
        if not date_columns and self.verbose:
            print("📅 No datetime columns found")
        elif self.verbose and date_columns:
          print(f"   📅 Total date columns found: {len(date_columns)}")
        return df_copy

    # TODO: Add this MAIN method to your SmartAutoDataLoader class
    # Hint: Use self.detect_format() then delegate to specific load methods

    def load(self, source: str, **kwargs) -> 'pd.DataFrame':
        """
        Universal loading method - auto-detects format and delegates

        Args:
            source: Path to the file to load
            **kwargs: Additional arguments to pass to specific loaders

        Returns:
            Loaded DataFrame
        """
        import pandas as pd

        if self.verbose:
            print(f"Loading file: {source}")

        detected_format = self.detect_format(source)

        if detected_format in ['csv', 'tsv']:
            return self.load_csv(source, **kwargs)
        elif detected_format == 'excel':
            return self.load_excel(source, **kwargs)
        elif detected_format == 'json':
            return self.load_json(source, **kwargs)
        else:
          raise ValueError(f"Unsupported file format: {detected_format}")

    def load_csv(self, source: str, **kwargs) -> 'pd.DataFrame':
        """
        Load CSV files with auto-detected parameters

        Args:
            source: Path to CSV file
            **kwargs: Additional pandas.read_csv arguments

        Returns:
            Loaded and processed DataFrame
        """
        import pandas as pd

        if self.verbose:
            print(f"Loading CSV: {source}")

        encoding = self.detect_encoding(source)
        delimiter = self._sniff_delimiter(Path(source), encoding)

        df = pd.read_csv(source, encoding=encoding, sep=delimiter, **kwargs)
        df = self.parse_datetimes(df)

        if self.verbose:
            print(f"Loaded CSV: {source} ({df.shape[0]} rows, {df.shape[1]} columns)")

        return df

    def load_excel(self, source: str, **kwargs) -> 'pd.DataFrame':
            """
            3/5 Excel loading with smart detection (README: 80% priority - HIGH)

            Implements ExcelLoader functionality with:
            - Sheet selection and detection (README requirement)
            - Auto-delegation from unified interface
            """
            import pandas as pd
            if self.verbose:
                print(f"Loading Excel: {source}")
            try:
                # Check if file exists
                path = Path(source)
                if not path.exists():
                    raise FileNotFoundError(f"Excel file not found: {source}")

                # Get sheet name from kwargs or detect best sheet
                sheet_name = kwargs.get('sheet_name', None)

                if sheet_name is None:
                    # Auto-detect best sheet (README: sheet selection)
                    xl_file = pd.ExcelFile(source)
                    sheet_names = xl_file.sheet_names

                    if self.verbose:
                        print(f"   📋 Available sheets: {sheet_names}")

                    # Use first sheet or find the largest one
                    if len(sheet_names) == 1:
                        sheet_name = sheet_names[0]
                    else:
                        # Find sheet with most data
                        max_rows = 0
                        best_sheet = sheet_names[0]

                        for sheet in sheet_names:
                            try:
                                temp_df = pd.read_excel(source, sheet_name=sheet, nrows=1)
                                if len(temp_df.columns) > max_rows:
                                    max_rows = len(temp_df.columns)
                                    best_sheet = sheet
                            except:
                                continue

                        sheet_name = best_sheet

                    if self.verbose:
                        print(f"   ✅ Selected sheet: '{sheet_name}'")

                # Load Excel with selected sheet
                df = pd.read_excel(source, sheet_name=sheet_name, **{k: v for k, v in kwargs.items() if k != 'sheet_name'})

                # Check if DataFrame is empty
                if df.empty:
                    if self.verbose:
                        print("   ⚠️ Warning: Excel file loaded but DataFrame is empty")
                    return df

                # Auto-detect and parse datetimes
                df = self.parse_datetimes(df)

                if self.verbose:
                    print(f"✅ Excel loaded: {len(df)} rows, {len(df.columns)} columns")
                    print(f"   📊 Column names: {list(df.columns)}")

                return df

            except Exception as e:
                error_msg = f"Error loading Excel file: {str(e)}"
                if self.verbose:
                    print(f"❌ {error_msg}")
                raise ValueError(error_msg)

    def load_json(self, source: str, **kwargs) -> 'pd.DataFrame':
            """
            4/5 JSON loading with smart detection (README: 70% priority - MEDIUM)

            Implements JsonLoader functionality with:
            - Structure flattening (README requirement)
            - Auto-delegation from unified interface
            """
            import pandas as pd

            if self.verbose:
                print(f"🗂️ Loading JSON file...")

            # Load JSON with structure flattening (README requirement)
            df = pd.read_json(source)

            # Auto-detect and parse datetimes
            df = self.parse_datetimes(df)

            # Check if DataFrame is empty
            if df.empty:
                if self.verbose:
                    print("   ⚠️ Warning: JSON file loaded but DataFrame is empty")
                return df

            if self.verbose:
                print(f"✅ JSON loaded: {len(df)} rows, {len(df.columns)} columns")

            return df

## 🔍 Section 4: Implement Format Detection

**Learning Goal:** Learn how to analyze file extensions and make decisions

### What This Section Does:
Create a method that automatically detects the file format based on the file extension.

### Key Concepts:
- **File extensions** - The part after the dot in filenames (.csv, .xlsx, .json)
- **Dictionary mapping** - Associating extensions with format names
- **Default values** - What to return when extension is unknown
- **String methods** - Using `.lower()` and `.suffix` for consistent handling

### Business Logic:
**Priority System:**
- CSV/TSV/TXT → "csv" (95% priority - CRITICAL)
- XLSX/XLS → "excel" (80% priority - HIGH)  
- JSON → "json" (70% priority - MEDIUM)

### Your Task:
Add this method to your SmartAutoDataLoader class:

```python
def detect_format(self, source: str) -> str:
```

**Requirements:**
1. Take a file path as input
2. Extract the file extension (hint: use `Path(source).suffix.lower()`)
3. Map extensions to format names using a dictionary
4. Return the detected format ("csv", "excel", or "json")
5. Default to "csv" for unknown extensions
6. Print the result if verbose mode is on

In [34]:
# TODO: Add this method to your SmartAutoDataLoader class
# Hint: Use Path(source).suffix.lower() to get the file extension

def detect_format(self, source: str) -> str:
    """
    Detect file format based on extension

    Args:
        source: Path to the file

    Returns:
        Format name: 'csv', 'excel', or 'json'
    """
    path = Path(source)
    suffix = path.suffix.lower()

    format_map = {
        '.csv': 'csv',
        '.tsv': 'csv',  # Treat TSV as CSV for simplicity
        '.txt': 'csv',  # Treat TXT as CSV for simplicity
        '.xlsm': 'excel',
        '.xlsx': 'excel',
        '.xls': 'excel',
        '.json': 'json',
    }
    detected = format_map.get(suffix, 'csv') # Default to CSV if not recognized
    if self.verbose:
        print(f"🔍 Detected format: {detected} for file: {source}")

    return detected


# Tip: Create a dictionary mapping extensions like {'.csv': 'csv', '.xlsx': 'excel', etc.}

## 🔤 Section 5: Implement Encoding Detection

**Learning Goal:** Learn how to handle different text encodings robustly

### What This Section Does:
Create a method that automatically detects the text encoding of a file by trying common encodings.

### Key Concepts:
- **Text encoding** - How text characters are stored as bytes (UTF-8, Latin-1, etc.)
- **Try-except blocks** - Handling errors gracefully
- **File reading** - Opening and reading file contents
- **UnicodeDecodeError** - What happens when encoding is wrong

### Why This Matters:
Files can be saved in different encodings. Using the wrong encoding causes garbled text or errors. We need to detect the correct encoding automatically.

### Your Task:
Add this method to your SmartAutoDataLoader class:

```python
def detect_encoding(self, source: str) -> str:
```

**Requirements:**
1. Try these encodings in order: ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']
2. For each encoding, try to read the first 1024 characters of the file
3. If reading succeeds without errors, return that encoding
4. If all encodings fail, return 'utf-8' as fallback
5. Print the detected encoding if verbose mode is on

**Algorithm:**
```python
for encoding in encodings_to_try:
    try:
        # Try to read file with this encoding
        # If successful, return the encoding
    except UnicodeDecodeError:
        # Try next encoding
        continue
```

In [35]:
# TODO: Add this method to your SmartAutoDataLoader class
# Hint: Use a for loop with try-except to test each encoding

def detect_encoding(self, source: str) -> str:
    """
    Detect file encoding by trying common encodings

    Args:
        source: Path to the file

    Returns:
        Detected encoding name
    """
    encodings_to_try = ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1', 'utf-16', 'utf-8-sig', 'utf-32', 'utf-8-sig', 'utf-16-le', 'utf-16-be', 'utf-8-sig', 'utf-16-be']

    for encoding in encodings_to_try:
        try:
            with open(source, 'r', encoding=encoding) as f:
                f.read(1024)
            if self.verbose:
                print(f"🔤 Encoding detected: {encoding}")
            return encoding
        except UnicodeDecodeError:
          continue
    return 'utf-8' #Fallback
    pass

# Tip: Use "with open(source, 'r', encoding=encoding) as f: f.read(1024)" to test

## 📋 Section 6: Implement CSV Delimiter Sniffing

**Learning Goal:** Learn how to analyze text content to extract patterns

### What This Section Does:
Create a helper method that analyzes the first line of a CSV file to detect what character is used as the delimiter (comma, semicolon, tab, etc.).

### Key Concepts:
- **Delimiter** - The character that separates values in CSV files
- **String analysis** - Counting character occurrences
- **Dictionary operations** - Finding the key with maximum value
- **Internal methods** - Methods that start with `_` (private/helper methods)

### Common CSV Delimiters:
- `,` (comma) - Most common, used in English locales
- `;` (semicolon) - Common in European locales  
- `\t` (tab) - Used in TSV (Tab-Separated Values) files
- `|` (pipe) - Less common but sometimes used

### Your Task:
Add this **internal helper method** to your SmartAutoDataLoader class:

```python
def _sniff_delimiter(self, path: Path, encoding: str) -> str:
```

**Requirements:**
1. Read the first line of the file using the provided encoding
2. Count how many times each potential delimiter appears in that line
3. Return the delimiter that appears most frequently
4. If no delimiters are found, return comma (`,`) as default
5. Handle any file reading errors gracefully

**Algorithm:**
```python
delimiters = [',', ';', '\t', '|']
counts = {delimiter: first_line.count(delimiter) for delimiter in delimiters}
return delimiter_with_highest_count
```

In [36]:
# TODO: Add this INTERNAL helper method to your SmartAutoDataLoader class
# Hint: Use max(counts, key=counts.get) to find the delimiter with highest count

def _sniff_delimiter(self, path: Path, encoding: str) -> str:
    """
    Internal method: Detect CSV delimiter by analyzing first line

    Args:
        path: Path object for the file
        encoding: Encoding to use for reading

    Returns:
        Most likely delimiter character
    """
    try:
        with open(path, 'r', encoding=encoding) as f:
            first_line = f.readline()

        delimiters = [',', ';', '\t', '|']
        counts = {delimiter: first_line.count(delimiter) for delimiter in delimiters}
        return max(counts, key=counts.get) if max(counts.values()) > 0 else ','
    except Exception:
        return ','

# Tip: Count occurrences of [',', ';', '\t', '|'] in the first line

## 🔧 Section 7: Implement CSV Parameter Detection

**Learning Goal:** Learn how to combine multiple detection functions

### What This Section Does:
Create a method that combines encoding and delimiter detection into a comprehensive CSV parameter detection function.

### Key Concepts:
- **Function composition** - Using one function's output as another's input
- **Dictionary returns** - Returning structured data as dictionaries
- **Method calling** - Using `self.method_name()` to call other methods in the same class

### Your Task:
Add this method to your SmartAutoDataLoader class:

```python
def sniff_csv_params(self, source: str) -> Dict[str, Any]:
```

**Requirements:**
1. Call `self.detect_encoding(source)` to get the encoding
2. Call `self._sniff_delimiter(Path(source), encoding)` to get the delimiter
3. Return a dictionary with detected parameters
4. Print the results if verbose mode is on

**Return Dictionary Structure:**
```python
{
    'delimiter': detected_delimiter,
    'encoding': detected_encoding,
    'has_header': True  # Assume files have headers for now
}
```

This method brings together your encoding and delimiter detection into one convenient function!

In [37]:
# TODO: Add this method to your SmartAutoDataLoader class
# Hint: Call your other detection methods and combine results

def sniff_csv_params(self, source: str) -> Dict[str, Any]:
    """
    Detect comprehensive CSV parameters

    Args:
        source: Path to the CSV file

    Returns:
        Dictionary with detected parameters
    """
    detected_encoding = self.detect_encoding(source)
    detected_delimiter = self._sniff_delimiter(Path(source), detected_encoding)

    params = {
     'delimiter': detected_delimiter,
    'encoding': detected_encoding,
    'has_header': True  # Assume files have headers for now

    }

    if self.verbose:
        print(f"🔍 Detected CSV parameters: {params}")

    return params
    pass

# Tip: Use self.detect_encoding() and self._sniff_delimiter() methods you created

## 📅 Section 8: Implement DateTime Parsing

**Learning Goal:** Learn how to detect and convert datetime patterns in data

### What This Section Does:
Create a method that automatically finds columns containing dates and converts them to proper datetime format.

### Key Concepts:
- **Regular expressions (regex)** - Pattern matching for finding dates in text
- **pandas datetime conversion** - Using `pd.to_datetime()` to convert strings to dates
- **Data sampling** - Looking at a few rows to determine column patterns
- **Data type checking** - Using `df[col].dtype == 'object'` to find text columns

### Your Task:
Add this method to your SmartAutoDataLoader class:

```python
def parse_datetimes(self, df: 'pd.DataFrame') -> 'pd.DataFrame':
```

**Requirements:**
1. Create a copy of the input DataFrame
2. Initialize an empty list to track found date columns
3. For each column in the DataFrame:
   - Skip non-text columns (only check `object` dtype)
   - Take a sample of 10 non-null values
   - Check each date pattern from `self.date_patterns`
   - If >50% of sample values match a pattern, convert the column
4. Print results if verbose mode is on
5. Return the modified DataFrame

**Algorithm:**
```python
for col in df.columns:
    if df[col].dtype == 'object':  # Text columns only
        sample = df[col].dropna().astype(str).head(10)
        for pattern, date_format in self.date_patterns:
            matches = sum(1 for val in sample if re.search(pattern, val))
            if matches >= len(sample) * 0.5:  # 50% threshold
                # Convert column to datetime
```

In [38]:
# TODO: Add this method to your SmartAutoDataLoader class
# Hint: Use re.search() to check patterns and pd.to_datetime() to convert

def parse_datetimes(self, df: 'pd.DataFrame') -> 'pd.DataFrame':
    """
    Automatically detect and parse datetime columns

    Args:
        df: Input DataFrame

    Returns:
        DataFrame with datetime columns converted
    """
    import pandas as pd
    df_copy = df.copy()
    date_columns = []

    for col in df_copy.columns:
        if df_copy[col].dtype == 'object':  # Text columns only
            sample = df_copy[col].dropna().astype(str).head(10)


            for pattern, date_format in self.date:
              matches = sum(1 for val in sample if re.search(pattern, val))

              if matches >= len(sample) * 0.5:  # 50% threshold
                try:
                    df_copy[col] = pd.to_datetime(df_copy[col], format=date_format, errors='coerce')
                    date_columns.append(col)

                    if self.verbose:
                        print(f"📅 Parsed datetime column: {col} ({date_format})")
                except:
                  continue
    if not date_columns and self.verbose:
        print("📅 No datetime columns found")
    elif self.verbose and date_columns:
      print(f"   📅 Total date columns found: {len(date_columns)}")
    return df_copy

# Tip: Use try-except when calling pd.to_datetime() in case conversion fails

## 🎯 Section 9: Implement Universal Load Method

**Learning Goal:** Learn how to create a main method that coordinates other methods

### What This Section Does:
Create the main `load()` method that automatically detects file format and calls the appropriate loading function.

### Key Concepts:
- **Method delegation** - Using one method to call other specialized methods
- **Conditional logic** - Using if/elif/else to make decisions
- **Error handling** - Raising appropriate errors for unsupported formats

### Your Task:
Add this **main method** to your SmartAutoDataLoader class:

```python
def load(self, source: str, **kwargs) -> 'pd.DataFrame':
```

**Requirements:**
1. Print loading message with filename if verbose
2. Call `self.detect_format(source)` to determine file type
3. Based on detected format, call appropriate method:
   - 'csv' or 'tsv' → call `self.load_csv(source, **kwargs)`
   - 'excel' → call `self.load_excel(source, **kwargs)`  
   - 'json' → call `self.load_json(source, **kwargs)`
4. Raise `ValueError` for unsupported formats
5. Return the loaded DataFrame

**Algorithm:**
```python
detected_format = self.detect_format(source)
if detected_format in ['csv', 'tsv']:
    return self.load_csv(source, **kwargs)
elif detected_format == 'excel':
    return self.load_excel(source, **kwargs)
# ... etc
```

**Note:** You'll implement the specific loading methods (load_csv, load_excel, load_json) in the next sections!

In [39]:
# TODO: Add this MAIN method to your SmartAutoDataLoader class
# Hint: Use self.detect_format() then delegate to specific load methods

def load(self, source: str, **kwargs) -> 'pd.DataFrame':
    """
    Universal loading method - auto-detects format and delegates

    Args:
        source: Path to the file to load
        **kwargs: Additional arguments to pass to specific loaders

    Returns:
        Loaded DataFrame
    """
    import pandas as pd

    if self.verbose:
        print(f"Loading file: {source}")

    detected_format = self.detect_format(source)

    if detected_format in ['csv', 'tsv']:
        return self.load_csv(source, **kwargs)
    elif detected_format == 'excel':
        return self.load_excel(source, **kwargs)
    elif detected_format == 'json':
        return self.load_json(source, **kwargs)
    pass

# Tip: Use Path(source).name to get just the filename for logging

## 📊 Section 10: Implement CSV Loading

**Learning Goal:** Learn how to use pandas to load CSV files with auto-detected parameters

### What This Section Does:
Create the specialized CSV loading method that uses your detection functions.

### Your Task:
Add this method to your SmartAutoDataLoader class:

```python
def load_csv(self, source: str, **kwargs) -> 'pd.DataFrame':
```

**Requirements:**
1. Print loading message if verbose
2. Auto-detect encoding using `self.detect_encoding(source)`
3. Auto-detect delimiter using `self._sniff_delimiter(Path(source), encoding)`
4. Load CSV using `pd.read_csv(source, encoding=encoding, sep=delimiter)`
5. Auto-parse datetimes using `self.parse_datetimes(df)`
6. Print success message with row/column count
7. Return the processed DataFrame

This method brings together all your detection work to load CSV files intelligently!

In [40]:
# TODO: Add this method to your SmartAutoDataLoader class

def load_csv(self, source: str, **kwargs) -> 'pd.DataFrame':
    """
    Load CSV files with auto-detected parameters

    Args:
        source: Path to CSV file
        **kwargs: Additional pandas.read_csv arguments

    Returns:
        Loaded and processed DataFrame
    """
    import pandas as pd

    if self.verbose:
        print(f"Loading CSV: {source}")

    encoding = self.detect_encoding(source)
    delimiter = self._sniff_delimiter(Path(source), encoding)

    df = pd.read_csv(source, encoding=encoding, sep=delimiter, **kwargs)
    df = self.parse_datetimes(df)

    if self.verbose:
        print(f"Loaded CSV: {source} ({df.shape[0]} rows, {df.shape[1]} columns)")

    return df

    pass

In [41]:
# TODO: Add this method to your SmartAutoDataLoader class
def load_excel(self, source: str, **kwargs) -> 'pd.DataFrame':
        """
        3/5 Excel loading with smart detection (README: 80% priority - HIGH)

        Implements ExcelLoader functionality with:
        - Sheet selection and detection (README requirement)
        - Auto-delegation from unified interface
        """
        import pandas as pd
        if self.verbose:
            print(f"Loading Excel: {source}")
        try:
            # Check if file exists
            path = Path(source)
            if not path.exists():
                raise FileNotFoundError(f"Excel file not found: {source}")

            # Get sheet name from kwargs or detect best sheet
            sheet_name = kwargs.get('sheet_name', None)

            if sheet_name is None:
                # Auto-detect best sheet (README: sheet selection)
                xl_file = pd.ExcelFile(source)
                sheet_names = xl_file.sheet_names

                if self.verbose:
                    print(f"   📋 Available sheets: {sheet_names}")

                # Use first sheet or find the largest one
                if len(sheet_names) == 1:
                    sheet_name = sheet_names[0]
                else:
                    # Find sheet with most data
                    max_rows = 0
                    best_sheet = sheet_names[0]

                    for sheet in sheet_names:
                        try:
                            temp_df = pd.read_excel(source, sheet_name=sheet, nrows=1)
                            if len(temp_df.columns) > max_rows:
                                max_rows = len(temp_df.columns)
                                best_sheet = sheet
                        except:
                            continue

                    sheet_name = best_sheet

                if self.verbose:
                    print(f"   ✅ Selected sheet: '{sheet_name}'")

            # Load Excel with selected sheet
            df = pd.read_excel(source, sheet_name=sheet_name, **{k: v for k, v in kwargs.items() if k != 'sheet_name'})

            # Check if DataFrame is empty
            if df.empty:
                if self.verbose:
                    print("   ⚠️ Warning: Excel file loaded but DataFrame is empty")
                return df

            # Auto-detect and parse datetimes
            df = self.parse_datetimes(df)

            if self.verbose:
                print(f"✅ Excel loaded: {len(df)} rows, {len(df.columns)} columns")
                print(f"   📊 Column names: {list(df.columns)}")

            return df

        except Exception as e:
            error_msg = f"Error loading Excel file: {str(e)}"
            if self.verbose:
                print(f"❌ {error_msg}")
            raise ValueError(error_msg)



In [42]:
def load_json(self, source: str, **kwargs) -> 'pd.DataFrame':
        """
        4/5 JSON loading with smart detection (README: 70% priority - MEDIUM)

        Implements JsonLoader functionality with:
        - Structure flattening (README requirement)
        - Auto-delegation from unified interface
        """
        import pandas as pd

        if self.verbose:
            print(f"🗂️ Loading JSON file...")

        # Load JSON with structure flattening (README requirement)
        df = pd.read_json(source)

        # Auto-detect and parse datetimes
        df = self.parse_datetimes(df)

        # Check if DataFrame is empty
        if df.empty:
            if self.verbose:
                print("   ⚠️ Warning: JSON file loaded but DataFrame is empty")
            return df

        if self.verbose:
            print(f"✅ JSON loaded: {len(df)} rows, {len(df.columns)} columns")

        return df

## 🧪 Section 11: Test Your Implementation

**Learning Goal:** Learn how to test and validate your code

### What This Section Does:
Create test cases to verify your SmartAutoDataLoader works correctly.

### Your Task:
Test your complete implementation:

1. **Create an instance** of your SmartAutoDataLoader
2. **Test format detection** with different file extensions
3. **Test encoding detection** with a real file
4. **Test the full loading process** with a sample file
5. **Verify datetime parsing** is working
6. **Check verbose logging** is providing useful information

### Success Criteria:
- ✅ No error messages during execution
- ✅ Correct format detection for .csv, .xlsx, .json files
- ✅ Successful file loading with proper DataFrame shape
- ✅ Clear, helpful log messages
- ✅ Automatic datetime conversion when applicable

### Congratulations! 🎉
If your tests pass, you've successfully built a professional-grade data loader from scratch!

**What You've Learned:**
- Object-oriented programming with classes and methods
- File handling and text encoding detection  
- Data analysis with pandas
- Regular expressions for pattern matching
- Error handling with try-except blocks
- Method composition and code organization

In [43]:
# TODO: Test your SmartAutoDataLoader implementation

# Test 1: Create an instance
print("=== Testing SmartAutoDataLoader ===")
loader = SmartAutoDataLoader(verbose=True)


# Test 2: Test format detection
print("Testing format detection:")
print(f"CSV: {loader.detect_format('test.csv')}")
print(f"Excel: {loader.detect_format('test.xlsx')}")
print(f"JSON: {loader.detect_format('test.json')}")

# Test 3: Test full loading (replace with your actual file path)
df = loader.load("/content/cohort_data.csv")
print(f"Loaded DataFrame: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"Data types:\n{df.dtypes}")

print("🎉 Ready to test your implementation!")

=== Testing SmartAutoDataLoader ===
🎯 SmartAutoDataLoader  initialized!
Testing format detection:
🔍 Detected format: csv for file: test.csv
CSV: csv
🔍 Detected format: excel for file: test.xlsx
Excel: excel
🔍 Detected format: json for file: test.json
JSON: json
Loading file: /content/cohort_data.csv
🔍 Detected format: csv for file: /content/cohort_data.csv
Loading CSV: /content/cohort_data.csv
🔤 Encoding detected: utf-8
📅 Parsed datetime column: session_start (%Y-%m-%d)
📅 Parsed datetime column: session_end (%Y-%m-%d)
📅 Parsed datetime column: birthdate (%Y-%m-%d)
📅 Parsed datetime column: sign_up_date (%Y-%m-%d)
📅 Parsed datetime column: check_in_time (%Y-%m-%d)
📅 Parsed datetime column: check_out_time (%Y-%m-%d)
📅 Parsed datetime column: departure_time (%Y-%m-%d)
📅 Parsed datetime column: return_time (%Y-%m-%d)
   📅 Total date columns found: 8
Loaded CSV: /content/cohort_data.csv (49211 rows, 41 columns)
Loaded DataFrame: (49211, 41)
Columns: ['session_id', 'user_id', 'trip_id', 'ses

In [44]:
df

Unnamed: 0,session_id,user_id,trip_id,session_start,session_end,flight_discount,hotel_discount,flight_discount_amount,hotel_discount_amount,flight_booked,...,destination_airport,seats,return_flight_booked,departure_time,return_time,checked_bags,trip_airline,destination_airport_lat,destination_airport_lon,base_fare_usd
0,529787-62d1bb241814424c840e483c5e3cc27c,529787,529787-84517d15fc57417d8b481c1ce9d219e1,NaT,NaT,False,True,,0.05,True,...,JAX,1.0,True,NaT,NaT,1.0,American Airlines,30.494,-81.688,252.07
1,540557-d35969ecb54b460abc85879d6afe5914,540557,540557-ad27b94684f044319029502f3a06ee47,NaT,NaT,False,False,,,True,...,YOW,1.0,True,NaT,NaT,2.0,Porter Airlines,45.323,-75.669,134.80
2,533048-d4aee0f0d2764bd6bbdc5dd38935cc10,533048,533048-24646a85fdc24795b42f39a72e544027,NaT,NaT,False,True,,0.15,False,...,,,,NaT,NaT,,,,,
3,539320-563701cd4360453d8627fdc7896c4a5a,539320,539320-6128d40ec84e4bc1a98a12df7cd5fe14,NaT,NaT,False,False,,,False,...,,,,NaT,NaT,,,,,
4,593385-eb3a98579fbb41c584f8450b49545fa5,593385,593385-e158202c7504428f99cc99daa2b7355f,NaT,NaT,False,True,,0.10,True,...,YYC,1.0,True,NaT,NaT,1.0,WestJet,51.114,-114.020,547.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49206,659128-b9f79d8d13214b44b0011dcdfe97219f,659128,659128-a02f1af346a34ee7be3a82d80e928e71,NaT,NaT,False,True,,0.05,True,...,LAX,1.0,True,NaT,NaT,1.0,American Airlines,33.942,-118.408,733.29
49207,522800-0a42cbcbfcb844349348ac5e8676acc9,522800,522800-e2ed815aa13448178241b6a65b35a1aa,NaT,NaT,False,False,,,True,...,CMH,1.0,True,NaT,NaT,1.0,Southwest Airlines,39.998,-82.892,142.50
49208,542095-371e53bf2a0e463d887c5ce5339cd7d8,542095,542095-da1172e2431342eba39c1630cbf81673,NaT,NaT,False,False,,,True,...,LGA,1.0,True,NaT,NaT,1.0,Southwest Airlines,40.640,-73.779,283.94
49209,406210-aa8ddc159f934ca48b8211436e95aa6a,406210,406210-6c29806425804fe2b4da095371e25cfb,NaT,NaT,False,False,,,True,...,ITM,1.0,False,NaT,NaT,0.0,All Nippon Airways,34.785,135.438,536.61
