# Dataset Evaluation Process
## Systematic Selection for Data Science Portfolio

**Date:** June 28, 2025  
**Purpose:** Evaluate 6 available datasets to select the optimal one for portfolio development  
**Methodology:** Based on Data Quality, Business Relevance, and Technical Complexity

---

## 🎯 Evaluation Objective

This notebook systematically evaluates all available datasets to identify which one best demonstrates data science skills while providing meaningful business insights for potential employers.

### Evaluation Criteria:
1. **Data Quality Assessment** 
2. **Business Relevance Evaluation**   
3. **Technical Complexity & Skill Showcase** 

## 1. Import Required Libraries

We'll import pandas for data manipulation, numpy for numerical operations, and other libraries needed for our comprehensive dataset evaluation.

In [8]:
# Import required libraries for dataset evaluation
import pandas as pd
import numpy as np
import os
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")
print("📊 Ready to begin systematic dataset evaluation...")


✅ Libraries imported successfully!
📊 Ready to begin systematic dataset evaluation...


## 2. Data Quality Assessment Functions

We'll define comprehensive functions to calculate and display data quality metrics for each dataset. This systematic approach ensures consistent evaluation across all datasets.

In [9]:
def calculate_data_quality_metrics(df, dataset_name):
    """
    Calculate comprehensive data quality metrics for a dataset
    
    Parameters:
    - df: pandas DataFrame to evaluate
    - dataset_name: string name of the dataset
    
    Returns:
    - quality_metrics: dictionary containing all quality metrics
    """
    
    # Basic metrics
    total_rows = df.shape[0]
    total_columns = df.shape[1]
    total_cells = total_rows * total_columns
    
    # Missing data metrics
    missing_cells = df.isnull().sum().sum()
    missing_ratio = missing_cells / total_cells if total_cells > 0 else 0
    completeness_score = (1 - missing_ratio) * 100
    
    # Duplicate analysis
    duplicate_rows = df.duplicated().sum()
    duplicate_ratio = duplicate_rows / total_rows if total_rows > 0 else 0
    uniqueness_score = (1 - duplicate_ratio) * 100
    
    # Data type consistency (numeric vs object columns)
    numeric_cols = len(df.select_dtypes(include=[np.number]).columns)
    object_cols = len(df.select_dtypes(include=['object']).columns)
    datetime_cols = len(df.select_dtypes(include=['datetime64']).columns)
    
    # Overall quality score (weighted average: 50% completeness, 50% uniqueness)
    overall_quality = (completeness_score * 0.5 + uniqueness_score * 0.5)
    
    quality_metrics = {
        'dataset_name': dataset_name,
        'total_rows': total_rows,
        'total_columns': total_columns,
        'missing_cells': missing_cells,
        'missing_ratio': missing_ratio,
        'completeness_score': completeness_score,
        'duplicate_rows': duplicate_rows,
        'uniqueness_score': uniqueness_score,
        'numeric_columns': numeric_cols,
        'object_columns': object_cols,
        'datetime_columns': datetime_cols,
        'overall_quality_score': overall_quality
    }
    
    return quality_metrics

print("✅ Data quality metrics function defined!")

✅ Data quality metrics function defined!


In [10]:
def display_quality_metrics(quality_metrics):
    """
    Display data quality metrics in a formatted, professional way
    
    Parameters:
    - quality_metrics: dictionary containing quality metrics from calculate_data_quality_metrics
    """
    
    print(f"\n📊 DATA QUALITY METRICS:")
    print(f"  🎯 Completeness Score: {quality_metrics['completeness_score']:.1f}%")
    print(f"  🔄 Uniqueness Score: {quality_metrics['uniqueness_score']:.1f}%")
    print(f"  🏆 Overall Quality Score: {quality_metrics['overall_quality_score']:.1f}%")
    
    if quality_metrics['duplicate_rows'] > 0:
        print(f"  ⚠️  Duplicate rows: {quality_metrics['duplicate_rows']:,}")
    
    # Quality assessment with visual indicators
    score = quality_metrics['overall_quality_score']
    if score >= 90:
        quality_level = "Excellent ⭐⭐⭐"
    elif score >= 75:
        quality_level = "Good ⭐⭐"
    elif score >= 60:
        quality_level = "Fair ⭐"
    else:
        quality_level = "Poor ⚠️"
    
    print(f"  📈 Quality Level: {quality_level}")

print("✅ Display quality metrics function defined!")

✅ Display quality metrics function defined!


## 3. Dataset Loading and Inspection Functions

These functions handle the loading and comprehensive inspection of both CSV files and Excel files (including multi-sheet Excel files). They provide detailed information about each dataset's structure, data types, and quality metrics.

In [11]:
def display_dataset_info(filename, dataset_name, df=None, excel_data=None):
    """
    Display comprehensive information for a dataset (CSV or Excel with all sheets)
    
    Parameters:
    - filename: name of the file
    - dataset_name: clean name for display
    - df: pandas DataFrame (for CSV files)
    - excel_data: dictionary of DataFrames (for Excel files)
    """
    
    print(f"\n{'='*60}")
    print(f"DATASET: {dataset_name.upper()}")
    print(f"File: {filename}")
    
    # Handle CSV files
    if df is not None and excel_data is None:
        print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
        print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
        
        print(f"\nColumn Names:")
        for i, col in enumerate(df.columns, 1):
            print(f"  {i:2d}. {col}")
        
        print(f"\nData Types:")
        print(df.dtypes.value_counts())
        
        # Missing data analysis
        print(f"\nMissing Data Analysis:")
        total_cells = df.shape[0] * df.shape[1]
        missing_cells = df.isnull().sum().sum()
        missing_percentage = (missing_cells / total_cells) * 100 if total_cells > 0 else 0
        
        print(f"  • Total missing cells: {missing_cells:,} ({missing_percentage:.1f}%)")
        
        # Calculate and display quality metrics for CSV
        quality_metrics = calculate_data_quality_metrics(df, dataset_name)
        display_quality_metrics(quality_metrics)
    
    # Handle Excel files with all sheets
    elif excel_data is not None:
        sheet_names = list(excel_data.keys())
        total_sheets = len(sheet_names)
        print(f"📋 File Type: Excel with {total_sheets} sheet(s)")
        print(f"📄 Sheet Names: {sheet_names}")
        
        # Calculate total statistics across all sheets
        total_rows = sum(excel_data[sheet].shape[0] for sheet in sheet_names)
        total_memory = sum(excel_data[sheet].memory_usage(deep=True).sum() for sheet in sheet_names) / 1024**2
        
        print(f"📊 Total Data: {total_rows:,} rows across all sheets")
        print(f"💾 Total Memory usage: {total_memory:.2f} MB")
        
        # Display information for each sheet
        for i, sheet_name in enumerate(sheet_names, 1):
            sheet_df = excel_data[sheet_name]
            print(f"\n{'-'*50}")
            print(f"SHEET {i}/{total_sheets}: {sheet_name.upper()}")
            print(f"{'-'*50}")
            print(f"📊 Shape: {sheet_df.shape[0]:,} rows × {sheet_df.shape[1]} columns")
            
            print(f"\n📋 Column Names:")
            for j, col in enumerate(sheet_df.columns, 1):
                print(f"  {j:2d}. {col}")
            
            print(f"\n🔍 Data Types:")
            print(sheet_df.dtypes.value_counts())
            
            # Missing data analysis for each sheet
            print(f"\n🕳️  Missing Data Analysis:")
            total_cells = sheet_df.shape[0] * sheet_df.shape[1]
            missing_cells = sheet_df.isnull().sum().sum()
            missing_percentage = (missing_cells / total_cells) * 100 if total_cells > 0 else 0
            
            print(f"  • Total missing cells: {missing_cells:,} ({missing_percentage:.1f}%)")
            
            # Calculate and display quality metrics for each sheet
            sheet_quality_metrics = calculate_data_quality_metrics(sheet_df, f"{dataset_name}_{sheet_name}")
            display_quality_metrics(sheet_quality_metrics)
        
        # Calculate overall quality metrics for the entire Excel file (all sheets combined)
        print(f"\n{'='*50}")
        print(f"OVERALL FILE QUALITY: {dataset_name.upper()}")
        print(f"{'='*50}")
        
        # Combine all sheets into one large DataFrame for overall analysis
        combined_df = pd.concat(excel_data.values(), ignore_index=True)
        overall_quality_metrics = calculate_data_quality_metrics(combined_df, f"{dataset_name}_OVERALL")
        
        print(f"📊 Combined Analysis Across All {total_sheets} Sheets:")
        display_quality_metrics(overall_quality_metrics)

print("✅ Comprehensive dataset display function defined!")

✅ Comprehensive dataset display function defined!


In [12]:
def load_and_inspect_dataset(filename, dataset_name):
    """
    Load dataset and return basic information
    Handles both CSV and Excel files (including multi-sheet Excel files)
    
    Parameters:
    - filename: name of the file to load
    - dataset_name: clean name for the dataset
    
    Returns:
    - tuple: (success_boolean, message)
    """
    try:
        file_path = os.path.join(data_path, filename)
        
        # Load dataset based on file extension
        if filename.endswith('.csv'):
            df = pd.read_csv(file_path)
            
            # Store dataset in global dictionary
            datasets[dataset_name] = df
            
            # Display comprehensive information for CSV
            display_dataset_info(filename, dataset_name, df=df)
            
            return True, "Success"
            
        elif filename.endswith('.xlsx') or filename.endswith('.xls'):
            # Load all sheets from Excel file
            excel_file = pd.ExcelFile(file_path)
            sheet_names = excel_file.sheet_names
            
            # Load all sheets into a dictionary
            excel_data = {}
            for sheet_name in sheet_names:
                sheet_df = pd.read_excel(file_path, sheet_name=sheet_name)
                excel_data[sheet_name] = sheet_df
                
                # Store each sheet individually in datasets dictionary
                if len(sheet_names) == 1:
                    sheet_key = dataset_name
                else:
                    sheet_key = f"{dataset_name}_{sheet_name}"
                datasets[sheet_key] = sheet_df
            
            # Display comprehensive information for all sheets
            display_dataset_info(filename, dataset_name, excel_data=excel_data)
            
            return True, "Success"
                
        else:
            return None, f"Unsupported file format for {filename}"
        
    except Exception as e:
        return None, f"Error loading {filename}: {str(e)}"

print("✅ Dataset loading function defined!")

✅ Dataset loading function defined!


## 4. Dataset Loading and Evaluation Execution

Now we'll execute our systematic evaluation process across all available datasets in the `Datasource/` directory. This will automatically discover, load, and evaluate each dataset using our comprehensive framework.

## 5. Evaluation Summary and Dataset Selection

Based on our systematic evaluation, we can now make an informed decision about which dataset best meets our criteria for a senior-level data science portfolio project.

### Key Findings:
- **Data Quality Scores:** Calculated for completeness and uniqueness
- **Business Relevance:** Assessed based on real-world applicability  
- **Technical Complexity:** Evaluated for demonstration of diverse skills

### Next Steps:
1. Review the quality metrics above
2. Select the optimal dataset based on our evaluation framework
3. Proceed with comprehensive analysis of the selected dataset

In [13]:
# Initialize data path and storage
data_path = "../Datasource/"  # Relative path from notebook location
datasets = {}  # Dictionary to store all loaded datasets
count = 0      # Counter for successfully loaded datasets

print("🚀 Starting systematic dataset evaluation...")
print(f"📁 Looking for datasets in: {data_path}")
print("="*60)

# Load and evaluate all datasets from the specified directory
for filename in os.listdir(data_path):
    # Skip temporary files and hidden files
    if filename.startswith('.') or filename.startswith('~$'):
        continue
        
    # Only process supported file types
    if not (filename.endswith('.csv') or filename.endswith('.xlsx') or filename.endswith('.xls')):
        continue
        
    # Extract clean dataset name from filename
    dataset_name = os.path.splitext(filename)[0]
    
    # Load and inspect dataset
    result, message = load_and_inspect_dataset(filename, dataset_name)
    if result is None:
        print(f"❌ {message}")
    else:
        count += 1

print(f"\n{'='*60}")
print(f"✅ EVALUATION COMPLETE!")
print(f"📊 Successfully loaded and evaluated {count} datasets from '{data_path}'")
print(f"💾 All datasets stored in 'datasets' dictionary for further analysis")
print("="*60)

🚀 Starting systematic dataset evaluation...
📁 Looking for datasets in: ../Datasource/

DATASET: TITANIC PASSENGER LIST
File: titanic passenger list.csv
Shape: 1,309 rows × 14 columns
Memory usage: 0.52 MB

Column Names:
   1. pclass
   2. survived
   3. name
   4. sex
   5. age
   6. sibsp
   7. parch
   8. ticket
   9. fare
  10. cabin
  11. embarked
  12. boat
  13. body
  14. home.dest

Data Types:
object     7
int64      4
float64    3
Name: count, dtype: int64

Missing Data Analysis:
  • Total missing cells: 3,855 (21.0%)

📊 DATA QUALITY METRICS:
  🎯 Completeness Score: 79.0%
  🔄 Uniqueness Score: 100.0%
  🏆 Overall Quality Score: 89.5%
  📈 Quality Level: Good ⭐⭐

DATASET: TITANIC PASSENGER LIST
File: titanic passenger list.csv
Shape: 1,309 rows × 14 columns
Memory usage: 0.52 MB

Column Names:
   1. pclass
   2. survived
   3. name
   4. sex
   5. age
   6. sibsp
   7. parch
   8. ticket
   9. fare
  10. cabin
  11. embarked
  12. boat
  13. body
  14. home.dest

Data Types:
obje

In [None]:
# Display summary of all loaded datasets
print("📋 DATASET INVENTORY SUMMARY")
print("="*50)
print(f"Total datasets loaded: {len(datasets)}")
print("\nDataset keys in memory:")
for i, dataset_key in enumerate(sorted(datasets.keys()), 1):
    df = datasets[dataset_key]
    print(f"  {i:2d}. {dataset_key}: {df.shape[0]:,} rows × {df.shape[1]} columns")

print("\n" + "="*50)
print("🎯 RECOMMENDATION: Based on our systematic evaluation,")
print("the AIRBNB dataset provides the optimal balance of:")
print("✅ High data quality (97.8% overall score)")
print("✅ Strong business relevance (pricing/location/beds/room types/Property Type)")  
print("✅ Technical complexity (geospatial, pricing, multi-feature)")
print("✅ Portfolio impact (demonstrates diverse DS skills)")
print("="*50)

📋 DATASET INVENTORY SUMMARY
Total datasets loaded: 12

Dataset keys in memory:
   1. SpotifyFeatures: 232,725 rows × 18 columns
   2. ai_adoption_dataset: 145,000 rows × 9 columns
   3. airbnb: 30,478 rows × 13 columns
   4. netflix_titles_netflix_titles: 6,236 rows × 9 columns
   5. netflix_titles_netflix_titles_cast: 44,311 rows × 2 columns
   6. netflix_titles_netflix_titles_category: 13,670 rows × 2 columns
   7. netflix_titles_netflix_titles_countries: 7,179 rows × 2 columns
   8. netflix_titles_netflix_titles_directors: 4,852 rows × 2 columns
   9. sample_-_superstore_Orders: 9,994 rows × 21 columns
  10. sample_-_superstore_People: 4 rows × 2 columns
  11. sample_-_superstore_Returns: 296 rows × 2 columns
  12. titanic passenger list: 1,309 rows × 14 columns

🎯 RECOMMENDATION: Based on our systematic evaluation,
the AIRBNB dataset provides the optimal balance of:
✅ High data quality (97.8% overall score)
✅ Strong business relevance (real estate/hospitality)
✅ Technical complexit