Skip to content

🧰 Essential EDA and Data Cleaning Helpers for Any DataFrame This collection of functions is designed to accelerate exploratory data analysis (EDA), quickly surface data quality issues, and offer high-level insights into the structure and content of your dataset.

License

Notifications You must be signed in to change notification settings

andrei-vataselu/data-science-snippets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

🧠 data-science-snippets

data-science-snippets is a modular, production-ready Python snippets containing curated, reusable utilities used in the day-to-day workflows of senior data scientists and machine learning engineers.

It includes tools for EDA, cleaning, validation, text processing, feature engineering, visualization, model evaluation, time series, and more β€” organized by task to keep your work clean and efficient.


πŸš€ Features

βœ… Covers every major step in the data science lifecycle
βœ… Clean, modular structure by task
βœ… Built for reusability in real-world projects
βœ… Lightweight: only depends on pandas, numpy, matplotlib, seaborn by default
βœ… Compatible with Python 3.9+


πŸ“ Folder Structure

data-science-snippets/
β”œβ”€β”€ eda/
β”‚   β”œβ”€β”€ most_frequent_values.py
β”‚   β”œβ”€β”€ data_summary.py
β”‚   β”œβ”€β”€ cardinality_report.py
β”‚   └── basic_statistics.py
β”œβ”€β”€ data_cleaning/
β”‚   β”œβ”€β”€ missing_data_summary.py
β”‚   β”œβ”€β”€ outlier_detection.py
β”‚   └── duplicate_removal.py
β”œβ”€β”€ preprocessing/
β”‚   β”œβ”€β”€ minmax_scaling.py
β”‚   β”œβ”€β”€ encoding.py
β”‚   └── normalize_columns.py
β”œβ”€β”€ loading/
β”‚   β”œβ”€β”€ load_csv_with_info.py
β”‚   β”œβ”€β”€ safe_parquet_loader.py
β”‚   └── load_large_file_chunks.py
β”œβ”€β”€ visualization/
β”‚   β”œβ”€β”€ missing_data_heatmap.py
β”‚   β”œβ”€β”€ distribution_plot.py
β”‚   └── correlation_matrix.py
β”‚   └── color_palette_utils.py
β”œβ”€β”€ feature_engineering/
β”‚   β”œβ”€β”€ create_datetime_features.py
β”‚   β”œβ”€β”€ binning.py
β”‚   β”œβ”€β”€ interaction_terms.py
β”‚   └── rare_label_encoding.py
β”œβ”€β”€ automated_eda/
β”‚   β”œβ”€β”€ quick_eda_report.py
β”‚   └── profile_report_wrapper.py
β”œβ”€β”€ model_evaluation/
β”‚   β”œβ”€β”€ classification_report_extended.py
β”‚   β”œβ”€β”€ confusion_matrix_plot.py
β”‚   β”œβ”€β”€ cross_validation_metrics.py
β”‚   └── roc_auc_plot.py
β”œβ”€β”€ text_processing/
β”‚   β”œβ”€β”€ clean_text.py
β”‚   β”œβ”€β”€ tokenize_text.py
β”‚   └── tfidf_features.py
β”œβ”€β”€ time_series/
β”‚   β”œβ”€β”€ lag_features.py
β”‚   β”œβ”€β”€ rolling_statistics.py
β”‚   └── datetime_indexing.py
β”œβ”€β”€ modeling/
β”‚   β”œβ”€β”€ model_training.py
β”‚   β”œβ”€β”€ pipeline_builder.py
β”‚   └── hyperparameter_tuner.py
β”œβ”€β”€ data_validation/
β”‚   β”œβ”€β”€ schema_check.py
β”‚   β”œβ”€β”€ unique_constraints.py
β”‚   └── value_range_check.py
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ memory_optimization.py
β”‚   β”œβ”€β”€ execution_timer.py
β”‚   └── logging_setup.py
└── README.md

πŸ”Ή eda/ – Exploratory Data Analysis

  • most_frequent_values.py: Shows the most common (modal) value per column, its frequency, and percent from non-null values.
  • data_summary.py: Summarizes dtypes, nulls, uniques, and memory usage for quick inspection.
  • cardinality_report.py: Reports high-cardinality columns in categorical features.
  • basic_statistics.py: Returns mean, median, min, max, std, and other summary statistics.

πŸ”Ή data_cleaning/

  • missing_data_summary.py: Shows missing value count and percentage per column, along with data types.
  • outlier_detection.py: Detects outliers using IQR or Z-score methods.
  • duplicate_removal.py: Identifies and removes duplicate rows or records.

πŸ”Ή preprocessing/

  • minmax_scaling.py: Scales numeric values to a [0, 1] range.
  • encoding.py: Label encoding and one-hot encoding utilities.
  • normalize_columns.py: Z-score standardization and column normalization helpers.

πŸ”Ή loading/

  • load_csv_with_info.py: Loads CSVs and prints metadata like shape, dtypes, and missing values.
  • safe_parquet_loader.py: Robust parquet file loader with fallback options.
  • load_large_file_chunks.py: Loads large files in chunks with progress reporting.

πŸ”Ή visualization/

  • missing_data_heatmap.py: Visualizes missing values with a Seaborn heatmap.
  • distribution_plot.py: Plots distributions of numeric variables.
  • correlation_matrix.py: Draws a correlation heatmap of numeric features.

πŸ”Ή feature_engineering/

  • create_datetime_features.py: Extracts features like day, month, year, weekday from datetime columns.
  • binning.py: Performs binning (equal-width or quantile) on continuous variables.
  • interaction_terms.py: Creates interaction features (e.g., feature1 * feature2).
  • rare_label_encoding.py: Groups rare categorical labels into 'Other'.

πŸ”Ή automated_eda/

  • quick_eda_report.py: Generates a summary of shape, dtypes, nulls, basic stats.
  • profile_report_wrapper.py: Wrapper for pandas-profiling / ydata-profiling report generation.

πŸ”Ή model_evaluation/

  • classification_report_extended.py: Displays precision, recall, F1 with support for multiple averages.
  • confusion_matrix_plot.py: Annotated confusion matrix visual.
  • cross_validation_metrics.py: Computes metrics across folds and aggregates results.
  • roc_auc_plot.py: Plots ROC curve and calculates AUC score.

πŸ”Ή text_processing/

  • clean_text.py: Removes punctuation, stopwords, numbers, and lowercases text.
  • tokenize_text.py: Word and sentence tokenizers with NLTK or spaCy support.
  • tfidf_features.py: Builds TF-IDF matrix from text columns.

πŸ”Ή time_series/

  • lag_features.py: Generates lagged versions of a column for time-aware modeling.
  • rolling_statistics.py: Rolling mean, median, std, and min/max features.
  • datetime_indexing.py: Time-based slicing, filtering, and resampling helpers.

πŸ”Ή modeling/

  • model_training.py: Trains scikit-learn models with optional cross-validation and logging.
  • pipeline_builder.py: Builds preprocessing + modeling pipelines using Pipeline or ColumnTransformer.
  • hyperparameter_tuner.py: Wraps GridSearchCV or RandomizedSearchCV with easy setup and evaluation.

πŸ”Ή data_validation/

  • schema_check.py: Validates schema based on expected dtypes and column names.
  • unique_constraints.py: Ensures unique values for IDs or compound keys.
  • value_range_check.py: Checks for valid value ranges in numeric columns.

πŸ”Ή utils/

  • memory_optimization.py: Downcasts numerical columns to save memory.
  • execution_timer.py: Times function execution with decorators or context managers.
  • logging_setup.py: Sets up consistent logging configuration for larger projects.

πŸ› οΈ Usage

Copy-Paste πŸ“¦

πŸ“š Requirements

  • Python β‰₯ 3.9
  • pandas β‰₯ 1.5.3
  • numpy β‰₯ 1.24.4
  • seaborn β‰₯ 0.12.2
  • matplotlib β‰₯ 3.6.3

πŸ” Security

Please see our SECURITY.md for vulnerability disclosure guidelines.


πŸ‘₯ Authors

  • Vataselu Andrei
  • Nicola-Diana Sincaru

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.


🌟 Contributions

We welcome contributions! If you have a reusable function or snippet that you think belongs in a senior data scientist’s toolkit, feel free to open a pull request.

About

🧰 Essential EDA and Data Cleaning Helpers for Any DataFrame This collection of functions is designed to accelerate exploratory data analysis (EDA), quickly surface data quality issues, and offer high-level insights into the structure and content of your dataset.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages