data-science-snippets is a modular, production-ready Python snippets containing curated, reusable utilities used in the day-to-day workflows of senior data scientists and machine learning engineers.
It includes tools for EDA, cleaning, validation, text processing, feature engineering, visualization, model evaluation, time series, and more β organized by task to keep your work clean and efficient.
β
Covers every major step in the data science lifecycle
β
Clean, modular structure by task
β
Built for reusability in real-world projects
β
Lightweight: only depends on pandas
, numpy
, matplotlib
, seaborn
by default
β
Compatible with Python 3.9+
data-science-snippets/
βββ eda/
β βββ most_frequent_values.py
β βββ data_summary.py
β βββ cardinality_report.py
β βββ basic_statistics.py
βββ data_cleaning/
β βββ missing_data_summary.py
β βββ outlier_detection.py
β βββ duplicate_removal.py
βββ preprocessing/
β βββ minmax_scaling.py
β βββ encoding.py
β βββ normalize_columns.py
βββ loading/
β βββ load_csv_with_info.py
β βββ safe_parquet_loader.py
β βββ load_large_file_chunks.py
βββ visualization/
β βββ missing_data_heatmap.py
β βββ distribution_plot.py
β βββ correlation_matrix.py
β βββ color_palette_utils.py
βββ feature_engineering/
β βββ create_datetime_features.py
β βββ binning.py
β βββ interaction_terms.py
β βββ rare_label_encoding.py
βββ automated_eda/
β βββ quick_eda_report.py
β βββ profile_report_wrapper.py
βββ model_evaluation/
β βββ classification_report_extended.py
β βββ confusion_matrix_plot.py
β βββ cross_validation_metrics.py
β βββ roc_auc_plot.py
βββ text_processing/
β βββ clean_text.py
β βββ tokenize_text.py
β βββ tfidf_features.py
βββ time_series/
β βββ lag_features.py
β βββ rolling_statistics.py
β βββ datetime_indexing.py
βββ modeling/
β βββ model_training.py
β βββ pipeline_builder.py
β βββ hyperparameter_tuner.py
βββ data_validation/
β βββ schema_check.py
β βββ unique_constraints.py
β βββ value_range_check.py
βββ utils/
β βββ memory_optimization.py
β βββ execution_timer.py
β βββ logging_setup.py
βββ README.md
most_frequent_values.py
: Shows the most common (modal) value per column, its frequency, and percent from non-null values.data_summary.py
: Summarizes dtypes, nulls, uniques, and memory usage for quick inspection.cardinality_report.py
: Reports high-cardinality columns in categorical features.basic_statistics.py
: Returns mean, median, min, max, std, and other summary statistics.
missing_data_summary.py
: Shows missing value count and percentage per column, along with data types.outlier_detection.py
: Detects outliers using IQR or Z-score methods.duplicate_removal.py
: Identifies and removes duplicate rows or records.
minmax_scaling.py
: Scales numeric values to a [0, 1] range.encoding.py
: Label encoding and one-hot encoding utilities.normalize_columns.py
: Z-score standardization and column normalization helpers.
load_csv_with_info.py
: Loads CSVs and prints metadata like shape, dtypes, and missing values.safe_parquet_loader.py
: Robust parquet file loader with fallback options.load_large_file_chunks.py
: Loads large files in chunks with progress reporting.
missing_data_heatmap.py
: Visualizes missing values with a Seaborn heatmap.distribution_plot.py
: Plots distributions of numeric variables.correlation_matrix.py
: Draws a correlation heatmap of numeric features.
create_datetime_features.py
: Extracts features like day, month, year, weekday from datetime columns.binning.py
: Performs binning (equal-width or quantile) on continuous variables.interaction_terms.py
: Creates interaction features (e.g., feature1 * feature2).rare_label_encoding.py
: Groups rare categorical labels into 'Other'.
quick_eda_report.py
: Generates a summary of shape, dtypes, nulls, basic stats.profile_report_wrapper.py
: Wrapper for pandas-profiling / ydata-profiling report generation.
classification_report_extended.py
: Displays precision, recall, F1 with support for multiple averages.confusion_matrix_plot.py
: Annotated confusion matrix visual.cross_validation_metrics.py
: Computes metrics across folds and aggregates results.roc_auc_plot.py
: Plots ROC curve and calculates AUC score.
clean_text.py
: Removes punctuation, stopwords, numbers, and lowercases text.tokenize_text.py
: Word and sentence tokenizers with NLTK or spaCy support.tfidf_features.py
: Builds TF-IDF matrix from text columns.
lag_features.py
: Generates lagged versions of a column for time-aware modeling.rolling_statistics.py
: Rolling mean, median, std, and min/max features.datetime_indexing.py
: Time-based slicing, filtering, and resampling helpers.
model_training.py
: Trains scikit-learn models with optional cross-validation and logging.pipeline_builder.py
: Builds preprocessing + modeling pipelines usingPipeline
orColumnTransformer
.hyperparameter_tuner.py
: WrapsGridSearchCV
orRandomizedSearchCV
with easy setup and evaluation.
schema_check.py
: Validates schema based on expected dtypes and column names.unique_constraints.py
: Ensures unique values for IDs or compound keys.value_range_check.py
: Checks for valid value ranges in numeric columns.
memory_optimization.py
: Downcasts numerical columns to save memory.execution_timer.py
: Times function execution with decorators or context managers.logging_setup.py
: Sets up consistent logging configuration for larger projects.
Copy-Paste π¦
- Python β₯ 3.9
- pandas β₯ 1.5.3
- numpy β₯ 1.24.4
- seaborn β₯ 0.12.2
- matplotlib β₯ 3.6.3
Please see our SECURITY.md for vulnerability disclosure guidelines.
- Vataselu Andrei
- Nicola-Diana Sincaru
This project is licensed under the MIT License. See the LICENSE file for details.
We welcome contributions! If you have a reusable function or snippet that you think belongs in a senior data scientistβs toolkit, feel free to open a pull request.