Awesome Data Science with Python

A curated list of awesome resources for practicing data science using Python, including not only libraries, but also links to tutorials, code snippets, blog posts and talks.


pandas - Data structures built on top of numpy.
scikit-learn - Core ML library.
matplotlib - Plotting library.
seaborn - Data visualization library based on matplotlib.
pandas_summary - Basic statistics using DataFrameSummary(df).summary().
pandas_profiling - Descriptive statistics using ProfileReport.
sklearn_pandas - Helpful DataFrameMapper class.
janitor - Clean messy column names.
missingno - Missing data visualization.

Pandas and Jupyter

General tricks: link
Python debugger (pdb) - blog post, video, cheatsheet
cookiecutter-data-science - Project template for data science projects.
nteract - Open Jupyter Notebooks with doubleclick.
modin - Parallelization library for faster pandas DataFrame.
swifter - Apply any function to a pandas dataframe faster.
xarray - Extends pandas to n-dimensional arrays.
blackcellmagic - Code formatting for jupyter notebooks.
pivottablejs - Drag n drop Pivot Tables and Charts for jupyter notebooks.
qgrid - Pandas DataFrame sorting.
ipysheet - Jupyter spreadsheet widget.
nbdime - Diff two notebook files, Alternative GitHub App: ReviewNB.


textract - Extract text from any document.
camelot - Extract text from PDF.

Big Data

spark - DataFrame for big data, cheatsheet, tutorial.
sparkit-learn, spark-deep-learning - ML frameworks for spark.
koalas - Pandas API on Apache Spark.
dask, dask-ml - Pandas DataFrame for big data and machine learning library, resources, talk1, talk2, notebooks, videos.
turicreate - Helpful SFrame class for out-of-memory dataframes.
h2o - Helpful H2OFrame class for out-of-memory dataframes.
datatable - Data Table for big data support.
cuDF - GPU DataFrame Library.
ray - Flexible, high-performance distributed execution framework.
mars - Tensor-based unified framework for large-scale data computation.
bottleneck - Fast NumPy array functions written in C.
bolz - A columnar data container that can be compressed.
cupy - NumPy-like API accelerated with CUDA.
vaex - Out-of-Core DataFrames.
petastorm - Data access library for parquet files by Uber.

Command line tools

ni - Command line tool for big data.
xsv - Command line tool for indexing, slicing, analyzing, splitting and joining CSV files.
csvkit - Another command line tool for CSV files.
csvsort - Sort large csv files.


Common statistical tests explained
Bland-Altman Plot - Plot for agreement between two methods of measurement.
scikit-posthocs - Statistical post-hoc tests for pairwise multiple comparisons.


Null Hypothesis Significance Testing (NHST), Correlation, Cohen's d, Confidence Interval, Equivalence, non-inferiority and superiority testing, Bayesian two-sample t test, Distribution of p-values when comparing two groups, Understanding the t-distribution and its normal approximation

Exploration and Cleaning

impyute - Imputations.
fancyimpute - Matrix completion and imputation algorithms.
imbalanced-learn - Resampling for imbalanced datasets.
tspreprocess - Time series preprocessing: Denoising, Compression, Resampling.
Kaggler - Utility functions (OneHotEncoder(min_obs=100))
pyupset - Visualizing intersecting sets.
pyemd - Earth Mover's Distance, similarity between histograms.

Feature Engineering

sklearn - Pipeline, examples.
pdpipe - Pipelines for DataFrames.
few - Feature engineering wrapper for sklearn.
skoot - Pipeline helper functions.
categorical-encoding - Categorical encoding of variables, vtreat (R package).
dirty_cat - Encoding dirty categorical variables.
patsy - R-like syntax for statistical models.
mlxtend - LDA.
featuretools - Automated feature engineering, example.
tsfresh - Time series feature engineering.
pypeln - Concurrent data pipelines.

Feature Selection

Blog post series - 1 Univariate Selection, 2 Linear Models and Regularization, 3 Random Forests, 4 Stability selection and RFE Tutorial, Talk
sklearn - Feature selection.
eli5 - Feature selection using permutation importance.
scikit-feature - Feature selection algorithms.
stability-selection - Stability selection.
scikit-rebate - Relief-based feature selection algorithms.
scikit-genetic - Genetic feature selection.
boruta_py - Feature selection, explaination, example.
linselect - Feature selection package.
mlxtend - Exhaustive feature selection.

Dimensionality Reduction

prince - Dimensionality reduction, factor analysis (PCA, MCA, CA, FAMD).
sklearn - Multidimensional scaling (MDS).
sklearn - t-distributed Stochastic Neighbor Embedding (t-SNE), intro. Faster implementations: lvdmaaten, MulticoreTSNE.
sklearn - Truncated SVD (aka LSA).
mdr - Dimensionality reduction, multifactor dimensionality reduction (MDR).
umap - Uniform Manifold Approximation and Projection.
FIt-SNE - Fast Fourier Transform-accelerated Interpolation-based t-SNE.
scikit-tda - Topological Data Analysis, paper, talk.


All charts, Austrian monuments.
cufflinks - Dynamic visualization library, wrapper for plotly, medium, example.
physt - Better histograms, talk, notebook.
matplotlib_venn - Venn diagrams.
joypy - Draw stacked density plots.
mosaic plots - Categorical variable visualization, example.
scikit-plot - ROC curves and other visualizations for ML models.
yellowbrick - Visualizations for ML models (similar to scikit-plot).
bokeh - Interactive visualization library, Examples, Examples.
animatplot - Animate plots build on matplotlib.
plotnine - ggplot for Python.
altair - Declarative statistical visualization library.
bqplot - Plotting library for IPython/Jupyter Notebooks.
hvplot - High-level plotting library built on top of holoviews.
dtreeviz - Decision tree visualization and model interpretation.
chartify - Generate charts.
VivaGraphJS - Graph visualization (JS package).
pm - Navigatable 3D graph visualization (JS package), example.
python-ternary - Triangle plots.
falcon - Interactive visualizations for big data.


dash - Dashboarding solution by Tutorial: 1, 2, 3, 4, 5, example
bokeh - Dashboarding solution.
visdom - Dashboarding library by facebook.
bowtie - Dashboarding solution.
panel - Dashboarding solution.
altair example - Video

Geopraphical Tools

folium - Plot geographical maps using the Leaflet.js library, jupyter plugin.
stadiamaps - Plot geographical maps.
datashader - Draw millions of points on a map.
sklearn - BallTree, Example.
pynndescent - Nearest neighbor descent for approximate nearest neighbors.
geocoder - Geocoding of addresses, IP addresses.
Conversion of different geo formats: talk, repo
geopandas - Tools for geographic data
Low Level Geospatial Tools (GEOS, GDAL/OGR, PROJ.4)
Vector Data (Shapely, Fiona, Pyproj)
Raster Data (Rasterio)
Plotting (Descartes, Catropy)
Predict economic indicators from Open Street Map ipynb.
PySal - Python Spatial Analysis Library.
geography - Extract countries, regions and cities from a URL or text.

Recommender Systems

Examples: 1, 2, 2-ipynb, 3.
surprise - Recommender, talk.
turicreate - Recommender.
implicit - Fast Collaborative Filtering for Implicit Feedback Datasets.
spotlight - Deep recommender models using PyTorch.
lightfm - Recommendation algorithms for both implicit and explicit feedback.
funk-svd - Fast SVD.
pywFM - Factorization.

Decision Tree Models

Intro to Decision Trees and Random Forests, Intro to Gradient Boosting
lightgbm - Gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, doc.
xgboost - Gradient boosting (GBDT, GBRT or GBM) library, doc, Methods for CIs: link1, link2.
catboost - Gradient boosting.
thundergbm - GBDTs and Random Forest.
h2o - Gradient boosting.
forestci - Confidence intervals for random forests.
scikit-garden - Quantile Regression.
grf - Generalized random forest.
dtreeviz - Decision tree visualization and model interpretation.
rfpimp - Feature Importance for RandomForests using Permuation Importance.
Why the default feature importance for random forests is wrong: link
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
bartpy - Bayesian Additive Regression Trees.
infiniteboost - Combination of RFs and GBDTs.
merf - Mixed Effects Random Forest for Clustering, video
rrcf - Robust Random Cut Forest algorithm for anomaly detection on streams.

Natural Language Processing (NLP) / Text Processing

talk-nb, nb2, talk.
Text classification Intro, Preprocessing blog post.
gensim - NLP, doc2vec, word2vec, text processing, topic modelling (LSA, LDA), Example, Coherence Model for evaluation.
Embeddings - GloVe ([1], [2]), StarSpace, wikipedia2vec.
magnitude - Vector embedding utility package.
pyldavis - Visualization for topic modelling.
spaCy - NLP.
NTLK - NLP, helpful KMeansClusterer with cosine_distance.
pytext - NLP from Facebook.
fastText - Efficient text classification and representation learning.
annoy - Approximate nearest neighbor search.
faiss - Approximate nearest neighbor search.
pysparnn - Approximate nearest neighbor search.
infomap - Cluster (word-)vectors to find topics, example.
datasketch - Probabilistic data structures for large data (MinHash, HyperLogLog).
flair - NLP Framework by Zalando.
stanfordnlp - NLP Library.


Search Engine Correlation

Image Processing

cv2 - OpenCV, classical algorithms: Gaussian Filter, Morphological Transformations.
scikit-image - Image processing.
mahotas - Image processing (Bioinformatics), example.
imagepy - Software package for bioimage analysis.
CellProfiler - Biological image analysis.

Neural Networks


Convolutional Neural Networks for Visual Recognition course - Lessons 1-7, Lessons 8-14
Tensorflow without a PhD - Neural Network course by Google.
Feature Visualization: Blog, PPT
Tensorflow Playground
Visualization of optimization algorithms

Image Related

keras preprocessing - Preprocess images.
imgaug - More sophisticated image preprocessing.
imgaug_extension - Extension for imgaug.
albumentations - Wrapper around imgaug and other libraries.
Augmentor - Image augmentation library.
tcav - Interpretability method.
cutouts-explorer - Image Viewer.

Text Related

ktext - Utilities for pre-processing text for deep learning in Keras.
textgenrnn - Ready-to-use LSTM for text generation.


keras - Neural Networks on top of tensorflow, examples.
keras-contrib - Keras community contributions.
hyperas - Keras + Hyperopt: Convenient hyperparameter optimization wrapper.
elephas - Distributed Deep learning with Keras & Spark.
tflearn - Neural Networks on top of tensorflow.
tensorlayer - Neural Networks on top of tensorflow, tricks.
tensorforce - Tensorflow for applied reinforcement learning.
fastai - Neural Networks in pytorch.
ignite - Highlevel library for pytorch.
skorch - Scikit-learn compatible neural network library that wraps pytorch.
Detectron - Object Detection by Facebook.
autokeras - AutoML for deep learning.
simpledet - Object Detection and Instance Recognition.
PlotNeuralNet - Plot neural networks.
lucid - Neural network interpretability, Activation Maps.
AdaBound - Optimizer that trains as fast as Adam and as good as SGD.
caffe - Deep learning framework, pretrained models.
foolbox - Adversarial examples that fool neural networks.
hiddenlayer - Training metrics.
imgclsmob - Pretrained models.
netron - Visualizer for deep learning and machine learning models.
torchcv - Deep Learning in Computer Vision.

Applications and Snippets

CycleGAN and Pix2pix - Various image-to-image tasks.
SPADE - Semantic Image Synthesis.
Entity Embeddings of Categorical Variables, code, kaggle
Image Super-Resolution - Super-scaling using a Residual Dense Network.
Cell Segmentation - Talk, Blog Posts: 1, 2
CenterNet - Object detection.


cuML - Run traditional tabular ML tasks on GPUs.
thundergbm - GBDTs and Random Forest.
thundersvm - Support Vector Machines.


Understanding SVM Regression: slides, forum, paper

pyearth - Multivariate Adaptive Regression Splines (MARS), tutorial.
pygam - Generalized Additive Models (GAMs), Explanation.
GLRM - Generalized Low Rank Models.
tweedie - Specialized distribution for zero inflated targets, Talk.


Talk, Notebook
Blog post: Probability Scoring
All classification metrics
DESlib - Dynamic classifier and ensemble selection


pyclustering - All sorts of clustering algorithms.
somoclu - Self-organizing map.
hdbscan - Clustering algorithm.
nmslib - Similarity search library and toolkit for evaluation of k-NN methods.
buckshotpp - Outlier-resistant and scalable clustering algorithm.
merf - Mixed Effects Random Forest for Clustering, video

Interpretable Classifiers and Regressors

skope-rules - Interpretable classifier, IF-THEN rules.
sklearn-expertsys - Interpretable classifiers, Bayesian Rule List classifier.

Multi-label classification

scikit-multilearn - Multi-label classification, talk.

Signal Processing and Filtering

Kalman Filter book - Focuses on intuition using Jupyter Notebooks. Includes Baysian and various Kalman filters.
Interactive Tool for FIR and IIR filters, Examples.
The Scientist & Engineer's Guide to Digital Signal Processing (1999).
filterpy - Kalman filtering and optimal estimation library.

Time Series

statsmodels - Time series analysis, seasonal decompose example, SARIMA, granger causality.
pyramid, pmdarima - Wrapper for (Auto-) ARIMA.
pyflux - Time series prediction algorithms (ARIMA, GARCH, GAS, Bayesian).
prophet - Time series prediction library.
pm-prophet - Time series prediction and decomposition library.
htsprophet - Hierarchical Time Series Forecasting using Prophet.
nupic - Hierarchical Temporal Memory (HTM) for Time Series Prediction and Anomaly Detection.
tensorflow - LSTM and others, examples: link, link, link, Explain LSTM, seq2seq: 1, 2, 3, 4
tspreprocess - Preprocessing: Denoising, Compression, Resampling.
tsfresh - Time series feature engineering.
thunder - Data structures and algorithms for loading, processing, and analyzing time series data.
gatspy - General tools for Astronomical Time Series, talk.
gendis - shapelets, example.
tslearn - Time series clustering and classification, TimeSeriesKMeans, TimeSeriesKMeans.
pastas - Simulation of time series.
fastdtw - Dynamic Time Warp Distance.
fable - Time Series Forecasting (R package).
CausalImpact - Causal Impact Analysis (R package).
pydlm - Bayesian time series modeling (R package, Blog post)
PyAF - Automatic Time Series Forecasting.
luminol - Anomaly Detection and Correlation library from Linkedin.
matrixprofile-ts - Detecting patterns and anomalies, website, ppt.
stumpy - Another matrix profile library.
obspy - Seismology package. Useful classic_sta_lta function.
RobustSTL - Robust Seasonal-Trend Decomposition.
seglearn - Time Series library.
pyts - Time series transformation and classification, Imaging time series.
Turn time series into images and use Neural Nets: example, example.

Time Series Evaluation

TimeSeriesSplit - Sklearn time series split.
tscv - Evaluation with gap.

Financial Data

pyfolio - Portfolio and risk analytics.
zipline - Algorithmic trading.
alphalens - Performance analysis of predictive stock factors.

Survival Analysis

Time-dependent Cox Model in R.
lifelines - Survival analysis, Cox PH Regression, talk, talk2.
scikit-survival - Survival analysis.
xgboost - "objective": "survival:cox" NHANES example
survivalstan - Survival analysis, intro.
convoys - Analyze time lagged conversions.
RandomSurvivalForests (R packages: randomForestSRC, ggRandomForests).

Outlier Detection & Anomaly Detection

sklearn - Isolation Forest and others.
pyod - Outlier Detection / Anomaly Detection.
eif - Extended Isolation Forest.
AnomalyDetection - Anomaly detection (R package).
luminol - Anomaly Detection and Correlation library from Linkedin.
Distances for comparing histograms and detecting outliers - Talk: Kolmogorov-Smirnov, Wasserstein, Energy Distance (Cramer), Kullback-Leibler divergence


lightning - Large-scale linear classification, regression and ranking.


SLIM - Scoring systems for classification, Supersparse linear integer models.

Probabilistic Modeling and Bayes

Intro, Guide
PyMC3 - Baysian modelling, intro
pomegranate - Probabilistic modelling, talk.
pmlearn - Probabilistic machine learning.
arviz - Exploratory analysis of Bayesian models.
zhusuan - Bayesian deep learning, generative models.
dowhy - Estimate causal effects.
edward - Probabilistic modeling, inference, and criticism, Mixture Density Networks (MNDs), MDN Explanation.
Pyro - Deep Universal Probabilistic Programming.
tensorflow probability - Deep learning and probabilistic modelling, talk, example.

Stacking Models and Ensembles

Model Stacking Blog Post
mlxtend - EnsembleVoteClassifier, StackingRegressor, StackingCVRegressor for model stacking.
vecstack - Stacking ML models.
StackNet - Stacking ML models.
mlens - Ensemble learning.

Model Evaluation

pycm - Multi-class confusion matrix.
pandas_ml - Confusion matrix.
Plotting learning curve: link.
yellowbrick - Learning curve.

Model Explanation, Interpretability, Feature Importance

Book, Examples
shap - Explain predictions of machine learning models, talk.
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
lime - Explaining the predictions of any machine learning classifier, talk, Warning (Myth 7).
lime_xgboost - Create LIMEs for XGBoost.
eli5 - Inspecting machine learning classifiers and explaining their predictions.
lofo-importance - Leave One Feature Out Importance, talk, examples: 1, 2, 3.
pybreakdown - Generate feature contribution plots.
FairML - Model explanation, feature importance.
pycebox - Individual Conditional Expectation Plot Toolbox.
pdpbox - Partial dependence plot toolbox, example.
partial_dependence - Visualize and cluster partial dependence.
skater - Unified framework to enable model interpretation.
anchor - High-Precision Model-Agnostic Explanations for classifiers.
l2x - Instancewise feature selection as methodology for model interpretation.
contrastive_explanation - Contrastive explanations.
DrWhy - Collection of tools for explainable AI.
lucid - Neural network interpretability.
xai - An eXplainability toolbox for machine learning.
innvestigate - A toolbox to investigate neural network predictions.

Automated Machine Learning

AdaNet - Automated machine learning based on tensorflow.
tpot - Automated machine learning tool, optimizes machine learning pipelines.
auto_ml - Automated machine learning for analytics & production.
autokeras - AutoML for deep learning.
nni - Toolkit for neural architecture search and hyper-parameter tuning by Microsoft.
automl-gs - Automated machine learning.

Evolutionary Algorithms & Optimization

deap - Evolutionary computation framework (Genetic Algorithm, Evolution strategies).
evol - DSL for composable evolutionary algorithms, talk.
platypus - Multiobjective optimization.
autograd - Efficiently computes derivatives of numpy code.
nevergrad - Derivation-free optimization.
gplearn - Sklearn-like interface for genetic programming.
blackbox - Optimization of expensive black-box functions.
Optometrist algorithm - paper.
DeepSwarm - Neural architecture search.

Hyperparameter Tuning

sklearn - GridSearchCV, RandomizedSearchCV.
hyperopt - Hyperparameter optimization.
hyperopt-sklearn - Hyperopt + sklearn.
optuna - Hyperparamter optimization, Talk.
skopt - BayesSearchCV for Hyperparameter search.
tune - Hyperparameter search with a focus on deep learning and deep reinforcement learning.
hypergraph - Global optimization methods and hyperparameter optimization.
bbopt - Black box hyperparameter optimization.
dragonfly - Scalable Bayesian optimisation.

Incremental Learning, Online Learning

sklearn - PassiveAggressiveClassifier, PassiveAggressiveRegressor.
creme-ml - Incremental learning framework.
Kaggler - Online Learning algorithms.

Active Learning

modAL - Active learning framework.

Reinforcement Learning

YouTube, YouTube
Intro to Monte Carlo Tree Search (MCTS) - 1, 2, 3
AlphaZero methodology - 1, 2, 3, Cheat Sheet
RLLib - Library for reinforcement learning.
Horizon - Facebook RL framework.


h2o - Scalable machine learning.
turicreate - Apple Machine Learning Toolkit.
astroml - ML for astronomical data.

Deployment and Lifecycle Management

m2cgen - Transpile trained ML models into other languages.
sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
mlflow - Manage the machine learning lifecycle, including experimentation, reproducibility and deployment.
modelchimp - Experiment Tracking.
skll - Command-line utilities to make it easier to run machine learning experiments.
BentoML - Package and deploy machine learning models for serving in production


dvc - Versioning for ML projects.
daft - Render probabilistic graphical models using matplotlib.
unyt - Working with units.
scrapy - Web scraping library.
VowpalWabbit - ML Toolkit from Microsoft.
metric-learn - Metric learning.

General Python Programming

funcy - Fancy and practical functional tools.
more_itertools - Extension of itertools.
dill - Serialization, alternative to pickle.
attrs - Python classes without boilerplate.
dateparser - A better date parser.
jellyfish - Approximate string matching.


PocketCluster - Blog. - Blog.

