[<- back] to the pattern_classification repository
last updated: 10/22/2014
Useful libraries for data science in Python
This is not meant to be a complete list of all Python libraries out there that are related to scientific computing and data analysis -- printed on paper and stacked one on top of the other, the stack could easily reach a height of 238,857 miles, the distance from Earth to Moon.
Table of Contents
- Fundamental Libraries for Scientific Computing
- Math and Statistics
- Machine Learning
- Plotting and Visualization
- Data formatting and storage
Fundamental Libraries for Scientific Computing
IPython is an alternative Python command line shell for interactive computing with lots of useful enhancements over the "default" Python interpreter.
The browser-based documents IPython Notebooks are a great environment for scientific computing: Not only to execute code, but also to add informative documentation via Markdown, HTML, LaTeX, embedded images, and inline data plots via e.g., matplotlib.
NumPy is probably the most fundamental package for efficient scientific computing in Python through linear algebra routines. One of NumPy's major strength is that most operations are implemented as C/C++ and FORTRAN code for efficiency. At its core, NumPy works with multi-dimensional array objects that support broadcasting and lead to efficient, vectorized code.
Pandas is a library for operating with table-like structures. It comes with a powerful DataFrame object, which is a multi-dimensional array object for efficient numerical operations similar to NumPy's ndarray with additional functionalities.
SciPy is a considered to be one of the core packages for scientific computing routines. As a useful expansion of the NumPy core functionality, it contains a broad range of functions for linear algebra, interpolation, integration, clustering, and many more.
Math and Statistics
SymPy is a Python library for symbolic mathematical computations. It has a broad range of features, ranging from calculus, algebra, geometry, discrete mathematics, and even quantum physics. It also includes basic plotting functionality and print functions with LaTeX support.
Statsmodel is a Python library that is centered around statistical data analysis mainly through linear models and includes a variety of statistical tests.
Scikit-learn is is probably the most popular general machine library for Python. It includes a broad range of different classifiers, cross-validation and other model selection methods, dimensionality reduction techniques, modules for regression and clustering analysis, and a useful data-preprocessing module.
Shogun is a machine learning library that is focussed on large-scale kernel methods. Its particular strengths are Support Vector Machines (SVMs) and it comes with a range of different SVM implementations.
PyBrain (Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library) is a machine learning library that uses neural networks to focus on supervised learning, reinforcement learning, and evolutionary methods.
PyLearn2 is a machine learning research library - a library to study machine learning - focussed on deep and convolutional neural networks, restricted Boltzman machines, and auto-encoders.
The focus of PyMC is Bayesian statistics and comes with a broad range of algorithms (including Markov Chain Monte Carlo, MCMC) for model fitting.
Plotting and Visualization
Bokeh is a plottling library that is focussed on aesthetic layouts and interactivity to produce high-quality plots for web browsers.
d3py is a plotting library to create interactive data visualizations based on d3.
ggplot is a port of R's popular ggplot2 library, which brings the alternative syntax and unique visualization style to Python.
Matplotlib is Python's most popular and comprehensive plotting library that is especially useful in combination with NumPy/SciPy.
Plotly is a plotting library that is focussed on adding interactivity to data visualizations and to share them via the web for collaborative data analysis. To produce interactive plots, plotly requires connection to the internet to stream data to the plotly servers, however, plots can also be saved in common image formats for offline use.
Prettyplotlib is a nice enhancement-library that turns matplotlib's default styles into beautiful, presentation-ready plots based on information design and color perception studies.
Seaborn is based on matplotlib's core functionality and adds additional features (e.g., violin plots) and visual enhancements to create even more beautiful plots.
Data formatting and management
csvkit is also known as the "Swiss Army knife for comma-delimited data files" that offers additional functionality and features over Python's in-built
csv module. It comes with several shell command-line tools, e.g., csvgrep, csvsort, etc., but of course it can also be imported as library in Python.
PyTables is a library that combines HDF5 and NumPy for working with very large datasets efficiently. PyTables also makes use of C-extensions (via Cython) for fast data access and pulling data into NumPy or pandas arrays.
sqlite3 is part of Python's Standard Library, it is still worth mentioning this classic that provides a Python interface to SQLite databases. SQLitean open-source SQL database engine that is ideal for smaller workgroups, because it is a single locally stored database file (up to 140 Tb in size) that does not require -- in contrast to SQL -- any server infrastructure.