Permalink
dcef221 Oct 22, 2014
rasbt plotly note
1 contributor

Users who have contributed to this file

284 lines (170 sloc) 10 KB

[<- back] to the pattern_classification repository

Sebastian Raschka
last updated: 10/22/2014

Useful libraries for data science in Python

This is not meant to be a complete list of all Python libraries out there that are related to scientific computing and data analysis -- printed on paper and stacked one on top of the other, the stack could easily reach a height of 238,857 miles, the distance from Earth to Moon.

However, I would still be looking forward to additions and suggestions.
Please feel free to drop me a note via twitter, email, or google+.



Table of Contents



Fundamental Libraries for Scientific Computing

back to top


IPython Notebook

Website: http://ipython.org/notebook.html

IPython is an alternative Python command line shell for interactive computing with lots of useful enhancements over the "default" Python interpreter.
The browser-based documents IPython Notebooks are a great environment for scientific computing: Not only to execute code, but also to add informative documentation via Markdown, HTML, LaTeX, embedded images, and inline data plots via e.g., matplotlib.


NumPy

Website: http://www.numpy.org

NumPy is probably the most fundamental package for efficient scientific computing in Python through linear algebra routines. One of NumPy's major strength is that most operations are implemented as C/C++ and FORTRAN code for efficiency. At its core, NumPy works with multi-dimensional array objects that support broadcasting and lead to efficient, vectorized code.


pandas

Website: http://pandas.pydata.org

Pandas is a library for operating with table-like structures. It comes with a powerful DataFrame object, which is a multi-dimensional array object for efficient numerical operations similar to NumPy's ndarray with additional functionalities.


SciPy

Website: http://scipy.org/scipylib/index.html

SciPy is a considered to be one of the core packages for scientific computing routines. As a useful expansion of the NumPy core functionality, it contains a broad range of functions for linear algebra, interpolation, integration, clustering, and many more.



Math and Statistics

back to top


SymPy

Website: http://www.sympygamma.com

SymPy is a Python library for symbolic mathematical computations. It has a broad range of features, ranging from calculus, algebra, geometry, discrete mathematics, and even quantum physics. It also includes basic plotting functionality and print functions with LaTeX support.


Statsmodels

Website: http://statsmodels.sourceforge.net

Statsmodel is a Python library that is centered around statistical data analysis mainly through linear models and includes a variety of statistical tests.



Machine Learning

back to top


Scikit-learn

Website: http://scikit-learn.org/stable/

Scikit-learn is is probably the most popular general machine library for Python. It includes a broad range of different classifiers, cross-validation and other model selection methods, dimensionality reduction techniques, modules for regression and clustering analysis, and a useful data-preprocessing module.


Shogun

Website: http://www.shogun-toolbox.org

Shogun is a machine learning library that is focussed on large-scale kernel methods. Its particular strengths are Support Vector Machines (SVMs) and it comes with a range of different SVM implementations.


PyBrain

Website: http://pybrain.org


PyBrain (Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library) is a machine learning library that uses neural networks to focus on supervised learning, reinforcement learning, and evolutionary methods.


PyLearn2

Website: http://deeplearning.net/software/pylearn2/

PyLearn2 is a machine learning research library - a library to study machine learning - focussed on deep and convolutional neural networks, restricted Boltzman machines, and auto-encoders.


PyMC

Website: http://pymc-devs.github.io/pymc/index.html

The focus of PyMC is Bayesian statistics and comes with a broad range of algorithms (including Markov Chain Monte Carlo, MCMC) for model fitting.



Plotting and Visualization

back to top


Bokeh

Website: http://bokeh.pydata.org

Bokeh is a plottling library that is focussed on aesthetic layouts and interactivity to produce high-quality plots for web browsers.


d3py

Website: https://github.com/mikedewar/d3py

d3py is a plotting library to create interactive data visualizations based on d3.


ggplot

Website: https://github.com/yhat/ggplot

ggplot is a port of R's popular ggplot2 library, which brings the alternative syntax and unique visualization style to Python.


matplotlib

Website: http://matplotlib.org

Matplotlib is Python's most popular and comprehensive plotting library that is especially useful in combination with NumPy/SciPy.


plotly

Website: https://plot.ly

Plotly is a plotting library that is focussed on adding interactivity to data visualizations and to share them via the web for collaborative data analysis. To produce interactive plots, plotly requires connection to the internet to stream data to the plotly servers, however, plots can also be saved in common image formats for offline use.


prettyplotlib

Website: http://olgabot.github.io/prettyplotlib/

Prettyplotlib is a nice enhancement-library that turns matplotlib's default styles into beautiful, presentation-ready plots based on information design and color perception studies.


seaborn

Website: http://web.stanford.edu/~mwaskom/software/seaborn/

Seaborn is based on matplotlib's core functionality and adds additional features (e.g., violin plots) and visual enhancements to create even more beautiful plots.



Data formatting and management

back to top


csvkit

Website: https://csvkit.readthedocs.org

csvkit is also known as the "Swiss Army knife for comma-delimited data files" that offers additional functionality and features over Python's in-built csv module. It comes with several shell command-line tools, e.g., csvgrep, csvsort, etc., but of course it can also be imported as library in Python.


PyTables

Website: http://www.pytables.org

PyTables is a library that combines HDF5 and NumPy for working with very large datasets efficiently. PyTables also makes use of C-extensions (via Cython) for fast data access and pulling data into NumPy or pandas arrays.


sqlite3

Website: https://docs.python.org/3.4/library/sqlite3.html

Although, the sqlite3 is part of Python's Standard Library, it is still worth mentioning this classic that provides a Python interface to SQLite databases. SQLitean open-source SQL database engine that is ideal for smaller workgroups, because it is a single locally stored database file (up to 140 Tb in size) that does not require -- in contrast to SQL -- any server infrastructure.