<center><h1>Python For Data Analysis Wiki</h1></center>

This wiki does not cover teaching Python.  This wiki just points the reader to resources on how Python is used for data analysis.

## Setting up a Python 3.X environment / Using virtual environments / Popular IDEs

- **Ananconda / Miniconda3 distribution by Continuum Analytics (highly recommended for Windows users)**
    - Get the Miniconda3 installer [here](http://conda.pydata.org/miniconda.html).  **Use Python 3!**
    - [How-To conda command](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/conda/conda_howto.ipynb): making virtual environments, installing/uninstall packages, etc
        - [Official Conda cheat sheet](http://conda.pydata.org/docs/_downloads/conda-cheatsheet.pdf)
- **Linux** - Otherwise, Python 3.3+ comes with [pyvenv](https://docs.python.org/3.3/library/venv.html) command for creating virtual environments or use [virtualenv](http://docs.python-guide.org/en/latest/dev/virtualenvs/) library
    - How I build a [Python 3 environment](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/wiki/Building_Python_3_Linux.ipynb) in Linux OS
- **Popular IDEs or text editors**
    - [PyCharm Community Edition](https://www.jetbrains.com/pycharm/features/) More resource intensive IDE
    - [Sublime Text](https://www.sublimetext.com/) A fast, light-weight text editor
        - [How I setup Sublime Text 3 for interactive data analysis](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/sublime_text/ST3_For_Interactive_Data_Analysis.ipynb)
    - [Jupyter Notebook](http://jupyter.org/) (not an IDE, but very useful for exploratory data analysis and reproducibility.  Also check out [literate programming](https://en.wikipedia.org/wiki/Literate_programming)).
        - [Installing and setting up jupyter notebook server](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/jupyter/Installing_Jupyter_Notebook_Server.ipynb)
    - [Yhat Rodeo](https://www.yhat.com/products/rodeo) An IDE for data science
    - [Spyder IDE](https://pythonhosted.org/spyder/) (more familiar to people with MATLAB background)
    - VIM :)

## Data Ingestion From Various Sources

- Reading several file types as input using [Pandas](http://pandas.pydata.org/pandas-docs/stable/io.html)
- Connecting to an ODBC data source using [pyodbc](https://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/pandas/PandasCheatSheet.ipynb#database)  Drivers available for working with MySQL / MariaDB / PostgreSQL / or even ORDBs using SqlAlchemy
- Web-scraping using [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) or [lxml](http://lxml.de/)
- [Excel files](http://www.python-excel.org/)
- "Largish" data ingestion using [Dask](http://dask.pydata.org) (data larger than RAM, but fits on hard drive, but allows you to use Pandas-like functions)

## Data Clean-up / Transformation

- [Tidying Data](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Tidy_Data_In_Python?lang=en)
- [Pandas Cheat Sheet](https://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/pandas/PandasCheatSheet.ipynb)
- [Advanced pandas examples](http://nbviewer.jupyter.org/github/TomAugspurger/PyDataSeattle/tree/master/notebooks/)
- Using [xlwings](https://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/xlwings/Excel_Formatting.ipynb) to replace Excel VBA

## Data Visualization

- [MATPLOTLIB](http://matplotlib.org/)
    - But [this](http://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb) should be the official tutorial
- Don't forget, [pandas](http://pandas.pydata.org/pandas-docs/stable/visualization.html) has a wrapper around matplotlib, so you can plot [directly](http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting) from pandas dataframes/series
    - So you can do something like:
    
    df.plot.hist() or df.plot.bar() or df.plot.boxplot(), etc
    
    
- [seaborn](https://stanford.edu/~mwaskom/software/seaborn/)  For more nicer-looking statistical charts
- [bokeh](http://bokeh.pydata.org/en/latest/)  javascript-based
- [plotly](https://plot.ly/python/)  javascript-based
- [lightning](http://lightning-viz.org/)  javascript-based
- Yhat [ggplot](http://blog.yhat.com/posts/ggplot-for-python.html) (development may have stagnated)
- Dashboard frameworks similar to R's Shiny
    - [Spyre](https://github.com/adamhajari/spyre/blob/master/README.md)
    - [Pyxley](http://multithreaded.stitchfix.com/blog/2015/07/16/pyxley/)

## Data Summarization / Statistics using Scipy, Numpy,  Statsmodels, Pandas stack

- [Comparing 2 distributions using Mann-Whitney-Wilcoxon or Kolmogorov-Smirnov 2-Sample Tests](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/statistics/Comparing2Distributions.ipynb)
- [Comparing distributions using rank sum test and one-way ANOVA](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/statistics/Distribution_Comparisons_and_ANOVA.ipynb)
- [Distribution fitting and Q-Q Plot](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/statistics/Testing_For_Normality_and_Distribution_Fitting.ipynb)
- [Intro to linear regression example](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/statistics/Intro_Linear_Regression.ipynb)
- Bootstrap method
- Shuffling method
- Failure forecasting using [Weibull Analysis](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/weibull/Weibull_Analysis.ipynb)

## For the aspiring Polyglot

- Using R from Python from jupyter notebook using [rpy2](https://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/pandas/PandasCheatSheet.ipynb#rpy2)
- Using Python from within R using [rPython](http://www.r-bloggers.com/calling-python-from-r-with-rpython/) (not Windows supported) [rPython-win](https://github.com/cjgb/rPython-win) (Windows support)
- [Comparison](http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/TomAugspurger/6e052140eaa5fdb6e8c0/raw/811585624e843f3f80b9b6fe89e18119d7d2d73c/dplyr_pandas.ipynb) of R's dplyr and Pandas
- Python clone of R's dplyr called [dplython](https://github.com/dodger487/dplython)!
    - My dplython and pandas [example](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/dplython/dplython_example.ipynb)
- Using R's [forecast](http://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/R/Using_forecast_package.ipynb) library example
- Other related polyglot technologies inspired by the Jupyter project
    - [Beaker Notebook](http://beakernotebook.com/) (Java-based)
    - [Apache Zeppelin](https://zeppelin.incubator.apache.org/) (Java-based)

## Other useful resources

- Probably the most [comprehensive resource](http://people.duke.edu/~ccc14/sta-663/) on using Python for scientific computing I have ever seen
- Practical Business Python [website](http://pbpython.com/)
- Chris Albon's Python and R [scripts](http://chrisalbon.com/)
- [Gallery](https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks) of interesting jupyter notebooks
- Most [viewed](http://nb.bianp.net/sort/views/) jupyter notebooks