<center><h1>Python For Data Analysis Wiki</h1></center>

This wiki does not cover teaching Python.  This wiki assumes basic prior knowledge of Python programming with the focus on using it for data analysis.

## Setting up a Python 3.X environment / Using virtual environments / Popular IDEs

- **Ananconda / Miniconda3 distribution by Continuum Analytics (highly recommended for Windows users)**
    - Get the Miniconda3 installer [here](http://conda.pydata.org/miniconda.html)
    - [How-To conda command](../conda/conda_how_to.ipynb): making virtual environments, installing/uninstall packages, etc
- **Using pip to install/uninstall packages**
    - Example usage:
    
        
    pip install <package-name>
    pip uninstall <package_name>
    pip freeze # to list all packages installed
    
- **Popular IDEs**
    - [PyCharm Community Edition](https://www.jetbrains.com/pycharm/features/) More resource intensive IDE
    - [Sublime Text](https://www.sublimetext.com/) A fast, light-weight IDE
    - [Jupyter Notebook](http://jupyter.org/) (not an IDE, but very useful for exploratory data analysis and reproducibility.  Also check out [literate programming](https://en.wikipedia.org/wiki/Literate_programming)).
    - [Yhat Rodeo](https://www.yhat.com/products/rodeo) An IDE for data science
    - [Spyder IDE](https://pythonhosted.org/spyder/) (more familiar to people with MATLAB background)
    - VIM :)

### My Python environment:

- Python 3.X installed in virtual environment using Miniconda 3 at home and work
- Jupyter notebook for prototyping purposes and debugging, then finalize using Sublime Text 3 in VIM emulation mode.

## Data Ingestion From Various Sources

- Reading several file types as input using [Pandas](http://pandas.pydata.org/pandas-docs/stable/io.html)
- Connecting to an ODBC data source using [pyodbc](https://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/pandas/PandasCheatSheet.ipynb#database)  Drivers available for working with MySQL / MariaDB / PostgreSQL / or even ORDBs using SqlAlchemy
- Web-scraping using [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) or [lxml](http://lxml.de/)
- [Excel files](http://www.python-excel.org/)
- "Largish" data ingestion using [Dask](http://dask.pydata.org) (data larger than RAM, but fits on hard drive) - But you should still try to use a database!

## Data Clean-up / Transformation

- [Pandas Cheat Sheet](https://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/pandas/PandasCheatSheet.ipynb)
- [Tidying Data](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Tidy_Data_In_Python?lang=en)
- Using [xlwings](https://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/xlwings/Excel_Formatting.ipynb) to replace Excel VBA

## Data Visualization

- [MATPLOTLIB](http://matplotlib.org/)
    - But [this](http://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb) should be the official tutorial
- Don't forget, [pandas](http://pandas.pydata.org/pandas-docs/stable/visualization.html) has a wrapper around matplotlib, so you can plot directly from pandas dataframes/series
- [seaborn](https://stanford.edu/~mwaskom/software/seaborn/)
- [bokeh](http://bokeh.pydata.org/en/latest/)
- [plotly](https://plot.ly/python/)
- [lightning](http://lightning-viz.org/)
- Yhat [ggplot](http://blog.yhat.com/posts/ggplot-for-python.html) (development dead?)
- Dashboard frameworks similar to R's Shiny
    - [Spyre](https://github.com/adamhajari/spyre/blob/master/README.md)
    - [Pyxley](http://multithreaded.stitchfix.com/blog/2015/07/16/pyxley/)

## Data Summarization / Statistics using Scipy or Pandas

- Summary statistics using Pandas
- Distribution fitting and Q-Q Plot
- Linear regression / correlation
- Bootstrap method
- Shuffling method

## For the aspiring Polyglot

- [Using R from Python via jupyter notebook](https://nbviewer.jupyter.org/github/pybokeh/jupyter_notebooks/blob/master/pandas/PandasCheatSheet.ipynb#rpy2)
- Using Python from within R using [rPython](http://www.r-bloggers.com/calling-python-from-r-with-rpython/) (Windows supported?) [rPython-win](https://github.com/cjgb/rPython-win)
- Other related polyglot technologies inspired by the Jupyter project
    - [Beaker notebook](http://beakernotebook.com/) (Java-based)
    - [Apache Zeppelin](https://zeppelin.incubator.apache.org/) (Java-based)

## Other useful resources

- Probably the most [comprehensive resource](http://people.duke.edu/~ccc14/sta-663/) on using Python for scientific computing I have ever seen