# python

[`python`](https://www.python.org/) is one of the most popular scripting and programming languages in the world. there are, [like, ninefinity different ways of ranking programming lanaguages](https://www.python.org/), and `python` sits in the top 5 of almost every one of them.

I have used it on every single project I've ever worked on. I am an unashamed `python` fanboy.

I'm also not really one for the `R` vs. `python` holy wars -- the two have different use cases and any conversation that attempts to settle "which is better" is already fundamentally flawed, in my opinion. That being said, I'd like to make the following case as to why you should *learn* `python`, even if it doesn't become your go-to data science language:

##### `python` is much more common than `R`

this is a feature of a number of biases, but also speaks to something fundamental about the differences between the languages: `R` is a very *deep* language in a very *narrow* field of concepts (namely, statistics), whereas `python` is among the most *broad* and *flexible* languages without a central purpose.

Another reason this matters: the barrier to entry for a company or government agency's IT department will be lower (or already surpassed) for `python` and `python` packages if for no other reason than that computer engineers are familiar with it by default.

##### there is a package for that

this is a corollary of the previous point: if there is a thing you want to do, it is very likely that some one has already done it in `python`, and that their work is available for you to use.

For a point of comparison, there are 11,282 `R` packages on `CRAN`, and 115,497 on `python`.

In [4]:
import antigravity

##### it does all the most basic things well

although it is available and well-supported on every OS, `python` is very much a linux-focused language. it "grew up" in linux as an alternative scripting language (alternative to `bash` and other `shell` scripts). because of this, it acquired some of the linux philosophy points, and specifically those that focus on simplicity.

many of the current iterations of linux tools are actually calling `python` scripts under the hood, which means that essential things like web scraping, emailing, scheduling and timing, networking, logging, and database access are all possible and highly optimized in `python`

##### it is fun

`python` was created with the express interest of being as simple to program in as possible. most of the syntax and rules are specifically generated to make the language function as much like pseudo-code as possible, so code is easy to read.

the community also has an edge to it. a good example: start a python session and type

```python
import this
```

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


## version 2 vs. 3

I'd be remiss not to mention one of the major red marks against `python` -- the infamous "2 vs. 3" upgrade controversy.

in order to make some very low-level changes to the language (primarily for performance improvements and to support international languages), the developer community chose to make a new major version of `python`: `python3`.

the process caused a lot of confusion among newcomers to the language -- which was exploding in popularity at around the same time -- and also put a large burden of uncertainty on corporate developers and development.

the bottom line, in my opinion is this:

***unless you have no other option, you should always use python 3 and only python 3***

## packages

a given file of executible `python` code is probably best referred to as a "script", but a collection of scripts which expose some sort of interface to a user to do "something" are generally called a "library" or a "package". 

This is mostly the same convention as in the `R` community -- think of the differences between scripts you wrote and `dplyr` and all the other stuff Hadley wrote.

### my favorite packages

So what sorts of `python` packages should you use?

first of all, the builtin packages are pretty great, and cover a wide range of the most necessary use cases for a programming language (e.g. file i/o and os utilities and tie-ins). The ones I use most often are:

+ `argparse` - reading in and parsing command line arguments
+ `collections` - sets of "collection" objects (e.g. ordered dictionaries, named tuples, default dictionaries)
+ `csv` - for reading and writing delimited files
+ `datetime` - the fundamental date object and utilities library
+ `functools` - functional tools, including fancy stuff like partial function definitions and caching
+ `itertools` - an awesome library of utilities for iterating through collections of items
+ `json` - for parsing and constructing well formatted JSON
+ `logging` - for logging messages to console, file, etc
+ `os` - operating system interaction (I use this in almost every single program)
+ `pickle` - a `python`-native serialization protocol, for saving `python` stuff
+ `random` - a decent (if not special) randomization library
+ `re` - regular expression parsing library
+ `time` - a generic OS-level time interface

for any `python` installation, these *already exist* -- no installation necessary

there are also a ton of great open-source libraries for just about any purpose you might imagine. Again, the ones I use most often:

+ `flask` - a `python` web framework (for standing up webpages)
+ `ipython` - the best interactive shell, it just makes the normal python program look silly
+ `jupyter` - the interactive extension of the above (`ipython`, this is what is used to make this bodacious document you see before you)
+ `lxml` - a fast and flexible XML / HTML library
+ `matplotlib` - a plotting library that is super useful but will make `R` users dream of their former glory
+ `nltk` - Natural Language Tool Kit, a library for language processing and text analytics
+ `numpy` - NUMerical PYthon, a lot of super duper array and linear algebra glue code to make C and FORTRAN routines available in `python`.
+ `pandas` - PANel DAta, a dataframe interface for feature data. This is the main data science library in `python` and, again, I use it in almost every single program
+ `plotly` - an amazing plotting library

+ `psycopg2` - a `postgres` library
+ `requests` - the main web GET and POST library
+ `scipy` - SCIentific PYthon, and extension of `numpy` to include a more scientific utilities
+ `scrapy` - a flexibile but easy web scraping framework
+ `seaborn` - something you import whenever you use `matplotlib` to make your plots non-heinous (also has some useful functions that no one has discovered yet)
+ `selenium` - a javascript engine library (for when `requests` isn't good enough)
+ `sklearn` - the other half of the primary data science workflow, an all-purpose modeling library
+ `sqlalchemy` - an ORM library for most sql databases. It's pretty flashy and when you finally need it, you'll know in your heart.
+ `tqdm` - a fancy-pants progress bar library. You don't need it, but you want it.
+ `yaml` - a library for parsing the world's greatest configuration format, Yet Another Markup Language (YAML)

### installing packages

So, let's take a journey together.

Unlike `R`, the folks who put `python` together thought that people should care about the versions of the packages they installed. They didn't really do anything to make this happen in a sane way, though, so there were like ten different ways to install packages. 

If you learned `python` in the early days, you probably heard it was hard to install packages. Well, it was. Maybe it still is, depending on your attitude. That's right, I'm blaming the victim.

Really, though, I'm sorry. If you're coming to `python` from `R` this probably feels silly. Why not just have an `install` function and install whatever you want? 

Why? Basically, because that's a bad idea for writing production-level software.

production-level software is meant to be deterministic, and to be stable. Software that has the ability to install packages within the language has several disadvantages:

1. avoids administrator oversight
    1. having to ask your admin to install something is a *good* thing
2. could install something malicious or broken without anyone knowing
3. could install different versions on different machines at different times

Basically, the versions of all your packages matter, so you should care about that stuff. The `python` community is pretty stickly about that and has gone to great lengths (and, like, 15 different methods) to try and solve that problem. And today, that means that everyone is doing one of the following:

+ using `pip` ("Pip Installs Python", and yes, recursive acronyms are annoying)
+ using `pip`, but in a virtual environment
+ using `conda` (virtual environments on steroids or amphetamines, depending on whether you're a data scientist or sysad (resp))

I advocate using `conda` for many reasons -- more on this later.

## environments

### basic environments

Let's take a quick python version poll:

on your laptop (not your `ec2` instance), what `python` version do you have installed?

```bash
python --version
```

In [5]:
%%bash
python --version

Python 3.6.1 :: Continuum Analytics, Inc.


different versions of `python` (and different versions of installed packages) have different files defining the language's behavior and thus different levels of compatability. personally, I think knowing that these files exist is among the more important.

***the way that the code you wrote behaves depends on these files***

recall the `which` command, which will tell us they path that will actually be called when we type in a command

```bash
which python3
```

In [12]:
%%bash
which python3

/home/zlamberty/anaconda3/envs/bs/bin/python3


your out-of-the-box `ec2` instances will likely return `/usr/bin/python3`. so when you type `python3` on the command line, you will actually call the executible file `/usr/bin/python3`.

the same sort of thing is going on for individual `python` modules we import. Every module has a "private" member `__file__` which lists the path to the file used to define that module:

In [13]:
import os
os.__file__

'/home/zlamberty/anaconda3/envs/bs/lib/python3.6/os.py'

let's look at that file!

```bash
# for you, it is:
less /usr/lib/python3.5/os.py

#for me, right now, it'll be different -- hence the craziness below. sorry!
```

In [14]:
%%bash
OS_FILE=$(python -c "import os; print(os.__file__)")
cat $OS_FILE

r"""OS routines for NT or Posix depending on what system we're on.

This exports:
  - all functions from posix or nt, e.g. unlink, stat, etc.
  - os.path is either posixpath or ntpath
  - os.name is either 'posix' or 'nt'
  - os.curdir is a string representing the current directory (always '.')
  - os.pardir is a string representing the parent directory (always '..')
  - os.sep is the (or a most common) pathname separator ('/' or '\\')
  - os.extsep is the extension separator (always '.')
  - os.altsep is the alternate pathname separator (None or '/')
  - os.pathsep is the component separator used in $PATH etc
  - os.linesep is the line separator in text files ('\r' or '\n' or '\r\n')
  - os.defpath is the default search path for executables
  - os.devnull is the file path of the null device ('/dev/null', etc.)

Programs that import and use 'os' stand a better chance of being
portable between different platforms.  Of course, they must then
only use functions that are defined by all pla

if you change that file, or your friend (who is running your code) doesn't have that same file, the code that uses `os` will be different.

the same caveat goes for every file or environment variable used by your python process on any machine. This collection of files is often called the "`python` environment", and it can be different on any system.

in the real world, the implication is immediate: if one of my programs only works for version 1.2, and another only works for version 2.1, and the `GOVERNMENT AGENCY NAME REDACTED` sysad just installed library 1.0 and *that* took two years, this  will probably be a problem.

It would be nice if this problem was solved...

### virtual environments

"virtual environments" are ways of isolating out the contents (the files) of libraries you're installing.

This is something you've actually probably done in `R`, actually, without knowing it. if you've ever tried installing a package but didn't have admin rights, the `R` interpreter prompts you to see if there's some other place you'd like to install things (usually in your home directory). 

that is a system-level isolation of the files you want to install. When the interpreter is told to load a package, it looks first for your local copy to see if you have anything spicy, and then the global copy, and then it cries.

So, generalize that idea: let's make *many* separate environments (collections of files defining how our `python` code behaves).

We can generalize this beyond just "global" and "user" (as with `R`), even creating a separate environment for each process or code base.

On a very basic level, all we're doing here is re-installing packages into a special sub-directory somewhere on the machine, and then telling `python` (through environment variables like the `PATH` variable) where to look to find them. 

We're tricking `python` into doing the right thing. and `python` is cool about it; once it realizes it's been tricked it's not even mad or anything, it's strong in our relationship and knows that it was all a bit of a goof and what's more, we all actually really had a great time and made some good memories.

Often times finished `python` projects will ship with a `requirements.txt` file, which lists each `python` package which should be installed and the exact version that it was tested against, and it is expected that it will be executed by a system with the same packages and versions. 

The "virtual environment" is an isolated set of packages that will meet that requirement.

The original way of creating a virtual environment was the python utility `virtualenv`, which is awesome and worth checking out. That being said, however, it's not what I'll recommend. Instead, I'll recommend...

## generalizing virtual environments: `conda`

`conda`, short for `anaconda`, is a *distribution* of python. it takes the virtual environment concept above and adds a special wrinkle: while most virtual environment managers allow you to install different versions of `python` *packages*, `conda` allows you to install different versions of `python` *itself*.

this should help you deal with any `python2` vs. `python3` problems you may experience.

so, let's go ahead and install `conda`, create a virtual environment, and install something.

*note: I would recommend you install `conda` on both your laptop and your `ec2` instance, but we will *require* you to install it on your `ec2` instance (it's part of the homework), so you may want to use that instance to do all of this right now*

#### installing `conda`

`conda`, by default, comes with many of the most commonly downloaded `python` packages. This is great because it gives you a pretty solid working base without any modification, *BUT* given our time and bandwidth limits, I'm going to recommend you install the `miniconda` version (the bare bones) and install packages *as needed* instead of up front.

+ [`conda`](https://www.continuum.io/downloads): a big installation, which will take a few minutes, and pre-installs several of the "must haves" (many of the above, and maybe more)
+ [`miniconda`](https://conda.io/miniconda.html): a bare-bones implementation of the above for the *discerning* gentleprogrammer

Download that stuff. Then follow the instructions on the download page, which will probably say:

```bash
bash Miniconda_some_other_stuff_.sh
```

And then, once everything is done:

```bash
conda update conda
```

<div align="center">**everyone installs `conda`**</div>

recall that we previously called

```bash
which python3
```

and got `/usr/bin/python3`, and we also checked the file path to the `os` package (from within a `python` shell):

```python
import os
os.__file__
```

what do we get now, after installing `conda`?

*everything* about `conda` is installed in one and only one directory. "uninstalling" `conda` is equivalent to simply deleting that directory.

the act of creating an environment creates a new folder under the `env` sub-directory in that main `conda` directory, and installing all of our required packages there. Let's look into that right now:

```bash
conda create -n l33tmode python=3
```

will use `conda` to create an environment named "`l33tmode`" with `python` version 3 installed.

as the little dialog will state after you create the environemnt, you have to "activate" that environment if you want to use it. You have to do this any time you want to use a virtual environment.

what we're *actually* doing here is updating the `PATH` environment variable to "point" `python` to our newly created set of files. Now, when we wish to use `python`, we will be using our specialized, isolated versions

So let's do that:

```bash
# mac or linux:
source activate l33tmode

# windows
activate l33tmode
```

This should have made our terminal prompt 10 times l33t3r. To verify that we're now looking at different files:

```bash
which python3
```

Now let's install some stuff

```bash
conda install jupyter ipython
```

and then try it out

```bash
ipython
```

this should open a fancier python interpreter (`ipython`)

## `ipython` and `jupyter notebook`

the documents and slideshows we've been using as lecture notes this whole time were created with `jupyter`, a cool `python` package which allows you to execute interpreted `python` commands in a "notebook" format, where commands and notes are isolated into separate "cells" that can be executed on demand.

There are a couple of popular "ways" of developing `python` code, and `jupyter` notebooks are probably the most popular.

I highly recommend becomming familiar with both, but particularly `jupyter`!

<div align="center">***YESSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS***</div>
<img src="http://ih0.redbubble.net/image.13413141.8561/flat,550x550,075,f.u3.jpg"></img>

# END OF LECTURE

next lecture: [AWS identity access management (IAM)](005_iam.ipynb)