# python

## why might you use `python`

[`python`](https://www.python.org/) is one of the most popular scripting and programming languages in the world. there are, [like, ninefinity different ways of ranking programming lanaguages](https://en.wikipedia.org/wiki/Measuring_programming_language_popularity#Indices), and `python` sits in the top 5 of almost every one of them.

I have used it on every single project I've ever worked on. I am an unashamed `python` fanboy.

though I like to poke fun at `R`, I'm not really one for the `R` vs. `python` holy wars -- the two have different use cases and any conversation that attempts to settle "which is better" is already fundamentally flawed, in my opinion. That being said, I'd like to make the following case as to why you should *learn* `python`, even if it doesn't become your go-to data science language:

##### `python` is much more common than `R` *outside* the statistics community

this is a feature of a number of biases, but also speaks to something fundamental about the differences between the languages: `R` is a very *deep* language in a very *narrow* field of concepts (namely, statistics), whereas `python` is among the most *broad* and *flexible* languages without a central purpose.

Another reason this matters: the barrier to entry for a company or government agency's IT department will be lower (or already surpassed) for `python` and `python` packages if for no other reason than that computer engineers are either already familiar with it or much more comfortable expanding into it.

##### the deep learning field is very much entrenched on the `python` side of the fence

though there are ways to communicate out of `R` and into `python` sessions for access to the deep learning frameworks, the action is all in `python`.

if you are looking to do deep learning, the easiest path forward will be in `python` rather than `R`.

*note: you could correctly argue that the action is actually **really** all in `C` and `C++` but the action `api`s are in `python`*

##### there is a package for that

this is a corollary of the previous point: if there is a thing you want to do, it is very likely that some one has already done it in `python`, and that their work is available for you to use.

For a point of comparison, there are [14,949 `R` packages on `CRAN`](https://cran.r-project.org/web/packages/), and [197,199 on `pypi`](https://pypi.org/).

In [None]:
import antigravity

##### it does all the most basic things well

although it is available and well-supported on every OS, `python` is very much a linux-focused language. it "grew up" in linux as an alternative scripting language (alternative to `bash` and other `shell` scripts). because of this, it acquired some of the linux philosophy points, and specifically those that focus on simplicity.

many of the current iterations of linux tools are actually calling `python` scripts under the hood, which means that essential things like web scraping, emailing, scheduling and timing, networking, logging, and database access are all possible and highly optimized in `python`

##### it is fun

`python` was created with the express interest of being as simple to program in as possible. most of the syntax and rules are specifically generated to make the language function as much like pseudo-code as possible, so code is easy to read.

the community also has an edge to it. a good example: start a python session and type

```python
import this
```

In [None]:
import this

okay, so it's a particular type of fun for a particular type of person. but given the prior that you're in this class, I suppose that's a safer prediction

## why might you not?

### version 2 vs. 3

I'd be remiss not to mention one of the major red marks against `python` -- the infamous "2 vs. 3" upgrade controversy.

in order to make some very low-level changes to the language (primarily for performance improvements and to support international languages), the developer community chose to make a new major version of `python`: `python3`.

the process caused a lot of confusion among newcomers to the language -- which was exploding in popularity at around the same time -- and also put a large burden of uncertainty on corporate developers and development.

the bottom line, in my opinion is this:

***unless you have no other option, you should always use python 3 and only python 3***

### it's not *built* for statistics

every single decision made in the development of the `R` programming language has been optimized for a particular audience (statisticians) with a particular task (statistics). as such, it does some things that computer scientists and software developers find baffling but are extremely intuitive to data scientists.

the river often flows in the other direction in `python` world

**<div align="center">open floor: if you like `python`, why? if you don't, why not?</div>**

# packages

a given file of executible `python` code is probably best referred to as a "script", but a collection of scripts which expose some sort of interface to a user to do "something" are generally called a "package" (or sometimes a "library", especially for the built-in packages).

This is mostly the same convention as in the `R` community -- think of the differences between scripts you wrote and `dplyr` and all the other stuff Hadley wrote.

## my favorite packages

So what sorts of `python` packages should you use?

first of all, the builtin packages are pretty great, and cover a wide range of the most necessary use cases for a programming language (e.g. file i/o and os utilities and tie-ins). The ones I use most often are:

+ `argparse` - reading in and parsing command line arguments
+ `collections` - sets of "collection" objects (e.g. ordered dictionaries, named tuples, default dictionaries)
+ `csv` - for reading and writing delimited files
+ `datetime` - the fundamental date object and utilities package
+ `functools` - functional tools, including fancy stuff like partial function definitions and caching
+ `itertools` - an awesome package of utilities for iterating through collections of items
+ `json` - for parsing and constructing well formatted JSON
+ `logging` - for logging messages to console, file, etc
+ `os` - operating system interaction (I use this in almost every single program)
+ `pickle` - a `python`-native serialization protocol, for saving `python` stuff
+ `random` - a decent (if not special) randomization package
+ `re` - regular expression parsing package
+ `time` - a generic OS-level time interface

for any `python` installation, these *already exist* -- no installation necessary

there are also a ton of great open-source libraries for just about any purpose you might imagine. Again, the ones I use most often:

+ `flask` - a `python` web framework (for standing up webpages)
+ `ipython` - the best interactive shell, it just makes the normal python program look silly
+ `jupyter` - the interactive extension of the above (`ipython`, this is what is used to make this bodacious document you see before you)
+ `lxml` - a fast and flexible XML / HTML package
+ `matplotlib` - a plotting package that is super useful but will make `R` users dream of their former glory
+ `numpy` - NUMerical PYthon, a lot of super duper array and linear algebra glue code to make C and FORTRAN routines available in `python`.
+ `pandas` - PANel DAta, a dataframe interface for feature data. This is the main data science package in `python` and, again, I use it in almost every single program
+ `plotly` - an amazing plotting package
+ `psycopg2` - a `postgres` package

+ `requests` - the main web GET and POST package
+ `scipy` - SCIentific PYthon, and extension of `numpy` to include a more scientific utilities
+ `scrapy` - a flexibile but easy web scraping framework
+ `seaborn` - something you import whenever you use `matplotlib` to make your plots non-heinous (also has some useful functions that no one has discovered yet)
+ `selenium` - a javascript engine package (for when `requests` isn't good enough)
+ `sklearn` - the other half of the primary data science workflow, an all-purpose modeling package
+ `spacy`, sometimes `nltk` - NLP libraries
+ `sqlalchemy` - an ORM package for most sql databases. It's pretty flashy and when you finally need it, you'll know in your heart.
+ `tensorflow`, `torch`, and `keras` - the three big libraries for deep learning implementations
+ `tqdm` - a fancy-pants progress bar package. You don't need it, but you want it.
+ `yaml` - a package for parsing the world's greatest configuration format, Yet Another Markup Language (YAML)

## installing packages

So, let's take a journey together.

Unlike `R`, the folks who put `python` together thought that people should care about the versions of the packages they installed. They didn't really do anything to make this happen in a sane way, though, so there were like ten different ways to install packages. 

If you learned `python` in the early days, you probably heard it was hard to install packages. Well, it was. Maybe it still is, depending on your attitude. That's right, I'm blaming the victim.

Really, though, I'm sorry. If you're coming to `python` from `R` this probably feels silly. Why not just have an `install` function and install whatever you want? 

Why? Basically, because that's a bad idea for writing production-level software.

production-level software is meant to be deterministic, and to be stable. Software that has the ability to install packages within the language has several disadvantages:

1. avoids administrator oversight
    1. having to ask your admin to install something is a *good* thing
2. could install something malicious or broken without anyone knowing
3. could install different versions on different machines at different times

Basically, the versions of all your packages matter, so you should care about that stuff. The `python` community is pretty stickly about that and has gone to great lengths (and, like, 15 different methods) to try and solve that problem. And today, that means that everyone is doing one of the following (and then some):

+ using `pip` ("Pip Installs Python", and yes, recursive acronyms are annoying)
+ using `pip`, but in a virtual environment
    + often done using `virtualenv` or `pyenv`
+ using `conda` (virtual environments on steroids or amphetamines, depending on whether you're a data scientist or sysad (resp))

I advocate using `conda` for many reasons -- more on this in the next lecture

# actually writing and executing `python` code

## interactive shells: `ipython` and `jupyter notebook`

the default `python` command opens a vanilla `python` shell, where you can execute any of the `python` commands your heart disires. that being said, the experience is obviously lacking the bells and whistles of any modern code development or execution environment.

for your personal use, `IPython` (interactive `python` shell) and `jupyter notebook`s are as close as it comes to a *must install* package as there is.

I personally think of `ipython` as being the primary means of developing software, and `jupyter` as being almost exclusively for exploratory documents and presentations, but you should do whatever works for you!

the documents and slideshows we've been using as lecture notes this whole time were created with `jupyter`, and I believe you used them extensively in 510. if it is still new to you, though, `jupyter` is a cool `python` package which allows you to execute interpreted `python` commands in a "notebook" format, where commands and notes are isolated into separate "cells" that can be executed on demand.

there are a couple of popular "ways" of developing `python` code, and `jupyter` notebooks are probably the most popular.

I highly recommend becomming familiar with both, but particularly `jupyter`!

### what `jupyter` actually is

based on my understanding of `jupyter` usage in 510, I am assuming most of you are familiar with how to *use* `jupyter notebook`s. I think it can still be helpful to know what `jupyter` actually *is*, and what it is *doing*.

`jupyter` is a `python` package (we can install it!) that does a lot of different things, but the main thing it does is create a **service** (a long-running process that *listens* for requests and *responds* to them with information)

In [None]:
%%bash
ps -aef | grep jupyter

that long running service knows how to read specially-formatted `json` files (`.ipynb`). these files contain code snippets that `jupyter` (hopefully) knows how to run and text blocks it can render. take this file, for example:

In [None]:
%%bash
head -n100 /Users/zach.lamberty/personal/code/gu511/005_python.ipynb

finally, this process knows how to start up different processes for handling calculations in different languages (kernels).

when a **client** (your web browser) access the `jupyter notebook` **service** (via `http`), you begin a back-and-forth communication that eventually leads to you executing `python` code here in this pretty browser window.

note: this client-server paradigm comes up a lot, right? because we're communicating over `http`, we should be able to run `jupyter notebook` services anywhere in the world, and connect to them from anywhere else, as long as that `http` message can travel. hm... I smell a homework exercise... 

I bring up all of the above just to say: `jupyter` is complicated! you're doing a lot with not a lot of effort. good for you. as you go on and experience problems or quirks, most of them come back to this architecture. you have a central process running as someone (maybe not you), and your attempts to run `python` code are going through this filter

### code in other languages with `jupyter`

you may have noticed a cell I ran up above:

```python
%%bash
head -n100 /Users/zach.lamberty/personal/code/gu511/005_python.ipynb
```

that `%%bash` piece is `ipython` / `jupyter` specific -- it won't work in regular `python`. in this case it enables me to write code in a different language (here, `bash`), and the `jupyter` service knows how to execute. `jupyter` currently [supports many programming languages this way](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels) (requires some installation)

## editors and `IDE`s

an **editor** is any program which allows you to edit text. these can be extremely basic (notepad) on up through extremely full featured.

an **`IDE`** (`i`ntegrated `d`evelopment `e`nvironment) is the combination of an editor with tools for developing, testing, and packagin software. these are often extensible via plugins and have special commands that allow you to take common actions (e.g. refactor code to rename a variable everywhere it occurs, or auto-complete function names or common design patterns as you type) very quickly.

the fundamental concept of an IDE is usually a **project** (a collection of files and metadata about how they are related) rather than a single file. you develop projects rather than edit files

as an analogy: `IDE`s are to software development what microsoft office is to writing letters. they are software that provide you with a huge range of tools to do many different things. you don't need all the tools, but it's nice to have all the ones you do need.

there are a multitude of options for developing code in `python`, and the choice really comes down to your personal preferrences.

if you're looking for the simplest possible starting point for doing *exploration* (not development!), you should use `jupyter notebook`s. this is the best exploratory environment for data science that `python` has to offer

if you've "grown up" coding in `RStudio`, you probably expect a windowed environment where you can write scripts, execute blocks, visualize output, and explore objects, you might want to consider:

+ [`pycharm`](https://www.jetbrains.com/pycharm/)
+ [`rodeo`](https://www.yhat.com/products/rodeo)
+ [`spyder`](https://pythonhosted.org/spyder/)

additionally, recent updates to `RStudio` itself (using the `R` `reticulate` package) [allow you to run `python` code in `RStudio`](https://blog.rstudio.com/2018/10/09/rstudio-1-2-preview-reticulated-python/).

if you are looking for something that is a significant step up from your base text editor, but not yet all the bells and whistles of a project-based IDE, you should look into

+ [`sublimetext`](https://www.sublimetext.com/)
+ [`notepadd++`](https://notepad-plus-plus.org)
+ [`emacs`](https://www.gnu.org/software/emacs/download.html)
+ [`vim`](https://www.vim.org)

finally, if you're looking to really jump into development, I suggest (in order of my own personal preference)

+ [`pycharm`](https://www.jetbrains.com/pycharm/)
+ [`vscode`](https://code.visualstudio.com/)
+ [`atom`](https://atom.io/)

personally, I'm a huge fan of `pycharm` for my work, but `vscode` is a close second.

I also advocate for the cycle of

1. do exploration in notebooks
1. convert cells or blocks of related cells into parameterized functions
1. collect related functions into packages
1. write external packages and import directly into notebooks

# a crash course of stuff you should know or learn!

I know that Prof. Price covered `python` in his course, so this may be overkill. If you're a `python` pro, bear with me -- sit back and bask in your total l33tness while we take a lightning tour of things that I think are #important.

some of these topics may feel a little out of left field, but they are things I've learned that I think are essential (but not sufficient) to being a good `python` programmer

## code structure and organization

+ [pep8](https://www.python.org/dev/peps/pep-0008/) was a really good idea. you should follow it
    + specifically pay attention to naming conventions. they are important!
        + files: `someshortword.py`
        + variables and function names: `lowercase_with_underscores` (aka `snake_case`)
        + class names: `CamelCase`

+ there are basically two types of `py` file:
    + modules: a file `thisthing.py` written so that I can run `import thisthing` in a `python` session and nothing happens, but now I have new `python` toys like `thisthing.foo` and `thisthing.bar`
    + scripts: I can run `python thisthing.py` from a bash shell and it *does a thing*
    + if your file does a combination of those two, you should ask yourself why (and probably not do that)
+ if I run `import thisthing` and *something happens*, that is almost always not a good idea

for `python` files (not `jupyter notebooks`), this is

a bad idea:

```python
# thisthing.py

import pandas as pd
import sklearn.neural_network

x = pd.read_csv('magicdata.csv')
y = pd.read_csv('easytarget.csv')
m = sklearn.neural_network.MLPClassifier(
    hidden_layer_sizes=(1E999, 1E99999999999999),
    random_state=1337)

m.fit(x, y)
```

a better idea:

```python
# thisthing.py

import pandas as pd
import sklearn.neural_network


def load_xy(xfile='magicdata.csv', yfile='easytarget.csv'):
    x = pd.read_csv(xfile)
    y = pd.read_csv(yfile)  
    return x, y
   
   
def model(x, y):
    m = sklearn.neural_network.MLPClassifier(
        hidden_layer_sizes=(1E999, 1E99999999999999), 
        random_state=1337)
    m.fit(x, y)
    return m
    

def main(xfile='magicdata.csv', yfile='easytarget.csv'):
    x, y = load_xy(xfile, yfile)
    m = model(x, y)
    print(m.coefs_)
    

# more on this later...
if __name__ == '__main__':
    main()
```

**<div align="center">what are your questions so far??</div>**

## defining functions

the typical function definition in `python` has a standard structure

```python
def foo(a, b, c, d):
    # stuff ...
    return result
```

I want to cover two special topics related to how you define functions -- one thing you should do more of

### `*args` and `**kwargs`

before we dive into what `*args` and `**kwargs` are, let's quickly set terminology: in `python`, users can pass arguments to functions in two ways:

1. as positional arguments (first, second, third argument)
1. as keyword arguments (`x=...`, `date=...`, etc)

for example, the function

```python
foo(a, b, c, d)
```

can have its arguments passed in order (`a` is the first element passed in, `b` is the second, etc) or by assignment (`a=...`, etc)

```python
# with only positional arguments
foo(1, 2, 3, 4)

# with only keyword arguments
foo(a=1, b=2, c=3, d=4)

# with a mix, but with ALL keywords AFTER ALL position
foo(1, 2, c=3, d=4)
```

what is more, keyword arguments could be provided completely out of order

```
foo(1, d=4, c=3, b=2)
```

will assign `a=1` (positional) and then all the other variables as provided

under the hood, the management of variables provided to function in `python` in done a standard way: through the use of "packing" and "unpacking" of positional and keyword arguments.

when reading the variables you provide for the function, `python` starts off by collecting any **positional** arguments you provide into a `tuple` we call `args` by convention. `python` requires that all positional arguments come first (otherwise, how would we know which variable name to assignt hat value to?)

then, it packs collects any **keyword** arguments you provide into a `dict` we call `kwargs`.

so when we invoke the function `foo(a, b, c, d)` we defined above as

```python
foo(1, 2, c=3, d=4)
```

this is seen as having `args = (1, 2)` and `kwargs = {'c': 3, 'd': 4}`

if we had invoked it

```python
foo(1, d=4, c=3, b=2)
```

this is seen as having `args = (1, )` and `kwargs = {'b': 2, 'c': 3, 'd': 4}`

similarly,

```python
foo(1, 2, 3, 4)
```

would `args = (1, 2, 3, 4)` and `kwargs = {}`, and

```python
foo(a=1, b=2, c=3, d=4)
```

would `args = (, )` and `kwargs = {'a': 1, 'b': 2, 'c': 3, 'd': 4}`

knowing that this happens actually opens up major opportunity: you might already know that `python` supports variable length argument lists for any function -- that is, if you want a function where

```python
foo(1)
foo(1, 2)
foo(1, 2, 3)
foo(1, 2, 3, 4)
```

are all defined and do basically the same thing, you can do that in `python`

there is a pretty essential builtin function that works in exactly this way, in fact:

In [None]:
max(1, 2)

In [None]:
max(1, 2, 3)

In [None]:
max(1, 2, 3, 4)

there are also arguments that allow you to provide a variable number of "keyword" arguments (arguments you pass in like `foo=bar, baz=buzz`). again, in base `python`:

In [None]:
dict(a=1, b=2, c=3, x=-3, y=-2, z=-1)

both of these proceses work by leveraging the fact that `args` and `kwargs` are `tuple` and `dict`s (respectively) and can be arbitrary size. to allow for aribtrary number of positional or keyword arguments, we simply change the way we declare the function from

```python
def foo(a, b, c, d):
    ...
```

to

```python
def foo(*args, **kwargs):
    ...
```

you can read

```python
def foo(*args, **kwargs):
    ...
```

as "take all the positional arguments and pack them into a tuple named `args`; then put all keywords into a dictionary name `kwargs`".

in the body of the function `foo`, you could refer to `args` or `kwargs` and get *whatever* the user passed in

the `*` and the `**` here are doing the "packing" or the "unpacking"

+ packing: collect positional or keyword arguments into tuples and dicts (resp)
    + take a series of elements `x, 1, y, d, ...` and *pack* them into a tuple
    + take a series of keyword-value statements `var1=val1, var2=val2, ...` and *pack* them into a dictionary
+ unpacking: given a tuple or a dictionary, "explode" the elements into a list of positional or keyword arguments
    + take a tuple `(x, 1, y, d, ...)` and convert it into a series of elements `x, 1, y, d, ...` inside a function call
    + take a dictionary `{var1: val1, var2: val2, ...}` and convert it into a series of keyword=value statments `var1=val1, var2=val2, ...` inside a function call

let's create a function to see what this looks like on both sides

In [None]:
def foo(*args, **kwargs):
    print('args = {}'.format(args))
    print('kwargs = {}'.format(kwargs))

In [None]:
# only positional args
foo(1, 2, 'a', 'b')

In [None]:
# only keyword args
foo(val1=1, val2=2, val3='a', other_val='b')

In [None]:
# a mix of both positional args and keyword args
foo(1, 2, 'a', 'b', val1=3, val2=4, val3='c', val4='d')

importantly, we can use the `*` and `**` operators to "explode" things we want to pass to a function. for example, the following function expects to receive *exactly* four arguments with names `a, b, c, d`.

In [None]:
def bar(a, b, c, d):
    print('a = {}'.format(a))
    print('b = {}'.format(b))
    print('c = {}'.format(c))
    print('d = {}'.format(d))

+ if we provide 0 positional (non-named) argument, we must provide `a=`, `b=`, `c=`, and `d=` values in any order
+ if we provide 1 positional (non-named) argument, it will be assigned to `a`, and we must provide `b=`, `c=`, and `d=` values in any order
+ if we provide 2 positional (non-named) argument, they will be assigned to `a` and `b` (resp), and we must provide `c=`, and `d=` values in any order
+ if we provide 3 positional (non-named) argument, it will be assigned to `a`, and we must provide `b=`, `c=`, and `d=` values in any order

watch how we define a tuple of arguments `args` and explode it via `*args`, and how we do the same with a dictionary called `kwargs` we explode via `**kwargs`

In [None]:
args = tuple()
kwargs = {'a': 'kw_arg_a',
          'b': 'kw_arg_b',
          'c': 'kw_arg_c',
          'd': 'kw_arg_d'}
bar(*args, **kwargs)

In [None]:
args = ('pos_arg_a', )
kwargs = {'b': 'kw_arg_b',
          'c': 'kw_arg_c',
          'd': 'kw_arg_d'}
bar(*args, **kwargs)

In [None]:
args = ('pos_arg_a', 'pos_arg_b', )
kwargs = {'c': 'kw_arg_c',
          'd': 'kw_arg_d'}
bar(*args, **kwargs)

In [None]:
args = ('pos_arg_a', 'pos_arg_b', 'pos_arg_c' )
kwargs = {'d': 'kw_arg_d'}
bar(*args, **kwargs)

In [None]:
args = ('pos_arg_a', 'pos_arg_b', 'pos_arg_c', 'pos_arg_d', )
kwargs = {}
bar(*args, **kwargs)

**<div align="center">what are your questions so far??</div>**

### mutable default values are very bad

defining default values for the arguments in your function is very good!

making those default values *mutable* is very bad!

in `python`, a *mutable* object is any object which can be changed after it is created. anything which is not mutable is *immutable*. among basic types,

| type | mutable |
|-|-|
| `bool` | |
| `int` | |
| `float` | |
| `str` | |
| `tuple` | |
| `list` | x |
| `set` | x |
| `dict` | x |

you will find that there will be times that you define a function and you want one of the arguments to be a list or a dict. you will want to define the default behavior of that function, and that behavior will often be determined by some default (usually empty) value for that argument.

you *may* be temped to write something like

```python
# DON'T DO THIS
def foo(a, b, my_list=[]):
    # do stuff...
```

under the hood this will

1. create a function object `foo`
1. create a default for the `my_list` argument which is an empty list `[]`

that default value for `my_list` is *persistent* -- it is not re-created each time you invoke the function, but rather re-used.

this means that if you do something to that list, the next time you call that function the default value is not `[]` but whatever the last state of that variable was.

the canonical example is a function which will append a value to a list, where the list is optional -- if one isn't provided we will return a list with only the one item

In [None]:
def my_append(my_val, my_list=[]):
    my_list.append(my_val)
    return my_list

putting aside for a moment that this is a really dumb and contrived way to do this, let's see how it works:

In [None]:
x = my_append(1)
x

In [None]:
x = my_append(2, x)
x

In [None]:
x = my_append(3, x)
x

looks good, right?

what happens when we try again?

In [None]:
x = my_append(1)
x

wat

the default argument we provided (`[]`) was an empty list. when the function was defined, that list was created and stored in memory in `python`. then, every time we made a change to that element it changed *that list*

```python
x = my_append(1)     # the default list is updated to [1] and returned
x = my_append(2, x)  # that same list has 2 appended to it, [1, 2]
x = my_append(3, x)  # that same list has 2 appended to it, [1, 2, 3]
x = my_append(1)     # that default is now [1, 2, 3] and has a 1 appended to it
```

so what's the solution?

a very standard pattern in `python` for *mutable* default values is to set a default value of `None` and replace the value when nothing is provided:

In [None]:
def my_append(my_val, my_list=None):
    if my_list is None:
        my_list = []
    my_list.append(my_val)
    return my_list

now

In [None]:
x = my_append(1)
x

In [None]:
x = my_append(2, x)
x

In [None]:
x = my_append(3, x)
x

and the moment of truth...

In [None]:
x = my_append(1)
x

**<div align="center">what are your questions so far??</div>**

## iteration

one of the first-order concepts in `python` is that things which are collections should be *iterable* -- you should be able to move through them in order using a 

```python
for item in collection:
    # do a thing...
```

construct. basically, any time a thing *can* be iteratable, it *should* be iterable.

if you are coming from `R`, or have predominantly executed code in `pandas`, you now that you should try to do *vectorized* things as often as possible. that is definitely still true!!

that being said, *outside* of those contexts, you will often be using and creating iterable things, so you should know about them.

iteration is so important that there is a standard library called [`itertools`](https://docs.python.org/3/library/itertools.html) written to support some complex iteration steps. Among many other things, this will allow you to:

In [None]:
import itertools

# iterate through products of different lists
# think "nested for loops"
alist = [0, 1, 2]
blist = 'abc'

for (a, b) in itertools.product(alist, blist):
    print(a, b)

In [None]:
# do all combinations with or without replacement
for (a0, a1) in itertools.combinations(alist, 2):
    print(a0, a1)
    
print()

for (a0, a1) in itertools.combinations_with_replacement(alist, 2):
    print(a0, a1)

In [None]:
# zip together lists with repetition from a shorter list
alist = range(10)
blist = 'ab'

for (a, b) in zip(alist, blist):
    print(a, b)
    
print()

for (a, b) in zip(alist, itertools.cycle(blist)):
    print(a, b)

### list comprehensions and generators

you are no doubt familiar with the ability to create lists, sets, and dictionaries using for loops:

```python
l = []
s = set()
t = tuple()
d = {}
for i in range(10):
    l.append(i ** 2)
    s.add(i ** 2)
    t += (i ** 2, )
    d[i] = i ** 2
```

but this actually *not* the preferred way of creating collections from simple iteration. the `pythonic` way of doing that is to use *comprehensions*

the following are equivalent:

```python
l = []
s = set()
t = tuple()
d = {}
for i in range(10):
    l.append(i ** 2)
    s.add(i ** 2)
    t += (i ** 2, )
    d[i] = i ** 2
```

and

```python
l = [i ** 2 for i in range(10)]
s = {i ** 2 for i in range(10)}
t = tuple(i ** 2 for i in range(10))
d = {i: i ** 2 for i in range(10)}
```

the second form is both faster and more compact -- win win! and if you ask the `python` developers, more readable (though that's definitely debatable)

for a general loop structure like

```python
l = []
for elem1 in [some iterable]:
    if [conditional check on elem1]:
        for elem2 in [some other iterable, maybe depending on elem1]:
            if [conditional check on elem2]:
                ...
                l.append([some expression])
```

can be replaced with a **comprehension**

```python
l = [some expression
     for elem1 in some iterable
     [optional conditional check on elem1]
     for elem2 in some other iterable dependent on elem1
     [optional condtional check on elem2]
     ...]
```

you can think of it as 

1. moving the final line (which defines what the collection elements are) to the very top
1. keeping the `for` loop elements in order
1. drop the `:` characters
1. move everything to the same indentation level

let's verify that the comprehensions we wrote above are equivalent to the nested loop statements

In [None]:
l = []
for i in range(10):
    l.append(i ** 2)
print(l)

print() 

l = [
    i ** 2
    for i in range(10)
]

print(l)

In [None]:
d = {}
for i in range(10):
    d[i] = i ** 2
print(d)

print()

# note: unlike for loops, comprehension expressions don't
# have to be on multiple lines
d = {i: i ** 2 for i in range(10)}
print(d)

In [None]:
d = {}

for i in range(5):
    if i % 2 == 0:
        for j in range(i, 5):
            if j % 2 == 0:
                d[i, j] = (i ** 2, j ** 3)
print(d)

print()

d = {
    (i, j): (i ** 2, j ** 3)
    for i in range(5)
    if i % 2 == 0
    for j in range(i, 5)
    if j % 2 == 0
}
print(d)

### generators

you may have been tempted to try this for tuples (`(a, b, ...)`) as well:

In [None]:
s = (i ** 2 for i in range(10))
s

# note: this is the correct way to do it for a set:
#s = tuple(i ** 2 for i in range(10))
#s

this is not a set comprehension but isntead a *generator*. you can think of it is a memory-optimized version of the *comprehension* construct above.

+ a *comprehension* take some iterables and a rule for composing them into individual values, and then builds all of the items at once and stores them in memory.
+ a *generator* acts more like a factory for those items -- it doesn't create them ahead of time, but can if asked

You can iterate through either, but if you use a generator you only need to hold one of those things in memory at a time.

In [None]:
g = ((i ** 2) for i in range(10))

In [None]:
# keep running this cell and see what happens
g.send(None)

each time we run the `send` method, our generator supplies us with the calculated value (from that recipe) given the next input (from that iterator)

in other words, the generator object had an *internal state*, a "current" value of those iterated `i` values, and a recipe for converting them into an object which it would then *yield* as we called the `send` method.

we can actually see that state

In [None]:
g = ((i ** 2) for i in range(10))

In [None]:
# keep running the cell and see what happens
print(g.send(None))
g.gi_frame.f_locals

of course, you normally wouldn't just create a generator and call `send` on it, you would use the normal iterator `for` loop construct:

In [None]:
g = (i ** 2 for i in range(10))
for i2 in g:
    print(i2)

## `io` operations and file objects

### `os`

the `os` module has, basically, one goal: handle all the stuff that is different between different operating systems for you. 

the best example of this is file paths. suppose I want to create a file three directories below the current location: how do I write that path?

```sh
# in windows:
subdir1\subdir2\subdir3\myfile.txt

# in linux:
subdir1/subdir2/subdir3/myfile.txt
```

it'd be sad if such a dumb difference broke our script

in steps the `os` module:

In [None]:
import os
os.path.join('subdir1', 'subdir2', 'subdir3', 'myfile.txt')

my recommendation: never write a path in `python` again, ever, for any reason. always use `os.path`

the way that `os` joins those directories together is by using the `os.sep` character

In [None]:
os.sep

note that if we want to create a path relative to the root directory, then, we could do the following:

In [None]:
os.path.join(os.sep, 'tmp', 'myfile.txt')

another very useful part of the `os` module is the `environ` dictionary object, which is an OS-agnostic way of loading all of the environment variables:

In [None]:
os.environ

note: this is a `python` dictionary-like object:

In [None]:
os.environ['PWD']

there are a ton of other goodies in the `os` module, but if you learn nothing else, you should know

1. use `os` for building paths -- never write paths as strings!
1. you can access environment variables via `os.environ` -- this is one way to parameterize the `python` scripts that you write

### reading and writing files

for many people, the idea of "opening a file" is not any different than saying "here's a file path, go get me all the stuff in it". for example, at this point, I imagine most of you are familiar with something such as:

```python
import pandas as pd

df = pd.read_csv('/path/to/my/precious/data.csv')
```

I have a file name, I open it with a function, what else is there?

there are actually many different things you might want to do with a file:

1. read the contents into a single `str` because that's what some other package requries
1. replace or remove all of the occurrences of a word
1. load the first 100 lines only
1. search through a file to find out what line a particular string is on
1. count all occurrences of a given word
1. replace certain characters for easier NLP processing

now, you could just load an entire file to a string object or `pandas` data frame, make your changes, and write it out. that's fine until you get to a file that is several GBs.

most of the things I mentioned above that you might want to do involve *iterating* through a file one character or line at a time. this is the fundamental way that `python` handles files.

`python` interacts with the file system through a concept called a "file object," which you can basically think of as a cursor pointing to a memory address at a certain point within a file. given where this cursor is currently, the file object could read the next character, the next word, the next line (etc). it could write new contents to the file.

[about 1 million steps under the hood](https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L480-L490), `pd.read_csv` is doing this for you (thanks, `pandas`!), making that `str` path usable.

other libraries -- important ones! -- don't take care of you that way.

1. `pickle` (builtin serialization package) works on file objects
1. `json` and `yaml`, libraries for parsing two of the most common configuration formats, work on file objects
1. several parsing libraries (`lxml`, `nltk`) require string inputs or file pointers, so `read_csv` and the like are out
1. many non-default deep learning framework file loading operations

all I mean to say is: **you will use file objects!**. it's good to know what they are, even at a high level

let's look at how you use a file object on a low level. if you don't follow right now, don't worry -- you *generally* won't have to do this, but you do very often have to create a file object and pass it to some other function that does, so it's helpful to know more

the main function for interacting with files is the `open` function. 

In [None]:
help(open)

#### file streams [advanced]

the first line of the `help` string on the `open` function says

> Open file and return a stream.

a *stream* is an abstraction for a collection of data. it is like a list in that it is an ordered sequence of data, but unlike a list in that it is not bounded (it could continue producing items forever) and implies that there are two "sides" -- upstream, where new data is generated, and downstream, where that content is sent and consumed

wherever you find streams transferring data from a producer to a consumer, you might also find a **buffer** -- an in-memory queue of content to consume that is allowed to get arbitrarily large (within memory limits).

in `python`, the `open` function returns a **stream** object which can transfer content from a file into the `python` program memory (or vice versa). this object also acts as a **buffer**, storing up content that we want to write to a file or individual lines we want to read from it.

we we can get some practice using file objects to learn a bit about streams and buffers (and how IO happens in `linux` in general)

In [None]:
f_out = open('/tmp/testfile.txt', 'w')  # the 'w' says we want to 'w'rite

f_out.write('hello')

In [None]:
%%bash
less /tmp/testfile.txt

hm... nothing?

we have added the 5 characters `hello` to the *buffer*; we can push them downstream with the `flush` command

In [None]:
# this saves the writing we've done
f_out.flush()

In [None]:
%%bash
less /tmp/testfile.txt

In [None]:
f_out.write('world')
f_out.flush()

In [None]:
%%bash
less /tmp/testfile.txt

yep -- you have to write *everything*. even spaces, or new line characters:

In [None]:
f_out.write('\n')
f_out.write('hello\n')
f_out.write('world')
f_out.flush()

In [None]:
%%bash
less /tmp/testfile.txt

note: you have to **close** file objects!

In [None]:
f_out.close()

so now we have *written* a file using the file object. we use the same `open` function to create a file object to *read* contents:

In [None]:
f_in = open('/tmp/testfile.txt', 'r')

you can read the *entire* contents of the file as a single string with `.read`

In [None]:
f_in.read()

we have now consumed the entirely of the *stream* and there is nothing else left to read. if we try to run `.read` *again*, we won't re-read the file -- we will read "whatever is left" in the stream, which as of right now is nothing

In [None]:
f_in.read()

if we want to reference the contents again, we need to either

1. have saved it to a variable when we first ran it, or
1. close and re-open the file.

the ship has sailed on 1, so let's do 2

In [None]:
f_in.close()
f_in = open('/tmp/testfile.txt', 'r')
s = f_in.read()
s

In [None]:
f_in.close()

# still works after closing the file object
s

above we read *all* of the contents in a file into a single string, but some files (and especially the ones we might care most about!) will be too big to fit in memory

we can also *iterate* over the lines in the file, reading in one line at a time. here we are using our **stream** to populate the contents of a **buffer** with one line of characters, then accessing that line of characters

In [None]:
f_in = open('/tmp/testfile.txt', 'r')

f_in.readline()

In [None]:
f_in.readline()

In [None]:
f_in.readline()

In [None]:
f_in.readline()

In [None]:
f_in.close()

finally, we can use the file object as an *iterator* in a `for` loop:

In [None]:
f_in = open('/tmp/testfile.txt', 'r')

for line in f_in:
    print(line)

f_in.close()

that's enough streaming for now -- let's clean up this testfile and call it a day

In [None]:
%%bash
rm /tmp/testfile.txt
# mv it day

so, this may feel a little low-level and annoying, and also overkill for some of our purposes. well, it is. people much smarter and better at programming at `python` did us a solid by writing us a bunch of libraries to handle the reading and writing of data.

that being said, *it is super common* that a function wants to take a *file object* and not a name of a file. so you should get used to the idea that you might have to take the extra step of using the `open` function to create a file object from a file name.

*in the interest of time, I'm going to comment out the `csv` section for lectures. if you're reading at home, this package is super useful, but it's less important than many of the others and I'd rather be sure to get to them than to cover this package*

**<div align="center">what are your questions so far??</div>**

### `csv` [advanced]

before `pandas` dataframes, there were lists of dictionaries:

```python
[
    {'col0': val00, 'col1': val10, 'col2': val20},
    {'col0': val01, 'col1': val11, 'col2': val21},
    {'col0': val02, 'col1': val12, 'col2': val22},
    {'col0': val03, 'col1': val13, 'col2': val23},
]
```

this is one `pythonic` way of representing a csv file: records as dictionaries, and key-value pairs corresponding to header field names and values.

the `csv` module (and specifically the `csv.DictReader` and `csv.DictWriter` objects) allow us to read and write csv files into this data structure

In [None]:
import csv

x = [
    {'a': 1, 'b': 2, 'c': 3},
    {'a': 100, 'b': 200, 'c': 300},
]

# I'll explain what this "with" thing is later
with open(os.path.join(os.sep, 'tmp', 'myfile.csv'), 'w') as f:
    c = csv.DictWriter(f, fieldnames=['a', 'b', 'c'])
    c.writeheader()
    c.writerows(x)

In [None]:
%%bash
less /tmp/myfile.csv

and now we could read it (or any csv) in:

In [None]:
# I'll explain what this "with" thing is later
with open(os.path.join(os.sep, 'tmp', 'myfile.csv'), 'r') as f:
    c = csv.DictReader(f)
    # note: c is just a special file object; you still need to iterate
    # through it all to get all the records!
    x = list(c)
    
x

an `OrderedDict` is a special class (from the `collections` module) which is simply a dictionary where the order of the keys is remembered

##### why should you ever do this?

generally speaking, you will probably want to read `csv` files in with `numpy`, `scipy`, or `pandas`. however, it is possible you might be in an environment where those are not made available to you.

first, ask yourself why you are acting as a data scientist but not allowed to use actual data scientist tools. then, remember that the answer *does* exist in the standard library, and see what you can figure out.

## control flow options

you are hopefully all very familiar at this point with the `if` / `elif` / `else`, `for`, and `while` control flow keywords. we will talk about a few other options for controlling execution of code in your `python` programs

### `try`, `except`

one of the first major leaps that a beginning programmer makes is recognizing that the code that they wrote won't work for *every* imaginable situation. take, for example, a function that looks to `log`-normalize a feature in your dataset.

what happens when your feature contains negative values?

the instinct of most beginning programmers is to **proactively protect against** these dangerouse scenarios:

```python
if (my_feature <= 0).any():
    handle_negs(my_feature)
else:
    log_normalize(my_feature)
```

programming in this way is called the "look before you leap" paradigm -- you check for possible problems and once you're sure none are there, you proceed with your plan.

while this will *work*, it's actually not "the right way (tm)." there are many reasons, but the most common scenario under which this is not the optimal way forward is the scenario in which the "bad thing" (e.g. negative values in your feature) are not normal behavior.

almost every time you run your code it would have been fine to just go straight to `log_normalize`, but you had to waste some (even small!) amount of time getting there. that's less efficient and becomes more confusing as more special casses are added. it's not hard to imagine you come across more and more complicated edge cases and the actual "thing" you are doing here (`log_normalize`) ends up many, many lines of code deep in your program.

when the thing you are protecting yourself from is an **exception to a rule**, you should instead use the "easier to ask forgiveness than permission" paradigm.

`try` to do the thing you want to do, and then if there is an `exception` that makes that impossible, "catch" that exception and try to do something different

`python` has keywords for this: `try` and `except`

```python
try:
    log_normalize(my_feature)
except ValueError:
    handle_negs(my_feature)
```

this will only try to fix negative values if the error (aka `Exception`) that they cause (in this case, a `ValueError`) is raised. adding the `except ValueError` block is referred to as "handling" that error, and we can handle many types of errors by adding more `except` statements (just like `elif` statements)

### `try`, `except`, and custom exceptions

one major leap in your programming abilities comes from doing something pretty simple: creating your own exceptions. 

in this code

```python
try:
    log_normalize(my_feature)
except ValueError:
    handle_negs(my_feature)
```

I catch *all* `ValueError` errors, and for each of them I try to handle negative values.

what happens if the `log_normalize` function failed with a `ValueError` but **not** because of negative numbers? `ValueError` is not *really* specific to the problem I encountered, and other `ValueError`s could easily occur.

we can be more specific about what caused our errors by creating our own exceptions. for example, I could have written

```python
class LogOfNegativeError(Exception):
    pass

def log_normalize(my_feature):
    # blah blah blah
    try:
        z = math.log(my_feature)
    except ValueError:
        raise LogOfNegativeError(
            "cannot log norm a feature with a negative value. rescale to positive values")
    # blah blah blah
```

this has two major advantages

1. I can tailor it to give much more information to the readers and users of my code
    + the error message could be anything -- including steps the user should take to address it
1. in code that uses this function, I can be explicit in how I handle it

```python
try:
    log_normalize(my_feature)
#except ValueError:
except LogOfNegativeError:
    handle_negs(my_feature)
```

now this won't accidentally capture things I *thought* were logs of negative values

**<div align="center">what are your questions so far??</div>**

### context managers and the `with` statement

in the [reading and writing files section above](#reading-and-writing-files) I said that every time you open a file you must close it.

for every

```python
f = open(fielname, 'r')
```

there must be a 

```python
f.close()
```

data scientists are only human, and it is extremely possible that in the course of a long and complicated file processing function this simply slips your mind. you're not alone! this could happen to anyone.

I've said often that you *must* do this, but not *why*. in the case of streams to the file system -- or other sorts of connections you might make, e.g. to databases -- there is often a limit to the number of simultaneous connections that we are allowed to make. sometimes it's a system resources limitation, sometimes it's a configuration limitation (e.g. only 100 simultaneous users on a database)

additionally, you have reserved some amount of your precious `python` process memory for something you don't need any more. with only one open connection, that's not big deal -- but what if it's hundreds? it can happen!

if you forget to close your connections, you're consuming those resources. that's not great!

wouldn't it be nice if there were a better way to `open` files? a way that would magically `.close` them as soon as we were done using them?

folks, you'll never believe it...

*there is!*:


```python
with open(filename, 'r') as f:
    # blah blah
```

`with` is a special `python` keyword statement that takes care of the *setup* and *teardown* steps you might need to take for a given object. under the hood, `python` developers have collected all of the things you need to do to *set up*, i.e. correctly open a thing (like a file), and also the things you need to do to *tear down*, i.e. close and clean up a thing (like a file)

things that can be passed to the `with` statement are called [**context managers** ](https://docs.python.org/3.7/reference/datamodel.html#context-managers)

when we write

```python
with [some context manager object] as [alias]:
    # do stuff ...
```

we are

1. creating an object and giving it an alias
1. "entering" a context (defined by that object)
1. "doing stuff"
1. "exiting" a context (defined by the object)

let's look at the `open` function as a context manager

```python
with open('/tmp/testfile.txt', 'w') as f_out:
    f_out.write('test test test')
```

1. we ask `python` to create a file object and give it the alias `f_out`
    + this is analogous to `f_out = open('/tmp/testfile.txt', 'w')`
1. we "enter" a context
    + the `open()` file object `f_out` has a method `f_out.__enter__` which is called and "sets up" the file object stream
1. we do whatever we want (write to the file)
1. we "exit" the context
    + the `f_out.__exit__` method is called and "tears down" the file stream (i.e. calls `f_out.flush()` and `f_out.close()` for us)

In [None]:
# these functions really do exist
f_tmp = open('/tmp/testfile.txt', 'w')

f_tmp.__enter__

In [None]:
f_tmp.__exit__

In [None]:
f_tmp.__exit__()

In [None]:
f_tmp.closed

the same sort of thing applies to many types of connections you might make in data science

+ create a connection to a database
    + open the connection on `__enter__`, maybe commits transactions but definitely closes connections on `__exit__`
+ create a cursor or a transaction while connected to a database
    + build that object on `__enter__`, maybe commit and definitely delete the objecxt on `__exit__`
+ a `tensorflow` session
    + the `__enter__` prepares a graph for execution and the `__exit__` cleans up the `tensorflow` graph

recommendation: *never* open a file with

```python
# bad way -- very bad, no! bad!
f = open(filename, 'r')

# do stuff

f.close()
```

but *always* use context managers

```python
# yaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!
with open(filename, 'r') as f:
    # do stuff
```

this way you will *never forget* and if the context object (here, a file object) ever gets more complex or requires more clean up *you don't have to care*, and not having to care is the very heart of good programming.

**<div align="center">what are your questions so far??</div>**

## string formatting

you should read [this entire format string syntax page](https://docs.python.org/3.6/library/string.html#formatstrings).

the basic gist of it, though, is that there is that every string object `s` in python has a member function

```python
s.format(...)
```

and this can be used to replace elements within the string that are coded within `{}` characters. There is a large and highly flexible mini-language for doing this.

in particular, we can

+ pass an arbitrary number of *positional* arguments

In [None]:
myname = 'zach'
'hello {}, how are you today'.format(myname)

In [None]:
yourname = 'caitlin'
mymood = 'great'
"hey {}, I'm {}".format(yourname, mymood)

+ parameterize the string and provide values as named arguments (for readability

In [None]:
# look -- keyword explosion in the wild!!
'{name} is feeling {mood} today'.format(name='zach', mood='groovy')

In [None]:
# look -- keyword explosion in the wild!!
me = {'name': 'zach', 'mood': 'groovy', }
'{name} is feeling {mood} today'.format(**me)

+ control the precision of floating point numbers we print out

In [None]:
'here are only the first three decimal places: {:.3}'.format(0.123456789)

+ left, right, or center align
+ set a fixed size of the output
+ set a fill character (to fill the output where the thing-to-be-format-ted doesn't fill it

In [None]:
s = 'left aligned with "x"s '
'{:x<60}'.format(s)

In [None]:
s = ' center aligned with "-"s '
'{:-^60}'.format(s)

In [None]:
s = ' right aligned with "*"s'
'{:*>60}'.format(s)

+ format datetime using the same basic commands we use in the `date` function in `bash`

In [None]:
from datetime import datetime

d = datetime.now()
'right now it is {:%Y-%m-%d %H:%M:%S}'.format(d)

In [None]:
'which is the same as {d:%F %T}, aka {d:%b %d, %Y at %T}'.format(d=d)

**<div align="center">mini exercise: write "hello world" to file</div>**

1. create a "hello world" function that
    1. takes a person's name as its first variable
    2. creates a string to write to file using string formatting expressions and the variable name 
    3. open a file `/tmp/test.txt` using the `with` context manager construct
    4. write that string to file
    5. in bash, print the results of that code to screen

In [None]:
os.path.join(os.sep, 'tmp', 'test.txt')

In [None]:
import os

def hello_world(name):
    s = 'hello {}, how are you today?'.format(name)
    filename = os.path.join(os.sep, 'tmp', 'test.txt')
    with open(filename, 'w') as f:
        f.write(s)
    
hello_world('caitlin')

In [None]:
%%bash
cat /tmp/test.txt

## `pandas`

`pandas` is *the* data science data structure package. in the latter parts of this course we will be using `pandas` and `scikit-learn` to do some of our analyses, so I'll push the learning of `pandas` off until then. Just know that you should learn `pandas`.

In many ways, `pandas` is a second programming language. because it sits part-way between python (where there is one and only one way to do things, iteration is a first-order principle) and `R` (where everything is vectorized there is a paradigm shift relative to the `R` community: whereas the `R` community is focused on making the language easy and intuitive for the user, the `pandas` community defaults more toward the development community. this will make the learning curve particularly steep for most.

In [None]:
import pandas as pd

df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
    names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
)

df.head()

In [None]:
df.describe()

In [None]:
df.groupby('class').mean()

## plotting

okay, okay, I hear you: "but all I want to do is plot"

one common knock against `python` is that it makes plotting so much harder than `R`, and that is absolutely a valid complaint. That being said, I don't think it is *quite* as bad as people like to make it out to be

### `matplotlib`

this is the primary plotting workhorse in the `python` world, and also the source of much of the angst. the name comes from the fact that this package was actually built as an attempted replacement plotting package for *matlab* transfers.

rather than dive into a whole tutorial about `matplotlib`, let me just offer a few pieces of advice:

1. begin any notebook in which you plan to plot with `%matplotlib inline` or `%matplotlib notebook`
2. favor the object oriented method (*e.g.* `f, ax = matplotlib.pyplot.subplots(); ax.plot(x, y)`) over the "interactive" `pyplot` method (*e.g.* `matplotlib.pyplot.plot(x, y)`).
3. import `seaborn` first; it will configure almost all of the defaults for you
4. you're probably using `pandas`, so lean on the `pandas` plotting builtins

In [None]:
%matplotlib inline

import random
import matplotlib.pyplot as plt

x = [random.gauss(0, 1) for i in range(1000)]
y = [random.gauss(0, 1) for i in range(1000)]

f, ax = plt.subplots()
ax.plot(x, y, linestyle='', marker='o')

### `matplotlib` with `pandas`

the good folks behind `pandas` decided to build support for `matplotlib` plotting directly into their dataframe object interface, so often times you are best off calling a dataframe's `plot` method:

In [None]:
df = pd.DataFrame({'x': x, 'y': y})
df.head()

In [None]:
df.plot.scatter('x', 'y')

### `seaborn`

think of `seaborn` as a sort of conceptual gradient boost to `matplotlib` -- it was created to handle some of the rough edges for the statistical community, and makes first-class some familiar `R` plotting concepts such as

1. easy distribution and kde plots
2. panel plotting
3. plot types
    1. violin plots, box and whisker plots, swarmplots
4. plotting features
    1. jitter, nicer default color schemes
5. several statistically familiar datasets

if you do nothing else to make your plots better, I recommend starting every notebook with

```python
import seaborn as sns
sns.set()
```

In [None]:
import seaborn as sns
sns.set()

In [None]:
sns.jointplot('x', 'y', data=df)

In [None]:
sns.jointplot('x', 'y', data=df, kind='kde')

In [None]:
iris = sns.load_dataset('iris')
sns.pairplot(iris)

### `plotly`

finally, I strongly urge you to check out `plotly`. it has come a long way since the early days (in which it was a fundamental requirement that you post your plots to a public website -- a total non-starter for work with sensitive or proprietary information, for example). it is based on a language-independent graphical object model, and as a result not only is the process of *creating* graphs nearly identical from `R` to `python` to `matlab` to `julia` to (you get it), but converting from one language to another is actually built in to the package itself.

plus, it's a `d3` based `javascript`-first package, so it often has some of the cool web bells and whistles before most of the others do (really, before any of them do in `python`)

In [None]:
import plotly
import plotly.graph_objs as go

In [None]:
data = [go.Scatter(x=df.x,
                   y=df.y,
                   mode='markers')]

fig = go.Figure(data)
fig

in much the same way as `seaborn` provides a thin wrapper around `matplotlib` for making the most common but complicated statistical plot types, `plotly.express` wraps base `plotly` to give a second useful interface

In [None]:
import plotly.express as px

fig = px.scatter(df, x="x", y="y", marginal_x="histogram", marginal_y="histogram")
fig.show()

## that `__main__` thing [advanced]

at the very beginning I wrote a "better idea" version of a module file, and it ended with this block:

```python
if __name__ == '__main__':
    main()
```

can some one explain what is happening with that block?

as I said above, there are two types of `python` files: modules that provide functions for doing things, and scripts that acutally do things. 

modules all have a "name" member variable which is accessible via

```python
mymodule.__name__
```

(pronounced "mymodule dunder name").

for example:

In [None]:
import os
os.__name__

In [None]:
import logging.config
logging.config.__name__

the `__name__` variable value can be hard-coded to be something special within the source code of the module, but by default it is the same as the module name as it gets imported. so, if you wrote a `python` file `thisthing.py`, without making any change at all you would find that

```python
import thisthing
thisthing.__name__
```

would print the string 

```
'thisthing'
```

what's going on here is roughly the following:

1. the `python` interpreter sees that you want to `import thisthing`
2. it creates a "namespace" for `thisthing`
    1. a "namespace" is a segmented place where the contents of the `thisthing` can be put
        1. helps avoid naming conflicts
        2. is basically a big dictionary with "names" and the compiled objects they point to (like functions, values)
    2. a special variable `__name__` is created inside the `thisthing` module with a value `"thisthing"`
    3. all of the functions and values in `thisthing.py` are then executed and loaded into the `thisthing` namespace
    4. within the scope of `thisthing.py`, it is known that the "name" of their namespace is `thisthing`
3. all of the items in the `thisthing` module are then made available as `thisthing.SOME_ITEM`

so when the compiler goes to `import os`, it creates a namespace `'os'`, it creates a `__name__` value within that namespace, and it loads everything in `os`.

the end result is that there is now an object called `__name__` within the namespace `os`, aka

```python
os.__name__
```

there is one special name that doesn't correspond to a module -- that is the "script environment":

In [None]:
__name__

why the compiler starts for the first time, it's basically doing that some process without a module to `import`. it creates a *global* namespace, where everything the names of things are not prepended with anything. the `__name__` value is set to `__main__`

#### why does this matter, though?

so, given those two facts:

1. a general module, when `import`ed, will result in a module object with a `__name__` member variable equal to the string with which it was `import`ed: `thisthing.__name__`
2. the value of `__name__` in the global scope with value `__main__`

what does it mean to have a block

```python
# a bunch of code
# ...
# ...
# ...

if __name__ == "__main__":
    do_a_thing()
```

??

the answer: `__name__` will *not* equal `"__main__"` when you *import* that file, but it will when you invoke it from the command line. that is, when you are inside a `python` session and you write

```python
import thisthing
```

the contents of the file `thisthing.py` will be loaded into a namespace `thisthing`, and the `__name__` variable there will be `thisthing`. as it is being imported, that `if` statement will evaluate as `False`.

if, on the other hand, you run `python thisthing.py` from the command line, then the code in `thisthing.py` is being evaluated directly in the main namespace, and the value of `__name__` will be `"__main__"`, and the `if` statement will evaluate as `True`

**ultimately**, if you write your code to have this "dunder main" block at the end, it will serve as a logical switch between "code is being imported" and "code is being run from the command line".

if you furthermore clean up the code above the `if __name__ == "__main__"` line to be only function definitions -- no running code or changing state -- your `python` file will be entirely *explicit*: users can directly invoke it from the command line or after `import`ing it, but it will never be invoked accidentally on `import`.

**<em><div align="center">YESSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS</div></em>**
<div align="center"><img src="http://ih0.redbubble.net/image.13413141.8561/flat,550x550,075,f.u3.jpg"></div>

# END OF LECTURE

next lecture: [environment management pt. 1: anaconda](006_environments_1_anaconda.ipynb)