# python

## why might you use `python`

[`python`](https://www.python.org/) is one of the most popular scripting and programming languages in the world. there are, [like, ninefinity different ways of ranking programming lanaguages](https://www.python.org/), and `python` sits in the top 5 of almost every one of them.

I have used it on every single project I've ever worked on. I am an unashamed `python` fanboy.

I'm also not really one for the `R` vs. `python` holy wars -- the two have different use cases and any conversation that attempts to settle "which is better" is already fundamentally flawed, in my opinion. That being said, I'd like to make the following case as to why you should *learn* `python`, even if it doesn't become your go-to data science language:

##### `python` is much more common than `R` *outside* the statistics community

this is a feature of a number of biases, but also speaks to something fundamental about the differences between the languages: `R` is a very *deep* language in a very *narrow* field of concepts (namely, statistics), whereas `python` is among the most *broad* and *flexible* languages without a central purpose.

Another reason this matters: the barrier to entry for a company or government agency's IT department will be lower (or already surpassed) for `python` and `python` packages if for no other reason than that computer engineers are familiar with it by default.

##### there is a package for that

this is a corollary of the previous point: if there is a thing you want to do, it is very likely that some one has already done it in `python`, and that their work is available for you to use.

For a point of comparison, there are [13,121 `R` packages on `CRAN`](https://cran.r-project.org/web/packages/), and [153,987 on `pypi`](https://pypi.org/).

In [None]:
import antigravity

##### it does all the most basic things well

although it is available and well-supported on every OS, `python` is very much a linux-focused language. it "grew up" in linux as an alternative scripting language (alternative to `bash` and other `shell` scripts). because of this, it acquired some of the linux philosophy points, and specifically those that focus on simplicity.

many of the current iterations of linux tools are actually calling `python` scripts under the hood, which means that essential things like web scraping, emailing, scheduling and timing, networking, logging, and database access are all possible and highly optimized in `python`

##### it is fun

`python` was created with the express interest of being as simple to program in as possible. most of the syntax and rules are specifically generated to make the language function as much like pseudo-code as possible, so code is easy to read.

the community also has an edge to it. a good example: start a python session and type

```python
import this
```

In [None]:
import this

okay, so it's a particular type of fun for a particular type of person. but given the prior that you're in this class, I suppose that's a safer prediction

## why might you not?

### version 2 vs. 3

I'd be remiss not to mention one of the major red marks against `python` -- the infamous "2 vs. 3" upgrade controversy.

in order to make some very low-level changes to the language (primarily for performance improvements and to support international languages), the developer community chose to make a new major version of `python`: `python3`.

the process caused a lot of confusion among newcomers to the language -- which was exploding in popularity at around the same time -- and also put a large burden of uncertainty on corporate developers and development.

the bottom line, in my opinion is this:

***unless you have no other option, you should always use python 3 and only python 3***

### it's not *built* for statistics

every single decision made in the development of the `R` programming language has been optimized for a particular audience (statisticians) with a particular task (statistics). as such, it does some things that computer scientists and software developers find baffling but are extremely intuitive to data scientists.

the river often flows in the other direction in `python` world

# packages

a given file of executible `python` code is probably best referred to as a "script", but a collection of scripts which expose some sort of interface to a user to do "something" are generally called a "library" or a "package". 

This is mostly the same convention as in the `R` community -- think of the differences between scripts you wrote and `dplyr` and all the other stuff Hadley wrote.

## my favorite packages

So what sorts of `python` packages should you use?

first of all, the builtin packages are pretty great, and cover a wide range of the most necessary use cases for a programming language (e.g. file i/o and os utilities and tie-ins). The ones I use most often are:

+ `argparse` - reading in and parsing command line arguments
+ `collections` - sets of "collection" objects (e.g. ordered dictionaries, named tuples, default dictionaries)
+ `csv` - for reading and writing delimited files
+ `datetime` - the fundamental date object and utilities library
+ `functools` - functional tools, including fancy stuff like partial function definitions and caching
+ `itertools` - an awesome library of utilities for iterating through collections of items
+ `json` - for parsing and constructing well formatted JSON
+ `logging` - for logging messages to console, file, etc
+ `os` - operating system interaction (I use this in almost every single program)
+ `pickle` - a `python`-native serialization protocol, for saving `python` stuff
+ `random` - a decent (if not special) randomization library
+ `re` - regular expression parsing library
+ `time` - a generic OS-level time interface

for any `python` installation, these *already exist* -- no installation necessary

there are also a ton of great open-source libraries for just about any purpose you might imagine. Again, the ones I use most often:

+ `flask` - a `python` web framework (for standing up webpages)
+ `ipython` - the best interactive shell, it just makes the normal python program look silly
+ `jupyter` - the interactive extension of the above (`ipython`, this is what is used to make this bodacious document you see before you)
+ `lxml` - a fast and flexible XML / HTML library
+ `matplotlib` - a plotting library that is super useful but will make `R` users dream of their former glory
+ `nltk` - Natural Language Tool Kit, a library for language processing and text analytics
+ `numpy` - NUMerical PYthon, a lot of super duper array and linear algebra glue code to make C and FORTRAN routines available in `python`.
+ `pandas` - PANel DAta, a dataframe interface for feature data. This is the main data science library in `python` and, again, I use it in almost every single program
+ `plotly` - an amazing plotting library

+ `psycopg2` - a `postgres` library
+ `requests` - the main web GET and POST library
+ `scipy` - SCIentific PYthon, and extension of `numpy` to include a more scientific utilities
+ `scrapy` - a flexibile but easy web scraping framework
+ `seaborn` - something you import whenever you use `matplotlib` to make your plots non-heinous (also has some useful functions that no one has discovered yet)
+ `selenium` - a javascript engine library (for when `requests` isn't good enough)
+ `sklearn` - the other half of the primary data science workflow, an all-purpose modeling library
+ `sqlalchemy` - an ORM library for most sql databases. It's pretty flashy and when you finally need it, you'll know in your heart.
+ `tensorflow`, `torch`, and `keras` - the three big libraries for deep learning implementations
+ `tqdm` - a fancy-pants progress bar library. You don't need it, but you want it.
+ `yaml` - a library for parsing the world's greatest configuration format, Yet Another Markup Language (YAML)

## installing packages

So, let's take a journey together.

Unlike `R`, the folks who put `python` together thought that people should care about the versions of the packages they installed. They didn't really do anything to make this happen in a sane way, though, so there were like ten different ways to install packages. 

If you learned `python` in the early days, you probably heard it was hard to install packages. Well, it was. Maybe it still is, depending on your attitude. That's right, I'm blaming the victim.

Really, though, I'm sorry. If you're coming to `python` from `R` this probably feels silly. Why not just have an `install` function and install whatever you want? 

Why? Basically, because that's a bad idea for writing production-level software.

production-level software is meant to be deterministic, and to be stable. Software that has the ability to install packages within the language has several disadvantages:

1. avoids administrator oversight
    1. having to ask your admin to install something is a *good* thing
2. could install something malicious or broken without anyone knowing
3. could install different versions on different machines at different times

Basically, the versions of all your packages matter, so you should care about that stuff. The `python` community is pretty stickly about that and has gone to great lengths (and, like, 15 different methods) to try and solve that problem. And today, that means that everyone is doing one of the following (and then some):

+ using `pip` ("Pip Installs Python", and yes, recursive acronyms are annoying)
+ using `pip`, but in a virtual environment
+ using `conda` (virtual environments on steroids or amphetamines, depending on whether you're a data scientist or sysad (resp))

I advocate using `conda` for many reasons -- more on this in the next lecture

# actually writing and executing `python` code

## interactive shells: `ipython` and `jupyter notebook`

the default `python` command opens a vanilla `python` shell, where you can execute any of the `python` commands your heart disires. that being said, the experience is obviously lacking the bells and whistles of any modern code development or execution environment.

for your personal use, `IPython` (interactive `python` shell) and `jupyter notebook`s are as close as it comes to a *must install* package as there is.

I personally think of `ipython` as being the primary means of developing software, and `jupyter` as being almost exclusively for exploratory documents and presentations, but you should do whatever works for you!

the documents and slideshows we've been using as lecture notes this whole time were created with `jupyter`, and I believe you used them extensively in 510. if it is still new to you, though, `jupyter` is a cool `python` package which allows you to execute interpreted `python` commands in a "notebook" format, where commands and notes are isolated into separate "cells" that can be executed on demand.

there are a couple of popular "ways" of developing `python` code, and `jupyter` notebooks are probably the most popular.

I highly recommend becomming familiar with both, but particularly `jupyter`!

### what `jupyter` actually is

based on my understanding of `jupyter` usage in 510, I am assuming most of you are familiar with how to *use* `jupyter notebook`s. I think it can still be helpful to know what `jupyter` actually *is*, and what it is *doing*.

`jupyter` is a `python` package (we can install it!) that does a lot of different things, but the main thing it does is create a **service** (a long-running process that *listens* for requests and *responds* to them with information)

In [None]:
%%bash
ps -aef | grep jupyter

that long running service knows how to read specially-formatted `json` files (`.ipynb`). these files contain code snippets that `jupyter` (hopefully) knows how to run and text blocks it can render. take this file, for example:

In [None]:
%%bash
head -n100 /Users/zach.lamberty/personal/code/gu511/005_python.ipynb

finally, this process knows how to start up different processes for handling calculations in different languages (kernels).

when a **client** (your web browser) access the `jupyter notebook` **service** (via `http`), you begin a back-and-forth communication that eventually leads to you executing `python` code here in this pretty browser window.

note: this client-server paradigm comes up a lot, right? because we're communicating over `http`, we should be able to run `jupyter notebook` services anywhere in the world, and connect to them from anywhere else, as long as that `http` message can travel. hm... I smell a homework exercise... 

I bring up all of the above just to say: `jupyter` is complicated! you're doing a lot with not a lot of effort. good for you. as you go on and experience problems or quirks, most of them come back to this architecture. you have a central process running as someone (maybe not you), and your attempts to run `python` code are going through this filter

### code in other languages with `jupyter`

you may have noticed a cell I ran up above:

```python
%%bash
head -n100 /Users/zach.lamberty/personal/code/gu511/005_python.ipynb
```

that `%%bash` piece is `ipython` / `jupyter` specific -- it won't work in regular `python`. in this case it enables me to write code in a different language (here, `bash`), and the `jupyter` service knows how to execute. `jupyter` currently [supports many programming languages this way](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels) (requires some installation)

## editors and `IDE`s

there are a multitude of options for developing code in `python`, and the choice really comes down to your personal preferrences. If you've "grown up" coding in `RStudio`, the you probably expect a windowed environment where you can write scripts, execute blocks, visualize output, and explore objects, you are probably going to favor one of:

+ [`rodeo`](https://www.yhat.com/products/rodeo)
+ [`spyder`](https://pythonhosted.org/spyder/)

additionally, recent updates to `RStudio` itself (using the `R` `reticulate` package) [allow you to run `python` code in `RStudio`](https://blog.rstudio.com/2018/10/09/rstudio-1-2-preview-reticulated-python/).

there are a ton of good editors and I won't belabor the point. try out a bunch and see which one speaks to you. I know a lot of people who enjoy the following options:

* [sublimetext](https://www.sublimetext.com/)
* [atom](https://atom.io/)
* [vscode](https://code.visualstudio.com/)

personally, I'm a huge fan of developing code in side-by-side terminals -- one for a regular editor and another for an `IPython` session. I have the ability copy and paste code for quick execution, but I *also* have a workflow which forces me to write real modules with callable functions I can re-import as I develop. It's a way of making sure the code I write becomes somethign I can deploy.

# a crash course of stuff you should know or learn!

I know that Stuart covered `python` in his course, so this may be overkill. If you're a `python` pro, bear with me -- sit back and bask in your total l33tness while we take a lightning tour of things that I think are #important.

some of these topics may feel a little out of left field, but they are things I've learned that I think are essential (but not sufficient) to being a good `python` programmer

## code structure and organization

+ [pep8](https://www.python.org/dev/peps/pep-0008/) was a really good idea. you should follow it
+ keep code in files called "someshortword.py"
+ there are basically two types of `py` file:
    + modules: I can run `import thisthing` in a `python` session and nothing happens, but now I have new `python` toys
    + scripts: I can run `python thisthing.py` from a bash shell and it *does a thing*
    + if your file does a combination of those two, you should ask yourself why
+ if I run `import thisthing` and *something happens*, that is almost always not a good idea

for *python files* (not `jupyter notebooks`, that is), this is 

a bad idea

```python
# thisthing.py

import pandas as pd
import sklearn.neural_network

x = pd.read_csv('magicdata.csv')
y = pd.read_csv('easytarget.csv')
m = sklearn.neural_network.MLPClassifier(
    hidden_layer_sizes=(1E999, 1E99999999999999),
    random_state=1337
)

m.fit(x, y)
```

a better idea

```python
# thisthing.py

import pandas as pd
import sklearn.neural_network


def load_xy(xfile='magicdata.csv', yfile='easytarget.csv'):
    x = pd.read_csv(xfile)
    y = pd.read_csv(yfile)  
    return x, y
   
   
def model(x, y):
    m = sklearn.neural_network.MLPClassifier(
        hidden_layer_sizes=(1E999, 1E99999999999999), 
        random_state=1337
    )
    m.fit(x, y)
    return m
    

def main(xfile='magicdata.csv', yfile='easytarget.csv'):
    x, y = load_xy(xfile, yfile)
    m = model(x, y)
    print(m.coefs_)
    

# more on this later...
if __name__ == '__main__':
    main()
```

## `io` operations and file objects

### reading and writing files

for many people, the idea of "opening a file" is not any different than saying "here's a file path, go get me all the stuff in it". for example, at this point, I imagine most of you are familiar with something such as:

```python
import pandas as pd

df = pd.read_csv('/path/to/my/precious/data.csv')
```

I have a file name, I open it with a function, what else is there?

there are actually many different things you might want to do with a file:

1. read the contents into a single `str` because that's what some other library requries
1. replace or remove all of the occurrences of a word
1. load the first 100 lines only
1. search through a file to find out what line a particular string is on
1. count all occurrences of a given word
1. replace certain characters for easier NLP processing

now, you could just load an entire file to a string object or `pandas` data frame, make your changes, and write it out. that's fine until you get to a file that is several GBs.

most of the things I mentioned above that you might want to do involve *iterating* through a file one character or line at a time. this is the fundamental way that `python` handles files.

`python` interacts with the file system through a concept called a "file object," which you can basically think of as a cursor pointing to a memory address at a certain point within a file. given where this cursor is currently, the file object could read the next character, the next word, the next line (etc). it could write new contents to the file.

[about 1 million steps under the hood](https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L417-L430), `pd.read_csv` is doing this for you (thanks, `pandas`!), making that `str` path usable.

other libraries -- important ones! -- don't take care of you that way.

1. `pickle` (builtin serialization library) works on file objects
1. `json` and `yaml`, libraries for parsing two of the most common configuration formats, work on file objects
1. several parsing libraries (`lxml`, `nltk`) require string inputs or file pointers, so `read_csv` and the like are out

all I mean to say is: **you will use file objects!**. it's good to know what they are, even at a high level

let's look at how you use a file object on a low level. if you don't follow right now, don't worry -- you *generally* won't have to do this, but you do very often have to create a file object and pass it to some other function that does, so it's helpful to know more

the main function for interacting with files is the `open` function. 

In [None]:
help(open)

In [None]:
f = open('/tmp/testfile.txt', 'w')

f.write('hello')

In [None]:
%%bash
less /tmp/testfile.txt

hm... nothing?

In [None]:
# this saves the writing we've done
f.flush()

In [None]:
%%bash
less /tmp/testfile.txt

In [None]:
f.write('world')
f.flush()

In [None]:
%%bash
less /tmp/testfile.txt

yep -- you even have to write the new line characters:

In [None]:
f.write('\n')
f.write('hello\n')
f.write('world')
f.flush()

In [None]:
%%bash
less /tmp/testfile.txt

note: you *have to close* file objects!

In [None]:
f.close()

and clean up...

In [None]:
%%bash
rm /tmp/testfile.txt

so, this may feel a little low-level and annoying, and also overkill for some of our purposes. well, it is. people much smarter and better at programming at `python` did us a solid by writing us a bunch of libraries to handle the reading and writing of data.

that being said, *it is super common* that a function wants to take a *file object* and not a name of a file. so you should get used to the idea that you might have to take the extra step of using the `open` function to create a file object from a file name.

### context managers and the `with` statement

so what was the deal with

```python
with open(filename, 'r') as f:
    # blah blah
```

remember how I said that you *absolutely have to close file objects*?

note how I'm not doing that here?

a [*context manager*](https://docs.python.org/3.6/reference/datamodel.html#context-managers) is a syntactical construct (way of writing the code) such that

1. you create and rename some object
    1. the results of `open(filename, 'r')` are called `f`
2. you "enter" a context
    1. internally, the context object has an `__enter__` method which "does something"
        1. the file object is created
        2. a database connection is initialized
3. after you've done all the code in the indented block, you "exit" the context
    1. the context object has an `__exit__` method which "cleans up"
        1. the file object is closed
        2. the database cursor is executed, all transactions are wrapped up, and the database connection is closed

recommendation: *never* open a file with

```python
# bad way -- very bad, no! bad!
f = open(filename, 'r')

# do stuff

f.close()
```

but *always* use context managers

```python
# yaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!
with open(filename, 'r') as f:
    # do stuff
```

this way you will *never forget* and if the context object (here, a file object) ever gets more complex or requires more clean up *you don't have to care*, and not having to care is the very heart of good programming.

### `os`

the `os` module has, basically, one goal: handle all the stuff that is different between different operating systems for you. 

the best example of this is file paths. suppose I want to create a file three directories below the current location: how do I write that path?

```sh
# in windows:
subdir1\subdir2\subdir3\myfile.txt

# in linux:
subdir1/subdir2/subdir3/myfile.txt
```

it'd be sad if such a dumb difference broke our script

in steps the `os` module:

In [None]:
import os
os.path.join('subdir1', 'subdir2', 'subdir3', 'myfile.txt')

my recommendation: never write a path in `python` again, ever, for any reason. always use `os.path`

the way that `os` joins those directories together is by using the `os.sep` character

In [None]:
os.sep

note that if we want to create a path relative to the root directory, then, we could do the following:

In [None]:
os.path.join(os.sep, 'tmp', 'myfile.txt')

another very useful part of the `os` module is the `environ` dictionary object, which is an OS-agnostic way of loading all of the environment variables:

In [None]:
os.environ

note: this is a `python` dictionary-like object:

In [None]:
os.environ['PWD']

there are a ton of other goodies in the `os` module, but you'll learn them in due time.

*in the interest of time, I'm going to comment out the `csv` section for lectures. if you're reading at home, this package is super useful, but it's less important than many of the others and I'd rather be sure to get to them than to cover this package*

### `csv`

before `pandas` dataframes, there were lists of dictionaries:

```python
[
    {'col0': val00, 'col1': val10, 'col2': val20},
    {'col0': val01, 'col1': val11, 'col2': val21},
    {'col0': val02, 'col1': val12, 'col2': val22},
    {'col0': val03, 'col1': val13, 'col2': val23},
]
```

this is one `pythonic` way of representing a csv file: records as dictionaries, and key-value pairs corresponding to header field names and values.

the `csv` module (and specifically the `csv.DictReader` and `csv.DictWriter` objects) allow us to read and write csv files into this data structure

In [None]:
import csv

x = [
    {'a': 1, 'b': 2, 'c': 3},
    {'a': 100, 'b': 200, 'c': 300},
]

# I'll explain what this "with" thing is later
with open(os.path.join(os.sep, 'tmp', 'myfile.csv'), 'w') as f:
    c = csv.DictWriter(f, fieldnames=['a', 'b', 'c'])
    c.writeheader()
    c.writerows(x)

In [None]:
%%bash
less /tmp/myfile.csv

and now we could read it (or any csv) in:

In [None]:
# I'll explain what this "with" thing is later
with open(os.path.join(os.sep, 'tmp', 'myfile.csv'), 'r') as f:
    c = csv.DictReader(f)
    # note: c is just a special file object; you still need to iterate
    # through it all to get all the records!
    x = list(c)
    
x

an `OrderedDict` is a special class (from the `collections` module) which is simply a dictionary where the order of the keys is remembered

##### why should you ever do this?

generally speaking, you will probably want to read `csv` files in with `numpy`, `scipy`, or `pandas`. however, it is possible you might be in an environment where those are not made available to you.

first, ask yourself why you are acting as a data scientist but not allowed to use actual data scientist tools. then, remember that the answer *does* exist in the standard library, and see what you can figure out.

## string formatting

you should read [this entire format string syntax page](https://docs.python.org/3.6/library/string.html#formatstrings).

the basic gist of it, though, is that there is that every string object `s` in python has a member function

```python
s.format(...)
```

and this can be used to replace elements within the string that are coded within `{}` characters. There is a large and highly flexible mini-language for doing this. for example

In [None]:
myname = 'zach'
print('hello {}, how are you today'.format(myname))

me = {
    'name': 'zach',
    'mood': 'groovy',
}
print('{name:} is feeling {mood:} today'.format(**me))

s = ' my title '
print('{:-^100}'.format(s))

**<div align="center">mini exercise: write "hello world" to file</div>**

1. create a "hello world" function that
    1. takes a person's name as its first variable
    2. creates a string to write to file using string formatting expressions and the variable name 
    3. open a file `/tmp/test.txt` using the `with` context manager construct
    4. write that string to file
    5. in bash, print the results of that code to screen

In [None]:
os.path.join(os.sep, 'tmp', 'test.txt')

In [None]:
import os

def hello_world(name):
    s = 'hello {}, how are you today?'.format(name)
    filename = os.path.join(os.sep, 'tmp', 'test.txt')
    with open(filename, 'w') as f:
        f.write(s)
    
hello_world('caitlin')

In [None]:
%%bash
cat /tmp/test.txt

## `pandas`

`pandas` is *the* data science data farme library. in the latter parts of this course we will be using `pandas` and `scikit-learn` to do some of our analyses, so I'll push the learning of `pandas` off until then. Just know that you should learn `pandas`.

In many ways, `pandas` is a second programming language. because it sits part-way between python (where there is one and only one way to do things, iteration is a first-order principle) and `R` (where everything is vectorized there is a paradigm shift relative to the `R` community: whereas the `R` community is focused on making the language easy and intuitive for the user, the `pandas` community defaults more toward the development community. this will make the learning curve particularly steep for most.

In [None]:
import pandas as pd

df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
    names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
)

df.head()

In [None]:
df.describe()

In [None]:
df.groupby('class').mean()

## plotting

okay, okay, I hear you: "but all I want to do is plot"

one common knock against `python` is that it makes plotting so much harder than `R`, and that is absolutely a valid complaint. That being said, I don't think it is *quite* as bad as people like to make it out to be

### `matplotlib`

this is the primary plotting workhorse in the `python` world, and also the source of much of the angst. the name comes from the fact that this package was actually built as an attempted replacement plotting library for *matlab* transfers.

rather than dive into a whole tutorial about `matplotlib`, let me just offer a few pieces of advice:

1. begin any notebook in which you plan to plot with `%matplotlib inline` or `%matplotlib notebook`
2. favor the object oriented method (*e.g.* `f, ax = matplotlib.pyplot.subplots(); ax.plot(x, y)`) over the "interactive" `pyplot` method (*e.g.* `matplotlib.pyplot.plot(x, y)`).
3. import `seaborn` first; it will configure almost all of the defaults for you
4. you're probably using `pandas`, so lean on the `pandas` plotting builtins

In [None]:
%matplotlib inline

import random
import matplotlib.pyplot as plt

x = [random.gauss(0, 1) for i in range(1000)]
y = [random.gauss(0, 1) for i in range(1000)]

f, ax = plt.subplots()
ax.plot(x, y, linestyle='', marker='o')

### `matplotlib` with `pandas`

the good folks behind `pandas` decided to build support for `matplotlib` plotting directly into their dataframe object interface, so often times you are best off calling a dataframe's `plot` method:

In [None]:
df = pd.DataFrame({'x': x, 'y': y})
df.head()

In [None]:
df.plot.scatter('x', 'y')

### `seaborn`

think of `seaborn` as a sort of conceptual gradient boost to `matplotlib` -- it was created to handle some of the rough edges for the statistical community, and makes first-class some familiar `R` plotting concepts such as

1. easy distribution and kde plots
2. panel plotting
3. plot types
    1. violin plots, box and whisker plots, swarmplots
4. plotting features
    1. jitter, nicer default color schemes
5. several statistically familiar datasets

if you do nothing else to make your plots better, I recommend starting every notebook with

```python
import seaborn as sns
sns.set()
```

In [None]:
import seaborn as sns
sns.set()

In [None]:
sns.jointplot('x', 'y', data=df)

In [None]:
sns.jointplot('x', 'y', data=df, kind='kde')

In [None]:
iris = sns.load_dataset('iris')
sns.pairplot(iris)

### `plotly`

finally, I strongly urge you to check out `plotly`. it has come a long way since the early days (in which it was a fundamental requirement that you post your plots to a public website -- a total non-starter for work with sensitive or proprietary information, for example). it is based on a language-independent graphical object model, and as a result not only is the process of *creating* graphs nearly identical from `R` to `python` to `matlab` to `julia` to (you get it), but converting from one language to another is actually built in to the package itself.

plus, it's a `d3` based `javascript`-first library, so it often has some of the cool web bells and whistles before most of the others do (really, before any of them do in `python`)

In [None]:
import plotly
import plotly.graph_objs as go
import plotly.offline

plotly.offline.init_notebook_mode(connected=True)

In [None]:
data = [
    go.Scatter(
        x=df.x,
        y=df.y,
        mode='markers'
    )
]

plotly.offline.iplot(data)

## iteration

one of the first-order concepts in `python` is that things which are collections should be *iterable* -- you should be able to move through them in order using a 

```python
for item in collection:
    # do a thing...
```

construct. basically, any time a thing *can* be iteratable, it *should* be iterable.

these constructs are so important that the community has developed a standard library called [`itertools`](https://docs.python.org/3/library/itertools.html) to support some complex iteration steps. Among many other things, this will allow you to:

In [None]:
import itertools

# iterate through products of different lists
# think "nested for loops"
alist = [0, 1, 2]
blist = 'abc'

for (a, b) in itertools.product(alist, blist):
    print(a, b)

In [None]:
# do all combinations with or without replacement
for (a0, a1) in itertools.combinations(alist, 2):
    print(a0, a1)
    
print()

for (a0, a1) in itertools.combinations_with_replacement(alist, 2):
    print(a0, a1)

In [None]:
# zip together lists with repetition from a shorter list
alist = range(10)
blist = 'ab'

for (a, b) in zip(alist, blist):
    print(a, b)
    
print()

for (a, b) in zip(alist, itertools.cycle(blist)):
    print(a, b)

### list comprehensions and generators

you are no doubt familiar with the ability to create lists, sets, and dictionaries using for loops:

```python
l = []
s = set()
t = tuple()
d = {}
for i in range(10):
    l.append(i ** 2)
    s.add(i ** 2)
    t += (i ** 2, )
    d[i] = i ** 2
```

but this actually *not* the preferred way of creating collections from simple iteration. the `pythonic` way of doing that is to use *comprehensions*

a list, set, tuple, or dictionary comprehension takes the form:

```python
l = [
    some expression
    for elem1 in some iterable
    [optional conditional check on elem1]
    for elem2 in some other iterable dependent on elem1
    [optional condtional check on elem2]
    ...
]
```

you can think of it as 

1. moving the final item which gets added to the collection to the very top
2. keeping the `for` loop elements in order
3. losing the "`:`" characters
4. moving everything to the same indentation level

In [None]:
l = []
for i in range(10):
    l.append(i ** 2)
print(l)

print() 

l = [
    i ** 2
    for i in range(10)
]

print(l)

In [None]:
d = {}
for i in range(10):
    d[i] = i ** 2
print(d)

print()

# note: unlike for loops, comprehension expressions don't
# have to be on multiple lines
d = {i: i ** 2 for i in range(10)}

In [None]:
d = {}

for i in range(5):
    if i % 2 == 0:
        for j in range(i, 5):
            if j % 2 == 0:
                d[i, j] = (i ** 2, j ** 3)
print(d)

print()

d = {
    (i, j): (i ** 2, j ** 3)
    for i in range(5)
    if i % 2 == 0
    for j in range(i, 5)
    if j % 2 == 0
}
print(d)

### generators

you may have been tempted to try this for sets as well:

In [None]:
s = (i ** 2 for i in range(10))
s

# note: this is the correct way to do it for a set:
#s = tuple(i ** 2 for i in range(10))
#s

this is not a set comprehension but isntead a *generator*. you can think of it is a memory-optimized version of the *comprehension* construct above.

in this sense, a *comprehension* take some iterables and a rule for composing them into individual values, and then builds all of the items at once and stores them in memory. a generator, on the other hand, acts more like a factory for those items. You can iterate through either, but if you use a generator you only need to hold one of those things in memory at a time.

In [None]:
g = ((i ** 2) for i in range(10))

In [None]:
# keep running this cell and see what happens
g.send(None)

each time we run the `send` method, our generator supplies us with the calculated value (from that recipe) given the next input (from that iterator)

in other words, the generator object had an *internal state*, a "current" value of those iterated `i` values, and a recipe for converting them into an object which it would then *yield* as we called the `send` method.

we can actually see that state

In [None]:
g = ((i ** 2) for i in range(10))

In [None]:
# keep running the cell and see what happens
print(g.send(None))
g.gi_frame.f_locals

of course, you normally wouldn't just create a generator and call `send` on it, you would use the normal iterator `for` loop construct:

In [None]:
g = (i ** 2 for i in range(10))
for i2 in g:
    print(i2)

## that `__main__` thing

at the very beginning I wrote a "better idea" version of a module file, and it ended with this block:

```python
if __name__ == '__main__':
    main()
```

can some one explain what is happening with that block?

as I said above, there are two types of `python` files: modules that provide functions for doing things, and scripts that acutally do things. 

modules all have a "name" member variable which is accessible via

```python
mymodule.__name__
```

(pronounced "mymodule dunder name").

for example:

In [None]:
import os
os.__name__

In [None]:
import logging.config
logging.config.__name__

the `__name__` variable value can be hard-coded to be something special within the source code of the module, but by default it is the same as the module name as it gets imported. so, if you wrote a `python` file `thisthing.py`, without making any change at all you would find that

```python
import thisthing
thisthing.__name__
```

would print the string 

```
'thisthing'
```

what's going on here is roughly the following:

1. the `python` interpreter sees that you want to `import thisthing`
2. it creates a "namespace" for `thisthing`
    1. a "namespace" is a segmented place where the contents of the `thisthing` can be put
        1. helps avoid naming conflicts
        2. is basically a big dictionary with "names" and the compiled objects they point to (like functions, values)
    2. a special variable `__name__` is created inside the `thisthing` module with a value `"thisthing"`
    3. all of the functions and values in `thisthing.py` are then executed and loaded into the `thisthing` namespace
    4. within the scope of `thisthing.py`, it is known that the "name" of their namespace is `thisthing`
3. all of the items in the `thisthing` module are then made available as `thisthing.SOME_ITEM`

so when the compiler goes to `import os`, it creates a namespace `'os'`, it creates a `__name__` value within that namespace, and it loads everything in `os`.

the end result is that there is now an object called `__name__` within the namespace `os`, aka

```python
os.__name__
```

there is one special name that doesn't correspond to a module -- that is the "script environment":

In [None]:
__name__

why the compiler starts for the first time, it's basically doing that some process without a module to `import`. it creates a *global* namespace, where everything the names of things are not prepended with anything. the `__name__` value is set to `__main__`

#### why does this matter, though?

so, given those two facts:

1. a general module, when `import`ed, will result in a module object with a `__name__` member variable equal to the string with which it was `import`ed: `thisthing.__name__`
2. the value of `__name__` in the global scope with value `__main__`

what does it mean to have a block

```python
# a bunch of code
# ...
# ...
# ...

if __name__ == "__main__":
    do_a_thing()
```

??

***<div align="center">YESSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS</div>***
<div align="center"><img src="http://ih0.redbubble.net/image.13413141.8561/flat,550x550,075,f.u3.jpg"></div>

# END OF LECTURE

next lecture: [environment management pt. 1: anaconda](006_environments_1_anaconda.ipynb)