<img src=images/ucsc_banner.png width=500>
# Anaconda, IPython, and Dataframes

This module covers tools which combine to form an interactive and explorative environment for data science.

<img src="images/anaconda_logo.png" width=250>

Anaconda is a package manager and collection of popular scientific Python tools. Installing Anaconda is an easy way to setup a data analysis environment from scratch without worrying about conflicting dependencies that may arise from installing so many disparate pieces of software. Anaconda supports Windows, Mac OS X, and Linux.

# IPython

IPython stands for *Interactive Python*, and is a beefed version of the standard Python interpreter.

Type `ipython` in the terminal to open IPython

- **M4.Q1** List some *shell* commands that you can use in IPython.
- **M4.A1** ls,cd,pwd, ls -a,ls -l,ls -h,cd ..,cd bd2k
- **M4.Q2** Why is this useful?
- **M4.A2** because it allows one to work on the command line without having to log in or exiting

Type: `a = 'kitties'`  

We've now created `a`, a string variable.  But what are some *methods* that you can perform with this string?  In the standard Python intepreter, you'd have a rather difficult time figuring this out without having to look up Python's [string documentation](https://docs.python.org/2/library/string.html). 

**Terminology**: A *method* is a function that is associated with a particular *object*. An object can be any unique datatype like an *int* or a *str*. 

Type: `a.` 

(don't forget the period, or *dot*) and hit *tab*. This returns all of the *dot methods* associated with the *str* object. We see the first method listed is `capitalize`.  Type `a.ca` and hit *tab* again -- this autocompletes the method for us! If we just hit *enter* now, we'll get an error. This is because `capitalize` is a *method*, and all methods and functions are designated in Python by `()`.  So the full command we want is `a.capitalize()`.  

- **M4.Q3** What does `a.capitalize()` return? 
- **M4.A3** 'Kitties'
- **M4.Q4** How would you *assign* what it returns to a new variable, `b`?
- **M4.A4** 

The reason for the parentheses, is that some methods accept *arguments*. `capitalize` doesn't happen to have any arguments, as it's simply upper-casing the first letter in the string, but what about `count`?

Type: `a.count()`

This will raise an error: `TypeError: count() takes at least 1 argument (0 given)`.  This tells us that *count* requires at least one argument that goes in the parentheses, but how do we know what that argument is?  

Type: `a.count?`

```
Docstring:
S.count(sub[, start[, end]]) -> int

Return the number of non-overlapping occurrences of substring sub in
string S[start:end].  Optional arguments start and end are interpreted
as in slice notation.
Type:      builtin_function_or_method
```

This tells us a lot of information about the `count` method. The first line explains that `count` has a required argument, `sub`, and that it returns an integer (`-> int`).  The rest of the docstring explains `count`'s optional arguments.

- **M4.Q5** Provide a valid `count` command.
- **M4.A5**
- **M4.Q6** Read the docstring carefully and provide a command that uses `count`'s *optional* arguments.
- **M4.A6**

IPython has a lot more goodies in it, but the above fundamentals give you the basis to explore Python without constantly having to reference documentation. 

Documentation for IPython can be found at: http://ipython.readthedocs.io/en/stable/index.html

# Jupyter Notebooks

<img src="images/jupyter_logo.png" width=250>

Jupyter notebooks used to be called IPython notebooks, but the popularity of notebooks led to the notebook framework being moved into its own project that now supports more languages than just Python.



## Setting up remote notebook

We want to work on these IPython notebook on our remote server, but notebooks require a GUI to use. We'll launch the notebooks on the remote server, and access them from our local machine using port forwarding.

On your remote machine:

Type: `screen`

This opens a separate terminal so your notebook will stay running even if your SSH instance gets logged out.  You can exit the screen by typing `ctrl+a` then hitting `d`, and resume your screen session by typing `screen -r`.

Type: `jupyter notebook --no-browser`

From your *local* machine (open a new tab or exit your ssh session):

Linux / Mac OSx
```
ssh -NL 8157:localhost:8888 ubuntu@your-remote-machine-public-dns
```

You should now be able to point your browser to http://localhost:8157 and see the jupyter notebook startup screen.

 
## Basics

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

Jupyter notebooks are comprised of *cells* that allow a user to write small blocks of code or text and execute them.  You can use the menu at the top to create / delete / arrange cells, or you can click **Help -> Keyboard Shortcuts**. I primarily use only two types of cells: code cells and *Markdown* cells for text formatting.  

Besides codifying a methodology in a format that allows for narration to be placed alongside the code used to achieve the result, notebooks are great way to share your research with others as they can download your notebook and run it for themselves.

We'll be using notebooks to visualize *dataframes* and execute Python code on those dataframes all in a single environment.

# Dataframes

A *dataframe* is an abstract concept for **(deep breath)** size-mutable "labeled arrays" that handle heterogeneous data.  If anyone has used the R programming language, then you're already familiar with what a dataframe is. For those who haven't used R, it simply takes the concept of an *array* of data, and applies hierarchical labeling.

## Pandas

Pandas is a Python implementation of the the dataframes model with a ton of cool features. I'll let the author himself provide a brief overview of Pandas: https://vimeo.com/59324550

I mostly use Pandas for exploratory data science work, some examples of which can be checked out here:

**Similarity Comparison Between Two RNA-Seq Pipelines** <br>
https://github.com/jvivian/ipython_notebooks/blob/master/RSEM_comparison/RSEM_comparison.ipynb

**Fitting a Distribution to Kallisto Bootstraps** <br>
https://github.com/jvivian/ipython_notebooks/blob/master/kallisto_boostraps/Kallisto%20Bootstraps.ipynb


## Data 

To play around with Pandas, we'll look at some data from IMDB, the internet movie database.

Download cast.csv to the **data/** directory in your forked repo with the following URL: <br>
https://drive.google.com/file/d/0ByHO8wS-fc8HTFJpZDE0T3RBcG8/view?usp=sharing

In [1]:
import pandas as pd

We'll use some CSS styling to make our tables look pretty. This doesn't translate if you view the notebook from Github.

In [2]:
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

Now let's read in our dataframes

In [3]:
titles = pd.DataFrame.from_csv('data/titles.csv', index_col=None)
cast = pd.DataFrame.from_csv('data/cast.csv', index_col=None)

There are two main datatypes in Pandas: DataFrames and Series. It's easiest to think of a single column of a pandas dataframe as a series.

In [4]:
type(titles)

pandas.core.frame.DataFrame

In [5]:
type(titles.title)

pandas.core.series.Series

Two of the most useful pandas methods are `head` and `tail`.

In [11]:
titles.head()

Unnamed: 0,title,year
0,The Rising Son,1990
1,The Thousand Plane Raid,1969
2,Crucea de piatra,1993
3,Country,2000
4,Gaiking II,2011


In [13]:
cast.tail()

Unnamed: 0,title,year,name,type,character,n
3499555,Stuttur Frakki,1993,Sveinbj?rg ??rhallsd?ttir,actress,Flugfreyja,24.0
3499556,Foxtrot,1988,Lilja ??risd?ttir,actress,D?ra,24.0
3499557,Niceland (Population. 1.000.002),2004,Sigr??ur J?na ??risd?ttir,actress,Woman in Bus,26.0
3499558,U.S.S.S.S...,2003,Krist?n Andrea ??r?ard?ttir,actress,Afgr.dama ? bens?nst??,17.0
3499559,Bye Bye Blue Bird,1999,Rosa ? R?gvu,actress,Pensionatv?rtinde,


## Dataframe Operations

We can look at any column in our dataframe by using array notation `['column_name']` or dot-method notation.

In [14]:
titles.columns

Index([u'title', u'year'], dtype='object')

In [15]:
titles['title'].head()

0             The Rising Son
1    The Thousand Plane Raid
2           Crucea de piatra
3                    Country
4                 Gaiking II
Name: title, dtype: object

In [16]:
titles.title.head()

0             The Rising Son
1    The Thousand Plane Raid
2           Crucea de piatra
3                    Country
4                 Gaiking II
Name: title, dtype: object

There are a lot of built-in dataframe methods

In [17]:
titles.count()

title    225616
year     225616
dtype: int64

In [18]:
titles.sort_values('title').head()

Unnamed: 0,title,year
138980,#1 Serial Killer,2013
158682,#5,2013
84531,#50Fathers,2015
70275,#66,2015
63318,"#73, Shaanthi Nivaasa",2007


In [19]:
titles.sort_values('year').head()

Unnamed: 0,title,year
172508,Miss Jerry,1894
120909,Reproduction of the Corbett and Jeffries Fight,1899
92240,Trouble in Hogan's Alley,1900
19987,"Pierrot's Problem, or How to Make a Fat Wife O...",1900
178104,Soldiers of the Cross,1900


#### Conditionals

Filtering dataframes can be a bit unintuitive at first, but make sense once you've done it a few times.

Say we wanted to look at every movie named **Hamlet**, how would we do that?  You might try something like:

In [20]:
titles.title == 'Hamlet'

0         False
1         False
2         False
3         False
4         False
5         False
6         False
7         False
8         False
9         False
10        False
11        False
12        False
13        False
14        False
15        False
16        False
17        False
18        False
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27        False
28        False
29        False
          ...  
225586    False
225587    False
225588    False
225589    False
225590    False
225591    False
225592    False
225593    False
225594    False
225595    False
225596    False
225597    False
225598    False
225599    False
225600    False
225601    False
225602    False
225603    False
225604    False
225605    False
225606    False
225607    False
225608    False
225609    False
225610    False
225611    False
225612    False
225613    False
225614    False
225615    False
Name: title, dtype: bool

Whoa, what is this? What we've gotten back is a *boolean* list of every title and whether or not it's named **Hamlet**, which we can see a majority of are False.  We can use this to filter our original dataframe by *subsetting it*. 

In [21]:
titles[titles.title == 'Hamlet'].head()

Unnamed: 0,title,year
5807,Hamlet,1948
27270,Hamlet,2016
38985,Hamlet,2015
45362,Hamlet,1910
71187,Hamlet,1954


If you have multiple conditionals, you need to wrap them with parentheses and combine them with the `&` operator.

In [22]:
titles[(titles.year < 1959) & (titles.year > 1955)].head()

Unnamed: 0,title,year
38,La momia azteca contra el robot humano,1958
43,Mavi boncuk,1958
91,Perdeu-se um Marido,1957
114,Hi no tori,1956
131,"Quem Sabe, Sabe!",1956


Working with Dataframes is *functional*. You can string together many functions and operations.

In [23]:
titles[(titles.year < 1959) & (titles.year > 1955)].sort_values('year').head(2)

Unnamed: 0,title,year
225480,"Tischlein, deck dich",1956
64366,Yield to the Night,1956


In [24]:
titles[(titles.year < 1959) | (titles.year > 1955)].sort_values('year').head(2)

Unnamed: 0,title,year
172508,Miss Jerry,1894
120909,Reproduction of the Corbett and Jeffries Fight,1899


#### Mutability

Another thing that takes some getting used to is Pandas *mutability*.  Pandas prefers never to *mutate*, or change, a dataframe object unless you explicitly tell it to. Instead, it creates a copy and assigns that to the new dataframe.

In [25]:
sorted_titles = titles.sort_values('year')
sorted_titles.head()

Unnamed: 0,title,year
172508,Miss Jerry,1894
120909,Reproduction of the Corbett and Jeffries Fight,1899
92240,Trouble in Hogan's Alley,1900
19987,"Pierrot's Problem, or How to Make a Fat Wife O...",1900
178104,Soldiers of the Cross,1900


If you want to force a change into a dataframe, use the `inplace=True` argument.

In [26]:
titles.sort_values('year', inplace=True)
titles.head()

Unnamed: 0,title,year
172508,Miss Jerry,1894
120909,Reproduction of the Corbett and Jeffries Fight,1899
92240,Trouble in Hogan's Alley,1900
19987,"Pierrot's Problem, or How to Make a Fat Wife O...",1900
178104,Soldiers of the Cross,1900


## Exercises
Borrowed from the great Brandon Rhodes: http://rhodesmill.org/brandon/

Most (if not all) of these exercises can be done in a single line. Be precise! If a question asks "How many movies...", then your answer should return a number.

### What are the earliest two films listed in the titles dataframe?

### How many movies have the title "Hamlet"?

### How many movies are titled "North by Northwest"?

### When was the first movie titled "Hamlet" made?

### List all of the "Treasure Island" movies from earliest to most recent.

### How many movies were made in the year 1950?

### How many movies were made in the year 1960?

### How many movies were made from 1950 through 1959?

### In what years has a movie titled "Batman" been released?

### How many roles were there in the movie "Inception"?

### How many roles in the movie "Inception" are NOT ranked by an "n" value?

### But how many roles in the movie "Inception" did receive an "n" value?

### Display the cast of "North by Northwest" in their correct "n"-value order, ignoring roles that did not earn a numeric "n" value.

### Display the entire cast, in "n"-order, of the 1972 film "Sleuth".

### Now display the entire cast, in "n"-order, of the 2007 version of "Sleuth".

### How many roles were credited in the silent 1921 version of Hamlet?

### How many roles were credited in Branagh’s 1996 Hamlet?

### How many "Hamlet" roles have been listed in all film credits through history?

### How many people have played an "Ophelia"?

### How many people have played a role called "The Dude"?

### How many people have played a role called "The Stranger"?

### How many roles has Sidney Poitier played throughout his career?

### How many roles has Judi Dench played?

### List the supporting roles (having n=2) played by Cary Grant in the 1940s, in order by year.

### List the leading roles that Cary Grant played in the 1940s in order by year.

### How many roles were available for actors in the 1950s?

### How many roles were avilable for actresses in the 1950s?

### How many leading roles (n=1) were available from the beginning of film history through 1980?

### How many non-leading roles were available through from the beginning of film history through 1980?

### How many roles through 1980 were minor enough that they did not warrant a numeric "n" rank?

NIH BD2K Center for Big Data in Translational Genomics, UCSC Genomics Institute