<img src=images/ucsc_banner.png width=500>

# Anaconda, Jupyter Notebooks, Pandas Dataframes

This module covers tools which combine to form an interactive and explorative environment for data science.

<img src="images/anaconda_logo.png" width=250>

Anaconda is a package manager and collection of popular scientific Python tools. Installing Anaconda is an easy way to setup a data analysis environment from scratch without worrying about conflicting dependencies that may arise from installing so many disparate pieces of software. Anaconda supports Windows, Mac OS X, and Linux.

Anaconda can be downloaded at https://www.anaconda.com/distribution/ 

Anaconda comes with many of the packages we will be using today like jupyter notebook and pandas. Packages can be installed using `conda install`. 

Here is a helpful cheatsheet for conda commands: https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf

# Jupyter Notebooks

<img src="images/jupyter_logo.png" width=250>

Jupyter notebooks used to be called IPython notebooks, but the popularity of notebooks led to the notebook framework being moved into its own project that now supports more languages than just Python.



## Basics

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

Jupyter notebooks are comprised of *cells* that allow a user to write small blocks of code or text and execute them.  You can use the menu at the top to create / delete / arrange cells, or you can click **Help -> Keyboard Shortcuts**. I primarily use only two types of cells: code cells and *Markdown* cells for text formatting.  

Besides codifying a methodology in a format that allows for narration to be placed alongside the code used to achieve the result, notebooks are great way to share your research with others as they can download your notebook and run it for themselves.

We'll be using notebooks to visualize *dataframes* and execute Python code on those dataframes all in a single environment.

# Dataframes

A *dataframe* is an abstract concept for **(deep breath)** size-mutable "labeled arrays" that handle heterogeneous data.  If anyone has used the R programming language, then you're already familiar with what a dataframe is. For those who haven't used R, it simply takes the concept of an *array* of data, and applies hierarchical labeling.

## Pandas

Pandas is a Python implementation of the the dataframes model with a ton of cool features. I'll let the author himself provide a brief overview of Pandas: https://vimeo.com/59324550

John Vivian mostly uses Pandas for exploratory data science work, some examples of which can be checked out here:

**Similarity Comparison Between Two RNA-Seq Pipelines** <br>
https://github.com/jvivian/ipython_notebooks/blob/master/RSEM_comparison/RSEM_comparison.ipynb

**Fitting a Distribution to Kallisto Bootstraps** <br>
https://github.com/jvivian/ipython_notebooks/blob/master/kallisto_boostraps/Kallisto%20Bootstraps.ipynb


## Data 

To play around with Pandas, we'll look at some data from IMDB, the internet movie database.

Download cast.csv to the **data/** directory in your forked repo with the following URL: <br>
https://drive.google.com/file/d/0ByHO8wS-fc8HTFJpZDE0T3RBcG8/view?usp=sharing

In [None]:
import pandas as pd

We'll use some CSS styling to make our tables look pretty. This doesn't translate if you view the notebook from Github.

In [None]:
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

Now let's read in our dataframes

In [None]:
titles = pd.read_csv('data/titles.csv', index_col=None)
cast = pd.read_csv('data/cast.csv', index_col=None)

There are two main datatypes in Pandas: DataFrames and Series. It's easiest to think of a single column of a pandas dataframe as a series.

In [None]:
type(titles)

In [None]:
type(titles.title)

Two of the most useful pandas methods are `head` and `tail`.

In [None]:
titles.head()

In [None]:
cast.tail()

## Dataframe Operations

We can look at any column in our dataframe by using array notation `['column_name']` or dot-method notation.

In [None]:
titles.columns

In [None]:
titles['title'].head()

In [None]:
titles.title.head()

There are a lot of built-in dataframe methods

In [None]:
titles.count()

In [None]:
titles.sort_values('title').head()

In [None]:
titles.sort_values('year').head()

#### Conditionals

Filtering dataframes can be a bit unintuitive at first, but make sense once you've done it a few times.

Say we wanted to look at every movie named **Hamlet**, how would we do that?  You might try something like:

In [None]:
titles.title == 'Hamlet'

Whoa, what is this? What we've gotten back is a *boolean* list of every title and whether or not it's named **Hamlet**, which we can see a majority of are False.  We can use this to filter our original dataframe by *subsetting it*. 

In [None]:
titles[titles.title == 'Hamlet'].head()

If you have multiple conditionals, you need to wrap them with parentheses and combine them with the `&` operator.

In [None]:
titles[(titles.year < 1959) & (titles.year > 1955)].head()

Working with Dataframes is *functional*. You can string together many functions and operations.

In [None]:
cast.head()

In [None]:
cast[(cast.year > 2000) & (cast.name == 'Zoe Saldana')].sort_values('year').tail()

In [None]:
cast[(cast.title == 'Avatar 2') | (cast.title == 'Avatar 3')].sort_values('year')

#### Mutability

Another thing that takes some getting used to is Pandas *mutability*.  Pandas prefers never to *mutate*, or change, a dataframe object unless you explicitly tell it to. Instead, it creates a copy and assigns that to the new dataframe.

In [None]:
sorted_titles = titles.sort_values('year')
sorted_titles.head()

If you want to force a change into a dataframe, use the `inplace=True` argument.

In [None]:
titles.sort_values('year', inplace=True)
titles.head()

## Exercises
Borrowed from the great Brandon Rhodes: http://rhodesmill.org/brandon/

Most (if not all) of these exercises can be done in a single line. Be precise! If a question asks "How many movies...", then your answer should return a number.

### What are the earliest two films listed in the titles dataframe?

### How many movies are titled "North by Northwest"?

### List all of the "Treasure Island" movies from earliest to most recent.

### How many movies were made in the year 1950?

### In what years has a movie titled "Batman" been released?

### How many roles were there in the movie "Inception"?

### How many people have played a role called "The Dude"?

### How many roles has Sidney Poitier played throughout his career?

# More pandas excercises..

# TCGA 

The Cancer Genome Atlas (TCGA) is a landmark cancer project that characterized over 20,000 primary cancer samples. On cbioportal(https://www.cbioportal.org), you can explore TCGA data. In the following excercises you will use pandas to find subsets of samples that have different mutations. KRAS and EGFR are two genes that are freqeuntly mutated in lung cancer patients. The table below (mutation_table) was generated using cbioportal and shows which samples have a mutation (or multiple mutations) in KRAS and EGFR. 'Nan' represents the samples that do not have a mutation in that gene. 

Use pandas to answer the following questions. You will likely have to use google!

In [None]:
mutation_table = pd.read_csv('data/luad_tcga_pan_can_atlas_2018_KRAS_EGFR_mut.txt',sep = "\t", index_col=None)
mutation_table.head()

### How many samples have no mutations in KRAS or EGFR?

### How many samples have mutations in both KRAS and EGFR?

### How many samples have a KRAS mutation at position G12 ? (Can be G12V, G12C, etc.)

### How many samples have more than one mutation in EGFR?

NIH BD2K Center for Big Data in Translational Genomics, UCSC Genomics Institute