<img src=images/ucsc_banner.png width=500>
# Visualization

Visualization is one of the best parts of data science. Visualizing data isn't just for aesthetics, it can profoundly change the way a person understands the data they're working with, in a way that is much harder to achieve by just working with numbers.

If you do not have have seaborn installed,  you can use conda to install it with the command
`conda install -c anaconda seaborn`

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

In [None]:
from IPython.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

I have a personal preference for [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/), a statistical modeling package from Stanford for Python that is designed around working with dataframes.

In the previous module, you computed some statistics from the *E. coli* data.  Now, we'll go ahead and create visual representations of those answers.

### What is the distribution of genome sizes?

In [None]:
df = pd.read_csv('data/ecoli_cit.csv')

In [None]:
df.head()

Let's start with just a basic scatter plot of the the genome_sizes by sample.

In [None]:
x = [sample for sample in xrange(len(df))] # this gets us an array of numbers from 0 to 29 for the x-axis
plt.scatter(x, df.genome_size);

We can see there's some separation to the data, let's bin these genome sizes into a histogram

In [None]:
# the semi-colon at the end hides some unnecessary matplotlib output
plt.hist(df.genome_size); 

### Is there a relationship between genome size and Cit status?
We can use Seaborn to look at these relationships by taking advantage that the data is in a dataframe.

In [None]:
sns.boxplot(x='cit', y='genome_size', data=df);

Instead of boxplots, we can visualize the *densities* of the cit plus and minus distributions using our Pandas skills

In [None]:
sns.kdeplot(df[df.cit == 'plus'].genome_size, shade=True, label='plus')
sns.kdeplot(df[df.cit == 'minus'].genome_size, shade=True, label='minus');

### Is there a relationship between genome size and generation?

In [None]:
sns.jointplot(x='generation', y='genome_size', data=df, kind='reg');

A Pearson correlation of 0.4 isn't very strong, and we can see from the data that the disparity in the genome size happens between generations 30,000 and 40,000.  Based on the results of the above boxplots, we know that the genome size is mostly related to Cit status.

## Additional Dimensions

All of these above plots show 1 or 2 dimensions worth of data.  How would you visualize 3 dimensions? 3-dimensional plots are generally not used as ultimately they're projected onto a 2-dimensional surface which obfuscates the data its trying to represent.

*color*, *size*, and *shape* are just a few ways we can visualize additional dimesions of data without needing a 3-dimensional plot.

This additional dimension is usually for *categorical* and not *continuous* data.

In [None]:
sns.lmplot(x='generation', y='genome_size', data=df);

This shows us a regression of generation vs. genome_size.  But how would we visualize this data with regards to the samples Cit status?

In [None]:
sns.lmplot(x='generation', y='genome_size', data=df, hue='cit');

Now we get 3 regression lines, one showing that *unknown* and *minus* are almost identical and that cit+ is distinct. 

### Tips Dataset
**Tips** is a built in dataset from seaborn, what are some interesting relationships you can discover through visualization?

In [None]:
tips = sns.load_dataset("tips")

In [None]:
tips.head()

In [None]:
# Have fun! 

NIH BD2K Center for Big Data in Translational Genomics, UCSC Genomics Institute