# Visualizing Real Data

So - let's put these two concepts together! Let's use plotting functions so that we don't have to look at these big data frames by hand, but can examine them visually to look for interesting patterns.

First let's load in the "penguins" data set (<https://github.com/allisonhorst/palmerpenguins>)

This is a dataset which is useful for understanding plotting and manipulating data. 

It's data collected by Dr. Kristen Gorman and the Palmer Station, Antarctica on penguin ecology.  The dataset contains information for phenotypes of 344 penguins, with 3 different penguin species.

We've provided this data as Excel file - it can be loaded with pandas - as long as you _first_ install the depednency `openpyxl` with `mamba install openpyxl`

In [None]:
import pandas as pd

In [None]:
%cd ../data

In [None]:
!ls

In [None]:
penguin=pd.read_excel('penguin.xlsx')
penguin

Let's filter out the rows that don't have values (the nan values)

In [None]:
penguin=penguin[penguin.sex.notnull()]
penguin

Now, let's start by just plotting - for example - the length of their flippers vs their body mass.  How would we do that from this dataset?  We'll use `plotnine` <https://plotnine.readthedocs.io/en/stable/> which uses something called the "grammar of graphics" to generate plots.  It works similar to matplotib, but with perhaps an easier interface for working with datasets. This was originally derived from R, but lifted over to python by the plotnine team.

In [None]:
import matplotlib.pyplot as plt
import plotnine as p9

First - let's tell plotnine where to find the data - the main plotting funciton is `ggplot`

In [None]:
(p9.ggplot(data=penguin))

No plot - because we didn't _tell_ it to plot anything.  Let's tell it the variables - using `flipper_length_mm` column and the `body_mass_g` columns.  

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g')))

So we specify the "aesthetics" or aes to tell ggplot which variables there are.  But we still haven't told it to _plot_ the variables. Let's use `geom_point` <https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_point.html>

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g')) + p9.geom_point())

So this is pretty good, we can see already that there is a roughly linear relationship between flipper length and body mass.  But how this look if we stratify or compare the species?  Is there a way we can look at that data simultaneously? Let's color code the points according to their species like so using `color`:

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g',  color='species', shape='species')) + p9.geom_point())

Pretty nice!  We could also make the markers different shapes by setting `style`

Ok - but some of the points are basically on top of each other, making it hard to see.  You can make them partially transparent by setting `alpha` which is the level of transparency - making sure we can see the points.

You'll note that there's some level of overplotting - one way to resolve that is to set the points to be partially transparent with `alpha`

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g', color='species', shape='species'))
 + p9.geom_point(alpha=.5))

But what if we have a more quantitative or "continuous" third variable instead of a "categorical" one - for example bill length. When you set color to this it tries to automatically scale the point color to the different bill lengths.

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g', color='bill_length_mm')) + p9.geom_point(alpha=0.5))

We can even plot use `size` to plot the continuous variables

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g', size='bill_length_mm')) + p9.geom_point(alpha=0.5))

What if we want to layer the species information on top of the rest of this?  We can try:

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g', color='species', size='bill_length_mm')) + p9.geom_point(alpha=0.5))

But the problem is that the plot becomes too crowded to really understand the information.  Instead, we can **facet** the plots, we can do this by using `facet_wrap` - setting either the number of rows

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g', color='bill_length_mm')) + p9.geom_point(alpha=0.5) 
 + p9.facet_wrap('species', nrow=1))

Or columns

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g', color='bill_length_mm')) + p9.geom_point(alpha=0.5) + p9.facet_wrap('species', ncol=1))

Even letting them wrap around:

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='flipper_length_mm', y='body_mass_g', color='bill_length_mm')) + p9.geom_point(alpha=0.5) + p9.facet_wrap('species', ncol=2))

We can also fit lines to the data, using `geom_smooth` to perform a local polynomial regression:

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='body_mass_g', y='flipper_length_mm', color='species')) 
 + p9.geom_point(alpha=0.5) + p9.geom_smooth())

Or we can fit a linear model

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='body_mass_g', y='flipper_length_mm', color='species')) + p9.geom_point(alpha=0.5) + 
 p9.geom_smooth(method="lm"))

What if we just want to compare the _distributions_ of, say, flipper length between the different species? Plotting the difference between **categorical** variables instead of quantitative variables:

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='species', y='flipper_length_mm', color='species')) + p9.geom_point(alpha=0.5))

But even with alpha the overplotting is too much!  Let's add some "jitter" which moves the x values around to make it more visible:

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='species', y='flipper_length_mm', color='species')) + p9.geom_jitter(alpha=0.5))

We can also try 

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='species', y='flipper_length_mm', color='species')) + p9.geom_sina(alpha=0.5))


The color is a little extraneous here - we could use that color for something else - how about using it for the penguin sex?

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='species', y='flipper_length_mm', colour='sex')) + p9.geom_jitter())

This has more information content - we can still see the differences in distribution between the species, but we can also see that the females have different flipper lengths than the males.   
We can also make the categorical axis the y-axis if we want instead:

In [None]:
((p9.ggplot(data=penguin, mapping=p9.aes(y='species', x='flipper_length_mm', colour='sex')) + p9.geom_jitter()))+p9.ylab("WHICH")

Plotting the points is informative, but we may want to instead make a boxplot to better represent the distribution.  Boxplots have a box which show the 25th percentile to the 75th percentile of the data, with the "whiskers" showing 1.5 X the interquartile. We can just set this with `kind="box"`

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='species', y='flipper_length_mm', color='species')) + p9.geom_boxplot())

And what happens if we try to color code by sex here?

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='species', y='flipper_length_mm', color='sex')) + p9.geom_boxplot())

The boxes are automatically "dodged" or separated.  Here it becomes clear that male and female are pretty different in flipper length, and the different species are different as well.   Sometimes we might be interested in the distribution of the points, but still summarizing - for this we can use a "violin" plot - `geom_violin`:

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='species', y='flipper_length_mm', color='sex')) + p9.geom_violin())

But if what if we want to _layer_ plots - like the sina plot _on top_ of the violin - can do it like this:

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='species', y='flipper_length_mm', color='sex')) +
 p9.geom_violin()+ p9.geom_sina())