# 01. Data Analysis with Jupyter and Python

By Adam Claridge-Chang, Joses Ho and Sangyu Xu

## Load Libraries

In [None]:

import piplite
await piplite.install('seaborn')

import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline

In [None]:
# plot settings
sns.set(style='ticks', font_scale=1.2)

In [None]:
exploration_times = pd.read_csv("../data/exploration_times.csv")

In [None]:
exploration_times

## Loading data

Let's load in an example dataset. We shall load the [iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

>The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) in 1936. It is sometimes called Anderson's Iris data set because [Edgar Anderson](https://en.wikipedia.org/wiki/Edgar_Anderson) collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".



<div align="center">
<img src="https://sangyusblog.files.wordpress.com/2022/10/gaspe.jpg?w=804" width="75%">
</div>

>The data set consists of 50 samples from each of three species of Iris (*iris setosa*, *iris virginica* and *iris versicolor*). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

<div align="center">
<img src="https://ars.els-cdn.com/content/image/3-s2.0-B9780128147610000034-f03-01-9780128147610.jpg" alt="versicolor" style="width:70%">
<figcaption>An iris versicolor. (Photo by Danielle Langlois. Marking by Vijay Kotu and Bala Deshpande. Licensed under Creative Commons)</figcaption>
</div>
<div align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a7/Irissetosa1.jpg" alt="setosa" style="width:70%">
<figcaption>An iris setosa. (Photo by Денис Анисимов. Public Domain)</figcaption>
</div>
<div align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f8/Iris_virginica_2.jpg" alt="virginica" style="width:70%">
<figcaption>An iris virginica. (Photo by Eric Hunt. Licensed under Creative Commons)</figcaption>
</div>





In [None]:
# Read iris data from the sheet with pandas
iris = pd.read_csv("../data/iris.csv")

## If you're using Windows, you need to use:
# iris = pd.read_csv('C://Users//whho//Downloads//IrisData - iris.csv')

You have created a new object known as a pandas `DataFrame`, with the contents of the CSV. Think of it as a spreadsheet, but with a lot more useful features for data analysis.
It has several methods we can use to handle, analyse, and plot the data.

We can peak at the data using the `.head()` method.

In [None]:
iris.head() # Gives us the first 5 rows of the dataframe.
# iris.head(10) # Gives us the first 5 rows of the dataframe.




Get a summary of the data.

In [None]:
iris.describe()

Let's see what is in the `species` column.

In [None]:
iris.species.unique()

## Plot the old-fashioned bar chart

In [None]:
iris.head()

In [None]:
ax1 = sns.barplot(data = iris, 
                  x = 'species', 
                  y = 'petal_width')

# Axes should always be labelled.
# ax1.set(xlabel='Species', ylabel='Mean Sepal length (cm)')

## Plot a swarmplot, which shows all the data

In [None]:
ax2 = sns.swarmplot(data = iris, 
                    x = 'species', 
                    y = 'petal_length',
                   hue = 'species')

# ax2.set(xlabel='Species', ylabel='Petal length (cm)')

## The split-apply-combine workflow

All your scientific experiments follow a very simple analysis workflow: *split-apply-combine*

You do an experiment on 2 or more groups, apply some summary function to each group, and then aggregate the results.

<div align="center">
<img src="../images/for_ipynb/split-apply-combine.jpg" width="75%">
</div>


In [None]:
iris.head()

In [None]:
iris.groupby('species').mean()

In [None]:
iris.groupby('species').sem()

The plotting package `seaborn` does this automatically for you.

In [None]:
# `catplot` is short for "categorical plot", 
# where either the x-axis or y-axis consists of categories.

ax3 = sns.catplot(data=iris, 
            kind='bar',   # there are several types of plots.
            errorbar='sd',      # plot the error bars as ± standard deviations.
            col='species' # plot each species as its own column.
           )
ax3.set_axis_labels("", "Length (cm)")


You should quickly notice that the plot isn't as informative as we want it to be.

The current plot only allows us to investigate the relationships the four metrics within species.

Ideally, we want to directly compare metrics between species. 

To do so, we need to _reshape_ the data.

## The Long-form vs the Wide-form of your data

Our iris dataframe is in the wide-form (below right) and we want to turn it into the long-form. In the original iris dataframe, the data is organised by unit (flower, in rows) and the columns contain a mixtrue of variables (sepal length, sepal width etc). In a long-form dataframe, each columne is a variable, and each row is an observation. (Please read Hadley Wickham's https://vita.had.co.nz/papers/tidy-data.pdf to learn more about tidiness of datasets.) 
<div align="center">
<img src="https://seaborn.pydata.org/_images/data_structure_19_0.png" width="75%">
<figcaption>Michael Waskom, Seaborn Tutorial 2022</figcaption>

</div>


In [None]:
iris_tidy = pd.melt(iris.reset_index(), 
                    id_vars=['index','species'], 
                    var_name='metric', 
                    value_name='cm')
iris_tidy = iris_tidy.rename(columns = {'index': 'ID'})

In [None]:
iris

In [None]:
iris_tidy

In [None]:
ax4 = sns.catplot(data=iris_tidy, 
            x='metric', 
            y='cm', 
            hue='species',
            kind='bar', 
            errorbar='sd',
            aspect=1.5
           )

In [None]:
import matplotlib.pyplot as plt

In [None]:
# f, ax = plt.subplots(1, figsize=(3,3))

ax5 = sns.catplot(data=iris_tidy, 
            kind='swarm', 
             x='species', y='cm', hue='metric',
            size=4.5,
            aspect=1.5,
            #ci='sd',
            palette=['red','grey','orange','pink'],
            
           )

## Scatterplot and linear regression line

Next to the categorical plot, the scatter plot is a very useful visualization tool for biological experiments. Often we want to know how one variable is correlated with another, we can then use a scatterplot to easily take a quick look.

In [None]:
# Draw a scatteplot of petal width versus length with a simple linear regression line
ax6 = sns.regplot(data=iris, 
                  ci=95,
                  x="sepal_width", 
                  y="petal_length")


# ax6.set(xlabel='Sepal width (cm)', ylabel='Sepal length (cm)')


In [None]:

for s in iris.species.unique():
    ax7 = sns.regplot(data=iris.loc[iris.species == s], 
                  ci=95,
                  x="sepal_width", 
                  y="petal_length", label = s)
    ax7.legend()


## Seaborn allows you to do that more systematically with pairplot

This is like doing a scatter plot for each pair of the variables in one go. On the diagonal, distributions of values within each species group are plotted for each variable.  



In [None]:
fig = sns.pairplot(iris, hue="species")

## Dimension Reduction with Principle Component Analysis
We have 4 dimensions we measured the irises on, what if we want a more concise way of describing the data? We can try to find 2 dimensions along which the data has the most variance. 

In [None]:
from sklearn import decomposition
from sklearn import datasets


iris_pca = datasets.load_iris()
X = iris_pca.data
y = iris_pca.target

pca = decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)


f, ax8 = plt.subplots(1, figsize = (5, 5))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=[iris_pca.target_names[i] for i in y], alpha = 0.7)
ax8.set_xlabel('PC1')
ax8.set_ylabel('PC2')




In [None]:
f, ax9 = plt.subplots(2, 2, figsize = (12, 11))
titles = ['sepal length', 'sepal width', 'petal length', 'petal width']

for i in range(0, 4):
    sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris_pca.data[:, i], alpha = 0.7, ax = ax9.flatten()[i])
    ax9.flatten()[i].set_xlabel('PC1')
    ax9.flatten()[i].set_ylabel('PC2')
    ax9.flatten()[i].set_xlim(-4, 8)
#     ax9.flatten()[i].set_title(titles[i])
# f.suptitle('Colored by Original Measurements')

## Towards Publication-Ready Plots
Try to achieve as much of the final figure requirements as possible via code

In [None]:
all_metrics = iris_tidy.metric.unique()

all_metrics


In [None]:
y_titles = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
letters = ['A', 'B', 'C', 'D']

In [None]:
import matplotlib.pyplot as plt

f, ax = plt.subplots(2, 2, figsize=(10, 10))

all_axes = ax.flatten()

for i, metric in enumerate(all_metrics):
    
    current_axes = all_axes[i]
    
    sns.swarmplot(data=iris, size = 3.5, 
                  x='species', y=metric, hue = 'species',
                  ax=current_axes)

    current_axes.set(ylabel=y_titles[i])

    
    current_axes.set_ylim(0, 10)
    current_axes.get_ylim 
    current_axes.get_legend().remove()
    current_axes.text(-1, 10.5, letters[i], fontsize = 25, fontweight = 'semibold')



In [None]:
f.savefig("myplot.svg")
f.savefig("myplot.png", dpi = 300)

