# Week 6: Plotting

Today we learn how to plot. Since python doesn't know how to plot out of the box, we first need to learn to import modules, that is access code written by other people. Here is the table of contents:
1. **Modules**  
   A simple way to use other people's code.  
2. **Plotting**  
   Line plots, scatter plots, bar charts, pie charts.  
3. **Nucleotide make-up of an RNA sequence**  
4. **Microhabitat coverage**  
   Loading a data table and plotting the data.
5. **Tracking fish**  
   Loading a data table and manipulating its columns.
6. **Showing data collection sites on a map**  
7. **Tracking wolves**  
   Visualize an animal's motion on a map.
8. **Sponge recruitment**  
   

## 1. Modules


### 1.1. What is a module?

A module is a file consisting of Python code that can define functions and variables. It is basically code that has been saved in a file. Modules are a way to re-use specific functions that later you want to import and make use of its functionality. It is similar to the functions you generate in previous weeks, but now if you will save your functions in a specific file, you coul develeop a module.

To see how this works, let's make our own module. Let's start with the transcript function from last time, which translates a DNA sequence into and RNA sequence:

In [None]:
def transcript(dna):
    rna = dna.replace('T','U')
    return rna

print(transcript('ATG'))

In the folder this notebook is in, create an empty file called `my_dna_module.py`. Open it in a text editor and paste the code that defines the `transcript` function above. Then execute the code below.

In [None]:
# This essentially runs the code in "my_dna_module.py" and make whatever is defined in it available to us.
# The name after "import" should be the name of the file without the ".py" extension.
import my_dna_module
# There's a catch though. To access something defined in "my_dna_module.py" we need to type
# the name of the module, then a period, then the name of the variable or function we want to access.
my_dna_module.transcript('ATG')

That's the basic idea of a module. What makes modules so powerful, though, is that:  

- There's a special folder on your computer where you can put module files to make them available to all your python projects. Say you have a file "my_dna_module.py" that contains dozens of useful functions to work with DNA. If you put it in that special folder, it will be available to any python program you write on your computer by simply typing "import my_dna_module".  

- There are tens of thousands of additional modules freely available from the internet. No matter what project you're working on, there's a module out there that already does most of what you need to do.

### 1.2. Importing modules

Some modules get installed automatically when you install python. They are already on your computer, you just didn't know about it. For example, mathematical functions are defined in the `math` module. Python itself knows about addition, subtractions, multiplications, and divisions, but things like square roots, exponentials, logarithms, constants like $\pi$, etc are in the `math` module.

In [None]:
# Basic python doesn't know about pi.
pi

In [None]:
# You can get pi by importing the math module and calling the pi defined therein.
import math
print(math.pi) # This print the value of pi.

In [None]:
# If you don't want to type "math" every time, you can give it a shorter name.
import math as m
print(m.pi) # This also prints the value of pi.

# Be careful not to use the short name for something else. If later on you define a variable
# called m, m will no longer refer to the math module.
m = 1
print(m) # This prints 1.
print(m.pi) # This throws an error.

In [None]:
# If you're planning on using math.pi a lot and even m.pi feels like too much of a hassle,
# you can import pi like this:
from math import pi
print(pi) # This prints the value of pi.

Don't abuse the last one though. Every variable or function you import this way becomes a name you can't use for your own variables and functions any more lest you overwrite the module's defintion of it.

### 1.3 Installing modules

Some of the modules we'll need today do not come preinstalled with python. We need to install them ourselves. Fortunately there's an easy way to do that. The three modules we'll need today are called "matplotlib", "pandas", and "mplleaflet". Here is how to install them.

For mac users:  

1. Open the terminal.  
2. Type and execute "conda install -y matplotlib".  
3. Type and execute "conda install -y -c conda-forge pandas".
4. Type and execute "pip install mplleaflet".

For windows users:  

1. In the winpython folder, start "WinPython Command Prompt". A terminal opens.  
2. In the terminal, type and execute "pip install matplotlib pandas mplleaflet".

*Note:* Depending how you installed python you may already have some of those modules. One way to check is to try to import them (in a notebook, execute "import [name of the module]"). If you don't get an error then the module is alreaday instaled. Either way, you can run the commands above. If a module is already installed they'll notice and move on.

## 2. Plotting

#### Why plot with python?

Python is not the easiest way to make a plot. Spreadsheet softwares can make plots. Enter the data in the spreadsheet, click on some kind of plot button, and out pops a plot. Then you can use a combination of menus and drag-and-drop to refine the design.

The main benefit of using python (or another programming language) to make plots is that python can also perform advanced data analysis. If your workflow involves frequent back and forth between the data analysis and the plotting, having a single tool that does both makes a huge difference.


#### The matplotlib module

Basic python has no plotting functionalities, but there are plenty of python modules that do. The one we'll use is called matplotlib. It's one of the modules we just installed. Note that you only need to install it once; like any regular software, it will stay on your computer until you uninstall it.

Here are the basic steps to create a plot with matplotlib (once you've installed the module):  

1. Import the module (only once per notebook).

2. Define the data you want to plot.

3. Call one of matplotlib's plotting functions.

4. Display the plot on the screen.

There's a subtlety to step 1: the functions we need are not directly in the matplotlib module, they're in a submodule of matplotlib called pyplot (in other words, matplotlib itself imports pyplot). We will import it with `import matplotlib.pyplot as plt`. After that a function called plot defined in pyplot can be called as plt.plot.

### 2.1. Line plot

The purpose of line plots is usually to visualize changes over time.

In [None]:
# 1. Import the plotting module.
import matplotlib.pyplot as plt

# 2. Import or copy your data. In this case we have a list of x coordinates and y coordinates.
x = [1,2,3,4]
y = [4,7,3,5]

# 3.Use the correct function to plot. In this case to make a line plot, call plt.plot(x,y).
# x corresponds to the x coordinates and y to the y coordinates above.
plt.plot(x,y)

# 4. At the end, call plt.show() to display the plot on the screen.
# Similar to how the jupyter notebook always prints the return value of the 
# last command in your cell, it also displays the plot even if you forget 
# plt.show(). It's a good habit to use it anyway though.
plt.show()

Now we will learn to plot multiple lines in the same figure.

In [None]:
# Add other data you would like to plot.
x  = [1,2,3,4]
y1 = [4,7,3,5]
y2 = [3,5,4,4]

# To plot multiple curves, call plt.plot multiple times.
plt.plot(x,y1)
plt.plot(x,y2)

# There are other pyplot command to add a title, axis labels, 
# control the range of the axes, etc.
plt.title('Two curves') # Add a title at the top.
plt.xlabel('x') # Add a label below the x axis.
plt.ylabel('y') # Add a label left of the y axis.
plt.xlim(0,5) # Make the x axis go from 0 to 5.
plt.ylim(0,None) # Make the x axis start at 0; let pyplot choose the upper limit automatically.

# plt.show() always comes at the end. 
plt.show()
# Any plt commmand after this will start a new plot.

In [None]:
# To add a legend, first we need to give a name to each curve using
# the option "label" of "plt.plot":
x  = [1,2,3,4]
y1 = [4,7,3,5]
y2 = [3,5,4,4]
plt.plot(x,y1,label='First curve')
plt.plot(x,y2,label='Second curve')
# Then we call plt.legend(), which uses the labels defined above.
plt.legend()
plt.show()

### 2.2. Scatter plot
Another common plot is the scatter plot. The purpose of the scatter plots is to show the relationship between two variables. Coding this plot is very similar to the line plot.

In [None]:
# Same data as before.
x = [1,2,3,4]
y = [4,7,3,5]

#Use the correct function to plot. In this case we use plt.scatter.
plt.scatter(x,y)
plt.show()

# Note how calling plt.show() twice resulted in two separate plots.

### 2.3. Bar plot
The purpose of the bar plot is to visualize differences or association between groups. In terms of coding, is very similar to the code for the line and scatter plots.

In [None]:
# A bar chart also needs two lists. The first list has the name of each bar ("x axis").
# The second list has the height of each bar ("y axis").
plt.bar(['A','Z','D','E'],y)
plt.show()

### 2.4. Pie plot
The purpose of the pie plot is to visualize compositional data. The total 360$^\circ$ and each slice gives a proportion of an item relative to the total. Pie charts only require one list of values. They don't have to add up to 100, pyplot will normalize them automatically.

In [None]:
# Data
data = [13,25,275]

# Or directly add the data into the function:
plt.pie(data)
plt.show()

# You can give a list of labels, one for each sector.
plt.pie(data,labels=['a','b','c'])
plt.show()

# It often looks better to turn off the default labels
# and make a legend instead.
plt.pie(data,labels=['a','b','c'],labeldistance=None)
# Here we specify the location of the legend by giving its x and y
# coordinates where (0,0) is the bottom left corner of the chart and 
# (1,1) is its top right corner.
plt.legend(loc=(1,0.5))
plt.show()


### 2.5. More info on matplotlib

Matplotlib has a very detailed online documentation. Here is the documentation page for the `plot` function: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html. It describes every possible argument you can give the function to tweak what it plots and how it plots it. There are similar pages for `scatter`, `bar`, and `pie`. Generally speaking, if you type "pyplot [name of a pyplot function]" into a search engine the first result will usually be that function's documentation page.

There are many more useful resources on the matplotlib website. You can learn to customize your plots at https://matplotlib.org/users/customizing.html. You can check out matplotlib's gallery at https://matplotlib.org/gallery.html to see examples of plots people made with it.

Also remember that a lot of people use matplotlib, including a lot of beginners like yourselves. As a result the internet is full of discussion about how to use it. If you're stuck you can always type "how to ... with matplotlib" into a search engine.

## 3. Nucleotide make-up of an RNA sequence

Now that you have the basic tools to start plotting, it's time to use them. In this first task we make a bar chart showing how much of each nucleotide is present in an RNA sequence.

<div class="alert alert-block alert-danger">
<b>Task 1:</b>
Load the RNA sequence in `hbb_rna.txt`. Compute the A count, the U count, the G count, and the C count. Create a bar chart showing the amount of each. Add axis labels and a title to make your graph self-explanatory (one should be able to look at the graph in isolation and understand what it shows).
</div>

## 4. Microhabitat coverage

See Andia's slides about the study the data is from.

The data is in the excel spreadsheet `microhabitat1.xlsx`. To load it we use the function `read_excel` from the module `pandas`. This produces a data table object known as a **dataframe**. If you've used the R language before, pandas' dataframe have much in common with R's dataframes.

In [None]:
# To load data we need a new module: pandas.
import pandas as pd
# We're also importing a alternative to "print". Specifically, this is what the 
# jupyter notebook uses when it automatically prints the outcome of the last line
# of every code cell. We're doing this because it renders dataframes much better
# than the regular "print" function.
from IPython.display import display

# By default read_excel reads the name of each column in the first row (row index 0).
# In this case they're in the third row (row index 2), so we need to tell it with the 
# "header" option.
data = pd.read_excel('microhabitat1.xlsx',header=2)
display(data)

In [None]:
# We can extract a single column of the table by calling by name:
print(data['Coral'])
# data['Coral'] is not technically a list, but it behaves very much like one.
# Among other things, we can use it in pyplot's plot functions.

# Note that the name of the column needs to be an exact match (including 
# capitalization and spaces) or you'll get an error.

In [None]:
plt.plot(data['Year'],data['Coral'])
plt.xlabel('Year') # Add a label to the x axis.
plt.ylabel('% coverage') # Add a label to the y axis.
plt.title('Percentage of the microhabitat covered in coral.') # Add a title above the plot.
plt.show() # Display the plot on the screen.

In [None]:
# You can get the list of all columns like this:
print(data.columns)
# Again it's not quite a list, but it behaves very much like one.
# Among other things you can slice it and loop over it.

<div class="alert alert-block alert-danger">
<b>Task 2:</b>
Plot every column from the data above as a function of the year on the same plot. That is, every column except the year; don't plot the year as a function of the year...
</div>

## 5. Tracking fish

The video `fish.mp4` is from an ongoing research project on collective behavior in astyanax mexicanus fish. It shows two fish swimming in a shallow circular tank about 1 meter in diameter. It is shot from above the tank. It was taken on campus in the fish trilab (Alex Keene, Erik Duboue, Johanna Kowalko). The position of each fish in each frame was then extracted by a python program written by postdoc Adam Patch using the opencv computer vision module. The result is the file `fish.csv`, which contains one row per frame of the video and 5 columns: the time, the pixel coordinates of the center of the first fish (pixel column number x1 and pixel row number y1), and the pixel coordinates of the second fish (x2 and y2).

In [None]:
# The positions are in a csv file so we use pandas' read_csv. The names of the 
# columns are read from the first row (no need for the header option here; unless 
# specified otherwise the column names are assumed to be in the first row).
data = pd.read_csv('fish.csv')
display(data)

We could again extract the columns with `data[name of the column]`. Instead, let's talk about the other, more general way to extract part of a dataframe: by index. Whereas elements of a list can be accessed with a single index, elements of a table need two: a row index and a column index. Like list indices, dataframe indices start at 0. Also like list indices, they can be replaced by slices.  

The syntax is as follows: the name of the dataframe, then `.iloc[]`. Between the square bracket, two indices or slices separated by a comma; the first for rows, the second for columns.

In [None]:
# Retrieve the element at row index 0, column index 2 (first row, third column):
print(data.iloc[0,2])

# Every other row starting at 1 and stopping right before 6, columns 2 to right before 5:
display(data.iloc[1:6:2,2:5])

# The entire y1 column (start at row 0, end at the last row, column index 2):
display(data.iloc[:,2])

# Note that the three objects we just produced are of three different types.
# The first is just a number. The second is a dataframe just like "data", only
# smaller. The third is a column object. That's why they get printed differently.

Let's use this new `iloc` method to plot the trajectories of the fish.

In [None]:
# The x and y coordinates of the first fish are in the "x1" and "y1" 
# columns, respectively.
plt.plot(data.iloc[:,1],data.iloc[:,2])
# For the second fish we need the "x2" and "y2" columns.
plt.plot(data.iloc[:,3],data.iloc[:,4])
# This makes sure 1 unit along the x axis has the same on-screen size as 1 unit along
# the y axis so the trajectories are not deformed.
plt.axis('equal')
plt.show()

If you compare the trajectories we just plotted with the video, you'll notice something is off. What's happening is that the coordinates of the fish are given as a pixel column number (x) and a pixel column number (y). As per the standard image analysis convention, rows are counted from the top and columns are counted from the left. As a result the y axis is "flipped" and the trajectories are upside-down.

Fortunately, it's very easy to perform basic arithmetic operations on dataframes and columns.

In [None]:
# Let's demonstrate on the first five rows of x1.
x1 = data.iloc[:5,1]
print(x1)

# Applying basic arithmetic operations to a column object simply applies the operation to
# every value in that column. -x1 flips the sign of every value in x1. 2*x1 doubles every 
# value in x1. Etc.
print('\nMinus x1:')
print(-x1)
print('\nTwice x1:')
print(2*x1)

<div class="alert alert-block alert-danger">
<b>Task 3:</b>
Replot the fish trajectories upside-up. Add axis labels, a legend, and a title to make the plot self-explanatory.
</div>

The trajectories should now match what you see in the video.

## 6. Showing data collection sites on a map

The mplleaflet module makes it very easy to plot things on a map. To make a scatter plot on a map, just make a regular scatter plot with `plt.scatter` using the longitudes as x coordinates and the latitudes as y coordinates. At the end, instead of `plt.show()`, call `mplleaflet.show()`. That will create a file called `_map.html` and open it in your web browser like a regular webpage.

In [None]:
# We already installed the mplleaflet module in section 1.3.
# Now we need to import it to make it available in this notebook.
import mplleaflet

In [None]:
# Now we load the data. It's in an excel spreadsheet so we use the read_excel function 
# from the pandas module. The names of the columns are read from the first row.
data = pd.read_excel('oil_sample_locations.xlsx')
display(data)

In [None]:
# Now we plot. First we make a regular scatter plot with pyplot using the longitude column
# as x coordinates and the latitude column as y coordinates, then we call mplleaflet.show().
# This creates the file _map.html which is then opened in your browser like a regular webpage.
# Note that you need internet access to see the map. The _map.html file contains your points, 
# but it doesn't contain the map per say, only a link to an online map. 
plt.scatter(data['LONGITUDE'],data['LATITUDE'],color='r',s=100)
mplleaflet.show()

In [None]:
# You can change the map's style with the "tiles" option.
# The default style is 'osm' (open street map).
# To get topographic lines use tiles='esri_worldtopo'.
# To get aerial views use tiles='esri_aerial'.
plt.scatter(data['LONGITUDE'],data['LATITUDE'],color='r',s=100)
mplleaflet.show(tiles='esri_aerial')

In [None]:
# Here is the full list of available styles:
print(list(mplleaflet.maptiles.tiles.keys()))

## 7. Tracking wolves

This is from an ongoing research project at the University of Minnesota Twin Cities. It involves placing GPS collars on wolves living in the Voyageur National Park. Check out the project's website at https://www.voyageurswolfproject.org, there's much more to it. 

The voyageur wolf collaboration generously accepted to share the data with us, but they ask that you **<font color='red'>do not share the data</font>** outside of this class as it could be used to harm the wolves.

The data files are located in the `wolf_data` folder. Each file corresponds to a different wolf. They're all csv files with the column names in the first row.

In [None]:
# Here is the wolf "V028", whose location was recorded every 20 minutes for 6 months.
# Note the name of the file is now a path. "wolf_data/V028.csv" means the file "V028.csv"
# located in the folder "wolf_data", itself located in the same folder as this notebook.
data = pd.read_csv('wolf_data/V028.csv')
display(data)

In [None]:
# Let's plot it on a map.
plt.plot(data['Longitude'],data['Latitude'])
mplleaflet.show()

Next we'd like to show every wolf's trajectory on the same map. The first step is to make a list of all the wolves' data files. You can do that by hand or use the function `listdir` from module `os`, which takes the name of a directory and returns a list of all the files in it:

In [None]:
import os
print(os.listdir('wolf_data'))

<div class="alert alert-block alert-danger">
<b>Task 4:</b>
Write a loop over the wolf data files. For each file, load the data and plot the wolf's trajectory. Once the loop is over, use mplleaflet to render the whole thing on a map.
</div>

## 8. Sponge recruitment

See Andia's slides about sponge recruitment.

This one doesn't introduce any new concept, it's just additional practice of the things we talked about throughout the notebook.

<div class="alert alert-block alert-danger">
<b>Task 5:</b>
Load the data in `sponge_recruits.csv`. Use it to make a pie chart showing the proportion of each species' recruits.
</div>