<div style="border: 2px solid #8A9AD0; margin: 1em 0.2em; padding: 0.5em;">

# Plotting in Python

by [Maria Christina Maniou](https://training.galaxyproject.org/hall-of-fame/mcmaniou/), [Fotis E. Psomopoulos](https://training.galaxyproject.org/hall-of-fame/fpsom/), [The Carpentries](https://training.galaxyproject.org/hall-of-fame/carpentries/), [Erasmus+ Programme](https://training.galaxyproject.org/hall-of-fame/erasmusplus/)

CC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)

**Objectives**

- How can I create plots using Python in Galaxy?

**Objectives**

- Use the scientific library matplolib to explore tabular datasets

**Time Estimation: 1H**
</div>


<p>In this lesson, we will be using Python 3 with some of its most popular scientific libraries. This tutorial assumes that the reader is familiar with the fundamentals of data analysis using the Python programming language, as well as, how to run Python programs using Galaxy. Otherwise, it is advised to follow the “Introduction to Python” and “Advanced Python” tutorials available in the same platform. We will be using JupyterNotebook, a Python interpreter that comes with everything we need for the lesson.</p>
<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<h3 id="-icon-comment--comment">💬 Comment</h3>
<p>This tutorial is <strong>significantly</strong> based on <a href="https://carpentries.org">the Carpentries</a> <a href="https://swcarpentry.github.io/python-novice-inflammation/">Programming with Python</a> and <a href="https://swcarpentry.github.io/python-novice-gapminder/">Plotting and Programming in Python</a>, which is licensed CC-BY 4.0.</p>
<p>Adaptations have been made to make this work better in a GTN/Galaxy environment.</p>
</blockquote>
<blockquote class="agenda" style="border: 2px solid #86D486;display: none; margin: 1em 0.2em">
<h3 id="agenda">Agenda</h3>
<p>In this tutorial, we will cover:</p>
<ol id="markdown-toc">
<li><a href="#plot-data-using-matplotlib" id="markdown-toc-plot-data-using-matplotlib">Plot data using matplotlib</a></li>
</ol>
</blockquote>
<h1 id="plot-data-using-matplotlib">Plot data using matplotlib</h1>
<p>For the purposes of this tutorial, we will use a file with the annotated differentially expressed genes that was produced in the <a href="https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html">Reference-based RNA-Seq data analysis</a> tutorial.</p>
<p>Firstly, we read the file with the data.</p>


In [None]:
data = pd.read_csv("https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular", sep = "\t", index_col = 'GeneID')
print(data)

<p>We can now use the <code>DataFrame.info()</code> method to find out more about a dataframe.</p>


In [None]:
data.info()

<p>We learn that this is a DataFrame. It consists of 130 rows and 12 columns. None of the columns contains any missing values. 6 columns contain 64-bit floating point <code>float64</code> values, 2 contain 64-bit integer <code>int64</code> values and 4 contain character <code>object</code> values. It uses 13.2KB of memory.</p>
<p>We now have a basic understanding of the dataset and we can move on to creating a few plots and further explore the data. <code>matplotlib</code> is the most widely used scientific plotting library in Python, especially the <code>matplotlib.pyplot</code> module.</p>


In [None]:
import matplotlib.pyplot as plt

<p>Simple plots are then (fairly) simple to create. You can use the <code>plot()</code> method and simply specify the data to be displayed in the x and y axis, by passing the data as the first and second argument. In the following example, we select a subset of the dataset and plot the P-value of each gene, using a lineplot.</p>


In [None]:
subset = data.iloc[121:, :]

x = subset['P-value']
y = subset['Gene name']

plt.plot(x, y)
plt.xlabel('P-value')
plt.ylabel('Gene name')

<p><img src="https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-plotting/../../images/python-plotting/Figure9_Lineplot.png" alt="A line chart is shown with a y axis of gene name and an x axis of p-value. The specific content of the graph is not important other than that it is produced with the expected x and y labels." /></p>
<p>We use Jupyter Notebook and so running the cell generates the figure directly below the code. The figure is also included in the Notebook document for future viewing. However, other Python environments like an interactive Python session started from a terminal or a Python script executed via the command line require an additional command to display the figure.</p>
<p>Instruct matplotlib to show a figure:</p>


In [None]:
plt.show()

<p>This command can also be used within a Notebook - for instance, to display multiple figures if several are created by a single cell.</p>
<p>If you want to save and download the image to your local machine, you can use the <code>plt.savefig()</code> command with the name of the file (png, pdf etc) as the argument. The file is saved in the Jupyter Notebook session and then you can download it. For example:</p>


In [None]:
plt.tight_layout()
plt.savefig('foo.png')

<p><code class="language-plaintext highlighter-rouge">plt.tight_layout()</code> is used to make sure that no part of the image is cut off during saving.</p>
<p>When using dataframes, data is often generated and plotted to screen in one line, and <code>plt.savefig()</code> seems not to be a possible approach. One possibility to save the figure to file is then to save a reference to the current figure in a local variable (with <code>plt.gcf()</code>) and then call the savefig class method from that variable. For example, the previous plot:</p>


In [None]:
subset = data.iloc[121:, :]

x = subset['P-value']
y = subset['Gene name']

fig = plt.gcf()
plt.plot(x, y)
fig.savefig('my_figure.png')

<h2 id="more-about-plots">More about plots</h2>
<p>You can use the <code>plot()</code> method directly on a dataframe. You can plot multiple lines in the same plot. Just specify more columns in the x or y axis argument. For example:</p>


In [None]:
new_subset = data.iloc[0:10, :]
new_subset.loc[:, ['P-value', 'P-adj']].plot()
plt.xticks(range(0,len(new_subset.index)), new_subset['Gene name'], rotation=60)
plt.xlabel('Gene name')

<p><img src="https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-plotting/../../images/python-plotting/Figure10_Multiple_lines_plot.png" alt="The graph is similar to the last image, now the x axis is the gene name, and the y axis runs from 0 to 7 with an annotation of 1e-65 above. Now there are two lines, one in red labelled p-adj, and one in blue labelled p-value." /></p>
<p>In this example, we select a new subset of the dataset, but plot only the two columns <code>P-value</code> and <code>P-adj</code>. Then we use the <code>plt.xticks()</code> method to change the text and the rotation of the x axis.</p>
<p>Another useful plot type is the barplot. In the following example we plot the number of genes that belong to the different chromosomes of the dataset.</p>


In [None]:
bar_data = data.groupby('Chromosome').size()
bar_data.plot(kind='bar')
plt.xticks(rotation=60)
plt.ylabel('N')

<p><img src="https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-plotting/../../images/python-plotting/Figure11_Barplot.png" alt="It is now a bar plot with chromosome as an X axis and some values of N for the y axis for a few chromosomes." /></p>
<p><code>matplotlib</code> supports also different plot styles from ather popular plotting libraries such as ggplot and seaborn. For example, the previous plot in ggplot style.</p>


In [None]:
plt.style.use('ggplot')
bar_data = data.groupby('Chromosome').size()
bar_data.plot(kind='bar')
plt.xticks(rotation=60)
plt.ylabel('N')

<p><img src="https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-plotting/../../images/python-plotting/Figure12_ggplot_Barplot.png" alt="The same graph as the previous, but with a different aethetic, the background is now light grey instead of white and the bars are red instead of blue to be a bit more like ggplot2 outputs." /></p>
<p>You can also change different parameters and customize the plot.</p>


In [None]:
plt.style.use('default')
bar_data = data.groupby('Chromosome').size()
bar_data.plot(kind='bar', color = 'red', edgecolor = 'black')
plt.xticks(rotation=60)
plt.ylabel('N')

<p><img src="https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-plotting/../../images/python-plotting/Figure13_Barplot_2.png" alt="The same graph again, but now the bars are red with a black border." /></p>
<p>Another useful type of plot is a scatter plot. In the following example we plot the Base mean of a subset of genes.</p>


In [None]:
scatter_data = data[['Base mean', 'Gene name']].head(n = 15)

plt.scatter(scatter_data['Gene name'], scatter_data['Base mean'])
plt.xticks(rotation = 60)
plt.ylabel('Base mean')
plt.xlabel('Gene name')

<p><img src="https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-plotting/../../images/python-plotting/Figure14_Scatterplot.png" alt="A scatterplot is shown comparing gene name to base mean." /></p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<h3 id="-icon-question--question-plotting">❓ Question: Plotting</h3>
<p>Using the same dataset, create a scatterplot of the average P-value for every chromosome for the “+” and the “-“ strand.</p>
<blockquote class="solution" style="border: 2px solid #B8C3EA;color: white; margin: 1em 0.2em">
<div style="color: #555; font-size: 95%;">Hint: Select the text with your mouse to see the answer</div><h3 id="-icon-solution--solution">👁 Solution</h3>
<p>First find the data and save it in a new dataframe. Then create the scatterplot. You can even go one step further and assign different colors for the different strands.Note the use of the <code>map</code> method that assigns the different colors using a dictionary as an input.</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>exercise_data = data.groupby(['Chromosome', 'Strand']).agg(mean_pvalue = ('P-value', 'mean')).reset_index()

colors = {'+':'red', '-':'blue'}
plt.scatter(x = exercise_data['Chromosome'], y = exercise_data['mean_pvalue'], c = exercise_data['Strand'].map(colors))
plt.ylabel('Average P-value')
plt.xlabel('Chromosome')
</code></pre></div>    </div>
<p><img src="https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-plotting/../../images/python-plotting/Figure15_Exercise_plot.png" alt="Another scatterplot showing chromosome vs average p-value, but every column has both a blue and red point, presumably showing the values for different strands." /></p>
</blockquote>
</blockquote>
<h2 id="making-your-plots-accessible">Making your plots accessible</h2>
<p>Whenever you are generating plots to go into a paper or a presentation, there are a few things you can do to make sure that everyone can understand your plots.</p>
<p>Always make sure your text is large enough to read. Use the <code>fontsize</code> parameter in <code>xlabel</code>, <code>ylabel</code>, <code>title</code>, and <code>legend</code>, and <code>tick_params</code> with <code>labelsize</code> to increase the text size of the numbers on your axes.
Similarly, you should make your graph elements easy to see. Use <code>s</code> to increase the size of your scatterplot markers and <code>linewidth</code> to increase the sizes of your plot lines.
Using <code>color</code> (and nothing else) to distinguish between different plot elements will make your plots unreadable to anyone who is colorblind, or who happens to have a black-and-white office printer. For lines, the <code>linestyle</code> parameter lets you use different types of lines. For scatterplots, <code>marker</code> lets you change the shape of your points.</p>


# Key Points

- Python has many libraries offering a variety of capabilities, which makes it popular for beginners, as well as, more experienced users
- You can use scientific libraries like Matplotlib to perform exploratory data analysis.

# Congratulations on successfully completing this tutorial!

Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-plotting/tutorial.html#feedback) and check there for further resources!
