<div style="border: 2px solid #8A9AD0; margin: 1em 0.2em; padding: 0.5em;">

# Advanced Python

by [Maria Christina Maniou](https://training.galaxyproject.org/hall-of-fame/mcmaniou/), [Fotis E. Psomopoulos](https://training.galaxyproject.org/hall-of-fame/fpsom/), [The Carpentries](https://training.galaxyproject.org/hall-of-fame/carpentries/), [Erasmus+ Programme](https://training.galaxyproject.org/hall-of-fame/erasmusplus/)

CC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)

**Objectives**

- How can I analyze data using Python with Numpy and Pandas?

**Objectives**

- Use the scientific libraries pandas and numpy to explore tabular datasets
- Calculate basic statistics about datasets and columns

**Time Estimation: 3H**
</div>


<p>In this lesson, we will be using Python 3 with some of its most popular scientific libraries. This tutorial assumes that the reader is familiar with the fundamentals of the Python programming language, as well as, how to run Python programs using Galaxy. Otherwise, it is advised to follow the “Introduction to Python” tutorial available in the same platform. We will be using JupyterNotebook, a Python interpreter that comes with everything we need for the lesson. Please note:  JupyterNotebook is only currently available on the <a href="https://usegalaxy.eu/">usegalaxy.eu</a> and <a href="https://usegalaxy.org/">usegalaxy.org</a> sites.</p>
<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<h3 id="-icon-comment--comment">💬 Comment</h3>
<p>This tutorial is <strong>significantly</strong> based on <a href="https://carpentries.org">the Carpentries</a> <a href="https://swcarpentry.github.io/python-novice-inflammation/">Programming with Python</a> and <a href="https://swcarpentry.github.io/python-novice-gapminder/">Plotting and Programming in Python</a>, which is licensed CC-BY 4.0.</p>
<p>Adaptations have been made to make this work better in a GTN/Galaxy environment.</p>
</blockquote>
<blockquote class="agenda" style="border: 2px solid #86D486;display: none; margin: 1em 0.2em">
<h3 id="agenda">Agenda</h3>
<p>In this tutorial, we will cover:</p>
<ol id="markdown-toc">
<li><a href="#analyze-data-using-numpy" id="markdown-toc-analyze-data-using-numpy">Analyze data using numpy</a></li>
</ol>
</blockquote>
<h1 id="analyze-data-using-numpy">Analyze data using numpy</h1>
<p>NumPy is a python library and it stands for Numerical Python. In general, you should use this library when you want to perform operations and manipulate numerical data, especially if you have matrices or arrays. To tell Python that we’d like to start using NumPy, we need to import it:</p>


In [None]:
import numpy as np

<p>A Numpy array contains one or more elements of the same type. To examine the basic functions of the library, we will create an array of random data. These data will correspond to arthritis patients’ inflammation. The rows are the individual patients, and the columns are their daily inflammation measurements. We will use the <code>random.randint()</code> function. It has 4 arguments as inputs <code>randint(low, high=None, size=None, dtype=int)</code>. <code>low</code> nad <code>high</code> specify the limits of the random number generator. <code>size</code> determines the shape of the array and it can be an integer or a tuple.</p>


In [None]:
np.random.seed(2021)  #create reproducible work
random_data = np.random.randint(1, 25, size=(50,70))

<p>If we want to check the data have been loaded, we can print the variable’s value:</p>


In [None]:
print(random_data)

<p>Now that the data are in memory, we can manipulate them. First, let’s ask what type of thing data refers to:</p>


In [None]:
print(type(random_data))

<p>The output tells us that data currently refers to an N-dimensional array, the functionality for which is provided by the NumPy library. These data correspond to arthritis patients’ inflammation. The rows are the individual patients, and the columns are their daily inflammation measurements.</p>
<p>The <code>type</code> function will only tell you that a variable is a NumPy array but won’t tell you the type of thing inside the array. We can find out the type of the data contained in the NumPy array.</p>


In [None]:
print(random_data.dtype)

<p>This tells us that the NumPy array’s elements are integer numbers.</p>
<p>With the following command, we can see the array’s shape:</p>


In [None]:
print(random_data.shape)

<p>The output tells us that the data array variable contains 50 rows and 70 columns. When we created the variable <code>random_data</code> to store our arthritis data, we did not only create the array; we also created information about the array, called members or attributes. This extra information describes <code>random_data</code> in the same way an adjective describes a noun. <code>random_data.shape</code> is an attribute of <code>random_data</code> which describes the dimensions of <code>random_data</code>.</p>
<p>If we want to get a single number from the array, we must provide an index in square brackets after the variable name, just as we do in math when referring to an element of a matrix. Our data has two dimensions, so we will need to use two indices to refer to one specific value:</p>


In [None]:
print('first value in data:', random_data[0, 0])

In [None]:
print('middle value in data:', random_data[25, 35])

<p>The expression random_data[25, 35] accesses the element at row 25, column 35. While this expression may not surprise you, random_data[0, 0] might. Programming languages like Fortran, MATLAB and R start counting at 1 because that’s what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because it represents an offset from the first value in the array (the second value is offset by one index from the first value). As a result, if we have an M×N array in Python, its indices go from 0 to M-1 on the first axis and 0 to N-1 on the second.</p>
<p>Slicing data
An index like [25, 35] selects a single element of an array, but we can select whole sections as well, using slicing the same way as previously with the strings. For example, we can select the first ten days (columns) of values for the first four patients (rows) like this:</p>


In [None]:
print(random_data[0:4, 0:10])

<p>We don’t have to include the upper and lower bound on the slice. If we don’t include the lower bound, Python uses 0 by default; if we don’t include the upper, the slice runs to the end of the axis, and if we don’t include either (i.e., if we use ‘:’ on its own), the slice includes everything:</p>


In [None]:
small = random_data[:3, 36:]
print('small is:')
print(small)

<p>The above example selects rows 0 through 2 and columns 36 through to the end of the array.</p>
<h2 id="process-the-data">Process the data</h2>
<p>NumPy has several useful functions that take an array as input to perform operations on its values. If we want to find the average inflammation for all patients on all days, for example, we can ask NumPy to compute random_data’s mean value:</p>


In [None]:
print(np.mean(random_data))

<p>Let’s use three other NumPy functions to get some descriptive values about the dataset. We’ll also use multiple assignment, a convenient Python feature that will enable us to do this all in one line.</p>


In [None]:
maxval, minval, stdval = np.max(random_data), np.min(random_data), np.std(random_data)

print('maximum inflammation:', maxval)
print('minimum inflammation:', minval)
print('standard deviation:', stdval)

<p>How did we know what functions NumPy has and how to use them? If you are working in IPython or in a Jupyter Notebook, there is an easy way to find out. If you type the name of something followed by a dot, then you can use tab completion (e.g. type <code>np.</code> and then press Tab) to see a list of all functions and attributes that you can use. After selecting one, you can also add a question mark (e.g. <code>np.cumprod?</code>), and IPython will return an explanation of the method! This is the same as doing <code>help(np.cumprod)</code>.</p>
<p>When analyzing data, though, we often want to look at variations in statistical values, such as the maximum inflammation per patient or the average inflammation per day. One way to do this is to create a new temporary array of the data we want, then ask it to do the calculation:</p>


In [None]:
patient_0 = random_data[0, :] # 0 on the first axis (rows), everything on the second (columns)
print('maximum inflammation for patient 0:', np.max(patient_0))

<p>What if we need the maximum inflammation for each patient over all days (as in the next diagram on the left) or the average for each day (as in the diagram on the right)? As the diagram below shows, we want to perform the operation across an axis:</p>
<p><img src="https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-advanced-np-pd/../../images/python-advanced-np-pd/Figure8_Operations_across_axis.png" alt="A cartoon shows a data structure on the left with Patient 0 to N as rows in a table, and then Day 0 to N as columns of the same table. Below is written &quot;numpy.max(data, axis=1)&quot; which does an aggregation and finds the max per patient (so maximum daily value for each patient.) On the right a similar table is shown but written below is numpy.max(data, axis=0) and now the maximum is calculated per day, which patient had the maximum value." /></p>
<p>To support this functionality, most array functions allow us to specify the axis we want to work on. If we ask for the average across axis 0 (rows in our 2D example), we get:</p>


In [None]:
print(np.mean(random_data, axis=0))

<p>As a quick check, we can ask this array what its shape is:</p>


In [None]:
print(np.mean(random_data, axis=0).shape)

<p>The expression (70,) tells us we have an N×1 vector, so this is the average inflammation per day for all patients. If we average across axis 1 (columns in our 2D example), we get the average inflammation per patient across all days.:</p>


In [None]:
print(np.mean(random_data, axis=1))

<h3 id="stacking-arrays">Stacking arrays</h3>
<p>Arrays can be concatenated and stacked on top of one another, using NumPy’s <code>vstack</code> and <code>hstack</code> functions for vertical and horizontal stacking, respectively.</p>


In [None]:
import numpy as np

A = np.array([[1,2,3], [4,5,6], [7, 8, 9]])
print('A = ')
print(A)

B = np.hstack([A, A])
print('B = ')
print(B)

C = np.vstack([A, A])
print('C = ')
print(C)

<h3 id="remove-nan-values">Remove NaN values</h3>
<p>Sometimes there are missing values in an array, that could make it difficult to perform operations on it. To remove the <code>NaN</code> you must first find their indexes and then replace them. The following example replaces them with <code>0</code>.</p>


In [None]:
a = array([[1, 2, 3], [0, 3, NaN]])
print(a)
a[np.isnan(a)] = 0
print(a)

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<h3 id="-icon-question--question-selecting-and-stacking-arrays">❓ Question: Selecting and stacking arrays</h3>
<p>Write some additional code that slices the first and last columns of A, and stacks them into a 3x2 array. Make sure to print the results to verify your solution.</p>
<blockquote class="solution" style="border: 2px solid #B8C3EA;color: white; margin: 1em 0.2em">
<div style="color: #555; font-size: 95%;">Hint: Select the text with your mouse to see the answer</div><h3 id="-icon-solution--solution">👁 Solution</h3>
<p>A ‘gotcha’ with array indexing is that singleton dimensions are dropped by default. That means <code>A[:, 0]</code> is a one dimensional array, which won’t stack as desired. To preserve singleton dimensions, the index itself can be a slice or array. For example, <code>A[:, :1]</code> returns a two dimensional array with one singleton dimension (i.e. a column vector).</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>D = np.hstack((A[:, :1], A[:, -1:]))
print('D = ')
print(D)
</code></pre></div>    </div>
</blockquote>
</blockquote>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<h3 id="-icon-question--question-selecting-with-conditionals">❓ Question: Selecting with conditionals</h3>
<p>Given the followind array <code>A</code>, keep only the elements that are lower that <code>0.05</code>.</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>A = np.array([0.81, 0.025, 0.15, 0.67, 0.01])
</code></pre></div>  </div>
<blockquote class="solution" style="border: 2px solid #B8C3EA;color: white; margin: 1em 0.2em">
<h3 id="-icon-solution--solution-1">👁 Solution</h3>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>A = A[A&lt;0.05]
</code></pre></div>    </div>
</blockquote>
</blockquote>
<h1 id="use-pandas-to-work-with-dataframes">Use pandas to work with dataframes</h1>
<p>Pandas (<span class="citation"><a href="#reback2020pandas">pandas development team 2020</a></span>, <span class="citation"><a href="#mckinney-proc-scipy-2010">Wes McKinney 2010 </a></span>) is a widely-used Python library for statistics, particularly on tabular data. If you are familiar with R dataframes, then this is the library that integrates this functionality. A dataframe is a 2-dimensional table with indexes and column names. The indexes indicate the difference in rows, while the column names indicate the difference in columns. You will see later that these two features are useful when you’re manipulating your data. Each column can contain different data types.</p>
<p>Load it with import pandas as <code>pd</code>. The alias <code>pd</code> is commonly used for pandas.</p>


In [None]:
import pandas as pd

<p>There are many ways to create a pandas dataframe. For example you can use a numpy array as input.</p>


In [None]:
data = np.array([['','Col1','Col2'],
['Row1',1,2],
['Row2',3,4]])

print(pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:]))

<p>For the purposes of this tutorial, we will use a file with the annotated differentially expressed genes that was produced in the <a href="/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html">Reference-based RNA-Seq data analysis</a> tutorial</p>
<p>We can read a tabular file with <code>pd.read_csv</code>. The first argument is the filepath of the file to be read. The <code>sep</code> argument refers to the symbol used to separate the data into different columns. You can check the rest of the arguments using the <code>help()</code> function.</p>


In [None]:
data = pd.read_csv("https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular", sep = "\t")
print(data)

<p>The columns in a dataframe are the observed variables, and the rows are the observations. Pandas uses backslash <code>\</code> to show wrapped lines when output is too wide to fit the screen.</p>
<h2 id="explore-the-data">Explore the data</h2>
<p>You can use <code>index_col</code> to specify that a column’s values should be used as row headings.</p>
<p>By default row indexes are numbers, but we could use a column of the data. To pass the name of the column to <code>read_csv</code>, you can use its <code>index_col</code> parameter. Be careful though, because the row indexes must be unique for each row.</p>


In [None]:
data = pd.read_csv("https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular", sep = "\t", index_col = 'GeneID')
print(data)

<p>You can use the <code>DataFrame.info()</code> method to find out more about a dataframe.</p>


In [None]:
data.info()

<p>We learn that this is a DataFrame. It consists of 130 rows and 12 columns. None of the columns contains any missing values. 6 columns contain 64-bit floating point <code>float64</code> values, 2 contain 64-bit integer <code>int64</code> values and 4 contain character <code>object</code> values. It uses 13.2KB of memory.</p>
<p>The <code>DataFrame.columns</code> variable stores information about the dataframe’s columns.</p>
<p>Note that this is an attribute, not a method. (It doesn’t have parentheses.) Called a member variable, or just member.</p>


In [None]:
print(data.columns)

<p>You could use <code>DataFrame.T</code> to transpose a dataframe. The <code>Transpose</code> (written <code>.T</code>) doesn’t copy the data, just changes the program’s view of it. Like columns, it is a member variable.</p>


In [None]:
print(data.T)

<p>You can use <code>DataFrame.describe()</code> to get summary statistics about the data. <code>DataFrame.describe()</code> returns the summary statistics of only the columns that have numerical data.  All other columns are ignored, unless you use the argument <code>include='all'</code>. Depending on the data type of each column, the statistics that can’t be calculated are replaced with  the value <code>NaN</code>.</p>


In [None]:
print(data.describe(include='all'))

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<h3 id="-icon-question--question-using-pdhead-and-pdtail">❓ Question: Using pd.head and pd.tail</h3>
<p>After reading the data, use <code>help(data.head)</code> and <code>help(data.tail)</code> to find out what <code>DataFrame.head</code> and <code>DataFrame.tail</code> do.
	a. What method call will display the first three rows of the data?
	b. What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)</p>
<blockquote class="solution" style="border: 2px solid #B8C3EA;color: white; margin: 1em 0.2em">
<div style="color: #555; font-size: 95%;">Hint: Select the text with your mouse to see the answer</div><h3 id="-icon-solution--solution">👁 Solution</h3>
<p>a. We can check out the first five rows of the data by executing <code>data.head()</code> (allowing us to view the head of the DataFrame). We can specify the number of rows we wish to see by specifying the parameter <code>n</code> in our call to <code>data.head()</code>. To view the first three rows, execute:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>data.head(n=3)
</code></pre></div>    </div>
<table>
<thead>
<tr>
<th> </th>
<th>Base mean</th>
<th>log2(FC)</th>
<th>StdErr</th>
<th>Wald-Stats</th>
<th>P-value</th>
<th>P-adj</th>
<th>Chromosome</th>
<th>Start</th>
<th>End</th>
<th>Strand</th>
<th>Feature</th>
<th>Gene name</th>
<th>GeneID</th>
</tr>
</thead>
<tbody>
<tr>
<td>FBgn0039155</td>
<td>1086.974295</td>
<td>-4.148450</td>
<td>0.134949</td>
<td>-30.740913</td>
<td>1.617357e-207</td>
<td>1.387207e-203</td>
<td>chr3R</td>
<td>24141394</td>
<td>24147490</td>
<td>+</td>
<td>protein_coding</td>
<td>Kal1</td>
<td> </td>
</tr>
<tr>
<td>FBgn0003360</td>
<td>6409.577128</td>
<td>-2.999777</td>
<td>0.104345</td>
<td>-28.748637</td>
<td>9.419922e-182</td>
<td>4.039734e-178</td>
<td>chrX</td>
<td>10780892</td>
<td>10786958</td>
<td>-</td>
<td>protein_coding</td>
<td>sesB</td>
<td> </td>
</tr>
<tr>
<td>FBgn0026562</td>
<td>65114.840564</td>
<td>-2.380164</td>
<td>0.084327</td>
<td>-28.225437</td>
<td>2.850430e-175</td>
<td>8.149380e-172</td>
<td>chr3R</td>
<td>26869237</td>
<td>26871995</td>
<td>-</td>
<td>protein_coding</td>
<td>BM-40-SPARC</td>
<td> </td>
</tr>
</tbody>
</table>
<p>b. To check out the last three rows, we would use the command, <code>data.tail(n=3)</code>, analogous to <code>head()</code> used above. However, here we want to look 	at the last three columns so we need to change our view and then use <code>tail()</code>. To do so, we create a new DataFrame in which rows and columns are 	switched:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>data_flipped = data.T
</code></pre></div>    </div>
<p>We can then view the last three columns of the data by viewing the last three rows of data_flipped:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>data_flipped.tail(n=3)
</code></pre></div>    </div>
<p>| GeneID | FBgn0039155 | FBgn0003360 | FBgn0026562 | FBgn0025111 | FBgn0029167 | FBgn0039827 | FBgn0035085 | FBgn0034736 | FBgn0264475 | FBgn0000071 | … | FBgn0264343 | FBgn0038237 | FBgn0020376 | FBgn0028939 | FBgn0036560 | FBgn0035710 | FBgn0035523 | FBgn0038261 | FBgn0039178 | FBgn0034636 |
| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |
| Strand | + | - | - | - | + | + | + | + | + | + | … | + | - | + | + | + | - | + | + | + | - |
| Feature | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | lincRNA | protein_coding | … | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding |
| Gene name | Kal1 | sesB | BM-40-SPARC | Ant2 | Hml | CG1544 | CG3770 | CG6018 | CR43883 | Ama | … | CG43799 | Pde6 | Sr-CIII | NimC2 | CG5895 | SP1173 | CG1311 | CG14856 | CG6356 | CG10440 |</p>
</blockquote>
</blockquote>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<h3 id="-icon-question--question-saving-in-a-csv-file">❓ Question: Saving in a csv file</h3>
<p>As well as the <code>read_csv</code> function for reading data from a file, Pandas provides a <code>to_csv</code> function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called <code>processed.csv</code>. You can use <code>help</code> to get information on how to use <code>to_csv</code>.</p>
<blockquote class="solution" style="border: 2px solid #B8C3EA;color: white; margin: 1em 0.2em">
<h3 id="-icon-solution--solution-1">👁 Solution</h3>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>data_flipped.to_csv('processed.csv')
</code></pre></div>    </div>
</blockquote>
</blockquote>
<ul>
<li>Note about Pandas DataFrames/Series</li>
</ul>
<p>A <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">DataFrame</a> is a collection of <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html">Series</a>; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.</p>
<p>Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.</p>
<p>What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.</p>
<h2 id="select-data">Select data</h2>
<p>To access a value at the position <code>[i,j]</code> of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides an index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.</p>
<p>You can use <code>DataFrame.iloc[..., ...]</code> to select values by their (entry) position and basically specify location by numerical index analogously to 2D version of character selection in strings.</p>


In [None]:
print(data.iloc[0, 0])

<p>You can also use <code>DataFrame.loc[..., ...]</code> to select values by their (entry) label and basically specify location by row name analogously to 2D version of dictionary keys.</p>


In [None]:
print(data.loc["FBgn0039155", "Base mean"])

<p>You can use Python’s usual slicing notation, to select all or a subset of rows and/or columns. For example, the following code selects all the columns of the row <code>"FBgn0039155"</code>.</p>


In [None]:
print(data.loc["FBgn0039155", :])

<p>Which would get the same result as printing <code>data.loc["FBgn0039155"]</code> (without a second index).</p>
<p>You can select multiple columns or rows using <code>DataFrame.loc</code> and a named slice or <code>Dataframe.iloc</code> and the numbers corresponding to the rows and columns.</p>


In [None]:
print(data.loc['FBgn0003360':'FBgn0029167', 'Base mean':'Wald-Stats'])
print(data.iloc[1:4 , 0:3])

<ul>
<li>Note the difference between the 2 outputs.</li>
</ul>
<p>When choosing or transitioning between <code>loc</code> and <code>iloc</code>, you should keep in mind that the two methods use slightly different indexing schemes.</p>
<p><code>iloc</code> uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So <code>0:10</code> will select entries <code>0,...,9</code>. <code>loc</code>, meanwhile, indexes inclusively. So <code>0:10</code> will select entries <code>0,...,10</code>.</p>
<p>This is particularly confusing when the DataFrame index is a simple numerical list, e.g. <code>0,...,1000</code>. In this case <code>df.iloc[0:1000]</code> will return 1000 entries, while <code>df.loc[0:1000]</code> return 1001 of them! To get 1000 elements using <code>loc</code>, you will need to go one lower and ask for <code>df.loc[0:999]</code>.</p>
<p>The result of slicing is a new dataframe and can be used in further operations. All the statistical operators that work on entire dataframes work the same way on slices. E.g., calculate max of a slice.</p>


In [None]:
print(data.loc['FBgn0003360':'FBgn0029167', 'Base mean'].max())

<h2 id="use-conditionals-to-select-data">Use conditionals to select data</h2>
<p>You can use conditionals to select data. A comparison is applied element by element and returns a similarly-shaped dataframe of <code>True</code> and <code>False</code>. The last one can be used as a mask to subset the original dataframe. The following example creates a new dataframe consisting only of the columns ‘P-adj’ and ‘Gene name’, then keeps the rows that comply with the expression <code class="language-plaintext highlighter-rouge">'P-adj' &lt; 0.000005</code></p>


In [None]:
subset = data.loc[:, ['P-adj', 'Gene name']]
print(subset)

In [None]:
mask = subset.loc[:, 'P-adj'] < 0.000005
new_data = subset[mask]
print(new_data)

<p>If we have not had specified the column, that the expression should be applied to, then it would have been applied to the entire dataframe. But the dataframe contains different type of data. In that case, an error would occur.</p>
<p>Consider the following example of a dataframe consisting only of numerical data. The expression and the mask would be normally applied to the data and the mask would return <code>NaN</code> for the data that don’t comply with the expression.</p>


In [None]:
subset = data.loc[:, ['StdErr',	'Wald-Stats', 'P-value', 'P-adj']]
mask = subset < 0.05
new_data = subset[mask]
print(new_data)

<p>This is very useful because NaNs are ignored by operations like max, min, average, etc.</p>


In [None]:
print(new_data.describe())

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<h3 id="-icon-question--question-manipulating-dataframes">❓ Question: Manipulating dataframes</h3>
<p>Explain what each line in the following short program does: what is in first, second, etc.?</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>first = pd.read_csv("https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular", sep = "\t", index_col = 'GeneID')
second = first[first['log2(FC)'] &gt; 0 ]
third = second.drop('FBgn0025111')
fourth = third.drop('StdErr', axis = 1)
fourth.to_csv('result.csv')
</code></pre></div>  </div>
<blockquote class="solution" style="border: 2px solid #B8C3EA;color: white; margin: 1em 0.2em">
<div style="color: #555; font-size: 95%;">Hint: Select the text with your mouse to see the answer</div><h3 id="-icon-solution--solution">👁 Solution</h3>
<p>Let’s go through this piece of code line by line.</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>first = pd.read_csv("https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular", sep = "\t", index_col = 'GeneID')
</code></pre></div>    </div>
<p>This line loads the data into a dataframe called first. The <code>index_col='GeneID'</code> parameter selects which column to use as the row labels in the dataframe.</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>second = first[first['log2(FC)'] &gt; 0 ]
</code></pre></div>    </div>
<p>This line makes a selection: only those rows of first for which the ‘log2(FC)’ column contains a positive value are extracted. Notice how the Boolean expression inside the brackets is used to select only those rows where the expression is true.</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>third = second.drop('FBgn0025111')
</code></pre></div>    </div>
<p>As the syntax suggests, this line drops the row from second where the label is ‘FBgn0025111’. The resulting dataframe third has one row less than the original dataframe second.</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>fourth = third.drop('StdErr', axis = 1)
</code></pre></div>    </div>
<p>Again we apply the drop function, but in this case we are dropping not a row but a whole column. To accomplish this, we need to specify also the axis parameter.</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>fourth.to_csv('result.csv')
</code></pre></div>    </div>
<p>The final step is to write the data that we have been working on to a csv file. Pandas makes this easy with the <code>to_csv()</code> function. The only required argument to the function is the filename. Note that the file will be written in the directory from which you started the Jupyter or Python session.</p>
</blockquote>
</blockquote>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<h3 id="-icon-question--question-finding-min-max-indexes">❓ Question: Finding min-max indexes</h3>
<p>Explain in simple terms what <code>idxmin</code> and <code>idxmax</code> do in the short program below. When would you use these methods?</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>data = pd.read_csv("https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular", sep = "\t", index_col = 'GeneID')

print(data['Base mean'].idxmin())
print(data['Base mean'].idxmax())
</code></pre></div>  </div>
<blockquote class="solution" style="border: 2px solid #B8C3EA;color: white; margin: 1em 0.2em">
<h3 id="-icon-solution--solution-1">👁 Solution</h3>
<p><code>idxmin</code> will return the index value corresponding to the minimum; idxmax will do the same for the maximum value.</p>
<p>You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.</p>
<p>Output:
FBgn0063667
FBgn0026562</p>
</blockquote>
</blockquote>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<h3 id="-icon-question--question-selecting-with-conditionals">❓ Question: Selecting with conditionals</h3>
<p>Assume Pandas has been imported and the previous dataset has been loaded. Write an expression to select each of the following:
a. P-value of each gene
b. all the information of gene <code>FBgn0039178</code>
c. the information of all genes that belong to chromosome <code>chr3R</code></p>
<blockquote class="solution" style="border: 2px solid #B8C3EA;color: white; margin: 1em 0.2em">
<h3 id="-icon-solution--solution-2">👁 Solution</h3>
<p>a. <code class="language-plaintext highlighter-rouge">data['P-value']</code>
b.<code class="language-plaintext highlighter-rouge">data.loc['FBgn0039178', :]</code>
c. <code class="language-plaintext highlighter-rouge">data[data['Chromosome'] == 'chr3R']</code></p>
</blockquote>
</blockquote>
<h2 id="group-by-and-analyze-the-data">Group-by and analyze the data</h2>
<p>Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results.</p>
<p>Pandas makes this very easy through the use of the <code>groupby()</code> method, which splits the data into groups. When the data is grouped in this way, the aggregate method <code>agg()</code> can be used to apply an aggregating or summary function to each group.</p>


In [None]:
summarised_data = data.groupby('Chromosome').agg({'Base mean':'first',
'log2(FC)': 'max'})
print(summarised_data)

<p>There are a couple of things that should be noted. The <code>agg()</code> method accepts a dictionary as input that specifies the function to be applied to each column. The output is a new dataframe, that each row corresponds to one group. The output dataframe uses the grouping column as index. We could change the last one by simply using the <code>reset_index()</code> method.</p>


In [None]:
summarised_data = data.groupby('Chromosome').agg({'Base mean':'first',
'log2(FC)': 'max'}).reset_index()
print(summarised_data)

<blockquote>
<h3 id="-icon-question--question-finding-the-max-of-each-group">❓ Question: Finding the max of each group</h3>
<p>Using the same dataset, try to find the longest genes in each chromosome.</p>
<blockquote>
<div style="color: #555; font-size: 95%;">Hint: Select the text with your mouse to see the answer</div><h3 id="-icon-solution--solution">👁 Solution</h3>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>data['Gene Length'] = data['End'] - data['Start']
data.groupby('Chromosome').agg(max_length = ('Gene Length', 'max'))
</code></pre></div>    </div>
</blockquote>
</blockquote>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<h3 id="-icon-question--question-grouping-with-multiple-variables">❓ Question: Grouping with multiple variables</h3>
<p>Using the same dataset, try to find how many genes are found on each strand of each chromosome.</p>
<blockquote class="solution" style="border: 2px solid #B8C3EA;color: white; margin: 1em 0.2em">
<h3 id="-icon-solution--solution-1">👁 Solution</h3>
<p>You can group the data according to more than one column.</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: white"><code>data.groupby(['Chromosome', 'Strand']).size()
</code></pre></div>    </div>
</blockquote>
</blockquote>
<h1 id="conclusion">Conclusion</h1>
<p>This tutorial aims to serve as an introduction to data analysis using the Python programming language. We hope you feel more confident in Python!</p>


# Key Points

- Python has many libraries offering a variety of capabilities, which makes it popular for beginners, as well as, more experienced users
- You can use scientific libraries like Numpy and Pandas to perform data analysis.

# Congratulations on successfully completing this tutorial!

Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-advanced-np-pd/tutorial.html#feedback) and check there for further resources!
