# 2. Pandas

Content:
- 2.1 Introduction
- 2.2 Installation
- 2.3 Series
- 2.4 DataFrames
- 2.5 Import files
- 2.6 Summary

## 2.1 Introduction

Pandas is a library built on Python that allows us to easily manipulate data. It is considered the de-facto standard to read, analyze and visualize tabular data from CSV files, Excel tables, SQL tables and many more.
Pandas has three main data structures: 
- Series: a 1D array
- DataFrame: a 2D table
- Panel: a 3D array (not discussed here)

Although the main data structure for tables is called a DataFrame, it is important to understand that it is built as a combination of Series. Generally, however, we will talk about the index (rows) and the columns.  


<img src="img/pandas_dataframe.png" height=80% width=80%>

There are a lot of comparisons that can be drawn between pandas and the R language. So if you have experience in R [this website](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html) might give you an interesting comparison between both.  


## 2.2 Installation

If not installed yet, either do it in Anaconda Navigator or with the following command:

In [None]:
pip install pandas

The above will result in an effective installation, or might tell you that the requirements are already satisfied. You might want to refresh your Notebook for the changes to take place. 

After installing the library in our environment, we still have to import it in our Notebook:

In [None]:
# Importing pandas as pd is a convention
import pandas as pd

## 2.3 Series

We will start by creating a one dimensional array which is the equivalent of a pandas Series. The data that we gather in this array is the data that will make up for a column in our table later on. 
- The data can be of any kind: integers, strings, floats, Python objects, etc. 
- The data is always accompanied by an index

There are several ways to create a Series: from a list, a dictionary, a NumPy array and from files and other external data sources. 

Let's make our first pandas object:

In [None]:
counts = [12, 35, 45, 12, 22, 38]

countSeries = pd.Series(counts)
countSeries

In [None]:
genes = ['GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', 'GeneF']

genesSeries = pd.Series(genes)
genesSeries

It looks like we already have a table, however the first column is actually the index and the second column contains the data. If the **index** is not specified upon making the pandas Series, the default will be a scalar value ranging from 0 uptill the length of the array -1. 

*range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].*

The following code makes a Series where the list of genes will be the index of the object and the list of counts will be the column. 

In [None]:
mySerie1 = pd.Series(counts, index=genes)
mySerie1

A Series can also be made from a dictionary like so:

In [None]:
aminoacids = {
     'A': 'Ala',
     'C': 'Cys',
     'D': 'Asp',
     'E': 'Glu',
     'F': 'Phe',
     'G': 'Gly',
     'H': 'His',
     'I': 'Ile',
     'K': 'Lys',
     'L': 'Leu',
     'M': 'Met',
     'N': 'Asn',
     'P': 'Pro',
     'Q': 'Gln',
     'R': 'Arg',
     'S': 'Ser',
     'T': 'Thr',
     'V': 'Val',
     'W': 'Trp',
     'Y': 'Tyr'}

aaSeries = pd.Series(aminoacids)
aaSeries

In this case the keys of the dictionary will be the indeces in the Series and the respective values will be the data values associated with these indeces. 

Obviously mathematical operations can be performed on the complete dataset in a Series. These are called vectorized operations as the operation will be applied to each element in the Series

In [None]:
countSeries + 2

In [None]:
countSeries * 2

---
### 2.3.1 Exercise:

Add the following data to the `mySerie1` pandas Series object as a new row:

```
GeneK   25
```

In [None]:
# Make a pandas Series from a dictionary and append it to the `mySerie1` pandas Series object. 
mySerie2 = pd.Series({"GeneK" : 25})
mySerie1.append(mySerie2)

---

Finally retrieving an element is easily done with squared brackets, similarly to accessing elements from a data structures in Python. 

In [None]:
# Accessing a single element in pandas Series
mySerie1['GeneA']

In [None]:
# Accessing multiple values in a pandas Series object
mySerie1[['GeneA', 'GeneC', 'GeneK']]

In [None]:
# Accessing multiple values in a pandas Series object
mySerie1[:2]

## 2.4 DataFrame
Two-dimensional data structure. Consists of an index for the rows and an index for the columns alongside the data in the columns.  

|  /   | **column_index1** | **column_index2**   |
|---|---|---|
| **row_index1**   | 1  | 2  |   
| **row_index2**  |  3 | 4  |   
| **row_index3**  |  5 | 6  |   


Similar to Series, a Dataframe can be created in several ways: from lists, dictionaries, or Series. However, the most important one is probably from importing a dataset from a file. 

First we'll **create** a DataFrame with a dictionary where the key is the column index and the value is a list of values that represent the data. 

In [None]:
counts_exp1 = [12, 35, 45, 12, 22, 38]
counts_exp2 = [6, 28, 55, 12, 19, 34]
genes = ['GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', 'GeneF']

dataframe_dict = {'counts_exp1': counts_exp1, 'counts_exp2': counts_exp2}

df = pd.DataFrame(dataframe_dict, index = genes)
df

We can **inspect** the DataFrame we just created:

In [None]:
print(df.columns)
print(df.index)
print(df.values)

In [None]:
# First five rows
df.head()

In [None]:
# Last five rows
df.tail()

Before exploring how to access rows, columns and elements from a table, let's have a look at how we can add new data or delete a column from a table.

In [None]:
# Make new column with new data
df['counts_exp3'] = [23, 24, 58, 16, 8, 5]
df

In [None]:
counts_exp4 = [23, 24, 58, 16, 8, 5]
counts4Series = pd.Series(counts_exp4)


In [None]:
# Deleting a column in two ways
# del df['avg']
# or
df.drop('avg', axis = 1, inplace=True)

**Accessing** data is done by calling the index name of the column within squared brackets. 
Remember that Python starts counting from 0 and it excludes the last number. 

In [None]:
# Accessing the column counts_exp1
df['counts_exp1']

In [None]:
# Accessing multiple columns 
df[['counts_exp1', 'counts_exp2']]

In [None]:
# Accessing values within a column: rows from 2 to 3. 
df['counts_exp1'][1:3]

In [None]:
# Accessing all columns from rows 2 to 4
df[1:5]

Another way of accessing the data in a Dataframe is by using the `.loc[]` and `iloc[]` method. 
- `.loc[]`: uses primarily label(s) to access the data,
- `.iloc[]`: uses purely integer-location based indexing for selection by position.

**`.loc[]`** accepts the row index as a first and default parameter, if a second parameter is given, this should refer to the column index. An elaborate explanation of the possibilities with the location method is available [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html).

In [None]:
# Select one row 
df.loc['GeneE']
# Alternatively
df.loc[['GeneE']]

In [None]:
# Accessing multiple rows from a dataset
selected_genes = ['GeneA', 'GeneB' ,'GeneC', 'GeneF']
df.loc[selected_genes]

In [None]:
# Accessing multiple rows and a subselection of columns by passing two parameters to .loc[]
selected_counts = ['counts_exp1', 'counts_exp2']
df.loc[selected_genes, selected_counts]

Another way of slicing our dataset is by using **`.iloc[]`**. An elaborate explanation of the possibilities with the location method is available [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html). Try the following, can you figure out how this method works? 

In [None]:
#df.iloc[0]
#df.iloc[[0]]
#df.iloc[[1, 2, 3],[0,2]]

---
### Exercise 2.4.1
- Select the number of counts in *GeneD* for the second and third experiment. 
- Add a new column to the dataframe with the average of the three experiments.

---
### Exercise 2.4.2

- Search in the pandas documentation for the median method and add a column that describes the median value per gene.
- Search in the pandas documentation for a method that will count all of the values of one experiment and add it as an extra row to the table. 
- Remove the row with the sum of the counts that we added in the previous step. 

---

## 2.5 Import files
Now we know how to parse through our Dataframe tables and how we can add new data, we'll have a look at how to import data from files. Pandas is great for working with tabular data, hence generally the data will come from a spreadsheet (csv, excel, etc.). Importing data is easily done by using a link to the file. This file can be in your local folder, or can be imported from the internet:

In [None]:
# annotation file in the data folder - excel
annotation_xlsx = pd.read_excel('data/annotation.xlsx')
annotation_xlsx.head()

In [None]:
# annotation file in the data folder - csv
annotation_csv = pd.read_csv('data/annotation.csv')
annotation_csv.head()

An overview of all the parameters that `.read_csv` accepts, is accessible on the pandas documentation website ([here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). 

---
### 2.5.1 Exercise
Search for the parameters of `.read_csv` that you need in order to read in the `annotation.csv` file where:
- chromosomes are the index of the rows, and 
- only the first 10 rows are imported.  

--- 

In the following code block we will read in the more widely known (amongst R users at least), iris dataset. In this case we're downloading and importing it immediately from a website (GitHub repo) as it is not available in pandas, nor Python. 

In [None]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head()

To have an idea on the descriptive statistics and the shape of the dataset's distribution, we can use `.describe()`. This method can be applied on the table (numerical data only) or on a selection of the columns:

In [None]:
iris.describe()

In [None]:
iris['sepal_length'].describe()

However, as mentioned above, in this dataset we're facing different kinds of data:
- numerical data (sepal_length, sepal_width, petal_length and petal_width)
- categorical data (species)

and therefore need to be treated differently. Luckily, pandas allows us to do this in a very easy way:

In [None]:
iris['species'].value_counts()

In [None]:
iris['species'].value_counts(normalize=True)

In [None]:
iris.groupby(['species']).mean()

In [None]:
iris.groupby(by='species').mean()

In [None]:
iris['species'] == 'versicolor'
iris.loc[iris['species'] == 'versicolor']

In [None]:
# Have a look at the values of the indeces of the df
iris.index.values

In [None]:
# Get unique values within a column
iris['species'].unique()

---
### 2.5.2 Exercise
Can you find a method that will retrieve the indices of all the virginica flowers? 

---
### 2.5.3 Exercise
From the file `annotation.csv`:
- Sort, in descending way, the table based on the counts in exons  
- Make a subselection with all chromosomes where at least 15 counts in introns. 

---
### 2.5.4 Exercise
In this exercise, derived from the [GTN](https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/rna-seq-viz-with-heatmap2/tutorial.html), we will prepare the data to create a heatmap (see exercise 3.2.6) of the top differentially expressed genes in an RNA-seq counts dataset. 
- [`counts`](https://zenodo.org/record/2529926/files/limma-voom_normalised_counts)
- [`de_genes`](https://zenodo.org/record/2529926/files/limma-voom_luminalpregnant-luminallactate)  

The latter file contains the results from comparing gene expression in the luminal cells in the pregnant versus lactating mice. It includes genes that are not significantly differentially expressed. We’ll call genes significantly differentially expressed in this dataset if they pass the thresholds of `adjusted P-value < 0.01` and `fold change of > 1.5 (log2FC of 0.58)`. Filter the top 20 DE genes from that table and create a joint dataframe that contains only the following columns and looks like this:

| SYMBOL_x |  MCL1.DG |  MCL1.DH |  MCL1.DI |  MCL1.DJ |   MCL1.DK |   MCL1.DL |  MCL1.LA |  MCL1.LB |  MCL1.LC |  MCL1.LD |  MCL1.LE |  MCL1.LF |
|---------:|---------:|---------:|---------:|---------:|----------:|----------:|---------:|---------:|---------:|---------:|---------:|----------|
|     Ggt1 | 6.732347 | 6.556047 | 6.558849 | 6.586562 |  6.437596 |  6.394067 | 5.193118 | 5.526432 | 4.223990 | 4.341605 | 7.243899 | 7.354535 |
|  Slc39a4 | 2.722153 | 3.027691 | 2.175532 | 1.993214 | -0.193255 | -0.016902 | 3.071502 | 2.928202 | 6.472918 | 6.526836 | 2.430346 | 1.847241 |
|      Ppl | 5.102274 | 4.900942 | 5.755087 | 5.951023 |  6.851420 |  6.881858 | 7.359977 | 7.732010 | 8.227118 | 8.437499 | 4.646145 | 4.798986 |
| ...   |     ...     |      ...    |   ...       |        ...  |    ...       |         ...  |       ...  |        ...   |      ...    |      ...    |       ...   |       ...   |

Save the file as a csv-file in the data-folder. 

## 2.6 Summary
We've seen a bunch of methods on how we can read and manipulate data with pandas. As it is such a huge library, we could still spend tons of time discovering all the modules, however now you should be able to come up with solutions tailored to a specific question. Either by using the pandas documentation (recommended) or by searching on Stackoverflow or any other forum. 


## 2.7 Next session
Explore how to visualize your data in the [next chapter](03_Visualization.ipynb)!