<img src="img/pandas_logo.png" height=60% width=60%>


Content:
- 2.1 Introduction
- 2.2 Installation
- 2.3 Series
- 2.4 DataFrames
- 2.5 Import files
- 2.6 Summary


Pandas is a Python library that allows us to easily manipulate data. It is considered the de-facto standard to read, analyze and visualize tabular data from CSV files, Excel tables, SQL tables and many more.

Pandas has three main data structures: 
- Series: a 1D array
- DataFrame: a 2D table
- Panel: a 3D array (not discussed here)

Although the main data structure for tables is called a DataFrame, it is important to understand that it is built as a combination of Series. Generally, however, we will talk about the index (rows) and the columns.  


## 2.1 Introduction

<img src="img/pandas-df.png" height=70% width=70%>


Notes: discuss Series --> Columns --> indeces --> dataframe 
    
There are a lot of comparisons that can be drawn between pandas and the R language. So if you have experience in R there is a website referenced in the notebook might give you an interesting comparison between both.

In [5]:
import pandas as pd
# data
inhabitants = [11.46, 17.28, 67.06]
area = [30689, 41543, 643801]
countries = ['Belgium', 'The Netherlands', 'France']

# Dictionary with key = column index, and values = data values
dataframe_dict = {'Inhabitants (in millions)': inhabitants, 'Area (in km²)': area}

# Dictionary with values and list of indeces
df = pd.DataFrame(dataframe_dict, index = countries)
df

Unnamed: 0,Inhabitants (in millions),Area (in km²)
Belgium,11.46,30689
The Netherlands,17.28,41543
France,67.06,643801


The code is not really important. We're just creating a pandas dataframe, call it as an output and this is what we get back. 

## 2.2 Installation

There are a couple of ways to install pandas (or any other library). 
- In the Environments section of Anaconda Navigator and manually selecting the package, 
- Installing in the Conda environment: `conda install -c conda-forge pandas`, 
- Installing in the Jupyter Notebook: `pip install pandas`

After installing the library, import with: 
```python
import pandas as pd
```

## 2.3 pandas Series

- One dimensional arrays
- Represents a column with data (`int`, `str`, `floats`, `dict`,...)
- Each data point has an accompanied index

<center> <img src="img/series.PNG"> </center>


**1. Create pandas Series**

`pd.Series(data, index, ...)` with:
- `data`: array-like, iterable like list, tuples, dictionary, etc.
- `index`: will default to RangeIndex (0, 1, 2, ..., n) if not provided.

Note: array finds its origin in the underlying numpy library. np.array() is a function to make arrays from other data structures.
A couple more arguments can be passed on to pd.Series, but data and index are the most essential ones. 

In [6]:
# Create pandas Series from list
counts = [12, 35, 45, 12, 22, 38]

countSeries = pd.Series(counts)
countSeries

pandas.core.series.Series

In [3]:
# Create pandas Series from tuples
genes = ('GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', 'GeneF')

genesSeries = pd.Series(genes, index=[1, 5, 4, 3, 6, 8])
genesSeries

1    GeneA
5    GeneB
4    GeneC
3    GeneD
6    GeneE
8    GeneF
dtype: object

In [4]:
# pandas Series with a specific index
mySerie1 = pd.Series(data=counts, index=genes)
mySerie1

GeneA    12
GeneB    35
GeneC    45
GeneD    12
GeneE    22
GeneF    38
dtype: int64

In [5]:
# Create pandas Series from dictionary
aaDict = {
     'A': 'Ala',
     'C': 'Cys',
     'D': 'Asp',
     'E': 'Glu',
     'F': 'Phe',
     'G': 'Gly'} # ...

aaSeries = pd.Series(aaDict)
aaSeries

A    Ala
C    Cys
D    Asp
E    Glu
F    Phe
G    Gly
dtype: object

**2. Operations with pandas Series**

In [6]:
# Vectorized operations 
countSeries * 2
countSeries + 2

0    14
1    37
2    47
3    14
4    24
5    40
dtype: int64

In [7]:
# Accessing a single element in pandas Series
mySerie1['GeneA']

12

In [8]:
# Accessing multiple values in a pandas Series object
mySerie1[['GeneA', 'GeneC', 'GeneF']]

GeneA    12
GeneC    45
GeneF    38
dtype: int64

In [None]:
# Accessing multiple values in a pandas Series object
mySerie1[:2] # 0,1

In [None]:
# Apply pandas functions 
countSeries.mean()
countSeries.sum()

Note: Operations in general exclude missing data.

Which Python built-in function can you use to list all functions applicable on pandas Series?

---
### 2.3.1 Exercise:

Use one of pandas Series functions to add the following data to the `mySerie1` pandas Series object as a new row:

```
GeneK   25
```

In [None]:
# Exercise 2.3.1

## 2.4 pandas DataFrame

- Two dimensional arrays
- Represents a table with data
- Combination of pandas Series

<center> <img src="img/series-and-dataframe.PNG"> </center>




Notes: The pandas DataFrame is a two-dimensional data structure, essentially it's a combination of two or more Series objects. 
DataFrames consist of row indeces and column indeces with the data in the columns as visually depicted in the following table.  

**1. Create pandas DataFrame**

`pd.DataFrame(data, index, ...)` with:
- `data`: ndarray, iterable like list, tuples, dictionary, etc.
- `index`: will default to RangeIndex (0, 1, 2, ..., n) if not provided.
- `columns`: index or array-like 

Note: also here ndarray finds its origin in the underlying numpy library. The difference between np.ndarray and np.array(). The former is an actual data type, while the latter is a function to make arrays from other data structures.

In [7]:
counts_exp1 = [12, 35, 45, 12, 22, 38]
counts_exp2 = [6, 28, 55, 12, 19, 34]
genes = ['GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', 'GeneF']

In [8]:
# Dictionary with key = column index, and values = data values
dataframe_dict = {'counts_exp1': counts_exp1, 'counts_exp2': counts_exp2}

# Dictionary with values and list of indeces
df = pd.DataFrame(dataframe_dict, index = genes)
df

Unnamed: 0,counts_exp1,counts_exp2
GeneA,12,6
GeneB,35,28
GeneC,45,55
GeneD,12,12
GeneE,22,19
GeneF,38,34


In [9]:
counts_exp1 = [12, 35, 45, 12, 22, 38]
counts_exp2 = [6, 28, 55, 12, 19, 34]
genes = ['GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', 'GeneF']

In [11]:
# Dictionary with values and list of indeces
df = pd.DataFrame(data = list(zip(counts_exp1, counts_exp2)), index = genes, columns = ['counts_exp1', 'counts_exp2'])
df

Unnamed: 0,counts_exp1,counts_exp2
GeneA,12,6
GeneB,35,28
GeneC,45,55
GeneD,12,12
GeneE,22,19
GeneF,38,34


**2. Inspect pandas DataFrame**

In [12]:
print(df.columns)
print(df.index)
print(df.values)

Index(['counts_exp1', 'counts_exp2'], dtype='object')
Index(['GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', 'GeneF'], dtype='object')
[[12  6]
 [35 28]
 [45 55]
 [12 12]
 [22 19]
 [38 34]]


In [13]:
# First three rows
df.head(3)

Unnamed: 0,counts_exp1,counts_exp2
GeneA,12,6
GeneB,35,28
GeneC,45,55


In [14]:
# Last five rows
df.tail()

Unnamed: 0,counts_exp1,counts_exp2
GeneB,35,28
GeneC,45,55
GeneD,12,12
GeneE,22,19
GeneF,38,34


**3. Add or remove data from pandas DataFrame**

In [15]:
# Make new column with new data, similar to adding data to dictionaries
df['counts_exp3'] = [23, 24, 58, 16, 8, 5]
df

Unnamed: 0,counts_exp1,counts_exp2,counts_exp3
GeneA,12,6,23
GeneB,35,28,24
GeneC,45,55,58
GeneD,12,12,16
GeneE,22,19,8
GeneF,38,34,5


In [21]:
# Alternatively, using the insert method
df.insert(loc = 1, column = "counts_exp4", value = [3, 4, 35, 16, 42, 11], allow_duplicates = False)
df

ValueError: cannot insert counts_exp4, already exists

In [24]:
# Deleting a column in two ways
# del df['counts_exp4']
# or
df.drop('counts_exp4', axis = 1, inplace=True)
df

Unnamed: 0,counts_exp1,counts_exp2,counts_exp3
GeneA,12,6,23
GeneB,35,28,24
GeneC,45,55,58
GeneD,12,12,16
GeneE,22,19,8
GeneF,38,34,5


In [27]:
# Add new row data
dict_row = {'counts_exp1': 1, 'counts_exp2': 2, 'counts_exp3': 3} 

# Will ruin the row indeces:
#df = df.append(df_row, ignore_index = True) 

# First create Series
new_row = pd.Series(data = dict_row, name='GeneX')
# Append row to the dataframe, ignore index is False as we want to keep our indeces
df = df.append(new_row, ignore_index=False)

df

Unnamed: 0,counts_exp1,counts_exp2,counts_exp3
0,12,6,23
1,35,28,24
2,45,55,58
3,12,12,16
4,22,19,8
5,38,34,5
6,1,2,3
7,1,2,3
GeneX,1,2,3


**4. Accessing data from pandas DataFrame**

Calling the index name of the column within squared brackets. Remember that Python starts counting from 0 and it excludes the last number.

In [None]:
# Accessing the column counts_exp1
df['counts_exp1']

```python
# Accessing multiple columns 
df[['counts_exp1', 'counts_exp2']]
# Accessing values within a column: rows from 2 to 3. 
df['counts_exp1'][1:3]
# Which is the same as:
df[1:3]['counts_exp1']
# Accessing all columns from rows 2 to 4
df[1:5]
df[['counts_exp2', 'counts_exp3']][2:4]
```

Another way of accessing the data in a Dataframe is by using the `.loc[]` and `iloc[]` method. 
- `.loc[]`: uses primarily label(s) to access the data,
- `.iloc[]`: uses purely integer-location based indexing for selection by position.

`df.loc[row_idx, column_idx]` with `row_idx` and `col_idx`: 
- single label, 
- list of labels, 
- slice object or 
- booleans with same length as axis being sliced

In [28]:
# Some examples
df.loc[]

Unnamed: 0,counts_exp1,counts_exp2,counts_exp3
1,35,28,24
2,45,55,58
4,22,19,8
5,38,34,5


```python
# Select one row with single label
df.loc['GeneE']
# Select multiple rows with list of labels
df.loc[['GeneD', 'GeneE']]
# Select multiple rows with slice object
df.loc['GeneC':'GeneE']
df.loc['GeneC', 'counts_exp2']
# Accessing multiple rows from a dataset
selected_genes = ['GeneA', 'GeneB' ,'GeneC', 'GeneF']
df.loc[selected_genes]
```
A boolean array of the same length as the axis being sliced
```python
df['counts_exp1'] > 15
df.loc[df['counts_exp1'] > 15]
```

`df.iloc[row_idx, column_idx]` with `row_idx` and `col_idx`: 
- single integer, 
- list of integers, 
- slice object or 
- booleans with same length as axis being sliced

In [None]:
df.iloc[:4, 1:3]

---
### Exercise 2.4.1
- Select the number of counts in *GeneD* for the second and third experiment. 
- Add a new column to the dataframe with the average of the three experiments.

In [None]:
# Exercise 2.4.1

---
### Exercise 2.4.2

- Search in the pandas documentation for the median method and add a column that describes the median countvalues per gene.
- Search in the pandas documentation for a method that will count all of the values of one experiment and add it as an extra row to the table. 
- Remove the row with the sum of the counts that we added in the previous step. 

In [51]:
# Exercise 2.4.2


## 2.5 File I/O
Plenty of possibilities: csv, excel, JSON, SLQ, ... [see documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

`pd.read_csv(filename, sep, header, names...)`: Discover all possibilities in the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). 

In [None]:
# metagenic classification file in the data folder - csv
metagenic_csv = pd.read_csv('data/metagenic.csv')
metagenic_csv.head()

In [None]:
metagenic_csv = pd.read_csv('data/metagenic.csv', header=1)
metagenic_csv.head()

### 2.5.1 Exercise
Search for the parameters of `.read_csv` that you need in order to read in the `metagenic.csv` file where:
- chromosomes are the index of the rows, and 
- only the first 10 rows are imported.  

In [2]:
# Exercise 2.5.1 - A

### Exercise 2.5.1 - B
Import the data from the `metagenic.csv` file and add a new column with the total counts for each chromosome (e.g. chromosome 21 has 88 counts), and sort the table by descending total counts per chromosome.  

In [None]:
# Exercise 2.5.1 - B

Read in csv file, directly from a URL-link:

In [52]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


**Inspect** the dataframe:

In [53]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Different kinds of data:
- numerical data (sepal_length, sepal_width, petal_length and petal_width)
- categorical data (species)

Note: However, in this dataset we're facing different kinds of data:

- numerical data (sepal_length, sepal_width, petal_length and petal_width)
- categorical data (species)

and therefore need to be treated differently. Luckily, pandas allows us to do this in a very easy way:

In [54]:
# Inspect datatypes of the columns:
iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

In [None]:
# Convert the species column to categorical data explicitly
iris['species'] = iris['species'].astype('category')
iris.dtypes

In [None]:
# Discover dataset


Notes:
```python
Fill in examples
```

---
### Exercise 2.5.2 
Can you find a method that will retrieve the indices of all the virginica flowers? 

In [None]:
# Exercise 2.5.2

---
### Exercise 2.5.3
From the file `metagenic.csv`:
1. Sort the table based on the counts in exons in descending way  
2. Make a subselection of chromosomes with at least 15 counts in introns. 

In [None]:
# Exercise 2.5.3

---

### Exercise 2.5.4

For this exercise we will use [this dataset](https://datahub.io/core/pharmaceutical-drug-spending) which contains the spendings of a bunch of countries in the pharmaceutical industry as from 1971. The dataset is available in the data folder as `pharmaspending.csv`. 

Make a subselection of this dataset that contains the data for Belgium and its neigbhouring countries France, Germany and the Netherlands. Furthermore, we're only interested in the data starting from the year 2000. 

In [None]:
#Exercise 2.5.4

---
### Exercise 2.5.5 
In this exercise, derived from the [GTN](https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/rna-seq-viz-with-heatmap2/tutorial.html), we will prepare the data to create a heatmap (see exercise 3.2.6) of the top differentially expressed genes in an RNA-seq counts dataset. 
- [`counts`](https://zenodo.org/record/2529926/files/limma-voom_normalised_counts)
- [`de_genes`](https://zenodo.org/record/2529926/files/limma-voom_luminalpregnant-luminallactate)  

The latter file contains the results from comparing gene expression in the luminal cells in the pregnant versus lactating mice. It includes genes that are not significantly differentially expressed. We’ll call genes significantly differentially expressed in this dataset if they pass the thresholds of `adjusted P-value < 0.01` and `fold change of > 1.5 (log2FC of 0.58)`. Filter the top 20 DE genes from that table and create a joint dataframe that contains only the following columns and looks like this:

| SYMBOL_x |  MCL1.DG |  MCL1.DH |  MCL1.DI |  MCL1.DJ |   MCL1.DK |   MCL1.DL |  MCL1.LA |  MCL1.LB |  MCL1.LC |  MCL1.LD |  MCL1.LE |  MCL1.LF |
|---------:|---------:|---------:|---------:|---------:|----------:|----------:|---------:|---------:|---------:|---------:|---------:|----------|
|     Ggt1 | 6.732347 | 6.556047 | 6.558849 | 6.586562 |  6.437596 |  6.394067 | 5.193118 | 5.526432 | 4.223990 | 4.341605 | 7.243899 | 7.354535 |
|  Slc39a4 | 2.722153 | 3.027691 | 2.175532 | 1.993214 | -0.193255 | -0.016902 | 3.071502 | 2.928202 | 6.472918 | 6.526836 | 2.430346 | 1.847241 |
|      Ppl | 5.102274 | 4.900942 | 5.755087 | 5.951023 |  6.851420 |  6.881858 | 7.359977 | 7.732010 | 8.227118 | 8.437499 | 4.646145 | 4.798986 |
| ...   |     ...     |      ...    |   ...       |        ...  |    ...       |         ...  |       ...  |        ...   |      ...    |      ...    |       ...   |       ...   |

Save the file as a csv-file in the data-folder. You can use the lay-out given in the notebook.

## 2.6 Recapitulation
We've seen a bunch of methods on how we can read and manipulate data with pandas. As it is such a huge library, we could still spend tons of time discovering all the modules, however now you should be able to come up with solutions tailored to a specific question. 
- Either by using the pandas documentation (recommended), or 
- by searching on Stackoverflow or any other forum. 

## 2.7 Next session
Explore how to visualize the data in the [next chapter](03_Visualization.ipynb)!