# Data science with Python

Here, we will work with a so called proteomics data set from breast cancer research.

In Python, there is a module called Pandas which can be used to analyze large datasets.

## Load a dataset

In [0]:
# import the pandas module and name it "pd"
import pandas as pd

# download the proteomics dataset
url = "https://raw.githubusercontent.com/researchschool/datalab/master/breast_cancer_study.txt"

data = pd.read_csv(url, delimiter="\t")

In [0]:
# have a first look at the data
data.head()

What do the columns, rows and values mean?

How big is the dataset?

In [0]:
data.shape

`data` is an object. That means, we can call properties like `data.shape` or call functions (methods) such as `data.head()`.

## Tidying up the dataset

In [0]:
# get rid of string values in the dataset

# remove the descriptions
data_no_desc = data.drop(columns="description")
data_no_desc.head()

In [0]:
# assign the protein name as row name
data_no_desc.index = data_no_desc["protein"]

# show the new row names
data_no_desc.index

# delete the protein column
data_no_desc_prot = data_no_desc.drop(columns="protein")

data_no_desc_prot.head()

Now we take the log2 of all the values.

In [0]:
# import numpy to get a lot of math functions
import numpy as np

# take the log2 of all the numbers in our data matrix
data_log = np.log2(data_no_desc_prot)

data_log.head()

## Inspecting the data

In [0]:
# show the expression profile of one protein
data_log.loc[["TP53"]].T.iloc[1:].plot()

In [0]:
# show multiple proteins
data_log.loc[["TP53", "CLPP"]].T.iloc[1:].plot()

## Find protein complexes in breast cancer

Proteins usually do not act on their own. They form complexes with themselves or other proteins. We can find such multi-protein complexes by comparing their expression profile with each other. Complexed proteins often are coregulated meaning that their expression profile across all the tumors looks similarily.

In [0]:
# calculate a correlation matrix (What is that by the way?)
corr_matrix = data_log.T.corr()

# show the head of the correlation matrix
corr_matrix.head()

In [0]:
# check if the correlation matrix is correct

# a good correlation (0.83)
data_log.loc[["A1BG", "A2M"]].T.iloc[1:].plot()

In [0]:
# a bad correlation
data_log.loc[["A1BG", "AAAS"]].T.iloc[1:].plot()

### Visualise the correlation matrix

In [0]:
import seaborn as sns

sns.heatmap(corr_matrix.iloc[6000:7000, 6000:7000], cmap="RdBu_r")

We cannot really see what is going on... Let's structure the data with some machine learning.

### Data clustering

In [0]:
sns.clustermap(corr_matrix.iloc[6000:7000, 6000:7000], cmap="RdBu_r")

Let's focus on two ribosomal proteins and plot their profiles.

In [0]:
data_log.loc[["RPL24", "RPL13"]].T.iloc[1:].plot()

## Identify different tumor groups

Breast cancer comes in different subgriups that can be treated with different medications and therapy plans. Here, we try to identify these subgroups.

In [0]:
# cluster the tumors
sns.clustermap(data_log, metric="correlation", cmap="viridis")
