# 3. Data Manipulation Techniques

## Motivation

- So far we have been lucky that all our data have been in the same file:
    + This is not usually the case
    + Dataset may be spread over several files
        + This takes longer, and is harder, than many people realise
    + We need to combine before doing an analysis



## Combining data from multiple sources: Gene Clustering Example

- R has powerful functions to combine heterogeneous data sources into a single data set
- Gene clustering example data:
    + Gene expression values in ***gene.expression.txt***
    + Gene information in ***gene.description.txt***
    + Patient information in ***cancer.patients.txt***
- A breast cancer dataset with numerous patient characteristics:
    + We will concentrate on ***ER status*** (positive / negative)
    + What genes show a statistically-significant different change between ER groups?

## Analysis goals

- We will show how to lookup a particular gene in the dataset
- Also, how to look-up genes in a given genomic region
- Perform a "sanity-check" to see if a previously-known gene exhibits a difference in our dataset
- How many genes on chromosome 8 are differentially-expressed?
- Create a heatmap to cluster the samples and reveal any subgroups in the data
    + do the subgroups agree with our prior knowledge about the samples
    
    r nrow(normalizedValues)` rows and `r ncol(normalizedValues)` columns
+ One row for each gene:
    + Rows are named according to particular technology used to make measurement
    + The names of each row can be returned by `rownames(normalizedValues)`; giving a vector
+ One column for each patient:
    + The names of each column can be returned by `colnames(normalizedValues)`; giving a vector

In [2]:
normalizedValues <- read.delim("Basic_R_Course/gene.expression.txt")
head(normalizedValues)

Unnamed: 0,NKI_4,NKI_6,NKI_7,NKI_8,NKI_9,NKI_11,NKI_12,NKI_13,NKI_14,NKI_17,...,NKI_393,NKI_394,NKI_395,NKI_396,NKI_397,NKI_398,NKI_401,NKI_402,NKI_403,NKI_404
Contig56678_RC,-0.261,0.346,0.047,-1.14,-0.11,0.253,-1.199,-0.115,0.057,0.308,...,0.045,0.139,-0.06,-0.439,0.038,-0.576,-0.019,-0.256,-0.018,-0.1
AF026004,-0.064,0.04,-0.165,-0.031,0.33,0.049,-0.211,-0.091,0.026,-0.061,...,0.039,,,,0.08,-0.385,0.165,-0.091,-0.002,0.249
AB033049,-0.307,0.046,-0.139,0.036,-0.154,-0.024,-0.057,-0.293,-0.197,0.132,...,0.073,0.048,-0.097,,-0.018,-0.427,0.088,-0.4,0.048,-0.122
AB033050,0.582,0.216,0.091,-0.186,-0.156,0.036,-0.15,-0.015,0.075,-0.103,...,-0.09,-0.017,-0.075,-0.296,0.005,-0.252,-0.14,-0.27,-0.31,-0.397
AB033086,-2.0,0.102,-0.016,-0.358,0.153,-0.191,0.332,-0.14,-0.075,-0.089,...,-0.186,-0.115,,-0.254,0.252,0.109,0.424,-0.285,0.156,0.323
NM_003008,-0.734,-0.085,0.163,-0.233,0.344,-0.222,-0.417,-0.295,0.22,-0.032,...,0.023,-0.026,0.07,0.058,-0.106,-0.949,0.447,-0.212,0.099,0.313


In [4]:
geneAnnotation <- read.delim("Basic_R_Course/gene.description.txt",stringsAsFactors = FALSE)
head(geneAnnotation)

Unnamed: 0,probe,HUGO.gene.symbol,Chromosome,Start
Contig56678_RC,Contig56678_RC,THSD4,chr15,71433788
AF026004,AF026004,CLCN2,chr3,184063973
AB033049,AB033049,ANKRD50,chr4,125585207
AB033050,AB033050,ZMIZ1,chr10,80828792
AB033086,AB033086,NLGN4X,chrX,5808083
NM_003008,NM_003008,SEMG2,chr20,43850010



- `r nrow(geneAnnotation)` rows and `r ncol(geneAnnotation)` columns
- One for each gene
- Includes mapping between manufacturer ID and Gene name

In [6]:
patientMetadata <- read.delim("Basic_R_Course/cancer.patients.txt",stringsAsFactors = FALSE)
head(patientMetadata)

Unnamed: 0,samplename,age,er,grade
NKI_4,NKI_4,41,1,3
NKI_6,NKI_6,49,1,2
NKI_7,NKI_7,46,0,1
NKI_8,NKI_8,48,0,3
NKI_9,NKI_9,48,1,3
NKI_11,NKI_11,37,1,3


- One for each patient in the study
- Each column is a different characteristic of that patient
    + e.g. whether a patient is ER positive (value of 1) or negative (value of 0)

In [8]:
table(patientMetadata$er)


  0   1 
 88 249 

To get a feel for these data, we will look at how we can subset and order

- R allows us to do the kinds of filtering, sorting and ordering operations you might be familiar with in Excel
- For example, if we want to get information about patients that are ER negative
    + these are indicated by an entry of ***0*** in the `er` column

In [10]:
patientMetadata$er == 0

We can do the comparison within the square brackets

- Remembering to include a `,` to index the columns as well
- Best practice to create a new variable and leave the original data frame untouched

In [12]:
erNegPatients <- patientMetadata[patientMetadata$er == 0,]
head(erNegPatients)

Unnamed: 0,samplename,age,er,grade
NKI_7,NKI_7,46,0,1
NKI_8,NKI_8,48,0,3
NKI_12,NKI_12,46,0,3
NKI_24,NKI_24,49,0,3
NKI_28,NKI_28,40,0,3
NKI_44,NKI_44,53,0,3


In [13]:
View(erNegPatients)

ERROR: Error in View(erNegPatients): 'View()' not yet supported in the Jupyter R kernel


In [15]:
sort(erNegPatients$grade)

- But this is not useful in all cases
    + We have lost the extra information that we have about the patients
    
- Instead, we can use **`order()`**
- Given a vector, `order()` will give a set of numeric values which will give an ordered version of the vector
    + default is smallest --> largest


In [17]:
myvec <- c(90,100,40,30,80,50,60,20,10,70)
myvec
order(myvec)

In [25]:
# - i.e. number in position 9 is the smallest, number in position 8 is the second smallest:
myvec[9]
myvec[8]

In [21]:
# N.B. `order` will also work on character vectors
firstName  <- c("Adam", "Eve", "John", "Mary", "Peter", "Paul", "Joanna", "Matthew", "David", "Sally")
order(firstName)

- We can use the result of `order()` to perform a subset of our original vector
- The result is an ordered vector

In [23]:
myvec.ord <- myvec[order(myvec)]
myvec.ord

- Implication: We can use `order` on a particular column of a data frame, and use the result to sort all the rows

- We might want to select the youngest ER negative patients for a follow-up study
- Here we order the `age` column and use the result to re-order the rows in the data frame


In [26]:
erNegPatientsByAge <- erNegPatients[order(erNegPatients$age),]
head(erNegPatientsByAge)

Unnamed: 0,samplename,age,er,grade
NKI_330,NKI_330,26,0,3
NKI_57,NKI_57,28,0,3
NKI_230,NKI_230,28,0,3
NKI_90,NKI_90,29,0,3
NKI_48,NKI_48,30,0,3
NKI_86,NKI_86,30,0,3
