<a href="https://colab.research.google.com/github/zainabbio/Youtube-Tutorials/blob/main/R_for_Genomic_Data_Analysis_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Functions and control structures**

**User-defined functions**

Functions are a great way to turn repetitive tasks into reusable pieces of code.

If you find yourself performing the same operations with different inputs multiple times, it's a good idea to write a function.

In R, a function takes one or more inputs, performs operations on them, and returns an output.

For example, here's a simple function that calculates the sum of the squares of two numbers, x and y.

In [15]:
sqSum<-function(x,y){
result=x^2+y^2
return(result)
}
# now try the function out
sqSum(2,3)

Functions can also output plots and/or messages to the terminal. Here is a function that prints a message to the terminal:

In [16]:
sqSumPrint<-function(x,y){
result=x^2+y^2
cat("here is the result:",result,"\n")
}
# now try the function out
sqSumPrint(2,3)

here is the result: 13 


In R, if statements are used to execute code only when certain conditions are met. You can use them in functions to perform different actions depending on the input values. Here's an example where we use an if statement inside a function to classify the length of a CpG island as "large", "normal", or "short" based on its length.

In [17]:
# Define a function to classify the CpG island length
classify_cpg_island <- function(length) {
  # Check if the length is greater than 1000
  if (length > 1000) {
    return("Large")  # If greater than 1000, classify as "Large"
  }
  # Check if the length is between 500 and 1000
  else if (length >= 500) {
    return("Normal")  # If between 500 and 1000, classify as "Normal"
  }
  # If the length is less than 500
  else {
    return("Short")  # If less than 500, classify as "Short"
  }
}

# Example usage of the function
result1 <- classify_cpg_island(1200)  # This should return "Large"
result2 <- classify_cpg_island(800)   # This should return "Normal"
result3 <- classify_cpg_island(300)   # This should return "Short"

# Print the results
print(result1)
print(result2)
print(result3)

[1] "Large"
[1] "Normal"
[1] "Short"


The function classify_cpg_island() takes a single argument length (the length of a CpG island).

The if statement checks:

If the length is greater than 1000, it returns "Large".

If the length is between 500 and 1000 (inclusive), it returns "Normal".

If the length is less than 500, it returns "Short".

# **Loops and looping structures in R**

In R, a for-loop is used to repeat a certain task multiple times. The loop runs for a specific number of iterations or until a condition is met. Below is an example of how a for-loop works in R to execute a task 10 times.

In [18]:
for(i in 1:10){ # number of repetitions
cat("This is iteration") # the task to be repeated
print(i)
}

This is iteration[1] 1
This is iteration[1] 2
This is iteration[1] 3
This is iteration[1] 4
This is iteration[1] 5
This is iteration[1] 6
This is iteration[1] 7
This is iteration[1] 8
This is iteration[1] 9
This is iteration[1] 10


Let us calculate the length of the CpG islands.

The loop calculates the length of the first 100 CpG islands by subtracting the end coordinate from the start coordinate. It iterates over the first 100 rows of the data frame and stores the calculated lengths. For large datasets, start with a few repetitions to test the loop's functionality.

In [19]:
# Example data frame with start and end coordinates for CpG islands (assuming you have such a dataset)
# Here we'll simulate a small dataset for the demonstration
cpg_islands <- data.frame(
  start = sample(1000:10000, 200, replace = TRUE),  # Random start coordinates
  end = sample(10001:20000, 200, replace = TRUE)   # Random end coordinates
)

# Function to calculate the length of CpG islands
calculate_lengths <- function(cpg_data, num_islands = 100) {
  lengths <- numeric(num_islands)  # Initialize a vector to store the lengths

  for (i in 1:num_islands) {
    # Calculate length by subtracting start from end
    lengths[i] <- cpg_data$end[i] - cpg_data$start[i]
  }

  return(lengths)
}

# Calculate the lengths of the first 100 CpG islands
cpg_lengths <- calculate_lengths(cpg_islands)

# Print the lengths of the first 100 CpG islands
print(cpg_lengths)

  [1]  9849  9106 10626  8250  4848 14594  8183 11490 12375  3653  5795 10237
 [13]  9467  9267  6874  5759  7326  5479  2888  9995  8845  4930  8287 14905
 [25]  4854 10639 13328  1112  9312 15961 10145  2159  8240  5883 10341 10547
 [37]  7060 10736 10678 14899 13108 14527 16173  8234 11664  8879  9660 14594
 [49]  4161  6602 11257 11851 15280  5936 15468 10401 16863  1361 10347 11977
 [61]  6232 13608  9583 11638 10011  2819  5020 15969  8983 15570  2600 16855
 [73]  5253 11895  6731  6286  4587 10903 14599  7751 10524 13207  9228  9992
 [85]  4113  7650 10931  7941 10371 11163  5679  9089 10329  7764  7932 13179
 [97] 11034  8758 17000  3583


# **Apply family functions instead of loops**

When dealing with large datasets, you can use parallel functions from the parallel package, such as mclapply. These functions divide the workload into smaller tasks, which are processed concurrently on separate CPUs. The results from each processor are then combined into a single output, maintaining the order of the input data.

**Vectorized functions in R**


In R, many operations don't require loops because vectorized functions can directly handle vectors. For example, instead of using mapply() with sum(), you can simply use the + operator to sum vectors Xs and Ys without needing loops. This makes code simpler and more efficient.

In [20]:
# Define Xs and Ys vectors
Xs <- c(1, 2, 3, 4, 5)
Ys <- c(6, 7, 8, 9, 10)

# Perform the addition
result <- Xs + Ys

# Print the result
print(result)

[1]  7  9 11 13 15


In order to get the column or row sums, we can use the vectorized functions colSums() and rowSums().
To calculate the column sums of a matrix, you can use the colSums() function. Here's an example:

In [23]:
# Example matrix
mat <- matrix(1:9, nrow = 3, byrow = TRUE)

# Calculate the column sums
column_sums <- colSums(mat)

# Print the result
print(column_sums)

[1] 12 15 18


Similarly, to get the row sums, you can use the rowSums() function:

In [24]:
# Calculate the row sums
row_sums <- rowSums(mat)

# Print the result
print(row_sums)

[1]  6 15 24
