# **Welcome to MXB107: Probability and Statistics**

```
.------------------------------------.
|   __  ____  ______  _  ___ _____   |
|  |  \/  \ \/ / __ )/ |/ _ \___  |  |
|  | |\/| |\  /|  _ \| | | | | / /   |
|  | |  | |/  \| |_) | | |_| |/ /    |
|  |_|  |_/_/\_\____/|_|\___//_/     |
'------------------------------------'

```


In this unit, you will be introduced to the R programming language through Google Colab, with a focus on its applications in statistical modeling. There's no need to install R or [RStudio](https://posit.co/download/rstudio-desktop/) on your computer—just an internet connection and a web browser are all you need.

## **About Google Colab**

Google Colab is a free, cloud-based platform that allows you to write and run code directly from your browser, with no setup required. Originally designed for Python, it also supports other languages like R. Colab provides access to powerful computing resources, including GPUs, making it ideal for data analysis, machine learning, and interactive coding tutorials. Its integration with Google Drive ensures that your notebooks are automatically saved and easily shareable, making collaboration seamless and efficient.

## **Switching to the R Kernel in Google Colab**

By default, Google Colab uses Python as its programming language. To use R instead, you’ll need to manually switch the kernel by going to **Runtime > Change runtime type**, and selecting R as the kernel. This allows you to run R code in the Colab environment.

However, our notebook is already configured to use R by default. Unless something goes wrong, you shouldn’t need to manually change runtime type.

### **Executing your first R code cell**

**To run a code cell**:

- Click the ▶️ button to the left of the cell,  
- Or press **Ctrl + Enter** (on Windows),  
- Or press **Cmd + Return** (on macOS).


This code cell will output **something** only if the R kernel is active; otherwise, it will produce an error.

In [None]:
if (!require("cowsay")) install.packages("cowsay"); library("cowsay")
say("Welcome to MXB107!", by = "pig")

### **Creating your first R code cell**

To create a new code cell, click the `+ Code` button at the top-left of the notebook. To insert a code cell immediately below an existing one, first click inside that code cell.



#### **Exercise**

Create a new code cell below this, enter `say("Your first code cell!", by = "cat")`, and execute it.

<details>
<summary>▶️ Click to show the solution</summary>

```r
say("Your first code cell!", by = "cat")
```

</details>

## **Introduction to the R Programming Languague**

## **Atomic Data Types in R**


Atomic types are the simplest building blocks in R that store individual values such as numbers, text, or logicals. These atomic types form the foundation for creating more complex data structures such as vectors, lists, and data frames.

Think of atomic types like the individual cells in an Excel spreadsheet, each holding a single piece of data. Just as cells combine to create sheets and workbooks, atomic types combine in R to build complex datasets.

### **Double**

Double (double-floating point) values represent real numbers (with or without decimals) and are one of the most common atomic types in R. For example, `0.1` represents 0.1.

In [None]:
0.1
typeof(0.1)

#### **Is This Exactly 0.1**?

Might not be the case (see [IEEE 754 - Wikipedia](https://en.wikipedia.org/wiki/IEEE_754)). Real numbers are uncountably infinite, but computers have only finite memory, so there might be some "generally acceptable" representation error. However, they can propagate and affect the accuracy of your results if not handled correctly (beyond the scope of this unit).

#### **Is `0.1 + 0.2` equal to `0.3`**?

In [None]:
0.1 + 0.2 == 0.3

**FYI: What about their numerical representations in our computers?**

In [None]:
writeBin(0.1 + 0.2, raw(), size = 8)  # 8 bytes = 64 bits
writeBin(0.3, raw(), size = 8)

They differ only in the least significant byte, which is `34` vs `33`. This is the very smallest possible change in a double.

**`all.equal()` to the Rescue: Safer Number Comparison in R**

In [None]:
all.equal(0.1 + 0.2, 0.3)

### **Integer**

In R, integers represent whole numbers without decimal points. We can create them using the `L` suffix, like `5L`, and check their type with `typeof(5L)`.

In [None]:
5L
typeof(5L)
typeof(5)

**FYI**: Integers and doubles are stored differently in memory. Integers use a fixed number of bits to represent whole numbers exactly, while doubles (floating-point numbers) follow the IEEE 754 standard to represent real numbers approximately.

### **Character**

Characters in R represent text data, such as letters, words, or sentences. They are enclosed in either single or double quotes, for example, `"hello"` or `'world'`.  

This data type is common in any programming languages and is used for many basic but essential tasks, such as specifying paths to a directory (e.g., `"content/path"`), specifying a name (e.g., an R package's name like `"ggplot2"`), storing information).

In [None]:
"Welcome to MXB107!"
typeof("Welcome to MXB107!")

#### **Print a `character`**

Printing characters is useful when you want to display messages during execution, such as status updates or error messages (there are dedicated functions in R for issuing warnings and errors).

In [None]:
print("Faculty of Science")

#### **What Does \# Do**?

In R, anything following # on a line is treated as a comment and is not executed. It’s good practice to include comments in your code to explain what it does, making it easier to understand and maintain.

### **Logical**

Logical values in R are `TRUE` (or equivalently, `T`) and `FALSE` (or equivalently, `F`) . These are used for decisions, filtering, and control flow and can be created via comparision operators or more generally, evaluation of some expressions in R.


In [None]:
T
TRUE
F
FALSE

R can generate logical results via comparison operators:

- `==` for equality
- `<` / `<=` for less than and less than or equal to
- `>` / `>=` for greater than and greater than or equal to

In [None]:
5 < 3
"a" == "a"
"a" < "b"

More generally, R can generate logical results by evaluating expressions. For example, functions like `is.numeric()` return `TRUE` or `FALSE` depending on the input.


In [None]:
is.numeric("a")
is.numeric(1)

## **R as a Calculator**

R is essentially just a really powerful calculator, kind of like those you may have been required to purchase for high school maths, only better and with a much higher limit on what it can achieve.

Doing computation in R involves writing code. We can create a code cell, type directly into that and run code, and it will give us an answer like a calculator.

### **Double Operations**

Common double operators include:

- `+` for addition (e.g., `2.5 + 1.5`)
- `-` for subtraction (e.g., `5.0 - 2.0`)
- `*` for multiplication (e.g., `3.0 * 4.0`)
- `/` for division (e.g., `10.0 / 2.0`)
- `^` for exponentiation (e.g., `10.0 ^ 2.0`)
- '%%' for modulo (remainder) (e.g., `10 %% 2`)

These operations follow standard arithmetic rules and return results as doubles. We can use parentheses `()` to change the order of evaluation and control operator precedence in expressions. See [R: Operator Syntax and Precedence](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Syntax.html) for more details.

In [None]:
2.5 + 1.5
typeof(2.5+1.5)

#
2*(5 - 2)^2 + 3
typeof((5 - 2)^2 + 3)

#### **Exercise**

Work out how to divide 50 by 4, then compute 14 multiplied by the sum of 4 and 6, divided by 2. What are the data types of the results?

<details>
<summary>▶️ Click to show the solution</summary>

```r
50/4
typeof(50/4)

14*(4+6)/2
typeof(14*(4+6)/2)
```

</details>

### **Integer Operations**

All double operators apply to integers as well, with some caveats.

#### **Integer Addition/Subtraction/Multiplication**

The output type stays `"integer"` as long as both operands are integers, unlike double operations.




In [None]:
1L+1L
typeof(1L+1L)
typeof(1+1)

#### **Integer division**

Unlike C/C++, R does not perform 'int' division, so the result is not 1L but `1.666...` as we desire.

In [None]:
5L/3L
typeof(5L/3L)

5/3
typeof(5/3)

#### **Exercise**



What is the output of integer 5 raised to the power of integer 3? What is the data type of the result?

<details>
<summary>▶️ Click to show the solution</summary>

```r
5L^3L
typeof(5L^3L)
```

</details>

### **Logical Operators**

Logical operators allow you to combine or modify logical values (`TRUE`/`FALSE`):

- `&` : Element-wise AND  
- `|` : Element-wise OR  
- `!` : NOT (negation)  
- `&&` : Short-circuit AND (evaluates only the first element)  
- `||` : Short-circuit OR (evaluates only the first element)

The difference between `&` and `&&` (and similarly `|` vs `||`) is subtle and mainly relevant when working with vectors.  


In [None]:
T & F
!(T&F)

#### **Exercises**

Combine `3^2 < 2^3` and `log(3) > 1.5` via the logical `AND` operator. What is the result? Verify the result explicitly in R.

<details>
<summary>▶️ Click to show the solution</summary>

```r
(3^2 < 2^3) & (log(3) > 1.5)
```

</details>

Verify this  `3 < 5 < 7`

Note that `3 < 5 < 7`, for example is **not** a valid expression in R. You must verify each expression separa

## **Assignment**

Assignment in R can be done using the `<-` or `=` operators (e.g., `x <- 10`). The expression on the right side is evaluated, and its value is stored in the variable on the left.

In [None]:
x = 5^2
y = 2^5
x>y

#
x2 <- 5^2
x == x2 #Equal or not?

A cool feature in R is the right-arrow assignment operator `->`, which assigns values from left to right. It can make your code more readable or convenient in certain cases, especially when you want to write code in a more natural flow.

In [None]:
5 -> x
print(x)

Some variables in R, like `pi`, come pre-assigned with built-in constant values that we can use directly in calculations.

In [None]:
pi

In R, reserved words like `if`, `for`, and `TRUE`/`T` cannot be used as variable names because they have special meanings in the language.

#### **Exercise**

Create a variable named `radius` and assign it the value `3`. Then, create a variable named `area` and assign it the area of a circle using that `radius`. Print the result.

<details>
<summary>▶️ Click to show the solution</summary>

```r
radius = 3
area = pi * radius^2
print(area)
```

</details>

## **Function**

A function in R is a set of instructions that performs a specific task and may return a value. You call a function by its name followed by parentheses, optionally passing arguments inside (e.g., `a_function(arg_1, arg_2, ...)`).


In R, you define a function using the `function()` keyword. For instance, here is a function called `computeArea` that computes the `area` of a circle given its radius:

In [None]:
computeArea = function(radius){
  area = pi*radius^2
  return(area)
}
computeArea(3)

Note that in R, you don’t always need to use the `return()` statement because it automatically returns the last evaluated expression. However, it is generally safer to use an explicit `return()` statement.

#### **Exercise**

Create a function called `computeCirc` that takes a `radius` and output the circumference of a circle given the `radius` value.

<details>
<summary>▶️ Click to show the solution</summary>

```r
computeCirc = function(radius){
  circ = 2*pi*radius
  return(circ)
}
```

</details>

## **More Complex Data Types**

### **Vector**

A vector is a one-dimensional array that holds elements of **the same type** (e.g., `numeric`, `character`, `logical`). It is one of the most basic data types in R.


#### **Creating a Vector**

Vectors are typically created using the `c()` function. For example, `c(1, 2, 3)` creates a numeric vector.

In [None]:
x = c(1,2,3)
typeof(x)
is.vector(x)

The expression `c("a", "b", "c")` creates a character vector.

In [None]:
y = c("a", "b", "c")
typeof(y)
is.vector(y)

What about `c("a", 1, 2)`?

This creates a character vector. When elements of different types are combined, R automatically converts them to a common type that can hold all values. `1` can be a character, but `a` is not a number.

In [None]:
z = c("a", 1, 2)
print(z)
typeof(z)

The `c()` function can also be used to combine vectors.

In [None]:
a = c(1,2,3)
b = c(4,5,6)
c(a,b)

#### **Basic Vector Operations**

**Indexing a Vector**

Indexing a vector in R is done using square brackets, e.g., `x[n]` returns the `n-th` element of `x`. We can extract multiple elements by passing a vector of indices, such as `x[c(1, 3, 5)]` to get the `1-st`, `3-rd`, and `5-th` elements.

Here, `1`, `3`, `5` are doubles. However, R implicitly converts these into integers.


In [None]:
x = c(1,4,3,6,3,24,7,2,3,4)
print(x)
print(x[c(1,3,5)])

**Arithmetic and Logical Operations**

In R, vector addition is performed element-wise. If you add two vectors of different lengths, the shorter vector is repeated to match the length of the longer vector before the addition. Similar for subtraction/addition/multiplication/comparision operations.

In [None]:
x = 1:5
y = seq(1,10,2)
x + y

#
a = 1:6
b = c(1,2)
a+b

#
z = c(1,4,3,6,3,24,7)
z > 5

#
n = seq(1,9,2) #1,3,5,7,9
n + 1

We can also index a vector using a logical operation. For example, if we want to know which values in a vector `x = c(1,4,3,6,3,24,7,2,3,4)` are larger than 3, we can do the following:

In [None]:
x = c(1,4,3,6,3,24,7,2,3,4)
whichIdx = x>3
print(whichIdx)
x[whichIdx]

**Modifying a Vector**

We can modify elements in a vector via the assignment operator.

In [None]:
x = 1:5
print(x)
x[1] = 5
print(x)

#### **Common Functions for Working with Vectors**

Some useful functions for working with vectors in R:

- `length(x)` — returns the number of elements in vector x.
- `sum(x)` — computes the sum of all elements in x.
- `mean(x)` — calculates the average value of x.
- `max(x)` - finds the maximum value in `x`.
- `sort(x)` — returns a sorted version of x.
- `unique(x)` — extracts unique elements from x.
- `c()` — combines values/vectors into a new vector.
- `seq(from, to, by)` — generates equally spaced sequences of numbers.
- `rep(x, times)` — repeats elements to create vectors.
- `lapply(x, FUN)` — applies a function `FUN` to each element of a vector, returns a list.
- `sapply(x, FUN)` — similar to lapply(), but results are simplified (e.g., to a vector or matrix).
- `which()` — returns indices of elements matching a condition.
- `names(x)` - returns the names of elements.
- `:`  — (not a function) creates a sequence of consecutive integers between two numbers.

In [None]:
x = seq(from = 1, to = 10, by = 2)
print(x)

y = 1:10
print(y)
sum(y)
mean(y)

#### **Exercise**

Create a new vector containing the odd numbers from 1 to 21, inclusive.
- Store this vector in a variable of your choice.
- From this vector, extract all elements that are divisible by 3 and store them in another variable.
- Compute the sum of the values that are divisible by 3.


<details>
<summary>▶️ Click to show the solution</summary>

```r
x = rep(from = 1, to = 21,by = 2),
#Divisible by 3 <=> x %% 3 == 0 (i.e., remainder = 0)
whichIdx = x%%3 == 0
y = x[whichIdx] #Extract a vector via a Boolean vector
print(y)
sum(y)
```

</details>

### **Matrix**

After vectors, matrices are the next level of data structure in R. A matrix is a two-dimensional structure where all elements must be of the same type, essentially a vector with dimensions.

There is a generalised data structure called an `array`, which encompasses both vectors and matrices matrices to multiple dimensions. This topic will not be covered here.


#### **Creating and Indexing a Matrix**
We can create one using matrix(), and access elements with '[rowIdx, columnIdx]` (i.e., 2-d indexing). Each row or column of a matrix is a vector.

In [None]:
A = matrix(data = 1:9,
           nrow = 3,
           ncol = 3)
print(A)
print(A[1:2, 3:3])

If only a row index or only a column index is provided when indexing a matrix R will return all columns (or all rows) corresponding to that index.

In [None]:
A[,1]

#### **Modifying a Matrix**

We can modify a matrix similarly to a vector. The key difference is that matrices use two-dimensional indexing `[rowIdx, columnIdx]` instead of one-dimensional.

In [None]:

print(A)

# Modify the second row
A[2, ] = c(10, 11, 12)
print(A)

# Modify the first column
A[, 1] = c(100, 101, 102)
print(A)

# Modify a specific element (row 3, column 2)
A[3, 2] = 999
print(A)

#### **Column-Major Ordering**

In R, matrices are stored in **column-major** order, meaning elements are filled column by column, not row by row.

To us, a matrix is a 2-dimensional array. But to the computer, a matrix is stored as a 1-dimensional array (a vector), so 1-dimensional indexing can be used to extract elements, which can be useful for high-performance applications. However, two-dimensional indexing remains the standard and safer approach.

In [None]:
A[1:4]

Arithmetic and logical operations on matrices in R are performed element-wise, just like vectors. However, since matrices are stored in column-major order, this affects how operations are performed.

In [None]:
print(A)
print(A + c(1,2,3))
print(A %% c(1,2,3) == 0)

#### **Common Functions for Working with Matrices**

- `dim(x)` — returns the dimensions (rows and columns) of matrix `x`.  
- `nrow(x)` — returns the number of rows.  
- `ncol(x)` — returns the number of columns.  
- `t(x)` — returns the transpose of matrix `x`.  
- `rowSums(x)` — computes the sum of each row.  
- `colSums(x)` — computes the sum of each column.  
- `rowMeans(x)` — computes the mean of each row.  
- `colMeans(x)` — computes the mean of each column.  
- `apply(x, MARGIN, FUN)` — applies a function `FUN` over rows (`MARGIN = 1`) or columns (`MARGIN = 2`).  
- `matrix(data, nrow, ncol)` — creates a matrix from the given data.  
- `cbind()` - combines two matrices by columns.
- `rbind()` - combines two matrices by columns.
- `dimnames(x)` — gets or sets row and column names.  
- `rownames(x)` — returns or sets row names.  
- `colnames(x)` — returns or sets column names.

#### **Exercise**

1. Create a 3x3 matrix `m` filled by columns with the first 9 odd numbers `(1, 3, 5, ..., 17)`.  
2. Use `rowSums()` to calculate the sum of each row.  
3. Use `colMeans()` to calculate the mean of each column.  
4. Use `apply()` to find the maximum value in each row. Recall that each row/column of a matrix is a vector.

**Hint:** Use `matrix()`, `rowSums()`, `colMeans()`, `max(), `apply()`, and `seq()` with` by = 2`.

<details>
<summary>▶️ Click to show the solution</summary>

```r
m = matrix(seq(1, 17, by = 2), nrow = 3, ncol = 3)
print(m)
rowSums(m)
colMeans(m)
apply(m, MARGIN = 1, max)
```

</details>

### **List**

A list in R is a flexible data structure that can store elements of different types and lengths, unlike vectors which require all elements to be the same type. This makes lists particularly useful for grouping together diverse outputs.

#### **Creating a List**

In R, a list can be created via the `list()` function.

In [None]:
mxb107Info = list(faculty = "Faculty of Science", school = "School of Mathematical Sciences", year = 2025, unitCode = "MXB107",
unitName = "Probability & Statistics")
print(mxb107Info)

#### **Indexing a List**

We can access its elements using double square brackets `[[ ]]` or the `$` operator if the elements are named. For example:

In [None]:
print(mxb107Info)
mxb107Info$faculty
mxb107Info$faculty[[1]]

If you use single square brackets `[ ]` on a list, R returns a sublist containing the selected elements—not the elements themselves.

In [None]:
print(mxb107Info)
mxb107Info$faculty
typeof(mxb107Info[4])

#### **Common Functions for Working with Lists**

- `str(x)` — displays the structure of a list — useful for inspection.
- `lapply(x, FUN)` — applies a function `FUN` to each element of a list, returns a list.
- `sapply(x, FUN)` — similar to lapply(), but results are simplified (e.g., to a vector or matrix).
- `unlist(x)` — flattens a list into a vector if possible.
- `append(x, value)` — adds elements to a list.

#### **Exercise**

Create a function named `Circle` that takes a `radius` and outputs the `area` and `circumference` of a circle with that `radius`.

**Hint:** Return the results in a list; Use `return()` does not work as once `return()` is evaluated, the function aborts.

<details>
<summary>▶️ Click to show the solution</summary>

```r
Circle = function(radius){
  areaVal = pi*radius^2
  circVal = 2*pi*radius
  return(list(area = areaVal, circ = circVal))
}
```

</details>


### **Data Frame**

A data frame is a table-like data structure in R where each column can hold different types (numeric, character, etc), and each row represents an observation. It’s widely used for storing and manipulating datasets. Think of that as an `Excel` table in R, but more efficient.

### **Create a Data Frame**

We can create a data frame using the `data.frame()` function. Each column in a data frame is a vector, and all columns must have the same length.

In [None]:
MXB107_ClassInfo <- data.frame(
  Class = c(
    "LEC01 01", "LEC01 01", "PRC01 01", "PRC01 01", "PRC01 02",
    "PRC01 03", "PRC01 07", "PRC01 08", "PRC01 02", "PRC01 04",
    "PRC01 05", "PRC01 06"
  ),
  Type = c(
    "Lecture (Internal)", "Lecture (Online)", "Practical (Online)", "Practical (Internal)",
    "Practical (Online)", "Practical (Internal)", "Practical (Internal)", "Practical (Internal)",
    "Practical (Internal)", "Practical (Internal)", "Practical (Internal)", "Practical (Internal)"
  ),
  Day = c(
    "Wed", "Wed", "Wed", "Thu", "Thu", "Thu", "Thu", "Thu", "Fri", "Fri", "Fri", "Fri"
  ),
  Location = c(
    "GP B117", "Online", "Online", "GP D413", "Online", "GP G216", "GP S520", "GP S517",
    "GP G216", "GP G216", "GP S502", "GP S519"
  ),
  Limit = c(
    240, 1000, 30, 35, 30, 35, 25, 30, 35, 35, 35, 35
  ),
  Teaching_Staff = c(
    "Chris Drovandi", "Chris Drovandi", "Narayan Srinivasan", "Narayan Srinivasan",
    "Oliver Vu", "Minh Long Nguyen", "Ryan Kelly", "Nicholas Gecks-Preston",
    "Oliver Vu", "Arwen Nugteren", "Arwen Nugteren", "Minh Long Nguyen"
  ),
  From = c(
    "11", "11", "16", "16", "16", "09", "14", "09",
    "9", "11", "15", "15"
  ),
  To = c(
    "13", "13", "18", "18", "18", "11", "16", "11",
    "11", "13", "17", "17"
  ),
  stringsAsFactors = FALSE
)

str(df)

#### **Indexing a Data Frame**

Indexing a data frame is quite similar to indexing a matrix. However, there are some additional operations specific to data frames:

- Access columns by name: `df$name or df[["name"]]`
- Access rows and columns by position:
 - `df[i, ]` (`i-th` row),
 - or `df[,j]` (`j-th` column),
 - or `df[i, j]` (element in `i-th` row, `j-th` column).
- Extract multiple rows or columns using vectors: `df[rowIdxVec, colIdxVec]`, `df[rowIdxVec, c("nameofCol1", "nameofCol2")]`.

In [None]:

MXB107_ClassInfo[,"Teaching_Staff"]
MXB107_ClassInfo$Teaching_Staff
MXB107_ClassInfo[["Teaching_Staff"]]

#
MXB107_ClassInfo[1,]
MXB107_ClassInfo[1:2, c("Location", "Limit")]

#### **Modifying a Data Frame**

There are various approaches to modify a data frame. Some common ones are:

- **Create a new column (e.g., `Semester`)**


In [None]:
MXB107_ClassInfo$Semester = "Semester 2" #This will automatically extended to match the number of rows
str(MXB107_ClassInfo)

- **Modify a column**.

We can observe that `From` and `To` columns are character vectors when they should be numeric ones. It's possible to override the existing character columns with the corresponding numeric ones.


In [None]:
MXB107_ClassInfo$From = as.numeric(MXB107_ClassInfo$From)
MXB107_ClassInfo$To = as.numeric(MXB107_ClassInfo$To)
str(MXB107_ClassInfo)

- **Modify a row**

In [None]:
print(MXB107_ClassInfo[1,])

MXB107_ClassInfo[1,] = list("1 LEC01 01", "Lecture", "Wed", "GP B117", 240, "Chris Drovandi", 11, 13, "Semester 2")

print(MXB107_ClassInfo[1,])

- **Modify a single entry**

This can be done the same way as modifying a single entry in a matrix.

In [None]:
print(MXB107_ClassInfo[1,])
MXB107_ClassInfo[1,"From"] = 13
MXB107_ClassInfo[1,"To"] = 15
print(MXB107_ClassInfo[1,])

#### **Common Functions for Working with Data Frames**

- `str(df)` — displays the structure of the data frame  
- `summary(df)` — provides summary statistics for each column  
- `nrow(df)` and `ncol(df)` — return the number of rows and columns  
- `head(df)` and `tail(df)` — show the first or last few rows  
- `names(df)` or `colnames(df)` — get or set the column names  
- `subset(df, condition)` — extract rows that meet a logical condition  


In [None]:
subset(MXB107_ClassInfo, MXB107_ClassInfo$Teaching_Staff == "Oliver Vu")

#### **Loading a Data Frame from a .csv File**

In practice, we may not need to create a data frame from scratch, but rather load it from an external file. A common format for storing tabular data is CSV (Comma-Separated Values). We can use the `read.csv()` function to load data from a CSV file.

Notice that in our file systems. There is a folder named `sample_data`, we will load the `california_housing_test.csv` file to R.

In [None]:
CalTest = read.csv("./sample_data/california_housing_test.csv")
dim(CalTest)
str(CalTest)

In [None]:
head(CalTest, 3)

#### **Exercise**

Load the `california_housing_train.csv` to R as a data frame. Subset the rows where `median_income < 1` and `housing_median_age < 18`. How many observations are left before and after extraction?

<details>
<summary>▶️ Click to show the solution</summary>

```r
CalTrain = read.csv("./sample_data/california_housing_train.csv")
dim(CalTrain)
subCalTrain = subset(CalTrain, (CalTrain$median_income < 1) & CalTrain$housing_median_age < 18)
dim(subCalTrain)
```

</details>

## **Working with the R File System**

Knowing your current working directory is important when working with files in R. We can check it using the `getwd()` function.

In [None]:
getwd()


On Colab, R runs on a temporary virtual Linux environment, so `getwd()` will typically return `/content`, which is the default working directory during the session.

If you need to change the working directory. For example, to access data stored in a different folder—we can use the `setwd("your/path/here")` function. Always ensure the path is correct, especially when running notebooks on different systems.

#### **Exercise**

Create a new directory named `our_working_directory` via `dir.create("our_working_directory")`. It will be a sub-directory of `/content`. Then, change the working directory to the newly created sub-directory.



<details>
<summary>▶️ Click to show the solution</summary>

```r
dir.create("our_working_directory")
setwd("./our_working_directory")
```

</details>

Check the current working directory:

In [None]:
getwd()

All datasets and core functionalities needed for this unit are hosted online in the following GitHub repository:
`"https://github.com/edelweiss611428/MXB107-Notebooks/tree/main"`
We will fork (i.e., make a copy of) the repository onto our virtual computer and change the working directory to access its contents.

**Do not modify the following**:

In [None]:
setwd("/content")

# Remove `MXB107-Notebooks` if exists,
if (dir.exists("MXB107-Notebooks")) {
  system("rm -rf MXB107-Notebooks")
}

# Fork the repository
system("git clone https://github.com/edelweiss611428/MXB107-Notebooks.git")

# Change working directory to "MXB107-Notebooks"
setwd("MXB107-Notebooks")

# Print the current working directory
getwd()

# List contents inside the directory
list.files()

## **Packages in R**

R packages add new features that aren’t included in base R. You install packages via `install.packages("pkg_name")` to access these extra tools and load them with `library("pkg_name")`.

Many popular R packages are already available on Colab, so you often don’t need to install them, saving time. Therefore, it’s best to check if a package is installed before attempting to install or load it blindly.

Here, the `MASS` package is available in base R, so the system won't attempt to install `MASS`.

In [None]:
if (!require("MASS")) install.packages("MASS"); library("MASS")

Most reviewed R packages are available on [CRAN](https://cran.r-project.org/web/packages/available_packages_by_name.html) and [Bioconductor](https://bioconductor.org/packages/release/bioc/), but many are also available as GitHub repositories.

For CRAN packages, it is generally sufficient to use `install.packages("pkg_name")` to install them, but installing from other repositories like GitHub can be more complicated.

#### **Exercise**

Check if the CRAN package `"dissimilarities"` is installed. If it is, load the package; if not, install it first and then load it.

<details>
<summary>▶️ Click to show the solution</summary>

```r
if (!require("dissimilarities")) install.packages("dissimilarities"); library("dissimilarities")
```

</details>

In general, you won’t need to manually install any packages in this unit. If you fork the unit's GitHub repository, change the directory to `"MXB107-Notebooks"` and run the `"R/preConfigurated.R"` script, it will automatically load and install all the necessary packages for you.

**Do not modify the following**:

In [None]:
source("R/preConfigurated.R")

This will also load some prepared utility functions that are not available in base R.

## **Getting Help in R**

If you find yourself forgetting what a particular function does or what the names of the arguments (inputs) you can pass it are, you can use the `help()` function or equivalently `?a_function`. For example, try running the following:

In [None]:
help(mean)

In [None]:
?mean

What you notice is that it prints the help info about calculating the arithmetic mean, where we are told that:
- It requires an R object x which is either a numeric vector, logical vector or date, date-time or time interval,
- And we can optionally tell it how much of the data to trim,
- And whether or not we want drop any missing values from the calculation.

One disadvantage of Colab is that it does not render R documentation files very well, as it is primarily designed for Python. However, the interface and documentation display improve significantly when using dedicated IDEs like [RStudio](https://posit.co/download/rstudio-desktop/).

## **Tests (Optional)**

**Do not modify the following**:

In [None]:

if (!require("testthat")) install.packages("testthat"); library("testthat")

test_that("Test if all packages have been loaded", {

   expect_true(all(c("ggplot2", "tidyr", "dplyr", "stringr", "magrittr") %in% loadedNamespaces()))

})

test_that("Test if all utility functions have been loaded", {
  expect_true(exists("skewness"))
  expect_true(exists("kurtosis"))
})