# Script 1 - Exploring Datasets

## Acknowledgements

*code largely based on SI material for*  **Borcard D., Gillet F. & Legendre P. 2018.**

**Numerical Ecology with R, 2nd edition. Springer International Publishing AG.and course material generously provided by A. Buttler (EPFL Numerical Ecology Class)**

# 1 Packages  -> *(install.packages(), library(), help(), ?package, ??package)*

## 1.1 Install packages -> *install.packages()*

In [None]:
install.packages("vegan")

In [None]:
install.packages("ade4")

In [None]:
install.packages("pastecs")

In [None]:
install.packages("psych")

In [None]:
install.packages("fmsb")

## 1.2 Loading the needed packages -> *library()*

In [None]:
library(vegan)
library(ade4)
library(pastecs) 
library(psych)
library(gplots)
library(fmsb)

## 1.3 Displaying the help page for a package

### 1.3.1 Displays the help page -> *?package*

In [None]:
# Displays the help page for the package
help(vegan)
?vegan

### 1.3.2 Perform a broader keyword search in all installed documentation. -> *??package*

In [None]:
# broader keyword search in all installed documentation.
??vegan

# 2 Working Directory (*getwd(), setwd(),setwd(choose.dir()*)

## 2.1 Returns the current working directory -> *getwd()*

In [None]:
getwd()

## 2.2 Sets the working directory to the specified path -> *setwd()*

In [None]:
setwd('/multivariate_statistics_env-513/')

## 2.3 Opens a dialog to choose the working directory (Windows only) -> *setwd(choose.dir())*

In [None]:
setwd(choose.dir())

# 3 Read data from file *(read.csv())*

## 3.1 import CSV files as dataframes -> *read.csv()*

**read.csv(file, header = TRUE, sep = ",", row.names = NULL)** 

- **file** : -> *path to the CSV file.*

- **header** : -> *indicates if the first line contains column names (`TURE` by default).*

- **sep** : -> *field separator (`,` by default for read.csv).*

- **row.names** : -> *column to use as row names., if = `1` then the first colomn will be used to name the rows in the dataframe*

In [None]:
ex1 <- read.csv("example.csv", head=TRUE, sep=",", row.names = 1)
ex1

## 3.2  import CSV file chosen via dialog -> *read.csv(file.choose())*

In [None]:
ex2<-read.csv(file.choose())
ex2

# 4 Data Definition (*declare variable, create vectors*)

## 4.1 Create variables, adding data etc..

In [None]:
a = 12       # Create a variable 'a' with value 12
b = 34       # Create a variable 'b' with value 34
c = a + b    # Add 'a' and 'b', store the result in 'c' 
print(a)
print(b)
print(c)

## 4.2 Create vectors -> *vec <- c(x1,x2,x3,x4)*

In [None]:
a <-c(1,4,6)
a<-a*3 + 2
a

In [None]:
dat1 <- c(2, 3, 5)  # Create a vector 'dat1' with values 2, 3, and 5
print(dat1)

dat1 <- dat1 * 2            # Multiply each element of 'dat1' by 2 (result: 4, 6, 10)
print(dat1)

<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 1</span><br/>
  What is the output of the following line of code ?<br/>
  <pre style="font-size: 20px; margin: 5px 0;"><code>a <- c(1, 4, 6)
a <- 2 + a * 3</code></pre>
  <p style="text-align: right; margin-bottom: 0px; font-style: italic;">
    You can check your answer by clicking on the "Answer" below.
  </p>
</div>

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code> c(5,14,20)</code>.
  </div>
</details>


# 5 Exploring datasets (*Load(),head(),tail(),nrow(),ncol(),dim(),colnames(),rownames(),range(),apply(),sum(),mean(),median(),sd(),var(),summary(),str()*)

## 5.1 Load data -> *load()*

In [None]:
# load data ---------------------------------------------------------------
load("Doubs.RData") # Load the saved R workspace file "Doubs.RData" into the environment
data(doubs) # Load the built-in dataset named "doubs" (if available in attached packages)

## 5.2 Display the whole data frame in the console -> *df*

In [None]:
spe

## 5.3 Display only first or last rows -> *head(df)*, *tail(df)*

- **head(df)** : *shows 6 first rows*

- **tail(df)** : *shows 6 last rows*)

In [None]:
head(spe)                 # Display only the first 6 rows
tail(spe)                 # Display only the last 6 rows

## 5.4 Display only selected rows and columns -> *df[]*

- **c()** = *selected rows/columns*

- **n:n+k** = *nth row/column to the n+k th row/column*

- **df[c(),c()]**

- **df[n:n+k,n:n+k]**

- **df[c(),n:n+k]**

- **df[n:n+k,c()]** 

In [None]:
X <- spe[c(2, 5, 7), 8:12]  # Select rows 2, 5, and 7 and columns 8 to 12 from 'spe'
X

## 5.5 Display the number of rows/columns or the dimension of a dataframe -> *nrow(df)*, *ncol(df)*, *dim(df)*

 - **nrow(df)** : *Number of rows*
 
 - **ncol(df)** : *Number of columns*

 - **dim(df)** : *Dimension of the dataframe (`rows x columns`)*

In [None]:
nrow(spe)                 # Number of rows (sites)
ncol(spe)                 # Number of columns (species)
dim(spe)                  # Dimension of the spe (rowsxcolumns)

## 5.6 Display Columns and row labels -> *colnames(df)*, *rownames(df)*

- **colnames(df)** : *Columns labels*

- **rownames(df)** : *Row names*

In [None]:
colnames(spe)             # Column labels (descriptors = species)
rownames(spe)             # Row labels (objects = sites)

## 5.7 Display Minimum and Maximum values found in a dataframe -> *range(df)*, *range(df$col)*, *apply(df,1/2,range)*

- **range(df)**: *returns the minimum and maximum values across the entire data frame `df` (all numeric values combined)*


In [None]:
range(spe) # Minimum and maximum of abundance values in the whole data set

- **range(df$col)**: *returns the minimum and maximum values of the column `col` in the data frame `df`*


In [None]:
range(spe$Cogo) # Minimum and maximum in the column Cogo

- **apply(df, 2, range)**: *applies the `range` function to each column of the data frame `df`, returning the minimum and maximum for each column*

- **apply(df, 1, range)**: *applies the `range` function to each row of the data frame `df`, returning the minimum and maximum for each row*


In [None]:
apply(spe,1,range)  # returns the min and max for each row (i.e., for each sample)

apply(spe, 2, range)  # returns the min and max for each column (i.e., for each species)


## 5.8 Display the sum of all values in column -> *sum(df$col)*

In [None]:
sum(spe$Cogo)  # Calculate the total sum of values in the 'Cogo' column of the 'spe' data frame

## 5.9 Display mean, median, standard deviation, variance for a column -> *mean(df\$col)* *median(df\$col)*, *sd(df\$col)*, *var(df\$col)*

- **mean(df$col)**: *returns the average of the values in column `col` of data frame `df`*  

- **median(df$col)**: *returns the median (middle value) of column `col` in `df`*  

- **sd(df$col)**: *returns the standard deviation of values in column `col` (spread around the mean)* 

- **var(df$col)**: *returns the variance (average squared deviation) of values in column `col`*



In [None]:
mean(spe$Cogo)       # Computes the average of the Cogo column in the spe dataframe  
median(spe$Cogo)     # Returns the median value of the Cogo column in spe  
sd(spe$Cogo)         # Calculates the standard deviation of the Cogo column in spe  
var(spe$Cogo)        # Computes the variance of the Cogo column in spe  

## 5.10 Display descriptive statistics of a dataframe -> *summary(df), stat.desc(df), describeBy(df,group =)*

<div style="padding: 10px; border:3px solid green; font-size: 10px; text-align: left;">
    <img src="boxplot.png" width="35%">
</div>

### 5.10.1 Display **descriptive statistics** for **each column** -> *summary(df)*

- **Min**: *minimum value*

- **1st Qu.**: *first quartile (25%)*

- **Median**: *median (50%)*

- **Mean**: *average*

- **3rd Qu.**: *third quartile (75%)*

- **Max.**: *maximum value*


In [None]:
summary(spe)              # Descriptive statistics for columns

### 5.10.2 Display statistics for each column (using pastecs library) -> *stat.desc(df)*

- **nbr.val** : *number of values (non-missing data points)*  
- **nbr.null** : *number of null values (usually zero in numeric data)*  
- **nbr.na** : *number of missing values (`NA`)*  
- **min** : *minimum value*  
- **max** : *maximum value*  
- **range** : *difference between max and min*  
- **sum** : *sum of all values*  
- **median** : *median value (50th percentile)*  
- **mean** : *average value*  
- **SE.mean** : *standard error of the mean*  
- **CI.mean.0.95** : *95% confidence interval for the mean*  
- **var** : *variance*  
- **std.dev** : *standard deviation*  
- **coef.var** : *coefficient of variation (std.dev divided by mean)*  

In [None]:
stat.desc(spe) # Descriptive statistics for columns

### 5.10.3 Display descriptive statistics by group for a dataframe -> *describeBy(df, group = df$GroupColumn)*

- **vars** : *variable index in the dataset*  
- **n** : *number of observations*  
- **mean** : *average*  
- **sd** : *standard deviation*  
- **median** : *median (50% percentile)*  
- **trimmed** : *mean after trimming outliers*  
- **mad** : *median absolute deviation*  
- **min** : *minimum value*  
- **max** : *maximum value*  
- **range** : *max - min*  
- **skew** : *skewness (asymmetry of the distribution)*  
- **kurtosis** : *kurtosis (tailedness of the distribution)*  
- **se** : *standard error*  




In [None]:
describeBy(iris, group = iris$Species)

## 5.11 Structure of the dataset -> *str(df)*       

- **Type**: *type of the object (e.g., data.frame, matrix, etc.)*  
- **Dimensions**: *number of rows and columns*  
- **Column types**: *data type of each column (e.g., numeric, factor, etc.)*  
- **Preview**: *first few values in each column*


In [None]:
str(spe)                  # Structure of the dataset

## 5.12 Adding a data frame (or list) to the search path, to refer to its columns directly by name without using the $ operator

- **Attach(df)** : *attach a specific df*

In [None]:
attach(iris) # Attach iris dataset to search path (to access columns directly)

In [None]:
mean(Sepal.Length) # Calculate mean of Sepal.Length (after attach)

## 5.13 Remove a dataframe (or list) to the search path  

- **detach(df)** : *detach a specific df*

- **detach()** : *detach the last attached package/object*

In [None]:
detach(iris)                                 # Detach iris dataset to remove it from search path
detach()                                     # Detach the last attached package/object (generally not recommended)

<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 3</span><br/>
  What does the function <code>head()</code> display in R?<br/>
  <p style="text-align: right; margin-bottom: 0px; font-style: italic;">
    You can check the answer by clicking on the "Answer" below.
  </p>
</div>

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>It displays the first 6 rows of a dataframe or vector by default.</code>
  </div>
</details>

<br/>

<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 4</span><br/>
  I have an output of <code>30 25</code> when using <code>dim(df)</code>, what does that mean?<br/>
  <p style="text-align: right; margin-bottom: 0px; font-style: italic;">
    You can check the answer by clicking on the "Answer" below.
  </p>
</div>

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>The dataframe <code>df</code> has 30 rows and 25 columns.</code>
  </div>
</details>

<br/>

<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 5</span><br/>
  I have a dataframe and I only want to select the <strong>first</strong>, <strong>second</strong> and <strong>third</strong> rows and only the <strong>fifth</strong> and <strong>sixth</strong> columns, what code should I use?<br/>
  <p style="text-align: right; margin-bottom: 0px; font-style: italic;">
    You can check the answer by clicking on the "Answer" below.
  </p>
</div>


<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>df[c(1,2,3), c(5,6)]</code>
  </div>
</details>

<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 6</span><br/>
  How do you get the names of the <strong>rows</strong> and <strong>columns</strong> of a dataframe df?<br/>
  <p style="text-align: right; margin-bottom: 0px; font-style: italic;">
    You can check the answer by clicking on the "Answer" below.
  </p>
</div>

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>rownames(df)</code><br/>
    <code>colnames(df)</code><br/>
      
  </div>
</details>


<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 7</span><br/>
  How do you find the <strong>minimum</strong> and <strong>maximum</strong> values of the entire dataframe?<br/>
  <p style="text-align: right; margin-bottom: 0px; font-style: italic;">
    You can check the answer by clicking on the "Answer" below.
  </p>
</div>

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>range(df)</code><br/>
  </div>

</details>

<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 8</span><br/>
  Does <code>apply(df, 1, range)</code> return the range of each row or each column?<br/>
  <p style="text-align: right; margin-bottom: 0px; font-style: italic;">
    You can check the answer by clicking on the "Answer" below.
  </p>
</div>

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
   <code>apply(df, 1, range)</code> calculates the range (min and max) for each row (because 1 = row, 2 = column).<br/>
  </div>

    
</details>


<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 9</span><br/>
 If I have<code>df$col</code>, how do I calculate its <strong>mean</strong>? <br/>
  <p style="text-align: right; margin-bottom: 0px; font-style: italic;">
    You can check the answer by clicking on the "Answer" below.
  </p>
</div>

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
   <code>mean(df$col)</code><br/>
  </div>
</details>

<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 10</span><br/>
 When using <code>str(df)</code>, I see <code>Cogo: int 0 0 0 0 0 0 0 0 0 0</code>. What does the "int" mean? <br/>
  <p style="text-align: right; margin-bottom: 0px; font-style: italic;">
    You can check the answer by clicking on the "Answer" below.
  </p>
</div>

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
   <code>"int"</code> means the column is of type integer (whole number)<br/>
  </div>
</details>





# 6 Subsetting dataframes

## 6.1 Extract rows with specific condition

In [None]:
ablette<-fishtraits[fishtraits$EnglishName=="Bleak",] # Select rows where EnglishName is "Bleak"

## 6.2 Extract specific column depending on its name

In [None]:
body_length2<-fishtraits$BodyLength # Extract the 'BodyLength' column as a vector

## 6.3 Extract specific column depending on its index

In [None]:
body_length=fishtraits[,6] # Extract the 6th column of the dataframe

## 6.4 Combining exctraction of specific rows and columns

In [None]:
body_length_ablette=fishtraits[fishtraits$EnglishName=="Bleak",6] # Extract 6th column for rows where EnglishName is "Bleak"
body_length_ablette # Display the result

<div style="padding: 10px; border:1px solid green; font-size: 10px;">
  <span style="font-size: 15px;"> <i>Please run this cell below to avoid having UTF-8 troubles</i> </span><br/>
</div>


In [None]:
# Apply this function to each column of the dataframe
fishtraits[] <- lapply(fishtraits, function(col) {
  if (is.factor(col)) {
    # If the column is a factor, convert it to character and fix encoding from latin1 to UTF-8
    iconv(as.character(col), from = "latin1", to = "UTF-8")
  } else if (is.character(col)) {
    # If the column is already character, just fix the encoding from latin1 to UTF-8
    iconv(col, from = "latin1", to = "UTF-8")
  } else {
    # If the column is neither factor nor character, keep it unchanged
    col
  }
})

<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 11</span><br/>
  Complete the code below to:<br/>
    1) Extract the rows from the <code>fishtraits</code> dataframe where the family is <strong>Cyprinidae</strong><br/>
    2) Extract the <strong>mean</strong> of the <code>BodyLength</code> where family is <strong>Cyprinidae</strong><br/>
    3) Use <code>describeBy(df, group = df$GroupColumn)</code> to describe each family in the dataframe<br/>
</div>


In [None]:
# 1) Extract rows where the Family is Cyprinidae

# Replace the blanks with the correct syntax to filter the rows
cyprinidae_data <- fishtraits[fishtraits$Family == "____",]
cyprinidae_data

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">1) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>fishtraits[fishtraits$Family == <span style="color:blue;">"Cyprinidae"</span>,]</code><br/>
  </div>
</details>

In [None]:
# 2) Calculate the mean BodyLength for Cyprinidae only
# Use the filtered table above
#Complete the two blanks
mean_length <- ____(fishtraits[fishtraits$Family == "Cyprinidae",]$____)
mean_length

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">2) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code><span style="color:blue;">mean</span>(fishtraits[fishtraits$Family == <span style="color:blue;">"Cyprinidae"</span>,]$<span style="color:blue;">BodyLength</span>)</code><br/>
  </div>
</details>

In [None]:
# 3) Describe each family using describeBy()

# Replace the blanks with the correct dataframe and grouping column
library(psych)
describeBy(fishtraits, group = fishtraits$_____)

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">3) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>describeBy(fishtraits, group = fishtraits$<span style="color:blue;">Family</span>)</code>
  </div>
</details>

# 7 Data visualisation (*hist(),barplot(),heatmap.2(),plot(),radarchart(),pairs(),par(mfrow = c(nrows, ncols))*)

## 7.1 Displaying histogram of a column -> *hist(df$col)*

In [None]:
hist(spe$Pato)

## 7.2 Display barplot -> *barplot(freq, xlab, ylab, col, las, main)*

**freq <- table(df$col)**: *creates a frequency table of the values in column `col` of the data frame `df`*

**barplot(freq, xlab, ylab, col, las, main)**: *creates a barplot using a frequency table with customizable axis labels, colors, orientation, and title*

- **xlab**: *label on the x-axis*
- **ylab**: *label on the y-axis*
- **col**: *color of the bars*
- **las**: *orientation of axis labels (1 = horizontal)*
- **main**: *title of the plot*


In [None]:
# Create a frequency table of abundance values for species Pato
freq <- table(spe$Pato)

# Barplot of the abundance distribution
barplot(freq,
        xlab = "Abundance class",
        ylab = "Frequency",
        col = "lightblue",
        las = 1,
        main = "Abundance distribution of species Pato")


## 7.3 Display heatmap -> *heatmap.2(as.matrix(df), dendrogram, trace)*

**heatmap.2(as.matrix(df), dendrogram, trace,col,main)**: *creates a heatmap from the data frame `df` converted into a matrix, with options to control clustering and cell display*

- **as.matrix(df)**: *converts the data frame `df` into a matrix suitable for heatmap plotting*  
- **dendrogram**: *controls whether row/column dendrograms are shown (`"both"`, `"row"`, `"column"`, or `"none"`)*  
- **trace**: *adds or removes trace lines inside the cells (`"row"`, `"column"`, or `"none"`)*  
- **col** *(optional)*: *customizes the color palette of the heatmap*
- **main** *(optional)*: *Add title*


In [None]:
# Plot heatmap for the spe dataset
heatmap.2(as.matrix(spe),         # Convert spe to matrix
          dendrogram = "none",    # No dendrogram
          trace = "none",         # No trace lines
          col = heat.colors(20),  # Color gradient
          main = "Species Abundance Heatmap")  # Title

## 7.4 Display a scatter plot -> *plot(x, y, main, xlab, ylab, pch, col)*

**plot(x, y, main, xlab, ylab, pch, col)**: *creates a scatter plot with customizable title, axis labels, point style, and color*

- **x**: *x coordinates*
-  **y**: *y coordinates*
- **main**: *title of the plot*

- **xlab**: *label on the x-axis*

- **ylab**: *label on the y-axis*

- **pch**: *plotting symbol type (e.g., 19 for solid circles)*

- **col**: *color of the points*

In [None]:
x <- c(1, 2, 3, 4, 5)            # Create a numeric vector x with values 1 to 5
y <- c(3, 7, 4, 6, 8)            # Create a numeric vector y with corresponding values

plot(x, y,                       # Plot y versus x as points
     main = "Simple Scatter Plot",  # Title of the plot
     xlab = "X values",              # Label for the x-axis
     ylab = "Y values",              # Label for the y-axis
     pch = 19,                      # Plotting character: filled circles
     col = "blue")                  # Color of the points: blue

### 7.4.1 Display a scatter plot with advanced parameters -> *plot(x, y, main, xlab, ylab, pch, col, asp, cex, cex.axis)*

**plot(x, y, main, xlab, ylab, pch, col, asp, cex, cex.axis)**: *creates a scatter plot with customizable title, axis labels, point style, color, aspect ratio, point size, and axis label size*

- **x**: *x coordinates*
- **y**: *y coordinates*
- **main**: *title of the plot*
- **xlab**: *label on the x-axis*
- **ylab**: *label on the y-axis*
- **pch**: *plotting symbol type (e.g., 19 for solid circles)*
- **col**: *color of the points*
- **asp**: *aspect ratio of the plot (e.g., 1 for equal scaling on x and y axes)*
- **cex**: *scaling factor for the size of the points*
- **cex.axis**: *scaling factor for the size of axis tick labels*


In [None]:
# Example of scatter plot with advanced options
x <- c(1, 2, 3, 4, 5)           # x vector
y <- c(3, 7, 4, 6, 8)           # y vector
sizes <- c(1, 3, 2, 4, 5)       # point sizes (example variable like spe$Satr)

plot(x, y,
     main = "Enhanced Scatter Plot",  # Plot title
     xlab = "X values",                # X axis label
     ylab = "Y values",                # Y axis label
     pch = 19,                        # Solid circle points
     col = "blue",                    # Point color
     asp = 1,                        # Aspect ratio 1:1 (equal scaling on x and y)
     cex.axis = 0.8,                 # Axis label size
     cex = sizes                     # Variable point sizes
)

### 7.4.2 Adding lines to a plot -> *lines(x, y, col, lwd)*

**lines(x, y, col, lwd)**: *adds connected line segments to an existing plot with customizable color and line width*

- **x**: *x coordinates of the points to connect*
- **y**: *y coordinates of the points to connect*
- **col**: *color of the line*
- **lwd**: *line width (thickness)*


In [None]:
# Example of scatter plot with advanced options and added line
x <- c(1, 2, 3, 4, 5)           # x vector
y <- c(3, 7, 4, 6, 8)           # y vector
sizes <- c(1, 3, 2, 4, 5)       # point sizes

plot(x, y,
     main = "Enhanced Scatter Plot with Line",  # Plot title
     xlab = "X values",                          # Label for x-axis
     ylab = "Y values",                          # Label for y-axis
     pch = 19,                                  # Solid circle points
     col = "blue",                              # Points color
     asp = 1,                                   # Aspect ratio 1:1 (equal scaling)
     cex.axis = 0.8,                            # Axis label size
     cex = sizes                                # Variable point sizes
)

lines(x, y, col = "lightblue", lwd = 3)         # Adds a light blue line connecting points with line width 3

## 7.5 Display a radarchart -> *radarchart(data, axistype, pcol, pfcol, plwd, cglcol, cglty, axislabcol, vlcex)*

**radarchart(data, axistype, pcol, pfcol, plwd, cglcol, cglty, axislabcol, vlcex)**: *creates a customizable radar chart with options for colors, line styles, and labels*

- **data**: *data frame where the first two rows contain the max and min values of the variables, and the following rows contain the data to plot*

- **axistype**: *axis type (1 to 4), changes the appearance of the concentric circles*

- **pcol**: *color of the lines connecting the points*

- **pfcol**: *fill color under the lines, can include transparency*

- **plwd**: *line width*

- **cglcol**: *color of the concentric grid lines*

- **cglty**: *line type for the grid (e.g., 1 = solid, 2 = dashed)*

- **axislabcol**: *color of the axis labels*

- **vlcex**: *size of the variable labels*


In [None]:
# Example data: first 2 rows are max and min for each variable
data <- data.frame(
  Speed = c(10, 0, 7, 8),
  Strength = c(10, 0, 9, 6),
  Agility = c(10, 0, 6, 7),
  Endurance = c(10, 0, 8, 9),
  Flexibility = c(10, 0, 7, 5)
)

# Radar chart with customized colors and styles
radarchart(
  data,     # data frame where the first two rows contain the max and min values
  axistype = 1, # changes the appearance of the concentric circles
  pcol = c("red", "blue"),               # Line colors for each observation
  pfcol = c(rgb(1,0,0,0.3), rgb(0,0,1,0.3)), # Transparent fill colors
  plwd = 1,                             # Line width
  cglcol = "grey",                      # Color of concentric grid lines
  cglty = 1,                           # Grid line type (1 = solid)
  axislabcol = "darkgrey",             # Axis label color
  vlcex = 1.2                          # Variable label size
)

## 7.6 Display a scatterplot matrix -> *pairs(x, labels, main, pch, col, bg)*

**pairs(x, labels, main, pch, col, bg)**: *creates a scatterplot matrix to visualize pairwise relationships between variables*

- **x**: *a data frame or matrix containing the variables to plot*

- **labels**: *optional character vector for labeling the variables (axes titles)*

- **main**: *main title of the entire plot*

- **pch**: *plotting symbol type for points (e.g., 19 for solid circles)*

- **col**: *color of the plotting symbols*

- **bg**: *background (fill) color for plotting symbols (used if pch allows filling)*  


In [None]:
# Sample data: first 4 columns of iris dataset
data <- iris[1:4]

# Scatterplot matrix with customization
pairs(
  data,
  labels = colnames(data),      # Labels for each variable (axis titles)
  main = "Scatterplot Matrix",  # Main title for the plot
  pch = 19,                     # Plotting symbol (solid circle)
  col = "darkgreen",            # Color of the points
  bg = "lightgreen"             # Background (fill) color of points (only used for some symbols)
)


## 7.7 Fit a linear regression model predicting a variable from another -> lm( df`$`col1, df`$`col2)

- `mod`: model created from the linear regression, predicting `col1` based on `col2`  
- `df$col1`: dependent variable (response)  
- `df$col2`: independent variable (predictor)  
- Use `summary(mod)` to view model details (coefficients, R², p-values, etc.)  
- Use `plot(df$col2, df$col1)` and `abline(mod, col = "red")` to visualize the regression line.

In [None]:
# Linear regression: predicting Sepal.Length from Petal.Length in the Iris dataset
mod <- lm(iris$Sepal.Length ~ iris$Petal.Length)

# Display a summary of the linear regression model
summary(mod)


- **Residuals:** Show the distribution of errors between observed and predicted values; ideally centered around zero.  
- **Coefficients:** Represent the intercept and slope of the regression line, indicating the relationship strength between variables. Significance codes(Pr(>t)) show if predictors meaningfully explain the response.  
- **Residual standard error:** Measures the average size of the prediction errors.  
- **Multiple R-squared:** Indicates how much of the response variable's variability is explained by the model.  
- **F-statistic:** Tests if the model as a whole is statistically significant, i.e., if the predictors explain the response variable better than a model with no predictors.


In [None]:
# Plot Petal.Length (x-axis) vs Sepal.Length (y-axis)
plot(iris$Petal.Length, iris$Sepal.Length, 
     main = "Linear Regression: Sepal.Length vs Petal.Length", 
     xlab = "Petal Length", 
     ylab = "Sepal Length")

# Add the regression line in red
abline(mod, col = "red")

## 7.8 Dividing the plot window into multiple frames -> *par(mfrow = c(nrows, ncols))*

**par(mfrow = c(nrows, ncols))**: *splits the plotting area into a grid with `nrows` rows and `ncols` columns, allowing multiple plots to be displayed in the same window.*

- **nrows**: *number of rows in the grid*
- **ncols**: *number of columns in the grid*

After calling this, subsequent plots are filled row-wise in the grid.


In [None]:
# Complete example with 4 plots in a 2x2 layout
par(mfrow = c(2, 2))  # Divide the plot window into 2 rows and 2 columns

# Example data: first 2 rows are max and min for each variable
data <- data.frame(
  Speed = c(10, 0, 7, 8),
  Strength = c(10, 0, 9, 6),
  Agility = c(10, 0, 6, 7),
  Endurance = c(10, 0, 8, 9),
  Flexibility = c(10, 0, 7, 5)
)

# 1) Enhanced scatter plot
x <- c(1, 2, 3, 4, 5)          # Define x coordinates
y <- c(3, 7, 4, 6, 8)          # Define y coordinates
sizes <- c(1, 3, 2, 4, 5)      # Define point sizes

plot(x, y,
     main = "Enhanced Scatter Plot",  # Plot title
     xlab = "X values",                # Label for x-axis
     ylab = "Y values",                # Label for y-axis
     pch = 19,                        # Plotting symbol (solid circles)
     col = "blue",                    # Color of points
     asp = 1,                        # Aspect ratio 1:1
     cex.axis = 0.8,                 # Size of axis annotation
     cex = sizes                     # Size of points varies
)

# 2) Barplot of abundance distribution for spe$Pato (example)
freq <- table(spe$Pato)          # Frequency table of species abundance

barplot(freq,
        xlab = "Abundance class",    # Label for x-axis
        ylab = "Frequency",          # Label for y-axis
        col = "lightblue",           # Bar color
        las = 1,                    # Rotate axis labels horizontally
        main = "Abundance distribution of species Pato"  # Plot title
)

# 3) Enhanced scatter plot with a connecting line
plot(x, y,
     main = "Enhanced Scatter Plot with Line",  # Title
     xlab = "X values",
     ylab = "Y values",
     pch = 19,
     col = "blue",
     asp = 1,
     cex.axis = 0.8,
     cex = sizes
)

lines(x, y, col = "lightblue", lwd = 3)   # Add light blue line connecting points, width=3

# 4) Radar chart with customized colors and styles
radarchart(
  data,
  axistype = 1,
  pcol = c("red", "blue"),               # Line colors for each observation
  pfcol = c(rgb(1,0,0,0.3), rgb(0,0,1,0.3)), # Transparent fill colors
  plwd = 2,                             # Line width
  cglcol = "grey",                      # Color of concentric grid lines
  cglty = 1,                           # Grid line type (1 = solid)
  axislabcol = "darkgrey",             # Axis label color
  vlcex = 1.2                          # Variable label size
)

par(mfrow = c(1, 1))  # Reset plot window to single plot layout


<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 12</span><br/>
  Complete the code below to:<br/>
  1) Calculate the total number of species per sample using <code>rowSums()</code><br/>
  2) Plot the sampling points using the <code>Spa</code> coordinates<br/>
  3) Set the point size according to the total number of species in each sample<br/>
  4) Add a red line that connect every points <br/>
</div>


In [None]:
# 1. Compute the total number of species per sample for the dataframe Spe
total_species <- ____ (____) # hint : rowSums(df)

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">1) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>total_species &lt;- <span style="color:blue;">rowSums</span>(<span style="color:blue;">spe</span>)</code>
  </div>
</details>

In [None]:
# 2. Plot the sample locations using Spa coordinates
plot(____, pch = 21, bg = "lightblue", main = "Sample Locations") #hint : if a dataframe already has only two columns (X and Y), you can just use plot(df)


<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">2) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>plot(<span style="color:blue;">spa</span>,  pch = 21, bg = "lightblue", main = "Sample Locations" )</code>
  </div>
</details>

In [None]:
# 3. Add point sizes corresponding to species richness
plot(___, pch = 21, bg = "blue", cex = ____/10,main ="Total number of species per samples location" )

# Hint: cex : sizes corresponding to the total species per sample divided by 10


<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">3) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>
      plot( <span style="color:blue;">spa</span>, pch = 21, bg = "blue", cex = <span style="color:blue;">total_species /10</span>, main =>"Total number of species per samples location")
    </code><br/>
  </div>
</details>


In [None]:
plot(___, pch = 21, bg = "blue", cex = ___/10,main ="Total number of species per samples location" )
# 4. Draw a red line connecting all sample points
lines(____, col = ___, lwd = 5)


<details style="font-size: 18px;"> <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">4) Answer</summary> <div style="padding: 10px; border:1px solid blue; font-size: 20px;"> <code>plot(<span style="color:blue;">spa</span>, pch = 21, bg = "blue", cex = <span style="color:blue;">total_species</span>/10, main ="Total number of species per samples location" )</code><br/> <code>lines(<span style="color:blue;">spa</span>, col = <span style="color:blue;">"red"</span>, lwd = 5)</code> </div> </details>

<div style="padding: 10px; border:1px solid red; font-size: 18px;"> <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 13</span><br/> Complete the code below to:<br/> 1) Create a frequency table of the abundance values for the variable <code>iris$Sepal.Length</code>.<br/> 2) Plot a barplot of this frequency table with labels and a title. </div>

In [None]:
# 1) Create a frequency table of Sepal.Length
freq <- table(____)

# 2) Create a barplot of the frequency table
barplot(freq,
        xlab = "____",
        ylab = "____",
        col = "lightblue",
        las = 1,
        main = "Abundance distribution of Sepal Length")


<details style="font-size: 18px;"> <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;"> Answer</summary> <div style="padding: 10px; border:1px solid blue; font-size: 20px;"> <code>freq <- table(<span style="color:blue;">iris$Sepal.Length</span>)</code><br/> <code>barplot(freq,</code><br/> <code>&nbsp;&nbsp;&nbsp;&nbsp;xlab = <span style="color:blue;">"Sepal Length classes"</span>,</code><br/> <code>&nbsp;&nbsp;&nbsp;&nbsp;ylab = <span style="color:blue;">"Frequency"</span>,</code><br/> <code>&nbsp;&nbsp;&nbsp;&nbsp;col = "lightblue",</code><br/> <code>&nbsp;&nbsp;&nbsp;&nbsp;las = 1,</code><br/> <code>&nbsp;&nbsp;&nbsp;&nbsp;main = <span style="color:blue;">"Abundance distribution of Sepal Length"</span>)</code> </div> </details>

<div style="padding: 10px; border:1px solid red; font-size: 18px;">
  <span style="text-decoration:underline; font-weight: bold; font-size: 22px;">Question 14</span><br/>
  Using the <code>iris</code> dataset, complete the following tasks:<br/>
  1) Fit a linear regression model predicting <code>Petal.Width</code> from <code>Petal.Length</code>. <br/>
  2) Display the summary of the linear regression model.<br/>
  3) From the summary, extract and report:<br/>
     - The intercept value<br/>
     - The slope value<br/>
     - Whether both coefficients are statistically significant (based on p-values).<br/>
     - How much variance is explained by the model (R-squared).<br/>
  4) Create a scatter plot of <code>Petal.Length</code> (x-axis) vs <code>Petal.Width</code> (y-axis).<br/>
  5) Add the regression line to the plot in blue color with a dashed line style.<br/>
</div>


In [None]:
# 1) Fit a linear regression model predicting Petal.Width from Petal.Length
mod <- lm(_________ ~ _________, data = iris)  # Fill in response ~ predictor

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">1) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>mod &lt;- lm(<span style="color:blue;">Petal.Width ~ Petal.Length</span>, data = iris)</code>
  </div>
</details>


In [None]:
# 2) Display the summary of the model
summary(________)  # Fill in the model object

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">2) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>summary(<span style="color:blue;">mod</span>)</code>
  </div>
</details>


<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">3) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    Intercept value : (<span style="color:blue;"> -0.363076</span>) </br>
    Slope value : (<span style="color:blue;">0.415755 </span>) </br>
    Statistical significance of both coefficients : (<span style="color:blue;">4.7e-16, < 2e-16</span>) </br>
    Variance explained  : (<span style="color:blue;">R-squared : 0.9271</span>) </br>
  </div>
</details>

In [None]:
# 4) Scatter plot Petal.Length vs Petal.Width
plot(iris$________, iris$________, main="Scatter plot", xlab="Petal Length", ylab="Petal Width")


<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">4) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>
      plot(<span style="color:blue;">iris$Petal.Length, iris$>Petal.Length </span>, main = "Scatter plot",xlab="Petal Length",ylab = "Peta Width")
    </code><br/>
  </div>
</details>


In [None]:
# 5) Scatter plot Petal.Length vs Petal.Width and regression line in blue, dashed
plot(iris$_____, iris$_______, 
     main = "Scatter plot of Petal Length vs Width", 
     xlab = "Petal Length", ylab = "Petal Width")

# Add regression line in blue, dashed
abline(____, col = "blue", lty = ___) #hint : to add a dashed line lty =2 

<details style="font-size: 18px;">
  <summary style="font-size: 20px; font-weight: bold; text-decoration: underline;">5) Answer</summary>
  <div style="padding: 10px; border:1px solid blue; font-size: 20px;">
    <code>
      plot(<span style="color:blue;">iris$Petal.Length, iris$>Petal.Length </span>, main = "Scatter plot",xlab="Petal Length",ylab = "Peta Width")
    </code><br/>
          <code>
      abline(<span style="color:blue;">mod </span>, col = "blue", <span style="color:blue;">lty = 2 </span>)
    </code><br/>
  </div>
</details>


# Exploring Iris Dataset

## 1.1 Load and display first Iris rows 

In [None]:
# Descriptive statistics using the iris dataset --------------------------------------------------
data(iris)                                  # Load the iris dataset
head(iris)                                  # Display the first rows of the dataset

## 1.2 Summary statistics for versicolor species

In [None]:
head(iris[iris$Species == "versicolor", ])

In [None]:
summary(iris[iris$Species == "versicolor", ])  # Summary statistics for versicolor species

## 1.3 Mean, standard deviation, median of specific columns (for all species)

In [None]:
mean(iris$Sepal.Length)                      # Calculate mean of Sepal.Length (all species)
sd(iris$Sepal.Length)                        # Calculate standard deviation of Sepal.Length (all species)
median(iris$Petal.Width)                     # Calculate median of Petal.Width (all species)

## 1.4 Mean, median, variance of specific columns for a specific species

In [None]:
mean(iris$Sepal.Length[iris$Species == "virginica"])     # Mean of Sepal.Length for virginica
median(iris$Sepal.Length[iris$Species == "virginica"]) # Median of Sepal.Length for virginica
var(iris$Sepal.Length[iris$Species == "virginica"])       # Variance of Sepal.Length for virginica

## 1.5 Display statistics for the Iris dataframe

In [None]:
# Compute descriptive statistics for the iris dataset
stat.desc(iris)

## 1.6 Display descriptive statistics by species


In [None]:
attach(iris) # Attach the iris dataset to access columns directly by name
describeBy(iris, group=Species) # Compute descriptive statistics for each species group in the iris dataset
describeBy(iris, group=Species)[3] # Computes descriptive statistics for each group defined by the Species variable in the iris data frame, then extracts the 3rd element of the resulting list (corresponding to the stats for the 3rd group, "virginica").
detach(iris)


# Distribution of abundance (doubs) 

In [None]:
  load("Doubs.RData")  

## 1.1 Get the overall minimum and maximum abundance values in the dataset `spe`

In [None]:
# Minimum and maximum of abundance values in the whole data set
range(spe)

## 1.2 Get the Minimum and maximum abundance for each species (by column) in the dataset `spe`

In [None]:
# Apply the function `range` to each column (species) of the dataset `spe`.
apply(spe, 2, range)

## 1.3 Count the frequency of each abundance value in the whole dataset `spe`

- *`unlist(spe)` converts the data frame `spe` into a single vector of all abundance values.*  
- *`table()` then counts how many times each unique abundance value appears.*  
- *`freq` stores the frequency table showing the number of occurrences per abundance class.*

In [None]:
# unlist(spe) converts the data frame spe into a single vector of all abundance values.
# table() then counts how many times each unique abundance value appears.
freq <- table(unlist(spe))

# Display the frequency table showing how many times each abundance value occurs
freq


## 1.4 Barplot of the frequency of abundance

In [None]:
# Create a barplot showing the frequency of each abundance
barplot(freq, 
        las = 1,                     # Horizontal axis labels
        xlab = "Abundance class",   # Label for x-axis
        ylab = "Frequency",         # Label for y-axis
        col = gray(5 : 0 / 5),      # Shades of gray for bars
        horiz = FALSE               # Vertical bars
)

## 1.5 Number of absences (zero values) in the data frame

In [None]:
# Number of absences
head(spe==0)
sum(spe == 0)

## 1.6 Calculates the ratio of zero values to the total number of elements in the spe data frame

In [None]:
# Proportion of zeros in the community data set
sum(spe == 0) / (nrow(spe) * ncol(spe))

## 1.7 Matrix visualization of species abundance

In [None]:
# matrix visualization
heatmap.2(as.matrix(spe), dendrogram = "none", trace="none")

# Map of the locations of the site

In [None]:
# Geographic coordinates x and y from the spa data frame
plot(spa, 
     asp = 1,            # Set aspect ratio to 1 for equal scaling of axes
     type = "n",         # Create an empty plot frame without plotting points
     main = "Site Locations",  # Main title of the plot
     xlab = "x coordinate (km)",  # Label for x-axis
     ylab = "y coordinate (km)"   # Label for y-axis
)
# Add a blue line connecting the sites along the Doubs River
lines(spa, col = "blue", lwd=3)  # Draw line connecting points in spa, blue color, line width 3

# Add the site labels at the corresponding coordinates
text(spa, row.names(spa), cex = 1, col = "red")  # Add red text labels using row names of spa, text size 1

# Add text blocks as annotations at fixed coordinates
text(68, 20, "Upstream", cex = 1.2, col = "red")   # Place "Upstream" label at coordinates (68, 20), bigger red text
text(15, 35, "Downstream", cex = 1.2, col = "red") # Place "Downstream" label at coordinates (15, 35), bigger red text

# Maps of some fish species 

In [None]:
# Divide the plot window into 4 frames, 2 per row
par(mfrow = c(2,2))

# Plot 1: Brown trout
plot(spa, 
     asp = 1,            # Set aspect ratio to 1 (equal scaling on x and y axes)
     cex.axis = 0.8,     # Scale axis annotation size to 80%
     col = "brown",      # Points colored brown
     cex = spe$Satr,     # Size of points proportional to spe$Satr values (Brown trout abundance)
     main = "Brown trout",  # Main title of the plot
     xlab = "x coordinate (km)",  # Label for x-axis
     ylab = "y coordinate (km)"   # Label for y-axis
)
lines(spa, col = "light blue", lwd=3)  # Add light blue line connecting points, width=3

# Plot 2: Grayling
plot(spa, 
     asp = 1, 
     cex.axis = 0.8, 
     col = "brown", 
     cex = spe$Thth,     # Size proportional to spe$Thth (Grayling abundance)
     main = "Grayling", 
     xlab = "x coordinate (km)", 
     ylab = "y coordinate (km)"
)
lines(spa, col = "light blue", lwd=3)

# Plot 3: Barbel
plot(spa, 
     asp = 1, 
     cex.axis = 0.8, 
     col = "brown", 
     cex = spe$Baba,     # Size proportional to spe$Baba (Barbel abundance)
     main = "Barbel", 
     xlab = "x coordinate (km)", 
     ylab = "y coordinate (km)"
)
lines(spa, col = "light blue", lwd=3)

# Plot 4: Common bream
plot(spa, 
     asp = 1, 
     cex.axis = 0.8, 
     col = "brown", 
     cex = spe$Abbr,     # Size proportional to spe$Abbr (Common bream abundance)
     main = "Common bream", 
     xlab = "x coordinate (km)", 
     ylab = "y coordinate (km)"
)
lines(spa, col = "light blue", lwd=3)


# Compare species: number of occurrences

## 1.1 Count number of sites where each species are present

In [None]:
# Compute the number of sites where each species is present
# for each species (columns) and each site (rows)
spe.pres <- apply(spe > 0,   # Logical matrix: TRUE if abundance > 0, FALSE otherwise
                  2,         # Apply function over columns (species)
                  sum)       # Count number of TRUE values (sites where species is present)

# Sort the vector 'spe.pres' in increasing order
# This arranges species by the number of sites where they are present, from fewest to most
sort(spe.pres)

## 1.2 Calculate the percentage of sites where each species is present

In [None]:
# Calculate the percentage of sites where each species is present
# spe.pres = number of sites with presence for each species
# nrow(spe) = total number of sites (rows)
# Multiply by 100 to get percentage values
spe.relf <- 100 * spe.pres / nrow(spe)
# Round the sorted output to 1 digit
round(sort(spe.relf), 1)

## 1.3 Plot histograms

In [None]:
# Set the plotting window to have 1 row and 2 columns (side-by-side plots)
par(mfrow = c(1, 2)) 

# Histogram of the absolute number of sites each species occurs in
hist(spe.pres, 
     main = "Species Occurrences",        # Title of the histogram
     right = FALSE,                       # Intervals are left-closed, right-open
     las = 1,                            # Make axis labels horizontal
     xlab = "Number of occurrences",     # Label for x-axis
     ylab = "Number of species",         # Label for y-axis
     breaks = seq(0, 30, by = 5),        # Define breakpoints for histogram bins
     col = "bisque"                      # Color of the bars
)

# Histogram of the percentage frequency of occurrences per species
hist(spe.relf, 
     main = "Species Relative Frequencies", # Title of the histogram
     right = FALSE,                          # Intervals are left-closed, right-open
     las = 1,                               # Make axis labels horizontal
     xlab = "Frequency of occurrences (%)",# Label for x-axis
     ylab = "Number of species",            # Label for y-axis
     breaks = seq(0, 100, by = 10),         # Define breakpoints for histogram bins
     col = "bisque"                         # Color of the bars
)


# Compare sites: species richness

## 1.1 Compute the number of species at each site

In [None]:
# Compute the number of species present at each site
# For each site (rows), count how many species (columns) have abundance > 0
sit.pres <- apply(spe > 0,    # Logical matrix: TRUE if abundance > 0, FALSE otherwise
                  1,          # Apply function over rows (sites)
                  sum)        # Count number of TRUE values (species present at the site)

# Sort the vector 'sit.pres' in increasing order
# This arranges sites by the number of species present, from fewest to most
sort(sit.pres)


## 1.2 Species Richness Visualization: Step Plot Along Gradient and Geographic Bubble Map

In [None]:
par(mfrow = c(1, 2)) 
# Divide the plotting window into 1 row and 2 columns (2 plots side by side)

# Plot species richness (number of species per site) versus site number
plot(sit.pres, type = "s",           # type="s" draws a step plot (stairs)
     las = 1,                       # axis labels are horizontal
     col = "blue",                  # line color is blue
     main = "Species Richness vs. \n Upstream-Downstream Gradient",  # plot title with line break
     xlab = "Site numbers",         # x-axis label
     ylab = "Species richness"      # y-axis label
)
text(sit.pres, row.names(spe), cex = .8, col = "red")  
# Add site labels at points, in red color and smaller size

# Plot a bubble map of species richness using geographic coordinates
plot(spa, 
     asp = 1,                       # aspect ratio 1:1 so distances are proportional
     main = "Map of Species Richness", 
     pch = 21,                     # plotting symbol: filled circle with border
     col = "white",                # border color white
     bg = "brown",                 # fill color brown
     cex = 5 * sit.pres / max(sit.pres),  # size of bubbles proportional to species richness scaled by max value
     xlab = "x coordinate (km)", 
     ylab = "y coordinate (km)"
)
lines(spa, col = "light blue", lwd=3)  
# Draw a light blue line connecting the points in spa with thickness 3


# Calculate and display different alpha-diversity indices using the vegan R package!

## 1.1 Environmental Data 

In [None]:
# Divide the plotting window into a 2 by 2 grid (4 plots)
par(mfrow = c(2, 2))  


# Scatter plot of elevation vs. distance from the source
plot(env$dfs, env$ele,
     xlab = "Distance from the source (km)",  # Label x-axis
     ylab = "Elevation (m)",                   # Label y-axis
     pch = 16,                                # Use solid circles as plotting symbols
     col = "red",                             # Color points red
     main = "Elevation"                       # Title of the plot
)

# Line plot of water discharge vs. distance from the source
plot(env$dfs, env$dis, 
     type = "l",                             # Line plot
     xlab = "Distance from the source (km)",# Label x-axis
     ylab = "Discharge (m3/s)",              # Label y-axis
     col = "blue",                           # Line color blue
     main = "Discharge",                     # Title of the plot
     lwd = 2                                # Line width set to 2 for thickness
)


# Plot nitrate concentration vs. distance with points and lines
plot(env$dfs, env$nit, 
     type = "o",                             # Overplotted points and lines
     xlab = "Distance from the source (km)",# Label x-axis
     ylab = "Nitrate (mg/L)",                # Label y-axis
     col = "brown",                         # Color brown
     main = "Nitrate"                       # Title of the plot
)


# Plot oxygen concentration vs. distance with point size proportional to nitrate concentration
plot(env$dfs, env$oxy, 
     type = "b",                            # Both points and lines, connected
     xlab = "Distance from the source (km)",# Label x-axis
     ylab = "Oxygen (mg/L)",                # Label y-axis
     col = "green3",                       # Color green3 for points and lines
     main = "Oxygen",                      # Title of the plot
     cex = env$nit                        # Size of points scaled by nitrate concentration
)

## 1.2 relationship between Nitrate and Oxygen

In [None]:
# Reset plotting window to a single plot (1 row, 1 column)
par(mfrow=c(1,1))  


# Scatter plot of oxygen concentration vs. nitrate concentration
plot(env$oxy ~ env$nit,  
     pch = 16,                        # Use solid circles for points
     xlab = "Nitrate (mg/L)",         # Label x-axis
     ylab = "Oxygen (mg/L)"           # Label y-axis
)

## 1.3 Fit a linear regression model predicting oxygen from nitrate

In [None]:
# Fit a linear regression model predicting oxygen from nitrate
mod <- lm(env$oxy ~ env$nit)  

# Display summary statistics of the linear model (coefficients, R², p-values, etc.)
summary(mod)  

## 1.4 Add the linear regression model to the graph

In [None]:
# Reset plotting window to a single plot (1 row, 1 column)
par(mfrow=c(1,1))  


# Scatter plot of oxygen concentration vs. nitrate concentration
plot(env$oxy ~ env$nit,  
     pch = 16,                        # Use solid circles for points
     xlab = "Nitrate (mg/L)",         # Label x-axis
     ylab = "Oxygen (mg/L)"           # Label y-axis
)

# Add the regression line to the plot with line width 2 and dashed line style (type 3)
abline(mod, # Add the regression line from the model to the plot
       lwd = 2,   # Line width = 2 (thicker line for visibility)
       lty = 3)   # Line type = 3 (dashed line to distinguish it from data points)


## 1.5 radar plots

### 1.5.1 Setting Fixed Min/Max Reference for Environmental Radar Charts

In [None]:
# Create a data frame with max and min values for environmental variables
max_min <- data.frame(
  dfs = c(450, 0),      # Distance from source (max, min)
  ele = c(1000, 150),   # Elevation
  slo = c(50, 0),       # Slope
  dis = c(70, 0),       # Discharge
  pH  = c(9, 6),        # pH
  har = c(120, 20))     # Hardness
rownames(max_min) <- c("Max", "Min")
max_min

### 1.5.2 Overview of the Environmental Data (`env`)

In [None]:
head(env)

## 1.5.3 Adding dataframe of min and max values above the real environmental data (`env`) -> *(`env2`)*

In [None]:
# Add the real environmental data (first 6 columns) below max/min
env2 <- rbind(max_min, env[, 1:6])
head(env2)  # Show the new data frame with max, min, and actual data

### 1.5.4 Prepare Data for Radar Chart of a Single Site

In [None]:
# Select 3 rows: Max, Min, and site 4 (site 4 is in row 6 of env2)
sample4 <- env2[c(1, 2, 6),]
sample4

### 1.5.5 Plot radar chart for a single site (site 4) with Max and Min as scale

In [None]:
library(fmsb)

In [None]:
# Plot radar chart for a single site (site 4) with Max and Min as scale
radarchart(sample4)

### 1.5.6 Plot radar chart for 4 sites with Max and Min as scale

In [None]:
# Adjust plot margins for next radar chart (bottom = 1, left=2, top=2, right=2)
par(mar = c(1, 2, 2, 2))

# Radar chart for the first 4 sites, scaled by Max/Min
# env2[1:6,] includes rows for Max, Min, and the first 4 sites (rows 3 to 6)
radarchart(env2[1:6,]) 

# Add legend for the 4 sites (excluding Max and Min)
legend(x = "bottomright", # Position the legend at the bottom right corner
       legend = rownames(env2[3:6,]),  # Site names
       horiz = TRUE,                   # Arrange legend items horizontally
       bty = "n",                      # No box around legend
       pch = 20,                       # Solid circle symbol
       col = 1:4,                      # Colors for the 4 sites
       title = "sites")                # Add the title "sites" above the legend

## 2 Scatter plots for all pairs of environmental variables

### 2.1 scatterplot matrix for all variables in `env`

In [None]:
# Creates a scatterplot matrix for all variables in 'env' (data frame or matrix).
# Each panel shows a scatterplot between two different variables in 'env'.
pairs(env)

## 2.2 scatterplot matrix for only the first 5 variables in `env`

In [None]:
# Creates a scatterplot matrix for only the first 5 columns (variables) of 'env'.
# Useful to limit the display when there are many variables.
pairs(env[1:5])


## 2.3 Custom Histogram Panel Function for Pairs Plot

This function `panel.hist` is designed to draw a small histogram within a plot panel, typically used as the diagonal panel in pairs plots.

- It saves the current plotting parameters to restore them later.
- It adjusts the y-axis limits to fit the normalized histogram heights.
- It computes the histogram of the given data without plotting it directly.
- It normalizes the histogram counts to scale between 0 and 1.
- It draws cyan-colored rectangles representing histogram bars in the current plotting area.

This allows displaying compact histograms as part of larger multi-plot visualizations.


In [None]:
panel.hist <- function(x, ...) {
  usr <- par("usr")             # Save current plotting parameters (user coordinates)
  on.exit(par(usr))             # Ensure to restore these parameters when function exits
  
  par(usr = c(usr[1:2], 0, 1.5)) # Set new y-axis limits from 0 to 1.5, keep x-axis limits unchanged
  
  h <- hist(x, plot = FALSE)    # Compute histogram data for x without plotting
  
  breaks <- h$breaks            # Extract histogram bin boundaries
  nB <- length(breaks)          # Get number of bin boundaries
  
  y <- h$counts                 # Extract counts for each bin
  y <- y / max(y)               # Normalize counts to max = 1 for scaling
  
  rect(breaks[-nB], 0,         # Draw rectangles (bars) for histogram:
       breaks[-1],             # from each bin start (except last) ...
       y,                     # ... up to the normalized count height ...
       col = "cyan", ...)      # ... fill color cyan, plus other graphical args
}


## 2.4 Custom Correlation Panel Function for Pairs Plot

The `panel.cor` function displays the correlation coefficient between two variables in a plot panel, often used in the upper or lower panels of pairs plots.

- It saves the current plotting parameters to restore them after execution.
- It sets the coordinate system to a fixed 0-to-1 scale for both axes to position the text easily.
- It computes the Pearson correlation coefficient between `x` and `y`, rounded to two decimals.
- It creates a label text showing the correlation, e.g., "R = 0.85".
- It adjusts the text size inversely proportional to the label width, scaled by the correlation value.
- It draws the correlation text centered in the panel with the calculated size.

This function helps visualize correlation strength compactly within scatterplot matrices.


In [None]:
panel.cor <- function(x, y) {
  usr <- par("usr")          # Save current plotting parameters (user coordinates)
  on.exit(par(usr))          # Restore plotting parameters when function exits
  
  par(usr = c(0, 1, 0, 1))  # Set coordinate system to [0,1] for both axes for easy text placement
  
  r <- round(cor(x, y), 2)  # Calculate Pearson correlation coefficient, rounded to 2 decimals
  
  txt <- paste0("R = ", r)  # Create text label showing correlation, e.g., "R = 0.85"
  
  cex.cor <- 0.8 / strwidth(txt)  # Calculate character expansion factor based on text width to fit nicely
  
  text(0.5, 0.5, txt, cex = cex.cor * r)  # Draw text centered in plot panel, size scaled by correlation
}


## 2.5 Custom Correlation Panel Function (Simplified) for Pairs Plot

The `panel.cor2` function displays the Pearson correlation coefficient between two variables inside a plot panel, designed for use in pairs plots.

- It saves the current graphical parameters to restore them after the function finishes.
- It sets the coordinate system to a fixed range from 0 to 1 on both axes, allowing precise text placement.
- It calculates the Pearson correlation coefficient between `x` and `y`, rounded to two decimal places.
- It creates a text label showing the correlation coefficient, e.g., "R = 0.92".
- It places this label at the center of the panel (0.5, 0.5) with a fixed large font size (`cex = 2`).

This simplified function provides a clear and readable way to display correlation values within the upper or lower panels of scatterplot matrices.


In [None]:
panel.cor2 <- function(x, y){
  usr <- par("usr")            # Save current plotting parameters (user coordinates)
  on.exit(par(usr))            # Restore plotting parameters when function exits
  par(usr = c(0, 1, 0, 1))    # Set user coordinates to [0,1] for x and y axes (normalized plot region)
  
  r <- round(cor(x, y), digits=2)        # Calculate correlation between x and y, rounded to 2 decimals
  txt <- paste0("R = ", r)                # Create text string to display correlation coefficient
  
  text(0.5, 0.5, txt, cex = 2)           # Add the correlation text at center (0.5, 0.5) with font size 2
}


## 2.6 Bivariate Plots with Histograms and Smooth Curves with correlation fontsize depending on its value

In [None]:
pairs(env[1:5],                    # Create pairwise scatterplots for the first 5 variables in 'env'
      panel = panel.smooth,       # Use a smoothing curve (loess) in the lower panels
      diag.panel = panel.hist,    # Use the custom histogram function on the diagonal panels
      upper.panel = panel.cor,    # Use the custom correlation display function on the upper panels
      cex.labels = 3,             # Set label size for variable names to 3 (larger)
      main = "Bivariate Plots with Histograms and Smooth Curves"  # Title of the whole plot matrix
)


## 2.7 Bivariate Plots with Histograms and Smooth Curves with correlation fontsize not depending on its value

In [None]:
pairs(env[1:5],                     # Create a pairs plot for the first 5 columns of the data frame 'env'
      panel = panel.smooth,        # Use 'panel.smooth' function to draw scatter plots with smooth curves in the lower panels
      diag.panel = panel.hist,     # Use 'panel.hist' function to draw histograms on the diagonal panels
      upper.panel = panel.cor2,    # Use 'panel.cor2' function to display correlation coefficients in the upper panels
      cex.labels = 3,              # Set the size of the variable labels (axis labels) to 3 (larger text)
      main = "Bivariate Plots with Histograms and Smooth Curves"  # Set the main title of the plot
)
