# Introduction to R

Szymon Talaga | 15 March 2020

<hr>

This is a very brief introduction to `R`. It should be enough to get a grasp of most of the crucial concepts, but is by no means comprehensive.
Luckily you can find a lot of resources online. For instance on platforms such as [DataCamp](https://www.datacamp.com/). 

## Basic data types and data structures in `R`

`R` language uses many different data types and data structures. Luckily for us, only few of them are important and useful for typical users.

All data types in `R` can be divided into:

* **Atomic types.** These values such as numbers (`1` or `7`) or strings (in technical terms strings are not atomic values but for practical intents and purposes they can be treated as atomic values in `R`). In `R` also vectors of values of a given type, i.e. vectors of integers, are considered atomic.
* **Composite types.** These types are composed of other types which may be both composite and atomic. For instance a list of numbers is composite.

### Most important atomic types

In [3]:
### NUMERIC (FLOATING POINT NUMBERS)
c(1, 1.5, 2.5, 7.2)

In [5]:
### INTEGERS
c(1, 10, 11)

In [1]:
### LOGICAL VALUES
c(TRUE, FALSE)

In [2]:
#### STRINGS
c("a string", "a different string")

In [6]:
#### FACTORS (CATEGORICAL VALUES)
factor(c("level I", "level II"))

In [7]:
#### MISSING VALUES
c(NA, NULL)

Note that `NA` values (denoting missing data) are represented while `NULL` was not even printed.
This is an important feature of `R`, which has to distinct kinds of missing data.

`NULL` is meant to represent a non-existent object, while `NA` is a placeholder for a missing data point.

### Operations on atomic data

In [8]:
#### ARITHMETIC OPERATORS
1 + 2     # addition
100 - 20  # subtraction
17 * 12   # multiplication
100 / 10  # division
2^8       # raising to a power
16^(1/4)  # takin root (raising to a fractional power)

Of course operations that we want to perform has to be well-defined.

In [11]:
#### CANNOT MULTIPLY A NUMBER AND A STRING
#### THIS IS NOT A WELL-DEFINED OPERATIONS
10 * "10"

ERROR: Error in 10 * "10": argument nieliczbowy przekazany do operatora dwuargumentowego


### Logical operators

One of the great features of `R` is its approach to handling missing data in logical operations.
As a matter of fact in `R` we do not follow the standard 2-valued logic, but we use a 3-valued
logic of Łukasiewicz.

In this 3-valued logic we have, of course, three values:

* `TRUE`
* `FALSE`
* `NA` (missing information)

The general idea is that if one of the arguments is `NA` but the other one is still enough to determine the outcome
of a logical operation then we should get this outcome. A result can be `NA` only if it is not possible to determine
whether it should be `TRUE` or `FALSE`.

Below we present truth tables for standard logical operations in `R`.

#### Negation (`!x`)

| `x`  | TRUE  | FALSE | NA |
| ---- | ----- | ----- | -- |
| `!x` | FALSE | TRUE  | NA |

In [20]:
#### NEGATION
x <- c(FALSE, TRUE, NA)
!x

#### Conjunction / and (`x & y`)

In [23]:
#### CONJUNCTION (AND)
x <- c(TRUE, FALSE, NA)
m <- outer(x, x, FUN = "&")
colnames(m) <- x
rownames(m) <- x
m

Unnamed: 0,TRUE,FALSE,NA
True,True,False,
False,False,False,False
,,False,


#### Disjuntion / or (`x | y`)

In [25]:
#### DISJUNCTION (OR)
x <- c(TRUE, FALSE, NA)
m <- outer(x, x, FUN = "|")
colnames(m) <- x
rownames(m) <- x
m

Unnamed: 0,TRUE,FALSE,NA
True,True,True,True
False,True,False,
,True,,


#### Summary of logical operations

In [26]:
!FALSE          # Negation
TRUE & FALSE    # And
TRUE | FALSE    # Or

# Comparisons
# ===========
1 == 1      # equality test 
1 != 1      # inequality test
1 > 1       # greater than test
1 < 1       # lower than test
1 >= 1      # greater or equal test
1 <= 1      # lower or equal test

### Vectorization of operations on atomic data

Another very important and useful feature of `R` is its vectorization of operations on atomic data. In general, `R` always treats atomic data as vectors, so one variable can be always a sequence of values of a given type instead of only a single value. Single values (scalars) in `R` are in fact represented as vectors with just one element.

Most importantly, all basic logical and airthmetic operations are fully vectorized in `R`.

In [28]:
# Vectors are often created with `c` function which is just a function
# that joins (concatenates) multiple elements in a single sequence.
x <- c(1, 15, 100, 120, 7)
# An example of a vectorized operation
x * 2
length(x)   # Length of a vector

Typically, in other programming languages an operation like that would have to be represented in a more complicated way with an explicit for-loop.
Of course in `R` we can do this too, but in general we should stick to vectorization whenever possible as vectorization is not only about a convenient
syntax. It also allows computations to be much faster.

In [29]:
#### EXPLICIT FOR-LOOP OVER AN ATOMIC VECTOR
x <- c(1, 15, 100, 120, 7)

for (i in 1:length(x)) {
    x[i] <- x[i] * 2
}
x

### Vector indexing

Vectors may consist of multiple elements. We can address particular elements (or subsets of elements) using so-called _indexing_. Indexing is an operation
of extracting particular elements from a vector by addressing them with their numerical positions along the vector or with a logical mask.

In [30]:
letters[5:10]       # Sequence from the 5th to the 10th letter of the roman alphabet
# By giving negative indexes we can drop particular elements
letters[-(5:10)]    # The roman alphabet without letters from 5 to 10. 
# Negative and positive indexes can not be mixed.
# However, we can hack around by indexing multiple times.
# But when combining multiple indexing operations we have to
# rememeber that usually the size of our vector is shrinking
# with every operation.
letters[5:10][-3]   # Sequence from the 5th to the 10th letter of the roman alphabet without the 7th one. 

In [31]:
### LOGICAL MASK
x <- 1:10       # numbers from 1 to 10
x %% 2 == 0     # logical vector indicating whether a given element is divisible by 2 or not

In [33]:
### LOGICAL MASK
### A logical vector of the same length can be used for filtering values
x[x %% 2 == 0]

Additionally, values in vectors can be associated with names (this gives named vectors). In such a case we can address elements by their names instead of
positions or logical masks. Below we show a very important application of this approach. We will create a simple _lookup table_ which is a data structure
that allows to quickly map one set of values to a different set of values. We will use a lookup table to quickly _recode_ a categorical variable, that is,
change names of its values (levels).

In [34]:
### SET RANDOM SEED
set.seed(303)

levels <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
## CREATE A LONG VECTOR OF CATEGORICAL VALUES
categorical_variable <- factor(sample(levels, size = 10000, replace = TRUE))
head(categorical_variable)

In [36]:
### NOW WE DEFINE THE LOOKUP TABLE
lookup_table <- c(
    A = "new A",
    B = "new B",
    C = "new C",
    D = "new D",
    E = "new E",
    F = "new F",
    G = "new G",
    H = "new H",
    I = "new I",
    J = "new J"
)

new_categorical_variable <- unname(factor(lookup_table[categorical_variable], levels = lookup_table))
head(new_categorical_variable)

In [38]:
## MOREOVER, IN THE CASE OF A REGULAR RENAMING LIKE THE ONE WE DID ABOVE
## WE CAN EXECUTE THE ENTIRE OPERATION EVEN QUICKER BY USING
## SIMPLE TEXT PROCESSING METHODS.
##
## INSTEAD OF DEFINING `lookup_table` OBJECT BY HAND WE CAN CREATE
## BY PASTING STRING `new` TO THE ORIGINAL NAMES OF THE LEVELS.
## WE WILL USE `paste` FUNCTION TO ACHIEVIE THIS
## (SEE THE DOCUMENTATION: ?paste).
lookup_table2 <- paste("new", levels)
names(lookup_table2)     
## THE LOOOKUP TABLE DOES NOT HAVE NAMES YET ...
lookup_table
## SO WE HAVE TO ADD `names` ATTRIBUTE
names(lookup_table2) <- levels
lookup_table2
## NOTICE THAT BOTH TABLE ARE IDENTICAL
identical(lookup_table, lookup_table2)

## NOW WE CAN MAKE THE SUBSTITUTION EXACTLY LIKE BEFORE
new_categorical_variable <- unname(factor(lookup_table2[categorical_variable], levels = lookup_table2))
## AND THE OUTCOME IS IDENTICAL TOO OF COURSE
identical(categorical_variable, new_categorical_variable)

NULL

## Composite types

Lists are the most important composite data structure in `R` because they constitute the basis on which almost all other more complex data structures are based on.

A `list` is an ordered sequence of arbitrary objects (that may be also mapped to names). Hence, a list may be composed of atomic values (vectors) but also any other kind of object, even different lists.

Lists are very useful because they allow storing many objects of different types in an organized and ordered fashion.

In [39]:
# Przykładowa lista.
# Zauważny, że elementami list mogą być inne listy, a nawet ramka danych.
# Ponadto jednocześnie mogą występować obok siebie elementy z nazwami i bez nazw
# (choć zasadniczo dobrą praktyką jest albo nie korzystać z nazwa albo nazywać wszystkie elementy).

## EXAMPLE OF A LIST
## =================
## NOTE THAT LIST ELEMENT MAY BE OF ANY TYPE, ALSO OTHER LISTS
## MOREOVER, ELEMENTS MAY BE MAPPED TO NAMES BUT DO NOT HAVE TO.
Lista <- list(1, a = 2, list(a = 2, b = 4), b = list(a = NULL, b = head(iris, n = 5)))
print(Lista)

## LISTS ARE ALWAYS ORDERED AND CAN BE INDEXES WITH NUMERICAL INDEXES.
Lista[2]
## NOTE THAT THIS OPERATION DID NOT RECEIVE A SINGLE NUMBER
## BUT A NUMBER WRAPPED IN A LIST (1-ELEMENT LONG LIST).
##
## LISTS ELEMENTS ARE ALWAYS INTERPRETED AS LENGTH 1 LISTS.
## IN ORDER TO EXTRACT A PARTICULAR VALUE WE NEED TO USE A DIFFERENT INDEXING OPERATOR `[[`
Lista[[2]]      # CLEARLY, THIS TIME THE RESULT IS JUST A SINGLE NUMBER (LENGTH 1 VECTOR)

[[1]]
[1] 1

$a
[1] 2

[[3]]
[[3]]$a
[1] 2

[[3]]$b
[1] 4


$b
$b$a
NULL

$b$b
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa




## Data frames

The most important data structure for typical statistical data analysis is data frame. It is a rectangular ($n$-by-$m$) table in which rows correspond to
observations and columns to variables (features of the observations).

Internally, data frames in `R` are represented as lists in which every elements is a vector of the same length. This implementation details is important
because it means that column-oriented operations will be typically easier to define and more efficient.

In the subsequent sections we will use the famous [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) to show what one can do with data frames in `R`.

In [41]:
#### PEAK FIRST FEW ROWS
head(iris)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


### Data frames as lists

In [42]:
### DATA FRAMES SUPPORT EXTRACTION OF COLUMNS WITH THE `$` OPERATOR.
### THE RETURNED VALUE IN THIS CASE IS AN ATOMIC VECTOR REPRESENTING A GIVEN COLUMN.
head(iris$Sepal.Length)

### COLUMN (VARIABLES/ FEATURES) CAN BE EXTRACTED ALSO WITH THE `[[` OPERATOR
head(iris[["Species"]])

### THIS IS IMPORTANT BECAUSE WE CAN USE `[[` OPERATOR TO PASS A COLUMN NAME
### VIA A VARIABLE.
column_name <- "Species"
head(iris[[column_name]])

### INTERNALLY DATA FRAMES ARE LISTS SO THEY ARE ITERABLE.
### WE CAN ITERATE OVER THEM, FOR INSTANCE, WITH A FOR-LOOP.
### IN A FOR LOOP WE ITERATE OVER COLUMNS, NOT ROWS.
### (BECAUSE A DATA FRAME IS A LIST OF COLUMNS).
for (i in iris) {
    print(head(i))
}

### WE CAN EASILY CONVINVE OURSELVES THAT DATA FRAMES ARE REALLY LISTS.
### WE CAN USE `unclass` FUNCTION TO SEE WHAT BASE OBJECT IS USED TO REPRESENT A DATA FRAME.
L <- unclass(iris)
L
### CLEARLY, IT IS REALLY JUST A LIST OF VECTORS WITH AN ADDITIONAL `row.names` ATTRIBUTE.

### BELOW WE SHOW HOW EASILY THIS UNCLASSING PROCESS CAN BE REVERSED.
class(L) <- "data.frame"
head(L)

[1] 5.1 4.9 4.7 4.6 5.0 5.4
[1] 3.5 3.0 3.2 3.1 3.6 3.9
[1] 1.4 1.4 1.3 1.5 1.4 1.7
[1] 0.2 0.2 0.2 0.2 0.2 0.4
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica


Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


### Data frames indexing

The crucial difference between a data frame and a bare list is the fact that it supports two-dimensional indexing along rows and/or columns.
This is why data frames are the most important data structure in data analysis and statistics. They give us convenient means for subsetting
our data (row indexing) and variables (column indexing).

Efficient row indexing is possible, because fix columns to be of the same length.

In [44]:
### ELEMENTS CAN BE ADDRESSED WITH NUMERIC INDEXES
iris[5, 5]

### AT THE SAME TIME THEY CAN BE ADDRESSED BY NAMES
iris[5, "Species"]

### ROWS ALSO CAN BE INDEXED BY NAMES (IF THEY ARE DEFINED)
iris["5", "Species"]

However, indexing by single values has very limited utility. Luckily, we can also index with multiple values which can also be passed as variables.

In [45]:
x  <- c(1, 5, 6)     # row indexes
y1 <- c(1, 2)        # column indexes
y2 <- c("Sepal.Length", "Sepal.Width")  # column names
iris[x, y1]
iris[x, y2]

## IF WE WANT WE MAY TAKE ALL ELEMENTS ALONG A GIVEN AXIS
## BY PASSING NO INDEX.
head(iris[ , "Species"])    # all records for the `Species` column 
iris[1:5, ]                 # first 5 rows for all columns 

Unnamed: 0_level_0,Sepal.Length,Sepal.Width
Unnamed: 0_level_1,<dbl>,<dbl>
1,5.1,3.5
5,5.0,3.6
6,5.4,3.9


Unnamed: 0_level_0,Sepal.Length,Sepal.Width
Unnamed: 0_level_1,<dbl>,<dbl>
1,5.1,3.5
5,5.0,3.6
6,5.4,3.9


Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa


Indexing by positions (numeric indexes) or names is not always the most convenient approach. Sometimes we need to filter elements according to some logical test.
We can easily achieve this because data frames support also indexing with logical masks.

We can use logical masks to filter out rows and columns separately, that is, to filter both rows and columns we have to pass to different masks.
Moreover, we can combine numeric or name-based indexes with logical masks. For instance, we may filter rows with a logical mask and columns by names
or positions.

Logical mask along a given axis (rows or columns) has to be of the same length as the axis (i.e. to filter by rows the mask has to be have exactly one element
for each row). It must be a logical vector (`TRUE` and `FALSE` values). Only elements to which `TRUE` values are mapped will be extracted.

In [46]:
### EXTRACT ALL `setosa` RECORDS AND COLUMNS `Sepal.Length` AND `Sepal.Width`.
iris[iris$Species == "setosa", c("Sepal.Length", "Sepal.Width")]

### EXTRACT ALL COLUMNS WITH NAMES STARTING WITH `Sepal`
iris[, grepl("^Sepal\\.", names(iris), perl = TRUE)]

Unnamed: 0_level_0,Sepal.Length,Sepal.Width
Unnamed: 0_level_1,<dbl>,<dbl>
1,5.1,3.5
2,4.9,3.0
3,4.7,3.2
4,4.6,3.1
5,5.0,3.6
6,5.4,3.9
7,4.6,3.4
8,5.0,3.4
9,4.4,2.9
10,4.9,3.1


Sepal.Length,Sepal.Width
<dbl>,<dbl>
5.1,3.5
4.9,3.0
4.7,3.2
4.6,3.1
5.0,3.6
5.4,3.9
4.6,3.4
5.0,3.4
4.4,2.9
4.9,3.1


However, real-world data analysis with standard indexing methods very quickly become unwieldy and produces hard-to-understand code. 
This is why people came up with more specialized solutions for data frame processing that provide us with a generic framework
for working with data frames that allows us to write a very legible code while also getting very good efficiency in terms of computation time.

## `dplyr`

The most important package for `R` for working with data frames in `dplyr`. It is a core part of the [Tidyverse ecosystem](https://www.tidyverse.org/)
and one of the most popular `R` packages of all time. You can read more about it [here](https://dplyr.tidyverse.org/).

The power of `dplyr` is due to its two crucial features. First of all it is internally implemented with a high-performance compiled code written in C/C++
so it is very fast. But perhaps even more important is a fact that it allows writing a very easy to understand code.

It is so, because `dplyr` as a package is a collection of functions corresponding to different typical operations on data frames. These functions are called
verbs and work as building blocks that can be easily combined together into even very complex data processing pipelines.

In [47]:
### LOAD THE PACKAGE
library(dplyr) 

### GETTING ROWS BY POSITIONS
slice(iris, 1:5)
### EXTRACTING COLUMNS BY POSITIONS
select(iris, 1:2)
### EXTRACTING COLUMNS BY NAMES
select(iris, Sepal.Length, Species)
### EXTRACTING A RANGE OF COLUMN FROM A GIVEN COLUMN TO A DIFFERENT COLUMN (DEFINED BY NAMES)
select(iris, Sepal.Length:Petal.Length)
### DROPPING COLUMNS BY NAMES (NEGATIVE INDEXING)
select(iris, -Species)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa


Sepal.Length,Sepal.Width
<dbl>,<dbl>
5.1,3.5
4.9,3.0
4.7,3.2
4.6,3.1
5.0,3.6
5.4,3.9
4.6,3.4
5.0,3.4
4.4,2.9
4.9,3.1


Sepal.Length,Species
<dbl>,<fct>
5.1,setosa
4.9,setosa
4.7,setosa
4.6,setosa
5.0,setosa
5.4,setosa
4.6,setosa
5.0,setosa
4.4,setosa
4.9,setosa


Sepal.Length,Sepal.Width,Petal.Length
<dbl>,<dbl>,<dbl>
5.1,3.5,1.4
4.9,3.0,1.4
4.7,3.2,1.3
4.6,3.1,1.5
5.0,3.6,1.4
5.4,3.9,1.7
4.6,3.4,1.4
5.0,3.4,1.5
4.4,2.9,1.4
4.9,3.1,1.5


Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
<dbl>,<dbl>,<dbl>,<dbl>
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
5.4,3.9,1.7,0.4
4.6,3.4,1.4,0.3
5.0,3.4,1.5,0.2
4.4,2.9,1.4,0.2
4.9,3.1,1.5,0.1


Of course `dplyr` can do much more than just simple filtering. But before we can discuss more advanced usecases we have to introduce a very important concept
of sequential procssing / pipeline computing.

Sequential processing is a very convenient approach to defining data processing pipeline in which we start with data and then apply to it different transformation
which eventually lead us to the desired final output.

In [48]:
### EXAMPLE OF SIMPLE EXTRACTION OF ROWS AND COLUMNS

### NON-SEQUENTIAL APPROACH
slice(select(iris, Sepal.Length:Sepal.Width), 1:10)

### SEQUENTIAL APPROACH
iris %>% 
    select(Sepal.Length:Sepal.Width) %>%
    slice(1:10)

Sepal.Length,Sepal.Width
<dbl>,<dbl>
5.1,3.5
4.9,3.0
4.7,3.2
4.6,3.1
5.0,3.6
5.4,3.9
4.6,3.4
5.0,3.4
4.4,2.9
4.9,3.1


Sepal.Length,Sepal.Width
<dbl>,<dbl>
5.1,3.5
4.9,3.0
4.7,3.2
4.6,3.1
5.0,3.6
5.4,3.9
4.6,3.4
5.0,3.4
4.4,2.9
4.9,3.1


The second approach is much easier to follow. Of course in both cases the result was the same and the computations applied to the data were identical.
The only difference is syntax. However, this is an important difference as clean syntax allow us to write better code and even reason about data
more efficiently.

Sequential processing syntax shines the most when used to define more complex computations. One of the greatest features of `dplyr` is how easy it makes it
to define complicated aggregation operations.

In [51]:
### AGGREGATION: computing mean and variance of `Sepal.Length` grouped by `Species`

### NAIVE APPROACH USING ONLY BASE R
c(mean(iris[iris$Species == "setosa", "Sepal.Length"]), var(iris[iris$Species == "setosa", "Sepal.Length"]))
c(mean(iris[iris$Species == "virginica", "Sepal.Length"]), var(iris[iris$Species == "virginica", "Sepal.Length"]))
c(mean(iris[iris$Species == "versicolor", "Sepal.Length"]), var(iris[iris$Species == "versicolor", "Sepal.Length"]))

### SLIGHTLY LESS NAIVE APPROACH USING `tapply` FUNCTION
### (you can read more about with the following command: ?tapply)
tapply(iris$Sepal.Length, iris$Species, mean)
tapply(iris$Sepal.Length, iris$Species, var)

### LESS NAIVE APPROACH BASED ON `tapply` FUNCTION AND AN ANONYMOUS FUNCTION
tapply(iris$Sepal.Length, iris$Species, function(x) c(M = mean(x), VAR = var(x)))

### OTHER LESS NAIVE APPROACH BASED ON `aggregate` FUNCTION.
aggregate(iris$Sepal.Length, by = list(iris$Species), function(x) c(M = mean(x), VAR = var(x)))

### DPLYR APPROACH
iris %>%
    group_by(Species) %>%
    summarize(
        M = mean(Sepal.Length),
        VAR = var(Sepal.Length)
    )

### NOTE HOW CLEAN AND EASY TO READ THE CODE IS.
### WE START WITH THE DATASET AND THEN APPLY DIFFERENT OPERATIONS
### TO IT ONE BY ONE.

Group.1,x
<fct>,"<dbl[,2]>"
setosa,"5.006, 0.1242490"
versicolor,"5.936, 0.2664327"
virginica,"6.588, 0.4043429"


Species,M,VAR
<fct>,<dbl>,<dbl>
setosa,5.006,0.124249
versicolor,5.936,0.2664327
virginica,6.588,0.4043429


You can learn more about `dplyr` (and many other awesome things) in the great book by Hadley Wickham and Garrett Grolemun: [R for Data Science](https://r4ds.had.co.nz/). In particular, you may want to look at the main [chapter](https://r4ds.had.co.nz/transform.html) on `dplyr` and data manipulation.