# lapply and sapply
**Sanket Dave**

---
 In this lesson, you'll learn how to use lapply() and sapply(), the two most important members of R's *apply family of
 functions, also known as loop functions.



                                                                                                                     
 These powerful functions, along with their close relatives (vapply() and tapply(), among others) offer a concise and
 convenient means of implementing the Split-Apply-Combine strategy for data analysis.



                                                                                                                 
 Each of the *apply functions will SPLIT up some data into smaller pieces, APPLY a function to each piece, then COMBINE
 the results. A more detailed discussion of this strategy is found in Hadley Wickham's Journal of Statistical Software
 paper titled 'The Split-Apply-Combine Strategy for Data Analysis'.



                                                                                                               
 Throughout this lesson, we'll use the Flags dataset from the UCI Machine Learning Repository. This dataset contains
 details of various nations and their flags. More information may be found here:
 http://archive.ics.uci.edu/ml/datasets/Flags



                                                                                                          
 Let's jump right in so you can get a feel for how these special functions work!



                                                                                                           
 I've stored the dataset in a variable called flags. Type head(flags) to preview the first six lines (i.e. the 'head') of
 the dataset.

In [9]:
y<- "https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data"

flags <- read.table(y, header = FALSE, sep = ",")

In [10]:
head(flags)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,V29,V30
Afghanistan,5,1,648,16,10,2,0,3,5,...,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,...,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,...,0,0,1,1,0,0,0,0,green,white
American-Samoa,6,3,0,0,1,1,0,0,5,...,0,0,0,0,1,1,1,0,blue,red
Andorra,3,1,0,0,6,0,3,0,3,...,0,0,0,0,0,0,0,0,blue,red
Angola,4,2,1247,7,10,5,0,2,3,...,0,0,1,0,0,1,0,0,red,black


In [11]:
names(flags) <- names(x) <- c("country", "landmass", "zone", "area", "population", "language", 
              "religion", "bars", "stripes", "colors", "red", "green", "blue",
              "gold", "white", "black", "orange","mainhue", "circles", "crosses", 
              "saltires", "quarters", "sunstars", "crescents", "triangle", "icon",
              "animate", "text", "topleft", "botright")

In [12]:
head(flags)

country,landmass,zone,area,population,language,religion,bars,stripes,colors,...,saltires,quarters,sunstars,crescents,triangle,icon,animate,text,topleft,botright
Afghanistan,5,1,648,16,10,2,0,3,5,...,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,...,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,...,0,0,1,1,0,0,0,0,green,white
American-Samoa,6,3,0,0,1,1,0,0,5,...,0,0,0,0,1,1,1,0,blue,red
Andorra,3,1,0,0,6,0,3,0,3,...,0,0,0,0,0,0,0,0,blue,red
Angola,4,2,1247,7,10,5,0,2,3,...,0,0,1,0,0,1,0,0,red,black


In [13]:
dim(flags)

In [14]:
class(flags)

let's find out what class does each column belong to...

 The lapply() function takes a list as input, applies a
 function to each element of the list, then returns a
 list of the same length as the original one. Since a
 data frame is really just a list of vectors (you can see
 this with as.list(flags)), we can use lapply() to apply
 the class() function to each column of the flags
 dataset. Let's see it in action!
 
  Type cls_list <- lapply(flags, class) to apply the
 class() function to each column of the flags dataset and
 store the result in a variable called cls_list. Note
 that you just supply the name of the function you want
 to apply (i.e. class), without the usual parentheses
 after it.

In [15]:
cls_list <- lapply(flags,class)

In [16]:
cls_list

 The 'l' in 'lapply' stands for 'list'. Type
 class(cls_list) to confirm that lapply() returned a
 list.

In [17]:
class(cls_list)

 As expected, we got a list of length 30 -- one element
 for each variable/column. The output would be
 considerably more compact if we could represent it as a
 vector instead of a list.

 You may remember from a previous lesson that lists are
 most helpful for storing multiple classes of data. In
 this case, since every element of the list returned by
 lapply() is a character vector of length one (i.e.
 "integer" and "vector"), cls_list can be simplified to a
 character vector. To do this manually, type
 as.character(cls_list).

In [18]:
as.character(cls_list)

 sapply() allows you to automate this process by calling
 lapply() behind the scenes, but then attempting to
 simplify (hence the 's' in 'sapply') the result for you.
 Use sapply() the same way you used lapply() to get the
 class of each column of the flags dataset and store the
 result in cls_vect. If you need help, type ?sapply to
 bring up the documentation.

In [19]:
?sapply

In [21]:
cls_vect <- sapply(flags,class)

In [22]:
class(cls_vect)

 In general, if the result is a list where every element is
 of length one, then sapply() returns a vector. If the
 result is a list where every element is a vector of the
 same length (> 1), sapply() returns a matrix. If sapply()
 can't figure things out, then it just returns a list, no
 different from what lapply() would give you.
 
  Columns 11 through 17 of our dataset are indicator
 variables, each representing a different color. The value
 of the indicator variable is 1 if the color is present in
 a country's flag and 0 otherwise.
 
  Therefore, if we want to know the total number of
 countries (in our dataset) with, for example, the color
 orange on their flag, we can just add up all of the 1s and
 0s in the 'orange' column. Try sum(flags$orange) to see
 this.

In [23]:
sum(flags$orange)

 Now we want to repeat this operation for each of the
 colors recorded in the dataset.

 First, use flag_colors <- flags[, 11:17] to extract the
 columns containing the color data and store them in a new
 data frame called flag_colors. (Note the comma before
 11:17. This subsetting command tells R that we want all
 rows, but only columns 11 through 17.)

In [24]:
flag_colors <- flags[,11:17]

In [25]:
head(flag_colors)

red,green,blue,gold,white,black,orange
1,1,0,1,1,1,0
1,0,0,1,0,1,0
1,1,0,0,1,0,0
1,0,1,1,1,0,1
1,0,1,1,0,0,0
1,0,0,1,0,1,0


 To get a list containing the sum of each column of
 flag_colors, call the lapply() function with two
 arguments. The first argument is the object over which we
 are looping (i.e. flag_colors) and the second argument is
 the name of the function we wish to apply to each column
 (i.e. sum). Remember that the second argument is just the
 name of the function with no parentheses, etc.

In [26]:
lapply(flag_colors,sum)

In [27]:
sapply(flag_colors,sum)

 Use sapply() to apply the mean() function to each column
 of flag_colors. Remember that the second argument to
 sapply() should just specify the name of the function
 (i.e. mean) that you want to apply.

In [28]:
sapply(flag_colors, mean)

 In the examples we've looked at so far, sapply() has
 been able to simplify the result to vector. That's
 because each element of the list returned by lapply()
 was a vector of length one. Recall that sapply() instead
 returns a matrix when each element of the list returned
 by lapply() is a vector of the same length (> 1).

 To illustrate this, let's extract columns 19 through 23
 from the flags dataset and store the result in a new
 data frame called flag_shapes. flag_shapes <- flags[,
 19:23] will do it.

In [29]:
flag_shapes <- flags[,19:23]

Each of these columns (i.e. variables) represents the
 number of times a particular shape or design appears on
 a country's flag. We are interested in the minimum and
 maximum number of times each shape or design appears.

 The range() function returns the minimum and maximum of
 its first argument, which should be a numeric vector.
 Use lapply() to apply the range function to each column
 of flag_shapes. Don't worry about storing the result in
 a new variable. By now, we know that lapply() always
 returns a list.

In [30]:
lapply(flag_shapes, range)

In [31]:
sapply(flag_shapes, range)

circles,crosses,saltires,quarters,sunstars
0,0,0,0,0
4,2,1,4,50


In [32]:
shape_mat <- sapply(flag_shapes, range)
shape_mat

circles,crosses,saltires,quarters,sunstars
0,0,0,0,0
4,2,1,4,50


In [33]:
class(shape_mat)

 As we've seen, sapply() always attempts to simplify the
 result given by lapply(). It has been successful in
 doing so for each of the examples we've looked at so
 far. Let's look at an example where sapply() can't
 figure out how to simplify the result and thus returns a
 list, no different from lapply().


 When given a vector, the unique() function returns a
 vector with all duplicate elements removed. In other
 words, unique() returns a vector of only the 'unique'
 elements. To see how it works, try unique(c(3, 4, 5, 5,
 5, 6, 6)).

In [34]:
unique(c(3, 4, 5, 5, 5, 6, 6))

 We want to know the unique values for each variable in
 the flags dataset. To accomplish this, use lapply() to
 apply the unique() function to each column in the flags
 dataset, storing the result in a variable called
 unique_vals.

In [35]:
unique_vals <- lapply(flags, unique)
unique_vals

 Since unique_vals is a list, you can use what you've
 learned to determine the length of each element of
 unique_vals (i.e. the number of unique values for each
 variable). Simplify the result, if possible. Hint: Apply
 the length() function to each element of unique_vals.


In [36]:
sapply(unique_vals, length)


In [37]:
sapply(flags,unique)

 Occasionally, you may need to apply a function that is
 not yet defined, thus requiring you to write your own.
 Writing functions in R is beyond the scope of this
 lesson, but let's look at a quick example of how you
 might do so in the context of loop functions.


 Pretend you are interested in only the second item from
 each element of the unique_vals list that you just
 created. Since each element of the unique_vals list is a
 vector and we're not aware of any built-in function in R
 that returns the second element of a vector, we will
 construct our own function.


 lapply(unique_vals, function(elem) elem[2]) will return
 a list containing the second item from each element of
 the unique_vals list. Note that our function takes one
 argument, elem, which is just a 'dummy variable' that
 takes on the value of each element of unique_vals, in
 turn.


In [38]:
lapply(unique_vals, function(elem) elem[2])

 The only difference between previous examples and this
 one is that we are defining and using our own function
 right in the call to lapply(). Our function has no name
 and disappears as soon as lapply() is done using it.
 So-called 'anonymous functions' can be very useful when
 one of R's built-in functions isn't an option.


 In this lesson, you learned how to use the powerful
 lapply() and sapply() functions to apply an operation
 over the elements of a list. In the next lesson, we'll
 take a look at some close relatives of lapply() and
 sapply().

**Source : [Swirlstats](swirlstats.com/student.html)**