![](https://www.r-project.org/Rlogo.png)

____________________________________________________________________________________
This tutorial is the third part of a series. You can start with the first part of the tutorial [here](https://www.kaggle.com/rtatman/getting-started-in-r-first-steps/).

In this part of the tutorial, we'll:

* make sure our data_frame has the right data types
* summarize our data
* find the summary statistics for a specific variable
* summarize a variable by groups

____________________________________________________________________________________


### Learning goals:

By the end of this tutorial, you will be able to do the following things. (Don't worry if you don't know what all these things are yet; we'll get there together!)

* [Be familiar with basic concepts: functions, variables, data types and vectors](https://www.kaggle.com/rtatman/getting-started-in-r-first-steps/)
* [Load data into R](https://www.kaggle.com/rtatman/getting-started-in-r-load-data-into-r)
* [Summerize your data](https://www.kaggle.com/rtatman/getting-started-in-r-summarize-data)
* [Graph data and save your graphs](https://www.kaggle.com/rtatman/getting-started-in-r-graphing-data/)

______

### Your turn!

Throughout this tutorial, you'll have lots of opportunities to practice what you've just learned. Look for the phrase "your turn!" to find these exercises.

In [3]:
library(reticulate)
os <- import("os")
os$listdir("../input")

> **This notebook is interactive! It will have errors until you fork it and add all the code it needs to run correctly. Don't worry: if you've been working through this tutorial you already know everything you need to get it working correctly and I'll give you instructions and reminders to help you out.**

## Load our data into R

Because this is a new notebook & it's not connected to our last notebook, we need to tell R that we're going to be using the tidyverse package, read in our data and remove the first row all over again. This is because this notebook is in a new session, and every time we start a new session R forgets everything we did in the last one. 

> Session: your current R working environment and anything that you've created within in, including libraries, datasets, variables and any functions that you've written.

This is a great chance for you to review what we learned in the last part of the tutorial! Try to follow the steps below on your own first. If you forget, you can also look [back at the last section](https://www.kaggle.com/rtatman/getting-started-in-r-load-data-into-r) for help.

In [4]:
# Your turn!
# call the "tidyverse" library using the library() function
library(tidyverse)
# read our data file into R and assign it to a variable called "chocolateData". 
# Remember that you can find out where the data is by expanding the "Input Files"
# box above by clicking the + sign in the left corner.
chocolateData = read_csv("../input/flavors_of_cacao.csv")
# remove the first row of the chocolateData data_frame using a negative index
chocolateData <- chocolateData[-1,]
# check the first few rows of your data using the head() function to make sure it
# looks alright
head(chocolateData)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.4     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  `Company 
(Maker-if known)` = [31mcol_character()[39m,
  `Specific Bean Origin
or Bar Name` = [31mcol_character()[39m,
  REF = [32mcol_double()[39m,
  `Review
Date` = [32mcol_double()[39m,
  `Cocoa
Percent` = [31mcol_cha

Company (Maker-if known),Specific Bean Origin or Bar Name,REF,Review Date,Cocoa Percent,Company Location,Rating,Bean Type,Broad Bean Origin
<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>
A. Morin,Kpime,1676,2015,70%,France,2.75,,Togo
A. Morin,Atsane,1676,2015,70%,France,3.0,,Togo
A. Morin,Akata,1680,2015,70%,France,3.5,,Togo
A. Morin,Quilla,1704,2015,70%,France,3.5,,Peru
A. Morin,Carenero,1315,2014,70%,France,2.75,Criollo,Venezuela
A. Morin,Cuba,1315,2014,70%,France,3.5,,Cuba


In [5]:
# Before we get going, let's get rid of the white spaces in the column names of this
# dataset. This will make it possible for us to refer to columns by thier names, since
# any white space in a name will mess R up.

names(chocolateData) <- gsub("[[:space:]+]", "_", names(chocolateData))
str(chocolateData)

tibble [1,794 × 9] (S3: tbl_df/tbl/data.frame)
 $ Company _(Maker-if_known)       : chr [1:1794] "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
 $ Specific_Bean_Origin_or_Bar_Name: chr [1:1794] "Kpime" "Atsane" "Akata" "Quilla" ...
 $ REF                             : num [1:1794] 1676 1676 1680 1704 1315 ...
 $ Review_Date                     : num [1:1794] 2015 2015 2015 2015 2014 ...
 $ Cocoa_Percent                   : chr [1:1794] "70%" "70%" "70%" "70%" ...
 $ Company_Location                : chr [1:1794] "France" "France" "France" "France" ...
 $ Rating                          : num [1:1794] 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 2.75 ...
 $ Bean_Type                       : chr [1:1794] " " " " " " " " ...
 $ Broad_Bean_Origin               : chr [1:1794] "Togo" "Togo" "Togo" "Peru" ...


## Data Cleaning

Excellent! Now that we've read our data into R, however, we have lots of questions that we want to ask about it.

The first thing that I usually do with a new dataset, however, is make sure that all of the data that I *think* should be a particular data type actually *is* that datatype. We can do this using the str() function we learned about in [the first part of this tutorial](https://www.kaggle.com/rtatman/getting-started-in-r-first-steps). 

In [6]:
# Your turn!

# Use the str() function to check the data type of the columns in chocolateData
str(chocolateData)

tibble [1,794 × 9] (S3: tbl_df/tbl/data.frame)
 $ Company _(Maker-if_known)       : chr [1:1794] "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
 $ Specific_Bean_Origin_or_Bar_Name: chr [1:1794] "Kpime" "Atsane" "Akata" "Quilla" ...
 $ REF                             : num [1:1794] 1676 1676 1680 1704 1315 ...
 $ Review_Date                     : num [1:1794] 2015 2015 2015 2015 2014 ...
 $ Cocoa_Percent                   : chr [1:1794] "70%" "70%" "70%" "70%" ...
 $ Company_Location                : chr [1:1794] "France" "France" "France" "France" ...
 $ Rating                          : num [1:1794] 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 2.75 ...
 $ Bean_Type                       : chr [1:1794] " " " " " " " " ...
 $ Broad_Bean_Origin               : chr [1:1794] "Togo" "Togo" "Togo" "Peru" ...


This has a lot of output, but don't be scared! We'll work through it together. Your output should look something like this:

![](https://image.ibb.co/eKYCK5/Screenshot_from_2017_08_29_16_09_09.png)

The first row shows you the class of the object and its size. Like I mentioned in the last section, a data_frame is a special type of data.frame called a tibble, which is abbreviated here to tbl. You can see that this tibble is 1795 rows long and 9 columns wide.

In R, the dollar sign ($) has a special meaning. It means that whatever comes directly after it is a column in a data_frame. You can use this to look at specific columns in a data_frame, like so:

In [7]:
#print the first few values from the column named "Rating" in the dataframe "chocolateData" 
head(chocolateData$Rating)

So all the text that comes after each dollar sign ($) here is just telling you that that's a column and it has that specific name. If you count them, you can see that there are 9 columns, just like it says in the first row of the output.

The next thing we can see from this is the type of data in each column. You can see from the fact that each column has "chr" printed after the colon (:) that all of these columns are of the data type "character". (We learned about data types in the very first part of the tutorial, [here](https://www.kaggle.com/rtatman/getting-started-in-r-first-steps/). If this doesn't sound familiar you might want to peek back for a quick refresher).

The last thing we can see is the first couple of observations in each column. However, looking at these we can see that they're not all numbers. In fact, we probably want several of our columns, like REF, Review Date and Rating, to be numeric instead. We *could* convert each column using the as.numeric() function we learned in the first part of the tutorial.

Fortunately, we don't have to. The tidyverse has a lot of nice utility functions that can make our lives much easier. One of them is type_convert, which will look at the first 1000 rows of each column, guess what the data type of that column should be and then convert that column into that data type.

In [8]:
# automatically convert the data types of our data_frame
chocolateData <- type_convert(chocolateData)


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  `Company _(Maker-if_known)` = [31mcol_character()[39m,
  Specific_Bean_Origin_or_Bar_Name = [31mcol_character()[39m,
  Cocoa_Percent = [31mcol_character()[39m,
  Company_Location = [31mcol_character()[39m,
  Bean_Type = [31mcol_character()[39m,
  Broad_Bean_Origin = [31mcol_character()[39m
)




In [9]:
# Your turn!

# Check out the structure of the converted data using the str() function. What do you notice?
# Are all the columns the data type you expected?
str(chocolateData)

tibble [1,794 × 9] (S3: tbl_df/tbl/data.frame)
 $ Company _(Maker-if_known)       : chr [1:1794] "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
 $ Specific_Bean_Origin_or_Bar_Name: chr [1:1794] "Kpime" "Atsane" "Akata" "Quilla" ...
 $ REF                             : num [1:1794] 1676 1676 1680 1704 1315 ...
 $ Review_Date                     : num [1:1794] 2015 2015 2015 2015 2014 ...
 $ Cocoa_Percent                   : chr [1:1794] "70%" "70%" "70%" "70%" ...
 $ Company_Location                : chr [1:1794] "France" "France" "France" "France" ...
 $ Rating                          : num [1:1794] 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 2.75 ...
 $ Bean_Type                       : chr [1:1794] " " " " " " " " ...
 $ Broad_Bean_Origin               : chr [1:1794] "Togo" "Togo" "Togo" "Peru" ...


So it looks like there's still a bit of a problem: the column "Cocoa Percent" is a character, not a numeric value (which is what we'd usually want a percent to be). This is probably because the data in this column contains the percent symbol (%), which isn't a number. Let's remove all the percent symbols from this dataset & try again. 

In [10]:
# remove all the percent signs in the fifth column. You don't really need to worry about
# all the different things that are happening in this line right now. 
chocolateData$Cocoa_Percent <- sapply(chocolateData$Cocoa_Percent, function(x) gsub("%", "", x))

# try the type_convert() function agian
chocolateData <- type_convert(chocolateData)

# check the structure to make sure it actually is a percent
str(chocolateData)


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  `Company _(Maker-if_known)` = [31mcol_character()[39m,
  Specific_Bean_Origin_or_Bar_Name = [31mcol_character()[39m,
  Cocoa_Percent = [32mcol_double()[39m,
  Company_Location = [31mcol_character()[39m,
  Bean_Type = [31mcol_character()[39m,
  Broad_Bean_Origin = [31mcol_character()[39m
)




tibble [1,794 × 9] (S3: tbl_df/tbl/data.frame)
 $ Company _(Maker-if_known)       : chr [1:1794] "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
 $ Specific_Bean_Origin_or_Bar_Name: chr [1:1794] "Kpime" "Atsane" "Akata" "Quilla" ...
 $ REF                             : num [1:1794] 1676 1676 1680 1704 1315 ...
 $ Review_Date                     : num [1:1794] 2015 2015 2015 2015 2014 ...
 $ Cocoa_Percent                   : num [1:1794] 70 70 70 70 70 70 70 70 70 70 ...
 $ Company_Location                : chr [1:1794] "France" "France" "France" "France" ...
 $ Rating                          : num [1:1794] 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 2.75 ...
 $ Bean_Type                       : chr [1:1794] " " " " " " " " ...
 $ Broad_Bean_Origin               : chr [1:1794] "Togo" "Togo" "Togo" "Peru" ...


Alright, that all looks good: now we're ready to begin our analysis! The steps we've done so far, getting to the point where we can do our analysis, is usually called "data cleaning", and it's a surprisingly large part of data analysis. 

> As the joke goes: “80 percent of data science is preparing data, and the other 20 percent is complaining about preparing data.”

## Summarizing data

Ok, our data is in R and clean. Now let's start summarizing it! There are a couple options for how to do this in R. 

> One thing you'll learn about R is that there are often multiple ways to do the same thing. This flexibility is really nice once you're comfortable in the language, but I also remember it being very frustrating when I was learning.

Let's try two functions. The first, summary() is from base R, while the second, summarise_all(), is part of the Tidyverse. We'll run both and then compare the outputs.

You can learn more about any function by looking at the documentation for that function. You can do that in a kernel by running a cell with a question mark in front of the function name, with no parentheses after it. (If you do this for more than one function in the same cell, you'll only see the documentation for the last one.) You can also use a search engine to find more information.

> **Protip**: Never feel embarrassed about looking things up. Professional programmers look things up all the time; no one knows everything about every programming language!

In [11]:
# run this cell to learn more about the summary() function
?summary

0,1
summary {base},R Documentation

0,1
object,an object for which a summary is desired.
x,a result of the default method of summary().
maxsum,"integer, indicating how many levels should be shown for factors."
digits,"integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame). In summary.default, if not specified (i.e., missing(.)), signif() will not be called anymore (since R >= 3.4.0, where the default has been changed to only round in the print and format methods)."

0,1
quantile.type,"integer code used in quantile(*, type=quantile.type) for the default method."
...,additional arguments affecting the summary produced.


In [12]:
# run this cell to learn more about the summarise_all() function
?summarise_all

0,1
summarise_all {dplyr},R Documentation

0,1
.tbl,A tbl object.
.funs,"A function fun, a quosure style lambda ~ fun(.) or a list of either form."
...,"Additional arguments for the function calls in .funs. These are evaluated only once, with tidy dots support."
.predicate,A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected. This argument is passed to rlang::as_function() and thus supports quosure-style lambda functions and strings representing function names.
.vars,"A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions, or NULL."
.cols,This argument has been renamed to .vars to fit dplyr's terminology and is deprecated.


In [13]:
# summary function from base R (base R means no packages)
summary(chocolateData)

# summary function from the Tidyverse (specifically dplyr). To use this function, you need
# to tell it what dataset to summarize and also what function to use. In this case I'm
# asking for the average using the function mean()
summarise_all(chocolateData, funs(mean))

 Company _(Maker-if_known) Specific_Bean_Origin_or_Bar_Name      REF      
 Length:1794               Length:1794                      Min.   :   5  
 Class :character          Class :character                 1st Qu.: 576  
 Mode  :character          Mode  :character                 Median :1069  
                                                            Mean   :1035  
                                                            3rd Qu.:1502  
                                                            Max.   :1952  
  Review_Date   Cocoa_Percent   Company_Location       Rating     
 Min.   :2006   Min.   : 42.0   Length:1794        Min.   :1.000  
 1st Qu.:2010   1st Qu.: 70.0   Class :character   1st Qu.:2.812  
 Median :2013   Median : 70.0   Mode  :character   Median :3.250  
 Mean   :2012   Mean   : 71.7                      Mean   :3.186  
 3rd Qu.:2015   3rd Qu.: 75.0                      3rd Qu.:3.500  
 Max.   :2017   Max.   :100.0                      Max.   :5.000  
  Bean

“`funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
“argument is not numeric or logical: returning NA”
“argument is not numeric or logical: returning NA”
“argument is not numeric or logical: returning NA”
“argument is not numeric or logical: returning NA”
“argument is not numeric or logical: returning NA”


Company _(Maker-if_known),Specific_Bean_Origin_or_Bar_Name,REF,Review_Date,Cocoa_Percent,Company_Location,Rating,Bean_Type,Broad_Bean_Origin
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,1035.436,2012.323,71.70318,,3.185619,,


In [14]:
# Your turn!

# Use the summarise_all() function to find the standard deviation of each numeric column.
# The function to find the standard deviation is sd()
summarise_all(chocolateData, funs(sd))

“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”


Company _(Maker-if_known),Specific_Bean_Origin_or_Bar_Name,REF,Review_Date,Cocoa_Percent,Company_Location,Rating,Bean_Type,Broad_Bean_Origin
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,552.6843,2.926739,6.321543,,0.47801,,


## Summarizing a specific variable

The functions we used above give you an overview of the entire dataset, but often you're only interested in one or two variables. We can look at specific variables really easily using the summarise() function and pipes. Pipes are part of the Tidyverse package we loaded in the beginning: if you try to use them without load in the package, you'll get an error.

> A pipe, which looks like this: %>% is a special operator. It takes all the output from the right side and passes it to whatever is on the left side.

Let's take our chocolate dataset and then pipe it to the summarise() function. The summarise() function will return a data_frame, where each column contains a specific type of information we've asked for and has a name we've given in. In this example, we're going to get back two columns. One, called "averageRating" will have the average of the Rating column in it, while the second, called "sdRating" will have the standard deviation of the Rating column in it. 

In [15]:
# return a data_frame with the mean and sd of the Rating column, from the chocolate
# dataset in it
chocolateData %>%
    summarise(averageRating = mean(Rating),
             sdRating = sd(Rating))

averageRating,sdRating
<dbl>,<dbl>
3.185619,0.47801


> ## Why are there line breaks after "%>%" and "mean(Rating)," in the code block above?
> 
> So far all the functions we've seen have been all on one line. As we're learning more functions and stringing them together, however, some of our lines are going to get really, really long. Breaking up a line can help make your code easier to read. (Easy to read is good! The person who's most likely to need to read your code in the future is you after you've forgotten everything about about it. Make future you's life a little easier.)
>
> You can't break a line up just anywhere, though. Lines of code are like lines of text in a book: you can't just start wrap a line anywhere you want.
>
>> Th
>>
>>is is p
>>
>>retty hard t
>>
>> o read.
>
>In text, you need to do wrap lines between words, or between syllables of words with a hyphen (-) in between. This lets the reader know that the word continues in the next line. For R, some of the "hyphens" that let the computer know that your line continues on the next line are the comma (,), the pipe (%>%) and the plus sign (+), which we'll talk about later. If you split your line directly after one of these characters, R knows to keep looking for the rest of the code one the next line.
>
> You should also indent any wrapped lines after the first. This isn't necessary, but it makes your code easier to learn.  You can do this either by hitting TAB once, or space four times. (There are a lot of arguments on-line about which is better, but both work.)

In [16]:
# this is fine! :)
mean(c(5,6,25,16))

# this is fine! :)
mean(c(5,6,
       25,16))

# this won't break your code, but it's hard to read :(
mean(c(5,6,
25,16))

# this will break your code :'(
mean(c(5,6,2
      5,16))

ERROR: Error in parse(text = x, srcfile = src): <text>:14:7: unexpected numeric constant
13: mean(c(5,6,2
14:       5
          ^


In [17]:
# Your turn!

# Can you use a pipe (%>%) and the summarise() function return a dataframe
# with the average and sd of the Cocoa_Percent column? You can name the new columns
# whatever you'd like, but keep in mind that clear names are the most helpful.
chocolateData %>%
    summarise(
        averagePercent = mean(Cocoa_Percent),
        sdPercent = sd(Cocoa_Percent)
             )

averagePercent,sdPercent
<dbl>,<dbl>
71.70318,6.321543


## Summarize a specific variable by group

At first pass, it may seem a bit silly to do things like calculate the mean and standard deviation this way. You can see why this is such a power technique, however, when we look at the the same variable across groups. 

We can use this with a hand function called group_by(). When you pipe a dataset into the group_by() function and tell it the name of a specific column, then it will look at all the values in that column and group together all the rows that have the same value in a given column. Then, when you pipe that data into the summarise() function, it will return the values you asked for for each group separately. Like so:

In [18]:
# Return the average and sd of ratings by the year a rating was given
chocolateData %>%
    group_by(Review_Date) %>%
    summarise(averageRating = mean(Rating),
             sdRating = sd(Rating))

Review_Date,averageRating,sdRating
<dbl>,<dbl>,<dbl>
2006,3.125,0.7691224
2007,3.162338,0.6998193
2008,2.994624,0.5442118
2009,3.073171,0.4591195
2010,3.148649,0.4663426
2011,3.256061,0.4899536
2012,3.178205,0.4835962
2013,3.197011,0.4461178
2014,3.189271,0.4148615
2015,3.246491,0.381096


In [21]:
# Your turn!

# Can you return a data_frame with the average and sd Cocoa_Percent by the year the reviews 
# were written?

chocolateData %>%
    group_by(Review_Date) %>%
        summarise(
            avPercent = mean(Cocoa_Percent),
            sdPercent = sd(Cocoa_Percent)
        )

Review_Date,avPercent,sdPercent
<dbl>,<dbl>,<dbl>
2006,71.0,7.42474
2007,72.03896,6.951792
2008,72.69892,8.412962
2009,70.44309,6.895057
2010,70.77928,7.424678
2011,70.9697,5.377714
2012,71.52821,5.725056
2013,72.2663,8.325992
2014,72.25304,5.201014
2015,72.01404,5.258777


This is a really efficient way to start understanding your data. For example, it looks like chocolate bar ratings might be trending slightly upwards by year. In the mid 2000's they were around 3.0, and now they're closer to 3.3.

To really get a better understanding of this, however, I really want to want to graph this data so that I can see if there's been reliable change over time. So let's move on to the final part of this tutorial: graphing!

## Next step: [Graphing Data](https://www.kaggle.com/rtatman/getting-started-in-r-graphing-data/)