[back to Table of Contents](https://www.shannonmburns.com/Psyc158/intro.html)

# Chapter 1 - Introduction to Statistical Thinking  

*“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” - H.G. Wells*

## 1.1 Starting out 

Life is messy and complex - things like people, animals, weather, even microscopic bacteria can vary between each other in a multitude of ways, for a multitude of reasons. How do we make sense of this complexity and still describe our world in reasonably accurate ways? Statistics is the discipline that helps us with this. It is how we turn variation in the world into variation in data, and how we analyze that data in order to give us answers about the essential structure or function of the world.    

The foundations of statistics come primarily from mathematics, but also from computer science, psychology, and other fields of study. From this interdisciplinary crucible, statistics has emerged as not just a collection of facts or equations. Statistics is a way of *thinking*, distinct from other approaches to knowledge. In particular, statistics can answer the sorts of questions where human intuition fails. 

For example, in recent years political ads have emphasized violent crime as a problem in the US, and most Americans have reported that they see violent crime as a serious societal issue ([Pew Research Center](https://www.pewresearch.org/short-reads/2022/10/31/violent-crime-is-a-key-midterm-voting-issue-but-what-does-the-data-say)). However, a statistical analysis of the actual crime data shows that in fact violent crime has steeply decreased since the 1990’s. Intuition fails us because we rely upon best guesses (which psychologists refer to as *heuristics*) that can often get it wrong. In this case, humans often judge the prevalence of some event (like violent crime) using an availability heuristic – that is, how easily can we think of an example of violent crime. For this reason, our judgments of crime rates may be more reflective of increasing news coverage and political discourse, in spite of an actual decrease in the rate of crime. Statistical thinking provides us with the tools to more accurately understand the world and overcome the biases of human judgment.

<img src="images/ch2-pew.png" width="500">

*Graph from the Pew Research Center showing the number of violent victimizations per 1,000 Americans age 12 and older in each year between 1993-2021, and that number broken down into different categories of violent crime (simple assault, aggravated assault, robbery, and rape/sexual assult). In all categories, the number of incidents has declined since 1993.*

It is important to appreciate how statistical thinking is different from our default intuitions, and to practice using it before doing any math or touching any data. In this chapter, we will cover the types of questions that statistics can answer for us, what is needed from us to answer those questions, and how to start out using those core skills.

## 1.2 What can statistics do for us?

There are three major things we can do with statistics:

- **Describe**: The world is complex and we often need to describe it in a simplified way that we can understand.
- **Predict**: We often wish to make predictions about new situations based on our knowledge of previous situations.
- **Infer**: Beyond knowing what is likely to happen, we also try to understand *why* it happens. 

Let’s look at an example of each of these use cases in action. 

### Describe

How do we know what’s healthy to eat? There are many different sources of guidance; government dietary guidelines, diet books, and bloggers, just to name a few. Let’s focus in on a specific question: Is saturated fat in our diet a bad thing?

One way that we might answer this question is common sense. If we eat fat, then it’s going to turn straight into fat in our bodies, right? And we have all seen photos of arteries clogged with fat, so eating fat is going to clog our arteries, right? Could be, or this could be an availability heuristic again fed by media sources.  

Another way that we might answer this question is to look at actual data on the subject. One large-scale study, called the PURE study, examined diets and health outcomes (including death) in more than 135,000 people from 18 different countries. In one of the analyses of this dataset (published in The Lancet in 2017; [Dehghan et al. (2017)](https://pubmed.ncbi.nlm.nih.gov/28864332/)), the PURE investigators reported how intake of various classes of macronutrients (including saturated fats and carbohydrates) was related to the likelihood of dying during the time that people were followed. The plot below displays some of the data from the study (extracted from the paper), showing the relationship between the intake of both saturated fats and carbohydrates and the risk of dying from any cause.

<img src="images/ch2-satfats.png" width="400">

*A plot of data from the PURE study describing death rates among people based on the amount of saturated fats and carbohydrates they intake. The death rate seems to increase for larger amounts of carbohydrates, and slightly decrease for larger amounts of saturdated fats.*

Don't worry if this plot is hard to read at this time - you'll get there! For now, notice the ten points where the lines run through. To obtain the numbers represented by these points, the researchers split the group of 135,335 study participants (which we call the **sample**) into 5 groups based on how much they ate each type of nutrient (carbohydrates and saturated fats). The first group contains the 20% of people with the lowest intake, and the 5th group contains the 20% with the highest intake. The researchers then computed how often people in each of those groups died during the time the study was conducted. The figure expresses this on the Y-axis in terms of the relative risk of dying in comparison to the lowest group: If this number is greater than one, it means that people in the group are more likely to die than are people in the lowest group, whereas if it’s less than one, it means that people in the group are less likely to die. According to this figure, people who ate more saturated fat had lower relative risk of death during the study. The opposite is seen for carbohydrates; the more carbs a person ate, the higher the relative risk of death during the study. 

This example shows how we can use statistics to describe a complex dataset with much simpler numbers. It would be very hard to tell the relative risk of death for a particular carbohydrate intake level by looking at the whole dataset at once (135,335 points of data!). By using statistics, we can *aggregate* and *compare* sets of information to make the takeaway information easier to see. Specifically, the family of statistical tools that help with this kind of insight is called **descriptive statistics**.

### Predict

These data help us see what relative death risk one can expect based on one's intake level of saturated fats or carbs. But these numbers describe the people in the PURE study specifically. We might also want to make predictions about other people *not* included in the dataset, or outcomes that haven't happened yet. For example, a life insurance company might want to guess how long someone is likely to live in order to set their premium amount. They base that prediction on some combination of information about the person, and this can include information about their intake of fat and carbohydrates. The type of statistics that help us make predictions about new data are known as **predictive models**. 

If we are to make predictions about new people, it's important that our predictions are good. You wouldn't want to trust a model that did no better than if you were to randomly guess a group's relative risk of death. You also wouldn't want to use a model that was based only on a particular group of people, like vegetarians, as that might not apply to people with other sorts of diets. In addition to building predictive models, statistics also gives us the tools to evaluate these models' predictive accuracy. 

### Infer

The numbers in this figure seem to show that some groups of people have different relative risks of death. Why? If your answer included the work "*because*", then you just made an inference - a guess about the process that generated these death risk data. In this case, it looks like consumption of fat and carbohydrates at least partially explains why someone has a lower or higher risk of death. Given this information, we might decide to make changes in our own lives in order to influence our own risk of death.  

But we also know that there is a lot of uncertainty in these data; there are some people who died early even though they ate a low-carb diet, and some people who ate a ton of carbs but lived to a ripe old age. Given this variability, we want to decide whether the patterns that we see in the data are large enough that we wouldn’t expect them to occur randomly if there was not truly a connection between diet and longevity. In other words, we want to be confident that the pattern we are seeing is real, and that carbohydrates are very likely the reason for this increased risk in death. 

Often people from the outside view these kinds of insights as absolute answers - we are *proving* a relationship between diet and death risk. But as we will see throughout the course, this need for black-and-white decisions based on fuzzy evidence has often led researchers astray. Thus, there are also methods within statistics to tell us how confident we can be with our conclusions, known as **inferential statistics**. 

## 1.3 The big ideas of statistics

Since statistics is a way of thinking, there are some basic principles that are important to remember for doing statistics well.

### Learning from data

At its core, statistics is the pursuit of knowledge via data. In any situation, we start with a set of ideas or hypotheses about what might be the case. In the PURE study, the researchers may have started out with the expectation that eating more fat would lead to higher death rates, given the prevailing negative dogma about saturated fats. Maybe they also had their own experience dealing with health problems while on a high-fat diet. But this hypothesis is not where the researchers stopped - they collected many data points to test it. In the end, the patterns in the data revealed a different reality. Data thus can help us solidify our beliefs, update them, or even inspire new ideas. It is the central currency of doing a science like psychology. 

### Aggregation

Although you do statistics using data, another way to think of the process is how best to *throw away* data. In the example of the PURE study above, we took more than 100,000 numbers and condensed them into ten. It is this kind of aggregation that is one of the most important concepts in statistics. When it was first advanced, this was revolutionary: If we throw out all of the details about every one of the participants, then how can we be sure that we aren’t missing something important? As we will see, statistics provides us ways to characterize the aggregates of data in a way that still preserves information about the total - it finds patterns among noise. However, it’s also important to keep in mind that aggregation can go too far, and later we will encounter cases where a summary can provide a very misleading picture of the data being summarized.

### Sampling from a population

The concept of aggregation implies that we can make useful insights by collapsing across data. But how much data do we need, and from who? The idea of sampling says that we can reliably summarize an entire **population** based on a small number of data points from the population, as long as those data are obtained in the right way. For example, the PURE study enrolled a sample of about 135,000 people, but its goal was to provide insights about the billions of humans who make up the population from which those people were sampled. The way that the study sample is obtained is critical - it determines how broadly we can generalize our results. Another fundamental insight about sampling we will learn about is that while larger samples are always better (in terms of their ability to accurately represent the entire population), there are diminishing returns as the sample gets larger. 

### Operationalization

Think of the statement "dogs are good pets." On the surface, it seems like a pretty simple, straight-forward claim. But now think of what it would take to support this statement with evidence. What counts in the category "dogs?" What actions or qualities are relavent to being evaluated as a good pet? And what even does "good" mean?  

Assigning concrete meaning to abstract or vague concepts like this is called **operationalization,** and is required for statistics to give you usable answers. In order to do any analysis, you need to be able to break a hypothesis down into smaller conceptual units and define what those mean in easy-to-understand ways. This enables you to 1) define what set of data to use (e.g., what counts as a "dog" and what doggy actions are relevant to you); 2) choose what kind of analysis to do on the data; and 3) evaluate how much your analysis contributes to understanding broader topics of all pet ownership.

### Uncertainty

The world is an uncertain place. We now know that cigarette smoking causes lung cancer, but this relationship is probabilistic. Consider a 68-year-old man who smoked two packs a day for the past 50 years and continues to smoke. He has a 15% (1 out of 7) risk of getting lung cancer, which is much higher than the chance of lung cancer in a nonsmoker. However, it also means that there will be many people like him who smoke their entire lives and never get lung cancer. Furthermore, it's possible that even when all measurable variables are the same, e.g., this man has a genetically identical twin brother who smokes the exact same amount and has the exact same lifestyle, one of them might get cancer while the other does not. This is a case of uncertainty: we can never be *sure* that something will happen, only various levels of confident. Statistics provides us with the tools to characterize this uncertainty and to make decisions under uncertainty.

One often sees journalists write that scientific researchers have “proven” some hypothesis. But statistical analysis can never “prove” a hypothesis, in the sense of demonstrating that it must be true (as one would in a logical or mathematical proof). Statistics can provide us with evidence, but it’s always tentative and subject to the uncertainty that is always present in the real world.

## 1.4 How to do statistics

The principles of statistical thinking will get you far as a consumer of stastical results when reading the news, research articles, etc. They are also the foundation you need to be able to produce your own statistical insights. But you will also need the mathematical tools to formalize those insights. This moves us from the domain of learning *about* statistics to learning to *do* statistics. 

Most people have learned something about doing statistics before they get to college. Some of you have even taken whole courses in statistics before this one. If you have, you have probably heard about some or all of these things: mean, variance, standard deviation, t-test, p, F, ANOVA, regression, chi-square, normal distribution, z score, and so on.

With such a long list, it’s no surprise that many students see statistics as a daunting subject to learn (especially if you're not very keen on math). This is completely normal! Even professional statisticians find it hard sometimes. They are always learning new things, and deepening their understanding. But don't get discouraged - if it feels hard, that just means you are making progress and pushing the boundaries of what you know. It does not mean that you aren’t capable of it. There is no such thing as a "math person", only people who have not yet found the math approach that clicks with them. In this course, we want to get you started along the pathway to understanding. At the end of the course you will understand more than you do now, and hopefully that will be useful to you. To do so, this course is structured around three pedagogical principles:

### Developing deep understanding

Most of the math classes you have taken before focused on solving problems and remembering equations. Because of how big the field of statistics can be, this course will work a bit differently. Instead of remembering a laundry list of disparate tools, we will focus more on understanding the deeper concepts that motivate them, and practicing statistical thinking. Understanding over memorizing, reasoning over calculating. There will still be some math to do of course, and some equations do need to be simply remembered to be helpful. But all these things are a lot more interconnected than traditionally taught, so we will focus on those connections.

### Practicing and making mistakes

You can't really learn how to ride a bike just by reading about it or listening to a lecture. That can make you aware of the principles of gyroscopic progression, but you ultimately need to put yourself in the bike seat to teach your body how to balance and move. A similar principle applies to learning statistics in a way that will actually be useful to you, rather than something that vanishes from your mind as soon as you walk out of the final. You will not succeed in this class if you only read the text in these pages. You also need to practice using stats and making statistical decisions, over and over. For this reason this and the following chapters you read will include interspersed windows for you to interactively try out new ideas you learn, as you learn them. In class we will also do collaborative problem-solving, and each week you will receive a problem set to complete. This does mean there is a lot of work to do in this class. But because of the importance of good statistical training in doing science well, I promise this hard work will pay off. 

Other classes you have taken in the past may have given you the expectation that the teacher will teach the student the right steps to follow for solving problems, and then the student's job is to remember the steps. This type of approach is not very helpful for learning statistics, and it's not going to help you learn coding. The best way to practice statistics is to try things and see what happens. Try something out, tweak it, see what changed, or think about why it didn’t work! Trial and error can be frustrating if we are not used to learning this way, and it may seem inefficient. But trial and error is a great way to learn because we learn from wrong answers as well as right ones. In this course we might sometimes ask you to do something wrong just to see what happens!

By embracing the process of trial and error in your education, your progress will not always go in a straight line. It will be more like experimenting and exploring, making discoveries as you go. The benefit of exploring is that you will get a more thorough sense of statistics!

### Learning through coding

Statistics has also been traditionally taught through a combination of hand-solved equations, and point and click software. This worked well enough for simple statistical questions about small amounts of data, but the 21st century no longer works that way. Instead, doing data analysis by writing computer code has become much more common. There are three big reasons for this: 

- Large data - Analyses that would have taken *months* in the 1950’s being calculated by hand can now be completed in a few seconds on a standard laptop computer using computer code. This change unleashes the ability to manage much larger amounts of data, and to ask questions in new and powerful ways.
- Reproducibility - The [replication crisis in psychology](https://www.psychologytoday.com/us/basics/replication-crisis) over the last decade found that many prior research results could not be replicated. In some cases, even repeat analyses on the *same data* could not get the same originally published result. This is partially because, when statistics are done with point-and-click software, you can't verify whether someone clicked the right button. There's no ability to "show your work." In contrast, doing statistics with code enables you to save all the code and share it, so anyone can verify what exactly you did and where any errors are, if any. 
- Practical skill - Increasingly, the world is a [data-driven place](https://www.forbes.com/sites/googlecloud/2020/05/20/how-the-world-became-data-driven-and-whats-next/?sh=61586b2557fc). Therefore, more and more employers value employees who know how to work with data and the computers that hold them. Even if you decide not to have a career in psychology or anything that uses statistics (though, as we said earlier, statistical thinking is all around us!), coding will be a valuable skill on your resume. 

Thus, in this course we will introduce you to some computer coding at the same time as you learn statistics. While this may seem daunting at this point in the course (learning two whole topics at once?!), research has shown that students actually learn statistics better by doing it through coding. It lets you get closer to the data for understanding it, allows you the autonomy to try out different things, and teaches you how to think through statistical problems logically. Therefor most of the coding we will learn in this course will be learned a couple steps at a time, along with the statistical lesson it can help you implement. 

## 1.5 Starting to code in R

Let's get started with learning some coding right now! The specific coding language we will use in this course is simply called "R." It is a very popular tool for doing statistics among data analysts in many fields, and is a relatively easy coding language for first-time users to pick up. In this final section of the chapter, we'll teach you the basic fundamentals of R, which actually are fundamental concepts for many computer coding languages. It all may seem a bit abstract at first, but once you practice with it you will understand more complex things you can do with code and also the statistical concepts we will put into code.

For example, here’s a bit of R code. Code functions as instructions for a computer, so to make the computer do something, you need to "run" the code (aka execute it). 

Read the code in the window below. Before you run it, what do you think it will do?

<div class="alert alert-block alert-info">
<b>Note</b>: To run code in a Google Colab notebook like this, hover over a code cell, and press the play button that appears on the left. 
</div>

Press the "Run" button and see what happens:

In [None]:
#execute your first bit of R code

print("Hello world!")

Congrats, you are officially coding! Printing out the words "Hello world!" is the traditional first coding task when learning any computer language. Now, "language" is something to pay attention to here - think of coding as learning how to say things in a different way. While in English, you might tell someone "Print the sentence 'hello world!'", a computer needs you to speak to it in its own language for it to understand what you want. The fundamentals of this chapter will help you learn the "vocabulary" of the language that is R, so that you can represent English concepts in ways that the computer understands.'

In a bit we'll cover the vocabularly you used above (the "print" word, parentheses, quotation marks, etc.) Let's go back to something you already know how to write that a computer can understand - arithmetic. 

Try running the code in the window below.

In [None]:
# a few basic arithmetic things
5 + 1
10 - 3
2 * 4
9 / 3

Basic math symbols like ```+```, ```-```, ```*```, ```\```, etc. can be used in R. For each line of code, R will evaluate it and return an output.   

Notice that you can put more than one line of code in a single R window. When you press the Run button, all the commands in the window will be run, one after the other, in the order in which they appear.

### Comments vs. commands

Notice in the code block above that the four lines that included arithmetic statements produced printed results, but the first line with words on it didn't seem to do anything. That first line is what is called a **comment** - a section of code that we want a human to be able to read in the code file, but a computer to ignore when executing the code. In R, we use a '#' symbol at the beginning of the line to denote a comment. Any line that starts with a '#' will be ignored by R. In contrast, any line without a '#' at the beginning will be considered a **command** - a statement about what you want the computer to do for you.

In practice, comments are used by the authors of code to communicate with anyone trying to use that code. This includes describing the purpose of a chunk of code, what options are available for changing the code, keeping notes about what features will be added later, etc. You should get into the habit of using comments often. Not only do they help another person trying to use your code, sometimes that other person is you in the future who had forgotten why you did something! In this course we will use comments as a way to give you instructions for R exercises, and we will expect you to use comments for describing what you're doing in your submitted work. 

<div class="alert alert-block alert-info">
<b>Note</b>: If you want to write a comment that takes more than one line, it’s a good idea to put a # at the beginning of each line.
</div>

In the code window below, try typing whatever you want after a '#' at the front of the line. Then press Run.

In [None]:
# type whatever you want
# see... blah blah blah

Notice that pressing 'Run' for this code chunk doesn't do anything, because lines that start with a # are ignored by R. 

### Objects

Have you ever had an experience where you have forgotten to save your work? It’s a terrible feeling. Saving your work is also important in R. In R, we don’t just type calculations and look at the results on the R console. We usually save the results of the calculations somewhere we can find them and use later.

Pretty much anything can be saved as an R **object**. Think of an object like a box that you can put anything into - a number, a message, etc. The **value** of the object is whatever is inside the box, while the **name** of the object is whatever you choose to name the object so that both you and the computer can refer to it later. After creating an object and assigning it a value, you can use the name of the object in later commands to stand in for its value. 

To assign an object (i.e., assign a value to the name), you need to use an **assignment operator**. Much like ```+``` and ```-``` are operators that tell the computer do some math, an assignment operator tells the computer to assign a value to an object name. In R, the assignment operator looks like an arrow: ```<-```. 

Here’s a simple example to show how it’s done. 

In [None]:
# This code will assign the number 47 to the R object favorite_number
favorite_number <- 47

# This code returns the value of favorite_number. Notice that you don't need to use the print() function 
# to print the contents of an R object; you can just type the name of the object
favorite_number

# now try making a new object below this line, called 'birthday', and assign it the day of the month 
# that you were born.


Anything can be saved into an object, even if it's a complex command with lots of actions (or other objects!) in it. For example, compare the value of ```step3``` to the value of ```all_steps``` by printing them out and evaluating the answers.

In [None]:
step1 <- 2*3
step2 <- 9/3
step3 <- step1 + step2

all_steps <- (2*3) + (9/3)

#type 'step3' and 'all_steps' below on separate lines to return their values


You can name your objects almost anything. There are just a few rules to follow:

- R is case-sensitive. Any little change in the name of an object will be considered a separate object (e.g., ```step1``` vs. ```Step1```). 
- You can use letters, numbers, underscores, and hyphens for your names, but they must always start with a letter.
- Names need to be all one word (no spaces). 

In addition to these requirements by the language, the R user community has decided on preferred naming conventions, called a [style guide](http://adv-r.had.co.nz/Style.html), in order to make code easier to read and consistent across people. You don't have to follow this style guide for this course, but it would be a good idea to make these naming conventions into habit. 

Lastly, it's important to remember that R code is evaluated in order, from the top of the page to the bottom. Objects created first will be "remembered" in later code lines unless you overwrite them with a new value; but trying to access an object at the beginning of a code window when it isn't created until lower down will result in an error.  

<div class="alert alert-block alert-info">
<b>Note</b>: In most other computing languages, this concept is actually called a "variable." However, users of R use the word "object" instead, because the field of statistics also has a concept called a variable and we try not to use overlapping nomenclature. That way we don't confuse the two.  
</div>

### Functions

So far you know about operators, like doing arithmetic or assigning values to objects. All of these operators tell the computer to "do a thing" (add numbers, create an object, etc.)

Oftentimes, we want the computer to do more complex things than there are operator symbols for. For this case, we use what's called a **function**. Functions will still do an operation, but are not limited to what individual symbols mean (in fact, operators are just special kinds of a functions). In thinking about the grammar of a coding language, objects are like nouns and functions are like verbs. 

We used a function at the very beginning of this chapter: ```print("Hello world!")```. 

Functions have three basic parts. The first part is the name of the function (e.g., ```print```). The second part is the input to the function, which goes inside a pair of parentheses. We call these inputs **arguments**. Arguments are whatever objects or values you want to do operations with. Lastly, the **output** of a function is whatever result comes out of the operation.

<div class="alert alert-block alert-info">
<b>Note</b>: In R, you can work with natural language by putting words or sentences between a pair of quotation marks "". Everything between the marks is called a string. We'll talk more about different data types in Chapter 2.
</div>

Sometimes a function takes only one argument (e.g., ```print("my message")```. Sometimes, a function can take two or more, which are separated by commas (e.g., ```sum(1,2,3)```). Each function is unique in what it does, and in what arguments it requires to do its operation. As we move through the course, you will learn some important functions, as well as how to look up other less common ones.  

Here we’ve put some instructions (as comments) into the code window. Write your code as a new line under each comment. See if your code works by clicking 'Run'. 

In [None]:
# Use the print() function to print the word "hello" (with the quotations)

# Use the sum() function to add up the numbers 5, 10, and 15. 


Notice that the actual R code are the lines you wrote in the code window, such as  ```sum(5,10,15)```  or  ```print("hello")``` . The output or result of the code (e.g., 30) appears in a new area underneath the buttons after you click 'Run'. The instructions are not returned, because they are comments and only readable by humans.

### Errors

R is a very flexible language, with literally thousands of functions that can do many different things. However, one thing to be aware of is that R is very, very picky. For example, go back to your code above and delete the last parens in one of the functions. What happens?

You probably got a returned message that said "Error..." Congratulations, you just caused your first computer bug! Get comfortable with this, because you will see a *lot* of errors in your time coding, throughout this class and beyond. Everyone, even the most experienced coders, will have errors in their code at first. When this happens to you, consider it the "first draft" of your code that you then refine until your code does what you want. No one writes a perfect essay on their first shot, and likewise no one writes perfect code the first time.

Figuring out why your code had an error is a big part of programming. Usually when an error occurs, a message will appear telling you what it is. Unfortunately, while R is considered a relatively user-friendly coding language to write, it's error messages are not very easy to understand.  

For instance, if you removed the last parens from the ```print()``` command, you probably got an error that said:

```
Error in parse(text = x, srcfile = src): <text>:6:0: unexpected end of input
4: print('hello'
5: 
  ^
Traceback:
```

An easy error message would say something like "looks like you forgot a parens!" Sadly, computers are very literal with the instructions you give them - they don't know your *intent,* only the explicit commands you gave them. So when it tried to run ```print("hello"``` as code, the computer doesn't know what that means - it expects functions to have both open and close parentheses. In this case, because the error says "unexpected end of input," it didn't know what you were telling it to do because it didn't see the character at the end of the line that it would expect if you were asking it to execute a function. On the next line, the location of the ^ symbol tells you exactly where in the code this error happened - after the line without the close parens. 

You're likely to get more complex errors than this which are harder to understand. If you feel frustrated, remember this is part of the territory. To solve it, try asking your peers to review your code, or copy and pasting the error message into Google. If you're getting an error, likely someone else has gotten the same one before, and an answer will be out there if you do enough digging. You'll also get better at identifying and fixing your bugs as you get used to what errors you're likely to make.

Below is some code with an error. See if you can figure out what it is, and fix it. (Hint: the rules that apply to object names also apply to funciton names!)

In [None]:
# Run the code below by pressing Run

# Now debug the code - fix the mistake and press Run

Sum(1,2)


One thing to keep in mind - the sorts of errors that give you an error message are called **run-time errors.** These ones prevent the code from completing, so you know when they've happened. You can also get more insidious errors - the kind where your command was a real code command, but it didn't actually do what you intended. For example, maybe you wanted to print out the result of ```5 + 1```, but accidentally typed ```5 - 1```. The computer will still give you an answer without a warning, because ```5 + 1``` is a legal code command. But it will be the wrong answer. Double-check your work to watch out for these!

### Packages

When you first install R on your computer, it comes with many base functions like ```print()``` and ```sum()```. However, the R coding community is always creating new functions to download, kind of like when you download DLC content to expand what you can do in a video game. In R, these collections of functions for download are called **packages**. Much of what you will do in this course uses base functions that come pre-installed, but sometimes you will have to install and load new packages. If you try to use functions from these packages without installing them first, R won't know what you're trying to do and will report an error. 

For example, try to run the code below. What happens? 

In [None]:
str_to_upper('aug 30')

Installing new packages into your R environment is a two-step process. First, you need to download them from the online repository where they are stored. The default way to do this is to use the function ```install.packages('PACKAGE_NAME')```. This adds the package, and all the functions bundled in it, to your computer (stored in what R calls your **library**). But that isn't enough. As the second step, you need to load the package from your library so that R can use it. To do this, you use the function ```library(PACKAGE_NAME)```. Try it out below. 

In [None]:
#stringr is a package that makes it easy to modify text data in R. 
#Step 1, download it (this can take a few moments)
install.packages("stringr")

#next, load it so that all the functions are available to R
library(stringr)

#now, try re-running the code from above. What happens?
str_to_upper('aug 30')

Notice how ```install.packages()``` wants the package name to be in quotation marks, but ```library()``` doesn't. Some functions have different requirements for their arguments, and it can get confusing to memorize them all. English has weird spelling conventions sometimes, and coding languages are no different. 

Even though you will learn a lot in this course, there are literally thousands of functions in R, more than anyone could remember. Even advanced users of R can’t remember it all. A good help for this is to create and maintain your own "R cheatsheet" - some document where you write down how to write functions you frequently need. In addition, you can search on the internet for functions you can’t remember. Thousands are hosted on the online repository called [CRAN](https://cran.r-project.org/web/packages/available_packages_by_name.html), or you can just Google "R package for _____". Not only will you find some new functions, but you’ll also find endless discussions about which ones are better than others. 

Additionally, if you know a function's name but don't remember how it works, type a ```?``` in front of the function name and R will point you to a link where you can read about it. E.g., ```?str_to_upper```. 

## Chapter summary

After reading this chapter, you should be able to:

- Explain the difference between using statistical reasoning and using intuition
- Come up with different describe, decide, and predict use cases for data
- Summarize the major ideas in statistics and what they mean
- Do some basic math in R
- Explain the difference between commands and comments
- Describe the three parts of objects
- Explain what functions are, and how to download more of them
- Embrace the prospect of making errors, and think about how to start solving them when they happen

## New concepts

- **sample**: A subset of a much larger group, whose values are used to represent characteristics of the larger group. 
- **descriptive statistics**: Statistical tools that quantitatively describe or summarize features of sample of data. 
- **predictive models**: Statistical tools that predict values of data not yet seen, based on relationships between variables in a sample. This can refer to entitites not in the sample or future values. 
- **inferential statistics**: Statistical tools that infer properties of a larger group of data based on information in a data sample, and provide confidence estimates about those properties. 
- **population**: The entirety of some group one wishes to understand. This can be all humans, all children, people from a specific cultural group, etc. 
- **operationalization**: The process of concretely defining the measurement of a phenomenon.
- **comment**: A line in code (started with a # symbol in R) that allows the coder to write human-readable prose that is ignored by the computer. The purpose of a comment is to communicate intentions and instructions to other people trying to use the code. 
- **command**: A line in code that instructs the computer to carry out an action. 
- **object**: In R, an object is an entity that stores values for later use.
- [object] **value**: The word, number, command, etc. that is stored within an object. 
- [object] **name**: How an object is referenced in order to access its value for use in further code. 
- **assignment operator**: A coding symbol used to assign a value to an object name. In R, the assignment operator is ```<-``` with the object name on the left of the arrow and value to be assigned on the right. 
- **function**: An instance of code that runs some predefined action when called.
- **output**: The resulting value of a function call. 
- **argument**: The input values on which a function runs operations. 
- **run-time errors**: An error in code that prevents the code from running.
- **package**: Sets of functions written by other people that can be downloaded and added to one's coding project. 
- **library**: The computer directory where a downloaded package is stored and from where it can be loaded for use in an R session.

[Next: Chapter 2 - What are Data](https://colab.research.google.com/github/smburns47/Psyc158/blob/main/chapter-2.ipynb)