# POLSCI 3

## Week 1, Notebook Lecture 1: Analyzing Data using R in Jupyter Notebooks


Welcome! In this notebook, you will start to learn how how to use a Jupyter Notebook (like this one!) and the R programming language to analyze quantitative data.

# Jupyter Notebooks

A Jupyter Notebook is an online, interactive computing environment that we will use in PS 3 to learn and practice __data analysis__ skills.

You can open up this Notebook for Week 1 by typing the following URL into a web browser: https://tidyurl.xyz/zOXB.

**Please get out your laptop or other device and open that URL now, so you can follow along with me!**

(Note: you should have permissions to access the Notebook through bCourses; you may need to authenticate on CalNet.  If you cannot access it but your neighbor can, please follow along with your neighbor. Or just watch the lecture today, and see a GSI about this issue after class).

There are a few advantages of Jupyter notebooks for a class like PS 3:

- You don't need to install software on your computer (and potentially have to troubleshoot it if it doesn't work).
- Some questions can be *auto-graded*, so you'll get your grades within minutes -- and your GSIs will have extra time to help you if you're stuck.
- For your *group* assignments, you'll see whether you got the right answer *as you're working on the assignment*!

Notebooks are composed of different types of __cells__. Cells are chunks of code or text that are used to break up a larger notebook into smaller, more manageable parts and to let the viewer modify and interact with the elements of the notebook.

Notice that the notebook consists of 2 different kinds of cells: **text** and **code**. A text cell (like this one) contains text, while a code cell contains expressions in R, the programming language that you will be using. 

If you want to see the underlying commands used to format a text cell, you can double-click on the cell.  Then press __Shift + Enter__ to return to the regular formatting.  Try it now on this cell!

### Running Cells

"Running" a cell is similar to pressing 'Enter' on a calculator once you've typed in an expression; it computes all of the expressions contained within the cell.

To run a code cell, you can do one of the following:
- press __Shift + Enter__
- click __-> Run__ in the toolbar at the top of the screen.

You can navigate the cells by either clicking on them or by using your up and down arrow keys. Try running the cell below to see what happens. 

In [3]:
5 + 5

The input of the cell consists of the text/code that is contained within the cell's enclosing box. Here, the input is an expression in R that "prints" or repeats whatever text or number is passed in. 

The output of running a cell is shown in the line immediately after it. Notice that markdown (text) cells have no output.

Each line of a cell runs an operation.

In [4]:
# Addition
20 + 20

In [5]:
# Multiplication
10 * 8.5

In [6]:
# Division
625 / 25

In [7]:
# A series of arithmetic operations
(2 - 4 * 5 + 7) + 18 * 2

Note that code after a # (hashtag) is not run, so we use lines starting with hashtags to add comments or notes on our code. Here's an example.

In [8]:
#By using a comment at the beginning of the cell, we can describe what will occur when you run the cell.
# Add ten to 8
10 + 8 # Note how we can add a comment after the expression

In [9]:
# If you create two lines, each line will run on its own.
2 + 2
5 + 5

### R Variables

Aside from numbers, R has **variables**, names that act as placeholders for certain values. For example, let the variables `x` and `y` equal 10 and 9, respectively. This action is called "assigning" a variable.

In [10]:
# Assign your variable by using <-
x <- 10
y <- 9

Notice that assigning a number to a variable name such as `x` produces no output. To view the value of x, place it at the end of a coding cell, like below.

In [11]:
# Note you need to run cell 9 (to assign the value of 10 to the variable x) before you run this cell to print the value!
x

Now, we can use the variables `x` and `y` in expressions.

In [12]:
10 + 9

In [13]:
x + y

Now what happens when the value of `x` changes?

In [14]:
x <- 12 * 2

Then, the value of expressions that rely on `x` also change.

In [15]:
x

In [16]:
x + y

**This is why the order in which you run code cells is important.** The expression `x + y` can yield different results depending on which cells you ran before.

In [17]:
x <- 5
y <- 20
x + y

What happens if you try to use a variable without assigning it to a value first?

In [18]:
x + y + z

ERROR: Error: object 'z' not found


You'll see that R outputs a `Error in eval`. R tried to find the value of `z`, but `z` hadn't been defined yet!

**Important:** If you see this error again in this notebook or in future notebooks, it is an indication that you might not have run all the previous cells or that you might be using variables without assigning values to them first.

Run the next cell to define `z`

In [19]:
# Defining z here
z <- 2019

In [20]:
# Good to go! Now that we've assigned a value to z, the code runs.
x + y + z

## Reading in and Using Datasets

R is meant to be used as a tool for statistical computing and graphics. Naturally, R allows us to read in data sets.

During most of the semester, we're going to be reading in real datasets from real studies about about the real world.

But first we'll start with something simple. In the next cell, we will read a comma-separated values **(CSV) file** that includes the sales of the three most popular clothes at the Cal Berkeley store! (This is fake, obviously.) We *assign* this dataset to a variable named `berkeleyStore`.

In [2]:
# This stores, or assigns, the dataset as berkeleyStore
# Note that the file lecture_notebook1.csv is stored in the same directory as this Notebook, at the URL you used

berkeleyStore <- read.csv('lecture_notebook1.csv')

Running the name of the dataset by itself prints out the data set. **For the in-class assignment on Thursday, you'll use the approach in the next cell to answer Question 1.**

In [21]:
berkeleyStore

Month,Hoodie,Sweater,Shirts
<chr>,<int>,<int>,<int>
Jan,250,220,130
Feb,224,250,154
March,178,220,196
April,160,200,224
May,129,185,250


- Every dataset will have multiple **rows**. Each row represents one **observation**. For example, in this dataset, one row corresponds with one month, then we know multiple things about what happened in that month by looking across the row. 
- Every dataset will have multiple **columns**. Each column represents one **variable**. A variable is a characteristic that can be measured or counted for every observation (that is, every row). 
- The intersection of a row and a column in a dataset contains the value of one variable for one particular observation. For example, the `berkeleyStore` dataset shows that in February, the store sold 250 sweaters. 

In this class, you'll also see a **codebook** that tells you what each variable means. Here's what the variables mean in this dataset:

- `Month`: What month of the year the sales refer to
- `Hoodie`: How many hoodies sold that month
- `Sweater`: How many sweaters sold that month
- `Shirts`: How many shirts were sold that month

As we can see R succesfully interpreted our file!

While the dataset is insightful enough by itself, it would be nice to do some operations with it. R allow us to do some interesting operations with the dataset columns. `dataSetName$varName` allow us to read a specified column in our data set and return it.

**For the in-class assignment on Thursday, you'll use the approach in the next cell to answer Question 2.**

(The **R Cheat Sheet** in bCourses will contain a list of everything we learn how to do in R and how to do it!)

In [22]:
# dataNameSet$varName give us the specified column.
berkeleyStore$Hoodie

Furthermore, it is possible to do different operations to this variable. In this case, let's compute the average of the number of hoodies sold in these five months. let's use R's `mean()` function, which computes the mean (i.e., the average) of a given variable.

In [23]:
mean(berkeleyStore$Hoodie)

You can even save the results into a new variable which just contains a single value. **For the in-class assignment, you'll use the approach in the next cell to answer Questions 3 and 4.**

In [24]:
hoodie.mean <- mean(berkeleyStore$Hoodie) # This assigns the value to hoodie.mean
hoodie.mean # This line just prints what is in hoodie.mean

Once we've assigned a variable, whether by using a command line `x <- 5` or the command above, they behave just like numbers. So, once they're assigned, so you can do arithmetic with the variables, too, just like we showed above.

For example, suppose that you want to calculate how many hoodies the Berkeley store is likely to sell in an entire year? Let's take our `hoodie.mean` (the monthly average) and multiply by 12.

**For the in-class assignment, you'll use the approach in the next cell to answer Question 5 (although not using multiplication).**

In [25]:
hoodies.per.year <- hoodie.mean * 12
hoodies.per.year

### Note on Variable Types

Not every variable is a number. For example, the `Month` variable is what is called a **string**, because it contains letters and words, not just numbers. Therefore, taking the `mean()` of `berkeleyStore$Month` is just going to give us an error.

In [26]:
mean(berkeleyStore$Month)

“argument is not numeric or logical: returning NA”


## **Conclusion**

Over the course of this notebook, you were introduced to the basic types of objects in R, how to store them, and how to use them.

**Often, you will be watching lectures that use Notebooks to build your R skills as recordings. Most weeks, you will need to watch these in advance of Thursday's class, so that you can do the in-class and group assignments on Thursdays!**  (If you need to watch a recorded lecture prior to Tuesday's class, I will let you know in the respective week).

### **Getting help!**

If you had trouble with any content in this notebook, **Data Science Peer Consultants are here to help!** 

You can view their locations and availabilites at this link: https://cdss.berkeley.edu/dsus/advising/data-science-peer-advising-dspa. Peer Consultants are there to answer all data-related questions, whether it be about the content of this notebook, applications of data science in the world or other data science courses offered at Berkeley -- make sure to take advantage of this wonderful resource!

As mentioned in the Syllabus, we also encourage you to make use of the **D-Lab consulting service**: https://dlab.berkeley.edu/frontdesk. They are open Monday - Fridays 9am - 4pm.

Finally, your GSIs are your most important resource!  **You can go to the office hours of any GSI, not just the one to whose section you are assigned.**  Please refer to the Syllabus and the bCourses page for more information!

## More about class

This was *Week 1, Notebook Lecture 1*. I have posted on bCourses the Notebook called *Week 1, Notebook Lecture 2* and a recording of the lecture using that Notebook. The material in the second Notebook and short recorded lecture gives you some more introduction to Jupyter Notebook and R.  

**Please watch the short recorded lecture before class on Thursday!** In class, you'll work on *Week 1, Activity Notebook 1*. In that notebook, you'll use what you'll learned in this notebook to answer very similar problems. Because it's the first week, the notebook will be graded only for completion.

On almost every Thursday, you'll answer that week's *Activity Notebook 1* **twice**:

- You'll first have 30 minutes to answer the questions individually. The notebooks will be available at 8:10 AM (students with DSP accommodations can start early; we will contact you individually about this) and be due at 8:40 AM.
- You'll then have 20 minutes to answer the *same* questions as a group from 8:45 - 9:05 AM. (For some of those questions, you'll be able to see whether you're getting the answer right or wrong.)
- Finally, we'll regroup and go over the right answers as a class.

**If you need help with an in-class assignment:**

- Check out the R Cheat Sheet in bCourses.
- Find a GSI and ask for help!

## Last thing: Downloading Your Notebook and Submitting it to be graded

When you finish assignments, you'll download your notebook and then submit it by uploading it to Gradescope. There will be instructions at the end of the assignments on how to do so.

**You do NOT need to submit THIS notebook. These instructions apply to the in-class assignments and problem sets, NOT the lecture notebooks (like this one).**