# Module 1: Entering Data

In this module, we look at how to input data to R both directly and with a file. We work with three different datasets, one on enrollment at SFU, one on weights of cereal boxes and one on brain size and intelligence.

## Entering Data Directly: SFU Enrollment

We start with the SFU enrollment dataset. This dataset contains information about enrollment for each faculty at SFU in the 2015-2016 academic year. The two variables in this dataset are called "faculty" and "students". The "faculty" variable tells us the name of each faculty, and "students" tells us how many students are enrolled in each faculty (i.e. student count). Faculty is a categorical variable, while "students" is a numeric variable. The individuals (also called units) are students, and each student is from one of the faculties.

### Making Variables

To enter data directly into R, we must first create variables. For simplicity's sake, we will create the two variables separately. The only restrictions on variable names are that they must start with a letter and cannot contain spaces. It is common in R to use a period in place of a space. 

For the SFU enrollment data, we need the variables "faculty" and "students". We could name these variables "x1" and "IReallyHopeIPassThisCourse", but it is good practice to name your variables something meaningful. Let's call them "faculty.input" and "students.input". 

We create a variable using the "c()" function. The format of this function is "variable.name = c(obs1, obs2,...,obsN)", where "N" is the number of values we have for that variable.

Important tips:  <br /> 
$\bullet$ R is case-sensitive, so "Faculty.input" and "faculty.input" are distinct objects  <br /> 
$\bullet$ In the "c()" function, the observations are separated by commas <br /> 
$\bullet$ If the variabe values are words instead of numbers, we must enclose each value in double quotation marks (e.g., c("red", "blue")) <br />
$\bullet$ The order of values in all of the variables must match

We can now create these two variables. The R code for doing this is in the next cell. To run the R code, click on the next cell and press the "run" button (the run button looks like a play button. See the introduction document for more information).

In [1]:
faculty.input = c("ASci", "FASS", "Bus", "CAT", "Educ", "Env", "HSci", "Sci", "Other")
students.input = c(3557, 11516, 3791, 3008, 1425, 1027, 1412, 3824, 32)

Notice that if we don't match the order of values in these two variables, then we might end up with 1027 students in Business and 3791 students in Environment.

Also, notice that the values of faculty are entered in quotation marks. This tells R that they are strings (words, instead of names of variables). Anytime you want to enter strings into R, you have to put them in quotations. 

### Making a Data Frame

The variables that we created in the last section, "faculty.input" and "students.input", are separate, unconnected objects. We now need to combine these variables into a single data set, connecting faculties to their student counts. 

In R, data sets are called data frames. A data frame is a single object that contains multiple variables. We make a data frame using the "data.frame()" function. 

Inside the brackets of "data.frame()", we set the name of each variable and tell R what we want that variable to equal. In this case, we want a variable called "faculty" that is equal to the list of values we called "faculty.input" and a variable called "students" that is equal to "students.input". 

Let's put this all together to make a data frame called "data.faculty" using the R code in the next cell. Remember, you run a cell by first clicking on it and then pressing the run button.

In [2]:
data.faculty = data.frame(faculty = faculty.input, students = students.input)

### Printing Variables and Data Frames

We've now made two separate objects containing lists of values and combined them into a data frame called "data.faculty". We can see exactly what information is contained in this data frame using the "print()" function. Just put the name of the object you want to print inside the brackets.

Let's use the "print()" function to examine both lists of values and our new data frame. We could include the three lines of R code all in one cell, but we instead use three separate cells so that it's clear what each line of code is printing. Run each of the following three cells to see how the "print()" function works.

In [3]:
print(faculty.input)

[1] "ASci"  "FASS"  "Bus"   "CAT"   "Educ"  "Env"   "HSci"  "Sci"   "Other"


In [4]:
print(students.input)

[1]  3557 11516  3791  3008  1425  1027  1412  3824    32


In [5]:
print(data.faculty)

  faculty students
1    ASci     3557
2    FASS    11516
3     Bus     3791
4     CAT     3008
5    Educ     1425
6     Env     1027
7    HSci     1412
8     Sci     3824
9   Other       32


Notice that the rows in our dataset are numbered. This is just for convenience, and won't actually affect our analysis.

### Printing Variables Inside Data Frames

Suppose that we want to print the values of a single variable from a data frame containing multiple variables. We do this using the "\$". 

The general format for using "\$" is "my.data.frame\$my.variable". This gives us the values of "my.variable" from inside the data frame called "my.data.frame". 

The next cell shows R code that extracts the "faculty" variable out of the data frame "data.faculty", and prints that variable using the "print()" function.

In [6]:
print(data.faculty$faculty)

[1] ASci  FASS  Bus   CAT   Educ  Env   HSci  Sci   Other
Levels: ASci Bus CAT Educ Env FASS HSci Other Sci


We could have used a similar approach to print the "students" variable by writing "students" after the "$".

## Importing Data From a File: CFSB

The next dataset we will work with is for a cereal called Chocolate Frosted Sugar Bombs (CFSB for short). Boxes of CFSB are advertised as containing 20 oz. of cereal. In reality, each box differs slightly from this target. This dataset is a record of actual weights (in ounces) for 100 boxes of CFSB cereal. Weight is a numeric variable. The units are boxes of cereal.

We tell R to read a dataset from a file using the "read.csv()" function. This function reads a ".csv" file and automatically creates a data frame in R to store it. 

The "read.csv()" function has 1 main input called "file". The "file" input tells R the name of the file containing your dataset. Note that this file needs to be in the same folder as this Jupyter notebook. You can either navigate to the folder containing a notebook before uploading a dataset, or use the "Move" button (see the Introduction document).

The "read.csv" function has an optional input called "header", which tells R whether the first row of your dataset file contains variable names or just another observation. If "header" is equal to "TRUE" (in all capitals), then the first row of your dataset is read as variable names. If "header" is equal to "FALSE" (also in all capitals), then the first row of your dataset is read as just another observation. To be certain we are reading data from a csv file correctly, we must always check the first row of the dataset and use the appropriate setting for "header".

We use a comma to separate the two inputs for the "read.csv()" function. Whenever a function in R has multiple inputs, we always separate those inputs with commas.

The default value of "header" in the "read.csv()" function is TRUE. Therefore, if the "header" option is not included, then R reads the first for of the dataset as variable names.

Let's now read the CFSB dataset into R using the read.csv function. Remember that we can use any name we want for our new data frame as long as it starts with a letter and has no spaces. For clarity, we will call our new data frame "data.cfsb".

In [7]:
data.cfsb = read.csv(file="CFSB.csv", header=TRUE)

Instead of setting "header=TRUE", we could have just omitted the optional argument "header" because the default value is "TRUE".

Let's print our new data frame using the "print()" function.

In [8]:
print(data.cfsb)

    Weight
1   20.440
2   20.244
3   20.549
4   20.755
5   20.740
6   20.847
7   20.063
8   20.453
9   20.719
10  20.283
11  20.362
12  20.162
13  20.131
14  20.304
15  20.345
16  20.076
17  20.386
18  20.419
19  20.527
20  20.427
21  20.435
22  20.426
23  20.769
24  20.483
25  20.463
26  20.397
27  20.894
28  20.673
29  20.975
30  20.369
31  20.832
32  20.178
33  20.608
34  20.680
35  20.884
36  20.483
37  20.395
38  20.635
39  20.424
40  20.652
41  20.211
42  20.331
43  20.196
44  20.427
45  20.494
46  20.506
47  20.435
48  20.939
49  20.152
50  20.353
51  19.984
52  20.790
53  20.244
54  20.369
55  20.652
56  20.593
57  20.675
58  20.619
59  20.226
60  20.277
61  20.639
62  20.565
63  20.312
64  20.452
65  20.526
66  20.612
67  20.528
68  20.318
69  20.877
70  20.597
71  20.514
72  20.666
73  20.672
74  20.373
75  20.315
76  20.722
77  20.260
78  20.188
79  20.642
80  20.628
81  20.941
82  20.789
83  20.761
84  20.523
85  20.500
86  20.591
87  20.495
88  20.289
89  20.145
90  20.666

It is a good idea to print a new dataset when you first read it into R to make sure that the variables were read in correctly. We should now have a dataset with a single variable called "Weight", and 100 observations for this variable.

Especially with large datasets, it is often preferable to look at just the first few lines. The "head()" function prints just the first ten lines of a dataset.

In [9]:
head(data.cfsb)

Weight
20.44
20.244
20.549
20.755
20.74
20.847


The "head" function formats our data a little differently than the print function, but it makes examining large datasets much more manageable.

Important Note: We use ".csv" files instead of a more common Excel format (e.g., .xlsx) because they are the easiest to work with in R. The ".csv" file type is used for storing lists of data, and "csv" stands for "Comma Separated Value". You can open ".csv" files in either Jupyter or Excel, and Excel files can be saved in the ".csv" format. 

Advanced Topic: The file containing your dataset does not actually need to be in the same folder as the Jupyter notebook that uses the file. If it is in a different folder, you need to specify the file path as part of the "file" input to the "read.csv()" function. E.g. "file = "Module_1/Datasets/my_data.csv"".

## Importing Data With No Variable Names: Brain Size

Our next dataset comes from psychology, where the researchers want to identify whether the size of a person's brain is associated with their intelligence. The first variable in our dataset is the person's IQ. The second variable is the person's brain size, measured in number of pixels it takes up on an MRI scan. These are both quantitative variables. The individuals in this study are people.

As in the last example, we import our new dataset to R using the "read.csv()" function. Unlike the CFSB dataset, the "Brain_Size.csv" file has no variable names. This means we need to set the "header" input to "FALSE" (in all capitals). We can then add variable names once we have read in the data.

Let's start by reading the dataset into a data frame in R, and print out the first few observations to make sure it worked. We call this data frame "data.brsi" (for BRainSIze).

In [10]:
data.brsi = read.csv(file = "Brain_Size.csv", header = FALSE)
head(data.brsi)

V1,V2
133,816932
140,1001121
139,1038437
133,965353
137,951545
99,928799


The default variable names chosen by R, "V1" and "V2", are not very informative. We can change them using the "names()" function. Let's first run the "names" function on our data frame. Notice that it gives us back the current variable names.

In [11]:
names(data.brsi)

This can be helpful if you have a data frame with lots of variables and you want to check what they are all called. 

However, if we want to assign new variable names, the command we use looks like "names(my.data) = c("var1", "var2",...,"varN")". In this example, "my.data" is a data frame with N variables, and "var1" to "varN" are new names for these variables.

Let's give the variables in our "data.brsi" data frame more sensible names.

In [12]:
names(data.brsi) = c("IQ", "Size")

Order matters here. The first name in our list is assigned to the first column of our data frame, and so on for the other names. We therefore have to make sure that we match the order of names to the order of variables in the data frame.

Let's print out the first few observations to make sure that our variables now have the right names.

In [13]:
head(data.brsi)

IQ,Size
133,816932
140,1001121
139,1038437
133,965353
137,951545
99,928799
