# Lesson 4: Data Frames

Today: Working with data frames
+ The `dplyr` package
+ Adding new columns (a second way)
+ Sorting rows
+ Filtering rows
+ Grouping and aggregating data

## Working with columns and rows of data frames

Yesterday, you were introduced to the R package `datasets`, which contains some built-in data that we can explore right away without having to import from an outside source.  Today, we will work with another important R package, called `dplyr`.  Unlike `datasets` which contains data, the package `dplyr` contains tools that allow us to explore data.

To use `dplyr` for the first time on a jupyter notebook, you have to load it using the `library()` function.  (It's not automatically loaded like the `datasets` package.)

There are several important functions that come with the `dplyr` package:
+ `mutate()`
+ `arrange()`
+ `filter()`
+ `group_by()`
+ `summarize()`

### 1. The `mutate()` function


**Example**

(we will use the UC Berkeley 1973 Graduate Admission dataset again)

In [None]:
berkeleydata <- read.csv('../../shared/datasets/berkeley73.csv')

In [None]:
# Two ways to do add a new column to a data frame

# First way ( without using mutate() )

# task: adding a new column called Men_AdmissionRate into the berkeleydata dataframe
#  this column is the Men_Admitted column divided by the Men_Applicants column




In [None]:
# A second way to add a new column: using the mutate() function

# task: adding a new column called Women_AdmissionRate into the berkeleydata dataframe
#  this column is the Women_Admitted column divided by the Women_Applicants column





In [None]:
# Check




### 2. The `arrange()` function

`arrange( DATAFRAMENAME, COLUMNNAME)` is used to **sort** the rows of a data frame based on the values in a particular column (ascending order).
+ inputs: a data frame and a column name in that data frame
+ output: a data frame that is sorted based on the values in the specified column

**2-minute Group Exercise**

Import the NYC Dog Licenses dataset in the `shared/datasets` directory as an R data frame called `nyc_dogs`.  Then,
+ Use `arrange()` to sort the entries by `AnimalName` column, from A to Z
+ Sort by `AnimalName` from Z to A
+ Pick a column (your choice) and sort it either in descending or ascending order
+ Sort the entries by `AnimalName` column from A to Z, then by the `AnimalBirthMonth` column (which is actually birth year) in descending order

In [None]:
nyc_dogs <- read.csv('../../shared/datasets/NYC_Dog_Licensing_small.csv' )




### 3.  The `filter()` function


**Example**

Filter the `nyc_dogs` data frame to consist only of dogs whose breed is 'Poodle'

**Exploration**
1. Filter the `nyc_dogs` data frame to consist only of rows that correspond to dogs
    + who are male
    + who were born in 2010 or after <br><br>

2. How many rows in this dataset correspond to dogs who are male and were born in 2010 or after?

3. How many rows in this dataset correspond to dogs named 'OTIS'?  Do all these rows correspond to different dogs or the same dogs?  How about 'LUCY'?

4. What is the percentage of rows that correspond to female dogs?

### 4. `group_by()` and `summarize()`

The two functions above are usually used together when we want to get a summary information about a group of rows.

**Example**

For example, suppose that we would like to 
1. group the rows based on the dog's gender and then
2. count how many rows there are for each group.

**Method 1**: We could do that using the `filter()` function above and the `dim()` function.

**Method 2**:  Here is a second way we can do it:
+ First group the rows by gender using `group_by()`
+ Count how many rows there are for each group using `summarize()`

`group_by( DATAFRAMENAME, COLUMNNAME)`
+ inputs: a data frame, along with the name of a categorical variable/column
+ output: a grouped data frame --- this data frame looks exactly the same as the input data frame (but internally, R has divide up this data based on the given categorical variable)

`summarize( DATAFRAMENAME, NEWSUMMARYCOLUMNNAME = AN AGGREGATE FUNCTION )`
+ inputs: a grouped data frame, along with what information we want summarized about each particular group
+ output: a summary data frame

**Exploration**

1. Group the rows based on the dog's birth year and count how many rows there are for each group.
2. Group the rows based on the dogs' zipcode and count how many rows there are for each group
3. Group the rows based on the dogs' zipcode and by gender; count how many rows there are for each group
4. Suppose we assume that each dog appears only once in this dataset, and that each dog is still alive and well today.  Find the average age within each zipcode