# Lesson 3: Exploring Data

Today:
1. Data frames
    + Useful functions: `dim()`, `names()`, `head()`
    + Getting datasets: (1) built-in datasets; (2) datasets in your computer or from the internet using `read.csv()`
    + Accessing columns of a data frame
    + Accessing entries of a data frame
2. Working with data frames
    +  Arithmetic with lists and columns
    +  More useful functions: `sum()`, `length()`, `max()`, `min()`
    +  Finding average of a column using `sum()` and `length()`; using `mean()`
    +  Finding proportions/percentages

## 1. Data Frames

Data Frames are essentially what R calls **tables of data** (similar to Excel spreadsheets).

### 1.1. Built-In Datasets

R comes with some datasets that are ready for us to explore.  They come in an R package called `datasets`.  



At this point, this `datasets` package is a bit of a mystery.  What are in it?  

To find out further information about any R packages, you can type 
    
    ? packagename
and run the cell.  Try it below.

You should see a message that you can find the list of all data sets by typing and running 
    
    library(help = 'datasets')

Do it in the following code cell.

You will see that one such built-in datasets is the `faithful` dataset.

### 1.2. Useful functions for working with data frames

As with lists, there are some basic functions that are useful for working with data frames.  These are:
+ `dim( DATAFRAMENAME )`: to find the number of rows and columns in a data frame
+ `names( DATAFRAMENAME )`: to find the column names of a data frame
+ `head( DATAFRAMENAME )`: to preview the first few rows (ten) of a data frame.  
    
    If you want to see the first $n$ rows (where $n$ is any number of your choice), you can run `head(DATAFRAMENAME, n)`.

Try these functions in the code cells below.

### 1.3. Importing a file from a directory in your computer or from the internet



**Exercise**

In the `/shared/datasets` folder, there is a file called `NYC_Dog_Licensing_small.csv`.

In the code cell below, please import that dataset into an R data frame named `nyc_dogs`.

In [None]:
nyc_dogs <-  

The above dataset comes from NYC Open Data, a site where the NYC government makes public data available.

[The following is the description for the above dataset](https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp): "All dog owners residing in NYC are required by law to license their dogs. The data is sourced from the DOHMH Dog Licensing System (https://a816-healthpsi.nyc.gov/DogLicense), where owners can apply for and renew dog licenses. Each record represents a unique dog license that was active during the year, but not necessarily a unique record per dog, since a license that is renewed during the year results in a separate record of an active license period. Each record stands as a unique license period for the dog over the course of the yearlong time frame."

The original [NYC Dog Licensing dataset](https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp) datasets is very large (more than 300,000 rows of data).  The above smaller dataset contains 1000 randomly chosen rows from the original dataset.

We could also import a csv file found directly from the internet using 
    
    read.csv(  'LINK TO THE CSV FILE' )

**Example**<br>
We will import a dataset from the NY State governement open data.  This page contains description of a dataset of a directory of criminal justice agencies: https://data.ny.gov/Public-Safety/Directory-of-Criminal-Justice-Agencies/gugp-n5ip

+ Click on the above link
+ Click on the `Export` button on that page
+ Right click on the `CSV` button
+ Choose "copy link address"

Read the csv file linked from that address and store it as an R data frame called `criminal_justice_agencies`.

In [None]:
criminal_justice_agencies <-   



### 1.4. Accessing columns of a data frame

Each column of a data frame is simply a list!  

Given a data frame, to obtain a list containing just one of its columns is easy, and there are two ways to do this.  
1. `DATAFRAMENAME$COLUMNNAME`: Gives you a list containing all entries in the column `COLUMNNAME` of the data frame `DATAFRAMENAME`
2. `DATAFRAMENAME[, COLNUM ]`: Gives you a list containing all entries in column number `COLNUM` of the data frame `DATAFRAMENAME`

### 1.5. Accessing an entry in a data frame

There are also two ways to access an entry in a data frame
1. `DATAFRAMENAME$COLUMNNAME[ ROWNUM ]`: To access an entry in row number `ROWNUM` and column called `COLUMNNAME`:
2. `DATAFRAMENAME[ ROWNUM, COLNUM ]`: To access an entry in row number `ROWNUM` and column number `COLNUM`:

#### Example

1. Find the name of the dog in the 327th row in the dataset `nyc_dogs`
2. Find the zipcode of the dog in the 596th row in the dataset

## 2. Working with data frames

### 2.1. Arithmetic with lists and columns

Recall the data frame `berkeleydata` from before.  Suppose that we would like to compute the admission rates of men and women in each of the departments.

### 2.2. Adding a new column to a data frame

    DATAFRAMENAME$NEWCOLUMNNAME <- list containing entries for the new column

**Example**

Suppose that we would like to create a new column called `Women_AdmissionRate` in the `berkeleydata` dataframe.

**Exercise**<br>
+ Create a new column called `Men_AdmissionRate`, which consists of the admission rate of men into each department
+ Create a new column called `Total_Admitted`, which consists of the total number of men and women addmitted to each department.
+ Create a new column called `Total_Applicants`, which consists of the total number of men and women addmitted to each department.
+ Create a new column called `Overall_Admission_Rate`, which consists of the overal admission rate (men and women) admitted to each department.

### 2.3. Finding sums and averages of a column

Recall that the function `sum( LISTNAME )` takes the sum of the numbers in the list `LISTNAME`.

Since a column in a data frame is simply a list, we can use `sum()` for a column of a data frame.

**Example**<br>
Compute the total number of admitted women across the six departments.

We can do arithmetic operations to columns of a data frame.

**Example**<br>
Compute the average number of admitted women across the six departments.