# From Data to Discovery (NYU CORE UA 111)
# R Reference Sheet

## Basic Functions
+ `print( NAME )`: to print out string or values stored in NAME
+ `sqrt( NAME )`: to take the square root of the value stored in NAME
+ `abs( NAME )`: to take the absolute value of the value stored in NAME


## Basic Functions on Lists
+ `c( VALUE1, VALUE2, ... )`: to put values into one list
+ `NUM1:NUM2`: to make a list of whole numbers starting from `NUM1` to `NUM2`
+ `length( LISTNAME )`: to find the number of entries in the list `LISTNAME`
+ `max( LISTNAME )`: to find the largest value in the list `LISTNAME`
+ `min( LISTNAME )`: to find the smallest value in the list `LISTNAME`
+ `sum( LISTNAME )`: to find the sum of values in the list `LISTNAME
+ `mean( LISTNAME )`: to find the average of values in the list `LISTNAME`

## Basic Functions on Data Frames
+ `dim( DATAFRAMENAME )`: to find the number of rows and columns of the data frame `DATAFRAMENAME`
+ `names( DATAFRAMENAME )`: to find the column names of the data frame `DATAFRAMENAME`
+ `head( DATAFRAMENAME )`: to preview the first few rows of the data frame `DATAFRAMENAME` (the default is 6 rows)
+ `head( DATAFRAMENAME, NUMBER)`: to preview the first `NUMBER` rows of the data frame `DATAFRAMENAME`
+ `data.frame( COLUMNNAME1 = LIST1, COLUMNNAME2 = LIST2, ...)`: to create a new data frame with columns `COLUMNNAME1`, `COLUMNNAME2` (and possibly more or fewer columns), where the entries in `COLUMNNAME1` are specified in `LIST1`, the entries is `COLUMNNAME2` are specified in `LIST2`, etc.
+ `data.frame( COLUMNNAME1 = double( NUMBER ), COLUMNNAME2 = double( NUMBER), ... )`: to create an empty data frame containing columns called `COLUMNNAME1`, `COLUMNNAME2` (and possibly more or fewer columns), with `NUMBER` rows.  We use `double` to indicate that the cells are to be filled with numbers of type `double` (numbers that are not necessarily whole numbers).

## Importing Datasets
+ `read.csv( 'FILENAME.csv')`: to "read" a csv file and import it as a data frame in R
+ `read.csv( url('http://webaddress.etc') )`: to "read" a csv file from a URL (web address) and import it as a data frame in R

## `dplyr` Functions (for working with data frames)
+ `arrange( DATAFRAMENAME, COLUMNNAME )`: to sort rows by values in the column called `COLUMNNAME`, in ascending order
+ `arrange( DATAFRAMENAME , desc( COLUMNNAME ) )`: to sort rows by values in the column called `COLUMNNAME`, in descending order
+ `filter( DATAFRAMENAME, CRITERIA)`: to produce a new data frame that contains only: rows in the data frame `DATAFRAMENAME` that satisfy the criteria specified in `CRITERIA`.
+ `group_by( DATAFRAMENAME, COLUMNNAME )`: to group rows of `DATAFRAMENAME` by their values in the column `COLUMNNAME`
+ `summarize( GROUPEDDATAFRAMENAME, NEWCOLUMN = FORMULA )`: to compute a summary quantity from grouped data `GROUPEDDATAFRAMENAME` (usuall the output of `group_by()`), where the summary quantity is stored in a new column called `NEWCOLUMN` (this is up to you).

## `ggplot2` Functions (for visualizing data)
+ `ggplot( DATAFRAMENAME, aes( x = COLUMNNAME1, y = COLUMNNAME2 ) ) + geom_point()`: to create a scatterplot with data in `DATAFRAMENAME`, with `COLUMNNAME1` on the x-axis and `COLUMNNAME2` on the y-axis
+ `ggplot( DATAFRAMENAME, aes( x = COLUMNNAME1, y = COLUMNNAME2 ) ) + geom_col()`: to create a bar chart with data in `DATAFRAMENAME`, with `COLUMNNAME1` on the x-axis and `COLUMNNAME2` on the y-axis
+ `ggplot( DATAFRAMENAME, aes( x = COLUMNNAME ) ) + geom_bar()`: to create a bar chart with data in `DATAFRAMENAME`, with `COLUMNNAME` on the x-axis and the number of observations on the y-axis
+ `ggplot( DATAFRAMENAME, aes( x = COLUMNNAME) ) + geom_histogram()`: to create a historam of the data in column `COLUMNNAME` in the data frame `DATAFRAMENAME`; the default number of bins is 30

    `ggplot( DATAFRAMENAME, aes( x = COLUMNNAME) ) + geom_histogram( bins = NUMBER )`: to create a historam of the data in column `COLUMNNAME` in the data frame `DATAFRAMENAME`, with the number of bins equal to `NUMBER`
    
    `ggplot( DATAFRAMENAME, aes( x = COLUMNNAME) ) + geom_histogram( breaks = LIST )`: to create a historam of the data in column `COLUMNNAME` in the data frame `DATAFRAMENAME`, where the bins are specified by `LIST`

## Defining New Functions
    NEWFUNCTIONNAME <- function( INPUT1, INPUT2, ... ){        
        ( ... describe task here ... )        
        OUTPUT
    }
+ `NEWFUNCTIONNAME` is the name of this new function.
+ `INPUT1`, `INPUT2`, etc: the first inputs to this new function.
+ `OUTPUT`: output value to be returned (if any)

## Conditionals
    if( CONDITION ){
        ( ... task to be completed if CONDITION is TRUE ... )
    }
    else{
        ( ... task to be completed if CONDITION is FALSE ... )
    }
+ `CONDITION` takes a boolean value (either `TRUE` or `FALSE`)

## While Loop
    while( CONDITION ){
        ( ... task to be completed as long as CONDITION is TRUE ... )    
    }
+ `CONDITION` takes a boolean value (either `TRUE` or `FALSE`)

## Sampling and Statistics
+ `sample( LIST_OFPOSSIBLE_OUTCOMES, NUMBER , replace = ... , prob = LIST_OFPROBABILITIES )`: to take `NUMBER` samples from `LIST_OFPOSSIBLEOUTCOMES`, where the probability of each outcome in this list is given by `LIST_OFPROBABILITIES`.  If the sampling is with replacement, then `replace = TRUE`; if the sampling is without replacement, then `replace = FALSE`.
+ `quantile( LISTNAME, P , type = 1 )`: to find the percentile specified by `P` among the numbers in the list `LISTNAME`.  (Here, `P` is a number between 0 and 1.)

## Miscellaneous
+ `knn( TRAINING_DATA, TEST_DATA, LABELS, K)`: to predict the labels of each row of `TEST_DATA` data frame; the `TRAINING_DATA` data frame contains the same number of (feature) columns as `TEST_DATA`, with the corresponding labels stored in the `LABELS` data frame; `K` is a positive integer for the number of nearest neighbors to be considered.