# Essentials of Microsoft R Server: A self-guided tutorial

Welcome! This self-guided tutorial is intended to help you learn the 
essentials of Microsoft R Server (MRS): its strengths, when to use it, and how to 
use its core functions. 


# Pre-requisites

This tutorial leverages the [jupyter notebook platform](https://jupyter.org/), and this platform may not be familiar to all of you. If you are unfamiliar with this environment, it is strongly encouraged to take a few moments and visit an introduction to using it, such as this [excellent tutorial presented at UseR2016!](https://github.com/michhar/useR2016-tutorial-jupyter).

To leverage the capacities of jupyter and R together, it requires that the [appropriate R kernel](https://github.com/IRkernel/IRkernel) is installed on the machine running jupyter. 

It also requires that the version of R that the kernel points to is [Microsoft R (Server or Client)](https://msdn.microsoft.com/en-us/microsoft-r/index).

One easy way of deploying these bits is to leverage the [Data Science Virtual Machine on Azure](https://azure.microsoft.com/en-us/marketplace/partners/microsoft-ads/standard-data-science-vm/).

## Essential Tips

A very brief summary of the critical components and commands are:

1. Press `Ctrl+Enter` to run (or render) the current cell.
2. Output will print to the console the notebook. You may have to scroll up to see it all.
3. Get help for any function by typing a question mark and then its name into
   the console: `?rxLinMod`. It will split the window, and will bring up the documentation for 
   that function below.
4. Get help for *all* the MRS functions with: `help("RevoScaleR")`
5. Files will appear in the specified directory. You can find them by selecting File in the menu bar and selecting "Open...". This will open a new browser window with a file navigator.
6. R objects can be viewed by typing `ls()` in an R cell.
7. Run all the example code!

There are a number of hands-on exercises in the document, so while you can run the notebook from beginning to end, you will get a lot more out of it by actually walking through cell-by-cell, and filling out the corresponding exercises.


## What is Microsoft R Server?

The statistical programming language R is an open-source, immensely popular
and useful tool for data analysis and machine learning. It's widely taught in
universities, has thousands of packages that extend its capabilities, and it's
free!

But R has two crucial limitations that make using it for big data challenging:
1. It's single-threaded. No matter how many cores your computer has, R will only use one for its operations.
2. It's memory-bound. All of the data you want to process has to fit into your computer's RAM. Megabytes of data? No problem. Terabytes? Trouble.

Microsoft R Server (MRS) extends open-source R in three critical ways to overcome these limitations:

1. Multi-threaded Math: MRS is built with the Intel MKL Library, which enables multi-threaded 
  matrix operations - very useful for matrix-intensive operations!
2. Parallel Algorithms:  MRS comes with a suite of functions called ScaleR that 
   allows computationally intense algorithms to be distributed across many cores
   and/or nodes in a cluster.
3. Data In External Memory: MRS doesn't require data to be in-memory for ScaleR algorithms to work.

The native data store is the 'eXternal Data Frame', or XDF file. This file is
   optimized for data analysis in a number of ways (see below), but MRS also
  works well with databases, HDFS, Spark, and so on.

## The XDF File

The standard way to store bigger-than-memory data for MRS is to create an
'eXternal Data Frame' (XDF) file. XDF files are optimized for fast, distributed
computation in a few essential ways:

1. Data is written to disk in chunks of N rows. This makes it natural to
   read chunks of data, either in serial or for distribution to nodes, and
   easy to append new data to an existing XDF file without re-writing the 
   entire file.
2. Within each chunk, values are stored column-wise. In statistics and machine
   learning, it's often more useful to have fast access to an entire column
   than fast access to an entire record (e.g., when calculating a mean)
3. Essential stats, like the low and high values of numeric variables and the 
   unique levels of categorical variables, are pre-computed when the XDF file 
   is created or modified, providing a quick overview of the dataset without
   processing it.


You can also create "composite" XDF files in which chunks of data spread across
many files are treated as one XDF - useful when you're working with a 
distributed file system like HDFS.

## Import to XDF

You can use the `rxImport()` function to create an XDF file from a wide variety
of sources: text files, SAS and SPSS files, ODBC connections, and so on.
`rxImport()` has just two essential arguments: 

- `inData`: a pointer to the file or connection you want to pull from, and 
- `outFile`: a filepath where you'd like your new XDF to appear. 

It's got plenty of options for importing subsets, specifying
variables types, and so on - but `inData` and `outFile` are the only essentials.

In this example, use `rxImport()` to pull in some data from a CSV file. We'll
work with a dataset of all of the flights originating in three New York
City-area airports in 2013. The original data are available directly from the [`nycflights13` package in R](https://cran.r-project.org/web/packages/nycflights13/index.html), but we have already processed the data here to simplify the tutorial.

Here's an overview of its variables:
- year, month, day, hour, minute: the date and time on which the flight departed
- dep_time: the time of departure
- dep_delay: how late the plane was in departing (in minutes)
- arr_time, arr_delay: same as dep_time and dep_delay, but for arrival
- carrier: the shorthand code for the airline operating the flight
- tailnum: the tail number (and unique identifier) for each plane
- flight: the flight number
- origin: the origin airport - New York's JFK and LaGuardia (LGA), and New Jersey's Newark (EWR)
- dest: the destination airport (many)
- air_time: the time the plane spent aloft (doesn't include taxi time)
- distance: number of miles flown

To start, let's import just one quarter of the flights.


In [None]:
# First, let's create a very simple R character object called "flightsCsv" that points to
# the CSV file, flights_q1.csv (which is in the same directory as this notebook)
flightsCsv <- "flights_q1.csv"

# Next, let's create another R character objected called "flightsXdf" that holds the
# filepath for the XDF file, flights.xdf.
# Note: This doesn't actually create the file! It just tells R where we want the file to
# go. Again, this file will be created in the current working directory.
flightsXdf <- "flights.xdf"

# Now, let's import flightsCsv into our new flightsXdf. If you're an R user,
# this might seem kind of strange: we're not assigning the results of rxImport()
# to an object, because all it does is create a new file.
# Even though optional, the `overwrite` option here is used in case the file already exists.
rxImport(inData = flightsCsv, outFile = flightsXdf, overwrite = TRUE)

# It's always a good idea to check the results. `rxGetInfo()` retrieves that
# pre-computed metadata mentioned above, which makes it essential for checking
# newly-imported data. In its simplest form, you can just give it the path to
# the XDF file:
rxGetInfo(flightsXdf)

# But you can get detailed information on the variables by adding
# getVarInfo = TRUE
# and peek at the first few rows of the dataset by adding
# numRows = 5
rxGetInfo(flightsXdf, getVarInfo = TRUE, numRows = 5)

In [None]:
###################
# Exercise 1: Append the Remaining Flights
###################

# XDF files have one other crucial feature: once created, they can only be
# modified in two ways:
#   1. You can add new rows by adding the argument append = TRUE
#   2. You can overwrite the existing file by adding overwrite = TRUE
# Very large XDF files can take a while to create, so MRS includes this small
# hurdle to make sure that any changes you make are intentional. In this
# tutorial, we'll use append and overwrite quite a bit - but you should be
# careful about using them often in production.

# For this exercise, use `rxImport()` to append the flights_q234.csv file onto your
# existing flightsXdf file. You'll need to use the `inData`, `outFile`, and `append`
# arguments.

# Here's a pointer to the CSV file:
moreFlightsCsv <- "flights_q234.csv"


# Append moreFlightsCsv to the *existing* XDF here:
# Reminder: R is case-sensitive!

# Can you report how many rows are in the dataset using rxGetInfo?
# How does this number compare to the number of rows in only flights_q1?
# What would happen if you ran the `rxImport()` command with append set to true more than once?


In [None]:
# Import a second table

# We'll also use a table of data about airports in this analysis.
# Load airports.csv into a new XDF file

# Here's a pointer to the CSV file:
airportsCsv <- "airports.csv"

# Here's the location of the XDF file:
airportsXdf <- "airports.xdf"

# Import the CSV to a new XDF here:
rxImport(inData = airportsCsv,
        outFile = airportsXdf,
        overwrite = TRUE)

In [None]:
## Exercise 2.

# Check the airports xdf file using rxGetInfo()

# Data Management

Now that you have an XDF, let's look how MRS handles the standard data
management tasks.



### Sorting XDFs


The `rxSort()` function handles sorting datasets. Just like `rxImport()`, it takes
the `inData` and `outFile` arguments; it also takes the `sortByVars` argument,
for naming the variables you want to sort.

In [None]:
# Sort by arrival delay
flightsSortedXdf <- "flightsSorted.xdf"

# Here's a complete and common rxSort() command.
# Note that arr_delay is quoted.
rxSort(inData = flightsXdf,
       outFile = flightsSortedXdf,
       sortByVars = "arr_delay")

### Check the Data!

A more general-purpose function for doing data processing is `rxDataStep()`. Like many other `RevoScaleR` functions, it takes standard `inData` and `outFile` arguments, as well as many others. We will explore this function in more depth below, but for now, it provides a simple interface for extracting the first few rows of a dataset. In the next cell, we use the `numRows` argument to extract the first 10 rows of the dataset.

In [None]:
# Leaving the `outFile` argument as NULL actually reads the data in to the R session as
# a standard R data.frame, which allows us to view the data within the R console:
# numRows is a numeric argument that indicates how many rows to read and process.
rxDataStep( inData = flightsSortedXdf, numRows = 10)

In [None]:
# Might make more sense to sort the other way. If so, just set decreasing = TRUE
# in the call to `rxSort()`
# We're going to overwrite our previous flightsSorted.xdf, so be sure to set
# overwrite = TRUE
rxSort(inData = flightsXdf,
       outFile = flightsSortedXdf,
       sortByVars = "arr_delay",
       decreasing = TRUE,
       overwrite = TRUE)



# Check the results
rxDataStep(flightsSortedXdf, numRows = 10)

In [None]:
###################
# Exercise 3: Sort an XDF
###################

# Sort flightsXdf by decreasing origin and then by increasing distance.
# To sort by two variables, use the c() function to combine them:
# sortByVars = c("origin", "distance")
# To change the sort order, you'll do something very similar:
# decreasing = c(TRUE, FALSE)
# Remember that you can bring up the documentation by typing ?rxSort
# in a code cell and executing that cell

# Write the sorting code here:

# Check the results
rxDataStep(flightsSortedXdf, numRows = 15)

## Deduplication

Deduplication is another common task within data processing. This task is accomplished through 
`rxSort()` as well. For example, if we wanted to keep only one entry for each origin airport
and month, then we can use those as keys, and specify `removeDupKeys` = TRUE. Use the help to examine which record is kept.



In [None]:
?rxSort

In [None]:
rxSort(inData = flightsXdf,
       outFile = flightsSortedXdf,
       sortByVars = c("origin", "month"),
       removeDupKeys = TRUE,
       overwrite = TRUE)



# Check the results
rxDataStep(flightsSortedXdf)

# How many rows are present in the sorted dataset now?

## Merging XDFs

Merging two tables is also another common task in data processing. We can accomplish this task in MRS by leveraging the `rxMerge()` function.

Merging two XDF files with `rxMerge()` is straightforward if you know SQL. 

- Specify the two datasets with `inData1` and `inData2`. 
- Use the `type` argument to specify the kind of join.
- Use `matchVars` to name the matching keys on the two XDF files.

Unlike SQL, MRS expects the keys to have the same name on both datasets. You
can't just say that x.person_id = y.PersonID, but you *can* temporarily
rename variables for the purposes of matching with the `newVarNames1` and
`newVarNames2` arguments.


In this example, we're merge the flights table with the airports table, based on the details of the *origin* airport in the flights table. This will map to the `faa` variable in the airports table.

In [None]:
# Examine the two tables to confirm the key names
rxDataStep(flightsXdf, numRows = 5)
rxDataStep(airportsXdf, numRows = 5)

# Merge on the origin airport details
# Name of the output Xdf file
originMergeXdf <- "originMerge.xdf"

rxMerge(inData1 = flightsXdf,
        inData2 = airportsXdf,
        outFile = originMergeXdf,
        overwrite = TRUE,
        
        # Type of join
        type = "left",
        
        # Temporarily rename the key
        newVarNames2 = c(faa = "origin"), # Original name (faa) on the left,
                                          # new name (origin) quoted on the right
        
        # Name the key variable(s)
        matchVars = "origin"
)

# Check it
rxDataStep(originMergeXdf, numRows = 5)

In [None]:
###################
# Exercise 4: Merge on the Destination Airport Details
###################


# Now, merge the dataset you just created (originMerge.xdf) on the airport 
# table for the *destination* airport.
# Add one new argument: duplicateVarExt = c("origin", "dest")
# Since we're merging with airportXdf twice, we'll end up with
# duplicate variables. duplicateVarExt will append "origin"
# to the existing variables, and "dest" to the new ones.

# Resulting dataset location
destMergeXdf <- "destMerge.xdf"

# do the merge!

# Check the results using rxDataStep()


## Subset rows with criteria


Selecting particular rows is another common data manipulation
task. For example, maybe we want to create a subset of all the flights
for a single carrier.

In MRS, `rxDataStep()` is the workhorse function for that kind
of data manipulation. Its argument `rowSelection` takes
a series of criteria and returns matching records.

These are standard R comparisons:

- Check equality with the DOUBLE equals sign: `x == y`
- Perform a logical "AND" with `&`: `x == y & x > 10`
- Perform a logical "OR" with `|`: `x == y | x > 10`

In the example below, I  create a subset of all United flights:

In [None]:
# the name of the file that we want to store the filtered
# data in
united_flights <- "united.xdf"

# Using rowSelection to select flights
rxDataStep(inData = flightsXdf,
           outFile = united_flights,
           rowSelection = carrier == "UA")

# Show the results
rxDataStep(united_flights, numRows = 5)

In [None]:
###################
# Exercise 5: Subset to United Flights Originating at LaGuardia
###################

# Let's try the subsetting task again, and this time subset to just
# United flights that originated in LaGuardia (LGA)

# Let's just overwrite the united_flights XDF
united_flights <- "united.xdf"

rxDataStep(united_flights, numRows = 5)

## Transforming variables

`rxDataStep()` is also the function to use for creating and
modifying variables. The key argument there is `transforms`.
transforms is a little unusual for R - it takes a list, which is a 
highly flexible R data structure. Inside that list, each new
variable gets an entry like this:

`newVar = someFunction(oldVar)`

Each entry must be named (e.g. "newVar"), and is separated from any
following entries by a comma.

For example, let's convert year-month-day into a proper Date variable.

In [None]:
rxDataStep(inData = flightsXdf,
           outFile = flightsXdf,
           transforms = list(      # Note the list(
               
               # First, let's just paste the components together
               flightDateString = paste(year, month, day, sep = "-"),
               
               # Then convert it into a Date
               flightDate = as.Date(flightDateString),
               
               # Finally, format it to day of week
               dayOfWeek = format(flightDate, format = "%A")
           ),
           overwrite = TRUE
)
rxDataStep(flightsXdf, numRows = 5)

## Multiple Transforms

One thing to note from the prior `rxDataStep()` is that the functions we specify in the `transforms` list can actually be used in sequence. The new variable `flightDateString` is defined in the list, but is used to create `flightDate`, which, in turn is used to define `dayOfWeek`.

In [None]:
###################
# Exercise 6: Calculate Speed
###################


# Add speed to each flight's records using the transforms argument:
# Be sure to measure speed in an appropriate fashion (e.g. miles per hour)

rxDataStep(inData = flightsXdf, numRows = 5)

## Complex Transformations

MRS was built from the very beginning with distributed computation in mind.
While that allows for some tremendous performance gains, it also comes with
some crucial limitations. For example, when in a distributed context, we
can't assume that all or even most of the dataset is available at any
given time. Each node may only have a small window on the dataset.

More precisely, *ANY* transformation that depends on a value from another row - 
even just the next row over - has to assume that value won't always be
available. Other statistics depend on *every* value in the dataset to be 
correctly computed - like the mean, minimum, and maximum. Let's take a look
at how things can go wrong.

Most ScaleR functions already take this into account and take care of the
details for you. Unlike those functions, `rxDataStep()` lets you run whatever
R code you like on your dataset. The price for that flexibility is that
you need to watch out for transformations that depend on values from other
rows.

To show how this happens, we create a tiny dataset called `chunks_df`
that has 9 rows in it. The three variables in this dataset correspond to:

- `chunk`: the chunk of data that is processed (with values 1, 2, or 3)
- `date`: a series of 9 dates
- `x`: a randomly ordered sequence of hte values 21-29

We generate and look at this data with the following cell:


In [None]:
# To show you how this happens, I'm going to set up a tiny dataset:
chunks_df <- data.frame(chunk = (1:9 + 2) %/% 3,
                        date = seq(Sys.Date(), length.out = 9, by = "1 day"),
                        x = sample(21:29, size = 9),
                        stringsAsFactors = FALSE
)


# Take a look at the results: a data frame with just 9 rows, a chunk ID, a date, 
# and a variable x
chunks_df

## Create the tiny xdf

Next, we import this data into an xdf in order to demonstrate how chunking works. 
We overwrite the file if it's the first chunk so that we can make sure that the file 
only has 9 rows, and we append if it's not the first row to make sure that each
write appends a new chunk of data:

In [None]:
# Now read it into an XDF file. I'm going to append three rows at a time so that
# the XDF file will have three chunks, despite its small size.
chunks <- "chunks.xdf"
    
for(i in 1:3) {
    rxImport(inData = chunks_df[chunks_df$chunk %in% i, ],
             outFile = chunks,
             overwrite = i == 1,
             append = i > 1)
}
    
rxGetInfo(chunks, getVarInfo = TRUE, numRows = 9)

## Problem

Now that we have a tiny data set (with tiny chunks), we can explore what happens 
in a very common scenario. Let's say that you would like to scale variable `x` 
so that its minimum value = 0 and its maximum value = 1. In practice, this type 
of transformation is common because it frequently makes it easier to compare 
variables with very different ranges.

If I just wrote the transformation in the classic R style, it might look like the
following cell:



In [None]:
rxDataStep(inData = chunks,
           outFile = chunks,
           
           # Subtract the minimum of x from each value of x, then divide by
           # difference between the minimum and maximum values
           transforms = list( xScaledNaive = (x - min(x)) / 
                                             (max(x) - min(x)) ),
           overwrite = TRUE
)


# Check the results.
rxDataStep(chunks)

## Understanding what happened 

If you compare the original and scaled values, you can see that multiple values
that aren't the true min and max of x are scaled to zero and one. That's
because transformations in `rxDataStep()` work on *one chunk at a time*, and
apply the R code you've written to each chunk. So `min()` and `max()` above are
correctly returning the minima and maxima of *each chunk*. For the
transformation to work as intended, we need the **global** min and max.

It's easy to detect the error in this small dataset, but on big datasets with
bigger chunks, the error can be much more subtle. So how can we address this?

The MRS approach is to calculate the global min and max (or any other global
statistic) independently, using a chunk-aware ScaleR function (ie, anything
except `rxDataStep()`), and then pass the results back to `rxDataStep()` for the
transformation.

The min and max are conveniently precomputed in the XDF metadata, so we
can just extract them there

In [None]:
# Get the min and max from metadata:
xMin <- rxGetVarInfo(chunks)$x$low
xMax <- rxGetVarInfo(chunks)$x$high

xMin
xMax


## What to do next?

Once we have the global min and max, we need to make the `rxDataStep()` process aware of those values
by passing them to the `rxDataStep()` call using the `transformObjects` argument.
This basically just takes these two values and makes them available inside
the environment of rxDataStep().

In [None]:

rxDataStep(inData = chunks,
           outFile = chunks,
           
           # We can use a list() to contain the two values. They have to have
           # names, and those names unfortunately have to be different than
           # the object names, so I'll just call them minValue and maxValue:
           transformObjects = list(minValue = xMin, maxValue = xMax),
           
           # I can then refer to minValue and maxValue inside my transformation
           # code, instead of the original min(x) and max(x):
           transforms = list( xScaledCorrect = (x - minValue) / 
                                               (maxValue - minValue)  ),
           overwrite = TRUE
)



# Compare the results of xScaledNaive and xScaledCorrect
rxGetInfo(chunks, numRows = 9)



# If this seems complex, that's because it is! Thankfully, you won't
# need to make transformations like this very often - but it is one of the
# hazards for new MRS programmers.

## What about other transformations?

It is very often that we want to do other transformations. For example, computing z-scores, or centering based on a mean or median is very common.

We compute other statistics - means, standard deviations, quantiles, etc - 
using ScaleR's other summary functions, which we'll see below.

Once we have those relevant statistics, the next step off passing them to `rxDataStep()` is the same - we use one of the `transformObjects` arguments to make those values accessible to the the computation.



## Factors

There's one other complex transformation to watch out for: creating categorical variables (or factors in R parlance). Factors are stored in each row as an integer index that maps that particular rows value to a particular dictionary of labels.

Because the mapping between integer and label could potentially differ from chunk to chunk,
it's easiest to just use `rxFactors()` to manage them. Here's an example.
Feel free to skip over it if you want, but be sure to run the code.

The key argument in `rxFactors()` is `factorInfo`, which takes a list naming each
variable you'd like to create or modify.
Each element in that list takes ANOTHER list.
Each element of THAT list is an argument that specifies one 
attribute of a factor - the variable to be converted (e.g. varName), the levels, etc.

In [None]:
# Example
rxFactors(inData = flightsXdf,
          outFile = flightsXdf,
          factorInfo = list(
              
              # The simplest use of rxFactors: create new factors from existing
              # character variables. I'll make three new factors, carrier_F,
              # origin_F, and dest_F, from the original character variables:
              carrier_F = list(varName = "carrier"),
              origin_F = list(varName = "origin"),
              dest_F = list(varName = "dest"),
              
              # If I want to specify the order of the levels manually, I can
              # add the "levels" argument to the list and just type out the
              # names of the levels:
              dayOfWeek_F = list(varName = "dayOfWeek",
                                 levels = c("Sunday", "Monday", "Tuesday",
                                            "Wednesday", "Thursday", "Friday",
                                            "Saturday"))
          ),
          overwrite = TRUE
)


# Check the results
rxGetVarInfo(flightsXdf)

## Summary Statistics

Thankfully, most MRS functions know they're working on chunked data and handle
computations accordingly. In this section, we'll look at the essential summary
statistics functions in MRS:

- `rxSummary()` for classic quantitative descriptive stats
- `rxQuantile()` for quantiles and the median
- `rxCrossTabs()` and `rxCube()` for crosstabulation

### rxSummary()

`rxSummary()` is the primary function for summarizing numeric
variables, and it works like summary functions in many other languages.

There is one twist: `rxSummary()` uses R's formula syntax to identify which
variables should be summarized. The formula syntax is very helpful for
specifying complex relationships between variables, but a little confusing in
this simple case. For `rxSummary()`, you can just type a tilde (~) and then name
the variables you'd like to summarize, separated by a plus sign:

In [None]:
# Continuous variable example with multiple variables to be summarized 
rxSummary( ~ arr_delay + dayOfWeek_F, data = flightsXdf)

### Group-wise summaries

If you want a groupwise summary, put the variable you want to summarize on the
*left* of the tilde, and your grouping variable on the right.
The grouping variable MUST be a factor!

In [None]:
# Here, we can see average departure delay by origin airport
rxSummary(dep_delay ~ origin_F, data = flightsXdf)

In [None]:
# You can get quantiles using rxQuantile() - but only one variable
# at a time. The median flight was four minutes early!
rxQuantile(varName = "arr_delay", data = flightsXdf)

In [None]:
###################
# Exercise 7: Summarize some variables
###################

# 1. Try summarizing a few different variables in flightsXdf.
# 2. Summarize arr_delay for each level of one of the factors.

### Summarizing Categorical Variables

To summarize categorical values, use `rxCrossTabs()` or `rxCube()`.
These two functions have the same syntax and functionality, but have different outputs and different defaults. 

These use R's formula syntax
as well - but instead of separating the right-hand variables with plus signs,
they're separated by colons. This is R's arcane way of expressing that we
want to know about the relationship between these two variables, but without
designating one the predictor and one the predicted.

In [None]:
# Compare the output of these two identical calls:
rxCrossTabs( ~ dest_F : origin_F, data = flightsXdf)

rxCube( ~ dest_F : origin_F, data = flightsXdf)

### Differences between rxCrossTabs() and rxCube()

The primary differences between `rxCrossTabs()` and `rxCube()` are the output structures. As seen above, `rxCrossTabs()` returns a N-dimensional table of counts, which will be stored in-memory in the R-session. `rxCube()` returns a `list()` object that can be coerced to an internal `data.frame()` like object that can easily be written to an xdf file. These are options within `rxCube()` as well - you can return an actual data.frame by setting the argument `returnDataFrame = TRUE`, and like many other ScaleR functions, you can immediately write to an XDF file by assigning a non-NULL value to the `outFile` argument.

The other main difference between `rxCrossTabs()` and `rxCube()` (which isn't apparent in this example) is that you can compute groupwise means or sums with the two functions by interacting a single numeric variable with categorical variables. By default, `rxCrossTabs()` will compute group-wise sums, and `rxCube()` will compute group-wise means. See the `means` argument for each function and their default values.

See the help for more information (`?rxCrossTabs` and `?rxCube`)



In [None]:
###################
# Exercise 8: Crosstabs
###################

# Now try adding a third variable to both rxCrossTabs() and rxCube(), and
# compare their output.
# What happens if that third variable is numeric? a factor?

## Predictive Modeling

Finally, MRS includes a suite of fast, distributed algorithms for statistical
modeling and machine learning, which includes:
- Linear regression (`rxLinMod()`)
- Logistic regression (`rxLogit()`)
- Generalized linear models (`rxGLM()`)
- Decision trees (`rxDTree()`)
- Gradient boosted decision trees (`rxBTree()`)
- Decision forests (`rxDForest()`)
- K-means (`rxKmeans()`)
- Naive Bayes (`rxNaiveBayes()`)


## Predicting Arrival Delay
Model syntax in MRS is usually very similar to open-source R. You can estimate the
model using the relevant call (e.g. `rxLinMod()`), and then view the model and access
the model using standard function calls (`summary()`, `coef()`, etc).

Here's a simple linear regression predicting arrival delay, using the formula
syntax again. I put my dependent variable on the left-hand side of the tilde,
and my predictor on the right:

In [None]:
# Example model syntax.
mod1 <- rxLinMod(arr_delay ~ origin_F, data = flightsXdf)

# To see the results of the model, use R's summary() function on it:
summary(mod1)

## Saving the model for a later session

Similar to open source R, we can save the model for later by saving it to disk using `save()`, 
and then load it back into memory later with `load()`.

Let's save the estimated model first:

In [None]:
# This model object can be saved to a file:
save(mod1, file = "model.Rdata")

# remove the model from the workspace:
rm(mod1)

# show that it's no longer available:
ls(pattern = "mod1")

## Loading the model

Now let's load it back and make sure it looks the same as above.

In [None]:

# And later re-loaded and used for predictions:
load("model.Rdata")

# it's back!
ls(pattern = "mod1")

summary(mod1)

## Generating Predictions

Predictions are also generated in a very similar way to predictions within
open source R. Rather than using `predict()`, we use `rxPredict()` for a ScaleR
model object.



In [None]:
# New data with an appropriate variable:
newData <- data.frame(origin_F = factor(c("EWR", "LGA", "JFK"),
                                        levels = c("EWR", "LGA", "JFK")))

# Then use rxPredict to apply the model to the new data
rxPredict(modelObject = mod1, data = newData)

In [None]:
###################
# Exercise 9: Build a Model
###################


# Expand the linear regression above into a more model that accounts for more variance
# Try one of the other algorithms for predicting arrival delay
# If you feel like getting fancy, use rxSplit() (see help via ?rxSplit) to separate the data into
# testing and training sets!

# Quick formula examples:
# To predict y as a function of x:                   y ~ x
# To predict y as a function of x and z:             y ~ x + z
# To estimate an interaction between x and z:        y ~ x * z