# Lab 4: Submitting Code on Supercomputer in R

### Objective:
Learn how to run parallel R on Supercomputer.

### Successful outcome:
Not only be able to write parallel R script in foreach, but also be able to write PBS and Bash script to submit jobs on terminal.

## Step 1: Preliminaries:

This is a very important first step, many of people who did not have this step may mess up multiple scripts and dependencies. This may cause other people not be able to re-produce what you have had.

In [1]:
## clean the workspace current variables
rm(list=ls())

## This function will check if a package is installed, and if not, install it
pkgTest <- function(x) {
  if (!require(x, character.only = TRUE))
  {
    install.packages(x, dep = TRUE)
    if(!require(x, character.only = TRUE)) stop("Package not found")
  }
}

## These lines load the required packages using 'pkgTest' function
packages <- c("readxl", "data.table", "ggmap", "ggplot2")
lapply(packages, pkgTest)
    
## the output should be all "NULL"

Loading required package: readxl
Loading required package: data.table
Loading required package: ggmap
Loading required package: ggplot2


In [2]:
## find the current working directory
getwd()

## Step 2: Preprocessing

#### Main Goal: 
1. Reading the required files and datasets
2. Adding the fields for modeling

This is usually the second steps you would like to do for the data. In this step, you may need to remove NAs, change the data from "character" to "factor" or otherwise. Sometimes, you may need to create a new field for modeling. For example, you may need to change dates from day into quaterly, that is just one of the exmaple we are frequently used.

In [3]:
## Read the input files under the current working directory
US_P <- readRDS("United_States_Property.rds")
US_D <- readRDS("United_States_Daily.rds")

In [4]:
## now we can exam the US_P & US_D by printing the first 5 elements.
head(US_P)

Property.ID,Host.ID,Listing.Title,Property.Type,Listing.Type,Created.Date,Last.Scraped.Date,Country,State,City,⋯,Count.Reservation.Days.LTM,Count.Available.Days.LTM,Count.Blocked.Days.LTM,Number.of.Photos,Business.Ready,Instantbook.Enabled,Listing.URL,Listing.Main.Image.URL,Latitude,Longitude
87601,68654,Eugene Get-Away (Sun Room),House,Private room,2017-03-16,2017-05-31,United States,Oregon,Eugene,⋯,17,0,13,13,False,No,https://www.airbnb.com/rooms/87601,https://a0.muscache.com/im/pictures/1868780/1bdccf9f_original.jpg?aki_policy=x_large,44.00863,-123.19604
26012,109589,Sunny 2-story townhouse w garden,House,Entire home/apt,2010-04-15,2017-06-02,United States,New York,New York,⋯,100,5,260,31,True,No,https://www.airbnb.com/rooms/26012,https://a0.muscache.com/im/pictures/68664532/f74dd09e_original.jpg?aki_policy=x_large,40.68157,-73.98989
82963,451748,Palm Desert Oasis,House,Entire home/apt,2011-03-18,2017-06-01,United States,California,Palm Desert,⋯,3,27,0,36,False,No,https://www.airbnb.com/rooms/82963,https://a0.muscache.com/im/pictures/70531770/3eb2f137_original.jpg?aki_policy=x_large,33.71463,-116.37367
23472,91808,Cozy Cottage,Cabin,Entire home/apt,2010-03-12,2017-05-30,United States,Wisconsin,Verona,⋯,47,286,1,10,False,No,https://www.airbnb.com/rooms/23472,https://a0.muscache.com/im/pictures/123650/2df7b8c1_original.jpg?aki_policy=x_large,43.03177,-89.62797
44020,101521,2 Bedroom 1 Block to Fullerton L,Apartment,Entire home/apt,2010-08-07,2017-06-01,United States,Illinois,Chicago,⋯,65,128,17,18,False,No,https://www.airbnb.com/rooms/44020,https://a0.muscache.com/im/pictures/730606/d0ff831f_original.jpg?aki_policy=x_large,41.92673,-87.65731
94059,486032,"""Psycho""- Sigmund Freud Suite",Bed & Breakfast,Entire home/apt,2011-04-11,2017-06-01,United States,Massachusetts,Southbridge,⋯,34,172,8,35,False,No,https://www.airbnb.com/rooms/94059,https://a0.muscache.com/im/pictures/6978898/adb70aab_original.jpg?aki_policy=x_large,42.07687,-72.03798


In [5]:
## Continue previous work
head(US_D)

Property.ID,Date,Status,Price,Booked.Date,Reservation.ID
6,2014-11-01,A,65,,
6,2014-11-02,A,65,,
6,2014-11-03,A,65,,
6,2014-11-04,A,65,,
6,2014-11-05,A,65,,
6,2014-11-06,A,65,,


In [6]:
## Creating the city & state field
US_P$City_State <- paste(US_P$City,"_", US_P$State, sep="")

In [7]:
## select all unique 
Markets  <- unique(US_P$City_State, incomparables=FALSE)

## for this excercise we will only use the first 10 markets
Markets1 <- Markets[c(1,3,4,5,6)]
print(Markets1)

[1] "Eugene_Oregon"             "Palm Desert_California"   
[3] "Verona_Wisconsin"          "Chicago_Illinois"         
[5] "Southbridge_Massachusetts"


In [8]:
## creating a file to store the outputs
if(!dir.exists("output")){
    dir.create("output")
}
dir.exists("output")

## Step 3: Modeling & Plotting

This step is usually when the parallelization and modification could happen for a data analytic application. In our case, we will need to go through all cities in the market and plotting them out in terms of "Revenue" and "Potential Revenue". 

In [9]:
## in this example for serial process, we will just choose one specific city to show
y <- Markets[1]
y

In [34]:
## Choose market
TH_P <- subset(US_P, City_State==y)
head(TH_P)

Unnamed: 0,Property.ID,Host.ID,Listing.Title,Property.Type,Listing.Type,Created.Date,Last.Scraped.Date,Country,State,City,⋯,Count.Available.Days.LTM,Count.Blocked.Days.LTM,Number.of.Photos,Business.Ready,Instantbook.Enabled,Listing.URL,Listing.Main.Image.URL,Latitude,Longitude,City_State
1,87601,68654,Eugene Get-Away (Sun Room),House,Private room,2017-03-16,2017-05-31,United States,Oregon,Eugene,⋯,0,13,13,False,No,https://www.airbnb.com/rooms/87601,https://a0.muscache.com/im/pictures/1868780/1bdccf9f_original.jpg?aki_policy=x_large,44.00863,-123.196,Eugene_Oregon
563,37609,161959,Feels like a forest cabin!,Bungalow,Entire home/apt,2010-07-06,2017-03-25,United States,Oregon,Eugene,⋯,45,13,6,False,No,https://www.airbnb.com/rooms/37609,https://a0.muscache.com/im/pictures/67162e4f-98cc-4087-9d1a-d03f421512e1.jpg,44.07781,-123.0899,Eugene_Oregon
1086,87035,105888,Quiet Suite in South Eugene Home,Guest suite,Private room,2011-04-01,2017-05-31,United States,Oregon,Eugene,⋯,122,78,26,False,No,https://www.airbnb.com/rooms/87035,https://a0.muscache.com/im/pictures/1816567/b5c170a5_original.jpg?aki_policy=x_large,44.01144,-123.0795,Eugene_Oregon
2922,49056,223619,Fabulous Views. Unique. Close to UO.,House,Entire home/apt,2010-09-05,2017-06-01,United States,Oregon,Eugene,⋯,76,2,67,False,No,https://www.airbnb.com/rooms/49056,https://a0.muscache.com/im/pictures/20004102/d97d6aba_original.jpg?aki_policy=x_large,44.04468,-123.0555,Eugene_Oregon
3457,27219,68654,Eugene Get-Away (Garden Room),House,Private room,2017-03-16,2017-06-01,United States,Oregon,Eugene,⋯,0,13,23,False,No,https://www.airbnb.com/rooms/27219,https://a0.muscache.com/im/pictures/1831641/76f9c73f_original.jpg?aki_policy=x_large,44.00631,-123.1984,Eugene_Oregon
3598,40218,169866,Downtown Bungalow,House,Private room,2010-07-18,2017-05-31,United States,Oregon,Eugene,⋯,218,18,15,False,No,https://www.airbnb.com/rooms/40218,https://a0.muscache.com/im/pictures/3102298/054c7636_original.jpg?aki_policy=x_large,44.04887,-123.1062,Eugene_Oregon


In [35]:
## Subset Daily Data by Market
TH_ID <- as.vector(TH_P$Property.ID)
TH_D <- US_D[which(US_D$Property.ID %in% TH_ID),]
head(TH_D)

Unnamed: 0,Property.ID,Date,Status,Price,Booked.Date,Reservation.ID
1455810,27219,2017-03-01,B,70,,
1455811,27219,2017-03-02,B,70,,
1455812,27219,2017-03-03,B,70,,
1455813,27219,2017-03-04,B,70,,
1455814,27219,2017-03-05,B,70,,
1455815,27219,2017-03-06,B,70,,


In [36]:
# Define Market Characteristics
City <- TH_P$City[1]
State <- TH_P$State[1]

## Modify date format
TH_D$date <- as.Date(as.character(TH_D$Date), format="%Y-%m-%d")
TH_D$datepos <- as.POSIXlt(TH_D$date)
print("final form of TH_D")
head(TH_D)

[1] "final form of TH_D"


Unnamed: 0,Property.ID,Date,Status,Price,Booked.Date,Reservation.ID,date,datepos
1455810,27219,2017-03-01,B,70,,,2017-03-01,2017-03-01
1455811,27219,2017-03-02,B,70,,,2017-03-02,2017-03-02
1455812,27219,2017-03-03,B,70,,,2017-03-03,2017-03-03
1455813,27219,2017-03-04,B,70,,,2017-03-04,2017-03-04
1455814,27219,2017-03-05,B,70,,,2017-03-05,2017-03-05
1455815,27219,2017-03-06,B,70,,,2017-03-06,2017-03-06


In [37]:
## Compute Property Level Potential Revenue
TH_D$PotentialRevenue <- 0
TH_D$PotentialRevenue[which(TH_D$Status=="A" | TH_D$Status=="R")] <- TH_D$Price[which(TH_D$Status=="A" | TH_D$Status=="R")]

## Compute Property Level Revenue
TH_D$Revenue <- 0
TH_D$Revenue[which(TH_D$Status=="R")] <- TH_D$Price[which(TH_D$Status=="R")]

## Compute Market Level Potential Revenue
PotentialRevenue <- aggregate(PotentialRevenue ~ date, data=TH_D, FUN=sum, na.rm=TRUE)

## Compute Market Level Revenue
Revenue <- aggregate(Revenue ~ date, data=TH_D, FUN=sum, na.rm=TRUE)

## Merge Revenue Data
TH_Daily <- merge(PotentialRevenue,Revenue, by = c("date"))
TH_Daily$RevenueFraction <- as.numeric(TH_Daily$Revenue/TH_Daily$PotentialRevenue)
TH_Daily$City <- City
TH_Daily$State <- State

head(TH_Daily)

date,PotentialRevenue,Revenue,RevenueFraction,City,State
2014-08-01,13206,6838,0.5177949,Eugene,Oregon
2014-08-02,13297,5743,0.4319019,Eugene,Oregon
2014-08-03,13903,4281,0.3079192,Eugene,Oregon
2014-08-04,14528,4604,0.3169053,Eugene,Oregon
2014-08-05,14570,4381,0.3006863,Eugene,Oregon
2014-08-06,14292,4793,0.3353624,Eugene,Oregon


In [None]:
## SaveRDS
data_name = paste("output/", y, ".rds", sep="")  
saveRDS(TH_Daily, data_name)

In [None]:
# Plot Potential and Actual Revenue by Date
ggplot() +
geom_line(aes(TH_Daily$date, TH_Daily$PotentialRevenue), colour='black') +
geom_line(aes(TH_Daily$date, TH_Daily$Revenue), colour='red')

# print file
graph_name = paste("output/Revenue_", y, ".png", sep="")
ggsave(graph_name, width = 8, height = 5)

In [39]:
## This function will check if a package is installed, and if not, install it
pkgTest <- function(x) {
  if (!require(x, character.only = TRUE))
  {
    install.packages(x, dep = TRUE)
    if(!require(x, character.only = TRUE)) stop("Package not found")
  }
}

## These lines load the required packages using 'pkgTest' function
packages <- c("foreach", "doParallel")
lapply(packages, pkgTest)

Loading required package: doParallel
Loading required package: iterators
Loading required package: parallel


## Step 3.1: Creating the parallel function

Creating the function that can be used in the "foreach" parallel syntax will be important.

In [40]:
findModel <- function(y, US_P, US_D){  

  print(paste0(y, " beginning"))

  ## Choose market
  TH_P <- subset(US_P, City_State==y)
  
  ## Subset Daily Data by Market
  TH_ID <- as.vector(TH_P$Property.ID)
  TH_D <- US_D[which(US_D$Property.ID %in% TH_ID),]
  
  if(nrow(TH_D) == 0){
    print(paste0(y," had 0 terms"))
  } else {

  # Define Market Characteristics
  City <- TH_P$City[1]
  State <- TH_P$State[1]
  
  ## Modify date format
  TH_D$date <- as.Date(as.character(TH_D$Date), format="%Y-%m-%d")
  TH_D$datepos <- as.POSIXlt(TH_D$date)
  
  ## Compute Property Level Potential Revenue
  TH_D$PotentialRevenue <- 0
  TH_D$PotentialRevenue[which(TH_D$Status=="A" | TH_D$Status=="R")] <- TH_D$Price[which(TH_D$Status=="A" | TH_D$Status=="R")]
  
  ## Compute Property Level Revenue
  TH_D$Revenue <- 0
  TH_D$Revenue[which(TH_D$Status=="R")] <- TH_D$Price[which(TH_D$Status=="R")]
  
  ## Compute Market Level Potential Revenue
  PotentialRevenue <- aggregate(PotentialRevenue ~ date, data=TH_D, FUN=sum, na.rm=TRUE)
  
  ## Compute Market Level Revenue
  Revenue <- aggregate(Revenue ~ date, data=TH_D, FUN=sum, na.rm=TRUE)
  
  ## Merge Revenue Data
  TH_Daily <- merge(PotentialRevenue,Revenue, by = c("date"))
  TH_Daily$RevenueFraction <- as.numeric(TH_Daily$Revenue/TH_Daily$PotentialRevenue)
  TH_Daily$City <- City
  TH_Daily$State <- State
  
  ## SaveRDS
  data_name = paste("output/", y, ".rds", sep="")
  saveRDS(TH_Daily, data_name)
  
  # Plot Potential and Actual Revenue by Date
  ggplot() +
    geom_line(aes(TH_Daily$date, TH_Daily$PotentialRevenue), colour='black') +
    geom_line(aes(TH_Daily$date, TH_Daily$Revenue), colour='red')
  
  # print file
  graph_name = paste("output/Revenue_", y, ".png", sep="")
  ggsave(graph_name, width = 8, height = 5)
  }
  # print finished city
  print(y)

}

## Step 3.2: Running the foreach parallelization

In [43]:
np <- detectCores()
np

In [44]:
## setting number of cores to 4 for testing purpose, on real run, you will need to use (np-1) number of cores
num_cores <- 4 

## making this many copy of work
cl <- makeForkCluster(num_cores)

## register this many number of cores
registerDoParallel(cores=(num_cores))

In [45]:
## import the number of cores
clusterExport(cl, c())

In [46]:
## show the starting time
Sys.time()
print("starting parallel")

## running the foreach parallelization
## NOTE: in the real run, we will be using Markets instead of only 5 markets in Markets1 variable.
foreach(i=1:length(Markets1), .combine='c') %dopar% {
  findModel(Markets1[i], US_P, US_D)
}

## end the parallelization
stopCluster(cl)

## show the ending time
Sys.time()
print("ended parallel")

[1] "2017-07-11 20:01:10 UTC"

[1] "ended parallel"


## Step 4: Submit job on Supercomputer

The following code is using bash script instead of R Script.
And it is based on using Roger Supercomputer

### Question 1
We have used "foreach" library to perform parallelization in this excerise, are there any other libraries can do the same thing? List at least 3 other different parallelization libraries.

### Question 2
Pick one of the parallelization library and give it a try in this excerise by only using 4 cores.

In [None]:
## setting number of cores to 4 for testing purpose, on real run, you will need to use (np-1) number of cores
num_cores <- 4 

## TODO: You can use your googled example here or just use this excerise, but making sure you only use 4 cores at most!
## You can learn how to write those from secion 3.2 in this excerise, the way of doing this should be similar.