# Overview

The following example demonstrates a data science workflow from start to finish using real world hotel booking demand data. The data files are named H1.csv and H2.csv and can be obtained from https://support.sas.com/documentation/onlinedoc/viya/examples.htm. The data are originally from https://www.sciencedirect.com/science/article/pii/S2352340918315191.

# Load the SWAT Library and Connect to the CAS Server

Load the SWAT library and then create a connection to the CAS server using the CAS function and assign the CAS connection object to the variable s. The first argument specifies the host name, and the second argument specifies the port.

In [None]:
import swat
# change the host and port to match your site
s = swat.CAS("cloud.example.com", 10065)

# Load the Data

## Load the Data from a Caslib

There are two data files, H1 and H2, that first need to be imported and then combined (appended) together. The default method of loading data is to load the data from the data source portion of a caslib, which is known as a server-side load. This requires the data files to be saved in the active caslib (Casuser). Once the files have been saved to the caslib, use a table.loadTable action for each CSV file to load the data files into memory.

In the importOptions parameter, specify CSV for the fileType.
For encoding, specify latin1.
For guessRows, specify a number that includes all rows in the data set, such as 100,000 so that all rows will be scanned to determine the appropriate data type for each column.

In [None]:
s.table.loadTable(path="H1.csv",
                  caslib="casuser",
                  casOut={"name":"H1",
                          "caslib":"casuser",
                          "replace":True},
                  importOptions={"fileType":"csv",
                                 "encoding":"latin1",
                                 "guessRows":"100000"})

s.table.loadTable(path="H2.csv",
                  caslib="casuser",
                  casOut={"name":"H2",
                          "caslib":"casuser",
                          "replace":True},
                  importOptions={"fileType":"csv",
                                 "encoding":"latin1",
                                 "guessRows":"100000"})

# Explore the Data

## Count the Number of Rows

Count the number of rows in each table. Use the table.recordCount action to verify that the H1 and H2 data files were imported with the correct number of rows. H1 should have 40,060 rows and H2 should have 79,330 rows.

In [None]:
s.table.recordCount(table={"caslib":"casuser",
                                      "name":"H1"})

s.table.recordCount(table={"caslib":"casuser",
                                      "name":"H2"})

## Examine the Columns

Examine the columns in each table using the table.columnInfo action to make sure the columns have the correct type. The result includes the names of columns, and information about each column, including its label (if applicable), type, length, and format.

In [None]:
s.table.columnInfo(table="H1")

In [None]:
s.table.columnInfo(table="H2")

## Specify Column Type When Loading Data

By default, the "children" column is imported as a double type in table H1 but it is imported as a varchar type in table H2 (due to missing values). The columns in each table need to be the same type for the tables to be appended. Therefore, it is necessary to add a vars subparameter in the loadTable action for H2 to import the children column as a double type. Use the columnInfo action to view the table H2.

In [None]:
s.table.loadTable(path="H2.csv",
                  caslib="casuser",
                  casOut={"name":"H2",
                          "caslib":"casuser",
                          "replace":True},
                  importOptions={"fileType":"csv",
                                 "encoding":"latin1",
                                 "guessRows":"10000",
                                 "vars":{"children":{"type":"double"}}})

s.table.columnInfo(table="H2")

The column information for table H2 now correctly specifies that the Children column has a double type.

# Prepare the Data

## Create a New Column to Identify Hotel Type

Table H1 contains booking information about a resort hotel and table H2 contains booking information about a city hotel. Prior to appending the tables, use the table.copyTable action with the computedVars and computedVarsProgram subparameters to create a new column in each table named "hotel" to identify whether the bookings come from the resort hotel or city hotel. Set the values of hotel to "R" in table H1 to represent the resort hotel and "C" in table H2 to represent the city hotel.

In [None]:
s.table.copyTable(casOut={"caslib":"casuser", 
                          "name":"H1_new", 
                          "replace":True},
                   table={"caslib":"casuser", 
                          "name":"H1", 
                          "computedVars":{"name":"hotel"}, 
                          "computedVarsProgram":"hotel='R'"})

In [None]:
s.table.copyTable(casOut={"caslib":"casuser", 
                          "name":"H2_new", 
                          "replace":True},
                   table={"caslib":"casuser", 
                          "name":"H2", 
                          "computedVars":{"name":"hotel"}, 
                          "computedVarsProgram":"hotel='C'"})

## Append the Tables

The next step is to append tables H1 and H2. Use table.append to combine rows from the source table H1 to the target table H2. Use table.alterTable to rename the appended table H2_new to hotel_bookings.

The target parameter specifies H2_new as the table that will have the source table appended to it.
The source parameter specifies H1_new as the table that will be appended to the target table.

In [None]:
s.table.append(target={"caslib":"casuser", 
                       "name":"H2_new"},
               source={"caslib":"casuser", 
                       "name":"H1_new"})

s.table.alterTable(name="H2_new", 
                   caslib="casuser", 
                   rename="hotel_bookings")

## Examine Column Information and Count Distinct and Missing Values

Check column data types and check for null values. Use columnInfo to check each column's data type. Use simple.distinct to identify the number of distinct values for each column.

In [None]:
s.table.columnInfo(table="hotel_bookings")

In [None]:
s.simple.distinct(table="hotel_bookings")

The simple.distinct action shows that there are 4 missing values in the Children column.

## Replace Missing Values with Zeros

Use the table.update action to replace missing values in the Children column with zero. Use a simple.distinct action to ensure that there are no missing values.

In [None]:
s.table.update(table={"name":"hotel_bookings",
                      "caslib":"casuser",
                      "where":"Children is null"},
                 set=[{"var":"Children", 
                       "value":"0"}])

s.simple.distinct(table={"name":"hotel_bookings",
                         "caslib":"casuser",
                         "vars":[{"name":"children"}]})

## Subset the Data to Exclude Invalid Rows

Use a table.copyTable action with an expression in the where parameter to subset the rows to keep only the rows where children, adults, or babies are greater than zero. This removes rows where the values of the adults, babies, and children columns are zero, since all three columns cannot be zero. Then use table.recordCount to count the number of rows to see how many cases are now in the subsetted data.

In [None]:
s.table.copyTable(table={"name":"hotel_bookings",
                         "where":"children > 0 | adults > 0 | babies > 0"},
                  casOut={"name":"hotel_bookings_subset", 
                          "replace":True})

In [None]:
s.table.recordCount(table={"caslib":"casuser", 
                           "name":"hotel_bookings_subset"})

# Analyze the Data

## Create a Frequency Table

Load the freqTab action set and use the freqTab.freqTab action to create a frequency distribution for country and include only bookings that have not been canceled.

In the order parameter, specify FREQ to sort rows by descending frequency count.
In the vars subparameter, specify country as the column that the frequencies will be calculated on.
Use the where subparameter to subset the data so that frequencies are calculated only on bookings that are not canceled.

In [None]:
s.loadActionSet("freqTab")

s.freqTab.freqTab(table={"caslib":"casuser",
                         "name":"hotel_bookings_subset",
                         "vars":[{"name":"country"}],
                         "where":"iscanceled = 0"},
                  order="FREQ")

## Calculate Summary Statistics by Month and Hotel Type

Examine how the price varies per night over the year. Use simple.summary to calculate the average (mean) daily rate of bookings by month for each hotel type (Resort and City), excluding canceled bookings, and save the results to separate tables based on each hotel type, named "bookings_summary_resort" and "bookings_summary_city". For each simple.summary action:

In the table parameter, specify arrivaldatemonth as the column for the groupBy subparameter so that statistics are calculated for each month in the resulting output table.

In the where subparameter, specify an expression that selects only bookings that are not canceled and the type of hotel.
In the inputs parameter, specify adr so that statistics are calculated on this column.

In the subset parameter, specify MEAN to calculate the mean of the column specified in the inputs parameter, adr.

Use a table.fetch action to fetch the arrivaldatemonth and _Mean_ columns from the output tables. The months are returned in alphabetical order, and will need to be sorted in the correct order.

In [None]:
s.simple.summary(table={"caslib":"casuser",
                        "name":"hotel_bookings_subset",
                        "groupBy":[{"name":"arrivaldatemonth"}],
                        "where":"iscanceled = 0 & hotel='R'"},
                 inputs={"adr"},
                 subset={"MEAN"},
                 casout={"name":"bookings_summary_resort", 
                         "replace":True})

s.simple.summary(table={"caslib":"casuser",
                        "name":"hotel_bookings_subset",
                        "groupBy":[{"name":"arrivaldatemonth"}],
                        "where":"iscanceled = 0 & hotel='C'"},
                 inputs={"adr"},
                 subset={"MEAN"},
                 casout={"name":"bookings_summary_city", 
                         "replace":True})

s.table.fetch(table={"name":"bookings_summary_resort"},
              fetchVars={"arrivaldatemonth", 
                         "_Mean_"})

In [None]:
s.table.fetch(table={"name":"bookings_summary_city"},
              fetchVars={"arrivaldatemonth", 
                         "_Mean_"})

## Sort Results by Month

To display the tables with the month column sorted in order, first use the table.copyTable action with the computedVars and computedVarsProgram parameters to create a numeric column named "monthno" containing the month number for each month based on the value of arrivaldatemonth. The result is saved to a table named "hotel_bookings_subset_monthno". For the simple.summary actions, include monthno in the groupBy parameter so that the fetched tables can be sorted by month number.

In [None]:
s.table.copyTable(casout={"caslib":"casuser", 
                          "name":"hotel_bookings_subset_monthno", 
                          "replace":True},
                  table={"caslib":"casuser", 
                         "name":"hotel_bookings_subset", 
                         "computedVars":[{"name":"monthno"}],
                         "computedVarsProgram":"""if arrivaldatemonth='January' then monthno=1;
                                                  else if arrivaldatemonth='February' then monthno=2;
                                                  else if arrivaldatemonth='March' then monthno=3;
                                                  else if arrivaldatemonth='April' then monthno=4;
                                                  else if arrivaldatemonth='May' then monthno=5;
                                                  else if arrivaldatemonth='June' then monthno=6;
                                                  else if arrivaldatemonth='July' then monthno=7;
                                                  else if arrivaldatemonth='August' then monthno=8;
                                                  else if arrivaldatemonth='September' then monthno=9;
                                                  else if arrivaldatemonth='October' then monthno=10;
                                                  else if arrivaldatemonth='November' then monthno=11;
                                                  else if arrivaldatemonth='December' then monthno=12"""})

s.simple.summary(table={"caslib":"casuser",
                        "name":"hotel_bookings_subset_monthno",
                        "groupBy":[{"name":"arrivaldatemonth"}, 
                                   {"name":"monthno"}, 
                                   {"name":"hotel"}],
                        "where":"iscanceled = 0 & hotel='R'"},
                inputs={"adr"},
                subset={"MEAN"},
                casout={"name":"bookings_summary_resort_monthno", 
                        "replace":True})

s.simple.summary(table={"caslib":"casuser",
                        "name":"hotel_bookings_subset_monthno",
                        "groupBy":[{"name":"arrivaldatemonth"}, 
                                   {"name":"monthno"}, 
                                   {"name":"hotel"}],
                        "where":"iscanceled = 0 & hotel='C'"},
                inputs={"adr"},
                subset={"MEAN"},
                casout={"name":"bookings_summary_city_monthno", 
                        "replace":True})

s.table.alterTable(caslib="casuser",
                   columns=[{"name":"_Mean_", "rename":"Mean Average Daily Rate"}],
                   name="bookings_summary_resort_monthno")

s.table.alterTable(caslib="casuser",
                   columns=[{"name":"_Mean_", "rename":"Mean Average Daily Rate"}],
                   name="bookings_summary_city_monthno")

resort = s.table.fetch(table={"name":"bookings_summary_resort_monthno"},
              fetchVars={"arrivaldatemonth", 
                         "Mean Average Daily Rate", 
                         "hotel"},
              sortBy=[{"name":"monthno"}])

city = s.table.fetch(table={"name":"bookings_summary_city_monthno"},
              fetchVars={"arrivaldatemonth", 
                         "Mean Average Daily Rate", 
                         "hotel"},
              sortBy=[{"name":"monthno"}])

display(resort, city)

# Visualize the Data

## Create Line Charts to Visualize Results by Month

The pandas plot.line method can be used to visualize data that has been preprocessed using CAS actions. This method uses the matplotlib library through pandas rather than using matplotlib functions directly which creates cleaner plots. The method is used here to create line charts that show the average cost of bookings by month for each hotel type.

In [None]:
resort_df = resort["Fetch"]
resortMonthVBookings = resort_df.plot.line(x="ArrivalDateMonth", y="Mean Average Daily Rate", rot=90)
resortMonthVBookings.set_xlabel("Month")
resortMonthVBookings.set_ylabel("Mean Average Daily Rate")
resortMonthVBookings.set_title("Mean Average Daily Rate by Month for Resort Hotel")

In [None]:
city_df = city["Fetch"]
cityMonthVBookings = city_df.plot.line(x="ArrivalDateMonth", y="Mean Average Daily Rate", rot=90)
cityMonthVBookings.set_xlabel("Month")
cityMonthVBookings.set_ylabel("Mean Average Daily Rate")
cityMonthVBookings.set_title("Mean Average Daily Rate by Month for City Hotel")

The sns.lineplot function from the Seaborn library can be used to create a grouped line chart with the month variable on the x-axis and the hotel type as the group variable. The Seaborn method does not require reshaping the data from long to wide. Before creating the line chart, first append the summary tables for resort and city hotels together, fetch the rows from the appended table, and then save it as an object named bookings_appended.

In [None]:
s.table.append(target={"caslib":"casuser",
                       "name":"bookings_summary_city_monthno"},
               source={"caslib":"casuser",
                       "name":"bookings_summary_resort_monthno"})            

In [None]:
s.table.alterTable(name="bookings_summary_city_monthno",
                   rename="bookings_summary_appended",
                   caslib="casuser")

In [None]:
bookings_appended = s.table.fetch("bookings_summary_appended",
                                  to=25)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
appended_df = bookings_appended["Fetch"]
g = sns.lineplot(data=appended_df, 
                 x='monthno', 
                 y='Mean Average Daily Rate', 
                 hue='hotel')
g.set(xlabel="Month",
      ylabel="Mean Average Daily Rate")
plt.xticks(rotation=70)
plt.title("Mean Average Daily Rate by Month and Hotel Type")
plt.legend(title='Hotel Type')
g.set_xticks([1, 2, 3, 
              4, 5, 6, 
              7, 8, 9, 
              10, 11, 12])
g.set_xticklabels(['Jan', 'Feb', "Mar", 
                   "Apr", "May", "Jun", 
                   "Jul", "Aug", "Sep", 
                   "Oct", "Nov", "Dec"])
plt.show()