# Overview

This example demonstrates how to create several types of data visualizations using Python libraries based on data that has been preprocessed using CAS actions. The data used for this example is 28,347 crime incidents in 2021 provided by the City of Washington, DC.

The data can be obtained from https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/Crime_Incidents_in_2021.csv and is originally from https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-2021/about.

# Load the SWAT Library and Connect to the CAS Server

Load the SWAT library and then create a connection to the CAS server using the CAS function and assign the CAS connection object to the variable s. The first argument specifies the host name, and the second argument specifies the port.

In [None]:
import swat
# change the host and port to match your site
s = swat.CAS("cloud.example.com", 10065)

# Load the Data

There are two methods that can be used to load a data file. The first method is to load the data from a caslib (server-side load). The second method is to load the data from a location that is accessible to the CAS server but not associated with a caslib (client-side load).

## Load the Data from a Caslib

The default method of loading data is to load the data from the data source portion of a caslib, which is known as a server-side load. This requires the data file to be saved in the active caslib (Casuser). Once the file has been saved to the caslib, use the table.loadTable action to load the Crime_Incidents_in_2021.csv file from the data source portion of the caslib into memory as a CAS table named crimes.

In [None]:
s.table.loadTable(path="Crime_Incidents_in_2021.csv",
                  caslib="casuser",
                  casOut={"name":"crimes", 
                          "replace":True},
                  importOptions={"fileType":"CSV",
                                 "encoding":"latin1",
                                 "guessrows":30000,
                                 "vars":{"ward":{"type":"varchar"}}})

## Load a Client-Side Data File into CAS

Another method of loading data into CAS memory is to load the data from an external source that is accessible to the CAS server. This example uses the SWAT upload_file method to perform a client-side load.

In [None]:
s.upload_file("https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/Crime_Incidents_in_2021.csv",  
                    casOut={'name':'crimes',
                            'caslib':'casuser',
                            'replace':True},
                    importOptions={"fileType":"csv",
                                   "encoding":"latin1",
                                   "guessRows":"10000",
                                   "vars":{"ward":{"type":"varchar"}}})

# Explore the Data

## Create a Reference to an In-Memory Table

Use the CASTable function to reference the crimes table and save the result in a object named tbl. Therefore, any action or method that is run on the CASTable object tbl will include the parameters in tbl.

In [None]:
tbl = s.CASTable(name='crimes', caslib='casuser')

## Examine the Rows

Use a table.fetch action to retrieve the first five rows from the crimes table.

In [None]:
tbl.fetch(to=5)

## Examine the Columns

Use the table.columnInfo action to obtain metadata about the table. The result includes the names of columns, and information about each column, including its label (if applicable), type, length, and format. 

In [None]:
tbl.columnInfo()

## Examine Unique and Missing Values

Run the simple.distinct action to identify the number of distinct values and the number of missing values for each column.

In [None]:
tbl.distinct()

# Visualize the Data 

The SWAT package and other libraries in Python can be used to visualize data that has been preprocessed using CAS actions. This example demonstrates how to create several common types of data visualizations, including a bar chart, histogram, box plot, and line plot.

## Import Libraries

Import the pandas, matplotlib, and seaborn libraries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Create a Bar Chart 

Create a bar chart that shows the number of crime incidents by the shift variable.

### Aggregate the Data 

First, use the simple.freq action to count the number of rows by shift and save the result to an object named tbl_shift. The resulting tbl_shift object is a dictionary containing the key Frequency, which is associated with the output frequency table stored in a SASDataFrame.

In [None]:
tbl_shift = s.simple.freq(tbl,
                          inputs={"SHIFT"})
tbl_shift

### Save the Data Frame

Use the Frequency key in the tbl_shift dictionary to select the output frequency table and assign it to an object named df_shift. The object df_shift now contains a SASDataFrame, which can be used like a pandas DataFrame.

In [None]:
df_shift = tbl_shift["Frequency"]
df_shift

Sort the values of df_shift in descending order by the Frequency column.

### Plot the Data

Sort the values of df_shift in descending order by the Frequency column. Use the pandas plot.bar method to create a bar chart with FmtVar (shift) on the X-axis and Frequency on the Y-axis. Specify axis labels, title, and subtitle.

In [None]:
df_shift_sorted = df_shift.sort_values('Frequency', ascending=False)

df_ShiftVFreq = df_shift_sorted.plot.bar(x="FmtVar", y="Frequency")
df_ShiftVFreq.set_xlabel("Shift")
df_ShiftVFreq.set_ylabel("Number of Crime Incidents")
plt.suptitle("Number of Crime Incidents by Shift")
plt.title("Washington, DC 2021")

## Create a Histogram

Create a histogram to show the distribution of crime incidents by the count of voting precincts.

### Aggregate the Data

Run the simple.freq action to count the number of rows by the voting precinct and save the result to an object named tbl_vp. The resulting tbl_vp object is a dictionary containing the key Frequency, which is associated with the output frequency table stored in a SASDataFrame.

In [None]:
tbl_vp = s.simple.freq(tbl,
                       inputs={"VOTING_PRECINCT"})
tbl_vp

### Save the Data Frame

Use the Frequency key in the tbl_vp dictionary to select the output frequency table and assign it to an object named df_vp. The object df_vp now contains a SAS DataFrame which can be used like a pandas DataFrame.

In [None]:
df_vp = tbl_vp["Frequency"]
df_vp

### Plot the Data

Use the plt.subplots function and the hist method to plot the Frequency column of df_vp. Specify axis labels, title, and subtitle. 

In [None]:
fig, ax = plt.subplots()
ax.hist(df_vp["Frequency"], bins=60)
ax.set_xlabel("Crime Incidents")
ax.set_ylabel("Count of Voting Precincts")
plt.suptitle("Distribution of Crime Incidents")
plt.title("Washington, DC 2021")
plt.show()

## Create a Box Plot

Create a box plot that shows summary statistics for the number of crime incidents among neighborhood clusters by ward.

### Aggregate the Data 

Load the aggregation action set. Specify neighborhood cluster and ward as the columns to use in the groupBy parameter for tbl. Use the aggregate action to count the number of crime incidents by neighborhood cluster and ward and save the resulting output table as CrimesByNCWard. The result shows a DataFrame with information about the output CAS table CrimesByNCWard.

In [None]:
s.builtins.loadActionSet("aggregation")

tbl.groupBy = [{"name":"NEIGHBORHOOD_CLUSTER"}, 
               {"name":"WARD"}]

tbl.aggregate(varSpecs=[{"name":"CCN",
                         "subset":{"N"}}],
              casOut={"name":"CrimesByNCWard",
                      "caslib":"casuser",
                      "replace":True})

del tbl.groupBy

### View the Aggregated Data

Run the table.fetch action to return all 63 rows of the CrimesByNCWard table and save the result to tbl_nc_ward. The resulting tbl_nc_ward object is a dictionary containing the key Fetch which is associated with the output aggregated data stored in a SASDataFrame.

In [None]:
tbl_nc_ward = s.table.fetch(table={"name":"CrimesByNCWard"},
                            to=63)
tbl_nc_ward

### Save the Data Frame

Use the Fetch key in the tbl_nc_ward dictionary to select the output frequency table and assign it to an object named df_nc_ward. The object df_nc_ward now contains a SASDataFrame, which can be used like a pandas DataFrame.

In [None]:
df_nc_ward = tbl_nc_ward["Fetch"]

### Plot the Data

Use the sns.boxplot function from the Seaborn library to create vertical box plots to show the summary statistics for number of crime incidents by ward among neighborhood clusters. Specify WARD_f as the X-axis variable and _CCN_Summary_NObs_ as the Y-axis variable. Specify the axis labels, title, and subtitle.

In [None]:
g = sns.boxplot(x=df_nc_ward["WARD_f"],
                y=df_nc_ward["_CCN_Summary_NObs_"])
g.set(xlabel="Ward",
      ylabel="Number of Crime Incidents")
plt.suptitle("Box Plot of Crime Incidents by Ward")
plt.title("Neighborhood Clusters, Washington DC")

## Create a Line Plot

Create a line plot to show the number of crime incidents by month in 2021.

### Create a Date Variable

In order to aggregate data by a date variable, it is necessary to convert the date column from character to numeric data type with a date format. 
First, use the SWAT eval function to create a new column named date that extracts the date value in front of the delimiter "+" with the informat 'ANYDTDTM.'.
Use the copyTable action to create a new table named crimes_date from the crimes table.
Use the CASTable function to create a reference to the crimes_date table and save the result to an object name tbl_date. 
Use the alterTable action to assign the display format 'date9.' to the date column in the crimes_date table. 

In [None]:
tbl.eval("date = datepart(inputn(scan(REPORT_DAT,1,'+'), 'ANYDTDTM.'))")

In [None]:
tbl.copyTable(casout={'name':'crimes_date','caslib':'casuser', 'replace':True})

In [None]:
tbl_date = s.CASTable(name='crimes_date', caslib='casuser')

In [None]:
tbl_date.alterTable(columns = [{'name':'date', 'format':'date9.'}])

In [None]:
tbl_date.columnInfo()

### Aggregate the Data

Run the aggregation.aggregate action to count the number of rows by month and save the resulting output as a table named CrimesByMonth.

In [None]:
s.aggregation.aggregate(table={"name":"crimes_date",                                                                  
                               "caslib":"casuser",
                               "groupBy":{"date"}},                                              
                        varSpecs=[{"name":"CCN",                                                                    
                                   "subset":{"N"}}],
                        ID="date",                                                            
                        interval="MONTH",                                                                          
                        casOut={"name":"CrimesByMonth",                                                     
                                "caslib":"casuser",
                                "replace":True})

### View the Rows

Run the table.fetch action to return only the first 12 rows from the CrimesByMonth table and assign the result to the object tbl_month. This will exclude the one row containing data from January 2022.

In [None]:
tbl_month = s.table.fetch(table={"name":"CrimesByMonth"},
                          to=12)

### Save the Data Frame

Use the Fetch key in the tbl_month dictionary to select the output frequency table and assign it to an object named df_month. The object df_month now contains a SASDataFrame, which can be used like a pandas DataFrame.

In [None]:
df_month = tbl_month["Fetch"]

### Plot the Data

Convert the date column to datetime and then use the dt.to_period method to extract only the month and year values from the date column and store the values in a column named month_year.

In [None]:
df_month['date'] = pd.to_datetime(df_month['date'])
df_month['month_year'] = df_month['date'].dt.to_period('M')

Use the plot.line pandas method to create a line plot with month_year as the X-axis variable and _CCN_Summary_NObs_ as the Y-axis variable. Specify axis labels, title, and subtitle. 

In [None]:
CrimesVMonth = df_month.plot.line(x="month_year", y="_CCN_Summary_NObs_", rot=45)
CrimesVMonth.set_xlabel("Month")
CrimesVMonth.set_ylabel("Number of Crime Incidents")
plt.suptitle("Number of Crime Incidents by Month")
plt.title("Washington, DC 2021")