# Overview

This example demonstrates how to create several types of data visualizations using Python libraries based on data that has been preprocessed using CAS actions. The data used for this example is 28,347 crime incidents in 2021 provided by the City of Washington, DC.

The data can be obtained from https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/Crime_Incidents_in_2021.csv and is originally from https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-2021/about.

# Load the SWAT Library and Connect to the CAS Server

Load the SWAT library and then create a connection to the CAS server using the CAS function and assign the CAS connection object to the variable s. The first argument specifies the host name, and the second argument specifies the port.

In [None]:
import swat
# change the host and port to match your site
s = swat.CAS("cloud.example.com", 10065)

# Load the Data

There are two methods that can be used to load a data file. The first method is to load the data from a caslib (server-side load). The second method is to load the data from a location that is accessible to the CAS server but not associated with a caslib (client-side load).

## Load the Data from a Caslib

The default method of loading data is to load the data from the data source portion of a caslib, which is known as a server-side load. This requires the data file to be saved in the active caslib (Casuser). Once the file has been saved to the caslib, use the table.loadTable action to load the Crime_Incidents_in_2021.csv file from the data source portion of the caslib into memory as a CAS table named crimes.

In [None]:
s.table.loadTable(path="Crime_Incidents_in_2021.csv",
                  caslib="casuser",
                  casOut={"name":"crimes", 
                          "caslib":"casuser",
                          "replace":True},
                  importOptions={"fileType":"CSV",
                                 "encoding":"latin1",
                                 "guessrows":30000,
                                 "vars":{"ward":{"type":"varchar"}}})

## Load a Client-Side Data File into CAS

Another method of loading data into CAS memory is to load the data from an external source that is accessible to the CAS server. This example uses the SWAT upload_file method to perform a client-side load.

In [None]:
tbl = s.upload_file("https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/Crime_Incidents_in_2021.csv",  
                    casOut={'name':'crimes',
                            'caslib':'casuser',
                            'replace':True},
                    importOptions={"fileType":"csv",
                                   "encoding":"latin1",
                                   "guessRows":"10000",
                                   "vars":{"ward":{"type":"varchar"}}})

# Explore the Data

## Create a Reference to an In-Memory Table

Use the CASTable function to reference the crimes table and save the result in a object named tbl. Therefore, any action or method that is run on the CASTable object tbl will include the parameters in tbl.

In [None]:
tbl = s.CASTable(name='crimes', caslib='casuser')

## Examine the Rows

Run the head() function to retrieve the first five rows from the crimes table.

In [None]:
tbl.head(n=5)

## Examine the Columns

Use the table.columnInfo action to obtain metadata about the table. The result includes the names of columns, and information about each column, including its label (if applicable), type, length, and format. 

In [None]:
tbl.info()

## Examine Unique and Missing Values

Run the simple.distinct action to identify the number of distinct values and the number of missing values for each column.

In [None]:
tbl.distinct()

# Visualize the Data 

The SWAT package and other libraries in Python can be used to visualize data that has been preprocessed using CAS actions. This example demonstrates how to create several common types of data visualizations, including a bar chart, histogram, box plot, and line plot.

## Import Libraries

Import the pandas, matplotlib, and seaborn libraries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Create a Bar Chart 

Create a bar chart that shows the number of crime incidents by the shift variable.

### Aggregate the Data 

First, use the value_counts SWAT method to count the number of rows by shift and save the resulting pandas Series to an object named df_shift. The value_counts SWAT method mimics the pandas value counts method but runs in the distributed CAS server.

In [None]:
df_shift = tbl["SHIFT"].value_counts()
display(df_shift)

### Plot the Data

Use the plt.subplots() function to create a bar chart with the shift (index) on the X-axis and frequency counts on the Y-axis. Specify axis labels, title, and subtitle.

In [None]:
fig, ax = plt.subplots()
ax.bar(df_shift.index, df_shift.values)
ax.set_title('Number of Crime Incidents by Shift \n Washington DC 2021')
ax.set_ylabel('Number of Crime Incidents')
ax.set_xlabel('Shift')

## Create a Histogram

Create a histogram to show the distribution of crime incidents by the count of voting precincts.

### Aggregate the Data

Use the value_counts SWAT method to count the number of rows by voting precinct and save the resulting pandas Series to an object named df_vp.

In [None]:
df_vp = tbl["VOTING_PRECINCT"].value_counts()
df_vp

### Plot the Data

Use the plt.subplots() function and the hist method to plot the values of df_vp. Specify axis labels, title, and subtitle. 

In [None]:
# Fig, ax method

fig, ax = plt.subplots()
ax.hist(df_vp.values, bins=25, edgecolor='black', linewidth=1)
ax.set_xlabel("Number of Crime Incidents")
ax.set_ylabel("Number of Voting Precincts")
ax.yaxis.set(ticks=range(0, 41, 10))
plt.suptitle("Distribution of the Frequency of Crime Incidents by Voting Precincts")
plt.title("Washington, DC 2021")
plt.show()

## Create a Box Plot

Create a box plot that shows summary statistics for the number of crime incidents among neighborhood clusters by ward.

### Aggregate the Data 

Use the pandas groupby() method to count the number of observations (using the CCN variable) by each neighborhood cluster and ward, and reset the index. 

In [None]:
df_nc_ward = tbl.groupby(["NEIGHBORHOOD_CLUSTER", "WARD"])["CCN"].count()
df_nc_ward = df_nc_ward.reset_index()
df_nc_ward

### Plot the Data

Use the sns.boxplot() function from the Seaborn library to create vertical box plots to show the summary statistics for number of crime incidents among neighborhood clusters by ward. Specify WARD_f as the X-axis variable and _CCN_Summary_NObs_ as the Y-axis variable. Specify the axis labels, title, and subtitle.

In [None]:
# Seaborn method

g = sns.boxplot(x=df_nc_ward["WARD"],
                y=df_nc_ward["CCN"])
g.set(xlabel="Ward",
      ylabel="Number of Crime Incidents")
g.yaxis.grid(True)
g.set_axisbelow(True)
plt.suptitle("Box Plot of Crime Incidents by Ward")
plt.title("Neighborhood Clusters in Washington DC, 2021")

## Create a Line Plot

Create a line plot to show the number of crime incidents by month in 2021.

### Create a Date Variable

In order to aggregate data by a date variable, it is necessary to convert the date column from character to numeric data type with a date format. 
First, use the SWAT eval function to create a new column named date that extracts the date value in front of the delimiter "+" with the informat 'ANYDTDTM.'.
Use the copyTable action to create a new table named crimes_date from the crimes table.
Use the CASTable function to create a reference to the crimes_date table and save the result to an object name tbl_date. 
Use the alterTable action to assign the display format 'date9.' to the date column in the crimes_date table. 

In [None]:
tbl.eval("date = datepart(inputn(scan(REPORT_DAT,1,'+'), 'ANYDTDTM.'))")

In [None]:
tbl.copyTable(casout={'name':'crimes_date','caslib':'casuser', 'replace':True})

In [None]:
tbl_date = s.CASTable(name='crimes_date', caslib='casuser')

In [None]:
tbl_date.alterTable(columns = [{'name':'date', 'format':'date9.'}])

In [None]:
tbl_date.columnInfo()

### Aggregate the Data

Run the aggregation.aggregate action to count the number of rows by month and save the resulting output as a table named CrimesByMonth. Create a CAS Table object tbl_month to reference the CrimesByMonth table.

In [None]:
s.builtins.loadActionSet("aggregation")

s.aggregation.aggregate(table={"name":"crimes_date",                                                                  
                               "caslib":"casuser",
                               "groupBy":{"date"}},                                              
                        varSpecs=[{"name":"CCN",                                                                    
                                   "subset":{"N"}}],
                        ID="date",                                                            
                        interval="MONTH",                                                                          
                        casOut={"name":"CrimesByMonth",                                                     
                                "caslib":"casuser",
                                "replace":True})

tbl_month = s.CASTable(name='CrimesByMonth', caslib='casuser')

### View the Rows

Use the head function to return only the first 12 rows from the CrimesByMonth table and assign the result to the object df_month. This will exclude the one row containing data from January 2022.

In [None]:
df_month = tbl_month.head(n=12)
df_month

### Plot the Data

Convert the date column to datetime and then use the dt.to_period method to extract only the month and year values from the date column and store the values in a column named month_year. Convert the column to a string type.

In [None]:
df_month['date'] = pd.to_datetime(df_month['date'])
df_month['month_year'] = df_month['date'].dt.to_period('M')
df_month['month_year'] = df_month['month_year'].astype(str)

In [None]:
df_month

Use the plt.subplots() function and the plot method to create a line plot with month_year as the X-axis variable and _CCN_Summary_NObs_ as the Y-axis variable. Specify axis labels, title, and subtitle.

In [None]:
# Fig, ax method

fig, ax = plt.subplots()

ax.plot(df_month["month_year"], df_month["_CCN_Summary_NObs_"], marker='.')
ax.set_xlabel("Month")
ax.set_ylabel("Number of Crime Incidents")
plt.xticks(rotation=45)
plt.suptitle("Number of Crime Incidents by Month")
plt.title("Washington, DC 2021")
plt.show()