# Simple Statistics in Python

The actions in CAS cover a wide variety of statistical analyses.  While we can't cover all of them here, we'll at least get you started on some of the simpler ones.

First we need to get a CAS connection set up.

In [1]:
import swat

conn = swat.CAS(host, port, username, password)

## The `simple` Action Set

The basic statistics package in CAS is called **simple** and should be already loaded.  If you are using IPython, you can see what actions are available using the **?** operator.

In [2]:
conn.simple?

You can also use Python's **help** function.

In [3]:
help(conn.simple)

Help on Simple in module swat.cas.actions object:

class Simple(CASActionSet)
 |  Analytics
 |  
 |  Actions
 |  -------
 |  simple.correlation : Generates a matrix of Pearson product-moment correlation
 |                       coefficients
 |  simple.crosstab    : Performs one-way or two-way tabulations
 |  simple.distinct    : Computes the distinct number of values of the variables in
 |                       the variable list
 |  simple.freq        : Generates a frequency distribution for one or more
 |                       variables
 |  simple.groupby     : Builds BY groups in terms of the variable value
 |                       combinations given the variables in the variable list
 |  simple.mdsummary   : Calculates multidimensional summaries of numeric variables
 |  simple.numrows     : Shows the number of rows in a Cloud Analytic Services table
 |  simple.paracoord   : Generates a parallel coordinates plot of the variables in
 |                       the variable list
 |  simpl

Let's start off with the **summary** action.  We'll need some data, so we'll load some CSV from a local file.  Then we'll run the action on it.

In [4]:
cars = conn.read_csv('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv')
out = cars.summary()
out

Unnamed: 0,Column,Min,Max,N,NMiss,Mean,Sum,Std,StdErr,Var,USS,CSS,CV,TValue,ProbT
0,MSRP,10280.0,192465.0,428.0,0.0,32774.85514,14027638.0,19431.716674,939.267478,377591600.0,620985400000.0,161231600000.0,59.28849,34.894059,4.160412e-127
1,Invoice,9875.0,173560.0,428.0,0.0,30014.700935,12846292.0,17642.11775,852.763949,311244300.0,518478900000.0,132901300000.0,58.778256,35.196963,2.684398e-128
2,EngineSize,1.3,8.3,428.0,0.0,3.196729,1368.2,1.108595,0.053586,1.228982,4898.54,524.7754,34.679034,59.656105,3.133745e-209
3,Cylinders,3.0,12.0,426.0,2.0,5.807512,2474.0,1.558443,0.075507,2.428743,15400.0,1032.216,26.834946,76.913766,1.515569e-251
4,Horsepower,73.0,500.0,428.0,0.0,215.885514,92399.0,71.836032,3.472326,5160.415,22151100.0,2203497.0,33.275059,62.173176,4.185344e-216
5,MPG_City,10.0,60.0,428.0,0.0,20.060748,8586.0,5.238218,0.253199,27.43892,183958.0,11716.42,26.111777,79.229235,1.866284e-257
6,MPG_Highway,12.0,66.0,428.0,0.0,26.843458,11489.0,5.741201,0.277511,32.96139,322479.0,14074.51,21.387709,96.729204,1.665621e-292
7,Weight,1850.0,7190.0,428.0,0.0,3577.953271,1531364.0,758.983215,36.686838,576055.5,5725125000.0,245975700.0,21.212776,97.52689,5.8125469999999994e-294
8,Wheelbase,89.0,144.0,428.0,0.0,108.154206,46290.0,8.311813,0.401767,69.08624,5035958.0,29499.82,7.68515,269.196577,0.0
9,Length,143.0,238.0,428.0,0.0,186.36215,79763.0,14.357991,0.69402,206.1519,14952830.0,88026.87,7.704349,268.525733,0.0


The result object here is a CASResults object which is a subclass of a Python dictionary.  In this case, we only have one key "Summary".  The value for this key is a DataFrame.  We can store the DataFrame in a variable so that it's easier to work with, then we can do any of the standard Pandas DataFrame operations on it.  Here we are setting the first column as the index for the DataFrame so that we can do data selection easier later on.

In [5]:
df = out['Summary']
df.set_index(df.columns[0], inplace=True)
df

Unnamed: 0_level_0,Min,Max,N,NMiss,Mean,Sum,Std,StdErr,Var,USS,CSS,CV,TValue,ProbT
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
MSRP,10280.0,192465.0,428.0,0.0,32774.85514,14027638.0,19431.716674,939.267478,377591600.0,620985400000.0,161231600000.0,59.28849,34.894059,4.160412e-127
Invoice,9875.0,173560.0,428.0,0.0,30014.700935,12846292.0,17642.11775,852.763949,311244300.0,518478900000.0,132901300000.0,58.778256,35.196963,2.684398e-128
EngineSize,1.3,8.3,428.0,0.0,3.196729,1368.2,1.108595,0.053586,1.228982,4898.54,524.7754,34.679034,59.656105,3.133745e-209
Cylinders,3.0,12.0,426.0,2.0,5.807512,2474.0,1.558443,0.075507,2.428743,15400.0,1032.216,26.834946,76.913766,1.515569e-251
Horsepower,73.0,500.0,428.0,0.0,215.885514,92399.0,71.836032,3.472326,5160.415,22151100.0,2203497.0,33.275059,62.173176,4.185344e-216
MPG_City,10.0,60.0,428.0,0.0,20.060748,8586.0,5.238218,0.253199,27.43892,183958.0,11716.42,26.111777,79.229235,1.866284e-257
MPG_Highway,12.0,66.0,428.0,0.0,26.843458,11489.0,5.741201,0.277511,32.96139,322479.0,14074.51,21.387709,96.729204,1.665621e-292
Weight,1850.0,7190.0,428.0,0.0,3577.953271,1531364.0,758.983215,36.686838,576055.5,5725125000.0,245975700.0,21.212776,97.52689,5.8125469999999994e-294
Wheelbase,89.0,144.0,428.0,0.0,108.154206,46290.0,8.311813,0.401767,69.08624,5035958.0,29499.82,7.68515,269.196577,0.0
Length,143.0,238.0,428.0,0.0,186.36215,79763.0,14.357991,0.69402,206.1519,14952830.0,88026.87,7.704349,268.525733,0.0


Now that we have an index, we can use the **loc** property of the DataFrame to select rows based on index values as well as columns based on names.

In [6]:
df.loc[['MSRP', 'Invoice'], ['Min', 'Mean', 'Max']]

Unnamed: 0_level_0,Min,Mean,Max
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MSRP,10280.0,32774.85514,192465.0
Invoice,9875.0,30014.700935,173560.0


## DataFrame methods on CASTable objects

In the previous example, we called the **summary** action directly.  This gave us a CASResults object that contained a DataFrame with the result of the action.  You can also use many of the Pandas DataFrame methods directly on the CASTable object so that, in many ways, they are interchangeable.  One of the most common methods used on a Pandas DataFrame is the **describe** method.  This includes statistics that would normally be gotten by running variations of the **summary**, **distinct**, **topk**, and **percentile** actions.  This is all done for you and the output created is the same as what you would get from an actual Pandas DataFrame.  The difference is that in the case of the CASTable version, you can handle much, much larger data sets.

In [7]:
cars.describe()

Unnamed: 0,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
count,428.0,428.0,428.0,426.0,428.0,428.0,428.0,428.0,428.0,428.0
mean,32774.85514,30014.700935,3.196729,5.807512,215.885514,20.060748,26.843458,3577.953271,108.154206,186.36215
std,19431.716674,17642.11775,1.108595,1.558443,71.836032,5.238218,5.741201,758.983215,8.311813,14.357991
min,10280.0,9875.0,1.3,3.0,73.0,10.0,12.0,1850.0,89.0,143.0
25%,20329.5,18851.0,2.35,4.0,165.0,17.0,24.0,3103.0,103.0,178.0
50%,27635.0,25294.5,3.0,6.0,210.0,19.0,26.0,3474.5,107.0,187.0
75%,39215.0,35732.5,3.9,6.0,255.0,21.5,29.0,3978.5,112.0,194.0
max,192465.0,173560.0,8.3,12.0,500.0,60.0,66.0,7190.0,144.0,238.0


Other examples of DataFrame methods that work on CASTable objects are **min**, **max**, **std**, etc.  Each of these calls **simple.summary** in the background, so if you want to use more than one, you might be better off just calling the **describe** method once to get all of them.

In [8]:
cars.min()

Make                Acura
Model          3.5 RL 4dr
Type               Hybrid
Origin               Asia
DriveTrain            All
MSRP                10280
Invoice              9875
EngineSize            1.3
Cylinders               3
Horsepower             73
MPG_City               10
MPG_Highway            12
Weight               1850
Wheelbase              89
Length                143
Name: min, dtype: object

In [9]:
cars.max()

Make                             Volvo
Model          Z4 convertible 3.0i 2dr
Type                             Wagon
Origin                             USA
DriveTrain                        Rear
MSRP                            192465
Invoice                         173560
EngineSize                         8.3
Cylinders                           12
Horsepower                         500
MPG_City                            60
MPG_Highway                         66
Weight                            7190
Wheelbase                          144
Length                             238
Name: max, dtype: object

In [10]:
cars.std()

MSRP           19431.716674
Invoice        17642.117750
EngineSize         1.108595
Cylinders          1.558443
Horsepower        71.836032
MPG_City           5.238218
MPG_Highway        5.741201
Weight           758.983215
Wheelbase          8.311813
Length            14.357991
dtype: float64

## Conclusion

Although we have just barely scratched the surface, you should now be able to get some basic statistical results back about your data.  Whether you want to use the action API directly, or the familiar Pandas DataFrame methods is up to you.

In [11]:
conn.close()