# Best Practices

As with all software, VerticaPy has a performance cost. To get the best performance, we need to understand the architectures of Vertica and VerticaPy. In this lesson, we'll go through some optimization steps.

## 1. Optimize your architecture at the database-level

At the end of the day, VerticaPy is an abstraction of SQL, so any database-level optimizations you make carry over to VerticaPy.

Optimizing Vertica mostly comes down to optimizing projections. Think in advance about your data architecture before creating a vDataFrame so we only select the essential columns.

In the following example, we use the 'usecols' parameter of the vDataFrame to select the most essential columns in our dataset.

In [1]:
import verticapy as vp
from verticapy.datasets import load_titanic
load_titanic() # loading the dataset in Vertica in case we do not have it
vdf = vp.vDataFrame("public.titanic",
                    usecols = ["survived", "pclass", "age"])
display(vdf)

Unnamed: 0,123pclassInteger,123survivedInteger,123ageNumeric(8)
1,1,0,2.0
2,1,0,30.0
3,1,0,25.0
4,1,0,39.0
5,1,0,71.0
6,1,0,47.0
7,1,0,[null]
8,1,0,24.0
9,1,0,36.0
10,1,0,25.0


## 2. Save the current relation when you can

The vDataFrame works just like a view. If the final generated relation uses a lot of different functions, it will drastically increase the computation time for each method call.

Smaller transformations won't slow down the process much, but heavier transformations (multiple joins, heavy use of advanced analytical funcions, moving windows, etc.) can cause noticeable slowdown. If you make these kinds of changes, you should save the vDataFrame structure. Let's look at an example.

In [2]:
# Doing multiple operation
vdf = vp.vDataFrame("public.titanic")
vdf["sex"].label_encode()["boat"].fillna(method = "0ifnull")["name"].str_extract(
    ' ([A-Za-z]+)\.').eval("family_size", expr = "parch + sibsp + 1").drop(
    columns = ["cabin", "body", "ticket", "home.dest"])["fare"].fill_outliers().fillna()
print(vdf.current_relation())

795 elements were filled.
(
   SELECT
     "pclass",
     "survived",
     "name",
     "sex",
     "age",
     "sibsp",
     "parch",
     COALESCE("fare", 32.9113074018842) AS "fare",
     "embarked",
     "boat",
     "family_size" 
   FROM
 (
   SELECT
     "pclass",
     "survived",
     REGEXP_SUBSTR("name", ' ([A-Za-z]+)\.') AS "name",
     DECODE("sex", 'female', 0, 'male', 1, 2) AS "sex",
     COALESCE("age", 30.1524573721163) AS "age",
     "sibsp",
     "parch",
     (CASE WHEN "fare" < -176.6204982585513 THEN -176.6204982585513 WHEN "fare" > 244.5480856064831 THEN 244.5480856064831 ELSE "fare" END) AS "fare",
     COALESCE("embarked", 'S') AS "embarked",
     DECODE("boat", NULL, 0, 1) AS "boat",
     parch + sibsp + 1 AS "family_size" 
   FROM
 (
                
   SELECT
     
                    "pclass",
     "survived",
     "name",
     "sex",
     "age",
     "sibsp",
     "parch",
     "fare",
     "embarked",
     "boat" 
                
   FROM
 "public"."titani

We can look at the query plan of the new relation. This will help us understand how Vertica will execute the different aggregations.

In [3]:
print(vdf.explain())

------------------------------ 
QUERY PLAN DESCRIPTION: 

EXPLAIN SELECT /*+LABEL('vDataframe.explain')*/ * FROM (SELECT "pclass", "survived", "name", "sex", "age", "sibsp", "parch", COALESCE("fare", 32.9113074018842) AS "fare", "embarked", "boat", "family_size" FROM (SELECT "pclass", "survived", REGEXP_SUBSTR("name", ' ([A-Za-z]+)\.') AS "name", DECODE("sex", 'female', 0, 'male', 1, 2) AS "sex", COALESCE("age", 30.1524573721163) AS "age", "sibsp", "parch", (CASE WHEN "fare" < -176.6204982585513 THEN -176.6204982585513 WHEN "fare" > 244.5480856064831 THEN 244.5480856064831 ELSE "fare" END) AS "fare", COALESCE("embarked", 'S') AS "embarked", DECODE("boat", NULL, 0, 1) AS "boat", parch + sibsp + 1 AS "family_size" FROM ( SELECT "pclass", "survived", "name", "sex", "age", "sibsp", "parch", "fare", "embarked", "boat" FROM "public"."titanic") VERTICAPY_SUBTABLE) VERTICAPY_SUBTABLE) VERTICAPY_SUBTABLE

Access Path:
+-STORAGE ACCESS for titanic [Cost: 67, Rows: 1K (NO STATISTICS)] (PATH ID: 1

We did plenty of operations and we must keep in mind that each method call will use this relation for its computations. We can save the result as a table in the Vertica database and use the parameter 'inplace' to change the current relation of the vDataFrame by the new one.

In [4]:
vp.drop("public.titanic_clean", method = "table")
vdf.to_db("public.titanic_clean",
          relation_type = "table",
          inplace = True)
print(vdf.current_relation())

"public"."titanic_clean"


When we're dealing with very large datasets, we have to think a bit before saving some transformations. Ideally, you'll want to do a proper data exploration first and then perform the heavier transformations only when they're really neeed. 

## 3. Stick to essential columns

Columnar databases perform faster queries when there are fewer columns. Since Vertica is a columnar MPP database, so it's important to understand that most of the optimizations are made in projections. VerticaPy doesn't manage this part, so it's important that the data you're working with is well-organized, particularly for larger volumes of data.

Most vDataFrame methods will automatically pick up all numerical columns - even if doing so has a significant performance impact - so it's important to be a little picky and only select the essential columns for any given use case. Let's look at an example.

In [5]:
vdf = vp.vDataFrame("public.titanic")
vp.set_option("sql_on", True)
vdf.avg()

Unnamed: 0,avg
"""pclass""",2.28444084278768
"""survived""",0.364667747163695
"""age""",30.1524573721163
"""sibsp""",0.504051863857374
"""parch""",0.378444084278768
"""fare""",33.9637936739659
"""body""",164.14406779661


Here, we didn't use the 'columns' parameter to pick on any specific columns, so we ended up computing the average of all the numerical columns of the vDataFrame. This isn't a big deal when we're dealing with smaller volumes of data (less than a TB), but with larger volumes of data, we have to be more careful with which columns we use.

In [6]:
vdf.avg(columns = ["age", "survived"])

Unnamed: 0,avg
"""age""",30.1524573721163
"""survived""",0.364667747163695


If you just want to exclude a few columns, you can simply get a list of all the columns and specify the unwanted columns with the 'get_columns' method.

In [7]:
vdf.get_columns()

['"pclass"',
 '"survived"',
 '"name"',
 '"sex"',
 '"age"',
 '"sibsp"',
 '"parch"',
 '"ticket"',
 '"fare"',
 '"cabin"',
 '"embarked"',
 '"boat"',
 '"body"',
 '"home.dest"']

In [8]:
vdf.get_columns(exclude_columns = ["boat", "embarked"])

['"pclass"',
 '"survived"',
 '"name"',
 '"sex"',
 '"age"',
 '"sibsp"',
 '"parch"',
 '"ticket"',
 '"fare"',
 '"cabin"',
 '"body"',
 '"home.dest"']

If you only want numerical columns, you can use the 'numcol'. This works the same way as 'get_columns', so you can also exclude columns in the same way.

In [9]:
vdf.numcol()

['"pclass"', '"survived"', '"age"', '"sibsp"', '"parch"', '"fare"', '"body"']

In [10]:
vdf.numcol(exclude_columns = ["body", "sibsp"])

['"pclass"', '"survived"', '"age"', '"parch"', '"fare"']

Let's compute a correlation matrix of our numerical columns excluding 'body' and 'sibsp'.

In [11]:
vp.set_option("plotting_lib","highcharts")
vdf.corr(columns = vdf.numcol(exclude_columns = ["parch", "sibsp"]))

Let's turn off the SQL code generation.

In [12]:
vp.set_option("sql_on", False)

## 4. Use the help function

The 'help' function is very useful for quickly viewing parameters.

In [13]:
help(vdf.agg)

Help on method aggregate in module verticapy.core.vdataframe._aggregate:

aggregate(func: Annotated[Union[str, list[str], ForwardRef('StringSQL'), list['StringSQL']], ''], columns: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, ncols_block: int = 20, processes: int = 1) -> verticapy.core.tablesample.base.TableSample method of verticapy.core.vdataframe.base.vDataFrame instance
    Aggregates the vDataFrame using the input functions.
    
    Parameters
    ----------
    func: SQLExpression
        | List of the different aggregations:
    
        |    **aad**: average absolute deviation
        |    **approx_median**: approximate median
        |    **approx_q%**: approximate q quantile
                            (ex: approx_50% for the
                            approximate median)
        |    **approx_unique**: approximative cardinality
        |    **count**: number of non-missing elements
        |    **cvar**: conditio

## 5. Close your connections

More connections to the database will increase the concurrency on the system, so try to close all of your connections after using them. VerticaPy simplifies the connection process by allowing the user to create an auto-connection, but it will not close it until you use the 'close_connection' function.

To demonstrate, let's create a database connection. Once we're done, we'll close it.

In [14]:
import verticapy as vp
# vp.connect("VerticaDSN")

We can use it to create a vDataFrame perform some operations on the data.

In [15]:
vdf = vp.vDataFrame("public.titanic")
vdf["sex"].label_encode()["boat"].fillna(method = "0ifnull")["name"].str_extract(
    ' ([A-Za-z]+)\.').eval("family_size", expr = "parch + sibsp + 1").drop(
    columns = ["cabin", "body", "ticket", "home.dest"])["fare"].fill_outliers().fillna()

795 elements were filled.


Unnamed: 0,123pclassInteger,123survivedInteger,AbcnameVarchar(164),123sexInteger,123ageNumeric(18),123sibspInteger,123parchInteger,123fareNumeric(20),AbcembarkedVarchar(20),123boatInteger,123family_sizeInteger
1,1,0,Miss.,0,2.0,1,2,151.55,S,0,4
2,1,0,Mr.,1,30.0,1,2,151.55,S,0,4
3,1,0,Mrs.,0,25.0,1,2,151.55,S,0,4
4,1,0,Mr.,1,39.0,0,0,0.0,S,0,1
5,1,0,Mr.,1,71.0,0,0,49.5042,C,0,1
6,1,0,Col.,1,47.0,1,0,227.525,C,0,2
7,1,0,Mr.,1,30.1524573721163,0,0,25.925,S,0,1
8,1,0,Mr.,1,24.0,0,1,244.5480856064831,C,0,2
9,1,0,Mr.,1,36.0,0,0,75.2417,C,1,1
10,1,0,Mr.,1,25.0,0,0,26.0,C,0,1


We can then close the connection when we are done.

In [16]:
# vp.close_connection()

It is very important to follow the previous process when you are working in an environment with multiple users.

## 6. Understand the time complexity of methods

Some techniques are more computationally expensive than others. For example, a 'kendall' correlation is very expensive compared to a 'pearson' correlation. This is because a 'kendall' correlation uses a cross join, giving it a time complexity of O(n*n) (where 'n' is the number of rows). We'll demonstrate this with the 'titanic' dataset.

In [17]:
import time
vdf = vp.vDataFrame("public.titanic")
start_time = time.time()
x = vdf.corr(method = "pearson", show = False)
print("Pearson, time: {0}".format(time.time() - start_time))
start_time = time.time()
x = vdf.corr(method = "kendall", show = False)
print("Kendall, time: {0}".format(time.time() - start_time))

Pearson, time: 0.03157639503479004


  0%|          | 0/6 [00:00<?, ?it/s]

Kendall, time: 3.26833176612854


As you can see, a Kendall Correlation Matrix is noticeably slower (around 100 times more than Pearson) because of its time complexity. Keep this in mind when using methods on larger datasets.

## 7. Limit the number of elements in a plot

Graphics are a powerful way to understand data, but graphs can be difficult to parse if it has too many elements. Let's draw a multi-histogram where one column is categorical with thousands of categories.

In [18]:
vdf.bar(["name", "survived"])

VerticaPy will try to draw it, but it could take time with a large dataset. Worse, it might be completely incomprehensible. Instead, we should try to create graphics with as few categories as possible.

In [19]:
vdf.hist(["pclass", "survived"])

Try to always check the cardinality of your variables before creating graphics.

In [20]:
vdf.nunique()

Unnamed: 0,approx_unique
"""pclass""",3.0
"""survived""",2.0
"""name""",1233.0
"""sex""",2.0
"""age""",96.0
"""sibsp""",7.0
"""parch""",8.0
"""ticket""",888.0
"""fare""",275.0
"""cabin""",181.0


## 8. Filter unneeded data

Filtering should be the first action you perform when you prepare your data. Proper filtering will help you avoid unnecessary computation and therefore drastically improve the performance of every method you call. While the performance impact isn't as important for small datasets, when working with large ones, you'll always want to filter your data in some way.

In the following example, our goal is to analyze the passengers on the Titanic that didn't have a lifeboat. Since we aren't concerned at all with passengers with a lifeboat, we can simply filter them out of the dataset and move on from there.

In [21]:
vdf.filter("boat IS NOT NULL")

795 elements were filtered.


Unnamed: 0,123pclassInteger,123survivedInteger,AbcVarchar(164),AbcsexVarchar(20),123ageNumeric(8),123sibspInteger,123parchInteger,AbcticketVarchar(36),123fareNumeric(12),AbccabinVarchar(30),AbcembarkedVarchar(20),AbcboatVarchar(100),123bodyInteger,AbcVarchar(100)
1,1,0,,male,36.0,0,0,13050,75.2417,C6,C,A,[null],
2,1,0,,male,[null],0,0,PC 17600,30.6958,[null],C,14,[null],
3,1,1,,female,29.0,0,0,24160,211.3375,B5,S,2,[null],
4,1,1,,male,0.92,1,2,113781,151.55,C22 C26,S,11,[null],
5,1,1,,male,48.0,0,0,19952,26.55,E12,S,3,[null],
6,1,1,,female,63.0,1,0,13502,77.9583,D7,S,10,[null],
7,1,1,,female,53.0,2,0,11769,51.4792,C101,S,D,[null],
8,1,1,,female,18.0,1,0,PC 17757,227.525,C62 C64,C,4,[null],
9,1,1,,female,24.0,0,0,PC 17477,69.3,B35,C,9,[null],
10,1,1,,male,80.0,0,0,27042,30.0,A23,S,B,[null],


## 9. Filter unneeded columns

Again, you should always try to cut down your dataset to the essential columns, but writing an explicit command to exclude columns can be cumbersome. Another way to do this is by simply dropping the columns with the 'drop' method.

In [22]:
vdf.drop(["name", "body"])

Unnamed: 0,123pclassInteger,123survivedInteger,AbcsexVarchar(20),123ageNumeric(8),123sibspInteger,123parchInteger,AbcticketVarchar(36),123fareNumeric(12),AbccabinVarchar(30),AbcembarkedVarchar(20),AbcboatVarchar(100),AbcVarchar(100)
1,1,0,male,36.0,0,0,13050,75.2417,C6,C,A,
2,1,0,male,[null],0,0,PC 17600,30.6958,[null],C,14,
3,1,1,female,29.0,0,0,24160,211.3375,B5,S,2,
4,1,1,male,0.92,1,2,113781,151.55,C22 C26,S,11,
5,1,1,male,48.0,0,0,19952,26.55,E12,S,3,
6,1,1,female,63.0,1,0,13502,77.9583,D7,S,10,
7,1,1,female,53.0,2,0,11769,51.4792,C101,S,D,
8,1,1,female,18.0,1,0,PC 17757,227.525,C62 C64,C,4,
9,1,1,female,24.0,0,0,PC 17477,69.3,B35,C,9,
10,1,1,male,80.0,0,0,27042,30.0,A23,S,B,


By using the 'drop' method, VerticaPy will simply exclude the specified columns from the SELECT query during SQL code generation.

In [23]:
print(vdf.current_relation())

(
   SELECT
     * 
   FROM
 (
                
   SELECT
     
                    "pclass",
     "survived",
     "sex",
     "age",
     "sibsp",
     "parch",
     "ticket",
     "fare",
     "cabin",
     "embarked",
     "boat",
     "home.dest" 
                
   FROM
 "public"."titanic") 
VERTICAPY_SUBTABLE WHERE (boat IS NOT NULL)) 
VERTICAPY_SUBTABLE


## 10. Maximize your ressources

You might encounter datasets with hundreds of columns. These datasets can be resource intensive because you have to compute many aggregations at the same time. VerticaPy allows you to control the number of queries you'll send to the system, allowing for some useful optimizations.

Let's generate a large dataset and see what we can do to handle it.

In [24]:
from verticapy.datasets import gen_dataset
features_ranges = {}
for i in range(600):
    features_ranges[f"x{i}"] = {"type": float, "range": [0, 1]}
vp.drop("test_dataset", method = "table")
vdf = gen_dataset(features_ranges, nrows = 10000).to_db("test_dataset", 
                                                        relation_type = "table", 
                                                        inplace = True)
vdf

Unnamed: 0,123x0Float(22),123x1Float(22),123x2Float(22),123x3Float(22),123x4Float(22),123x5Float(22),123x6Float(22),123x7Float(22),123x8Float(22),123x9Float(22),123x10Float(22),123x11Float(22),123x12Float(22),123x13Float(22),123x14Float(22),123x15Float(22),123x16Float(22),123x17Float(22),123x18Float(22),123x19Float(22),123x20Float(22),123x21Float(22),123x22Float(22),123x23Float(22),123x24Float(22),...,123x575Float(22),123x576Float(22),123x577Float(22),123x578Float(22),123x579Float(22),123x580Float(22),123x581Float(22),123x582Float(22),123x583Float(22),123x584Float(22),123x585Float(22),123x586Float(22),123x587Float(22),123x588Float(22),123x589Float(22),123x590Float(22),123x591Float(22),123x592Float(22),123x593Float(22),123x594Float(22),123x595Float(22),123x596Float(22),123x597Float(22),123x598Float(22),123x599Float(22)
1,0.0001960028894245,0.223276673816144,0.569054801017046,0.146288738353178,0.953899076674134,0.0837573639582843,0.89617644296959,0.430673168972135,0.91946047032252,0.60745054576546,0.0478525573853403,0.684267201926559,0.777295779902488,0.669061989989132,0.997448976850137,0.73087637568824,0.674936565104872,0.422331467736512,0.903067613719031,0.867233776254579,0.292366462992504,0.763601926388219,0.536887338617817,0.710121874930337,0.418818174628541,...,0.229242044733837,0.764463270548731,0.360086736502126,0.783899937057868,0.189401683397591,0.264250955777243,0.728645598050207,0.703070270363241,0.161886852234602,0.786843160865828,0.177094651618972,0.351087041897699,0.3317554758396,0.605905545176938,0.0202365254517645,0.879995567258447,0.546785854035988,0.981807304313406,0.322994549991563,0.194810706656426,0.864404631080106,0.493528368882835,0.0382401132956147,0.981283981585875,0.0058721215464174
2,0.0003493577241897,0.373498801840469,0.455233322223648,0.76446100580506,0.350019642850384,0.631780713796616,0.0129033336415887,0.0487699867226183,0.988948704442009,0.568891275441274,0.731783567694947,0.694827085360885,0.893559980904683,0.70381576824002,0.396100029814988,0.107085749972612,0.542411925038323,0.229081634432077,0.174546503461897,0.983786367811263,0.128322220873088,0.61485845525749,0.583496453473344,0.457593141589314,0.543760802596807,...,0.756734935799614,0.119025048334152,0.9494756560307,0.205647339578718,0.0212743829470128,0.290157475974411,0.948894704226404,0.561747586121783,0.678131439490244,0.199145698919892,0.380760907195508,0.764390038326383,0.336998986778781,0.716369953937829,0.360648927511647,0.622318131383508,0.160239834105596,0.0013183676637709,0.293792468961328,0.0396779901348054,0.385879191337153,0.287090267520398,0.136456688633189,0.734194091986865,0.21533956239
3,0.0003974714782088,0.541172124678269,0.0819042101502419,0.612209769198671,0.165250096470118,0.790135562187061,0.471942559583113,0.837421969277784,0.397448896197602,0.578222304116935,0.205473734531552,0.393783261533827,0.221182169625536,0.987276840023696,0.50043906015344,0.101590456906706,0.922020850237459,0.195601245155558,0.70166488410905,0.763824791647494,0.747480938443914,0.665059823775664,0.853508043335751,0.249596880516037,0.284520755754784,...,0.870145999127999,0.839568693889305,0.0595940668135881,0.415375182405114,0.408280453179032,0.520656133070588,0.788697505835444,0.999560160795227,0.328312843339518,0.703751656459644,0.542353689903393,0.480476793367416,0.0089707348961383,0.848017371725291,0.624103717505932,0.0095096910372376,0.616531784180552,0.599590308731422,0.844192487886176,0.846037429058924,0.487896479200572,0.706647804239765,0.822090843925253,0.562016223790124,0.095338967628777
4,0.000591944437474,0.184637373546138,0.677951882593334,0.187467935029417,0.0279353931546211,0.463902011280879,0.16278133308515,0.789235214702785,0.284193359082565,0.934725100873038,0.645646793534979,0.516607904806733,0.685419567162171,0.0983414296060801,0.0763954375870526,0.727678582305089,0.22128353221342,0.0310072842985392,0.674436809727922,0.79346641083248,0.490932506974787,0.780443067196757,0.240949733415619,0.262164728017524,0.757681918796152,...,0.714937340468168,0.532553632743657,0.532863085623831,0.24184009176679,0.658881615148857,0.774733348051086,0.705402012914419,0.29164276458323,0.337121942080557,0.842093809507787,0.255741101223975,0.373403426259756,0.18764799204655,0.61292251595296,0.902382029453292,0.098786182468757,0.424539287108928,0.580491655506194,0.281026278622448,0.250902089523152,0.869504528352991,0.285651515470818,0.119645486352965,0.0466237645596266,0.0754034735728055
5,0.0006170670967549,0.105564348399639,0.584340079687536,0.953900561202317,0.326041501248255,0.170486767310649,0.569921227870509,0.319427700480446,0.321388313546777,0.206756635569036,0.295853655319661,0.0198231625836343,0.819703753106296,0.72213381761685,0.454652979271486,0.978009321959689,0.826862358720973,0.154828996863216,0.187985793687403,0.742480967659503,0.109980063512921,0.992332164430991,0.0584800047799945,0.969711811980233,0.95329147647135,...,0.37454177159816,0.897310346597806,0.223332173423842,0.368544396245852,0.354954629670829,0.504379642661661,0.222957924474031,0.665943633997813,0.147880225908011,0.863684917567298,0.192000756273046,0.827987043419853,0.0424004790838808,0.008015566272661,0.0995148345828056,0.237968357512727,0.235770049039274,0.273310487391427,0.139218844473362,0.0545498847495764,0.310334933456033,0.257199371233582,0.0438421927392483,0.447895624442026,0.517899657366797
6,0.0008906864095479,0.0003899734001606,0.0188652069773525,0.0226375588681549,0.829312362940982,0.448174893856049,0.13616539654322,0.326967870583758,0.489925740519539,0.862510353559628,0.696743687614799,0.980027993908152,0.71159599465318,0.960573458345607,0.202505079098046,0.718040975276381,0.844679709291086,0.821532419417053,0.640146883204579,0.157010502414778,0.563411245821044,0.86328223766759,0.976324273506179,0.644337182631716,0.992767357267439,...,0.236641296884045,0.279278548900038,0.829369797371328,0.651575671276078,0.939473843434826,0.906315282220021,0.7361893823836,0.113148809410632,0.366884063929319,0.396213509142399,0.751739027677104,0.107069255784154,0.444786676205695,0.38816038123332,0.708308018278331,0.843149495078251,0.529179016593844,0.709357707761228,0.0336217009462416,0.270866733975708,0.587035881122574,0.785210688598454,0.626110471086577,0.753253973089159,0.0204890330787748
7,0.0009910960216075,0.318886847002432,0.945748614147305,0.995994537835941,0.864479599054903,0.93280887324363,0.556491694878787,0.878133221995085,0.91909973626025,0.439732129219919,0.375389935914427,0.87227628310211,0.567174246301875,0.960531617049128,0.760713666677475,0.234736167127267,0.316927049309015,0.0959935167338699,0.32661251607351,0.774928649654612,0.519779916619882,0.424948895350099,0.884039038093761,0.136508940951899,0.149809438269585,...,0.40674111014232,0.457084704888985,0.632804575143382,0.277281341841444,0.882114335428923,0.971609567524865,0.852457026951015,0.0565257731359452,0.393363466951996,0.450378278503194,0.651532074203715,0.0145144085399806,0.729614259209484,0.137245898135006,0.985782707808539,0.683276666328311,0.780729256104678,0.63319678325206,0.120685623027384,0.101446935441345,0.597757416078821,0.548413366544992,0.663553500315174,0.727297874167562,0.855718491366133
8,0.0011122152209281,0.127737686270848,0.988597585819662,0.271905181929469,0.185093800537288,0.800268192775548,0.719612940214574,0.173993803327903,0.500969029497355,0.682003990979865,0.43805263354443,0.28433684585616,0.542928143637255,0.129249114310369,0.6218430576846,0.916100636590272,0.283548012143001,0.609062026254833,0.233680115779862,0.142073763068765,0.895518846111372,0.964889961061999,0.200529202818871,0.226821770658717,0.924190223915502,...,0.511151976650581,0.181896143360063,0.42599389469251,0.437802878208458,0.442645092494786,0.47630655975081,0.0523433464113623,0.204150154488161,0.659303980879486,0.894879574887455,0.765794385923073,0.14462918927893,0.0729044564068317,0.0634878608398139,0.758046169765294,0.858581097098067,0.507445923984051,0.106576903956011,0.543448687996715,0.406176104443148,0.220278148772195,0.800542004173622,0.690266371006146,0.623673480469733,0.84685589838773
9,0.0011383290402591,0.514727840432897,0.559510929044336,0.836967127164826,0.215422366978601,0.330432499758899,0.635496387025341,0.277483627200127,0.13808006234467,0.603859294904396,0.640630841255188,0.863297674804926,0.71823794930242,0.65601344476454,0.760873788967729,0.976938360137865,0.313614796148613,0.181986263487488,0.785577922128141,0.644765353295952,0.664857805008069,0.490055622300133,0.707063964335248,0.935479175765067,0.0218988535925746,...,0.987041938817129,0.949876811821014,0.694530924083665,0.669157133204862,0.255534769734368,0.890554925426841,0.475527179427445,0.939421659335494,0.61769586103037,0.687824450666085,0.706856720848009,0.368430522270501,0.626537130679935,0.695024225162342,0.444609229918569,0.913069888250902,0.9948617646005,0.751152554294094,0.193772099446505,0.749380331952125,0.584758857265115,0.0151752014644444,0.0385855652857572,0.331208134768531,0.742575838929042
10,0.0011410950683057,0.16232419735752,0.217413087841123,0.90710126189515,0.452350787352771,0.408846958074719,0.0475887660868466,0.942377260886133,0.38022549287416,0.36730993213132,0.629348946269602,0.860214521177113,0.498138072201982,0.144945178413764,0.784107675775886,0.275741598568857,0.567494807066396,0.834918017266318,0.449279004475102,0.237954310839996,0.278288401430473,0.948916923021898,0.419836819870397,0.434775270987302,0.241078492254019,...,0.251380854984745,0.460777881089598,0.130139776039869,0.392375285271555,0.0530416590627283,0.618939404841512,0.788307816721499,0.886184368049726,0.243720447411761,0.905561363790184,0.225589220412076,0.526406581746414,0.111439529107884,0.399240131257102,0.781824928708375,0.0774167242925614,0.917795746820047,0.694827311206609,0.906841398915276,0.938551354222,0.663351311581209,0.0137427840381861,0.496990589192137,0.264630533521995,0.704548572655767


To see what is happening when you compute aggregations, turn on the SQL code generation and turn off the cache.

In [25]:
vp.set_option("sql_on", True)
vp.set_option("cache", False)

When you compute aggregations, VerticaPy allows you to send multiple queries iteratively or at the same time. To see this in action, let's compute the average of each dataset column. 

You can see that sending one big query is quite resource expensive:

In [26]:
display(vdf.avg(ncols_block = 600))

Unnamed: 0,avg
"""x0""",0.504701972361747
"""x1""",0.501500424734084
"""x2""",0.496360847263644
"""x3""",0.501922676050873
"""x4""",0.501098696065019
"""x5""",0.501922402327834
"""x6""",0.497511838229094
"""x7""",0.4995424558128
"""x8""",0.501139220085368
"""x9""",0.498195640844759


By using blocks of 100 columns, you reduce your impact on the system.

In [27]:
display(vdf.avg(ncols_block = 100))

  0%|          | 0/6 [00:00<?, ?it/s]

Unnamed: 0,avg
"""x0""",0.504701972361747
"""x1""",0.501500424734084
"""x2""",0.496360847263644
"""x3""",0.501922676050873
"""x4""",0.501098696065019
"""x5""",0.501922402327834
"""x6""",0.497511838229094
"""x7""",0.4995424558128
"""x8""",0.501139220085368
"""x9""",0.498195640844759


You can also send multiple queries at the same time. To do this, you specify the number of 'processes', which can be understood as the number of workers involved in the computation. Each child process will create a DB connection and send its query. In the following example, we will use 6 'processes'.

In [28]:
display(vdf.avg(ncols_block = 100, processes = 6))

Unnamed: 0,avg
"""x0""",0.500139651982998
"""x1""",0.501554523924948
"""x2""",0.501206196407368
"""x3""",0.500016289698682
"""x4""",0.504199397301208
"""x5""",0.498168718028883
"""x6""",0.500166695084632
"""x7""",0.500573250358412
"""x8""",0.498125163944671
"""x9""",0.500822512500989


You should always measure the impact you'll have on the system. Sometimes it is better to send multiple queries iteratively or in parallel rather than one big query. The optimal method depends on the use-case.