# A Short introduction to vaex

vaex is a python package made by [two+](https://github.com/vaexio/vaex/graphs/contributors) really smart cookies: Maarten Breddels & Jovan Veljanoski.

Vaex caught my attention a long time ago when I needed to solve a principal problem, best described in this image:

![0](artwork/vaex0.png)

How do I complete the transition from Source (top) to Sink (botton) with just 12 Gb ram (OS included)?

A number of tricks need to be employed.

1. The data that passes unchanged through filter-1 and directly to the merge doesn't need any processing, so I can keep the `False` indices of that because (it's shorter than the `True` index).
2. The data that passes through filter-2 requires is also used by Repair-1 in an "inplace" update, to avoid memory duplication.
3. To free up memory after Repair-1 the data is dumped to disk as HDF5.
4. The data that passes through filter-3 is also used by Repair-2 in an "inplace" update, so that memory duplication is avoided.

To generalize further, the typical data cleanup functions are:

0. View the data.

1. append datasets
2. filter datasets
3. sort
4. groupby
5. pivot / reverse pivot
6. change datatype (typically string (from csv) to something else)
7. join datasets
8. lookup
9. create charts
10. custom functions reading 1 or more columns to generate or replace 1 or more columns.



In [1]:
2+7

9

In [2]:
import vaex
import pathlib

In [3]:
def load(path, **kwargs):
    imported_file = path.parent / pathlib.Path(path.name + '.hdf5')
    if imported_file.exists():
        df = vaex.open(imported_file)
    elif path.name.endswith('txt'):
        df = vaex.from_csv(path, chunk_size=1_000_000, convert=True, **kwargs)
    else:
        raise NotImplemented
    return df

In [4]:
target_file = pathlib.Path(r"D:\Newport_Data_DEC-MAR.txt")
df = load(target_file, delimiter="\t", progress=True)

Converting csv to chunk files
Saved chunk #0 to D:\Newport_Data_DEC-MAR.txt_chunk_0.hdf5
Saved chunk #1 to D:\Newport_Data_DEC-MAR.txt_chunk_1.hdf5
Saved chunk #2 to D:\Newport_Data_DEC-MAR.txt_chunk_2.hdf5
Saved chunk #3 to D:\Newport_Data_DEC-MAR.txt_chunk_3.hdf5
Saved chunk #4 to D:\Newport_Data_DEC-MAR.txt_chunk_4.hdf5
Saved chunk #5 to D:\Newport_Data_DEC-MAR.txt_chunk_5.hdf5
Saved chunk #6 to D:\Newport_Data_DEC-MAR.txt_chunk_6.hdf5
Converting 7 chunks into single file D:\Newport_Data_DEC-MAR.txt.hdf5
export(hdf5) [########################################] 100.00% elapsed time  :    72.19s =  1.2m =  0.0h          
 

# View the data

In [5]:
df

#,ORIG_UT_ID,DST_UT_ID,MJR_IT_CD_ID,IT_DSC_TX,P_SHP_DT,P_SHP_TM,DAY_DT,ORDR_SRC_CT,DAY_TM,P_DMD_CT,P_XDCK_FLG,WRK_CT,P_CRT_PRC_CT,SHTL_UT_ID,MPAK_IT_QT,VPAK_IT_QT,P_RCV_DT,P_RCV_TM,P_SELL_BY_DT,UCC128_CD_ID,PO_NBR_ID,TLR_ID,TLR_CTNR_TYP_ID,P_HDL_CT,P_ATL_DVRT_DT,P_ATL_DVRT_TM,SHP_CTNR_ID,SHP_CTNR_TYP_ID,P_QT,LB_SRL_ID,UT_RCT_ID,SHTL_FROM_UT_ID,UT_RCT_CT,LD_ID,VDR_ID,VDR_NM_TX,MPAK_HGT_QT,MPAK_WGT_QT,MPAK_DPH_QT,MPAK_WDTH_QT,MPAK_CBE_QT,OUTLD_ID,TLR_LD_ID,UPC_ID,VPAK_HGT_QT,VPAK_WGT_QT,VPAK_DPH_QT,VPAK_WDTH_QT,VPAK_CBE_QT,PKY_ID,PKY_DSC_TX,P_ID,BYR_ID,MJR_P_SUB_CT_ID,MJR_P_SUB_CT_NM_TX,MPAKS_SHIPPED
0,883,34,121213,PC CHOC MILK 1% LOWFAT 1/2 GAL DF,1/20/2021,182941,1/21/2021,36,60428,REG,N,HSEL,HAND,0,9,9,1/20/2021,21759,2/1/2021,,212050180,413250,53,2,1/20/2021,182941,7761495,PL,9,7761478,1748292492,0,P,407501,1005419,PURPLE COW CREAMERY,10.8661,578.4,12.9134,12.6772,1778.8032,64879,413250,70882095744,10.8661,578.4,12.9134,12.6772,1778.8032,7,DAIRY,3642330,7,L3-004126,MILK PREMIUM RFRG DAIRY,1
1,883,268,479669,KRAFT SHRED CHEESE SHARP CHEDDAR 16 OZ,12/14/2020,130748,12/15/2020,36,55447,REG,N,HSEL,HAND,0,12,12,12/10/2020,62152,5/3/2021,,211924312,411115,53,1,12/14/2020,130748,7810420,PL,12,7810188,1591858735,0,P,404683,1003132,KRAFT FOODS,6.8,209.6,15.9,12.1,1308.2688,61476,411115,2100005371,6.8,209.6,15.9,12.1,1308.2688,7,DAIRY,502802,717,L3-004260,SHREDS CHUNKS PKGD CHEESE,1
2,883,55,50391,DANNON ACTIVIA LT STRAW BANANA PEACH 12/,12/16/2020,215157,12/17/2020,36,62909,REG,N,HSEL,HAND,0,4,4,12/8/2020,70535,1/13/2021,,211920154,411198,53,2,12/16/2020,215157,7878390,PL,4,7877697,966630323,0,P,404520,4024,DANNON YOGURT,5.0,212.64,16.0,9.0,720.0576,61613,411198,3663202764,5.0,212.64,16.0,9.0,720.0576,7,DAIRY,3010581,707,L3-002966,YOGURT RFRG DAIRY,1
3,882,227,249608,MEIJER CHICKEN BREAST THIN SLICED BS 40,2/13/2021,140118,2/14/2021,36,142230,REG,N,HSEL,BELT,0,12,12,2/12/2021,23736,10/21/2021,,212113806,985927,53,1,2/13/2021,140118,7490865,LD,12,7490350,2023697984,0,P,246224,39154,TYSON FOODS,8.2283,588.4755,23.2677,15.7874,3022.5548,65594,985927,70882000167,8.2283,588.4755,23.2677,15.7874,3022.5548,2,MEAT,3298997,207,L3-003085,CHICKEN FROZEN,1
4,883,109,34610,MEIJER ALL NATURAL SOUR CREAM 16 OZ,12/11/2020,202759,12/12/2020,36,63332,REG,N,HSEL,HAND,0,12,12,12/9/2020,33337,2/2/2021,,80199756,410910,53,2,12/11/2020,202759,7807527,PL,12,7806874,781801325,0,S,404868,0,?,6.4,207.2,13.6,9.3,809.3952,61187,410910,71928361714,6.4,207.2,13.6,9.3,809.3952,7,DAIRY,504314,7,L3-002958,CULTURED RFRG DAIRY,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6264700,883,27,219621,TURNIP GREENS BUNCH,1/10/2021,191554,1/11/2021,34,55415,REG,N,HSEL,HAND,0,18,18,1/9/2021,90226,1/9/2021,,211984479,412647,53,2,1/10/2021,191554,7684721,PL,18,7684571,807001881,0,P,406636,40338,RIO FRESH INC,7.5591,266.4,19.4488,13.3858,1967.8464,63887,412647,4619,7.5591,266.4,19.4488,13.3858,1967.8464,3,PRODUCE,520037,330,L3-003135,WET VEGETABLES,1
6264701,883,32,280054,DAISY SOUR CRM LT 16 OZ,12/24/2020,142339,12/25/2020,36,60855,REG,N,HSEL,HAND,0,12,12,12/4/2020,73723,3/1/2021,,211897008,411673,53,2,12/24/2020,142339,7939737,PL,12,7939270,1883702765,0,P,404486,37695,DAISY BRAND,4.8,210.4,15.8,11.7,887.328,62344,411673,7342000015,4.8,210.4,15.8,11.7,887.328,7,DAIRY,498754,7,L3-002958,CULTURED RFRG DAIRY,1
6264702,883,286,280430,SIMPLY OJ ORIG 89 OZ,2/26/2021,124719,2/27/2021,36,63415,REG,N,HSEL,HAND,0,6,6,2/24/2021,41604,4/27/2021,,212150085,415567,53,2,2/26/2021,124719,7690859,PL,6,7690278,1841264994,0,P,409504,1003128,MINUTE MAID CHILLED,10.7,628.0,20.0,10.0,2139.9552,68622,415567,2500005433,10.7,628.0,20.0,10.0,2139.9552,7,DAIRY,495132,707,L3-002962,JUICE AND JUICE DRINK RFRG,1
6264703,883,307,203189,MANGOS LARGE BOX,2/28/2021,155542,3/1/2021,34,55636,REG,N,HSEL,HAND,0,24,24,2/26/2021,90227,2/26/2021,,212111389,415693,53,2,2/28/2021,155542,7642414,PL,24,7642160,1912790060,0,P,409841,1004670,AMAZON PRODUCE NETWORK,4.8,360.0,15.7,23.7,1786.0608,68857,415693,4959,4.8,360.0,15.7,23.7,1786.0608,3,PRODUCE,519597,311,L3-003133,TROPICAL FRUIT,1


In [6]:
print(f"{len(df)} rows, {len(list(df))} columns\n{list(df)}")  # find columns using __iter__

6264705 rows, 56 columns
['ORIG_UT_ID', 'DST_UT_ID', 'MJR_IT_CD_ID', 'IT_DSC_TX', 'P_SHP_DT', 'P_SHP_TM', 'DAY_DT', 'ORDR_SRC_CT', 'DAY_TM', 'P_DMD_CT', 'P_XDCK_FLG', 'WRK_CT', 'P_CRT_PRC_CT', 'SHTL_UT_ID', 'MPAK_IT_QT', 'VPAK_IT_QT', 'P_RCV_DT', 'P_RCV_TM', 'P_SELL_BY_DT', 'UCC128_CD_ID', 'PO_NBR_ID', 'TLR_ID', 'TLR_CTNR_TYP_ID', 'P_HDL_CT', 'P_ATL_DVRT_DT', 'P_ATL_DVRT_TM', 'SHP_CTNR_ID', 'SHP_CTNR_TYP_ID', 'P_QT', 'LB_SRL_ID', 'UT_RCT_ID', 'SHTL_FROM_UT_ID', 'UT_RCT_CT', 'LD_ID', 'VDR_ID', 'VDR_NM_TX', 'MPAK_HGT_QT', 'MPAK_WGT_QT', 'MPAK_DPH_QT', 'MPAK_WDTH_QT', 'MPAK_CBE_QT', 'OUTLD_ID', 'TLR_LD_ID', 'UPC_ID', 'VPAK_HGT_QT', 'VPAK_WGT_QT', 'VPAK_DPH_QT', 'VPAK_WDTH_QT', 'VPAK_CBE_QT', 'PKY_ID', 'PKY_DSC_TX', 'P_ID', 'BYR_ID', 'MJR_P_SUB_CT_ID', 'MJR_P_SUB_CT_NM_TX', 'MPAKS_SHIPPED']


## select a subset of columns

In [7]:
selection = df[
    ['ORIG_UT_ID', 'DST_UT_ID', 'P_SHP_DT', 'PO_NBR_ID','VPAK_WDTH_QT', 'VPAK_CBE_QT', 'PKY_ID', 'PKY_DSC_TX', 'P_ID', 'BYR_ID', 'MPAKS_SHIPPED']
]
selection

#,ORIG_UT_ID,DST_UT_ID,P_SHP_DT,PO_NBR_ID,VPAK_WDTH_QT,VPAK_CBE_QT,PKY_ID,PKY_DSC_TX,P_ID,BYR_ID,MPAKS_SHIPPED
0,883,34,1/20/2021,212050180,12.6772,1778.8032,7,DAIRY,3642330,7,1
1,883,268,12/14/2020,211924312,12.1,1308.2688,7,DAIRY,502802,717,1
2,883,55,12/16/2020,211920154,9.0,720.0576,7,DAIRY,3010581,707,1
3,882,227,2/13/2021,212113806,15.7874,3022.5548,2,MEAT,3298997,207,1
4,883,109,12/11/2020,80199756,9.3,809.3952,7,DAIRY,504314,7,1
...,...,...,...,...,...,...,...,...,...,...,...
6264700,883,27,1/10/2021,211984479,13.3858,1967.8464,3,PRODUCE,520037,330,1
6264701,883,32,12/24/2020,211897008,11.7,887.328,7,DAIRY,498754,7,1
6264702,883,286,2/26/2021,212150085,10.0,2139.9552,7,DAIRY,495132,707,1
6264703,883,307,2/28/2021,212111389,23.7,1786.0608,3,PRODUCE,519597,311,1


## append datasets (1/10)

In [8]:
df_2x = selection.concat(selection)
len(df_2x), len(df)

(12529410, 6264705)

## filter datasets (2/10)

In [9]:
selection[selection.PKY_ID > 3 & selection.PKY_ID < 50]

#,ORIG_UT_ID,DST_UT_ID,P_SHP_DT,PO_NBR_ID,VPAK_WDTH_QT,VPAK_CBE_QT,PKY_ID,PKY_DSC_TX,P_ID,BYR_ID,MPAKS_SHIPPED
0,883,34,1/20/2021,212050180,12.6772,1778.8032,7,DAIRY,3642330,7,1
1,883,268,12/14/2020,211924312,12.1,1308.2688,7,DAIRY,502802,717,1
2,883,55,12/16/2020,211920154,9.0,720.0576,7,DAIRY,3010581,707,1
3,882,227,2/13/2021,212113806,15.7874,3022.5548,2,MEAT,3298997,207,1
4,883,109,12/11/2020,80199756,9.3,809.3952,7,DAIRY,504314,7,1
...,...,...,...,...,...,...,...,...,...,...,...
6264700,883,27,1/10/2021,211984479,13.3858,1967.8464,3,PRODUCE,520037,330,1
6264701,883,32,12/24/2020,211897008,11.7,887.328,7,DAIRY,498754,7,1
6264702,883,286,2/26/2021,212150085,10.0,2139.9552,7,DAIRY,495132,707,1
6264703,883,307,2/28/2021,212111389,23.7,1786.0608,3,PRODUCE,519597,311,1


## sort (3/10)

In [10]:
unsorted_selection = selection.extract()  # first extract to freeze the DF.
sorted_selection = unsorted_selection.sort(selection.PKY_ID, ascending=False)  # then sort the frozen DF
sorted_selection  # show.

#,ORIG_UT_ID,DST_UT_ID,P_SHP_DT,PO_NBR_ID,VPAK_WDTH_QT,VPAK_CBE_QT,PKY_ID,PKY_DSC_TX,P_ID,BYR_ID,MPAKS_SHIPPED
0,883,117,2/19/2021,210410037,0.0,273.024,214,BABY CONSUMABLES,4420409,218,1
1,883,122,12/4/2020,210239047,0.0,181.44,214,BABY CONSUMABLES,4671127,218,1
2,883,308,12/29/2020,211969568,0.0,181.44,214,BABY CONSUMABLES,4671126,218,1
3,883,115,2/5/2021,210410004,0.0,273.024,214,BABY CONSUMABLES,4420416,218,1
4,883,34,12/4/2020,210239047,0.0,273.024,214,BABY CONSUMABLES,4420415,218,1
...,...,...,...,...,...,...,...,...,...,...,...
6264700,883,65,2/9/2021,212090772,7.5,483.84,1,DRY GROCERY,4398117,12,1
6264701,883,32,12/16/2020,211818117,7.5,483.84,1,DRY GROCERY,4398116,12,1
6264702,883,237,1/1/2021,211859147,7.5,483.84,1,DRY GROCERY,4398117,12,1
6264703,883,222,12/16/2020,211818117,7.5,483.84,1,DRY GROCERY,4398116,12,1


In [11]:
# alternatively use numpy to obtain a multi-criteria sort index.
import numpy as np

In [12]:
# THIS NEEDS MORE WORK.
names = unsorted_selection.get_names(hidden=False)
arr = np.array(unsorted_selection, dtype=[])

ValueError: Cannot cast 'ORIG_UT_ID' (of type int64) to dtype([])

In [24]:
np.sort(arr, order=["5","3"])

ValueError: Cannot specify order when the array has no fields.

## groupby (4/10)

Supported groupby functions

- count: Number of elements in a group
- first: The first element in a group
- max: The largest value in a group
- min: The smallest value in a group
- sum: The sum of a group
- mean: The mean value of a group
- std: The standard deviation of a group
- var: The variance of a group
- nunique: Number of unique elements in a group

In [36]:
# Groupby P_SHP_DT, sum (MPAKS_SHIPPED)
selection.groupby(by="P_SHP_DT").agg({'MPAKS_SHIPPED': 'sum'})

#,P_SHP_DT,MPAKS_SHIPPED
0,2/17/2021,64298
1,1/17/2021,72176
2,12/16/2020,97609
3,2/1/2021,81184
4,2/9/2021,71759
...,...,...
87,2/28/2021,73358
88,1/19/2021,108443
89,1/11/2021,89584
90,1/20/2021,83162


## pivot / reverse pivot (5/10)

In [14]:
# Pivot the data? A groupby that is displayed as a sparse matrix.

## change datatype (6/10)
typically string (from csv) to something else

In [31]:
import pyarrow

In [34]:
def show(df):
    for col_name, data in df.columns.items():
        print(col_name, end=' ')
        if isinstance(data, np.ndarray):
            print(data.dtype.name)
        elif isinstance(data, pyarrow.lib.StringArray):
            print('str')
        else:
            print(type(col_name))

In [35]:

show(selection)

ORIG_UT_ID int64
DST_UT_ID int64
P_SHP_DT str
PO_NBR_ID int64
VPAK_WDTH_QT float64
VPAK_CBE_QT float64
PKY_ID int64
PKY_DSC_TX str
P_ID int64
BYR_ID int64
MPAKS_SHIPPED int64


# Convert string to datetime

In [15]:
selection['P_SHP_DT']

Expression = P_SHP_DT
Length: 6,264,705 dtype: string (column)
----------------------------------------
      0   1/20/2021
      1  12/14/2020
      2  12/16/2020
      3   2/13/2021
      4  12/11/2020
        ...        
6264700   1/10/2021
6264701  12/24/2020
6264702   2/26/2021
6264703   2/28/2021
6264704  12/10/2020

In [16]:
def f(i):
    m,d,y = i.split("/")
    return f"{y}-{m.zfill(2)}-{d.zfill(2)}"

In [17]:
assert f("1/1/2018") == "2018-01-01"

In [18]:
selection['P_SHP_DT2'] = selection['P_SHP_DT'].apply(f)

In [19]:
selection['P_SHP_DT2']

Expression = P_SHP_DT2
Length: 6,264,705 dtype: string (column)
----------------------------------------
      0  2021-01-20
      1  2020-12-14
      2  2020-12-16
      3  2021-02-13
      4  2020-12-11
        ...        
6264700  2021-01-10
6264701  2020-12-24
6264702  2021-02-26
6264703  2021-02-28
6264704  2020-12-10

In [20]:
selection['P_SHP_DT3'] = selection['P_SHP_DT2'].astype('datetime64')

In [21]:
selection['P_SHP_DT3']

Expression = P_SHP_DT3
Length: 6,264,705 dtype: datetime64[D] (column)
-----------------------------------------------
      0  2021-01-20
      1  2020-12-14
      2  2020-12-16
      3  2021-02-13
      4  2020-12-11
        ...        
6264700  2021-01-10
6264701  2020-12-24
6264702  2021-02-26
6264703  2021-02-28
6264704  2020-12-10

The process above can be wrapped as:
```python
def f(i):
    m,d,y = i.split("/")
    return f"{y}-{m.zfill(2)}-{d.zfill(2)}"

df[col] = df[col].apply(f).astype('datetime64')
```


In [22]:
selection[5:]

#,ORIG_UT_ID,DST_UT_ID,P_SHP_DT,PO_NBR_ID,VPAK_WDTH_QT,VPAK_CBE_QT,PKY_ID,PKY_DSC_TX,P_ID,BYR_ID,MPAKS_SHIPPED,P_SHP_DT2,P_SHP_DT3
0,883,145,12/29/2020,211953964,6.5,449.28,47,DELI,4062411,473,1,2020-12-29,2020-12-29
1,883,34,2/11/2021,212109664,9.2,1376.352,3,PRODUCE,4701397,380,1,2021-02-11,2021-02-11
2,883,315,12/15/2020,211892077,6.7,327.456,7,DAIRY,498508,717,1,2020-12-15,2020-12-15
3,883,53,12/31/2020,80190035,12.8,1799.8848,7,DAIRY,509508,7,1,2020-12-31,2020-12-31
4,883,63,2/2/2021,212017456,9.4,649.3824,7,DAIRY,502567,717,1,2021-02-02,2021-02-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6264695,883,27,1/10/2021,211984479,13.3858,1967.8464,3,PRODUCE,520037,330,1,2021-01-10,2021-01-10
6264696,883,32,12/24/2020,211897008,11.7,887.328,7,DAIRY,498754,7,1,2020-12-24,2020-12-24
6264697,883,286,2/26/2021,212150085,10.0,2139.9552,7,DAIRY,495132,707,1,2021-02-26,2021-02-26
6264698,883,307,2/28/2021,212111389,23.7,1786.0608,3,PRODUCE,519597,311,1,2021-02-28,2021-02-28


## join datasets (7/10)

In [37]:
a = np.array(['a', 'b', 'c'])
x = np.arange(1,4)
df1 = vaex.from_arrays(a=a, x=x)
df1

#,a,x
0,a,1
1,b,2
2,c,3


In [38]:
b = np.array(['a', 'b', 'd'])
y = x**2
df2 = vaex.from_arrays(b=b, y=y)
df2

#,b,y
0,a,1
1,b,4
2,d,9


In [39]:
df1.join(df2, left_on='a', right_on='b')

#,a,x,b,y
0,a,1,a,1
1,b,2,b,4
2,c,3,--,--


In [40]:
df1.join(df2, left_on='a', right_on='b', how='right')

#,b,y,a,x
0,a,1,a,1
1,b,4,b,2
2,d,9,--,--


In [41]:
df1.join(df2, left_on='a', right_on='b', how='inner')

#,a,x,b,y
0,a,1,a,1
1,b,2,b,4


## lookup (8/10)

Lookup is probably best processed as a nested loop, where LEFT is target and RIGHT is source

1. make the data in RIGHT available to all compute cores.
2. with processpool split the work in LEFT equally and start the search.
3. assemble the results.

## create charts (9/10)

## custom functions (10/10)

custom functions reading 1 or more columns to generate or replace 1 or more columns.

Custom operations

In [7]:
df.get_column_names(virtual=True, strings=True, hidden=False, regex=None) # Return a list of column names
df.get_names(hidden=False)  # Return a list of column names and variable names.

['ORIG_UT_ID',
 'DST_UT_ID',
 'MJR_IT_CD_ID',
 'IT_DSC_TX',
 'P_SHP_DT',
 'P_SHP_TM',
 'DAY_DT',
 'ORDR_SRC_CT',
 'DAY_TM',
 'P_DMD_CT',
 'P_XDCK_FLG',
 'WRK_CT',
 'P_CRT_PRC_CT',
 'SHTL_UT_ID',
 'MPAK_IT_QT',
 'VPAK_IT_QT',
 'P_RCV_DT',
 'P_RCV_TM',
 'P_SELL_BY_DT',
 'UCC128_CD_ID',
 'PO_NBR_ID',
 'TLR_ID',
 'TLR_CTNR_TYP_ID',
 'P_HDL_CT',
 'P_ATL_DVRT_DT',
 'P_ATL_DVRT_TM',
 'SHP_CTNR_ID',
 'SHP_CTNR_TYP_ID',
 'P_QT',
 'LB_SRL_ID',
 'UT_RCT_ID',
 'SHTL_FROM_UT_ID',
 'UT_RCT_CT',
 'LD_ID',
 'VDR_ID',
 'VDR_NM_TX',
 'MPAK_HGT_QT',
 'MPAK_WGT_QT',
 'MPAK_DPH_QT',
 'MPAK_WDTH_QT',
 'MPAK_CBE_QT',
 'OUTLD_ID',
 'TLR_LD_ID',
 'UPC_ID',
 'VPAK_HGT_QT',
 'VPAK_WGT_QT',
 'VPAK_DPH_QT',
 'VPAK_WDTH_QT',
 'VPAK_CBE_QT',
 'PKY_ID',
 'PKY_DSC_TX',
 'P_ID',
 'BYR_ID',
 'MJR_P_SUB_CT_ID',
 'MJR_P_SUB_CT_NM_TX',
 'MPAKS_SHIPPED']

In [25]:
vaex.from_items(*items)
vaex.from_dict(data)
vaex.from_json
vaex.vconstant
vaex.vrange(start, stop, step=1, dtype='f8')

NameError: name 'items' is not defined

## Custom operations and lazy evaluations

In [30]:
import vaex
import numpy as np
df = vaex.from_dict(
    {
        'id': np.array([1, 2, 3, 4]),
        'name': np.array(['Sally', 'Tom', 'Maria', 'John'])
    })

df2 = df[df.id > 2]  # lazy filter.

In [32]:
df2  # this unmaterialized

#,id,name
0,3,Maria
1,4,John


In [33]:
try:
    df2['age'] = np.array([27, 29])  # add new column on unmaterialized data will fail.
except ValueError as e:
    print(e)

Array is of length 2, while the length of the DataFrame is 2 due to the filtering, the (unfiltered) length is 4.


In [34]:
df2 = df2.extract()  # alias from "materialize lazy compute"
df2['age'] = np.array([27, 29])  # Now I can add the data.
df2

#,id,name,age
0,3,Maria,27
1,4,John,29


In [None]:
add_column(name, f_or_array, dtype=None)
add_variable(name, expression, overwrite=True, unique=True)
add_virtual_column(name, expression, unique=False)
apply(f, arguments=None, vectorize=False, multiprocessing=True)
drop(columns, inplace=False, check=True)
drop_filter(inplace=False)

#df.close()  #  Close any possible open file handles or other resources, the DataFrame will not be in a usable state afterwards.
describe(strings=True, virtual=True, selection=None)  # descriptive statistics.
dropinf(column_names=None)
dropmissing(column_names=None)
dropna(column_names=None)
dropnan(column_names=None)

In [None]:
df.filter(expression, mode='and')

In [None]:
evaluate(expression, i1=None, i2=None, out=None, selection=None, filtered=True, array_type=None, parallel=True, chunk_size=None, progress=None)
evaluate_iterator(expression, s1=None, s2=None, out=None, selection=None, filtered=True, array_type=None, parallel=True, chunk_size=None, prefetch=True, progress=None)
# >>> import vaex
# >>> df = vaex.example()
# >>> for i1, i2, chunk in df.evaluate_iterator(df.x, chunk_size=100_000):
# ...     print(f"Total of {i1} to {i2} = {chunk.sum()}")

In [None]:
execute()  # Execute all delayed jobs.
extract()  # materialize

In [48]:
vaex.get_private_dir(create=False)  #Each DataFrame has a directory where files are stored for metadata etc.

AttributeError: module 'vaex' has no attribute 'get_private_dir'