### Import Relevant Libraries

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Installing Vaex

In [None]:
pip install --upgrade vaex

Using Conda, you can install Vaex via the following:

**conda install -c conda-forge vaex**

In [None]:
import vaex

### Make Artificial Data for Tutorial

In [None]:
import numpy as np
n_rows = 30000000 #30 million rows
n_cols = 10 #10 variables
df = pd.DataFrame(np.random.randint(100000000, 1000000000, size=(n_rows, n_cols)), columns=['c%d' % i for i in range(n_cols)])

In [None]:
df.info(memory_usage='deep') 
## 2.2GB of artificial data

In [None]:
%%time
# Save the artificial data to read in use both pandas and vaex later
df.to_csv('data.csv', index=False)

### Comparing time it takes to Read in CVS (pandas v.s. Vaex)

In [None]:
%%time
pandas_df = pd.read_csv("data.csv", low_memory=False)

In [None]:
%%time
vaex_df = vaex.from_csv("data.csv",copy_index=False)

In [None]:
%%time
vaex_chunk_df = vaex.from_csv("data.csv",copy_index=False, convert=True, chunk_size=5_000)

From the above results, we see that ysing Vaex to read in the 2.2GB of artificial data takes slightly less than reading in the same data using pandas. It actually takes way longer if you read in the data in chunks. The documentation of Vaex proudly boasts that "Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) samples/rows per second". The gains doesn't seem to be that big. What is going on here?

There are good articles that analyze the performance speed of Vaex in comparison to Pandas or other big data tech (e.g. Dask, Pyspark)

- https://towardsdatascience.com/how-to-analyse-100s-of-gbs-of-data-on-your-laptop-with-python-f83363dda94
- https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13
- https://www.kdnuggets.com/2021/05/vaex-pandas-1000x-faster.html

These articles show how efficient and fast Vaex can be in doing various operations from reading in big data, merging, sorting to joining and calculating basic summary statistics like the mean.

It seems to be the case that Vaex's performance, as a library for dealing with "big data", actually backfires when it's dealing with smaller data for which pandas is enough. Here our artificial data was 2.2GB in memory and maybe it wasn't big enough to justify the use of Vaex. **Let's keep in mind that for small datasets for which pandas is enough to read in, just using pandas instead of other big data libraries may actually be more efficient**

Another article that compares the performance of read_csv for various sized data and shows that Vaex's speed/performance might not be better than Pandas for not much of a big dataset
- https://towardsdatascience.com/is-something-better-than-pandas-when-the-dataset-fits-the-memory-7e8e983c4fe5

### Basic Operations

One pros of Vaex in comparison to other big data libraries is its **HIGH SIMILARITY to the PANDAS syntax and API**. Most of the functions and syntax are exactly the same or extremely similar more or less.

##### View some rows of the dataset

In [None]:
vaex_df.head(3)

In [None]:
# Unique function to Vaex: View first few lines of the beginning part and another few lines from the last part of the dataframe
vaex_df.head_and_tail_print(2)

##### describe the data

In [None]:
vaex_df.describe()

##### Drop a column/variable

In [None]:
vaex_df = vaex_df.drop("c9")

In [None]:
list(vaex_df.columns) # "c9" column has been successfully dropped

##### groupby -- aggregate calculations

In [None]:
gender = ['male','male','female','male','female']

IQ = [120, 83, 52, 160, 97]

example_vaex_df = vaex.from_arrays(gender=gender, IQ=IQ)

In [None]:
example_vaex_df.groupby(by='gender').agg({'IQ':'mean'})

##### Dealing with Missing Data

In [None]:
test = [0,np.nan,np.nan,12,28,1932,130234]

test_vaex_df = vaex.from_arrays(co=test)

In [None]:
test_vaex_df.fillna(1)

There are mainly three functions for identifying missing values

- ismissing(): Returns True where there are missing values (masked arrays), missing strings or None
- isna(): Returns a boolean expression indicating if the values are Not Availiable (missing or NaN).
- isnan(): Returns an array where there are NaN values

and three dropna functions associated with each of these nan identifying functions

- test_df.dropna( )
- test_df.dropnan( )
- test_df.dropmissing( )

In [None]:
test_vaex_df.co.ismissing()

In [None]:
test_vaex_df.co.isna()

In [None]:
test_vaex_df.co.isnan()

And so many other operations you would normally think of doing in pandas are mostly available. Things like:

joining, sorting, string operations (e.g. lower, contains, endswith, alphanumeric check). 

Following is one example from the official documentation of how supervised learning works in vaex-ml using the iris dataset.

### Machine learning with Vaex-ml

There is a even a separate machine learning library for Vaex whose syntax is similar to sklearn. Things like clustering, PCA and supervised learning are all made possible via vaex-ml.

In [None]:
del pandas_df
del vaex_df
del vaex_chunk_df

In [None]:
import gc
gc.collect()

In [None]:
%%time
from vaex.ml.sklearn import Predictor
from sklearn.ensemble import GradientBoostingClassifier

iris_df = vaex.ml.datasets.load_iris()

features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
target = 'class_'

model = GradientBoostingClassifier(random_state=42)
vaex_model = Predictor(features=features, target=target, model=model, prediction_name='prediction')

vaex_model.fit(df=iris_df)

iris_df = vaex_model.transform(iris_df)

The big flow of things looks the same as sklearn but one difference you can see below is the use of the "Predictor" in vaex.ml.sklearn which is kind of a model placeholder that can hold various parameters for a model (e.g. features, target, which algorithm to use etc.). In a typical sklearn setting, this would usually be done within each algorithm instance (in this case within the GradientBoostingClassifier( ) instance)

Refer to the official documentation and the tutorials there for further detailed info! https://vaex.io/docs/index.html