farsante

Fake Pandas / PySpark DataFrame creator.

Install

pip install farsante

PySpark

Here's how to quickly create a 7 row DataFrame with first_name and last_name fields.

import farsante

df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Tommy|     Hess|
|    Arthur| Melendez|
|  Clemente|    Blair|
|    Wesley|   Conrad|
|    Willis|   Dunlap|
|     Bruna|  Sellers|
|     Tonda| Schwartz|
+----------+---------+

Here's how to create a DataFrame with 5 rows of data with first names and last names using Mexican Spanish.

import farsante
from mimesis import Person

mx = Person('es-mx')

df = farsante.pyspark_df([mx.first_name, mx.last_name], 5)
df.show()

+-----------+---------+
| first_name|last_name|
+-----------+---------+
|     Connie|    Xicoy|
|  Oliverios|   Merino|
|     Castel|    Yáñez|
|Guillelmina|   Prieto|
|     Gezane|   Campos|
+-----------+---------+

Pandas

Here's how to quickly create a 3 row DataFrame with first_name and last_name fields.

import farsante

df = farsante.quick_pandas_df(['first_name', 'last_name'], 3)
print(df)

  first_name last_name
0       Toby   Rosales
1      Gregg    Hughes
2    Terence       Ray

Here's how to create a 5 row DataFrame with first names and last names using Russian.

from mimesis import Person
ru = Person('ru')
df = farsante.pandas_df([ru.first_name, ru.last_name], 5)
print(df)

  first_name   last_name
0      Амиль  Ханженкова
1  Славентий  Голумидова
2    Паладин   Волосиков
3       Акша    Бабашова
4       Ника    Синусова

Fake files

Here's how to create a CSV file with some fake data:

import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime

person = Person()
address = Address()
datetime = Datetime()
df = farsante.pandas_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
df.to_csv('./tmp/fake_data.csv', index=False)

Here's how to create a Parquet file with fake data:

df = farsante.pandas_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
df.to_parquet('./tmp/fake_data.parquet', index=False)

h2o dataset creation

h2o is a popular library to benchmark data processing engines. Farsante uses rust to generate h2o datasets.

The following datasets are currently supported:

name	rows	cols	cols types	nulls
groupby	n	9	6 id cols, 2 int cols, 1 float col	optional
join_big	n	7	6 id cols, 1 float col	no
join_big_na	n	7	6 id cols, 1 float col	optional
join_medium	n / 1000	5	4 id cols, 1 float col	optional
join_small	n / 1_000_000	4	3 id cols, 1 float col	optional

Python

To create one of the above datasets, use the generate_h2o_dataset() function in farsante.h2o_dataset_create

from farsante import generate_h2o_dataset

generate_h2o_dataset(
    ds_type="join_big",
    n=10_000_000,
    k=10,
    nas=10,
    seed=10,
)

To create all of the above datasets in parallel, use the h2o_dataset_create_all.py script

python h2o_dataset_create_all.py --n 10000000 --k 10 --nas 10 --seed 42

Rust

To generate these datasets in rust:

Install rust
Install cargo
Install the rust dependencies: cargo install --path .
Run the rust program: cargo run --release -- --help to see run options

cargo run --release -- --n 10000000 --k 10 --nas 10 --seed 42

Contributing

If you would like to help make Farsante better, take a look at our Contributing Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
data		data
python/farsante		python/farsante
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

SemyonSinchenko/farsante

Folders and files

Latest commit

History

Repository files navigation

farsante

Install

PySpark

Pandas

Fake files

h2o dataset creation

Python

Rust

Contributing

About

Resources

Stars

Watchers

Forks

Languages