pyspark

Here are 1,120 public repositories matching this topic...

ibis-project / ibis

the portable Python dataframe library

Updated Jun 7, 2024
Python

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

machine-learning deep-learning tensorflow pytorch pyspark parquet parquet-files sysml pyarrow

Updated Dec 2, 2023
Python

AlexIoannides / pyspark-example-project

Star

Implementing best practices for PySpark ETL jobs and applications.

python data-science spark etl pyspark data-engineering etl-pipeline etl-job

Updated Jan 1, 2023
Python

hi-primus / optimus

Star

🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

data-science machine-learning spark bigdata data-transformation pyspark data-extraction data-analysis data-wrangling dask data-exploration data-preparation data-cleaning data-profiling data-cleansing big-data-cleaning data-cleaner cudf dask-cudf

Updated Jun 3, 2024
Python

jupyter-incubator / sparkmagic

Star

Jupyter magics and kernels for working with remote Spark clusters

magic spark kernel jupyter notebook cluster pandas-dataframe jupyter-notebook sql-query pyspark kerberos livy

Updated Jun 7, 2024
Python

lyhue1991 / eat_pyspark_in_10_days

Star

pyspark🍒🥭 is delicious，just eat it!😋😋

spark pyspark

Updated Sep 22, 2022
Python

HariSekhon / DevOps-Python-tools

Star

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Updated Jun 4, 2024
Python

MrPowers / quinn

Star

pyspark methods to enhance developer productivity 📣 👯 🎉

apache-spark pyspark

Updated May 17, 2024
Python

MrPowers / chispa

Star

PySpark test helper methods with beautiful error messages

testing pyspark

Updated May 30, 2024
Python

Nike-Inc / koheesio

Star

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

python pyspark data-engineering pydantic delta-lake

Updated Jun 7, 2024
Python

capitalone / datacompy

Star

Pandas, Polars, and Spark DataFrame comparison for humans and more!

python data-science data spark numpy pandas pyspark compare dask dataframes fugue polars

Updated Jun 7, 2024
Python

ekampf / PySpark-Boilerplate

Star

A boilerplate for writing PySpark Jobs

python boilerplate apache-spark pyspark

Updated Jan 21, 2024
Python

commoncrawl / cc-pyspark

Star

Process Common Crawl data with Python and Spark

spark pyspark sparksql wet commoncrawl common-crawl warc-files wat-files

Updated Apr 8, 2024
Python

CamDavidsonPilon / tdigest

Sponsor

Star

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

python estimate distributed-computing quantile pyspark mapreduce percentile

Updated May 4, 2023
Python

cartershanklin / pyspark-cheatsheet

Star

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

big-data spark apache-spark pyspark

Updated Sep 19, 2022
Python

databrickslabs / dbldatagen

Star

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

python spark faker pyspark spark-streaming data-generation databricks synthetic-data datagen datagenerator deltalake datageneration delta-live-tables

Updated Jun 8, 2024
Python

MrPowers / mack

Star

Delta Lake helper methods in PySpark

pyspark deltalake

Updated Feb 7, 2024
Python

quintoandar / butterfree

Star

A tool for building feature stores.

python package data-science etl pyspark data-engineering etl-framework feature-store

Updated Jun 7, 2024
Python

Morphl-AI / MorphL-Community-Edition

Star

MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization

kubernetes machine-learning cassandra pipeline artificial-intelligence pyspark user-experience data-driven-design conversion-rate-optimization front-end-development product-development hadoop-hdfs morphl-platform

Updated Oct 2, 2019
Python

runawayhorse001 / LearningApacheSpark

Star

LearningApacheSpark

python html tutorial latex spark pyspark

Updated Jan 3, 2024
Python

Improve this page

Add a description, image, and links to the pyspark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pyspark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyspark

Here are 1,120 public repositories matching this topic...

ibis-project / ibis

uber / petastorm

AlexIoannides / pyspark-example-project

hi-primus / optimus

jupyter-incubator / sparkmagic

lyhue1991 / eat_pyspark_in_10_days

HariSekhon / DevOps-Python-tools

MrPowers / quinn

MrPowers / chispa

Nike-Inc / koheesio

capitalone / datacompy

ekampf / PySpark-Boilerplate

commoncrawl / cc-pyspark

CamDavidsonPilon / tdigest

cartershanklin / pyspark-cheatsheet

databrickslabs / dbldatagen

MrPowers / mack

quintoandar / butterfree

Morphl-AI / MorphL-Community-Edition

runawayhorse001 / LearningApacheSpark

Improve this page

Add this topic to your repo