#

datalake

Here are 66 public repositories matching this topic...

Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.

data-science data csv sql database ai pandas data-analysis datalake gpt-3 gpt-4 llm

Updated May 24, 2024
Python

activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Updated May 25, 2024
Python

ApacheSpark

martandsingh / ApacheSpark

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

sql database spark hive hadoop etl pyspark data-engineering spark-streaming data-analysis databricks datalake spark-sql timetravel apachespark etl-pipeline deltalake

Updated Dec 28, 2023
Python

vim89 / datapipelines-essentials-python

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

python big-data spark apache-spark hadoop etl xml python3 xml-parsing pyspark data-pipeline datalake hadoop-mapreduce spark-sql etl-framework hadoop-hdfs etl-pipeline etl-components

Updated May 6, 2023
Python

awslabs / aws-orbit-workbench

A Data Platform built for AWS, powered by Kubernetes.

kubernetes aws jupyter analytics gpu jupyterhub data-analysis redshift mach workbench datalake dataengineering eks eks-cluster orbit-workbench

Updated Jul 24, 2023
Python

UncoderIO / Uncoder_IO

An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.

translation xdr siem sigma datalake edr threathunting roota uncoder uncoderio

Updated May 25, 2024
Python

PaloAltoNetworks / pan-cortex-data-lake-python

Python idiomatic SDK for Cortex™ Data Lake.

Updated Jan 7, 2022
Python

abdullahkhawer / aws-auto-terminate-idle-emr

AWS Auto Terminate Idle AWS EMR Clusters Framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.

Updated Sep 13, 2021
Python

aws-samples / aws-insurancelake-etl

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project

aws insurance glue datalake cdk

Updated Mar 27, 2024
Python

openEDI / open-data-access-tools

OEDI Data Lake Access

aws datalake nrel renewable-energy open-energy oedi

Updated May 6, 2024
Python

hifxit / dataligo

A library to accelerate ML and ETL pipeline by connecting all data sources

python database nosql datawarehouse datalake etl-pipeline ml-pipeline

Updated May 3, 2023
Python

mehroosali / s3-redshift-batch-etl-pipeline

Built functional python ETL script with functions that initialized spark clusters using pyspark library to extract songs stored in S3 bucket. Partitioned songs data by year and artist_id and compressed in parquet output files to increase load performance. Used the overwrite mode in spark to ensure every new run of ELT script is overwritten in th…

aws airflow sql spark etl analytics s3 python3 pyspark redshift datalake spark-sql airflow-dags

Updated Dec 28, 2021
Python

aws-samples / aws-insurancelake-infrastructure

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.

aws insurance datalake cdk

Updated Mar 3, 2024
Python

brfulu / datalake-spark-etl

Udacity Data Engineer Nanodegree - Datalake Spark ETL

python aws spark etl s3 datalake

Updated Jul 24, 2020
Python

data-lake-spark

aymanibrahim / data-lake-spark

Udacity's Data Engineering Nanodegree project: Data Lake with Spark.

python emr json udacity spark etl s3 nanodegree parquet datalake dataengineering

Updated Sep 16, 2022
Python

praveen-gopal-reddy / ETL-Spark-EMR-AWS-MusicData

To implement a data lake using S3 and Spark on an EMR cluster using AWS Cloud9 environment and develop an ETL pipeline for a Data Lake that extracts data from S3, processes the data using Spark, and loads the data back into S3 as a set of dimensional tables.

python bootstrap spark pyspark cloud9 s3-storage datalake emr-cluster

Updated Jul 30, 2021
Python

legout / pydala

Poor mans simple python api for creating a local or remote datalake based on several (pyarrow) datasets using duckdb

datalake pyarrow duckdb

Updated Jul 14, 2023
Python

poshkaran04 / stocks_data_transform

This is in order to add additional stocks data information using dbt.

bigquery airflow gcp data-visualization python3 data-engineering data-analysis dbt powerbi datawarehouse datalake gcs-bucket

Updated Apr 18, 2023
Python

mfilipelino / kafka2hdfs

pyspark streaming kafka(0.8.2) to hdfs

kafka spark spark-streaming hdfs datalake

Updated Dec 13, 2018
Python

dbbatalha / human-resources-analytics

Repositório para armazenar códigos do projeto.

python docker machine-learning airflow minio datalake human-resources pycaret

Updated Dec 2, 2021
Python

Improve this page

Add a description, image, and links to the datalake topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the datalake topic, visit your repo's landing page and select "manage topics."