datalake
Here are 66 public repositories matching this topic...
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
-
Updated
May 25, 2024 - Python
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
-
Updated
Dec 28, 2023 - Python
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
-
Updated
May 6, 2023 - Python
A Data Platform built for AWS, powered by Kubernetes.
-
Updated
Jul 24, 2023 - Python
An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.
-
Updated
May 25, 2024 - Python
Python idiomatic SDK for Cortex™ Data Lake.
-
Updated
Jan 7, 2022 - Python
AWS Auto Terminate Idle AWS EMR Clusters Framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
-
Updated
Sep 13, 2021 - Python
This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project
-
Updated
Mar 27, 2024 - Python
OEDI Data Lake Access
-
Updated
May 6, 2024 - Python
A library to accelerate ML and ETL pipeline by connecting all data sources
-
Updated
May 3, 2023 - Python
Built functional python ETL script with functions that initialized spark clusters using pyspark library to extract songs stored in S3 bucket. Partitioned songs data by year and artist_id and compressed in parquet output files to increase load performance. Used the overwrite mode in spark to ensure every new run of ELT script is overwritten in th…
-
Updated
Dec 28, 2021 - Python
This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.
-
Updated
Mar 3, 2024 - Python
To implement a data lake using S3 and Spark on an EMR cluster using AWS Cloud9 environment and develop an ETL pipeline for a Data Lake that extracts data from S3, processes the data using Spark, and loads the data back into S3 as a set of dimensional tables.
-
Updated
Jul 30, 2021 - Python
This is in order to add additional stocks data information using dbt.
-
Updated
Apr 18, 2023 - Python
pyspark streaming kafka(0.8.2) to hdfs
-
Updated
Dec 13, 2018 - Python
Repositório para armazenar códigos do projeto.
-
Updated
Dec 2, 2021 - Python
Improve this page
Add a description, image, and links to the datalake topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the datalake topic, visit your repo's landing page and select "manage topics."