aws-emr-clusters

Star

Here are 19 public repositories matching this topic...

UCloudM / Steam_Analysis_For_Gamers

Star

Analysis performed on data from the Steam platform using Apache Spark and Cloud services such as Amazon Web Services.

python aws data-science big-data apache-spark aws-ec2 aws-emr-clusters

Updated Dec 11, 2019
Python

felipeazucares / Airflow-EMR-Redshift

Star

EMR + Hadoop to Redshift ELT workflow using spark steps API and orchestrated by Apache-Airflow, which ingests disparate datasets focused around 7Gb of I94 arrivals information to produce a simple star schema in Redshift

apache-spark aws-s3 aws-emr sas7bdat-datasets apache-airflow aws-emr-clusters i94

Updated Feb 25, 2021
Python

abhibalani / emr_lambda

Star

Lambda to start EMR and run a map reduce job

aws aws-lambda aws-emr hadoop-mapreduce aws-emr-clusters mapreduce-python

Updated Aug 16, 2019
Python

polarbeargo / Udacity-nd027-Data-Lake

Star

aws-s3 aws-emr-clusters pyspark-python

Updated Jul 26, 2021
Python

khushal2405 / Daily-Incremental-load-ETL-pipeline-for-Ecommerce-company-using-AWS-Lambda-and-Apache-airflow

Star

Daily Incremental load ETL pipeline for Ecommerce company using AWS Lambda and AWS EMR cluster, Deployed using Apache airflow in a docker container.

Updated Mar 17, 2023
Python

Chan2k20 / Wine-Prediction-Prediction-Model-On-AWS-EMR

Star

Implemented random forest machine learning algorithm using pyspark on AWS EMR to classify the wines. The model is then deployed in docker container.

docker random-forest aws-s3 aws-ec2 ec2-instance aws-emr-clusters pyspark-mllib wine-quality-prediction

Updated Apr 10, 2024
Python

fermat01 / ETL-Data-Pipeline-using-AWS-EMR-Spark-Glue-Athena

Star

ETL Data pipeline using aws services

aws aws-s3 aws-athena aws-emr-clusters aws-glue-crawler

Updated Aug 23, 2024
Python

rigganni / AWS-Spark-Million-Song-ETL

Star

Load data from the Million Song Dataset into a final dimensional model stored in S3.

apache-spark etl aws-emr parquet parquet-files dimensional-model aws-emr-clusters

Updated May 17, 2020
Python

Adith-Rai / Reddit-Stock-Sentiment-Analyzer

Star

A Cloud based Reddit stock sentiment analyzer that analyzes overall sentiment from a configurable selection of stock subreddits for each stock. The architecture utilizes AWS MSK (Kafka), AWS EMR (PySpark) and AWS Lambda (Python 3) for maximum scalability and the OpenAI API for sentiment analysis through prompt engineering.

aws-lambda reddit-api python3 pyspark aws-ec2 aws-emr-clusters aws-msk openai-api

Updated Jan 30, 2024
Python

johnnyiller / cluster_funk

Star

An opinionated framework for running big data jobs

aws big-data spark aws-emr pyspark aws-emr-clusters

Updated Nov 11, 2022
Python

sagardua297 / udacity-data-engineering-nd

Star

Data Pipeline Analytics Platform is an end-to-end generic Big Data pipeline. Involves following tech stack: AWS S3, AWS Redshift, AWS EMR Cluster, Apache Spark, Apache Airflow.

python airflow spark cassandra aws-s3 data-warehouse data-engineering data-lake data-modeling airflow-plugin aws-redshift etl-pipeline aws-emr-clusters postrgresql airflow-dags airflow-operators

Updated Feb 13, 2021
Python

silviomori / covid19-datalake

Star

python emr docker aws data-science airflow spark docker-container aws-s3 ecs python3 aws-emr data-engineering data-lake aws-ecs boto3 aws-emr-clusters aws-ecs-cluster

Updated Jun 19, 2020
Python

marcus-repo / etl-spark

Star

ETL Pipeline extracts JSON files from AWS S3 bucket and transforms these using an AWS EMR Spark Cluster and stores the data into an AWS S3 bucket in parquet file format.

spark aws-s3 aws-emr pyspark aws-emr-clusters

Updated Feb 12, 2021
Python

geewynn / techcrunch_warehouse

Star

Built a data model, data warehouse and pipeline for extracting transforming and loading data into a star schema-based data model in a redshift database

airflow aws-s3 infrastructure-as-code aws-ec2 datawarehouse datapipeline aws-redshift dataengineering aws-emr-clusters datamodeling