Load data from the Million Song Dataset into a final dimensional model stored in S3.
-
Updated
May 17, 2020 - Python
Load data from the Million Song Dataset into a final dimensional model stored in S3.
TU Berlin Cloud Computing - correctly implemented assignment4
MLP for Sentiment Analysis on Movie's Reviews.
Define a big data architecture and perform distributed machine learning calculations on an EMR cluster using AWS
Realtime data pipeline
A Cloud based Reddit stock sentiment analyzer that analyzes overall sentiment from a configurable selection of stock subreddits for each stock. The architecture utilizes AWS MSK (Kafka), AWS EMR (PySpark) and AWS Lambda (Python 3) for maximum scalability and the OpenAI API for sentiment analysis through prompt engineering.
ETL Pipeline extracts JSON files from AWS S3 bucket and transforms these using an AWS EMR Spark Cluster and stores the data into an AWS S3 bucket in parquet file format.
A scalable prototype of an image recognition engine deployed on AWS.
In this project, the skills learned in the Big Data Fundamentals unit will be utilized to load, filter, and visualize a large real-world dataset within a cloud-based distributed computing environment using Hadoop, Spark, Hive, and the S3 filesystem.
Stand-alone Scala & Java tool to anonymize OOXML Documents (DOCX)
Data Pipeline Analytics Platform is an end-to-end generic Big Data pipeline. Involves following tech stack: AWS S3, AWS Redshift, AWS EMR Cluster, Apache Spark, Apache Airflow.
Built a data model, data warehouse and pipeline for extracting transforming and loading data into a star schema-based data model in a redshift database
In this repo, I build a LogisticRegression prediction model with Dask and PySpark and initialize an AWS EMR cluster to run the entire pipeline.
A CNN is deployed in AWS to extract image features in the context of distributed computing.
PySpark RDD and DataFrame Examples
Developing a Flow with EMR and Airflow
Add a description, image, and links to the aws-emr-clusters topic page so that developers can more easily learn about it.
To associate your repository with the aws-emr-clusters topic, visit your repo's landing page and select "manage topics."