This repository contains the implementation of three IoT use cases on AWS together with deliverables for the practicum project in collaboration with the Center for Deep Learning (CDL) at Northwestern University.
The Center for Deep Learning’s mission is to act as a resource for companies seeking to establish or improve access to artificial intelligence (AI) by providing technical capacity and expertise. Their recent work include serving for deep learning, model architecture redesign, AI for IoT and general streaming, and prediction or scoring confidence. Please refer to the following resource for more information regarding CDL.
The Center for Deep Learning is developing REFIT, a novel system that is built to consume and capitalize on IoT infrastructure by ingesting device data and employing modern machine learning approaches to infer the status of various components of the IoT system. It is specifically built upon several open source components with state-of-the-art artificial intelligence and it is notably distinguished from other IoT systems in many regards.
- Develop and implement three IoT use cases based on public data.
- Build and end-to-end solution for each use case on AWS, mimicking the general architecuture leveraged in REFIT.
- Assess the potential pros and cons of implementing a streaming-based solution in AWS versus REFIT.
- A comprehensive final report detailing the three IoT use cases, the end-to-end solution implemented in AWS, and a preliminary comparison between AWS and REFIT.
- Source code and thorough documentation as provided in this GitHub repository.
- Point of Contact - Borchuluun Yadamsuren
- Technical Adviser - Diego Klabjan
- Supporting Staff - Raman Khurana
The project was completed by the following MLDS students at Northwestern University: Yi (Betty) Chen, Henry Liang, Sharika Mahadevan, Ruben Nakano, Riu Sakaguchi, Sam Swain, and Yumeng (Rena) Zhang.
A Chicago-based bike share system, Divvy Bikes provides an affordable and convenient mode of transportation throughout cities. The raw dataset provided publicly by Divvy contains information at the trip level, including the starting and ending station and time. The business objective revolves around predicting the number of trips at various stations for the next hour to facilitate resourceful restocking of bikes. The Divvy Bikes use case leverages an LSTM model to account for long-term seasonal dependencies to predict demand.
Servers comprise of hard drive disks aggregated together to form a storage pod. In particular, hard drives serve as the foundation for both the storage and retrieval of data through rotating disks. The relevant data are ammased by BackBlaze through the monitoring of various sensors in select hard drive disks. The ultimate objective involves the identification of hard drives that are close to failure to facilitate efficient predictive maintainance of server centers. More specifically, this particular use case capitalizes on an XGBoost framework to predict the useful lifetime of hard drives.
The MotionSense data originates from an experiment involving 24 participants performing 6 activities across 15 trials in the same environment with fixed conditions. The activities comprise of moving upstairs, going downstairs, walking, jogging, sitting and standing. The dataset consists of accelerometer and gyroscope measurements generated by sensors in the devices carried by the participants during the experiment. The MotionSense use case also implements an LSTM model for the primary objective: to predict the type of activity from the sensor readings.
The instructions to run and test the end-to-end AWS solution for the use cases are provided here.
Data Sources
Data Ingestion
- Kinesis Data Streams
divvy-stream
harddrive-stream
motionsense-stream
Data Preparation
-
AWS Glue
divvy_static_etl
-
AWS Lambda
Data Storage
-
AWS Lambda
transform_and_stream_to_S3
(Divvy Bikes)motionsense-streamtoS3
harddrive-streamtoS3
-
Amazon S3 (stores raw streaming data)
divvy-stream-data
harddrive-stream-data
motionsense-stream-data
Model Inference
-
Amazon EC2 (hosts model endpoint)
divvy_api
harddrive_api
motionsense_api
-
AWS Lambda (calls model API and sends prediction to WebSocket)
divvybikes-getprediction-send2websocket
lambda-getprediction-send2websocket
(Motion Sense)harddrive-getprediction-send2websocket
-
AWS Lambda (calls model API and saves prediction to S3)
divvybikes-getprediction-savetoS3
motionsense-getprediction-savetoS3
harddrive-getprediction-savetoS3
-
Amazon S3
divvy-predictions
harddrive-predictions
motionsense-predictions
Display Predictions
-
Amazon EventBridge
-
AWS Lambda
-
WebSocket
websocket-1
-
DynamoDB
websocket-connections
websocket-connections-divvybikes
websocket-connections-harddrive
Model Retraining
-
Amazon S3
divvy-retraining
harddrive-retraining
motionsense-retraining
-
Amazon EventBridge
-
AWS Lambda
-
trigger-motionsense-retrain
-
trigger-harddrive-retrain
-
trigger-divvy-retrain
-
stop-motionsense-retrain
-
trigger-motionsense-retrain
-
stop-harddrive-retrain
-
trigger-harddrive-retrain
-
stop-divvy-retrain
-
trigger-divvy-retrain
-
-
Amazon EC2
motionsense_retrain
divvy_retrain
harddrive_retrain
The combined cost of the end-to-end AWS solution for the three use cases is estimated reach an annual total of $3,207.22 USD or equivalently, $267.27 USD per month. Amazon API Gateway and AWS Glue are two of the more costly AWS services employed as part of the comprehensive solution. A detailed break down of the cost estimate by service can be found here.
- Low latency streaming
- Real-time predictions
- Simplified process to start up the EC2 instances and stop retraining modules
- Great model performance across all three use cases
- Hard Drives:
$R^2$ score of 0.96 - Motion Sense:
$95.2$ % test accuracy - Divvy Bikes: MAPE of
$22.86$ %
- Hard Drives:
- Cost-effective cloud implementation
-
Visualization
- Currently, the predictions are generated as raw values.
- Augmenting an additional service to visualize past and current predictions could help further improve the AWS solution.
-
Throughput
- Throughput appeared to decrease inversely proportional to the stream size.
- The lambda function sending records to EC2 was identified as the likely culprit limiting the maximum potential throughput.
- Increasing the compute power and memory could serve as a potential solution.
The final scope and objectives of the project have transitioned slightly from the original proposal including the implementation of the three use cases on REFIT and designing a model agnostic feature selection algorithm for time series data. These works could serve as potential avenues for consideration for future projects with CDL.
The final report detailing the entire 8 month practicum project can be found here.