A data platform comprised of Twitter content and a sentiment analysis pipeline
- Elasticsearch
- Apache Pulsar (Stream & Stream Processing)
- FastAPI + WebSockets (Query Tweets + Sentiment Scores)
- Batch + Streaming Pipeline (Tweet Acquisition + Scoring)
Elasticsearch stores tweets processed through pulsar and pulsar-functions. The crawler is a set of scripts/functions that load/stream data into pulsar and parse/run's NLP Sentiment Analysis. The API will query Elasticsearch, and trigger crawler jobs to load content into Elasticsearch.
Run Pulsar locally
docker run -it \
-p 6650:6650 \
-p 8080:8080 \
--mount source=pulsardata,target=/pulsar/data \
--mount source=pulsarconf,target=/pulsar/conf \
apachepulsar/pulsar:2.8.1 \
bin/pulsar standalone
Run Elasticsearch locally
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.15.2
docker run -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.15.2
conda env create -f environment.yml
conda activate sentiment
python setup.py develop
./sentiment_api/scripts/runserver-dev.sh
Requires elasticsearch to be running on localhost:9200
python tweet_analysis/orchestrator.py
- Fetch Latest Content (param: days_back)
- Fetch Top Trending Content
- Search for Content
- Scrape Top Trending Tags for content -> Elasticsearch
- Perform Sentiment Analysis on content In Elasticsearch
- query 10 mins back in ES
- Score tweet text
- push updates to ES
- Pulsar Function that performs sentiment analysis on content as it is streamed into pulsar topics
- Research pulsar-admin and how to implement in production (Dockerfile? docker_run.sh?)
- Endpoints:
- Query a user's home timeline (tweepy)
- Fetch tweets by keyword (elasticsearch)
Will likely need to use docker-compose.yml for easy launch of a local dev instance (pulsar and elasticsearch included)