For our final project at Zip Code Wilmington, we chose to create a sentiment analyis on the Twitter conversation surrounding the COVID-19 vaccine in the United States. We produced streams of all the tweets using the Twitter API and put the data into an AWS SQL database. We then cleaned the data with Spark and returned it to the database.
After acquiring this data, we used NLTK machine learning models to analyze the sentiment of the tweets. We then separated the tweets into four different tables, based on what region of the United States they came from and created various visualizations of the data using Wordcloud and Matplotlib. The whole process was automated using an Apache Airflow DAG.
Lastly, we made an interactive data visualizations using Tableau where you can view the sentiment analysis for the USA and isolate each region.
Below are the basic steps of our program, followed by a flowchart showing how all of the technologies worked together. View our PowerPoint presentation by clicking here.
In the sentiment analysis, tweets were split into three categories: positive, negative and neutral. Using Word Cloud, we generated images of the key words for each category. The larger the word, the more common it was.
Here are the words found in positive tweets.
Here are key words found in negative tweets:
- Airflow
- AWS lightsail MySql
- PANDAS
- Matplotlib
- Tableau
- NLTK
- Papermill
- PySpark