NLP with Disaster Tweets (group project)

Group solution to the Kaggle problem titled "Natural Language Processing with Disaster Tweets". The problem is to classify text data from 10,000 tweets into one of two groups: representing tweets about real natural disaster (1), tweets that are not about actual disaster (0).

Dashboard implementing our solution is available here.

⭐ Kaggle's top score: 0.86117
⭐ Our top prediction score: 0.84155

🔗 Related Projects
👓 Theory
⚙️ Setup
🚀 How to run
👨‍💻 Contributing
🏛️ Architecture
📂 Directory Structure
🎓 Learning Materials
📅 Development Schedule
🆕 Project duration
🤖 Stack
📝 Examples
⚙ Configurations
💡 Tips
🚧 Warnings
🧰 Troubleshooting
📧 Contact
📄 License

🔗 Related Projects

Kaggle problem: "Natural Language Processing with Disaster Tweets"
NLP Disaster Tweets online dashboard
Twitter BOT built for this project Disaster Retweeter
NLP Disaster Tweets - Dashboard Models Kaggle notebook

👓 Theory

Theory has been moved to the repo's wiki

⚙️ Setup

Take these steps before section "🚀 How to run"

Create a virtual environment using virtualenv venv

Activate the virtual environment by running venv/bin/activate

On Windows use venv\Scripts\activate.bat

Install the dependencies using pip install -r requirements.txt

🚀 How to run

Follow the steps in "⚙️ Setup" section that describe how to install all the dependencies

How to access the web app?

The dashboard is deployed at Heroku and is live at the address https://nlp-disaster-tweets.herokuapp.com/

How to run the web app ("NLP Disaster Tweets") locally?

Clone the repo to the destination of your choice git clone https://github.com/SzymkowskiDev/nlp-disaster-tweets.git
Open Python interpreter (e.g. Anaconda Prompt) and change the directory to the root of the project nlp-disaster-tweets
In the terminal run the command python app.py
The app will launch in your web browser at the address http://127.0.0.1:8050/

How to run the REST API development server ("Disaster Retweeter Web") locally?

Clone the repo to the destination of your choice git clone https://github.com/bswck/disaster-retweeter
In your Python interpreter (e.g. Anaconda Prompt) change the directory to the root of the project disaster-retweeter
In the terminal run the command 'uvicorn retweeter_web.app.main:app`
The app will lanuch in your web browser at the address http://127.0.0.1:8000/

How to run a Jupyter notebook?

In the first iteration of the project, all there is to running the project is downloading a Jupyter notebook from directory "notebooks" and launching it with Jupyter. Jupyter is available for download as a part of Anaconda suite from https://www.anaconda.com/.

When feeding a Jupyter notebook with data, use data provided in directory "train_split" here.

👨‍💻 Contributing

🏛️ Architecture

📂 Directory Structure

├───assets
├───dashboard/src
|   ├───data
|   ├───models/production
|   └───tabs
├───disaster-retweeter (git module to https://github.com/bswck/disaster-retweeter)
├───notebooks
├───reports
├───submissions
├───app.py
└───requirements.txt

🎓 Learning/Reference Materials

❗ More resources are available on Team's google drive: discordnlp7@gmail.com, ask a team member for password ❗

❗ Also check the repo's wiki ❗

A wonderful book on the basics of NLP "Speech and Language Processing"
Kaggle's introductory tutorial to NLP NLP Getting Started Tutorial
How does CountVectorizer work? towardsdatascience.com article
Data Mining and Business Analytics with R - Johannes Ledolter
Dash tutorial
Plotly docs
Markdown in Dash
Dash HTML Components Gallery & code snippets
Dash Core Components Gallery & code snippets
Bootstrap components for Dash
Font awesome icons
Colorscales for Plotly charts
How to make a choropleth map or globe with plotly.graph_objects (go)
layout.geo reference
bootstrap theme explorer
ML Crach Course from google developlers

📅 Development Schedule

Version 1.0.0 MILESTONE: classification model

First Solution to the Kaggle's problem as a Jupyter notebook

Version 1.1.0 MILESTONE: model + dashboard

Version 1.2.0 MILESTONE: twitter bot

🆕 Project duration

03/07/2022 - 15/09/2022 (74 days)

🤖 Stack

Python
pandas
scikit-learn
nltk
dash
PostgreSQL
Redis
FastAPI

📝 Examples

Example 1. Measuring performance metrics with generate_perf_report()

To generate the model performance report, use generate_perf_report(). It compares predictions based on provided training data (X) to expected results (y) and gathers certain classification metrics, like precision, accuracy etc.:

import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer
from models.production.generate_perf_report import generate_perf_report

# Load the training data, prepare the TF-IDF vectorizer just for this demo
df = pd.read_csv(r"data\original\train.csv")
tfidf_vect = TfidfVectorizer(max_features=5000)

# Prepare training data and target values
X = tfidf_vect.fit_transform(df['text'])
y = df["target"].copy()

# Generate and print the report
report = generate_perf_report(
    X, y, name="demo report", description="tfidf vectorizer and no preprocessing"
)
print(report)

Output:

Date                               2022-07-29 00:17:16
Description      tfidf vectorizer and no preprocessing
Test Size                                         0.15
Precision                                        0.875
Recall                                        0.679208
F1 Score                                      0.764771
Accuracy                                      0.815236
Roc_auc_score                                 0.801142
Name: demo report, dtype: object

Name, description, test size and date format in the report can be optionally specified.

Example 2. Performing vectorization of choice with vectorize_data()

Function vectorize_data() takes two parameters: * data - * method - available options are: "tfidf"

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
assets		assets
dashboard/src		dashboard/src
disaster-retweeter		disaster-retweeter
disaster_retweeter @ 650ba35		disaster_retweeter @ 650ba35
notebooks		notebooks
reports		reports
submissions		submissions
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.py		app.py
my.py		my.py
nltk.txt		nltk.txt
requirements.txt		requirements.txt
runtime.txt		runtime.txt

License

SzymkowskiDev/nlp-disaster-tweets

Folders and files

Latest commit

History

Repository files navigation

NLP with Disaster Tweets (group project)

Contents

🔗 Related Projects

👓 Theory

⚙️ Setup

🚀 How to run

How to access the web app?

How to run the web app ("NLP Disaster Tweets") locally?

How to run the REST API development server ("Disaster Retweeter Web") locally?

How to run a Jupyter notebook?

👨‍💻 Contributing

🏛️ Architecture

📂 Directory Structure

🎓 Learning/Reference Materials

📅 Development Schedule

🆕 Project duration

🤖 Stack

📝 Examples

📧 Contact

📄 License

About

Resources

License

Stars

Watchers

Forks

Languages