Group solution to the Kaggle problem titled "Natural Language Processing with Disaster Tweets". The problem is to classify text data from 10,000 tweets into one of two groups: representing tweets about real natural disaster (1), tweets that are not about actual disaster (0).
Dashboard implementing our solution is available here.
- β Kaggle's top score: 0.86117
- β Our top prediction score: 0.84155
- π Related Projects
- π Theory
- βοΈ Setup
- π How to run
- π¨βπ» Contributing
- ποΈ Architecture
- π Directory Structure
- π Learning Materials
- π Development Schedule
- π Project duration
- π€ Stack
- π Examples
- β Configurations
- π‘ Tips
- π§ Warnings
- π§° Troubleshooting
- π§ Contact
- π License
- Kaggle problem: "Natural Language Processing with Disaster Tweets"
- NLP Disaster Tweets online dashboard
- Twitter BOT built for this project Disaster Retweeter
- NLP Disaster Tweets - Dashboard Models Kaggle notebook
Theory has been moved to the repo's wiki
Take these steps before section "π How to run"
virtualenv venv
venv/bin/activate
venv\Scripts\activate.bat
pip install -r requirements.txt
Follow the steps in "βοΈ Setup" section that describe how to install all the dependencies
The dashboard is deployed at Heroku and is live at the address https://nlp-disaster-tweets.herokuapp.com/
- Clone the repo to the destination of your choice
git clone https://github.com/SzymkowskiDev/nlp-disaster-tweets.git
- Open Python interpreter (e.g. Anaconda Prompt) and change the directory to the root of the project
nlp-disaster-tweets
- In the terminal run the command
python app.py
- The app will launch in your web browser at the address http://127.0.0.1:8050/
- Clone the repo to the destination of your choice
git clone https://github.com/bswck/disaster-retweeter
- In your Python interpreter (e.g. Anaconda Prompt) change the directory to the root of the project
disaster-retweeter
- In the terminal run the command 'uvicorn retweeter_web.app.main:app`
- The app will lanuch in your web browser at the address http://127.0.0.1:8000/
In the first iteration of the project, all there is to running the project is downloading a Jupyter notebook from directory "notebooks" and launching it with Jupyter. Jupyter is available for download as a part of Anaconda suite from https://www.anaconda.com/.
When feeding a Jupyter notebook with data, use data provided in directory "train_split" here.
ββββassets
ββββdashboard/src
| ββββdata
| ββββmodels/production
| ββββtabs
ββββdisaster-retweeter (git module to https://github.com/bswck/disaster-retweeter)
ββββnotebooks
ββββreports
ββββsubmissions
ββββapp.py
ββββrequirements.txt
β More resources are available on Team's google drive: discordnlp7@gmail.com, ask a team member for password β
β Also check the repo's wiki β
- A wonderful book on the basics of NLP "Speech and Language Processing"
- Kaggle's introductory tutorial to NLP NLP Getting Started Tutorial
- How does CountVectorizer work? towardsdatascience.com article
- Data Mining and Business Analytics with R - Johannes Ledolter
- Dash tutorial
- Plotly docs
- Markdown in Dash
- Dash HTML Components Gallery & code snippets
- Dash Core Components Gallery & code snippets
- Bootstrap components for Dash
- Font awesome icons
- Colorscales for Plotly charts
- How to make a choropleth map or globe with plotly.graph_objects (go)
- layout.geo reference
- bootstrap theme explorer
- ML Crach Course from google developlers
Version 1.0.0 MILESTONE: classification model
- First Solution to the Kaggle's problem as a Jupyter notebook
Version 1.1.0 MILESTONE: model + dashboard
- Deployment of a blank dashboard (and integrate Dash)
- Exploratory Data Analysis tab
- Introduction
- Data Quality Issues
- Keyword
- Location (+interactive globe data viz)
- Text (Word frequency (+Wordcloud))
- Target
- Customized classification tab
- Best performing
- Make a prediction
- Twitter BOT Analytics (blank)
- About page
Version 1.2.0 MILESTONE: twitter bot
- The Disaster Retweeter
- BOT: crawler
- BOT: classification
- BOT: retweeter
- Database 1: PostgreSQL
- Databse 2: Redis
- Rest API 1: analytics.py router
- Rest API 2: logs.py router
- Client part 1: module
- Client part 2: dashboard end
- tests
03/07/2022 - 15/09/2022 (74 days)
- Python
- pandas
- scikit-learn
- nltk
- dash
- PostgreSQL
- Redis
- FastAPI
Example 1. Measuring performance metrics with generate_perf_report()
To generate the model performance report, use generate_perf_report()
.
It compares predictions based on provided training data (X
) to expected results (y
)
and gathers certain classification metrics, like precision, accuracy etc.:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from models.production.generate_perf_report import generate_perf_report
# Load the training data, prepare the TF-IDF vectorizer just for this demo
df = pd.read_csv(r"data\original\train.csv")
tfidf_vect = TfidfVectorizer(max_features=5000)
# Prepare training data and target values
X = tfidf_vect.fit_transform(df['text'])
y = df["target"].copy()
# Generate and print the report
report = generate_perf_report(
X, y, name="demo report", description="tfidf vectorizer and no preprocessing"
)
print(report)
Output:
Date 2022-07-29 00:17:16
Description tfidf vectorizer and no preprocessing
Test Size 0.15
Precision 0.875
Recall 0.679208
F1 Score 0.764771
Accuracy 0.815236
Roc_auc_score 0.801142
Name: demo report, dtype: object
Name, description, test size and date format in the report can be optionally specified.
Example 2. Performing vectorization of choice with vectorize_data()
Function vectorize_data()
takes two parameters:
* data -
* method - available options are: "tfidf"
MIT License Β©οΈ 2019-2020 Kamil Szymkowski