Skip to content

Solution to the Kaggle problem "NLP with Disaster Tweets" (group project)

License

Notifications You must be signed in to change notification settings

SzymkowskiDev/nlp-disaster-tweets

Repository files navigation

NLP with Disaster Tweets (group project)

Group solution to the Kaggle problem titled "Natural Language Processing with Disaster Tweets". The problem is to classify text data from 10,000 tweets into one of two groups: representing tweets about real natural disaster (1), tweets that are not about actual disaster (0).

Dashboard implementing our solution is available here.

  • ⭐ Kaggle's top score: 0.86117
  • ⭐ Our top prediction score: 0.84155

dashboard

Contents

  1. πŸ”— Related Projects
  2. πŸ‘“ Theory
  3. βš™οΈ Setup
  4. πŸš€ How to run
  5. πŸ‘¨β€πŸ’» Contributing
  6. πŸ›οΈ Architecture
  7. πŸ“‚ Directory Structure
  8. πŸŽ“ Learning Materials
  9. πŸ“… Development Schedule
  10. πŸ†• Project duration
  11. πŸ€– Stack
  12. πŸ“ Examples
  13. βš™ Configurations
  14. πŸ’‘ Tips
  15. 🚧 Warnings
  16. 🧰 Troubleshooting
  17. πŸ“§ Contact
  18. πŸ“„ License

πŸ”— Related Projects

πŸ‘“ Theory

Theory has been moved to the repo's wiki

βš™οΈ Setup

Take these steps before section "πŸš€ How to run"

  • Create a virtual environment using virtualenv venv
  • Activate the virtual environment by running venv/bin/activate
  • On Windows use venv\Scripts\activate.bat
  • Install the dependencies using pip install -r requirements.txt

    πŸš€ How to run

    Follow the steps in "βš™οΈ Setup" section that describe how to install all the dependencies
    

    How to access the web app?

    The dashboard is deployed at Heroku and is live at the address https://nlp-disaster-tweets.herokuapp.com/

    How to run the web app ("NLP Disaster Tweets") locally?

    1. Clone the repo to the destination of your choice git clone https://github.com/SzymkowskiDev/nlp-disaster-tweets.git
    2. Open Python interpreter (e.g. Anaconda Prompt) and change the directory to the root of the project nlp-disaster-tweets
    3. In the terminal run the command python app.py
    4. The app will launch in your web browser at the address http://127.0.0.1:8050/

    How to run the REST API development server ("Disaster Retweeter Web") locally?

    1. Clone the repo to the destination of your choice git clone https://github.com/bswck/disaster-retweeter
    2. In your Python interpreter (e.g. Anaconda Prompt) change the directory to the root of the project disaster-retweeter
    3. In the terminal run the command 'uvicorn retweeter_web.app.main:app`
    4. The app will lanuch in your web browser at the address http://127.0.0.1:8000/

    How to run a Jupyter notebook?

    In the first iteration of the project, all there is to running the project is downloading a Jupyter notebook from directory "notebooks" and launching it with Jupyter. Jupyter is available for download as a part of Anaconda suite from https://www.anaconda.com/.

    When feeding a Jupyter notebook with data, use data provided in directory "train_split" here.

    πŸ‘¨β€πŸ’» Contributing

    πŸ›οΈ Architecture

    architecture

    πŸ“‚ Directory Structure

    β”œβ”€β”€β”€assets
    β”œβ”€β”€β”€dashboard/src
    |   β”œβ”€β”€β”€data
    |   β”œβ”€β”€β”€models/production
    |   └───tabs
    β”œβ”€β”€β”€disaster-retweeter (git module to https://github.com/bswck/disaster-retweeter)
    β”œβ”€β”€β”€notebooks
    β”œβ”€β”€β”€reports
    β”œβ”€β”€β”€submissions
    β”œβ”€β”€β”€app.py
    └───requirements.txt
    

    πŸŽ“ Learning/Reference Materials

    ❗ More resources are available on Team's google drive: discordnlp7@gmail.com, ask a team member for password ❗

    ❗ Also check the repo's wiki ❗

    πŸ“… Development Schedule

    Version 1.0.0 MILESTONE: classification model

    • First Solution to the Kaggle's problem as a Jupyter notebook

    Version 1.1.0 MILESTONE: model + dashboard

    • Deployment of a blank dashboard (and integrate Dash)
    • Exploratory Data Analysis tab
      • Introduction
      • Data Quality Issues
      • Keyword
      • Location (+interactive globe data viz)
      • Text (Word frequency (+Wordcloud))
      • Target
    • Customized classification tab
    • Best performing
    • Make a prediction
    • Twitter BOT Analytics (blank)
    • About page

    Version 1.2.0 MILESTONE: twitter bot

    • The Disaster Retweeter
      • BOT: crawler
      • BOT: classification
      • BOT: retweeter
      • Database 1: PostgreSQL
      • Databse 2: Redis
      • Rest API 1: analytics.py router
      • Rest API 2: logs.py router
      • Client part 1: module
      • Client part 2: dashboard end
      • tests

    πŸ†• Project duration

    03/07/2022 - 15/09/2022 (74 days)

    πŸ€– Stack

    • Python
    • pandas
    • scikit-learn
    • nltk
    • dash
    • PostgreSQL
    • Redis
    • FastAPI

    πŸ“ Examples

    Example 1. Measuring performance metrics with generate_perf_report()

    To generate the model performance report, use generate_perf_report(). It compares predictions based on provided training data (X) to expected results (y) and gathers certain classification metrics, like precision, accuracy etc.:

    import pandas as pd 
    from sklearn.feature_extraction.text import TfidfVectorizer
    from models.production.generate_perf_report import generate_perf_report
    
    # Load the training data, prepare the TF-IDF vectorizer just for this demo
    df = pd.read_csv(r"data\original\train.csv")
    tfidf_vect = TfidfVectorizer(max_features=5000)
    
    # Prepare training data and target values
    X = tfidf_vect.fit_transform(df['text'])
    y = df["target"].copy()
    
    # Generate and print the report
    report = generate_perf_report(
        X, y, name="demo report", description="tfidf vectorizer and no preprocessing"
    )
    print(report)

    Output:

    Date                               2022-07-29 00:17:16
    Description      tfidf vectorizer and no preprocessing
    Test Size                                         0.15
    Precision                                        0.875
    Recall                                        0.679208
    F1 Score                                      0.764771
    Accuracy                                      0.815236
    Roc_auc_score                                 0.801142
    Name: demo report, dtype: object
    

    Name, description, test size and date format in the report can be optionally specified.

    Example 2. Performing vectorization of choice with vectorize_data()

    Function vectorize_data() takes two parameters: * data - * method - available options are: "tfidf"

    πŸ“§ Contact

    πŸ“„ License

    MIT License ©️ 2019-2020 Kamil Szymkowski

    banner

  • About

    Solution to the Kaggle problem "NLP with Disaster Tweets" (group project)

    Resources

    License

    Stars

    Watchers

    Forks

    Packages

    No packages published