Natural Language Processing with Disaster Tweets

This Kaggle project aims to build a machine learning model to predict which tweets are about real disasters and which ones are not. The dataset consists of 10,000 tweets that were hand classified. The challenge is to create a model that can distinguish between real disaster tweets and those that are not, despite the use of metaphorical language or potentially offensive content.

Competition Description

Twitter has become an important communication channel during emergencies. The ability to announce an emergency in real-time makes it an attractive platform for disaster relief organizations and news agencies to monitor. However, determining whether a tweet is actually announcing a disaster can be challenging, especially for machines.

The goal of this project is to build a machine learning model that can accurately predict if a given tweet is about a real disaster (1) or not (0).

Dataset

The dataset for this competition contains potentially profane, vulgar, or offensive text. The necessary files include:

train.csv: The training set
test.csv: The test set
sample_submission.csv: A sample submission file in the correct format

Each sample in the train and test set contains the following information:

The text of a tweet
A keyword from that tweet (may be blank)
The location the tweet was sent from (may be blank)

Columns

id: A unique identifier for each tweet
text: The text of the tweet
location: The location the tweet was sent from (may be blank)
keyword: A particular keyword from the tweet (may be blank)
target: In train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

Model

This project uses a Logistic Regression model with TfidfVectorizer for feature extraction. The model achieved an F1 score of 0.78516. The following steps were taken to preprocess the data and train the model:

Remove URLs from the text
Convert the text to lowercase
Remove non-alphanumeric characters
Combine the keyword and text columns
Split the data into a training and validation set
Convert the text to a tf-idf matrix
Train a Logistic Regression model
Evaluate the model using accuracy and F1 score

Results and Conclusion

The Logistic Regression model, combined with the TfidfVectorizer, achieved an F1 score of 0.78516. This performance indicates that the model is reasonably effective at predicting whether a tweet is about a real disaster or not. Further improvements could potentially be made by exploring more advanced natural language processing techniques or using more complex machine learning models.

Execution

Install the required libraries:
- pandas
- numpy
- re
- scikit-learn
Load the datasets (train.csv and test.csv) in the same directory as the code.
Run the provided code to preprocess the data, train the model, and make predictions on the test set.
The predictions will be saved in a submission.csv file in the correct format.

License

This project is licensed under the MIT License. The MIT License is a permissive open source license that allows for free use, copying, modification, and distribution of the software, as long as the copyright notice and permission notice are included in all copies or substantial portions of the software. This license is suitable for both academic and commercial projects.

Reference

Howard, A., Devrishi, Phil Culliton, & Guo, Y. (2019). Natural Language Processing with Disaster Tweets. Kaggle. Retrieved from https://kaggle.com/competitions/nlp-getting-started .

Author

Zeyong Jin

April 21st, 2023

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
LICENSE		LICENSE
Natural_Language_Processing_with_Disaster_Tweets.ipynb		Natural_Language_Processing_with_Disaster_Tweets.ipynb
README.md		README.md
natural_language_processing_with_disaster_tweets.py		natural_language_processing_with_disaster_tweets.py
sample_submission.csv		sample_submission.csv
submission.csv		submission.csv
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE