Tweet Sentiment Mining

Motivation
Problem Statement
Dataset
1. Train Data
2. Test Data
Methodology
Running the code
Results
Credits

Motivation

The motivation of this assignment is to get practice with text categorization using classical Machine Learning algorithms.

Problem Statement

The goal of the assignment is to build a sentiment categorization system for tweets. The input of the code will be a set of tweets and the output will be a prediction for each tweet – positive or negative.

Dataset

Train Data

For training, we use the (processed version of) Sentiment140 dataset. The format of each line in the training dataset is <“label”, “tweet”>. It has a total of 1.6 million tweets. A label of 0 means negative sentiment and a label of 4 means positive sentiment.

A note from the creaters:

"Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search"

Training data can be found at _data/training.zip (to be unzipped).

Test Data

The final program will take input a set of tweets. Each line will have one tweet (without double quotes). The program will output predictions (0 or 4) one per line – matching one prediction per tweet.

Test data can be found at _data/test/.

Methodology

Each tweet undergoes several pre-processing steps. Upon several experiments, we settled with the strategy as in preprocessing.py, namely, replacement of hashtags / mentions / urls / punctuations / emoticons with placeholders, tweet normalization on words where letters are repeated for intensity & internet slang dictionary, contractions removal, reversing polarity in case of negations (until next punctuations), stemming (Porter Stemmer), to list a few.

Post processing, we have settled with TfidfVectorizer for feature extraction and LogisticRegression for prediction. The pipeline can be found here.

Running the code

Directory Structure

The main code (for both training and testing) is main.py. The program uses constants.py, preprocessing.py and utils.py (files names are self-explanatory). slang.txt contains list of internet slangs (one on each line) used for pre-processing.

stats.py is made for EDA (not used by the main code).

_data/ contains the dataset.

Requirements

The following Python packages are required:

 pandas==1.1.3
 numpy==1.19.5
 scikit-learn==0.24.2
 scipy==1.5.3
 nltk==3.5

Make sure punkt is downloaded in nltk to run PorterStemmer.

Training

bash run-train.sh <data_directory> <model_directory>

The script reads training data from <data_directory>/training.csv and saves the model after training at <model_directory>.

For format of input data, see section 3.i.

Testing

bash run-test.sh <model_directory> <input_file_path> <output_file_path>

The script loads the trained model from <model_directory>, scores text in <input_file_path> and writes prediction to <output_file_path>.

For format of input/output data, see section 3.ii.

Results

Accuracy on test set: 0.7833

Credits

Internet slangs dictionary: [Link1] [Link2]
Emoticons dictionary: [Link]
Punctuations dictionary:: [Link]
Contractions dictionary [Link]
[Forum] Sentiwordnet/POS tagging might not work well for tweets: [Link]

This README uses texts from the assignment problem document provided in the course.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
_data		_data
eda		eda
.gitignore		.gitignore
README.md		README.md
compile-test.sh		compile-test.sh
compile-train.sh		compile-train.sh
constants.py		constants.py
main.py		main.py
preprocessing.py		preprocessing.py
prereq_check.sh		prereq_check.sh
run-test.sh		run-test.sh
run-train.sh		run-train.sh
slang.txt		slang.txt
stats.py		stats.py
utils.py		utils.py
writeup.txt		writeup.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweet Sentiment Mining

Motivation

Problem Statement

Dataset

Train Data

Test Data

Methodology

Running the code

Directory Structure

Requirements

Training

Testing

Results

Credits

About

Releases

Packages

Languages

subhalingamd/nlp-tweet-sentiment

Folders and files

Latest commit

History

Repository files navigation

Tweet Sentiment Mining

Motivation

Problem Statement

Dataset

Train Data

Test Data

Methodology

Running the code

Directory Structure

Requirements

Training

Testing

Results

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages