Skip to content

Sentiment categorization system using classical ML algorithms for tweets | A2 for COL772 course (Fall 21)

Notifications You must be signed in to change notification settings

subhalingamd/nlp-tweet-sentiment

Repository files navigation

Tweet Sentiment Mining

  1. Motivation
  2. Problem Statement
  3. Dataset
    1. Train Data
    2. Test Data
  4. Methodology
  5. Running the code
    1. Directory Structure
    2. Requirements
    3. Training
    4. Testing
  6. Results
  7. Credits

Motivation

The motivation of this assignment is to get practice with text categorization using classical Machine Learning algorithms.

Problem Statement

The goal of the assignment is to build a sentiment categorization system for tweets. The input of the code will be a set of tweets and the output will be a prediction for each tweet – positive or negative.

Dataset

Train Data

For training, we use the (processed version of) Sentiment140 dataset. The format of each line in the training dataset is <“label”, “tweet”>. It has a total of 1.6 million tweets. A label of 0 means negative sentiment and a label of 4 means positive sentiment.

A note from the creaters:

"Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search"

Training data can be found at _data/training.zip (to be unzipped).

Test Data

The final program will take input a set of tweets. Each line will have one tweet (without double quotes). The program will output predictions (0 or 4) one per line – matching one prediction per tweet.

Test data can be found at _data/test/.

Methodology

Each tweet undergoes several pre-processing steps. Upon several experiments, we settled with the strategy as in preprocessing.py, namely, replacement of hashtags / mentions / urls / punctuations / emoticons with placeholders, tweet normalization on words where letters are repeated for intensity & internet slang dictionary, contractions removal, reversing polarity in case of negations (until next punctuations), stemming (Porter Stemmer), to list a few.

Post processing, we have settled with TfidfVectorizer for feature extraction and LogisticRegression for prediction. The pipeline can be found here.

Running the code

Directory Structure

The main code (for both training and testing) is main.py. The program uses constants.py, preprocessing.py and utils.py (files names are self-explanatory). slang.txt contains list of internet slangs (one on each line) used for pre-processing.

stats.py is made for EDA (not used by the main code).

_data/ contains the dataset.

Requirements

The following Python packages are required:

 pandas==1.1.3
 numpy==1.19.5
 scikit-learn==0.24.2
 scipy==1.5.3
 nltk==3.5

Make sure punkt is downloaded in nltk to run PorterStemmer.

Training

bash run-train.sh <data_directory> <model_directory>

The script reads training data from <data_directory>/training.csv and saves the model after training at <model_directory>.

For format of input data, see section 3.i.

Testing

bash run-test.sh <model_directory> <input_file_path> <output_file_path>

The script loads the trained model from <model_directory>, scores text in <input_file_path> and writes prediction to <output_file_path>.

For format of input/output data, see section 3.ii.

Results

Accuracy on test set: 0.7833

Credits

  • Internet slangs dictionary: [Link1] [Link2]

  • Emoticons dictionary: [Link]

  • Punctuations dictionary:: [Link]

  • Contractions dictionary [Link]

  • [Forum] Sentiwordnet/POS tagging might not work well for tweets: [Link]


This README uses texts from the assignment problem document provided in the course.

About

Sentiment categorization system using classical ML algorithms for tweets | A2 for COL772 course (Fall 21)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published