Subjectivity Detection in Sentences

Project Overview

This project aims to classify sentences as either subjective (expressing opinions, feelings, or personal views) or objective (stating facts or impartial information). By applying advanced natural language processing techniques, the goal is to identify distinguishing features of subjectivity and achieve high classification performance.

Dataset

The dataset contains labeled sentences, where:

SUBJ indicates subjective sentences.
OBJ indicates objective sentences.

The data was split into training and testing sets for model training and evaluation.

Preprocessing

Several preprocessing steps were applied to clean and prepare the data:

Lowercasing: Converted all text to lowercase for uniformity.
Removing Punctuation and Special Characters: Eliminated unnecessary symbols.
Tokenization: Split sentences into individual tokens (words).
Stop-Word Removal: Removed common stop words while retaining relevant ones for subjectivity detection.
Lemmatization: Converted words to their base forms.

Exploratory Data Analysis (EDA)

EDA was conducted to understand the data better:

Class Distribution: Analyzed the balance between subjective and objective sentences.
Most Frequent Words: Identified common words in both categories.
Word Pair Analysis: Explored co-occurrences of words.
Sentiment Distribution: Examined sentiment variations across labels.
Part of Speech Analysis: Analyzed the grammatical patterns in sentences.

Model Selection

Various models were implemented and compared:

Traditional Machine Learning:
- Naive Bayes
- Random Forest
- Gradient Boosting
- K-Nearest Neighbors
Deep Learning:
- Convolutional Neural Networks (CNNs)
- Recurrent Neural Networks (RNNs)
Transformer Models:
- BERT

BERT achieved the highest performance due to its contextual understanding and pre-trained features.

Evaluation Metrics

The models were evaluated using:

Accuracy
Precision
Recall
F1-Score
Confusion Matrix
Area Under Curve (AUC)
Precision-Recall Curve
Matthews Correlation Coefficient (MCC)

Results

BERT outperformed other models with a balanced accuracy across subjective and objective sentences.
The data imbalance (objective sentences dominating the dataset) significantly impacted model performance, particularly for traditional and deep learning models.

How to Run

Preprocessing the Data

Navigate to the model/preprocessing/ directory.
Open the preprocessing.py file and update the input_file variable with the path to your input dataset. For example:
```
input_file = "path/to/your/input_file.tsv"
```
Run the script using the following command:
```
python preprocessing.py
```
The preprocessed data will be saved in the output location specified inside the script.

Running the Models

Locate the model you want to run in the model/model_implementation/ directory.
Open the script for the model you wish to test (e.g., CNN.py, RNN.py, or BERT.py).
Update the test_data or test_df line to point to the path of your test dataset. For example:
```
test_df = pd.read_csv("path/to/your/test_data.tsv", sep="\t")
```
Run the model script using:
```
python <model_name>.py
```
The predictions will be saved in the model_outputs/ directory, with the file named after the model you ran.

Evaluating the Models

Navigate to the model/ directory.
Open the model_scorer.py file and update the following variables:
- gold_file_path: Path to the gold standard test dataset.
- pred_file_path: Path to the output predictions from your model.
- output_figures: Desired name for the result figures.
Example:
```
gold_file_path = "path/to/gold_file.tsv"
pred_file_path = "path/to/predictions.tsv"
output_figures = "model_evaluation_results"
```
Run the evaluation script:
```
python model_scorer.py
```
The evaluation results, including figures and metrics, will be saved in the evaluation_scores_data/ directory.

Requirements

To run the project, install the following libraries:

transformers
tensorflow
scikit-learn
pandas
matplotlib
seaborn
nltk
spacy
numpy
xgboost
imbalanced-learn

Install the dependencies using:

pip install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Subjectivity Detection in Sentences

Project Overview

Table of Contents

Dataset

Preprocessing

Exploratory Data Analysis (EDA)

Model Selection

Evaluation Metrics

Results

How to Run

Preprocessing the Data

Running the Models

Evaluating the Models

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
evaluation_scores_data		evaluation_scores_data
model		model
model_outputs		model_outputs
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Subjectivity Detection in Sentences

Project Overview

Table of Contents

Dataset

Preprocessing

Exploratory Data Analysis (EDA)

Model Selection

Evaluation Metrics

Results

How to Run

Preprocessing the Data

Running the Models

Evaluating the Models

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages