This project aims to classify sentences as either subjective (expressing opinions, feelings, or personal views) or objective (stating facts or impartial information). By applying advanced natural language processing techniques, the goal is to identify distinguishing features of subjectivity and achieve high classification performance.
- Dataset
- Preprocessing
- Exploratory Data Analysis (EDA)
- Model Selection
- Evaluation Metrics
- Results
- Requirements
- How to Run
- References
The dataset contains labeled sentences, where:
- SUBJ indicates subjective sentences.
- OBJ indicates objective sentences.
The data was split into training and testing sets for model training and evaluation.
Several preprocessing steps were applied to clean and prepare the data:
- Lowercasing: Converted all text to lowercase for uniformity.
- Removing Punctuation and Special Characters: Eliminated unnecessary symbols.
- Tokenization: Split sentences into individual tokens (words).
- Stop-Word Removal: Removed common stop words while retaining relevant ones for subjectivity detection.
- Lemmatization: Converted words to their base forms.
EDA was conducted to understand the data better:
- Class Distribution: Analyzed the balance between subjective and objective sentences.
- Most Frequent Words: Identified common words in both categories.
- Word Pair Analysis: Explored co-occurrences of words.
- Sentiment Distribution: Examined sentiment variations across labels.
- Part of Speech Analysis: Analyzed the grammatical patterns in sentences.
Various models were implemented and compared:
- Traditional Machine Learning:
- Naive Bayes
- Random Forest
- Gradient Boosting
- K-Nearest Neighbors
- Deep Learning:
- Convolutional Neural Networks (CNNs)
- Recurrent Neural Networks (RNNs)
- Transformer Models:
- BERT
BERT achieved the highest performance due to its contextual understanding and pre-trained features.
The models were evaluated using:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix
- Area Under Curve (AUC)
- Precision-Recall Curve
- Matthews Correlation Coefficient (MCC)
- BERT outperformed other models with a balanced accuracy across subjective and objective sentences.
- The data imbalance (objective sentences dominating the dataset) significantly impacted model performance, particularly for traditional and deep learning models.
- Navigate to the
model/preprocessing/directory. - Open the
preprocessing.pyfile and update theinput_filevariable with the path to your input dataset. For example:input_file = "path/to/your/input_file.tsv"
- Run the script using the following command:
python preprocessing.py
- The preprocessed data will be saved in the output location specified inside the script.
- Locate the model you want to run in the
model/model_implementation/directory. - Open the script for the model you wish to test (e.g.,
CNN.py,RNN.py, orBERT.py). - Update the
test_dataortest_dfline to point to the path of your test dataset. For example:test_df = pd.read_csv("path/to/your/test_data.tsv", sep="\t")
- Run the model script using:
python <model_name>.py
- The predictions will be saved in the
model_outputs/directory, with the file named after the model you ran.
-
Navigate to the
model/directory. -
Open the
model_scorer.pyfile and update the following variables:gold_file_path: Path to the gold standard test dataset.pred_file_path: Path to the output predictions from your model.output_figures: Desired name for the result figures.
Example:
gold_file_path = "path/to/gold_file.tsv" pred_file_path = "path/to/predictions.tsv" output_figures = "model_evaluation_results"
-
Run the evaluation script:
python model_scorer.py
-
The evaluation results, including figures and metrics, will be saved in the
evaluation_scores_data/directory.
To run the project, install the following libraries:
transformerstensorflowscikit-learnpandasmatplotlibseabornnltkspacynumpyxgboostimbalanced-learn
Install the dependencies using:
pip install -r requirements.txt