Skip to content

summer-waves/Tweet_Annotation_Analysis_Using_Logistic_Regression_Python

Repository files navigation

Tweet Annotation Analysis Using Logistic Regression

Contributors:

  • Amberly Cavazos
  • Marco Ortiz
  • Chinmeri Nwagwu
  • Shutao Zhang

📃 Table of Contents

  1. Project Overview
  2. Files
  3. Methodology
  4. Results
  5. Tools and Libraries
  6. Output
  7. Future Improvements
  8. Key Takeaways

📝 Project Overview

Built a logistic regression model using Python with 85%+ accuracy to classify 1,000 tweets as political or tech-focused. Utilized TF-IDF, TextBlob sentiment scoring, and lexical features; documented findings using Jupyter Notebook, Jupyter Lab, and GitHub.


📁 Files

  • Final_Annotated_Comments.xlsx – Annotated dataset used for modeling.
  • best_model_predictions_tech.csv – Results from the best model for technology classification.
  • best_model_predictions_pol.csv – Results from the best model for political classification.
  • final_project_modeling.ipynb – Jupyter Notebook with all code and outputs.

🧪 Methodology

Preprocessing

  • Removed missing values in comment or label fields.
  • Converted labels to binary:
    • Technology: "tech" → 1, "NoneTech" → 0
    • Political: "Pol" → 1, "NoPol" → 0

Feature Engineering

Three types of features were extracted:

Feature Type Description
TF-IDF Term Frequency–Inverse Document Frequency matrix using top 1000 features
Sentiment (Lexicon) Sentiment polarity and subjectivity via TextBlob
Structural Count of capital letters, exclamation marks, and total text length

Modeling

  • Classifier: Logistic Regression
  • Evaluation: Macro F1 Score and Micro F1 Score
  • Split: 80% training, 20% testing (stratified)

📊 Results

Technology Classification

Feature Set Macro F1 Micro F1
TF-IDF 0.5726 0.6650
Lexicon 0.3865 0.6300
Structural 0.3990 0.6300

📉 Best Model: TF-IDF

Political Classification

Feature Set Macro F1 Micro F1
TF-IDF 0.4872 0.9500
Lexicon 0.4872 0.9500
Structural 0.7217 0.9650

📉 Best Model: Structural


🔧 Tools and Libraries

  • Python 3
  • pandas, numpy
  • sklearn (Logistic Regression, metrics, train_test_split)
  • TextBlob (sentiment analysis)
  • TfidfVectorizer (feature extraction)
  • GitHub
  • Jupyter Notebook
  • Jupyter Lab

📈 Output

Final predictions for each best-performing model were saved in CSV format:

  • best_model_predictions_tech.csv
  • best_model_predictions_pol.csv

These contain:

  • Original tweet (comment)
  • True label (true_label)
  • Predicted label (predicted_label)

🚀 Future Improvements

  • Incorporate deep learning models (e.g., BERT) for context-aware classification.
  • Use multi-label classification to allow tweets to be both political and technological.
  • Visualize model performance using confusion matrices and ROC curves.

🧠 Key Takeaways

  • Lexical features (TF-IDF) perform well for content-related classification.
  • Structural cues are strong indicators of political content.
  • Combining multiple feature types could potentially improve accuracy further.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors