Skip to content

xyn-1127/deeplearning-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Influencer vs Observer Classification

A Kaggle competition project that predicts whether a user is an Influencer or an Observer based on tweet content.

Project Structure

Kaggle2025/
├── train.jsonl              # Training data
├── kaggle_test.jsonl        # Test data
├── baseline.ipynb           # Original baseline notebook
├── requirements.txt         # Dependencies
├── README.md               # Project documentation
│
├── src/                    # Source code
│   ├── data_exploration.py # Data exploration
│   ├── feature_engineering.py # Feature engineering
│   ├── train.py            # Main training script
│   └── models/             # Model implementations
│       ├── gradient_boosting.py  # XGBoost/LightGBM
│       ├── text_classifier.py    # TF-IDF + Traditional ML
│       └── transformer_classifier.py # CamemBERT
│
└── outputs/                # Output files (predictions)

Quick Start

1. Install Dependencies

pip install -r requirements.txt

# Download NLTK data
python -c "import nltk; nltk.download('stopwords')"

2. Data Exploration

cd src
python data_exploration.py

3. Train Models

cd src

# Train all models
python train.py --model all

# Or train a specific model
python train.py --model baseline    # TF-IDF + LR
python train.py --model features    # Numerical features + LR
python train.py --model combined    # TF-IDF + Numerical features + LR
python train.py --model xgboost     # XGBoost (numerical features only)

4. Train Transformer Model (Optional)

cd src/models
python transformer_classifier.py

Model Overview

Model Description Expected Accuracy
Baseline (Original) TF-IDF (1000 features) + LR ~63%
Baseline (Improved) TF-IDF (5000 features) + LR ~64-65%
Numerical Features User/Engagement/Text statistics + LR ~65-70%
Combined TF-IDF + Numerical features + LR ~68-72%
XGBoost Numerical features + XGBoost ~68-72%
CamemBERT Fine-tuned French pre-trained model ~72-78%

Key Features

User Features (Most Important!)

  • follower_friend_ratio: Follower-to-following ratio (core indicator of Influencer status)
  • followers_count: Number of followers
  • listed_count: Number of lists the user is added to
  • is_verified: Verification status

Engagement Features

  • retweet_count, favorite_count: Retweets / Likes
  • total_engagement: Total engagement count

Text Features

  • hashtag_count, mention_count: Number of hashtags / mentions
  • text_length, word_count: Text length and word count

Submission Format

CSV file with two columns:

  • ID: challenge_id
  • Prediction: 0 (Observer) or 1 (Influencer)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors