A Kaggle competition project that predicts whether a user is an Influencer or an Observer based on tweet content.
Kaggle2025/
├── train.jsonl # Training data
├── kaggle_test.jsonl # Test data
├── baseline.ipynb # Original baseline notebook
├── requirements.txt # Dependencies
├── README.md # Project documentation
│
├── src/ # Source code
│ ├── data_exploration.py # Data exploration
│ ├── feature_engineering.py # Feature engineering
│ ├── train.py # Main training script
│ └── models/ # Model implementations
│ ├── gradient_boosting.py # XGBoost/LightGBM
│ ├── text_classifier.py # TF-IDF + Traditional ML
│ └── transformer_classifier.py # CamemBERT
│
└── outputs/ # Output files (predictions)
pip install -r requirements.txt
# Download NLTK data
python -c "import nltk; nltk.download('stopwords')"cd src
python data_exploration.pycd src
# Train all models
python train.py --model all
# Or train a specific model
python train.py --model baseline # TF-IDF + LR
python train.py --model features # Numerical features + LR
python train.py --model combined # TF-IDF + Numerical features + LR
python train.py --model xgboost # XGBoost (numerical features only)cd src/models
python transformer_classifier.py| Model | Description | Expected Accuracy |
|---|---|---|
| Baseline (Original) | TF-IDF (1000 features) + LR | ~63% |
| Baseline (Improved) | TF-IDF (5000 features) + LR | ~64-65% |
| Numerical Features | User/Engagement/Text statistics + LR | ~65-70% |
| Combined | TF-IDF + Numerical features + LR | ~68-72% |
| XGBoost | Numerical features + XGBoost | ~68-72% |
| CamemBERT | Fine-tuned French pre-trained model | ~72-78% |
follower_friend_ratio: Follower-to-following ratio (core indicator of Influencer status)followers_count: Number of followerslisted_count: Number of lists the user is added tois_verified: Verification status
retweet_count,favorite_count: Retweets / Likestotal_engagement: Total engagement count
hashtag_count,mention_count: Number of hashtags / mentionstext_length,word_count: Text length and word count
CSV file with two columns:
ID: challenge_idPrediction: 0 (Observer) or 1 (Influencer)