Classifies music into 10 genres using audio features extracted with librosa and an XGBoost classifier trained on the GTZAN dataset.
Supported Genres: Blues, Classical, Country, Disco, Hip-Hop, Jazz, Metal, Pop, Reggae, Rock
git clone https://github.com/<your-username>/tracktype.git
cd tracktype
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtThe pre-trained model is included in the repo, so you can start using it right away:
Web app (upload any song and see the prediction):
streamlit run app/app.pyCLI (predict from the terminal):
python -m src.predict path/to/song.mp3========================================
Predicted Genre : jazz
Confidence : 87.3%
========================================
Top predictions:
1. jazz 87.3%
2. blues 6.2%
3. classical 3.1%
If you want to reproduce the training or experiment with the model yourself:
# download the GTZAN dataset from Kaggle (requires a free Kaggle account)
python -m src.download_data
# train the model (~91% accuracy, saves artifacts to models/)
python -m src.trainThe download script also places one sample audio file per genre into data/sample_audio/ for quick testing.
tracktype/
├── app/
│ └── app.py # Streamlit web interface
├── data/ # Dataset CSVs and sample audio (auto-downloaded)
├── models/ # Pre-trained model, scaler, label encoder
├── src/
│ ├── config.py # Paths, feature names, constants
│ ├── download_data.py # Downloads GTZAN dataset from Kaggle
│ ├── feature_extractor.py # Extracts audio features with librosa
│ ├── model.py # Loads model and runs inference
│ ├── predict.py # CLI prediction tool
│ └── train.py # Training script
├── requirements.txt
└── README.md
The audio file gets split into 3-second segments to match the training data. For each segment, librosa extracts 57 features (chroma, spectral centroid, bandwidth, rolloff, zero-crossing rate, MFCCs, etc.). These features are scaled with MinMaxScaler and fed to an XGBoost classifier. For files with multiple segments, the final genre is decided by majority vote across all segments.
The model was trained on the GTZAN features_3_sec.csv with a 70/30 split and reaches ~91% accuracy.
The GTZAN dataset comes with two CSVs: features_30_sec.csv (~1,000 rows, one per full clip) and features_3_sec.csv (~10,000 rows, each clip split into 3-second chunks). I initially trained on the 30-sec version and got ~71% accuracy. Switching to the 3-sec version bumped it to ~91% simply because the model gets 10x more training examples from the same data.
That choice created a mismatch problem at inference time though. Users upload full-length songs, not 3-second clips. If you extract features from a whole song and feed that to a model trained on 3-second windows, the feature distributions don't match and predictions are unreliable. So the prediction pipeline splits the input audio into the same 3-second segments, classifies each one independently, and uses majority voting to pick the final genre. This keeps the inference distribution aligned with what the model saw during training.