##### Overview for Final Capstone Project
##### Genre Classification for Music Library
##### Stu Alden, Thinkful Data Science Flex

This project is large enough so that, unlike the previous Capstones, it seemed best distributed over a number of notebooks.  Here they are, in "pipeline" order (and also the order recommended for review):

*  **overview.ipynb** - This notebook.
*  **feature_\*.ipynb** (one for each of six genres - blues, country, folk, jazz, new_age, and world) - Feature extraction from the digital music (MP3) files.
    * These use the `libROSA` library of audio tools to extract statistics from the waveforms.
    * They are separated by genre to facilitate a "poor man's parallelization" (running simultaneous notebooks on an 8-core processor) to speed up the run-times.
    * The first genre (alphabetically) is **blues**, and that notebook includes some helpful visualizations on a single test track of music.  The remaining genre notebooks contain only the feature extraction code.
* **ead_and_feature_selection.ipynb** - Basic feature analysis, including some supervised learning to discern feature importance, and PCA for potential dimensionality reduction.  The final step here is to write out files to be used as input to the supervised and unsupervised learning notebooks.
* **supervised.ipynb** - Supervised algorithms applied to the features and tuned to maximize predictive performance.
* **unsupervised.ipynb** -- Clustering algorithms applied to the features to evaluate the feasibility of prediction of existing genres and perhaps identification of not-yet-identified genres.
* **summary.ipynb** - Conclusion, with summary of findings, obstacles, lessons learned, and suggested next steps.

Please note:
* Because I've used `hvplot` in places, these notebooks may not show properly in Google Colab.  Please use nbviewer or native Jupyter Notebook instead if available.
* For the presentation, I anticipate showing a subset of cells from all notebooks in the order listed above, subject to any recommendations from the reviewer.
* Where useful, I'll be using the RISE notebook extension to show the cells as slides.

### Genre Classification for Music Library
#### Thinkful Data Science program - Final Capstone
#### Stu Alden
#### May 12, 2021

**Note:**  This notebook likely will not be properly viewable in Google Colab - please use nbviewer (https://nbviewer.jupyter.org/) or native Jupyter Notebook instead. To view select cells as presentation slides, please use the `RISE` notebook extension.

### Background
* I have a very large music library, built from ripping files off of my CDs and converting the files to MP3 format to save on storage space.  
* The files are organized at the top level by "genre," which has proved a more challenging concept that I would have originally expected - for example, many artists straddle multiple genres. 
* Genre seems to be a function not only of instrumentation, but tempo, rhythm, volume and other factors.

* I wanted to see if the genre of a particular musical track could be deduced from a close examination of the sound waveform.  

* Besides offering an interesting data science challenge, I could use such a tool for validating existing genre tagging in my library

* It would also be valuable to suggesting a genre for new music, and even possibly suggest new genres.
* To keep the project more manageable, I have avoided the rock and classical genres for the time being.


* The goal of this project was to make use of the concepts we've learned in the Thinkful course, including

    * Data gathering and manipulation
    * Exploratory data analysis and feature selection
    * Supervised learning
    * Unsupervised learning

* Also, although this project wouldn't meet a modern definition of "big data," I deliberately chose a dataset large enough to "stress" the computational power (speed and memory) I currently have available.  I address some of the resulting findings in the project summary.

In addition to the Thinkful course, I picked up a number of ideas from the following  project:

"Music Genre Classification Using Supervised and Unsupervised Learning" by William Easterby, May 13, 2020

https://medium.com/@weasterby/music-genre-classification-using-supervised-and-unsupervised-learning-cf1f0837d725

This article was particular helpful with introducing me to the capabilities of the `librosa` audio analysis library and suggesting code to help with visualizations.  

The article also introduced me to some other measures of clustering success (besides silhouette coefficients).  My project goes further, particularly in terms of

* The size and breadth of the data;
* The extent of EDA performed on the features;
* The degree of tuning of the machine learning algorithms (my scores are better); and
* The manner in which the clustering is evaluated and visualized.