# ***Artistic Intent Analysis and Visualization Tool***

How often have you listened to a new album and thought to yourself partway through, "Is this still the same song?" Then you realized several songs have advanced?  Or perhaps you find songs on an album are enjoyable when played separately, but the album overall is rather flat.  The experiential "flow" of an album can be by design or by accident.  

Major music streaming distributors (Spotify and I assume others) have created song-level analytics to help recommend listeners select songs based on their acoustic features.  Comparatively I have found little information about ***album-level*** "feature flow" analysis.  Music producers, artists, and mixing engineers sometimes do not pay enough attention to the wholistic listening experience for each albums release.  One reason is that they are missing easy-to-use tools to analyze and visualize acoustic "feature flow" for albums.  It is also difficult to compare the acoustic flow of one album with another album or group of albums.  With better tools for album-level analytics, producers and artists could more easily design and deliver desired album-level flows. Some better tools are needed to fill this void. 

## Project Goals
To fill this void in album analytics tools, this project created a software application and tools to help people visualize and analyze acoustic features for individual albums and collections of albums.  The resulting software provides visual insights based on quantitative song metrics, and makes it possible to remodel acoustic features to alter listener experiences.  Users can visually see and compare trends in album acoustic feature flows and identify anomalies that impact listener experiences.  

The target users for the software are collectively called "creatives", which means people involved in the creation of song sequences on albums, playlists, or other releases of song collections.  

The software includes an integrated database of song-level acoustic metrics data that were originally pulled from the Spotify API.  To accelerate the project, an existing database of song metrics was found and used; the database has Spotify acoutic feature metrics for the popular Billboard Top 200 song lists going back 50+ years.  The original database needed to be cleaned, refined, and updated to include Discogs album and tracklist information for verification.   

The software provides visualizations, descriptive statistics, and regression analysis.  With these tools and data in hand, it becomes possible to develop more advanced multivariate analytics that use machine learning and GANs.

The intended audience for this tool is a person who is involved with music creation and production for the purpose of consumption and distribution as a collective body (album, playlist, set, etc...). 

In order to have a feeling for what the features might mean for a song, you may need to do some careful listening and examination of different releases of the same song on Spotify.  A good way to investigate this would be to compare collections like Village People Studio songs to their club mixes and single mixes, or Led Zeppelin Albums with multiple remasters.  Other variants to examine are radio mixes and remixes.

**Disclaimer:** Descriptive statistics and individual feature trends of single albums are one by-product of this project, and they were found useful to visualize and gain further insight on the data.

- Written By: Robert Blindt
- Project Start: 11/28/2022
- Updated On: 03/09/2023

# **Terminology**
Some terms are used in this report that may not be universally understood, so they are defined here:
 - 'Creative(s)' - "Person(s) of interest of the report."  Artists, Producers, and Recording/Mixing Engineers are the primary target audience, and others with an interest in album analytics.  
 - 'Artistic Intent' - "The goal of the creative in conveying their work using computer-derived features and statistics of albums."  It is a quantitative profile of a creatives selection of albums to report on. 
 - Assertions of Quality - Music is subjective.  If the words "better" or "worse" are used in this project report, it is not meant to describe anything other than statistical fit or visual flow.  

# **Project & Work Product Description**

**Project Statement:**
   - To develop visual data analytics application to analyze album-level acoustic features, in order to help people discover and visualize trends and patterns in music, to better identify or refine guidelines for conveying a creatives intent.
    
   - The sequence of songs on an album is important to its "listenability", so it is important to ensure individual songs are mixed so the album maintains a listeners interest throughout.  While listening to albums or 'curated playlists', e.g., DJ sets, work out playlists, etc., people may subconsciously ***expect*** a particular structure or pattern.  If their expectations are not fulfilled they may experience an unwanted dissonance.  The visual analytics tools from this project were developed to help people uncover potentially unwanted dissonances or incongruities that album listeners may experience.  

   - Trends in acoustic features affect a listener's qualitative experience, so analyzing and visualizing these trends offers an important way to see acoustic qualities.  The patterns and trends found using the software may be used to tweak mixes or production choices to help a creative achieve their goals in expression.   
    
   - Spotify analyzes each song in their database and creates a simplified set of 7 acoustic metrics and 3 other experiential variables.  These metrics provide only a single data point measure per song.  Using a single value to measure an acoustic feature for an entire song is not ideal, but this is how Spotify quantifies music today.  This project provides tools for creatives to quantify, visualize, and report on trends and statistics for the "feature flows" across one or more album.  The purpose and utility is to improve a creative's ability to convey their artistic intent for an entire album. 
   
**Main Development Goals:** 
  - Strenthen skills in data analytics software design, development, debugging, and delivery 
  - Develop tools to manage and explore large amounts of data through visualizations and statistical analysis
  - Develop a solid understanding and skills working with REST APIs to build data analytics tools
  - Refactor and refine the "component.one" database to be more scalable, and easier to maintain.  
  - Provide a database that can be used for advanced analytics using GANs and other AI technologies.
  - Develop better software project management skills
  
**Main Deliverables:** 
  - An end-to-end music analytics software application for people to interactively visualize and analyze acoustic features of albums, and generate PDF reports to save the results of their analyses.  
  - Ancillary software tools and modules needed to accomplish the project goals.

## Project Overview Flow Chart:

![Overall_project_flow.png](attachment:8ebf0444-070a-43e8-9989-7654ac50dfcd.png)

## Description of Solution:
The software was built using popular Python data science packages (Pandas, Numpy, Plotly, and others).  It uses data from the Spotify API and Discogs API to display and report on Spotify acoustic feature album statistics.  To save time, I started with an existing database from "components.one", which contains Spotify features for the Billboard Top 200 Album sets from 1963 to 2019.  Unfortunately it contained some inaccurate data, such as incorrect track lists attached to album names.  I redesigned the database to make it more scalable, and cleaned the data using the Discogs API to source the correct track lists for albums.  All data is stored and managed -- using SQL through Python -- in a local SQLite3 database.  Streamlit was used to create an interactive front-end with functions to select, analyze, and visualize data from the album analytics database.  The Python package "FPDF2" is integrated into the application to permit users to save final reports from iterative analysis in PDF format with a unique name for each analysis.  Users can iteratively select, analyze, and visualize album acoustic features in a very short period of time.  

### High level solution design block diagrams:
There were two substantial outputs of this project:  
- Streamlit app that creates a PDF report  
- Redesign "components.one" database.  

The redesigned database provided additional capabilities which made it easier to expand and more logical to use.

#### Final app work flow:
![Final_app_workflow.png](attachment:09d9b2b7-f551-4971-829b-2bce8b72407e.png) 

The app is designed in a way that it should be very easy to use and understand, even without understanding statistics.  Good and bad statistical fit is represented by Red = bad, Yellow = almost good, and Green = Good, and by default the visualizations only show the album lines with 'good statistical fit'.  There are options in the Streamlit side bar to add lines that do not have a good regression fit, as well as an option to switch to an 'auto-scale' option for the comparison graphs.

#### Database Design:

![album_analytics_database_design_final.png](attachment:d94ba432-1e9b-4036-819b-918410b7e963.png) 

Due to issues in the original database, I had to find a way to auto-verify that the album in the Spotify feature tables was in fact the correct album and artist.  I did so by using the Discogs API.  

Discogs is a multi-vendor marketplace (like Etsy or Amazon without the warehouses) for music that provides a database of information about audio recordings, including commercial releases, promotional releases, and bootleg or off-label releases from the world-wide music community.  Due to albums having multiple release formats, countries, and appearances, their database was an ideal place to query. 

In order to compare albums of different lengths, both by number of songs and total run time, I normalized each album to a scale of 'time completeness' (on scale of 0 to 100 percent).  This is explained in depth in the "Solution Code Description" section below.

## Solution Design (high-level): 
**A) Exploratory Data Analysis and Visualization Tools using Streamlit**:
  1) Created app to display the acoustic features from the Spotify API
      - Enter Album name
      - Access Spotify API
      - Produce visualizations of album(s) analytics (using charts, graphs, and tables) 
  2) Created app to display sets of album data of the 'Billboard Top 200' database
      - Select timespan or artist
      - Access SQL Database
      - Create Visualization

**B) Data Management, Data Cleaning, and Database Redesign**:

  3) Create a new database structure (data model) and access tools that enable better data maintenance and easier expansion
  4) Capture data from the Discogs API to verify existing Spotify album data

**C) Generate Final Analytics Reports**:
  
  5) Create a PDF-formatted report from the analysis and visualization outputs
      - Enter Report "name" to identify analysis, along with a list of albums covered in the report
        - Maximum quantity for ease of viewing in Streamlit is 6 albums
      - Access database and do statistical analysis 
      - Create visualizations with options       
      - Saves a PDF report of the selected information



## Solution Code Description: 
Throughout this project I used Python code to interact with the Spotify API and the Discogs API to obtain and clean my data.  An SQLite3 database was used to store and manage data.  Finally Streamlit, Pandas, Matplotlib, and Plotly to interactively select, visualize, and save data.  The Python package "FPDF2" was used to generate final reports.

### Main App - Modular Architecture:

The Main Module is named `regression_report_app.py` and is implemented as a Streamlit app.py file.

The Main module `regression_report_app.py` calls functions within the following application support modules:

- `database_query_module.py` - functions used to query the database and do word processing to output the album/artist data
- `prepare_and_quantify.py` - functions that does time-normalization and calculates statistics
- `visualizations.py` - functions that create different graphs and tables from the prepared data
- `fpdf2_report_class.py` - A custom FPDF class to create the PDF report

### Supporting Programs
**Exploratory Data Analysis**
- `initial_spotify_api_exploration.py` - Initial test app to get Spotify feature data directly from their API.  Can be used to gain a feeling for what affects individual song feature intensity. 
- `top200_timespan_features_singlepull.py` - Generates 3D graphs for each averaged feature for a user-selected number of entries of the top 200 entries over a time period.  This is used to gain a better understanding of the source database and how to treat the data.
- `search_by_artists.py` - Generates a visualization of the averages for individual features of bands discographies.  

**Redesign and Data Cleaning files** \
*You should not need to use these files, but I am documenting them for the sake of completeness.  I was dealing with power blackouts, and issues with revision control during this time so there may be some gaps or issues within these files.*
- `functions_for_database_creation.py` - Start to create the new database format
- `transfer_data_tonew_database.py` - Transfer data from old database to new database. Do Discogs verification.

*There are a few other small supporting python files that contain lists that would take a long time to create from querying, and basic underlying commands needed to interact with SQL and the APIs.*

## **Exploratory Data Analysis:**

I started this project thinking that if I treated the Spotify feature data like a digitized photo of album acoustic feature sets (like a 6-channel time normalized 2000-point interpolated representation of an album), then I could put the image data into a neural network to generate 'target lines' for song sequences.  However I didn't know how to prepare the data or how to analyze the data for neural network analysis, so I started off by examining the data to see if any apparent trends or features in the data stood out.  The three python files in the Exploratory Data Analysis section is where I did those things.  

When observing the data pulled directly from Spotify, I started looking at data for albums of people I knew, whose albums did not fully enchant me.  The songs were good, but something caused dissonance for me when I listened to them as an album.  As I suspected, the features had little variation, no apparent trend, or coordination with other features over the album.  When I compared that against other albums of similar styles, I noticed a higher variance in some features, stronger trends, and greater coherence between features.  At that time I had not ironed out what interpolation or statistical method I would use, but I had enough information to continue my investigation.

***Album I wasn't enchanted by:***

![vchenzo010199_features.PNG](attachment:7ef220e8-9dfc-40b2-af83-519a3161a955.PNG)

***Album I was more satisfied with:*** 

![mr_wonderful_features.PNG](attachment:ec1a0df1-4e86-4672-8780-6bbee23da416.PNG)

### Selected Findings from Exploratory Data Analysis

#### Finding 1

A very interesting result I got using this application was that segmentation and subtle mixing of music can affect the way the system interprets music.  I saw this when I compared different remastered versions of the same album with almost identicdal albums that also contained their radio or club mixes on them.  

In the image below, one can see that the features for the same songs are different from one another.  The specific  features do not matter for the sake of this display, but as I spent more time with the application, I could get a feel for changes in song mixing and segmentation to get a different desired result.

***Whole Lotta Love by Led Zeppelin II Remastered comparison:*** 

![Whole lotta love compare.png](attachment:8b1beb71-209c-42db-b03d-624325da09e0.png)

***21st Century Schizoid Man by King Crimson Studio vs Radio Mix:***

![21stcentry_compare.png](attachment:d9a708bd-6518-4d0e-b4f2-4fd6867eb10f.png)

At this point I realized that if I wanted to use this kind of information for 'album playlists analysis', I needed to find a list of items to start building a database from, or find one that already existed.  Fortunately we found a database containing The Billboard Top 200 lists and corresponding (mostly) Spotify features from "components.one".  

By visualizing time periods with different numbers of albums and entire band discographies, I saw interesting trends that pointed toward how to treat the data. 

***Monthly Top 5 Album Acousticness Average from 1963 to 2019:*** 

![MonlyAcousticness_5avg.png](attachment:b17466f4-2ed9-4b8a-9983-2a3b51772d79.png)

Some of the features like Acousticness seemed to have average shifts over time, but this did not get me closer to finding listening trends within music.  When observing the same data by "bands discography", I saw a similar output that would better be described by a singular value rather than a trend or relationship. 

***Beatles Discography:*** 

![Beatles_discograpy.PNG](attachment:838b10d0-da10-4f38-9751-dab7b74c2f2b.PNG)

***Taylor Swift Discography:*** 

![Taylorswift_discograpy.PNG](attachment:7c1c029e-92a6-4aea-9d91-43a2e20862ef.PNG)

***Average Feature Lines Comparing Artists:*** 

![artist_features_compare.PNG](attachment:4a14185a-d329-4825-88d5-a604154251ee.PNG)

Averaging this data was not suitable for spotting the kinds of things that I wanted to see.  However, during early phases of exploratory data analysis I found some dirty data that required me to spend some time cleaning and redesigning the database.  This is common in data science and analytics projects.  

#### **Database Cleaning and Redesign:**

Cleaning the database was a crucial step to ensure that I was using representative data sets.  I found albums like "Waylon + Willie" (The Willie Nelson and Waylon Jennings album from 1978) that had the correct "Top 200" album and artist name, but the Spotify features information was for a completely incorrect album.  In this case it contained a 2018 Hip-Hop/Country-Rap album "Waylon & Willie 2" by Jelly Roll and Struggle Jennings.  By filtering the "Top 200" album names into the Discogs API to retrieve the earliest track list for the album by that name, I could automate verification of albums.  This process also cleaned certain albums that had bonus tracks that were not on the original album release. 

***Original Database Entry Vs. Intended Album:***

![Willie_list.PNG](attachment:09391263-97da-44ea-a386-8fdc95c84eea.PNG) 

***Original Database Entry Vs. Actual Album:***

![Rap_willie_list.PNG](attachment:69097d36-ef78-4cfb-821e-eac642ae012e.PNG) 

The Discogs database contains two different types of identifiers for most albums in order to handle many album versions that exist.  They have a 'master_id' to track the album as a whole, and a 'release_id' to track each individual release of an album.  For each master and release ID there is a corresponding track list. The 'master_id' contains the earliest version available.  Master IDs do not exist for all albums so I had to create some secondary requests to find a possible matching track list.  Due to music data not being formally standardized there were some minor issues with matching song lists and album names. To address these issues, I developed a data cleaning method and an improved database model to guarantee unique references and verification between albums and track lists.  I captured the data for albums that did not successfully make a connection in a separate set of tables so that I could later explore new data cleaning methods without needing to contact the Discogs API again. The block diagram below illustrates the data capture cleaning and storage process.

***Discogs Album Verification Process:***

![Discogs_API_verification.png](attachment:d7e83600-ac45-4d7b-9418-a124b5f4886f.png)


### **Final Version of Album Analytics Software Application:**

The final app design is relatively simple.  It has a main page with a single form for data entry fields, and a "Submit" button.  After filling out the form and hitting "Submit", the application retrieves and analyses the selected data and generates graphs and tables of statistics. At the bottom of the page there is an option to name and save a PDF report of the results.  There is a text box entry to enter the PDF name, and a button to output the PDF report.  There is a side bar to do some fast scrolling and visualization adjustments as well.  Using these functions it is easy to rapidly select albums from the database and generate sophisticated visualizations for album-level acoustic features.

Unfortunately due to some current incompatibility issues between in Streamlit and Plotly, adjustments to modified a Plotly chart in Streamlit are not rendered in the final report.  As a workaround, a set of defaults must be selected for the PDF report.  The default settings show only the interpolated album line and regression line for the lowest degree of good r-squared fit (the threshold for cutoff is above 0.7). 


#### ***Front Page When Application Is First Launched:***

![blank_final_app_page.PNG](attachment:21929d65-ac45-43b7-b7df-09cd015c1b1e.PNG)

#### ***Heatmaps Displayed:***

![Albums_heatmap.PNG](attachment:359d1853-9bf5-4102-880a-5ce050124a0c.PNG)

These Heatmaps are full album feature graphs.  I found that the line graphs that I used in the prior apps were very hard to read and because I cared more about the regression fit of individual features which is handled by the following set of graphs, this was an adequate way to display the full album data.  Additionally when I shifted from a linear point to point interpolation (gradient shift from point to point) to a 'previous point' interpolation (value remained the same until the next data point is reached), that issue became much worse.  

#### ***Interpolated Feature Comparison:***

![R-Squared_table.PNG](attachment:ee9eba6d-dd23-410f-97fa-8399e14e03c8.PNG)

#### ***PDF Output:***

![pdf_front_page_image.PNG](attachment:fe8de5cf-2857-4fea-a11e-1d24105482ac.PNG)

Albums can be compared because each album is "time-normalized" to its 'relative completeness' (or "percentage done").  Time-Normalization is done by taking the length of each song and creating dividing points on a theoretical timeline where songs end, and then dividing by the total runtime for all songs.  This gives useful positional references for where song end vis-a-vis the completed album.  Below is a simple example.

![Basic_time_normalization.png](attachment:0036d056-9d0e-4147-bd77-e77a355d4511.png) 

I used the 'previous' interpolation method to interpolate my data using the function `interp1d()`.  Due to the song data being average values *per song*, no assumptions about the intra-song feature trends can be made.

For the albums without good regression fits, the descriptive statistics are output to the report.  The descriptive statistics seem relatively in-actionable, because they cannot be easily plotted against the interpolated line for each album, but they can still be used as a guide.

The descriptive statistics describe:

| Terminology | Definition |
| :--- | :--- |
| **Average** | Mean value of the time normalized feature. On a scale of 0 to 1, it is a general intensity or 'probability of being'. 0 - Not prevalent or likely, 1 - Most prevalent or likely |
| **Variance** | The total accumulation of differences of points from the mean (multiplied by 2000 for the sake of interpolation) |
| **Skew** | A coefficient value judging whether the majority of the data points are higher or lower than the mean and median values of the album. Positive - Larger quantity of data points towards the lower end of the extreme (Left Leaning), Negative - Large Quantity of data points towards the upper end of the extremes (Right Leaning). |
| **Kurtosis** | A coefficient value expressing the existence of data points in the 'tails' of the distribution. Positive - More centralized Data points, Negative - Large number of data points far from the mean (in the tails). |



# **Application Use - Step-By-Step**

## **Final App:**

### Start the Streamlit application `regression_report_app.py`:

1) From the command line, type `Streamlit run regression_report_app.py`
2) Fill in your name, purpose, and notes (Information for self tracking reports)
3) Select Albums you would like to investigate
4) Hit "Submit" button to submit form
5) Investigate visualizations and change parameters in the side bar
6) Input desired PDF output file name
7) Press the export button to save the PDF to your local directory
    
## **Exploratory Data Analysis**

This section has three Streamlit apps.  Note that they do not have built-in capabilities to output reports.  

*When creating these apps, I did not bother to suppress the error outputs so that I could continue learning how to address things.  I did not fix this because these were not intended outputs of my project.  When opening and using these apps, users will see an error saying something like: 'list_name is empty', or 'list_name referenced before creation'.  However, it should work after adding data to the data entry fields.*

**Startup Streamlit app for `initial_spotify_api_exploration.py`:**

1) From the Command line, type `Streamlit run initial_spotify_api_exploration.py`
2) Type the name of an album or artist to investigate
    (The underlying search command defaults to searching by album name, but the Spotify search engine is forgiving.)
3) Select the albums you would like to see the features of (Checkbox)
4) Adjust Slider to add additional columns for comparison
5) Press "Submit" button (There are two select boxes to view data being passed into the graphs, but they do not add much value.)
6) Investigate the graphs in the containers

**Startup Streamlit app for `top200_timespan_features_singlepull.py`:**

1) From the command line, type `Streamlit run top200_timespan_features_singlepull.py`
2) Select a week to start your selection
3) Select a week to end your selection
4) Select the frequency to sample the database
5) Select how many of top 200 albums you would like to average between
6) After the graphs and statistics are generated, interpret and analyze the results.
   
**Startup Streamlit app for `search_by_artists.py`:**

1) From the command line, type `Streamlit run search_by_artists.py`
2) Select the artist(s) you want to investigate
3) After the graphs and statistics are generated, interpret and analyze the results.


## **Main Application Use - Tips & Tricks**:   

This tool is meant to be used iteratively to refine your targets until you are happy with the outputs of your report.

- Make sure to use the Notes data entry field to keep track of what you're thinking about when creating a report.  
- Similarly creating a PDF name that is related to the purpose and notes is probably a good idea.

## Additional Tips & Tricks:

If you are interested in how mixing and segmentation affects the way Spotify interprets the features within your songs, use the Streamlit app `initial_spotify_api_exploration.py` and compare remastered versions of the same song or Studio vs Radio vs Club Mixes.

# Installation

***REQUIRED***

1) Use conda to create a new virtual environment with Python 3.8. 
    
2) Activate the virtual environment just created. 
3) Use a command console (terminal) to install the following packages into the activated conda environment: 

    `pip install pandas streamlit FPDF2 matplotlib plotly` 


***[OPTIONAL]*** Install PyCharm and any other programs you want in your environment (Jupyter, Spyder, command line utilities, etc).  PyCharm or any other IDE is not required, because the applications are started from the command line.  PyCharm and other IDEs have a nicely integrated command line terminals that make it easy to run multiple applications at once.  

***[OPTIONAL]*** Note: *You do not need Spotify API unless you want to use the app: `initial_spotify_api_exploration.py`*

`pip install spotipy`
    
***[OPTIONAL]*** *Note: You do not need Discogs API unless you plan to do tracklist verification*

`pip install discogs_api`

### Using Discogs or Spotify APIs

To use the Discogs or Spotify API, you must sign up to get your own API key.  Their sign-up forms can be found here:

- https://www.discogs.com/developers/
- https://developer.spotify.com/dashboard/login

After getting API keys, copy them into environment variables by the name of spotifiy_client, spotify_secret, and discogs_token.

From there, use the steps described in the "Application Use" section to start the apps.

# Future Enhancements

1) Create a page to add items from the Spotify database.
    - Create a Discogs API verification, or create a third type of ID, e.g., 'user_verified'.  The Exploratory Data Analysis framework `initial_spotify_api_exploration.py` has this function, but it need an SQL commands for data inserts. 
        - Make sure to pull 'Acousticness' if you do this upgrade.  At some point it looks like I dropped or forgot about it in the exploration app.

2) Implement a secondary cleaning process to catch more correct matches from the error tables. For some albums, (e.g., the recent Spider Man soundtrack) the Discog name and Spotify name contained different descriptive information attached to track names.  This caused a match failure when it seemed like it should have matched.
  
3) Create a function(s) to do multi-variable analysis to find 'dominant feature pairs'.

4) Create the function(s) to describe variance in an actionable way for both descriptive stats and regression.

5) Create a custom Streamlit element or context manager to grab the modified images off the page before the page is 'refreshed'.  Whenever buttons are pressed or certain controls are selected on a Streamlit app, it reruns the underlying Python script.  This triggers Plotly to regenerate the figures, reseting them to their initial defaults, which lose the user-selected graphs results from user interactions.  There is currently no way to cache the Plotly chart images to preserve user-modified graphs.  By modifying the context manager for the exit condition, it may be possible to directly grab the image using Streamlit.  A worst case workaround would be to use Selenium to scrape and save each image to file; this would need to be done headlessly to not disrupt user interface flow.

6) Create a flag for albums matched by "release_id" instead of "master_id" for future querying
    - Some albums that only have a digital release, only have a 'release_id' in the Discogs database, and no "master_id".  There are commands to go from release_id to master_id, but to use it with the issues I saw, you would need to add more error handling blocks that I thought were unnecessary at the time.  

### Advanced Analytics Using Machine Learning, Deep Learning, or AI: 

More research is needed to identify how to use the data collected for future applications.  I have not yet found a consistent metric that could be used as a measure of "goodness" for acoustic feature flow for albums.  I need to learn more about multivariate analyses for high-dimensional data.  Unsupervised machine learning seems worth looking into further, but I must learn how to apply it to this data set.  Higher order regression techniques will naturally provide better fits, but the patterns we are trying to observe (or predict) should drive the selection of higher order analytical techniques.  Better fitting equations will help with comparing and predicting patterns required to achieve desired acoustic feature flows.  Deep learning and other advanced analytics techniques may help with these types of higher order analysis.  

# Lessons Learned

### Skills Development Lessons: 

#### Databases
SQA - Software Quality Assurance Skills:  I developed Data Quality Assurance skills to identify data quality problems, and then develop a reliable process and software tools to repair data quality problems.  I learned not to  take data sources at face value, but to examine data closely to see whether any problems exist.  The "component.one" database contained about 350,000 songs from 35,000 albums, sourced from Spotify.  It might have taken many weeks or months to build such a large database on my own, so a lot of time was saved by using it as a starting point.  However, some issues needed addressing that mandated database re-design and data cleaning.  The album names and tracklists were not correctly cross-referenced, so searches by album name were retrieving the wrong tracklists.  To fix this and other issues were mentioned earlier, I redesigned the "component.one" database model, clarified data tables primary key identifiers, and learned how to link component.one data originally sourced from the Spotify API with Discogs reference database.  I developed a data verification process that pulled data from Discogs' database and  cross-referenced them between Discogs and Spotify.  This gave me a straightforward way to reliably check and clean data.  The full data verification and cleaning process was implemented as Python scripts.  

#### Work Within the Limitations **Before** Working Against them

Streamlit is a tool for analyzing data but has its limits.  Streamlit's original design center was to make it easy for developers to create small GUI apps for interactive data analytics applications.  Within its intended use cases and platform design center, it expedites solution development.  However, it is not a general purpose web app development or data reporting tool.  Before starting my final app design, I identified some issues with Streamlit, but quickly identified ways to work with them.  Some solutions were not elegant or streamlined, but would have given the final app greater usability.  I was guided to try using custom Streamlit components and experimental features but lost almost a month, when my original system of checkboxes and sliders would have sufficed.  As a relatively new Python developer, developing custom Javascript components was outside of my comfort zone and available time for this project.  My real development contribution was clarifying and reporting the logical workflow and caching functions required to fix the problems with Plotly when it is used inside Streamlit.   As a result of the Plotly limitations, only two visualization options could be developed for the output report within the available time.  

#### Pandas Vs SQL Querying and Table Joining

I learned when to use Python lists or Python Numpy arrays versus when to use SQL queries or Pandas dataframes for retrieving and processing data.  Choosing which data structure(s) to use in specific situations can be difficult.  This gave me good insights how to better optimize code performance and commplexity for future projects.

#### Software Development Lifecycle (SDLC) Skills:  
I learned a lot more about the SDLC process for software products, i.e., how to go from concept, through requirements analysis, prototyping, refinement, debugging, and finally to documentation.  

### Music Industry Lessons:

#### Every Music Tracking System Uses Different Ways to Track Data

Album names and song releases have minor variations, sometime many.  Music industry data management systems do not have fully traceable version control.  Some do not care about tracking versions at all.
 
Tracking IDs for songs and albums in the legal world are not reused by Spotify, Discogs, or many other systems that I looked at.  The music industry seemingly has not created externally facing data standards that are required in other industries, such as the banking, medical devices, aerospace, or automotive industries.  Spotify is simply a user of music data coming from artists and production companies.  

In leiu of global standards for song and album releases, proper data management can be done.  This means using or creating your own standardized identifiers (primary keys) for data coming from each source system.  This allows software applications to create reliable connections to reference data from different sources.



# Appendix:

Database from: \
https://components.one/datasets/billboard-200