# Data Visualization Project: WASABI

# Project Description

## 1. User Description (1 paragraph)

Potential users of a music-related app will likely show interest in in-depth information related to music genres, their prevalence, spread, etc. across time and countries. As such, users might want access to the following:

- A Sankey diagram showing how artists' genre evolves over time, the relationship between genres and the weight of each genre. (e.g. An artist's career starts on Soul and moves to Pop and Neon Soul in posterior years) 

- A data visualization tool that showcases the prevalence of a music genre within a country and within a time period (e.g. Heavy Metal is *heavily* represented in the Nordics since the 1980s). A choropleth would allow a user to select a genre and a time period and see the ratio of bands in that genre divided by the total population number (a density representation used in music-specialized magazines).

## 2. Visual Tasks To Implement

- **Historical Evolution of the artists' genres**: A sankey diagram having on the horizontal axis different periods on time (specifically quinquenniums), and nodes in each quinquenium denoting the genre. The thickness of the bands from one node to the other will be determined by the quantity of albums.  

- **Music Genre Prevalence by Country**: A (zoomable?) world choropleth map where each country is colored as a heatmap based on the ratio of bands playing a specific genre during a specific decade *per capita*. Hovering over a country would open a tooltip box that would provide more information (e.g. absolute number of bands, total population of the country, etc.)
  - See ``## 7`` for proof of concept picture.

## 3. Name of visualization technique and the name of the member of the group who is going to implement it

- **Sankey**: Mariana

- **Choropleth**: Quentin

## 4. Needed attributes from the WASABI dataset

- **Sankey**: 
    - Needed fields from the [Album] table: X_id *(unique identifier of the album), genre, id_artist (unique identifier of the artist), publicationDate (year when the album was published), title. 
    - Needed fields from the [Artist] table: X_id (unique identifier of the artist), name_accent_fold (artist's name without special caracters)

- **Choropleth**:
    - Needed fields from the [Album] table: id_artist (unique identifier of the artist to perform a join with the [Artist] table), genre, dateRelease
    - Needed fields from the [Artist] table: id_artist (unique identifier of the artist to perform a join with the [Album] table), country (unique identifier of the country to perform a join with the [Country] table), ended, begin, end
    - Needed fields from the [Country] table (available on github [here](https://github.com/datasets/population/blob/master/data/population.csv)): id_artist country (unique identifier of the country to perform a join with the [Artist] table), year, value

## 5. Informal description of the data processing of the row data

- **Sankey**: 
    - Parse JSON
    - Variable selection
    - Transform blank spaces into NA 
    - Clean the genre variable 
    - Filter for only those observations that have information about the artist, the year of publication of the album and genre 
    - Since the analysis is at quinquennium level, filter of only those artists who have been active for at least 5 years and create quinquennium variable. 
    - Join the artist and albums datasets.
    - Group the number of albums by artist, genre, and quinquennium. 
    - Change from long format to ‘source and target’ format required for the Sankey diagram (most important step)

- **Choropleth**: 
    - Parse JSON
    - Variable selection
    - Transform blank spaces into NA
    - Standardize the variables (e.g. genre variable: ``Gothic Rock&#x200F;&#x200E;`` -> ``Gothic Rock``)
    - Filter out unusable rows from both Albums and Artist datasets (e.g. missing genre or missing country or artist_id)
    - Perform a join between both datasets on the id_artist key
    - Group and format the data based on a selected JSON structure
    - Save the JSON file into a workable file for D3.js

### Example of code for variable selection:

  ```r
  sankey_var_albums <- c('_id', "id_artist", "genre", "publicationDate", "title")
  sankey_var_artists <- c('_id', "nameVariations_fold")

  choro_var_albums <- c("id_artist", "genre", "dateRelease")
  choro_var_artists <- c('_id',"location.country", "members.XX.ended", "members.XX.begin", "members.XX.end")

  albums <- read_csv("wasabi_albums.csv")
  albums_lite_sankey <- albums %>% select(sankey_var_albums)
  albums_lite_choro <- albums %>% select(choro_var_albums)
  rm(albums)

  artists <- read_csv("wasabi_artists.csv")
  artists_lite_sankey <- artists %>% select(sankey_var_artists)
  artists_lite_choro <- artists %>% select(choro_var_artists)
  rm(artists)
  ```

## 6. Visual mapping of variables available in your data set

- **Sankey**: [data visualization catalogue page for Sankey 1](https://datavizcatalogue.com/methods/sankey_diagram.html), [data visualization catalogue page for Sankey 2](https://www.d3-graph-gallery.com/sankey.html)

| artist | source genre | target genre | quinquennium | value |
| --- | --- | --- | --- | --- |
| Artist name | Source node of genre | Target node of genre | year when the quinquennium starts | number of albums from the source to the target |  

- **Choropleth**: [data visualization catalogue page for Choropleth 1](https://datavizcatalogue.com/methods/choropleth.html), [data visualization catalogue page for Choropleth 2](https://www.d3-graph-gallery.com/choropleth)

1. <u>Albums data:</u>
    
| id_artist | genre | dateRelease |
| --- | --- | --- |
| **test** | **String** | **String** | 
| Unique Id of the artist | Album genre | Album publicationDate |

2. <u>Artists data:</u>
    
| id_artist | country | ended | begin | end |
| --- | --- | --- | --- | --- |
| **test** | **String** | **String** | **Boolean** | **String** | **String** |
| Unique Id of the artist | The artist's genres. It can be null | This represents the country of the birth of artist or group | If artist's carriere or groupe is stopped or not | Date of the birth of artist or group | Date of the end of activity |

2. <u>Country data:</u>
    
| country | year | value |
| --- | --- | --- |
| **String** | **String** | **String** |
| Country name | Year of Data | Population total |

<span style="color:red">We could perform a join between the three sets of data via the Artist's **name**</span>.

## 7 Example of visualization

**choropleth:**

![img](https://i.imgur.com/n4wOCye.png)

### WASABI API Documentation

The documentation and the data fields can be found [here](https://wasabi.i3s.unice.fr/apidoc/).
The github information can be found [here](https://github.com/micbuffa/WasabiDataset).

### Country Population Data

The population data is available on github [here](https://github.com/datasets/population/blob/master/data/population.csv).