## Background

**Background:**
>**Bulgarian pop-folk** (hereinafter referred to as **chalga**) is a dance genre, stemming from ethno-pop, with strong hints of Oriental rhythms and instrumentals.  Chalga is one of many branches of Balkan folk throughout the peninsula (turbofolk in Serbia, manele in Romania etc.) After the fall of communism in 1989 in Central and Eastern Europe, chalga rapidly found place in everyday life. <br>

>Chalga relies on provocativity, and tracks commonly contain sexually explicit lyrics. Because of this, it causes much controversy in society and there is sparse scientific work in the field. Nevertheless, chalga becomes an increasingly popular musical style. As such, we believe it must be subject to development. Finding its 'evolution' constitutes the main scientific motivation behind this study.

**This work serves the purpose of providing a general Exploratory Data Analysis (EDA) of the [Bulgarian popfolk songs](https://www.kaggle.com/astronasko/payner) dataset.**

## Preliminary
The main visualisation tools in the present notebook are ``matplotlib`` and ``seaborn``. Please note that you may have to install the ``transliterate`` package manually.

In [None]:
!pip install transliterate

# Including main libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.colors
from matplotlib.ticker import AutoMinorLocator
import seaborn as sns
from transliterate import translit
import datetime as dt

def shorten_name(name):
    '''Shortens the artist name for visualisation. Keeps the artist's first name unchanged.'''
    # Splits the input string in words
    name_count = len(name.split())
    
    if (name_count==1 or name=="Desi Slava"):
        return name
    
    # Keeps the first two names only
    forename, surname = name.split()[:2]
    #  Abbreviates the surname
    shortened_name = "{0:} {1:}.".format(forename, surname[0])
    return shortened_name
    
def top_feature_songs(data, n, feature):
    '''Returns the first n songs, sorted by the feature in question.'''
    columns = columns = ['artist_1','artist_2','artist_3','track_name', feature]
    
    out = data.nlargest(n, columns=feature,keep='all')[columns].style.hide_index()
    
    display(out)

# Load the entire dataset in data
data = pd.read_csv("/kaggle/input/payner/payner.csv")

## Pre-processing
The pre-processing of this EDA consists in:
- **Taking into account the year of release only**, as tracks are subject to periodic tendencies per calendar year (e.g. more upbeat songs during summer). In this analysis, songs do not fall to be assessed in terms in higher temporal resolution than a year.
- **Disregarding Spotify popularity**, as tracks are uploaded at different times, and this may bias towards older songs.
- **Disregarding instrumentalness** (whether a track contains no vocals), as all tracks have vocal content.
- **Disregarding liveness** (confidence measure of live audience presence), as all tracks have been assumed to be recorded in a studio.
- **Disregarding mode, key and time signature** because of my lack of technical competence in the field.

In [None]:
# Get year of release
data['year'] = data.datetime.apply(lambda x: x[0:4]).astype(int)

# Disregard aforementioned columns
data = data.drop(
    columns=[
        'track_id',
        'popularity',
        'mode',
        'key',
        'time_signature',
        'instrumentalness',
        'liveness',
        'datetime'
    ])

# Reorder remaining columns
new_cols = ['track_name', 'year', 'artist_1', 'artist_2', 'artist_3',
            'danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
            'valence', 'tempo', 'duration']

data = data[new_cols]

#Shorten artist names
data.artist_1 = data.artist_1.apply(shorten_name)
data.artist_2 = data.artist_2.apply(shorten_name)
data.artist_3 = data.artist_3.apply(shorten_name)

# Transform duration in minutes
data.duration = data.duration.apply(lambda x: x/1000/60)

## Artist-specific statistics

### Artists with most resolved tracks.

In [None]:
first_second_third = pd.concat([data.artist_1,data.artist_2,data.artist_3])

data_artists = pd.DataFrame({
    'artist_1': data.artist_1.value_counts(dropna=False),
    'artist_2': data.artist_2.value_counts(dropna=False),
    'artist_3': data.artist_3.value_counts(dropna=False)
}).fillna(0)

key = first_second_third.value_counts().index[1:16].tolist()

data_artists.reindex(key).plot(
    kind='bar',
    stacked=True,
    figsize=(12, 4),
    color=['gold','darkgray','saddlebrown'],
    edgecolor="black",
    linewidth=0.5,
    zorder=2)

plt.title("Most prevalent artists, PlanetaOfficial, 2014-2019",fontsize=14)
plt.legend(labels=['First', 'Second', 'Third'], title="Order in track", loc='best')
plt.grid(axis='y', linewidth=0.5, zorder=0)
plt.xticks(rotation='horizontal', wrap=True)
plt.ylim(0,30)
plt.ylabel("Track count")
plt.tight_layout()

Colour coding is added, so to follow what is the distribution of songs by order of mention in the track.

### Artists with most resolved solo tracks.

In [None]:
# Artists by solo songs - important for correlation matrices!
# Get only solo tracks
data_solo = data[(data.artist_2=='None')&(data.artist_3=='None')].drop(columns=['artist_2','artist_3'])

# Get all authors by solo track count
count_solo = pd.DataFrame({
    'solist': data_solo.artist_1.value_counts(dropna=False),
})

# Get names of first 10 authors by solo track count
key_solo = data_solo.artist_1.value_counts().index[0:10].tolist()

# Plot only first 10 authors from all authors
count_solo.reindex(key_solo).plot(
    kind='bar',
    figsize=(12, 4),
    color='tab:blue',
    edgecolor="black",
    linewidth=0.5,
    zorder=2)

plt.title("Top authors by solo tracks, Planeta Payner, 2014-2019",fontsize=14)
plt.legend(labels=['Solo tracks'], loc='best')
plt.xticks(rotation='horizontal', wrap=True)
plt.grid(axis='y', linewidth=0.5, zorder=0)
plt.tight_layout()

Solo songs are extremely important for further individual analysis of artists, as they may indicate individual trends and correlations in artists' discography. Naturally, artists with more solo songs will be of greater interest. As an example of the importance of solo tracks, we will further focus on the top three artists: **Dzhena**, **Maria** and **Preslava**.

## Dataset statistics
> **After random sampling (n=100), it may be inferred that the purity of this dataset is (0.87, 0.97), 95% C.I.**

In order to minimise noise tracks from the dataset, a certain form of data filtering must be conducted. In this work, the filtering condition is if an artist was mentioned first in at least three tracks. Although it is a rather aggressive method of filtering, this would increase the purity of data. Furthermore, it ensures selecting only 'representative' tracks, and in turn increases the confidence of subsequent findings. 

In [None]:
# Has this artist got at least 3 tracks in which they are mentioned first?
# Preparing a boolean mask of the condition
mask = data.artist_1.map(data.artist_1.value_counts())  >= 3
# Applying this boolean mask to data
data = data[mask]

## Feature analysis

### Danceability
> Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

In [None]:
top_feature_songs(data, n=5, feature='danceability')

The track with highest resolved danceability is ['Yako mi e'](https://www.youtube.com/watch?v=M3m4GMSckgE) by Dzhena (0.883). Tracks ranked second and third are ['Angelat'](https://youtu.be/NkbpLaMF_yo) (0.869) and ['Nyama da te bavya'](https://www.youtube.com/watch?v=XGhkY80Ddg4) (0.864). Two tracks split fourth place with danceability of 0.860: ['Blokiran'](https://youtu.be/IaZ07L2xdRA) and ['Sto nyuansa rozovo'](https://youtu.be/sx8ytImqjiI).

### Energy
> Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

In [None]:
top_feature_songs(data, n=5, feature='energy')

The track with highest resolved energy is ['Noshtta garmi'](https://youtu.be/D3vBCYwS34Y?t=1) by Galin and Lorena, scoring 0.995! Tracks ranked second and third are both by Roksana: ['Ot gordost da boli'](https://youtu.be/QsKkHFRXdAw), duet with Toni S. (0.993) and ['Selfi'](https://youtu.be/eCaXDlqmYgM) (0.989). Track ranked fourth is Galin's ['Gotina kola'](https://youtu.be/K9iEjkmoR3A?t=1) (0.984). Two songs share the fifth place of 0.982 energy: ['Roka-laka'](https://youtu.be/n9HclKtkV04) and ['Neka da e tayno'](https://youtu.be/pNCdXt1MXps).

### Loudness
> The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude).

It is important to note that the decibel scale is logarithmic, as human hearing responds logarithmically to auditory stimuli by the [Weber-Fechner's law](https://en.wikipedia.org/wiki/Weber%E2%80%93Fechner_law?oldformat=true).

In [None]:
top_feature_songs(data, n=5, feature='loudness')

The loudest resolved song by PlanetaOfficial in 2012-2019 is ['Noshtta garmi'](https://youtu.be/D3vBCYwS34Y?t=1) by Galin and Lorena. Please note again that the same song scored first in ``energy``. Second in loudness comes Roksana's ['Selfi'](https://youtu.be/eCaXDlqmYgM), which scored third in ``energy``. A good question arises: **Are energy and loudness correlated?**

### Speechiness
> Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

In [None]:
top_feature_songs(data, n=5, feature='speechiness')

The song with most speech detected is ['Chuzda staya'](https://youtu.be/OUyolqrl-OM), scoring 32.8 per cent. This comes to no surprise, as main artist is Ustata, a rapper associated with the Payner Ltd record label. Nevertheless, no track exceeds the threshold of 33 percent, as expected.

It is also interesting to see that the artist Galin is also well represented in terms of speechiness: ['Vse napred'](https://youtu.be/YafXXQeBnaI) and ['112'](https://youtu.be/gq0QbIHycg4) are ranked second and third among all tracks.

### Acousticness
> A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. 

In [None]:
top_feature_songs(data, n=5, feature='acousticness')

The track with highest resolved acousticness is Fiki's ['Is This Love'](https://youtu.be/MZSEnQJsRms) (0.811), well above the rest. The runner-up track is ['Spomeni'](https://youtu.be/8_boh8HoIZQ) by Roxana (0.731). Extra Nina's ['Molitva'](https://youtu.be/9kf0Es25zio) (0.689) wins third place in acousticness.

### Valence
> A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

In [None]:
top_feature_songs(data, n=5, feature='valence')

The Christmas-themed song ['Koledni zhelaniya'](https://youtu.be/5H5bwR1Ly-g) by Tedi A. is the most valent resolved track, scoring 0.970. The second most danceable song, ['Angelat'](https://youtu.be/NkbpLaMF_yo) by Tsvetelina Y., is also the second most valent in the dataset (0.969). Two tracks share third place: ['Profesor'](https://youtu.be/NzpdukkDn5U) by Milko K. and ['Umna i krasiva'](https://youtu.be/l44EVvCurak) by Veselin M.

### Tempo
> The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

In [None]:
top_feature_songs(data, n=5, feature='tempo')

The song wit highest resolved tempo is ['La Vida Amiga'](https://youtu.be/UYN53ZHUxN8) by Avi B., with almost 200 BPM. Interestingly enough, this song does not follow the characteristics of a Bulgarian pop-folk, and in fact, is entirely in Spanish. A worthy oponent is the ['S teb ili s nikoy'](https://youtu.be/AeYnY-aW_m0) duet by Preslava and Fiki (198 BPM), reaching over 17 million views in YouTube.

## Feature correlations

### Global feature correlations

In [None]:
f, axes = plt.subplots(1, 2, figsize=(12, 6), sharex=True)

mask = np.triu(np.ones_like(data.drop(columns=['year']).corr(), dtype=np.bool))

ax_pearson = sns.heatmap(
    data.drop(columns=['year']).corr(method='pearson'),
    mask=mask,
    vmax=1,
    vmin=-1,
    square=True,
    annot=True,
    fmt=".2f",
    cbar=False,
    cmap="RdBu_r",
    ax=axes[0])

ax_pearson.set_title('Pearson',fontsize=14)

ax_spearman = sns.heatmap(
    data.drop(columns=['year']).corr(method='spearman'),
    mask=mask,
    vmax=1,
    vmin=-1,
    square=True,
    annot=True,
    fmt=".2f",
    cbar=False,
    cmap="RdBu_r",
    ax=axes[1])

ax_spearman.set_title('Spearman',fontsize=14)

plt.tight_layout()
plt.suptitle("Features correlation of tracks, PlanetaOfficial, 2014-2019", fontsize=18, y=1.10)
plt.subplots_adjust(hspace = 0.10)

The correlation matrix proves to be an incredibly powerful tool. In our case, it hints that the energy-loudness correlation has linear (Pearson) coefficient of 0.68 and monotonous (Spearman) coefficient of 0.63. This is expected, as loudness is an argument in the energy function, according to the Spotify API documentation. Plotting all points on a energy-loudness diagram for confirmation:

In [None]:
plt.figure(figsize=(9, 6))
plt.title("Co-dependency of energy and loudness, PlanetaOfficial, 2014-2019", fontsize=14, y=1.04)
plt.xlabel("Energy coefficient")
plt.ylabel("Loudness (dB)")
sns.scatterplot(x="energy", y="loudness", data=data, alpha=0.3);
sns.regplot(x="energy", y="loudness", data=data, scatter=False);

### Artist-specific feature correlations
Alternatively, several individual correlation matrices for each artist may be constructed. **This can highlight specific artist-wise trends throughout solo songs, and allows such trends to be compared to the global dataset.** Here, outliers further than 3$\sigma$ were discarded in correlation calculations. Let us return back to **Dzhena**, **Maria** and **Preslava**, as discussed.

In [None]:
f, axes = plt.subplots(3, 2, figsize=(12, 14), sharex=True)
mask = np.triu(np.ones_like(data_solo.corr(), dtype=np.bool))

for i in range(3):
    
    artist = key_solo[i]
    
    data_artist = data_solo.loc[data_solo.artist_1 == artist].drop(columns=['track_name','year', 'artist_1'])
    data_artist = data_artist[(np.abs(stats.zscore(data_artist)) < 3).all(axis=1)]
    
    mask = np.triu(np.ones_like(data_artist.corr(), dtype=np.bool))
    # Pearson
    sns.heatmap(
        data=data_artist.corr(method='pearson'),
        mask=mask,
        vmax=1,
        vmin=-1,
        ax=axes[i, 0],
        square=True,
        annot=True,
        fmt=".2f",
        cbar=False,
        cmap="RdBu_r"
    )
    # Spearman
    sns.heatmap(
        data=data_artist.corr(method='spearman'),
        mask=mask,
        vmax=1,
        vmin=-1,
        ax=axes[i, 1],
        square=True,
        annot=True,
        fmt=".2f",
        cbar=False,
        cmap="RdBu_r"
    )
    # Left-hand-side labels
    pop_size = count_solo.solist[i]
    sam_size = len(data_artist)
    
    axes[i, 0].set_ylabel(
        '{0:}, n={1:}\n({2:} solo tracks)'.format(artist, sam_size, pop_size),
        fontsize=14,
        rotation=0,
        labelpad=80,
        va='center',
        linespacing=1.4
        
    )
    
    axes[i,0].set_title('Pearson', fontsize=14)
    axes[i,1].set_title('Spearman', fontsize=14)

plt.suptitle("Features correlation of artists, PlanetaOfficial, 2014-2019", fontsize=18, y=1.04)
plt.subplots_adjust(hspace = 0.15)
plt.tight_layout()

This set of 6 correlation matrices provides a wealth of information, if one is interested in artist-specific trends in discography. **Suppose we compare the **duration-speechiness Pearson correlation coefficient of **Dzhena (-0.45)** with that of **Preslava (0.50)**. Let us plot their solo tracks on a duration-speechiness diagram, against the whole PlanetaOfficial collection.

In [None]:
f, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True, sharex=True)

plt.suptitle("Tendencies of artists, PlanetaOfficial, 2014-2019", fontsize=16, y=1.05)
plt.xlim(2.5, 5)
plt.ylim(0, 0.3)
plt.xticks(
    ticks = np.linspace(2.5,5,6),
    labels = ['2:30','3:00','3:30','4:00','4:30','5:00'])

# DZHENA
axes[0].set_title("Solo tracks by Dzhena")
# All dots
sns.scatterplot(
    x="duration",
    y="speechiness",
    data=data,
    alpha=0.15,
    color='gray',
    ax=axes[0])
# Dzhena scatter
sns.scatterplot(
    x="duration",
    y="speechiness",
    data = data_solo[data_solo['artist_1']=="Dzhena"],
    color="tab:blue",
    ax=axes[0])
# Dzhena reg
sns.regplot(
    x="duration",
    y="speechiness",
    data=data_solo[data_solo['artist_1']=="Dzhena"],
    color="tab:blue",
    scatter=False,
    ax=axes[0])

# PRESLAVA
axes[1].set_title("Solo tracks by Preslava")
# All dots
sns.scatterplot(
    x="duration",
    y="speechiness",
    data=data,
    alpha=0.15,
    color='gray',
    ax=axes[1])
# Scatter
sns.scatterplot(
    x="duration",
    y="speechiness",
    data = data_solo[data_solo['artist_1']=="Preslava"],
    color="tab:orange",
    ax=axes[1])
# Reg
sns.regplot(
    x="duration",
    y="speechiness",
    data=data_solo[data_solo['artist_1']=="Preslava"],
    color="tab:orange",
    scatter=False,
    ax=axes[1]);

It may be deducted that as Dzhena tends to speak proportionally less in her longer songs. On the contrary, Preslava tends to speak proportionally more with the increase of track duration. Please note again - this was inferred just by using the artist-specific correlation matrix, which is clear evidence of its importance in tendency search.

## Change of tracks over time
The great question certainly was if tracks are subject to changes over time. For presentation purposes, tracks are binned in three by their year of release; that is, they were split to **'old' (2014-2015), 'medium' (2016-2017) and 'new' (2018-2019) songs.**

In [None]:
# Three-fold binning of data
data_old = data[data.year.isin([2014,2015])]
data_med = data[data.year.isin([2016,2017])]
data_new = data[data.year.isin([2018,2019])]

# Providing axis limits for features
feature_limits = {
    'danceability': (0.4, 0.9),
    'energy': (0.6, 1),
    'loudness': (-8, 0),
    'speechiness': (0, 0.3),
    'acousticness': (0, 0.3),
    'liveness': (0, 0.6),
    'valence': (0.2, 1),
    'tempo': (50, 250)
}

### Change of track duration over time
Consider the kernel density estimate (KDE) of tracks by duration in the three categories.

In [None]:
f, axes = plt.subplots(3, 1, figsize=(8, 8), sharex='all', sharey='all')

bins = np.linspace(2.5,5,31)

row = 0
for data in [data_old, data_med, data_new]:

    ax = sns.distplot(
    data.duration,
    norm_hist=True,
    ax=axes[row],
    color='tab:blue',
    bins=bins)
        
    ax.grid(axis='y', which='major', color='k', linestyle='-', alpha=0.2, zorder=100)
    ax.grid(axis='x', which='major', color='k', linestyle='-', alpha=0.2, zorder=100)
    
    ax.set_xlim(2.5, 5)
    ax.set_xticklabels(['2:30','3:00','3:30','4:00','4:30','5:00'])
    ax.minorticks_on()
    ax.xaxis.set_minor_locator(AutoMinorLocator(6))
    
    ax.set_xlabel("Track duration")
    ax.set_ylabel("KDE")

    row += 1
    

axes[0].set_title(r"2014-2015 $(n=131)$")
axes[1].set_title(r"2016-2017 $(n=122)$")
axes[2].set_title(r"2018-2019 $(n=104)$")

plt.suptitle("Tempo of tracks by PlanetaOfficial, 2014-2019", fontsize=14, y=1.02)
plt.subplots_adjust(top=0.99)
plt.grid(axis='x', linewidth=0.5, zorder=0)
plt.tight_layout()

In 2014 and 2015, a distribution plateau can be observed between 3:30 and 3:50. Since then, the distribution maximum is slowly moving to the leftm as time goes on. In 2018-2019, there is a clear prevalence of songs between 3:30 and 3:35 long. **It can be argued that songs from 2018-2019 are slightly shorter than ones several years ago.**

## Change of features over time

A good way to represent feature changes over time is to draw heatmaps of each feature against duration, for all three time periods. In these heatmaps, regions with more tracks present are redder. In this way, we can view these heatmaps as colour-coded two-dimensional KDE. For completeness, all feature heatmaps are displayed over the three time intervals. Again, outliers further than $3\sigma$ are disregarded.

In [None]:
f, axes = plt.subplots(7, 3, figsize=(9, 21),sharey='row', sharex='row')

row = 0

for feature in data_old.columns[5:-1]: #['danceability',...'duration']
    
    ax_old = sns.kdeplot(
        data_old['duration'],
        data_old[feature],
        shade=True,
        ax=axes[row,0],
        shade_lowest=False, 
        cmap="YlOrRd")

    ax_med= sns.kdeplot(
        data_med['duration'],
        data_med[feature],
        shade=True,
        ax=axes[row,1],
        shade_lowest=False, 
        cmap="YlOrRd")
    
    ax_new = sns.kdeplot(
        data_new['duration'],
        data_new[feature],
        shade=True,
        ax=axes[row,2],
        shade_lowest=False, 
        cmap="YlOrRd")
    
    ax_old.set_xlim(2.5, 5)
    ax_old.set_xticklabels(['2:30','3:00','3:30','4:00','4:30','5:00'])
    
    ax_old.set_ylim(*feature_limits[feature])
        
    axes[0,0].set_title(r"2014-2015 $(n=131)$")
    axes[0,1].set_title(r"2016-2017 $(n=122)$")
    axes[0,2].set_title(r"2018-2019 $(n=104)$")
    
    row += 1

plt.suptitle("Feature heatmaps over time, PlanetaOfficial, 2014-2019", fontsize=16, y=1.02)
plt.subplots_adjust(top=0.95)
plt.tight_layout()

In this EDA, the two heatmaps that indicate the most change, are discussed: the **duration-tempo** and **duration-loudness** heatmaps.

### Change of tempo over time

In [None]:
f, axes = plt.subplots(1, 3, figsize=(12, 4),sharey='row', sharex=True)

ax_old = sns.kdeplot(
    data_old['duration'],
    data_old['tempo'],
    shade=True,
    ax=axes[0],
    shade_lowest=False, 
    cmap="YlOrRd")

ax_med= sns.kdeplot(
    data_med['duration'],
    data_med['tempo'],
    shade=True,
    ax=axes[1],
    shade_lowest=False, 
    cmap="YlOrRd")

ax_new = sns.kdeplot(
    data_new['duration'],
    data_new['tempo'],
    shade=True,
    ax=axes[2],
    shade_lowest=False, 
    cmap="YlOrRd")

ax_old.set_xlim(2.5, 5)
ax_old.set_xticklabels(['2:30','3:00','3:30','4:00','4:30','5:00'])

ax_old.set_ylim(*feature_limits['tempo'])

axes[0].set_title(r"2014-2015 $(n=131)$")
axes[1].set_title(r"2016-2017 $(n=122)$")
axes[2].set_title(r"2018-2019 $(n=104)$")

plt.suptitle("Duration-tempo over time, PlanetaOfficial, 2014-2019", fontsize=14, y=1.02)
plt.subplots_adjust(top=0.95)
plt.tight_layout()

Throughout all intervals, it can be seen that most songs are clustered by tempo in two main regions - around 90 BPM (*slow group*) and around 175 BPM (*fast group*). 

**During 2014-2015, the slow group dominates in count.** It is interesting to see the core of the slow group is elongated horizontally; this indicates the presence of *long* and *short* tracks that are in the slow group. Later on, the slow group retains dominance, but songs are more centered around the 3:30 mark. **Suddenly, in 2018-2019 the fast group becomes at least as big as the slow group in count! We can deduce that in 2018-2019 PlanetaOfficial relies on more songs in the fast group.** However, their mean tempo is slightly lower (about 160 BPM).

**ThÐµ finding of this duration-tempo trend is the main achievement of this EDA.**
### Change of loudness over time

In [None]:
f, axes = plt.subplots(1, 3, figsize=(12, 4),sharey='row', sharex=True)

ax_old = sns.kdeplot(
    data_old['duration'],
    data_old['loudness'],
    shade=True,
    ax=axes[0],
    shade_lowest=False, 
    cmap="YlOrRd")

ax_med= sns.kdeplot(
    data_med['duration'],
    data_med['loudness'],
    shade=True,
    ax=axes[1],
    shade_lowest=False, 
    cmap="YlOrRd")

ax_new = sns.kdeplot(
    data_new['duration'],
    data_new['loudness'],
    shade=True,
    ax=axes[2],
    shade_lowest=False, 
    cmap="YlOrRd")

ax_old.set_xlim(2.5, 5)
ax_old.set_xticklabels(['2:30','3:00','3:30','4:00','4:30','5:00'])

ax_old.set_ylim(*feature_limits['loudness'])

axes[0].set_title(r"2014-2015 $(n=131)$")
axes[1].set_title(r"2016-2017 $(n=122)$")
axes[2].set_title(r"2018-2019 $(n=104)$")

plt.suptitle("Duration-loudness over time, PlanetaOfficial, 2014-2019", fontsize=14, y=1.02)
plt.subplots_adjust(top=0.95)
plt.tight_layout()

Another observation is that the distribution of songs by loudness shrinks over time in size and moves downwards. **In other words, tracks are becoming a bit more silent and more normalised.** This is very likely due to the introduction of the [Spotify Normalization](https://artists.spotify.com/faq/mastering-and-loudness#what-is-loudness-normalization-and-why-is-it-used) in later years:
> Audio files are delivered to Spotify from distributors all over the world and are often mixed/mastered at different volume levels. We want to ensure the best listening experience for users, so we apply Loudness Normalization to create a balance.
>It also levels the playing field between soft and loud masters. Louder tracks have often been cited as sounding better to listeners, so Loudness Normalization removes any unfair advantage.

There are indications that loudness may have correlated with duration in 2014-2015, judging by the shape of the core. That is to say, tracks around the 4:00 mark may have been marginally louder than ones around the 3:00 mark, on average.  It is interesting to see that in the intermediate period, 2016-2017, a subset of tracks have moved rightwards and downwards. In 2018-2019, tracks are distributed in a well-defined center.

## Conclusion
Throghout this work, several conclusions could be drawn out:
* **There is strong correlation between loudness and energy**, likely because energy takes loudness as an argument in Spotify API;
* **Artists have unique feature correlations in their solo songs**, e.g. Dzhena vs Preslava. This is in favour of the 'Every artist is unique!' idea. Moreover, these artist-specific tendencies in solo songs are recognised as a good opportunity for future machine learning (artist classification);
* **PlanetaOfficial seems to have three short-term changes in tracks**:
    1. **Duration of songs was slightly lowered in the 2018-2019,** though still kept to the industrial standard (around 3:30);
    2. **Songs are grouped in two main groups by tempo: high-tempo (>125 BPM) and low-tempo (<125 BPM). High-tempo songs rose much in count in 2018-2019,** compared to 2014-2017. This gives reason to think such transition was intentional by PlanetaOfficial;
    3. **Songs became more quiet.** Their overall loudness was normalised, as a consequence of the Spotify Normalization campaign.

**To the best of our knowledge, this is the first quantitative analysis of Bulgarian pop-folk that is publically available.**

Please note that this is an ongoing project. Feedback and constructive criticism are very much appreciated! :)