Analysis of Classical Composers and the historical context of their largest compositions...

DATA CLEAN: Classical Composers from Kaggle

In [None]:

import pandas as pd
df_composers = pd.read_csv('/Users/whitneyhollman/Desktop/classical_composers.csv', encoding='ISO-8859-1')

df_composers.head()

In [None]:
df_composers = df_composers.drop(df_composers.index[0])
df_composers.head()


In [None]:
df_composers = df_composers.drop(100)
df_composers


EXPLORATORY DATA ANALYSIS (EDA)

In [None]:
average_duration = df_composers['Duration of Biggest Piece(mins)'].mean()
average_duration


In [None]:
# Fill the NaN values with the average duration 

df_composers.fillna({'Duration of Biggest Piece(mins)': 38.4}, inplace=True)

df_composers

In [None]:
# First we will sort by composer and the duration of their biggest piece

df_composers.sort_values(by=['Composer', 'Duration of Biggest Piece(mins)'], ascending=[True, False])


In [None]:
# Group composer by duration

group = df_composers['Composer'].groupby(df_composers['Duration of Biggest Piece(mins)'])

group.head()


In [None]:
# What is the average duration of the biggest piece for composers who died in the same year?
grouped = df_composers['Died'].groupby(df_composers['Duration of Biggest Piece(mins)'])
grouped.mean()


In [None]:
# Removed Composer Guillaume Dufay because his Nationality was one of two places and it was not clear which one was correct
df_composers = df_composers[df_composers['Composer'] != 'Guillaume Dufay']
df_composers


New dataframe which includes the Composers age at their death.

In [None]:
print(df_composers.columns)

subset = df_composers[df_composers['Died'] - df_composers[' Born']>15].copy()

subset['AgeAtDeath'] = subset['Died' ]- subset[' Born']
subset

In [None]:
subset_sorted = subset.sort_values(by='AgeAtDeath', ascending=True)
subset_sorted


In [None]:
subset_sorted.columns = subset_sorted.columns.str.strip()
subset_sorted['AgeAtDeath'] = subset_sorted['Died' ]- subset_sorted['Born']

In [None]:
from tabulate import tabulate

average_age_at_death = subset_sorted.groupby('Nationality')['AgeAtDeath'].mean()
average_age_at_death

average_age_at_death_nationality_df = average_age_at_death.reset_index()

sorted_df = average_age_at_death_nationality_df.sort_values(by='AgeAtDeath', ascending=True)
print(tabulate(sorted_df, headers='keys', tablefmt='rst'))



In [None]:
# Data visualization for age at death by nationality

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

sns.catplot(
    data = sorted_df, 
    x = 'AgeAtDeath', y='Nationality',
    hue='AgeAtDeath',
    kind="swarm"
)



In [None]:
# subset_sorted by nationality
for name, group in subset_sorted.groupby('Nationality'):
    print(f"Nationality : {name}")
    print(group[['Composer', 'AgeAtDeath']], "\n")

In [None]:
# Nationality of compower and time of death in ascending order
ncd_df = subset_sorted[['Nationality', 'Composer', 'AgeAtDeath']]

ncd_df_ = ncd_df.groupby('Nationality')
print(tabulate(ncd_df, headers="keys", tablefmt="rst"))


The youngest composer to die was Ludig Beethoven at the age of 21. The oldest Composer was Jean Sibelius at the age of 92. 

In [26]:
import tabulate 
import altair as alt 

nd_df = subset_sorted[['Nationality', 'AgeAtDeath']]

nd_df_ = nd_df.groupby('Nationality')
print(tabulate.tabulate(nd_df, headers="keys", tablefmt="rst"))


chart = alt.Chart(nd_df).mark_bar().encode(
    x=alt.X('Nationality', title='Nationality'),
    y=alt.Y('AgeAtDeath', title='Age at death'),
    color='Nationality',
    tooltip=['Nationality', 'AgeAtDeath']
).properties(
    title='Composers by Nationality and Age at Death'
)

chart.display()

  ..  Nationality       AgeAtDeath
   1  German                    21
  81  Italian                   26
   7  Austrian                  31
  73  Italian                   34
   2  Austrian                  35
  33  English                   36
  58  French                    37
  13  German                    38
  63  American                  39
  12  Polish                    39
  40  German                    40
  91  English                   42
  31  Russian                   42
  57  Russian                   43
  87  Austrian                  43
  42  Italian                   45
  11  German                    46
  71  Austrian                  50
  64  Italian                   51
  18  Austrian                  51
  98  Russian                   53
   8  Russian                   53
  72  Russian                   54
  14  French                    56
  85  Italian                   57
  77  Italian                   58
  53  French                    59
  55  Czech         

This histogram gives us the first glance into the interplay between the composers, their country of origin, and the historical context of their compositions. A further dive into the data will hopefully reveal a deeper understanding of how historical, cultural and personal factors shaped the lifespan of these creative historical figures.

DESCRIPTIVE ANALYSIS

Composer's Productivity Over Their Lifetime...

In [None]:
# How does the composer lifespan correlate with their productivity? Measured by number of pieces or longest piece.

correlation = subset_sorted['AgeAtDeath'].corr(subset_sorted['Duration of Biggest Piece(mins)'])

print(f'Correlation Coefficient: {correlation}')


With a correlation of -0.4763305 we can say that there is a negative correlation between the age of the composer and the longest piece they ever wrote. Since it is close to zero we can safely say that there is barely any relationship. How long each composer lived does not help us predict how long their longest piece was, and vice versa.

In [None]:
# There are 98 composers
subset_sorted['Composer'].count()


Summary Statistics to find any relationships between a Copmoser and the longest Piece they ever wrote. 

In [None]:
# summary stats for the longest piece collectively
subset_sorted['Duration of Biggest Piece(mins)'].describe()

In [None]:
# Summary statistic for each composer's longest piece

composer_summary = subset_sorted.groupby('Composer')['Duration of Biggest Piece(mins)'].describe()


composer_summary


Nationanality and Musical Output

In [None]:
composer_nationality_means = subset_sorted.groupby(['Composer', 'Nationality'])['Duration of Biggest Piece(mins)'].mean().reset_index()

In [None]:
# Sorting composers by mean duration
composer_nationality_means_sorted = composer_nationality_means.sort_values(by='Duration of Biggest Piece(mins)', ascending=False)

top_10 = composer_nationality_means_sorted.head(10)
bottom_10 = composer_nationality_means_sorted.tail(10)


In [None]:
# Combining the teo into a single DataFrame

top_10 = pd.concat([top_10, bottom_10], axis=0).reset_index(drop=True)

In [None]:
import plotly.express as px

fig_top_10 = px.bar(top_10, 
                    x='Composer',
                    y='Duration of Biggest Piece(mins)',
                    color = 'Duration of Biggest Piece(mins)',
                    hover_data=['Nationality'],
                    title='Top and Bottom 10 Composers by Mean Duration of Longest Piece', 
                    labels={'Duration of Biggest Piece(mins)': 'Mean Duration(mins)', 'Composer': 'Composer'})

fig_top_10.update_layout(xaxis_title='Composer',
                  yaxis_title='Mean Duration (mins)',
                  xaxis={'categoryorder':'total descending'},
                  yaxis=dict(type='linear'),
                  template='plotly_white')

fig_top_10.show()

According to the above plot the top 10 Composers with the longest pieces are represented mostly by German/Austraian descent. Whereas, the top 10 Composers with shortes durationa were predominatly of French and English descent. One could make the conjecture that there is a  correlation between the duration of the pieces the composers create and of what nationality the composer is.

Age groups of Composers vs. the duration of their longest piece

In [None]:
viz = subset_sorted[['Composer', 'AgeAtDeath', 'Duration of Biggest Piece(mins)']].copy() 
viz

# Group by age in 10s
viz['AgeGroup'] = pd.cut(viz['AgeAtDeath'], bins=range(0, 101, 10), right=False)
grouped_age = viz.groupby('AgeGroup').size()

grouped_age


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


sns.set_theme(style="whitegrid")

ax =sns.catplot(
    data = viz, 
    x = 'AgeGroup', y='Duration of Biggest Piece(mins)',
    hue='AgeAtDeath',
    kind="swarm",
)
ax.set_xticklabels(rotation=60)




This visualization tells us that across all age groups the average pice is 40 minutes. Composers that are less than forty years of age have pieces that are the average duration or less. It also shows that composers in the ninety and one hundred age group also write compositions that are forty minutes or less. I find it interesting that the longest pieces are written by composers in the sixty to eighty age group. From previous analysis we also know that the top three composers with longest duration are German and Italian, and withi the same previous analysis we know that the shortest pieces of composition are written by English and French composers. 

COMPARITIVE ANALYSIS

Era vs. Composition Lengths

Baroque: 1600–1750

Classical: 1750–1820

Romantic: 1820–1900

Modern: 1900–present

In [None]:
historical_era = subset_sorted[['Composer','Nationality', 'Born', 'Died', 'AgeAtDeath', 'Duration of Biggest Piece(mins)']]
historical_era

df_historical_era = pd.DataFrame(historical_era)
df_historical_era.head()

In [None]:
def categorize_era(row):
    if row['Born'] < 1750:
        return'Baroque'
    elif 1750 <= row['Born'] < 1820:
        return 'Classical'
    elif 1820 <= row['Born'] < 1900:
        return 'Romantic'
    else:
        return 'Modern'
    
        
df_historical_era['Era'] = df_historical_era.apply(categorize_era, axis=1)
df_historical_era.head()

In [None]:
agg_data = df_historical_era.groupby('Era')['Duration of Biggest Piece(mins)'].mean().reset_index()

agg_data

In [None]:
# have to convert data when using altair

chart_data = alt.Data(values=agg_data.to_dict('records'))

In [None]:

chart = alt.Chart(chart_data).mark_bar().encode(
    y=alt.Y('Era:N', title='Musical Era', sort='-x'),  # Note the sort is now '-x'
    x=alt.X('mean(Duration of Biggest Piece(mins)):Q', title='Average Length of Longest Composition (mins)'),
    color='Era:N'
).properties(title='Average Length of Longest Compositions by Musical Era')

chart.display()


Looking at the data from the classical music history(era), we uncover a clear trend: compositions start long and elaborate in the Baroque era, maintain length but gain clarity in the Classical period, then gradually shorten. Romantic pieces dial back on duration, favoring emotion over length. The Modern era continues this trend towards conciseness, blending brevity with innovative twists. This pattern reveals not just shifts in musical preferences but echoes broader cultural transformations over time.

Average Age of Composer at Peak Creativity per Era and Nationality

In [None]:
# New dataframe 

peak_creativity = subset_sorted[['Composer', 'Nationality', 'Born', 'Died', 'Duration of Biggest Piece(mins)', 'AgeAtDeath']].copy()

peak_creativity_df=pd.DataFrame(peak_creativity)
peak_creativity_df.head()

peak_creativity_df['Era'] = peak_creativity_df.apply(categorize_era, axis=1)
peak_creativity_df.head()
    


In [None]:
# What is the average age at death for each era and duration of biggest piece? 

era_age_at_death = peak_creativity_df.groupby('Era')['AgeAtDeath'].mean().reset_index()
era_age_at_death

In [None]:
combined_era_peak = pd.merge(agg_data, era_age_at_death, on='Era', how='inner')
combined_era_peak

In [None]:
nationality_peak = peak_creativity_df.groupby('Nationality')['Duration of Biggest Piece(mins)'].mean().reset_index()
nationality_peak

In [None]:
creative_peak = df_historical_era.groupby(['Nationality', 'Era']).agg({
    'AgeAtDeath': 'mean',
    'Duration of Biggest Piece(mins)': 'mean'
}).reset_index()



In [None]:
# Rename columns for average dataframe

creative_peak.rename(columns={
    'AgeAtDeath': "Average Age at Death",
    'Duration of Biggest Piece(mins)': 'Average Piece(mins)',
    'Era': 'Historical Era'
}, inplace=True)
creative_peak

In [None]:
import altair as alt


facet_chart = alt.Chart(creative_peak).mark_bar().encode(
    x='Average Age at Death:Q',
    y='Average Piece(mins):Q',
    color='Nationality:N',
    tooltip=['Nationality', 'Historical Era', 'Average Piece(mins)', 'Average Age at Death']
).properties(
    width=200,
    height=150
).facet(
    column='Historical Era:N'
)

facet_chart.display() 


These patterns reveal cultural undercurrents across eras. The Romantic era's uniformity hints at strong shared values and an emphasis on innovation, individualism, and emotional expression, leading composers to produce similar works despite diverse backgrounds. The Classical era's consistent music style, appealing to a broad age range, indicates its cross-generational resonance. The Baroque era, with its wide range in music and lifespans, showcases a period rich in experimentation and innovation. Meanwhile, the Modern era's concise compositions and uniform lifespans reflect societal shifts towards efficiency and a collective move towards more direct, impactful expression.

MACHING LEARNING WITH CLUSTERING

In [None]:
# use kmeans to cluster the composers into 3 groups based on their age at death and duration of their biggest piece

from sklearn.cluster import KMeans
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Select the columns to cluster
cluster_data = peak_creativity_df[['Born', 'Duration of Biggest Piece(mins)']].copy()

# Create a pipeline to scale the data and then cluster it 
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)




In [None]:
pipeline = make_pipeline(scaler, kmeans)

# Fit the pipeline to the data
pipeline.fit(cluster_data)


In [None]:
# Add the cluster labels to the original data
cluster_labels = pipeline.predict(cluster_data)
peak_creativity_df['Cluster'] = cluster_labels
peak_creativity_df.head()


In [None]:
cluster_mean = cluster_data.groupby(cluster_labels)[['Born', 'Duration of Biggest Piece(mins)']].mean()
cluster_mean

In [None]:
cluster_standard_deviation = cluster_data.groupby(cluster_labels)[['Born', 'Duration of Biggest Piece(mins)']].std()
cluster_standard_deviation

In [None]:
# Visualize the clusters
sns.scatterplot(data=peak_creativity_df, x='Born', y='Duration of Biggest Piece(mins)', hue='Cluster', size='Era', sizes=(30,150), style='Era', palette='viridis')
plt.title('Clusters with Historical Era Context')
plt.show()

Clustering with k-means algorithm brought an interesting pattern to life that the rest of the analysis has not touched upon. The Baroque era is the only era that spans across all three clusters. Classical, Romantic, and Modern composers are clumped together with strong standard deviation and mean, indicating not only a clear relationship between the data and our analysis but also a strong correlation between the era, length of the compositions and a homogenity of lifespan for the composers. 

Why would the Baroque era be evaluated in each cluster? 
It suggests that the Baroque era, with its rich array of musical styles and forms, acts as a bridge connecting the characteristics of earlier and later periods of classical music.

FINAL ADVANCED ANALYSIS

Influence of Nationality on Career Lenths

Do composers from certain countries have longer careers than others?

In [None]:
# Rerun peak_creativity_df without the cluster column and new name
influence = subset_sorted[['Composer', 'Nationality', 'Born', 'Died', 'Duration of Biggest Piece(mins)']].copy()

influence_df=pd.DataFrame(influence)
influence_df.head()

influence_df['Era'] = influence_df.apply(categorize_era, axis=1)
influence_df.head()

In [None]:
influence_df['CareerLength'] = influence_df['Died'] - influence_df['Born']
influence_df.head()

In [None]:
nationality_career_lengths = influence_df.groupby('Nationality')['CareerLength'].agg(['mean', 'median', 'std', 'min', 'max']).reset_index()
nationality_career_lengths

In [None]:
import pandas as pd

df_nationality_career_lengths = pd.DataFrame(nationality_career_lengths)
print(df_nationality_career_lengths)


In [None]:
nationality_career_filtered = df_nationality_career_lengths.dropna(subset=['std'])
nationality_career_filtered

In [1]:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd



data = {
    'Nationality': ['American', 'Austrian', 'Czech', 'English', 'French', 'German', 'Hungarian', 'Italian', 'Russian', 'Spanish'],
    'mean': [72.0, 57.2, 65.666667, 66.2, 69.176471, 65.588235, 69.5, 60.071429, 60.909091, 66.0],
    'std': [19.634154, 17.573971, 7.371115, 17.242067, 12.430986, 18.668352, 7.778175, 16.909098, 13.787346, 5.656854],
    'median': [80.0, 56.5, 63.0, 65.0, 66.0, 70.0, 69.5, 61.5, 62.0, 66.0],
    'min': [39.0, 31.0, 60.0, 36.0, 37.0, 21.0, 64.0, 26.0, 42.0, 62.0],
    'max': [90.0, 77.0, 74.0, 86.0, 86.0, 87.0, 75.0, 88.0, 89.0, 70.0]
}
df = pd.DataFrame(data)

colors = px.colors.qualitative.Plotly 

colors += px.colors.qualitative.Alphabet 

color_map = {nat: colors[i] for i, nat in enumerate(df['Nationality'])}

fig = go.Figure()


fig.add_trace(go.Bar(
    x=df['Nationality'], 
    y=df['mean'], 
    name='Mean Career Length',
    marker_color=[color_map[nat] for nat in df['Nationality']],
    error_y=dict(
        type='data',
        array=df['std'],
        visible=True,
        color='black'),
    opacity=0.7
))


fig.add_trace(go.Scatter(
    x=df['Nationality'], 
    y=df['median'], 
    mode='markers', 
    name='Median',
    marker=dict(color=[color_map[nat] for nat in df['Nationality']], size=10)
    ))

fig.add_trace(go.Scatter(
    x=df['Nationality'], 
    y=df['min'], 
    mode='markers', 
    name='Min',
    marker=dict(color='rgba(135, 206, 250, 0.8)', size=10),
    ))

fig.add_trace(go.Scatter(
    x=df['Nationality'], 
    y=df['max'], 
    mode='markers', 
    name='Max',
    marker=dict(color='rgba(255, 165, 0, 0.8)', size=10)))


fig.update_layout(
    title='Career Length Statistics by Nationality',
    xaxis_title='Nationality',
    yaxis_title='Career Length (years)',
    template='plotly_white')

fig.update_layout({
    'plot_bgcolor': 'rgba(0,0,0,0)', 
    'paper_bgcolor': 'rgba(0,0,0,0)',  
    'title_font': {'size': 24},
    'font_family': "Arial, sans-serif",   
    'font_color': "black",
    'xaxis_title': "Nationality",
    'yaxis_title': "Career Length (years)",
    'xaxis': {'tickangle': 45, 'title_standoff': 25},
})

fig.show()


The standard devition for the career lengths among the composers of different nationalities has minimal variance for many, but not all. 40% of the data exhibits nationalites with diverse career lengths. However, the uniformity of 60% of the data gives us several insights. The small variation in the career lengths as indicated by the 'std' values suggests a consistency in the condtions that may have influenced the composers career length regardless of their nationality. In other words, there may be universal factors that influnced a chosen career as a composer, that is beyond cultural or geographical differences. 

The uniformity of career lengths might indicate that facts such as historical events, economic conditions, education, and resources, as well as the classical music community may have standarized the external effects on a composer's career. This could also suggest that this trend is reflected in the cultural and systemic support for composers in various countries, which suggests that classical music tradtion has a robust system in place worldwide and nurtures and sustains composer's careers. 