## ML Analysis: Predicting Happiness Group from Music Sentiment

**Goal:** Test whether a country’s average music sentiment can predict whether it belongs to the high-happiness vs low-happiness group.

**Target (y):** `happiness_group` (0 = low, 1 = high), created via a median split on `Ladder score`  
**Feature (X):** `Avg_Music_Sentiment` (country-level average)

## 1) Setup

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


## 2) Load dataset

In [2]:
df = pd.read_csv("final_project_dataset.csv")
print(df.shape)
df.head()


(95, 13)


Unnamed: 0,Country_Code,Avg_Music_Sentiment,Country_Name,Ranking,Country,Regional indicator,Ladder score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,PH,0.162485,Philippines,53,Philippines,Southeast Asia,60476,575533,7089,68,95708,24622,23652
1,LK,0.149765,Sri Lanka,125,Sri Lanka,South Asia,38981,635814,72914,73,67536,35942,5363
2,NZ,0.13387,New Zealand,11,New Zealand,North America and ANZ,70292,845752,94458,75,86357,56406,835
3,GE,0.104214,Georgia,89,Georgia,Commonwealth of Independent States,51847,685312,61243,71,78743,0,30305
4,PK,0.083081,Pakistan,105,Pakistan,South Asia,46567,499198,3714,65,62774,36035,12861


## 3) Cleaning and Type Conversion

In [3]:
print("Rows total:", len(df))
print(df[["Country", "Ladder score", "Avg_Music_Sentiment"]].head(10))


Rows total: 95
       Country Ladder score  Avg_Music_Sentiment
0  Philippines       6,0476             0.162485
1    Sri Lanka       3,8981             0.149765
2  New Zealand       7,0292             0.133870
3      Georgia       5,1847             0.104214
4     Pakistan       4,6567             0.083081
5    Australia       7,0569             0.076077
6        India       4,0541             0.074970
7       Canada       6,8996             0.072201
8      Nigeria       4,8808             0.070058
9       Kuwait       6,9514             0.066855


In [12]:
# Convert Ladder score from '6,0476' -> 6.0476
df["Ladder score"] = (
    df["Ladder score"].astype(str).str.replace(",", ".", regex=False)
)
df["Ladder score"] = pd.to_numeric(df["Ladder score"], errors="coerce")

# Ensure sentiment is numeric
df["Avg_Music_Sentiment"] = pd.to_numeric(df["Avg_Music_Sentiment"], errors="coerce")

# Drop rows missing the key columns
df = df.dropna(subset=["Country", "Ladder score", "Avg_Music_Sentiment"]).copy()

print("Rows after cleaning:", len(df))
df[["Country", "Ladder score", "Avg_Music_Sentiment"]].head().to_string()


Rows after cleaning: 95


'       Country  Ladder score  Avg_Music_Sentiment\n0  Philippines        6.0476             0.162485\n1    Sri Lanka        3.8981             0.149765\n2  New Zealand        7.0292             0.133870\n3      Georgia        5.1847             0.104214\n4     Pakistan        4.6567             0.083081'

## 4) Create the binary target (median split)

Countries with `Ladder score >= median` are labeled 1 (high), otherwise 0 (low).

In [14]:
median_score = df["Ladder score"].median()
df["happiness_group"] = (df["Ladder score"] >= median_score).astype(int)

print("Median Ladder score:", median_score)
print(df["happiness_group"].value_counts())


Median Ladder score: 6.0598
happiness_group
1    48
0    47
Name: count, dtype: int64


In [7]:
df[["Country", "Ladder score", "happiness_group"]].head().to_string()


'       Country  Ladder score  happiness_group\n0  Philippines        6.0476                0\n1    Sri Lanka        3.8981                0\n2  New Zealand        7.0292                1\n3      Georgia        5.1847                0\n4     Pakistan        4.6567                0'

## 5) Train–test split

In [15]:
X = df[["Avg_Music_Sentiment"]]
y = df["happiness_group"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train size:", len(X_train), "Test size:", len(X_test))

Train size: 76 Test size: 19


## 6) Model 1: Logistic Regression (linear baseline)

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

logreg = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(random_state=42))
])

logreg.fit(X_train, y_train)
pred1 = logreg.predict(X_test)

print("LogReg Accuracy:", accuracy_score(y_test, pred1))
print(confusion_matrix(y_test, pred1))
print(classification_report(y_test, pred1))


LogReg Accuracy: 0.3684210526315789
[[2 7]
 [5 5]]
              precision    recall  f1-score   support

           0       0.29      0.22      0.25         9
           1       0.42      0.50      0.45        10

    accuracy                           0.37        19
   macro avg       0.35      0.36      0.35        19
weighted avg       0.35      0.37      0.36        19



## 7) Model 2: Random Forest (non-linear comparison)

In [10]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=500,
    random_state=42,
    class_weight="balanced"
)

rf.fit(X_train, y_train)
pred2 = rf.predict(X_test)

print("RF Accuracy:", accuracy_score(y_test, pred2))
print(confusion_matrix(y_test, pred2))
print(classification_report(y_test, pred2))


RF Accuracy: 0.3157894736842105
[[5 4]
 [9 1]]
              precision    recall  f1-score   support

           0       0.36      0.56      0.43         9
           1       0.20      0.10      0.13        10

    accuracy                           0.32        19
   macro avg       0.28      0.33      0.28        19
weighted avg       0.27      0.32      0.28        19



## 8) Baseline 0: Dummy classifier (majority class)

This sets a baseline that any real model should beat.


In [11]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
pred_dummy = dummy.predict(X_test)

print("Dummy Accuracy:", accuracy_score(y_test, pred_dummy))


Dummy Accuracy: 0.47368421052631576


## 9) Model comparison

In [17]:
results = pd.DataFrame({
    "Model": ["Dummy (most frequent)", "Logistic Regression", "Random Forest"],
    "Accuracy": [
        accuracy_score(y_test, pred_dummy),
        accuracy_score(y_test, pred1),
        accuracy_score(y_test, pred2),
    ],
})

results.sort_values("Accuracy", ascending=False)


Unnamed: 0,Model,Accuracy
0,Dummy (most frequent),0.473684
1,Logistic Regression,0.368421
2,Random Forest,0.315789


## 10) Interpretation

If Logistic Regression and Random Forest perform at or below the Dummy baseline, then Avg_Music_Sentiment does not contain meaningful predictive information for happiness_group.

This aligns with the EDA/hypothesis testing results (e.g., highly overlapping distributions and non-significant group differences).


## 11) Limitations & future work

- Only one aggregated feature (Avg_Music_Sentiment) is used here to isolate its relationship with happiness.
- Country-level averages may hide within-country variability (genre, language, demographics, listening context).
- Future work: richer features (genre distribution, temporal trends, variance of sentiment, number of songs), and/or multi-feature models with careful interpretation.
