**Logistic Regression to predict if song titles contain love**

For this project, we were interested in a songs dataset from the Spotify API. We found that love was one of the most frequent words used in song titles. For this reason, we decided that we wanted to explore what variables were associated with a song being about love, measured by it having love in the title. To ask this specifically, our question of interest was:    
Does popularity score or valence affect if a song will be about love?    
For this question, we decided that we wanted our output to be a prediction of the probability of a song having love in it, which would be best represented using a logistic regression.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import pandas as pd
import seaborn as sns


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
#Read in datasets
spotify_data = pd.read_csv("../data/spotify_data.csv")

In [None]:
#Adding 'Has Love' column
song_attribute_data = spotify_data

titles = song_attribute_data["Song"]
titles = titles.str.split(pat=" ", expand=True)

song_attribute_data["Has Word Love?"] = song_attribute_data['Song'].str.contains("Love")
song_attribute_data_love = song_attribute_data[song_attribute_data["Has Word Love?"]]

song_attribute_data = song_attribute_data.dropna()


In [None]:
#Prepare X and y
y = song_attribute_data["Has Word Love?"]
X = song_attribute_data.loc[:,["speechiness", "valence", "tempo", "time_signature", "Popularity Points Awarded"]]

def minmax(z):
    z = (z-min(z))/(max(z)-min(z))
    return z
X = X.apply(minmax,axis=1)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=100)

In [None]:
#Run Regression
reg = LogisticRegression(penalty = None,
                         fit_intercept=True,
                         solver = 'newton-cholesky',
                         max_iter=1000).fit(X_train,y_train)

print("Training R^2" , reg.score(X_train, y_train))
print("Test R^2", reg.score(X_test, y_test))
#Pretty good accuracy scores for these predictors


Training R^2 0.9174862912350944
Test R^2 0.9136790810998956


In [None]:
#Run regression with all variables
X_n = song_attribute_data.iloc[:, [8] + [9] + list(range(11, 24))]
X_n = X_n.apply(minmax,axis=1)

dummies = pd.DataFrame([])
new_dummies = pd.get_dummies(song_attribute_data.loc[:,"spotify_track_explicit"], drop_first=True, dtype=int)
dummies = pd.concat([dummies, new_dummies], axis=1, ignore_index=True)

X2 = pd.concat([X_n,dummies],axis=1)
X2.columns = X2.columns.astype(str)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2,y, test_size=.2, random_state=100)

reg2 = LogisticRegression(penalty = None,
                         fit_intercept=True,
                         solver = 'newton-cholesky',
                         max_iter=1000).fit(X2_train,y2_train)

print("Training R^2", reg2.score(X2_train, y2_train))
print("Test R^2", reg2.score(X2_test, y2_test))
#Exact same R^2 values?

Training R^2 0.9174862912350944
Test R^2 0.9136790810998956


This code chunk shows that our target variable is highly imbalanced. Almost 92% of the songs do not have the word 'love' in the title, so our model has learned to always return an accuracy close to 1, which explains why the model always gives an accuracy close to .91 no matter what features or how many features we use.

Thus, below, we will add a new parameter called class_weight="balanced" to tell our model to care and focus more on the underrepresented population, which in this case is True: the title has the word 'love'.

In [None]:
print(song_attribute_data["Has Word Love?"].value_counts(normalize=True))


Has Word Love?
False    0.916725
True     0.083275
Name: proportion, dtype: float64


Let's see how the accuracy scores changed after adding this new parameter, class_weight="balanced", that tells the model to focus more on the underrepresented data.

In [None]:
#Prepare X and y
y = song_attribute_data["Has Word Love?"]
X = song_attribute_data.loc[:,["speechiness", "valence", "tempo", "time_signature", "Popularity Points Awarded"]]

X = X.apply(minmax,axis=1)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=100)

#Run Regression
reg = LogisticRegression(penalty = None,
                         fit_intercept=True,
                         solver = 'newton-cholesky',
                         max_iter=1000,
                         class_weight='balanced').fit(X_train,y_train)

print("Training R^2" , reg.score(X_train, y_train))
print("Test R^2", reg.score(X_test, y_test))
# We can see that the accuracy decreased

Training R^2 0.324310209765863
Test R^2 0.33379742429516185


Without adjusting:

- Training R^2 0.9174862912350944
- Test R^2 0.9136790810998956

After adjusting:

- Training R^2 0.324310209765863
- Test R^2 0.33379742429516185

We can see that the accuracy decreased

In [None]:
#Run regression with all variables
X_n = song_attribute_data.iloc[:, [8] + [9] + list(range(11, 24))]
X_n = X_n.apply(minmax,axis=1)

dummies = pd.DataFrame([])
new_dummies = pd.get_dummies(song_attribute_data.loc[:,"spotify_track_explicit"], drop_first=True, dtype=int)
dummies = pd.concat([dummies, new_dummies], axis=1, ignore_index=True)

X2 = pd.concat([X_n,dummies],axis=1)
X2.columns = X2.columns.astype(str)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2,y, test_size=.2, random_state=100)

reg2 = LogisticRegression(penalty = None,
                         fit_intercept=True,
                         solver = 'newton-cholesky',
                         max_iter=1000,
                         class_weight='balanced'
                         ).fit(X2_train,y2_train)

print("Training R^2", reg2.score(X2_train, y2_train))
print("Test R^2", reg2.score(X2_test, y2_test))

Training R^2 0.5130124466881365
Test R^2 0.5144448311869126


We can see that the accuracy scores are worse, but by using all the variables, the features explained more of the variance compared to cherry picking which variables to use.

Let's try some more feature combinations using class_weight = 'balanced'

In [None]:
# Do more popular songs have the word 'love'?
#Prepare X and y
y = song_attribute_data["Has Word Love?"]
X = song_attribute_data.loc[:,["Popularity Points Awarded"]]

def minmax(z):
    z = (z-min(z))/(max(z)-min(z))
    return z
X = X.apply(minmax,axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=100)

#Run Regression
reg = LogisticRegression(penalty = None,
                         fit_intercept=True,
                         solver = 'newton-cholesky',
                         max_iter=1000,
                         class_weight='balanced'
                         ).fit(X_train,y_train)

print("Training R^2" , reg.score(X_train, y_train))
print("Test R^2", reg.score(X_test, y_test))

Training R^2 0.4069109583079467
Test R^2 0.40932822833275323


In [None]:
#Prepare X and y
y = song_attribute_data["Has Word Love?"]
X = song_attribute_data.loc[:,["liveness"]]

X = X.apply(minmax,axis=0)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=100)

#Run Regression
reg = LogisticRegression(penalty = None,
                         fit_intercept=True,
                         solver = 'newton-cholesky',
                         max_iter=1000,
                         class_weight='balanced').fit(X_train,y_train)

print("Training R^2" , reg.score(X_train, y_train))
print("Test R^2", reg.score(X_test, y_test))
# This is the highest accuracy so far

Training R^2 0.6430498737923231
Test R^2 0.6359206404455273


In [None]:
# Trying a mix
# Prepare X and y
y = song_attribute_data["Has Word Love?"]
X = song_attribute_data.loc[:,["valence", "acousticness", "liveness", "instrumentalness"]]

X = X.apply(minmax,axis=0)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=100)

#Run Regression
reg = LogisticRegression(penalty = None,
                         fit_intercept=True,
                         solver = 'newton-cholesky',
                         max_iter=1000,
                         class_weight='balanced').fit(X_train,y_train)

print("Training R^2" , reg.score(X_train, y_train))
print("Test R^2", reg.score(X_test, y_test))

Training R^2 0.6167638610845156
Test R^2 0.608771319178559
