The goal of this notebook is to start talking about feature selection for models where the features and the target are both numerical.

The original dataset can be found here: https://www.kaggle.com/datasets/paakhim10/taylor-swift-the-myth-the-legend?select=taylorswift-Features.csv

In [24]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import pickle

df = pd.read_csv("taylorswift-Features.csv")

df.head()

Unnamed: 0.1,Unnamed: 0,album_id,album_name,id,track_name,danceability,swiftiness,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,0,1o59UpKw81iHR0HPiSkJR0,1989 (Taylor's Version) [Deluxe],4WUepByoeqcedHoYhSNHRt,Welcome To New York (Taylor's Version),0.757,100,0.61,7,-4.84,1,0.0327,0.00942,3.7e-05,0.367,0.685,116.998
1,1,1o59UpKw81iHR0HPiSkJR0,1989 (Taylor's Version) [Deluxe],0108kcWLnn2HlH2kedi1gn,Blank Space (Taylor's Version),0.733,100,0.733,0,-5.376,1,0.067,0.0885,0.0,0.168,0.701,96.057
2,2,1o59UpKw81iHR0HPiSkJR0,1989 (Taylor's Version) [Deluxe],3Vpk1hfMAQme8VJ0SNRSkd,Style (Taylor's Version),0.511,100,0.822,11,-4.785,0,0.0397,0.000421,0.0197,0.0899,0.305,94.868
3,3,1o59UpKw81iHR0HPiSkJR0,1989 (Taylor's Version) [Deluxe],1OcSfkeCg9hRC2sFKB4IMJ,Out Of The Woods (Taylor's Version),0.545,100,0.885,0,-5.968,1,0.0447,0.000537,5.6e-05,0.385,0.206,92.021
4,4,1o59UpKw81iHR0HPiSkJR0,1989 (Taylor's Version) [Deluxe],2k0ZEeAqzvYMcx9Qt5aClQ,All You Had To Do Was Stay (Taylor's Version),0.588,100,0.721,0,-5.579,1,0.0317,0.000656,0.0,0.131,0.52,96.997


In [25]:
# lets delete some columns we won't care about
df = df.drop(["Unnamed: 0", "album_id", "album_name", "id", "track_name"], axis=1)

df.head()

Unnamed: 0,danceability,swiftiness,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,0.757,100,0.61,7,-4.84,1,0.0327,0.00942,3.7e-05,0.367,0.685,116.998
1,0.733,100,0.733,0,-5.376,1,0.067,0.0885,0.0,0.168,0.701,96.057
2,0.511,100,0.822,11,-4.785,0,0.0397,0.000421,0.0197,0.0899,0.305,94.868
3,0.545,100,0.885,0,-5.968,1,0.0447,0.000537,5.6e-05,0.385,0.206,92.021
4,0.588,100,0.721,0,-5.579,1,0.0317,0.000656,0.0,0.131,0.52,96.997


In [26]:
df.shape

(246, 12)

Next we clean the data. We need to do the following:

- check for missing values and handle them
- encode any categorical data. There are technically some categories (mode, key), but they are already encoded! So we're good there
- remove outliers- lets assume we want to keep all the data points since we dont have a ton
- split the data into training and testing
- scale the features

In [27]:
# check for missing data
df.isna().sum()

danceability        0
swiftiness          0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
dtype: int64

No missing data! So we can move forward with splitting our data into a training and testing set.

In [28]:
# split the target from the features

yDF = pd.DataFrame(df["danceability"])

yDF.head()

Unnamed: 0,danceability
0,0.757
1,0.733
2,0.511
3,0.545
4,0.588


In [29]:
xDF = df.drop(columns="danceability")

xDF.head()

Unnamed: 0,swiftiness,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,100,0.61,7,-4.84,1,0.0327,0.00942,3.7e-05,0.367,0.685,116.998
1,100,0.733,0,-5.376,1,0.067,0.0885,0.0,0.168,0.701,96.057
2,100,0.822,11,-4.785,0,0.0397,0.000421,0.0197,0.0899,0.305,94.868
3,100,0.885,0,-5.968,1,0.0447,0.000537,5.6e-05,0.385,0.206,92.021
4,100,0.721,0,-5.579,1,0.0317,0.000656,0.0,0.131,0.52,96.997


In [30]:
yDF.head()

Unnamed: 0,danceability
0,0.757
1,0.733
2,0.511
3,0.545
4,0.588


In [31]:
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler

In [32]:
from sklearn.linear_model import LinearRegression

cv = KFold(n_splits=5, shuffle=True, random_state=42)

# lists for finding average scores
r2Scores = []
rmseScores = []

# cv.split splits the data into two groups, train indices and test indices
# enumerate numbers the splits indexed from 0
# we create the training anf testing sets each split by
# converting the indices to lists and using iloc to index the original data
 
for i, (train_index, test_index) in enumerate(cv.split(xDF, yDF)):
    ### making training and validation sets
    # Convert indices to list
    train_index = train_index.tolist()
    test_index = test_index.tolist()
    
    # Split the data into training and testing sets for this fold
    xTrain, xTest = xDF.iloc[train_index], xDF.iloc[test_index]
    yTrain, yTest = yDF.iloc[train_index], yDF.iloc[test_index]

    from sklearn.neural_network import MLPRegressor

cv = KFold(n_splits=5, shuffle=True, random_state=42)

r2Scores = []
rmseScores = []

for i, (train_index, test_index) in enumerate(cv.split(xDF, yDF)):

    ### making training and validation sets
    # Convert indices to list
    train_index = train_index.tolist()
    test_index = test_index.tolist()
    
    # Split the data into training and testing sets for this fold
    xTrain, xTest = xDF.iloc[train_index], xDF.iloc[test_index]
    yTrain, yTest = yDF.iloc[train_index], yDF.iloc[test_index]

    ### feature scaling
    xScaler = StandardScaler()
    xColNames = xTrain.columns.values.tolist()
    # train the scaler and apply it to the training set
    xTrainScaled = xScaler.fit_transform(xTrain[xColNames])
    # apply the scaling to the testing set
    xTestScaled = xScaler.transform(xTest[xColNames])

    ### model training
    # instantiate the model
    clf = MLPRegressor()
    # Train the classifier on the training data
    clf.fit(xTrain, yTrain)
    
    ### model prediction and evaluation
    # Make predictions on the test data
    y_pred = clf.predict(xTest)
    
    # Calculate metrics and store them
    r2Score = r2_score(yTest, y_pred)
    r2Scores.append(r2Score)

    rmseScore = mean_squared_error(yTest, y_pred, squared=False)
    rmseScores.append(rmseScore)

    print(f"Completed Fold {i}")

### Calculate the mean scores across all folds
avgR2Score = sum(r2Scores) / len(r2Scores)
print("Mean r squared score:", avgR2Score)

avgRMSE = sum(rmseScores) / len(rmseScores)
print("Mean rmse:", avgRMSE)

  y = column_or_1d(y, warn=True)


Completed Fold 0
Completed Fold 1
Completed Fold 2


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Completed Fold 3
Completed Fold 4
Mean r squared score: -1295.6693843675828
Mean rmse: 2.875300298817814


  y = column_or_1d(y, warn=True)
