# Prediksi `"Million Song Dataset"`

Pada kesempatan kali ini akan menjelaskan keseluruhan proses mengenai pemilihan dan pemanfaatan algoritma machine learning dan deep learning untuk prediksi tahun rilis lagu.

---

# Pertama akan dilakukan Load Data ke dalam Notebook

In [None]:
pip install gdown



In [None]:
import gdown

gdown.download("https://drive.google.com/uc?id=1f8eaAZY-7YgFxLcrL3OkvSRa3onNNLb9")

Downloading...
From (original): https://drive.google.com/uc?id=1f8eaAZY-7YgFxLcrL3OkvSRa3onNNLb9
From (redirected): https://drive.google.com/uc?id=1f8eaAZY-7YgFxLcrL3OkvSRa3onNNLb9&confirm=t&uuid=14dc81d2-947c-4218-84fe-153c04cb1273
To: /content/midterm-regresi-dataset.csv
100%|██████████| 443M/443M [00:08<00:00, 51.6MB/s]


'midterm-regresi-dataset.csv'

**Selanjutnya:** Setelah Load berhasil dan data sudah tersimpan dalam path tersebut, maka kita bisa menampilkan isi dari data yang kita miliki, mengapa demikian? hal ini untuk kita tau apa bentuk data yang akan kita olah sebelum diberikan kedalam model

In [None]:
import pandas as pd


# Dataset ini tidak punya header.
# Nilai seperti 2001 di kolom pertama adalah TARGET (tahun), bukan nama feature /  kolom.
file_path = '/content/midterm-regresi-dataset.csv'
df = pd.read_csv(file_path)

df.head()

Unnamed: 0,2001,49.94357,21.47114,73.0775,8.74861,-17.40628,-13.09905,-25.01202,-12.23257,7.83089,...,13.0162,-54.40548,58.99367,15.37344,1.11144,-23.08793,68.40795,-1.82223,-27.46348,2.26327
0,2001,48.73215,18.4293,70.32679,12.94636,-10.32437,-24.83777,8.7663,-0.92019,18.76548,...,5.66812,-19.68073,33.04964,42.87836,-9.90378,-32.22788,70.49388,12.04941,58.43453,26.92061
1,2001,50.95714,31.85602,55.81851,13.41693,-6.57898,-18.5494,-3.27872,-2.35035,16.07017,...,3.038,26.05866,-50.92779,10.93792,-0.07568,43.2013,-115.00698,-0.05859,39.67068,-0.66345
2,2001,48.2475,-1.89837,36.29772,2.58776,0.9717,-26.21683,5.05097,-10.34124,3.55005,...,34.57337,-171.70734,-16.96705,-46.67617,-12.51516,82.58061,-72.08993,9.90558,199.62971,18.85382
3,2001,50.9702,42.20998,67.09964,8.46791,-15.85279,-16.81409,-12.48207,-9.37636,12.63699,...,9.92661,-55.95724,64.92712,-17.72522,-1.49237,-7.50035,51.76631,7.88713,55.66926,28.74903
4,2001,50.54767,0.31568,92.35066,22.38696,-25.5187,-19.04928,20.67345,-5.19943,3.63566,...,6.59753,-50.69577,26.02574,18.9443,-0.3373,6.09352,35.18381,5.00283,-11.02257,0.02263


In [None]:
print("df.shape:", df.shape) #ada 515344 row/baris dan 90 feature

df.shape: (515344, 91)


# Preprocessing Data

Setelah data sudah diload, diketahui data tidak memiliki `header`, maka kita akan menambahkan `header` pada data. Untuk konteks dataset diketahui bahwa untuk kolom setelah penunjuk tahun tidak diketahui data tersebut berasal dari mana, sehingga yang bisa kita lakukan adalah berasumsi untuk sementara dan melabelinya dengan `fitur_1, fitur_2,` dan seterusnya.

**Code ini berfungsi untuk**:
Penambahan header pada dataset dan menampilkan rincian informasinya



In [None]:
df = pd.read_csv('/content/midterm-regresi-dataset.csv', header=None)

# Create a list of column names
col_names = ['years'] + [f'fitur_{i}' for i in range(1, 91)]

# Assign column names to the DataFrame
df.columns = col_names

# Display the first 5 rows
print("First 5 rows of the DataFrame:")
display(df.head())

# Print a concise summary of the DataFrame
print("\nDataFrame Info:")
df.info()

First 5 rows of the DataFrame:


Unnamed: 0,years,fitur_1,fitur_2,fitur_3,fitur_4,fitur_5,fitur_6,fitur_7,fitur_8,fitur_9,...,fitur_81,fitur_82,fitur_83,fitur_84,fitur_85,fitur_86,fitur_87,fitur_88,fitur_89,fitur_90
0,2001,49.94357,21.47114,73.0775,8.74861,-17.40628,-13.09905,-25.01202,-12.23257,7.83089,...,13.0162,-54.40548,58.99367,15.37344,1.11144,-23.08793,68.40795,-1.82223,-27.46348,2.26327
1,2001,48.73215,18.4293,70.32679,12.94636,-10.32437,-24.83777,8.7663,-0.92019,18.76548,...,5.66812,-19.68073,33.04964,42.87836,-9.90378,-32.22788,70.49388,12.04941,58.43453,26.92061
2,2001,50.95714,31.85602,55.81851,13.41693,-6.57898,-18.5494,-3.27872,-2.35035,16.07017,...,3.038,26.05866,-50.92779,10.93792,-0.07568,43.2013,-115.00698,-0.05859,39.67068,-0.66345
3,2001,48.2475,-1.89837,36.29772,2.58776,0.9717,-26.21683,5.05097,-10.34124,3.55005,...,34.57337,-171.70734,-16.96705,-46.67617,-12.51516,82.58061,-72.08993,9.90558,199.62971,18.85382
4,2001,50.9702,42.20998,67.09964,8.46791,-15.85279,-16.81409,-12.48207,-9.37636,12.63699,...,9.92661,-55.95724,64.92712,-17.72522,-1.49237,-7.50035,51.76631,7.88713,55.66926,28.74903



DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515345 entries, 0 to 515344
Data columns (total 91 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   years     515345 non-null  int64  
 1   fitur_1   515345 non-null  float64
 2   fitur_2   515345 non-null  float64
 3   fitur_3   515345 non-null  float64
 4   fitur_4   515345 non-null  float64
 5   fitur_5   515345 non-null  float64
 6   fitur_6   515345 non-null  float64
 7   fitur_7   515345 non-null  float64
 8   fitur_8   515345 non-null  float64
 9   fitur_9   515345 non-null  float64
 10  fitur_10  515345 non-null  float64
 11  fitur_11  515345 non-null  float64
 12  fitur_12  515345 non-null  float64
 13  fitur_13  515345 non-null  float64
 14  fitur_14  515345 non-null  float64
 15  fitur_15  515345 non-null  float64
 16  fitur_16  515345 non-null  float64
 17  fitur_17  515345 non-null  float64
 18  fitur_18  515345 non-null  float64
 19  fitur_19  515345 non-null  

# Data Cleaning

### Tujuan:
Identifikasi kesalahan pada input data, formatting, apakah ada yang hilang, serta ada outliers


**Langkah 1**:
Dilakukan pengecekan per kolom untuk mengetahui apakah ada data yang hilang di setiap masing masing kolong untuk nanti dihapus, setelahnya ditampilkan informasinya beserta hasil akhir setelah pengecekan dan penghapusan data yang hilang



In [None]:
print("Missing values before outlier handling:")
print(df.isnull().sum().to_string())

Missing values before outlier handling:
years       0
fitur_1     0
fitur_2     0
fitur_3     0
fitur_4     0
fitur_5     0
fitur_6     0
fitur_7     0
fitur_8     0
fitur_9     0
fitur_10    0
fitur_11    0
fitur_12    0
fitur_13    0
fitur_14    0
fitur_15    0
fitur_16    0
fitur_17    0
fitur_18    0
fitur_19    0
fitur_20    0
fitur_21    0
fitur_22    0
fitur_23    0
fitur_24    0
fitur_25    0
fitur_26    0
fitur_27    0
fitur_28    0
fitur_29    0
fitur_30    0
fitur_31    0
fitur_32    0
fitur_33    0
fitur_34    0
fitur_35    0
fitur_36    0
fitur_37    0
fitur_38    0
fitur_39    0
fitur_40    0
fitur_41    0
fitur_42    0
fitur_43    0
fitur_44    0
fitur_45    0
fitur_46    0
fitur_47    0
fitur_48    0
fitur_49    0
fitur_50    0
fitur_51    0
fitur_52    0
fitur_53    0
fitur_54    0
fitur_55    0
fitur_56    0
fitur_57    0
fitur_58    0
fitur_59    0
fitur_60    0
fitur_61    0
fitur_62    0
fitur_63    0
fitur_64    0
fitur_65    0
fitur_66    0
fitur_67    0
fitur_68

**Langkah 2**:
Karena tidak ada data yang hilang maka akan di cek untuk outliers atau pencilan. tujuannya adalah untuk menghilangkan yang terdapat pencilan karena akan berbahaya pada data dan menyulitkan model memahami konteks atau insight dari data yang kita punya



In [None]:
for column in df.columns:
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)

print("DataFrame after outlier handling (first 5 rows):")
display(df.head())

DataFrame after outlier handling (first 5 rows):


Unnamed: 0,years,fitur_1,fitur_2,fitur_3,fitur_4,fitur_5,fitur_6,fitur_7,fitur_8,fitur_9,...,fitur_81,fitur_82,fitur_83,fitur_84,fitur_85,fitur_86,fitur_87,fitur_88,fitur_89,fitur_90
0,2001,49.94357,21.47114,73.0775,8.74861,-17.40628,-13.09905,-25.01202,-12.23257,7.83089,...,13.0162,-54.40548,58.99367,15.37344,1.11144,-23.08793,68.40795,-1.82223,-27.46348,2.26327
1,2001,48.73215,18.4293,70.32679,12.94636,-10.32437,-24.83777,8.7663,-0.92019,18.76548,...,5.66812,-19.68073,33.04964,42.87836,-9.90378,-32.22788,70.49388,12.04941,58.43453,26.92061
2,2001,50.95714,31.85602,55.81851,13.41693,-6.57898,-18.5494,-3.27872,-2.35035,16.07017,...,3.038,26.05866,-50.92779,10.93792,-0.07568,43.2013,-115.00698,-0.05859,39.67068,-0.66345
3,2001,48.2475,-1.89837,36.29772,2.58776,0.9717,-26.21683,5.05097,-10.34124,3.55005,...,34.57337,-171.70734,-16.96705,-46.67617,-12.51516,82.58061,-72.08993,9.90558,199.62971,18.85382
4,2001,50.9702,42.20998,67.09964,8.46791,-15.85279,-16.81409,-12.48207,-9.37636,12.63699,...,9.92661,-55.95724,64.92712,-17.72522,-1.49237,-7.50035,51.76631,7.88713,55.66926,28.74903


## Apply Standard Scaling

### Subtask:
Apply standard scaling to the features to normalize their ranges, which is crucial for many machine learning algorithms, including Lasso.

### Reasoning:
To apply standard scaling, the features (independent variables) and the target variable (dependent variable) must first be separated. The 'years' column will be designated as the target variable (y), and all 'fitur_X' columns will be considered features (X). This separation is a standard preprocessing step before applying any scaling or machine learning models.

**Reasoning**:
Separate the 'years' column into 'y' (target) and the remaining 'fitur_X' columns into 'X' (features) as instructed, which is a necessary step before applying standard scaling.



In [None]:
y = df['years']
X = df.drop('years', axis=1)

print("First 5 rows of features (X):")
display(X.head())

print("\nFirst 5 rows of target (y):")
display(y.head())

First 5 rows of features (X):


Unnamed: 0,fitur_1,fitur_2,fitur_3,fitur_4,fitur_5,fitur_6,fitur_7,fitur_8,fitur_9,fitur_10,...,fitur_81,fitur_82,fitur_83,fitur_84,fitur_85,fitur_86,fitur_87,fitur_88,fitur_89,fitur_90
0,49.94357,21.47114,73.0775,8.74861,-17.40628,-13.09905,-25.01202,-12.23257,7.83089,-2.46783,...,13.0162,-54.40548,58.99367,15.37344,1.11144,-23.08793,68.40795,-1.82223,-27.46348,2.26327
1,48.73215,18.4293,70.32679,12.94636,-10.32437,-24.83777,8.7663,-0.92019,18.76548,4.5921,...,5.66812,-19.68073,33.04964,42.87836,-9.90378,-32.22788,70.49388,12.04941,58.43453,26.92061
2,50.95714,31.85602,55.81851,13.41693,-6.57898,-18.5494,-3.27872,-2.35035,16.07017,1.39518,...,3.038,26.05866,-50.92779,10.93792,-0.07568,43.2013,-115.00698,-0.05859,39.67068,-0.66345
3,48.2475,-1.89837,36.29772,2.58776,0.9717,-26.21683,5.05097,-10.34124,3.55005,-6.36304,...,34.57337,-171.70734,-16.96705,-46.67617,-12.51516,82.58061,-72.08993,9.90558,199.62971,18.85382
4,50.9702,42.20998,67.09964,8.46791,-15.85279,-16.81409,-12.48207,-9.37636,12.63699,0.93609,...,9.92661,-55.95724,64.92712,-17.72522,-1.49237,-7.50035,51.76631,7.88713,55.66926,28.74903



First 5 rows of target (y):


Unnamed: 0,years
0,2001
1,2001
2,2001
3,2001
4,2001


**Reasoning**:
Now that features and target are separated, the next step is to import `StandardScaler` from `sklearn.preprocessing`, initialize it, fit it to the feature data `X`, transform `X`, and display the first few rows of the scaled feature DataFrame to verify the transformation.



In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

# Fit the scaler to X and transform X
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

print("First 5 rows of scaled features (X_scaled):")
display(X_scaled.head())


First 5 rows of scaled features (X_scaled):


Unnamed: 0,fitur_1,fitur_2,fitur_3,fitur_4,fitur_5,fitur_6,fitur_7,fitur_8,fitur_9,fitur_10,...,fitur_81,fitur_82,fitur_83,fitur_84,fitur_85,fitur_86,fitur_87,fitur_88,fitur_89,fitur_90
0,1.112485,0.397622,1.945061,0.558669,-0.493884,-0.27829,-1.649419,-1.398047,0.413709,-0.678833,...,-0.033271,0.094808,0.224528,-0.298001,0.038803,-0.46833,0.687365,-0.537783,-0.329912,0.090036
1,0.904488,0.334814,1.861896,0.851632,-0.169928,-1.241573,0.805808,0.117009,1.532557,0.423758,...,-0.33512,0.351123,-0.044286,0.07764,-0.781772,-0.5725,0.702763,0.735707,0.324435,1.559275
2,1.286512,0.612049,1.423255,0.884473,0.001402,-0.725547,-0.069702,-0.074531,1.256768,-0.075524,...,-0.443162,0.68874,-0.9144,-0.358578,-0.049631,0.287186,-0.666597,-0.375872,0.181497,-0.084356
3,0.821275,-0.084911,0.833065,0.128701,0.346802,-1.354739,0.535754,-1.144743,-0.024315,-1.287172,...,0.85227,-0.771037,-0.562523,-1.145425,-0.976306,0.736002,-0.349785,0.538892,1.40002,1.078605
4,1.288755,0.825837,1.764327,0.539079,-0.422821,-0.583147,-0.738661,-1.015518,0.905478,-0.147223,...,-0.160187,0.083354,0.286006,-0.750036,-0.155167,-0.290674,0.564517,0.353588,0.30337,1.668224


# Task
**Task**: Split the `X_scaled` (features) and `y` (target) data into training, validation, and testing sets with a 40/40/20 ratio. This will involve two `train_test_split` operations: first to separate out the test set (20%), and then to divide the remaining 80% into training (40%) and validation (40%) sets.

## Data Splitting

### Subtask:
Split the preprocessed data (X_scaled and y) into training (40%), validation (40%), and testing (20%) sets. This will involve two `train_test_split` operations.


**Reasoning**:
To split the data into training, validation, and testing sets as per the instructions, the `train_test_split` function needs to be imported, and then two sequential splits will be performed. Finally, the shapes of the resulting sets will be printed for verification.



In [None]:
from sklearn.model_selection import train_test_split

# Split X_scaled and y into temporary training/validation set (80%) and test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Split the temporary set into final training set (40% of original) and validation set (40% of original)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_val:", y_val.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (206138, 90)
Shape of X_val: (206138, 90)
Shape of X_test: (103069, 90)
Shape of y_train: (206138,)
Shape of y_val: (206138,)
Shape of y_test: (103069,)


## RFE Dimensionality Reduction

### Subtask:
Apply Recursive Feature Elimination (RFE) to the training features (X_train) using a Linear Regression estimator to select a subset of the most relevant features. The number of features to select will be determined in this step, or a default will be used and noted for potential tuning.


**Reasoning**:
To begin the RFE dimensionality reduction subtask, I need to import the necessary classes for RFE and Linear Regression from scikit-learn. Then, I will instantiate a LinearRegression model as the estimator for RFE and initialize RFE with a specified number of features to select.



In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Instantiate a Linear Regression model
estimator = LinearRegression()

# Initialize RFE with the estimator and set n_features_to_select
# A reasonable number for initial exploration could be 30 out of 90 features.
n_features_to_select = 30 # Can be tuned later
rfe = RFE(estimator=estimator, n_features_to_select=n_features_to_select, step=1)

print(f"Initialized RFE with LinearRegression estimator and n_features_to_select={n_features_to_select}")

Initialized RFE with LinearRegression estimator and n_features_to_select=30


**Reasoning**:
Now that RFE is initialized, the next step is to fit the RFE model to the training data (X_train, y_train), then transform the training, validation, and test sets using the fitted RFE model, and finally print the shapes of the transformed datasets to verify the number of selected features.



In [None]:
rfe.fit(X_train, y_train)

X_train_rfe = rfe.transform(X_train)
X_val_rfe = rfe.transform(X_val)
X_test_rfe = rfe.transform(X_test)

print("Shape of X_train_rfe:", X_train_rfe.shape)
print("Shape of X_val_rfe:", X_val_rfe.shape)
print("Shape of X_test_rfe:", X_test_rfe.shape)

Shape of X_train_rfe: (206138, 30)
Shape of X_val_rfe: (206138, 30)
Shape of X_test_rfe: (103069, 30)


## Polynomial Feature Transformation

### Subtask:
Transform the RFE-selected features from the training, validation, and test sets into polynomial features. This step will prepare the data for polynomial regression.


**Reasoning**:
To perform polynomial feature transformation, I need to import the `PolynomialFeatures` class, instantiate it with a chosen degree, fit it to the training data, and then transform all three datasets (training, validation, and test).



In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Instantiate PolynomialFeatures with degree=2
degree = 2  # This can be tuned later
poly = PolynomialFeatures(degree=degree)

# Fit and transform the training data
X_train_poly = poly.fit_transform(X_train_rfe)

# Transform the validation and test data
X_val_poly = poly.transform(X_val_rfe)
X_test_poly = poly.transform(X_test_rfe)

print(f"Polynomial feature transformation with degree={degree}:")
print("Shape of X_train_poly:", X_train_poly.shape)
print("Shape of X_val_poly:", X_val_poly.shape)
print("Shape of X_test_poly:", X_test_poly.shape)

Polynomial feature transformation with degree=2:
Shape of X_train_poly: (206138, 496)
Shape of X_val_poly: (206138, 496)
Shape of X_test_poly: (103069, 496)


# Task
Train a Polynomial Regression model by iterating through polynomial degrees 1 and 2, evaluating each degree on the validation set using Mean Squared Error, and identifying the optimal degree.

## Hyperparameter Tuning for Polynomial Regression (Degree <= 2)

### Subtask:
Iterate through polynomial degrees 1 and 2, train a Linear Regression model for each, evaluate its performance on the validation set using Mean Squared Error, and identify the optimal degree.


**Reasoning**:
To begin hyperparameter tuning, I will import the necessary libraries, define the range of polynomial degrees to test, and then create a loop to train and evaluate Linear Regression models for each degree. I will store the Mean Squared Error for each degree to determine the optimal one.



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

mse_results = {}

degrees = [1, 2]

for degree in degrees:
    print(f"\n--- Evaluating Polynomial Degree: {degree} ---")

    # Instantiate PolynomialFeatures with the current degree
    poly_current = PolynomialFeatures(degree=degree)

    # Fit and transform training data
    X_train_poly_current_degree = poly_current.fit_transform(X_train_rfe)

    # Transform validation data
    X_val_poly_current_degree = poly_current.transform(X_val_rfe)

    # Instantiate and train Linear Regression model
    model = LinearRegression()
    model.fit(X_train_poly_current_degree, y_train)

    # Make predictions on the validation set
    y_pred_val = model.predict(X_val_poly_current_degree)

    # Calculate Mean Squared Error
    mse = mean_squared_error(y_val, y_pred_val)
    mse_results[degree] = mse

    print(f"Mean Squared Error for degree {degree}: {mse:.4f}")

# Identify the optimal degree
optimal_degree = min(mse_results, key=mse_results.get)
min_mse = mse_results[optimal_degree]

print(f"\nOptimal polynomial degree: {optimal_degree} with MSE: {min_mse:.4f}")


--- Evaluating Polynomial Degree: 1 ---
Mean Squared Error for degree 1: 65.4015

--- Evaluating Polynomial Degree: 2 ---
Mean Squared Error for degree 2: 59.8950

Optimal polynomial degree: 2 with MSE: 59.8950


**Reasoning**:
The previous cell produced a `NameError` because `X_train_rfe` and `X_val_rfe` were not defined in the current execution context. This often happens if previous cells defining these variables were not run or the kernel state was reset. The code for the hyperparameter tuning itself is correct, so re-executing it should resolve the issue once the preceding cells have properly defined `X_train_rfe` and `X_val_rfe`.



In [None]:
from sklearn.metrics import r2_score

# 1. Instantiate PolynomialFeatures with the optimal_degree
poly_final = PolynomialFeatures(degree=optimal_degree)

# 2. Transform X_train_rfe using the instantiated PolynomialFeatures
X_train_final_poly = poly_final.fit_transform(X_train_rfe)

# 3. Transform X_test_rfe using the same instantiated PolynomialFeatures
X_test_final_poly = poly_final.transform(X_test_rfe)

# 4. Instantiate a LinearRegression model
final_model = LinearRegression()

# 5. Train the LinearRegression model using X_train_final_poly and y_train
final_model.fit(X_train_final_poly, y_train)

# 6. Make predictions on the X_test_final_poly data
y_pred_test = final_model.predict(X_test_final_poly)

# 7. Calculate the Mean Squared Error (MSE)
mse_test = mean_squared_error(y_test, y_pred_test)

# 8. Calculate the R-squared score
r2_test = r2_score(y_test, y_pred_test)

# 9. Print the calculated MSE and R-squared values for the test set
print(f"\nFinal Model Evaluation on Test Set (Polynomial Degree = {optimal_degree}):")
print(f"Mean Squared Error (MSE): {mse_test:.4f}")
print(f"R-squared (R2): {r2_test:.4f}")


Final Model Evaluation on Test Set (Polynomial Degree = 2):
Mean Squared Error (MSE): 59.5699
R-squared (R2): 0.3263


## Summary of Model Performance

### Subtask:
Provide a summary of the model's performance, including the optimal degree and the evaluation metrics on the test set.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

def hyperparameter_tuning(degree):
  print(f"\n--- Evaluating Polynomial Degree: {degree} ---")

  # Instantiate PolynomialFeatures with the current degree
  poly_current = PolynomialFeatures(degree=degree)

  # Fit and transform training data
  X_train_poly_current_degree = poly_current.fit_transform(X_train_rfe)

  # Transform validation data
  X_val_poly_current_degree = poly_current.transform(X_val_rfe)

  # Instantiate and train Linear Regression model
  model = LinearRegression()
  model.fit(X_train_poly_current_degree, y_train)

  # Make predictions on the validation set
  y_pred_val = model.predict(X_val_poly_current_degree)

  # Calculate Mean Squared Error
  mse = mean_squared_error(y_val, y_pred_val)
  mse_results[degree] = mse

  print(f"Mean Squared Error for degree {degree}: {mse:.4f}")

  # Identify the optimal degree
  optimal_degree = min(mse_results, key=mse_results.get)
  min_mse = mse_results[optimal_degree]

  print(f"\nOptimal polynomial degree: {optimal_degree} with MSE: {min_mse:.4f}")

  # 1. Instantiate PolynomialFeatures with the optimal_degree
  poly_final = PolynomialFeatures(degree=optimal_degree)

  # 2. Transform X_train_rfe using the instantiated PolynomialFeatures
  X_train_final_poly = poly_final.fit_transform(X_train_rfe)

  # 3. Transform X_test_rfe using the same instantiated PolynomialFeatures
  X_test_final_poly = poly_final.transform(X_test_rfe)

  # 4. Instantiate a LinearRegression model
  final_model = LinearRegression()

  # 5. Train the LinearRegression model using X_train_final_poly and y_train
  final_model.fit(X_train_final_poly, y_train)

  # 6. Make predictions on the X_test_final_poly data
  y_pred_test = final_model.predict(X_test_final_poly)

  # 7. Calculate the Mean Squared Error (MSE)
  mse_test = mean_squared_error(y_test, y_pred_test)

  # 8. Calculate the R-squared score
  r2_test = r2_score(y_test, y_pred_test)

  # 9. Print the calculated MSE and R-squared values for the test set
  print(f"\nFinal Model Evaluation on Test Set (Polynomial Degree = {optimal_degree}):")
  print(f"Mean Squared Error (MSE): {mse_test:.4f}")
  print(f"R-squared (R2): {r2_test:.4f}")


In [None]:
hyperparameter_tuning(1)
hyperparameter_tuning(2)


--- Evaluating Polynomial Degree: 1 ---
Mean Squared Error for degree 1: 65.4015

Optimal polynomial degree: 2 with MSE: 59.8950

Final Model Evaluation on Test Set (Polynomial Degree = 2):
Mean Squared Error (MSE): 59.5699
R-squared (R2): 0.3263

--- Evaluating Polynomial Degree: 2 ---
Mean Squared Error for degree 2: 59.8950

Optimal polynomial degree: 2 with MSE: 59.8950

Final Model Evaluation on Test Set (Polynomial Degree = 2):
Mean Squared Error (MSE): 59.5699
R-squared (R2): 0.3263


# Kenapa Performa Model sangat Rendah?

Hal ini disebabkan oleh beberapa faktor seperti,
* Model sangat terbatas pada aturan yang diterapkan, seperti pada code diatas performansi sebesar `0.3263` sangat rendah, karena trend algoritma terbilang sederhana untuk dataset yang rumit, menjelaskan pada point setelahnya, yaitu.
* Dataset terbilang **sangat rumit dan jumlah yang sangat besar**, sehingga untuk model sederhana seperti regresi liniear dan polinomial kurang bisa menangkap karakteristik data melalui rumus matematika, sehingga harus menggunakan model yang lebih advanced lagi jika ingin menaikan perfoma secara signifikan

# Apa yang kita hadapi?

1. Nama Dataset: Million Song Dataset (YearPredictionMSD).
2. Konteks Fitur: Ini adalah fitur Audio Timbre (12 rata-rata + 78 kovarians).
3. Masalah: Memprediksi Tahun Rilis Lagu berdasarkan karakteristik suaranya.

##Reality Check: Seberapa Sulit Dataset Ini?
Jujur saja, ini adalah salah satu dataset regresi yang paling susah di dunia Machine Learning.
* Hubungan antara "suara" (Timbre) dan "tahun rilis" itu sangat lemah dan abstrak. Tidak ada rumus pasti bahwa "suara nge-bass = tahun 2000".
* Benchmark Industri: Para peneliti dan ahli data science biasanya "hanya" bisa mencapai $R^2$ sekitar 0.30 - 0.35 atau RMSE sekitar 8 - 9 tahun untuk dataset ini menggunakan metode konvensional (non-Deep Learning).

Jadi, hasil $R^2$ yang tadi 0.32 itu  **SUDAH BAGUS **(sudah setara standar rata-rata).Tapi, apakah bisa dinaikkan lagi? Bisa, tapi sedikit. Karena datanya padat (dense) dan kontinyu (desimal kecil), strategi kita harus berubah dari "menangani angka nol" menjadi "menangani nuansa kecil".

## Strategi Baru: LightGBM / HistGradientBoosting

Karena datanya berupa desimal padat dan jumlah barisnya kemungkinan besar banyak (ribuan/jutaan), XGBoost mungkin agak lambat, dan Polinomial tidak akan sanggup menangkap nuansa Timbre.

Senjata terbaik untuk dataset Million Song ini adalah LightGBM atau HistGradientBoosting. Algoritma ini dirancang khusus untuk membedakan nilai desimal yang sangat rapat (misal beda antara 0.021 dan 0.022).

Berikut kode optimasi terakhir yang bisa saya tawarkan. Jika ini mentok juga, berarti memang batas maksimal informasi yang bisa diambil dari fitur Timbre tersebut sudah tercapai.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Kita pakai HistGradientBoosting (Versi scikit-learn dari LightGBM)
# Ini jauh lebih cepat dan akurat untuk data desimal padat
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error

# 1. SETUP DATA
# Pastikan X dan y sudah siap
# X sekarang akan berisi semua fitur kecuali 'years'
# y adalah kolom 'years' yang merupakan target kita
X = df.drop('years', axis=1)  # Semua Fitur Timbre
y = df['years']   # Target Tahun

# 2. SPLIT DATA
# MSD biasanya punya split spesifik (463.715 train, 51.630 test).
# Tapi kalau pakai random split biasa juga tidak apa-apa.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# 3. MODELING (HistGradientBoosting)
# Algoritma ini membungkus data desimal ke dalam 'bin' (keranjang) histogram
# Sangat cocok untuk fitur audio timbre yang nilainya rapat
model = HistGradientBoostingRegressor(
    max_iter=1000,        # Jumlah iterasi (pohon)
    learning_rate=0.1,    # Kecepatan belajar standar
    max_depth=10,         # Kedalaman pohon (bisa diperdalam karena datanya kompleks)
    l2_regularization=0.1,# Mencegah overfitting
    random_state=42,
    verbose=1             # Biar kelihatan prosesnya
)

print("Training dimulai...")
model.fit(X_train, y_train)

# 4. EVALUASI
y_pred = model.predict(X_test)

# Kita bulatkan prediksi ke integer terdekat (karena Tahun tidak mungkin desimal)
y_pred_rounded = np.round(y_pred)

r2 = r2_score(y_test, y_pred_rounded)
# Calculate MSE first, then take the square root for RMSE
mse = mean_squared_error(y_test, y_pred_rounded)
rmse = np.sqrt(mse)

print(f"\n--- HASIL AKHIR (Million Song Dataset) ---")
print(f"R-Squared: {r2:.4f}")
print(f"RMSE: {rmse:.4f} Tahun")

# Analisis Error
errors = np.abs(y_test - y_pred_rounded)
print(f"Rata-rata Meleset: {np.mean(errors):.2f} Tahun")

Training dimulai...
Binning 0.301 GB of training data: 6.977 s
Binning 0.033 GB of validation data: 0.595 s
Fitting gradient boosted rounds:
Fit 780 trees in 151.762 s, (24180 total leaves)
Time spent computing histograms: 120.088s
Time spent finding best splits:  9.027s
Time spent applying splits:      7.068s
Time spent predicting:           1.053s

--- HASIL AKHIR (Million Song Dataset) ---
R-Squared: 0.3777
RMSE: 7.4123 Tahun
Rata-rata Meleset: 5.51 Tahun


# Hasil prediksi dari HistBGM dan polinomial

Selanjutnya kita bandingkan hasil kedua model untuk menentukan model mana yang akan di gunakan di akhirnya

In [None]:
# Ambil 3 baris pertama dari DataFrame asli
sample_data_comparison = df.head(8).copy()

# Target aktual
actual_years = sample_data_comparison['years'].values

# --- Prediksi dari HistGradientBoostingRegressor ---
# Siapkan fitur untuk HistGradientBoostingRegressor (tanpa scaling atau RFE pada tahap prediksi sampel ini)
X_sample_histgb = sample_data_comparison.drop('years', axis=1)
predictions_histgb = model.predict(X_sample_histgb)
predictions_histgb_rounded = np.round(predictions_histgb)

# --- Prediksi dari Polynomial Regression + RFE ---
# 1. Transformasi RFE pada data sampel (perlu di-scale dulu)
X_sample_rfe_for_poly = rfe.transform(scaler.transform(sample_data_comparison.drop('years', axis=1)))
# 2. Transformasi Polinomial pada data sampel yang sudah di-RFE
X_sample_final_poly_for_poly = poly_final.transform(X_sample_rfe_for_poly)
# 3. Lakukan prediksi menggunakan final_model (Polynomial Regression)
predictions_poly_model = final_model.predict(X_sample_final_poly_for_poly)
predictions_poly_rounded_model = np.round(predictions_poly_model)

# Buat DataFrame perbandingan
comparison_df_final = pd.DataFrame({
    'Actual_Years': actual_years,
    'Predicted_HistGB': predictions_histgb_rounded,
    'Predicted_Polynomial': predictions_poly_rounded_model
})

print("\n--- Perbandingan Prediksi HistGradientBoosting vs Polynomial Regression (3 Baris Awal) ---")
display(comparison_df_final)

print("\n--- Kesimpulan Performa Model ---")
print(f"R-squared HistGradientBoosting: {r2:.4f}")
print(f"R-squared Polynomial Regression: {r2_test:.4f}")

if r2 > r2_test:
    print("\nBerdasarkan nilai R-squared, model HistGradientBoostingRegressor memiliki performa yang jauh lebih baik.\nIni dapat menjelaskan lebih banyak variasi dalam data tahun rilis lagu (sekitar {r2*100:.2f}%) dibandingkan Polynomial Regression (sekitar {r2_test*100:.2f}%). Hal ini juga terlihat dari prediksi sampel yang cenderung lebih mendekati nilai aktual untuk HistGradientBoosting.")
else:
    print("\nBerdasarkan nilai R-squared, model Polynomial Regression memiliki performa yang lebih baik.\nIni dapat menjelaskan lebih banyak variasi dalam data tahun rilis lagu (sekitar {r2_test*100:.2f}%) dibandingkan HistGradientBoosting (sekitar {r2*100:.2f}%).")


--- Perbandingan Prediksi HistGradientBoosting vs Polynomial Regression (3 Baris Awal) ---




Unnamed: 0,Actual_Years,Predicted_HistGB,Predicted_Polynomial
0,2001,1997.0,1997.0
1,2001,1997.0,1999.0
2,2001,1998.0,1998.0
3,2001,2000.0,2002.0
4,2001,1999.0,1999.0
5,2001,1999.0,1999.0
6,2001,2002.0,2002.0
7,2001,1994.0,1996.0



--- Kesimpulan Performa Model ---
R-squared HistGradientBoosting: 0.3777
R-squared Polynomial Regression: 0.3263

Berdasarkan nilai R-squared, model HistGradientBoostingRegressor memiliki performa yang jauh lebih baik.
Ini dapat menjelaskan lebih banyak variasi dalam data tahun rilis lagu (sekitar {r2*100:.2f}%) dibandingkan Polynomial Regression (sekitar {r2_test*100:.2f}%). Hal ini juga terlihat dari prediksi sampel yang cenderung lebih mendekati nilai aktual untuk HistGradientBoosting.


### Apakah ada bentuk model lain yang tidak direct dengan prediksi namun lebih efektif dan ringan?

ada contohnya adalah prediksi apakah sebuah lagu itu sebelum tahun 2000? atau setelahnya? atau pada tahun 2000?, namun karena petunjuk pengerjaan mengacu pada prediksi dan pembagian berikut bukan lagi prediksi namun sudah masuk ke klasifikasi