# Continuation of IV. Popularity Classification

## Model Building (using undersampling):
_As mentioned in the previous notebook, I'm performing undersampling to see if I can get a generalized model here without any overfit. Even though some data is lost here, since our dataset is huge (110k+), thought it might be a viable option to experiment._

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_pickle('cleaneddata_2.pkl')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 113549 entries, 0 to 113999
Data columns (total 18 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   duration_ms               113549 non-null  int64  
 1   danceability              113549 non-null  float64
 2   energy                    113549 non-null  float64
 3   loudness                  113549 non-null  float64
 4   speechiness               113549 non-null  float64
 5   acousticness              113549 non-null  float64
 6   instrumentalness          113549 non-null  float64
 7   liveness                  113549 non-null  float64
 8   valence                   113549 non-null  float64
 9   tempo                     113549 non-null  float64
 10  key_sin                   113549 non-null  float64
 11  key_cos                   113549 non-null  float64
 12  mode_1                    113549 non-null  int32  
 13  time_signature_1          113549 non-null  int32 

In [4]:
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
import time

In [5]:
#Target and Feature split
X = df.drop(['popularity_class_encoded'], axis=1)
y = df['popularity_class_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

#Applying RandomUnderSampler to the training data
under_sampler = RandomUnderSampler(random_state=0)
X_train_under, y_train_under = under_sampler.fit_resample(X_train, y_train)

#Models for evaluation (the true, false -> tells us if data has to be scaled or not)
models = [
    ("Logistic Regression", LogisticRegression(max_iter=1000), True),
    ("KNN", KNeighborsClassifier(), True),
    ("Decision Tree", DecisionTreeClassifier(), False),
    ("Random Forest", RandomForestClassifier(), False),
    ("AdaBoost", AdaBoostClassifier(), False),
    ("Naive Bayes", GaussianNB(), True),
    ("XGBoost", XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'), False)
]

In [6]:
results_list = []

#Dictionary to store each model
fitted_models = {}

#Scaling features if necessary
if any(model[2] for model in models):  #Checking if any model requires scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_under)
    X_test_scaled = scaler.transform(X_test)
else:
    X_train_scaled, X_test_scaled = X_train_under, X_test

In [7]:
#Training and evaluating models
for name, model, needs_scaling in models:
    start_time = time.time()
    
    #Fitting the model
    model.fit(X_train_scaled if needs_scaling else X_train_under, y_train_under)
    
    #Storing the fitted model
    fitted_models[name] = model
    
    #Making predictions
    y_pred_train = model.predict(X_train_scaled if needs_scaling else X_train_under)
    y_pred_test = model.predict(X_test_scaled if needs_scaling else X_test)
    
    #Evaluating performance
    train_accuracy = accuracy_score(y_train_under, y_pred_train)
    test_accuracy = accuracy_score(y_test, y_pred_test)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_test, average='weighted')
    time_taken = time.time() - start_time
    
    #Appending results to the list
    results_list.append({"Model": name, "Accuracy_Train": train_accuracy, 
                         "Accuracy_Test": test_accuracy, "Precision": precision,
                         "Recall": recall, "F1_Score": f1, "Time_Taken": time_taken})

results_df = pd.DataFrame(results_list)

In [8]:
results_df

Unnamed: 0,Model,Accuracy_Train,Accuracy_Test,Precision,Recall,F1_Score,Time_Taken
0,Logistic Regression,0.428002,0.412359,0.449937,0.412359,0.415682,0.224231
1,KNN,0.679923,0.522325,0.541571,0.522325,0.522158,19.903474
2,Decision Tree,0.991489,0.585792,0.604294,0.585792,0.586554,1.926924
3,Random Forest,0.991489,0.658007,0.670067,0.658007,0.660145,46.634097
4,AdaBoost,0.455271,0.426538,0.466915,0.426538,0.430162,5.361146
5,Naive Bayes,0.405392,0.349185,0.457924,0.349185,0.321734,0.100839
6,XGBoost,0.723956,0.567151,0.593868,0.567151,0.570957,2.441031


## Insights for the under-sampled modelling: 
- In comparision to the over-sampled results, we see that the under sampled results performed similarly but slightly worser. Reduced test accuracy in all the models with the train accuracy almost similar. 
- Over-sampled Random Forest before hyper-parameter tuning gave test accuracy of 69.8% while the under-sampled train accuracy gives 65.8%, clear reduction of accuracy. Similarly, over-sampled XGBoost before hyper-parameter tuning gave test accuracy of 59.4% and while under-sampled XGBoost gives 56.7%, reduction again.  
- Not performing hyper-parameter tuning here since I'm convinced that these models will not perform better than the over-sampled models as we've seen how the test accuracy dropped from the over-sampled models. 

# V. Result comparison between Regression vs Classification

## Model Complexity and Interpretability
- **Regression**: Predicted continuous popularity scores, dealing with non-linear relationships and skewness. Best model (Random Forest) achieved moderate R²=0.54 with significant overfitting.
- **Classification**: Simplified prediction by categorizing popularity into classes based on the distribution of popularity. Tuned Random Forest and XGBoost models achieved ~70% accuracy with balanced scores, demonstrating better performance.

## Accuracy and Generalization
- **Regression**: Struggled with high errors and low test R² values while high train R² value indicating poor generalization.
- **Classification**: Showed better accuracy and generalization, particularly after hyper-parameter tuning and class balancing with SMOTE.

## Model Training and Evaluation
- **Regression**: Involved feature engineering and transformations but still showed overfitting.
- **Classification**: Benefitted from SMOTE for class balance and hyper-parameter tuning, leading to models with better discrimination capabilities.

## Feature Importance and Insights
- **Regression**: Identified features like acousticness and danceability as important, but overfitting hindered accurate predictions.
- **Classification**: Provided clearer insights into feature influence on popularity classes, with balanced importance across features.

## Computational Efficiency
- **Regression**: Required significant resources for training and tuning without substantial performance improvements.
- **Classification**: Demanded computational power for tuning but yielded better performance, justifying the investment.

Classification outperformed regression in predicting song popularity across accuracy, generalization, and interpretability. It effectively addressed class imbalance and provided actionable insights, making it the preferred approach for this task.

# VI. Conclusion

After extensive data analysis, feature engineering, and model evaluation through both regression and classification approaches, the analysis concludes with the recommendation of a Random Forest classification model, refined through hyper-parameter tuning and oversampling techniques to handle class imbalances effectively. This model demonstrates a robust ability to classify songs into their respective popularity categories, with substantial accuracy, precision, recall, and F1 score metrics.

## Key Insights:

- **Musical Features Matter**: Features such as acousticness, duration, danceability, and energy levels play significant roles in predicting song popularity. This insight can guide artists and producers in creating music that aligns with current trends and preferences.
- **Handling Class Imbalance**: The use of oversampling (SMOTE) proved more effective than undersampling in this context, leading to better model performance. This strategy ensures that the model learns adequately from all classes, reducing bias towards more common outcomes.
- **Predictive Modeling as a Tool**: The developed model can serve as a strategic tool for stakeholders in the music industry to predict song popularity with a reasonable degree of accuracy. It can complement existing decision-making processes, from A&R (Artists and Repertoire) decisions to marketing strategy formulation.

## Strategic Recommendations:

- **Data-Driven A&R Decisions**: Utilize the model's insights in the artist signing and song selection process, favoring songs with characteristics that align with patterns of higher popularity classes.
- **Tailored Marketing Strategies**: Allocate marketing resources more efficiently by focusing on songs predicted to achieve medium to high popularity, using targeted promotion strategies to maximize their market impact.
- **Creative Guidance**: Artists and songwriters can use the model's insights into influential features to guide their creative process, potentially increasing their songs' appeal to a broader audience.
- **Continuous Improvement and Adaptation**: Regularly update the model with new data to adapt to changing musical trends and preferences, ensuring its predictions remain relevant and accurate.

## Future Directions:

- Exploring advanced machine learning and deep learning techniques to enhance model performance further.
- Integrating sentiment analysis of lyrics and social media trends to add another dimension to the predictive capabilities.
- Expanding the model to include regional and genre-specific popularity predictions, offering more insights for targeted marketing strategies.

In conclusion, the predictive model offers valuable insights and a strategic advantage in the highly competitive music industry, enabling stakeholders to make informed decisions based on data-driven predictions of song popularity.

# References:

* Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 5(4), 13-22. This article introduces the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, which is a structured approach to planning and executing data mining projects. It details the six phases of the process: business understanding, data understanding, data preparation, modeling, evaluation, and deployment, providing a comprehensive framework for managing data mining efforts and enhancing the overall efficiency and effectiveness of projects.

* Brownlee, J. (2020). Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning. Machine Learning Mastery. This reference provided insights into handling class imbalances, crucial for developing the predictive model.

* James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer. This textbook was invaluable for understanding the statistical underpinnings of the machine learning models used.

* Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830. This paper provided a comprehensive overview of the Scikit-learn library, which was instrumental in model development and evaluation.

* Torgo, L. (2010). Data Mining with R, learning with case studies. Chapman and Hall/CRC. This book provided practical examples of data mining that influenced the exploratory data analysis and feature engineering phases of the project.

* Chollet, F. (2018). Deep Learning with Python. Manning Publications. This book offered foundational knowledge on deep learning techniques that could enhance future model performance.