# Random Forest Classifier
Random Forest is a popular and powerful ensemble machine learning algorithm that is used for classification and regression tasks. It is a type of bagging algorithm, where a group of decision trees are trained on different subsets of the training data and the predictions of all the trees are aggregated to make the final prediction.

## Benefits of Random Forest
- It can handle high-dimensional data and categorical features very well
- It is relatively fast to train and make predictions
- It is resistant to overfitting, since the model is an ensemble of decision trees, which are less prone to overfitting than  other models such as neural networks
- It can provide feature importance scores, which can be useful for feature selection and understanding the dataset
## Limitations of Random Forest
- It can be difficult to interpret the model, since it is an ensemble of decision trees and the overall decision process is not transparent
- It may not perform as well as some other models on highly imbalanced datasets
- It may not be the most accurate model for very complex datasets
- It can be more resource-intensive to train and tune compared to some other models

In [135]:
from sqlalchemy import create_engine
import pandas as pd
from config import db_password
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, classification_report, confusion_matrix, accuracy_score

In [136]:
# DATABASE
#Add the connection to the PostgreSQL database
# add songs to SQL:
db_string = f"postgresql://postgres:{db_password}@127.0.0.1:5432/Spotify_data"
#Create the database engine with the following:
engine = create_engine(db_string)

In [137]:
df_encoded = pd.read_sql("SELECT * FROM encoded", con=engine, index_col="track_id")

In [138]:
df_encoded.head()

Unnamed: 0_level_0,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,...,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#,mode_Major,mode_Minor
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00021Wy6AyMbLP2tqij86e,0.234,0.617,169173,0.862,0.976,0.141,-12.855,0.0514,129.578,0.886,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
000CzNKC8PEt1yC3L8dqwV,0.249,0.518,130653,0.805,0.0,0.333,-6.248,0.0407,79.124,0.841,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
000DfZJww8KiixTKuk9usJ,0.366,0.631,357573,0.513,4e-06,0.109,-6.376,0.0293,120.365,0.307,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
000EWWBkYaREzsBplYjUag,0.815,0.768,104924,0.137,0.922,0.113,-13.284,0.0747,76.43,0.56,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
000xQL6tZNLJzIrtIgxqSl,0.131,0.748,188491,0.627,0.0,0.0852,-6.029,0.0644,120.963,0.524,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


## Splitting the data

our target column is the **`is_popular`** binary column that we've created in the data preparation phase, the features columns are all the remaining column as we did drop the uncessary columns for our machine learning model.

In [139]:
# Split the data into features and target
y = df_encoded["is_popular"]
X = df_encoded.drop("is_popular",1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Create the scaler
scaler = StandardScaler()

# Fit the scaler to the training data
scaler.fit(X_train)

# Scale the training and test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)


  This is separate from the ipykernel package so we can avoid doing imports until


In [141]:
# evaluate the module
# Calculating the confusion matrix.
cm = confusion_matrix(y_test, y_pred)

# Create a DataFrame from the confusion matrix.
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"])

In [142]:
# Calculating the accuracy score.
acc_score = accuracy_score(y_test, y_pred)

In [143]:
# Displaying results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, y_pred))

Confusion Matrix


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,36540,1674
Actual 1,1984,41359


Accuracy Score : 0.9551479333472296
Classification Report
              precision    recall  f1-score   support

           0       0.95      0.96      0.95     38214
           1       0.96      0.95      0.96     43343

    accuracy                           0.96     81557
   macro avg       0.95      0.96      0.95     81557
weighted avg       0.96      0.96      0.96     81557



# Interpretation

In this report, the classifier has a precision of 0.95 for class 0 and 0.96 for class 1, a recall of 0.96 for class 0 and 0.95 for class 1, and an f1-score of 0.95 for class 0 and 0.96 for class 1.

Overall, the classifier performs very well, with an accuracy of 0.96 and good precision, recall, and f1-scores for both classes.