# Introduction
This Jupyter Notebook analyzes YouTube trending videos using machine learning. It predicts whether a video has a "high view count" based on video and channel attributes. A Random Forest Classifier is trained and evaluated using a stratified k-fold cross-validation method. The importance of features is also analyzed.

# Data Preparation
- We start by importing the necessary libraries.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold

- Load the dataset

In [2]:
# Read data 
df = pd.read_csv('cleaned_youtube_trending_videos_global.csv')

- Preprocess features using label encoding

In [3]:
# Encode categorical features
label_encoder = LabelEncoder()
df['video_category_id'] = label_encoder.fit_transform(df['video_category_id'])
df['video_dimension'] = label_encoder.fit_transform(df['video_dimension'])
df['video_definition'] = label_encoder.fit_transform(df['video_definition'])
df['channel_country'] = label_encoder.fit_transform(df['channel_country'])
df['channel_have_hidden_subscribers'] = label_encoder.fit_transform(df['channel_have_hidden_subscribers'].astype(str))

- Create the target variable (high_view_count)

In [4]:
median_view_count = df['video_view_count'].median()
df['high_view_count'] = (df['video_view_count'] > median_view_count).astype(int)

# Feature Selection
- Select relevant features for training the machine learning model.

In [5]:
# Define features and target
features = ['video_category_id', 'video_duration', 'video_dimension', 'video_definition', 'channel_view_count', 
            'channel_subscriber_count', 'channel_video_count', 'channel_have_hidden_subscribers']
X = df[features]
y = df['high_view_count']

# Train-Test Split
- Split the data into training and test sets.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training and Cross-Validation
- Train a Random Forest Classifier and evaluate it using stratified k-fold cross-validation.

In [7]:
rf_classifier = RandomForestClassifier(
    n_estimators=50,
    max_depth=None,
    min_samples_split=2,
)

# Perform stratified k-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(rf_classifier, X_train, y_train, cv=cv, scoring='accuracy')

- Let's take a look to the cross-validation scores and accuracy relusts

In [8]:
print("Cross-validation scores:")
cv_scores

Cross-validation scores:


array([0.99339769, 0.99362088, 0.99382382, 0.99351264, 0.99417558])

In [9]:
print("Mean cross-validation accuracy:")
cv_scores.mean()

Mean cross-validation accuracy:


np.float64(0.9937061228773182)

# Model Evaluation
- Evaluate the trained model on the test set and report accuracy, classification report, and feature importance.

In [10]:
# Train the model
rf_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
print("Accuracy:")
accuracy_score(y_test, y_pred)

Accuracy:


0.9944637764296499

-  Let's evaluate the model's accuracy on the training data

In [11]:
y_train_pred = rf_classifier.predict(X_train)
print("Train Accuracy:")
accuracy_score(y_train, y_train_pred)

Train Accuracy:


0.9980625897168431

-  Let's take a look at the classification report for the test data

In [12]:
print("Classification Report:\n")
classification_report(y_test, y_pred)

Classification Report:



'              precision    recall  f1-score   support\n\n           0       1.00      0.99      0.99     92416\n           1       0.99      1.00      0.99     92367\n\n    accuracy                           0.99    184783\n   macro avg       0.99      0.99      0.99    184783\nweighted avg       0.99      0.99      0.99    184783\n'

-  Let's analyze the feature importances from the random forest model

In [13]:
# Analyze feature importance
feature_importances = pd.DataFrame(rf_classifier.feature_importances_,
                                   index=features,
                                   columns=['importance']).sort_values('importance', ascending=False)

print("Feature Importances:\n")
feature_importances

Feature Importances:



Unnamed: 0,importance
video_duration,0.308634
channel_view_count,0.276251
channel_subscriber_count,0.224004
channel_video_count,0.142087
video_category_id,0.048644
video_definition,0.00038
video_dimension,0.0
channel_have_hidden_subscribers,0.0
