<h1>YouTube Views Feature Selection</h1>
<h3>Goals for this project:</h3>
<ul><li>The website <a href=https://socialblade.com/>https://socialblade.com/</a> assigns grades to content creators on various platforms, based on their popularity and reach. </li><li>Therefore, one can ask which YouTube metric has the greatest influence on the rankings of this website. </li><li>We will use a dataset scraped from socialblade.com (<a href="https://www.kaggle.com/datasets/balaka18/youtubers-popularity-dataset">here</a>).</li></ul>

In [1]:
import pandas as pd
import numpy as np 

df = pd.read_csv('usa_top_500.csv')
df

Unnamed: 0,Rank,Grade,Ch_name,Uploads,Subscriptions,Views
0,1st,A++,Cocomelon - Nursery Rhymes,518,78.2M,57088982878
1,2nd,A++,✿ Kids Diana Show,692,50.9M,24179259602
2,3rd,A++,Like Nastya,401,52.3M,30609490114
3,4th,A++,Movieclips,35216,36.3M,35071065049
4,5th,A++,Vlad and Nikita,219,37.7M,18086626003
...,...,...,...,...,...,...
245,246th,A,RomeoSantosVEVO,161,9M,9396425529
246,247th,A,Moonbug Kids - Cartoons & Nursery …,505,1.74M,464543675
247,248th,A,Coco Jelly - Kids Songs,599,220K,322073020
248,249th,A,Linkin Park,497,15.8M,8088860682


<h1>Clean the Data</h1>
<ul><li>Let us look for null values.</li></ul>

In [2]:
df.isnull().sum()

Rank              0
Grade            43
Ch_name           0
Uploads           0
Subscriptions     0
Views             0
dtype: int64

<ul><li>The Grade column has 43 null values, so we will have to address that later.</li>
<li>Now let's make sure the values are the correct dtypes.</li></ul>

In [3]:
df.dtypes

Rank             object
Grade            object
Ch_name          object
Uploads          object
Subscriptions    object
Views            object
dtype: object

<ul><li>It seems that all entries in the data are strings. We should format the numerical entries as floats.</li><li>First let's strip the numerical characters from the Rank Colum.</li></ul>

In [4]:
df.Rank = df.Rank.apply(lambda x : x.rstrip('st'))
df.Rank = df.Rank.apply(lambda x : x.rstrip('nd'))
df.Rank = df.Rank.apply(lambda x : x.rstrip('rd'))
df.Rank = df.Rank.apply(lambda x : x.rstrip('th'))
df.Rank = df.Rank.astype('float64')

<ul><li>We should also convert the Uploads, Subscriptions and Views columns to integer values.</li></ul>

In [5]:
df.Uploads = df.Uploads.apply(lambda x : x.replace(',',''))
df.Uploads = df.Uploads.astype('float')

def alpha_to_numeric(x):
    if x.find('B') != -1: return float(x.rstrip('B'))*1000000000
    elif x.find('M') != -1: return float(x.rstrip('M'))*1000000
    elif x.find('K') != -1: return float(x.rstrip('K'))*1000
    else: return float(x)
    
df.Subscriptions = df.Subscriptions.apply(alpha_to_numeric)

df.Views = df.Views.apply(lambda x : x.replace(',',''))
df.Views = df.Views.astype('float64')

df.dtypes

Rank             float64
Grade             object
Ch_name           object
Uploads          float64
Subscriptions    float64
Views            float64
dtype: object

<h1>Importance Selection</h1>
<ul><li>The uploader of the dataset describes the Grade column simply as 'Grade assigned (A++, A+, A)' and the Rank column simply as 'Rank of the channel', so let us try to see how these were determined.</li><li>It will be helpful to convert the Rank column to numeric values with the scheme A++ --> 3, A+ --> 2, A --> 1.</li></ul>

In [6]:
def alpha_to_numeric(x):
    if x == 'A++': return 3
    elif x == 'A+': return 2
    elif x == 'A': return 1

df.Grade = df.Grade.apply(alpha_to_numeric)

df.dtypes

Rank             float64
Grade            float64
Ch_name           object
Uploads          float64
Subscriptions    float64
Views            float64
dtype: object

In [7]:
df

Unnamed: 0,Rank,Grade,Ch_name,Uploads,Subscriptions,Views
0,1.0,3.0,Cocomelon - Nursery Rhymes,518.0,78200000.0,5.708898e+10
1,2.0,3.0,✿ Kids Diana Show,692.0,50900000.0,2.417926e+10
2,3.0,3.0,Like Nastya,401.0,52300000.0,3.060949e+10
3,4.0,3.0,Movieclips,35216.0,36300000.0,3.507107e+10
4,5.0,3.0,Vlad and Nikita,219.0,37700000.0,1.808663e+10
...,...,...,...,...,...,...
245,246.0,1.0,RomeoSantosVEVO,161.0,9000000.0,9.396426e+09
246,247.0,1.0,Moonbug Kids - Cartoons & Nursery …,505.0,1740000.0,4.645437e+08
247,248.0,1.0,Coco Jelly - Kids Songs,599.0,220000.0,3.220730e+08
248,249.0,1.0,Linkin Park,497.0,15800000.0,8.088861e+09


<ul><li>We need a machine learning model to try to predict the Rank column from the Uploads, Subscriptions, and Views columns. The intent of this is to facilitate importance selection.</li>
<li>We will first split our training data into training and validation sets (using scikit-learn's API), then add a scaling layer for preprocessing the data. This will map the values onto a standard normal distribution.</li><li>Our ML model will be a simple linear regressor.</li></ul>

In [8]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

train_X, val_X, train_y, val_y = train_test_split(df[['Uploads', 'Subscriptions', 'Views']], 
                                                  df['Rank'], random_state=1)

model = make_pipeline(StandardScaler(), # scaling layer
                      SGDRegressor(random_state = 1))

model.fit(train_X, train_y)

In [9]:
#let's take the mean absolute error to check the accuracy of our model
from sklearn.metrics import mean_absolute_error

mean_absolute_error(val_y, model.predict(val_X))

63.21218124611387

<ul><li>Our measure of importance will be permutation importance, which means we will iterate through all of the features in the training set, randomly shuffling the feature values across training instances, retraining our model each time, and observing the resulting change in our model's validation metrics.</li></ul>

In [10]:
import eli5
from eli5.sklearn import PermutationImportance

importance = PermutationImportance(model, random_state=1).fit(val_X, val_y)
eli5.show_weights(importance, feature_names = val_X.columns.tolist())

Weight,Feature
0.3577  ± 0.2720,Views
0.0003  ± 0.0050,Uploads
-0.0004  ± 0.0021,Subscriptions


<ul><li>Let us do the same for the modified Grade column. However, we must first deal with the null values.</li></ul>

In [11]:
df[df.Grade.isnull()]

Unnamed: 0,Rank,Grade,Ch_name,Uploads,Subscriptions,Views
5,6.0,,Peppa Pig - Official Channel,863.0,15200000.0,9308419000.0
6,7.0,,Animal World,318.0,2570000.0,721166300.0
7,8.0,,WWE,47027.0,57500000.0,41314290000.0
8,9.0,,Little Baby Bum - Nursery Rhymes &…,1165.0,27200000.0,24198550000.0
9,10.0,,BabyBus - Nursery Rhymes,1373.0,14900000.0,9262979000.0
10,11.0,,LooLoo Kids - Nursery Rhymes and C…,453.0,26900000.0,11563410000.0
24,25.0,,MandaPanda Toy Collector,310.0,425000.0,228364900.0
25,26.0,,Mother Goose Club Playhouse,987.0,9200000.0,8712855000.0
26,27.0,,Max Steel,42.0,825000.0,193927.0
27,28.0,,SonyMusicIndiaVEVO,3054.0,28900000.0,13752690000.0


<ul><li>Upon inspection, we can qualitatively note that many of the missing values in the Grade column correspond to children's YouTube channels or sports YouTube channels. We cannot conclude much more than this since there are no other relevant columns to explore.</li><li>There is also no correlation among successive rows, so the best we can do to deal with these missing values is to exclude them from the analysis.</li><ul>

In [12]:
df_dropped = df.dropna()

train_X, val_X, train_y, val_y = train_test_split(df_dropped[['Uploads', 'Subscriptions', 'Views']], 
                                                  df_dropped['Grade'], random_state=1)
model = make_pipeline(StandardScaler(), 
                      SGDRegressor(random_state = 1))

model.fit(train_X, train_y)

<ul><li>Let's take the mean absolute error to check the accuracy of our model, then examine the relative importance of each column.</li></ul>

In [13]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(val_y, model.predict(val_X))

0.17632020128433723

In [14]:
importance = PermutationImportance(model, random_state=1).fit(val_X, val_y)
eli5.show_weights(importance, feature_names = val_X.columns.tolist())

Weight,Feature
0.0624  ± 0.1312,Views
0.0227  ± 0.0407,Uploads
-0.0502  ± 0.1450,Subscriptions


<h1>Conclusion</h1>
<ul><li>Views seems to be the most important determining factor in socialblade.com's rankings and grades.</li></ul>