# Predicting YouTube views: Large-language models

In the previous notebook we used tfidf-vectorisation combined with classical ML methods to classify YouTube videos based on views. Now we can explore another approach: a pre-trained LLM. For this task we'll be using distilBERT, a compact version of Google's BERT model. We will combine four text features -- the video category, the channel title, the video title and the video description -- as the input to the model.  Let's import the training data and fill NA values with empty strings.

In [1]:
#pip install --upgrade transformers



In [2]:
#!pip install tf-keras
import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'



In [3]:
import pandas as pd

samples = pd.read_csv('https://raw.githubusercontent.com/tommyliphysics/tommyli-ml/main/youtube_predictor/data/train.csv', lineterminator='\n')
samples

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
0,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271,0
1,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389,1
2,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385,1
3,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802,0
4,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282,0
...,...,...,...,...,...,...,...
25323,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115,0
25324,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098,1
25325,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341,0
25326,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217,1


In [4]:
samples = samples.fillna('')

The video categories have been represented by an integer value, but we will convert them into text.

In [5]:
video_categories = {1: 'Film & Animation',
                    2: 'Autos & Vehicles',
                    10: 'Music',
                    15: 'Pets & Animals',
                    17: 'Sports',
                    19: 'Travel & Events',
                    20: 'Gaming',
                    22: 'People & Blogs',
                    23: 'Comedy',
                    24: 'Entertainment',
                    25: 'News & Politics',
                    26: 'Howto & Style',
                    27: 'Education',
                    28: 'Science & Technology',
                    29: 'Nonprofits & Activism'}

samples['video_category'] = samples['video_category'].apply(lambda category: video_categories[category])

Next, we will perform an 80/20 train/validation split for the purpose of tuning the learning rate scheduler:

In [6]:
val = samples.sample(frac=0.2, random_state=524)
train = samples.drop(val.index)

In [7]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
0,BKTVOK,People & Blogs,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271,0
1,Rockit14,Gaming,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389,1
2,MAD ABOUT SCIENCE,People & Blogs,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385,1
3,BKTVOK,People & Blogs,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802,0
4,HVTraining,Sports,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282,0
...,...,...,...,...,...,...,...
25323,Khanish,People & Blogs,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115,0
25324,TungaloyCorporation,Science & Technology,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098,1
25325,Sansad TV,News & Politics,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341,0
25326,The Truth Show,Education,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217,1


In [8]:
val

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
11683,CrashCourse,Education,Micro-Biology: Crash Course History of Science...,It's all about the SUPER TINY in this episode ...,66,5.778978,1
14892,Padhle Tenthies,Education,Chemical Reactions and Equations Class 10 Scie...,Chemical Reactions and Equations Class 10 Scie...,23,5.384982,1
21540,Coding with Lewis,Science & Technology,What code editor should you use? 👩‍💻 #technolo...,,21,6.074111,1
4075,Hitesh Gohel,People & Blogs,Redox reaction class 11th Science gujarati,Balancing Redox reaction in acidic medium,60,4.482845,0
2171,MCQ Questions Hub,Education,Magnetism (चुम्बकत्व) MCQ | Physics MCQ | MCQ ...,Magnetism (चुम्बकत्व) MCQ | Physics MCQ | MCQ ...,35,3.995591,0
...,...,...,...,...,...,...,...
23360,Espiri Ibarra,People & Blogs,"October 9, 2022",,18,0.301030,0
11858,Happy Day,People & Blogs,Kirchhoff's Law,Kirchhoff’s Law-Pfis19-2B-Mohammad Nafis Nailu...,46,2.607455,0
12996,Book Scribe,Education,NCERT Science Class 6 தமிழ் | Chapter 13 | Fun...,NCERT Science Class 6 - Chapter 13: Fun with m...,27,3.859559,0
7290,OBD Việt Nam - Dịch vụ TỪ XA,Autos & Vehicles,Introduction SCR (Selective catalytic reductio...,#obdvietnam #service \n-----------------------...,1,1.568202,0


We will now define functions to import the pre-trained model from the huggingface transformers library.

In [19]:
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig
from transformers import DistilBertTokenizerFast

def get_tokenizer_model():
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
    model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)
    return tokenizer,model

In [29]:
tokenizer,model = get_tokenizer_model()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

We will use keras to fine-tune the pre-trained models. Let's define a function that prepares text samples for training, and then prepare the training and validation data.

In [30]:
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

sep_token = tokenizer.convert_tokens_to_ids('[SEP]')
cls_token = tokenizer.convert_tokens_to_ids('[CLS]')

def create_dataset(samples):
    encodings = {col: tokenizer(samples[col].tolist(), add_special_tokens=False)['input_ids'] for col in ['channel_title', 'video_title', 'video_description', 'video_category']}
    encodings = [[cls_token] + channel_title + [sep_token] + video_title + [sep_token] + video_description + [sep_token] + video_category + [sep_token]
                  for channel_title, video_title, video_description, video_category
                  in zip(encodings['channel_title'], encodings['video_title'], encodings['video_description'], encodings['video_category'])]
    encodings = pad_sequences(encodings, maxlen=tokenizer.model_max_length, padding='post', truncating='post', value=tokenizer.pad_token_id)

    return tf.data.Dataset.from_tensor_slices((encodings,samples['label'].tolist())).shuffle(len(samples)).batch(16)

In [22]:
train_dataset = create_dataset(train)
val_dataset = create_dataset(val)

Token indices sequence length is longer than the specified maximum sequence length for this model (627 > 512). Running this sequence through the model will result in indexing errors


Next we can define our functions to compile and train the model.

In [31]:
import numpy as np

from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import EarlyStopping,ModelCheckpoint,LearningRateScheduler

import math

import keras
from keras.callbacks import Callback

last_epoch = 0

def compile_model(model):
    model.compile(optimizer=RMSprop(learning_rate=learning_rate),
                  metrics = ['accuracy'])
    model.config.id2label = {0: 'under 50k views', 1: 'over 50k views'}

def fit_model(model):
    history = model.fit(train_dataset,
        epochs=epochs,
        batch_size=batch_size,
        callbacks=[LearningRateScheduler(lr_scheduler),
                   EarlyStopping(monitor="val_loss", patience=3)],
        validation_data=val_dataset,
        verbose=1)



We can now compile and fine tune the keras model. We'll train the model for 20 epochs, starting at a learning rate of 10$^{-3}$ and progressively decreasing the training rate.

In [32]:
learning_rate = 1e-5
compile_model(model)

In [33]:
def lr_scheduler(epoch, lr):
    return learning_rate*(0.5**epoch)

In [34]:
epochs = 20
batch_size=128

In [35]:
fit_model(model)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
