<a href="https://colab.research.google.com/github/saishdesai23/Youtube-Video-Classification/blob/main/Youtube_Video_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Youtube_video_classification
## Author: Saish Desai
## Dataset creation credits - https://www.kaggle.com/datasets/rajatrc1705/youtube-videos-dataset

**Connecting to google drive**

In [1]:
# connecting google drive to the notebook
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


**Installing and Importing required libraries**

In [2]:
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup

import nltk
import spacy

**Importing data**

In [3]:
data = pd.read_csv("/content/gdrive/MyDrive/Kaggle Competitions/Youtube_Video_Classification/data/youtube.csv")

In [4]:
data.head()

Unnamed: 0,link,title,description,category
0,JLZlCZ0,Ep 1| Travelling through North East India | Of...,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nT...,travel
1,i9E_Blai8vk,Welcome to Bali | Travel Vlog | Priscilla Lee,Priscilla Lee\n45.6K subscribers\nSUBSCRIBE\n*...,travel
2,r284c-q8oY,My Solo Trip to ALASKA | Cruising From Vancouv...,Allison Anderson\n588K subscribers\nSUBSCRIBE\...,travel
3,Qmi-Xwq-ME,Traveling to the Happiest Country in the World!!,Yes Theory\n6.65M subscribers\nSUBSCRIBE\n*BLA...,travel
4,_lcOX55Ef70,Solo in Paro Bhutan | Tiger's Nest visit | Bhu...,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nH...,travel


In [5]:
data['category'].unique()

array(['travel', 'food', 'art_music', 'history'], dtype=object)

In [6]:
data['category'].value_counts()

travel       1156
art_music     947
food          903
history       593
Name: category, dtype: int64

**Data Cleaning and Preprocessing**

In [44]:
# Initialization for data cleaning
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
stemmer = nltk.stem.PorterStemmer()
nltk.download('wordnet')
lemmatizer = nltk.stem.WordNetLemmatizer()



def data_cleaning_preprocessing(raw_text : str):
  """
  Function to clean and preprocess the data
  """
  text = BeautifulSoup(raw_text, "html.parser")
  text = re.sub("[^a-zA-Z]"," ",text.get_text())
  text = text.lower()
  text = [lemmatizer.lemmatize(ele) for ele in text.split() if ele not in stopwords and 15>len(ele)>2]
  text = " ".join(text)
  return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [45]:
data['clean_title'] = data['title'].apply(data_cleaning_preprocessing)
data['clean_description'] = data['description'].apply(data_cleaning_preprocessing)

data['text'] = data['clean_title'] + " " + data['clean_description']

In [10]:
data.head()

Unnamed: 0,link,title,description,category,clean_title,clean_description,text
0,JLZlCZ0,Ep 1| Travelling through North East India | Of...,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nT...,travel,ep travelling north east india arunachal journ...,tanya khanijow k subscriber subscribe journey ...,ep travelling north east india arunachal journ...
1,i9E_Blai8vk,Welcome to Bali | Travel Vlog | Priscilla Lee,Priscilla Lee\n45.6K subscribers\nSUBSCRIBE\n*...,travel,welcome bali travel vlog priscilla lee,priscilla lee k subscriber subscribe disclaime...,welcome bali travel vlog priscilla lee priscil...
2,r284c-q8oY,My Solo Trip to ALASKA | Cruising From Vancouv...,Allison Anderson\n588K subscribers\nSUBSCRIBE\...,travel,solo trip alaska cruising vancouver anchorage,allison anderson k subscriber subscribe spent ...,solo trip alaska cruising vancouver anchorage ...
3,Qmi-Xwq-ME,Traveling to the Happiest Country in the World!!,Yes Theory\n6.65M subscribers\nSUBSCRIBE\n*BLA...,travel,traveling happiest country world,yes theory subscriber subscribe black friday d...,traveling happiest country world yes theory su...
4,_lcOX55Ef70,Solo in Paro Bhutan | Tiger's Nest visit | Bhu...,Tanya Khanijow\n671K subscribers\nSUBSCRIBE\nH...,travel,solo paro bhutan tiger nest visit bhutan trave...,tanya khanijow k subscriber subscribe presenti...,solo paro bhutan tiger nest visit bhutan trave...


In [46]:
data['clean_title'][0]


'travelling north east india arunachal journey begin pasighat'

In [47]:
data['clean_description'][0]

'tanya khanijow subscriber subscribe journey arunachal north east india begin train journey guwahati murkongselek head pasighat travel companion getting started exploring tiny glimpse arunachal far market bridge adventure get better next video show'

In [48]:
data['text'][0]

'travelling north east india arunachal journey begin pasighat tanya khanijow subscriber subscribe journey arunachal north east india begin train journey guwahati murkongselek head pasighat travel companion getting started exploring tiny glimpse arunachal far market bridge adventure get better next video show'

In [49]:
data_clean = data[['text','category']]
data_clean['category'] = data_clean['category'].astype('category')
data_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,text,category
0,travelling north east india arunachal journey ...,travel
1,welcome bali travel vlog priscilla lee priscil...,travel
2,solo trip alaska cruising vancouver anchorage ...,travel
3,traveling happiest country world yes theory su...,travel
4,solo paro bhutan tiger nest visit bhutan trave...,travel


**Train Text Split**

In [95]:
from sklearn.model_selection import train_test_split
X = data_clean[['text']]
y = data_clean['category']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state = 42)

By observing the value counts for train and test data set we can confirm that the ratio between all the categories for both the datasets in almost same

In [213]:
print("Frequency count for train data set")
y_train.value_counts()

Frequency count for train data set


travel       776
art_music    690
food         639
history      414
Name: category, dtype: int64

In [215]:
print("Frequency count for test data set")
y_test.value_counts()

Frequency count for test data set


travel       380
food         264
art_music    257
history      179
Name: category, dtype: int64

In [98]:
 X_train.head()

Unnamed: 0,text
2670,hit radio live radio pop music hit best englis...
276,brother ghana travel blogger dubai vlog brothe...
2387,pop hit top pop song playlist popular pop musi...
2942,chak india title song shah rukh khan sukhvinde...
2383,minute timer pop music kid fun timer classroom...


**Word Embedding**

Now we will encode text pertaining to each video into word embeddings using a word2vec model trained on the corpus 'text8' and vocabulary observed in the text.

In [32]:
import gensim.downloader as api

In [33]:
corpus = api.load('text8')



**Training the model on the 'text8' corpus**

In [34]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)

**Training the model on the data set vocabulary**

In [86]:
my_corpus = []
for ele in data_clean['text']:
  my_corpus.append(ele.split())

model = Word2Vec(sentences=my_corpus, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

In [130]:
train_features = []
for ele in X_train['text']:
  train_features.append(np.average(np.transpose(model[ele.split()]),axis=1))

  This is separate from the ipykernel package so we can avoid doing imports until


**Data Frame crration for training**

Using the features from word embedding we have created a dataframe invoving the all the sentences as rows and the word embedding as features

In [216]:
# list to numpy array
train_features = np.array(train_features)

# numpy array to pandas dataframe
df_train = pd.DataFrame(train_features)

print("Train Data")

df_train

Train Data


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.543152,-0.474838,-1.590685,-0.117498,-0.100035,-0.088057,-0.417259,2.085554,-1.631940,0.420533,...,0.608796,-0.392851,1.034701,0.417928,-0.499790,-0.365881,0.530502,-0.317301,-0.114568,0.694621
1,0.375676,-0.164515,-0.762357,0.076611,-0.097052,0.075910,-0.211070,0.969503,-0.612564,0.218435,...,0.338122,-0.198298,0.618516,0.111413,-0.121832,-0.099441,0.222590,-0.114414,-0.058468,0.405780
2,0.634623,-0.487159,-1.512009,0.008553,-0.111199,0.006033,-0.483153,2.090725,-1.320328,0.376740,...,0.654796,-0.404338,1.215353,0.293952,-0.292783,-0.360393,0.440977,-0.314642,-0.251997,0.879024
3,0.388568,-0.110772,-0.875813,0.020317,-0.088182,0.043812,-0.176177,0.970341,-0.715505,0.226798,...,0.412236,-0.227099,0.667700,0.133940,-0.113329,-0.064843,0.272767,-0.135443,-0.010178,0.352153
4,0.318853,-0.098652,-0.728615,-0.004456,-0.052695,0.017839,-0.157296,0.787827,-0.586874,0.164410,...,0.453045,-0.223019,0.671777,0.127017,-0.044245,-0.072742,0.209067,-0.104454,-0.027643,0.304154
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2514,0.517925,0.002131,-1.364986,-0.185532,-0.067471,-0.069308,-0.110016,1.001332,-0.977803,0.246661,...,0.662033,-0.280311,0.883489,0.045476,0.162004,0.199510,0.472871,-0.301088,0.042010,0.280037
2515,0.399685,-0.119503,-0.821010,0.135167,-0.137346,0.140662,-0.217769,1.099018,-0.691899,0.258823,...,0.430096,-0.236903,0.759944,0.176301,-0.219436,-0.081227,0.261021,-0.081732,-0.015844,0.390747
2516,0.365934,-0.122759,-0.660062,0.201532,-0.150597,0.175995,-0.231826,1.012891,-0.561905,0.234543,...,0.346654,-0.200343,0.673241,0.148986,-0.237537,-0.112944,0.190087,-0.046750,-0.026955,0.397573
2517,0.416492,-0.043006,-0.964231,-0.054241,-0.056875,0.016375,-0.126387,0.819644,-0.687954,0.209801,...,0.580225,-0.264543,0.799892,0.095529,0.071372,0.051180,0.291585,-0.155606,0.012360,0.315717


In [133]:
test_features = []
for ele in X_test['text']:
  test_features.append(np.average(np.transpose(model[ele.split()]),axis=1))

  This is separate from the ipykernel package so we can avoid doing imports until


In [217]:
# list to numpy array
test_features = np.array(test_features)

print("Test Data")

# numpy array to pandas dataframe
df_test = pd.DataFrame(test_features)
df_test

Test Data


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.462388,-0.071301,-1.175730,-0.146009,-0.025926,-0.029549,-0.151913,0.998285,-0.861383,0.255205,...,0.618051,-0.281709,0.850921,0.131302,0.113780,0.070796,0.364353,-0.207334,-0.034069,0.392574
1,0.551034,-0.027345,-1.476295,-0.252048,-0.006267,-0.084543,-0.127317,1.088644,-1.065543,0.295179,...,0.806459,-0.350243,1.040541,0.151116,0.206931,0.150462,0.465418,-0.266798,-0.004296,0.397669
2,0.682003,-0.581143,-1.947197,-0.216980,-0.070619,-0.163573,-0.494157,2.404702,-1.796731,0.449704,...,0.674995,-0.462455,1.208670,0.354461,-0.364262,-0.395126,0.628881,-0.481992,-0.221232,0.904846
3,0.743699,-0.400386,-2.011714,-0.228352,-0.061704,-0.160684,-0.405077,2.210406,-1.680077,0.408173,...,0.755060,-0.488946,1.305045,0.283314,-0.159843,-0.257226,0.632765,-0.481505,-0.185301,0.817787
4,0.431699,-0.163205,-0.936407,-0.026650,-0.043559,0.017903,-0.191027,0.961564,-0.745054,0.261509,...,0.503631,-0.288747,0.796734,0.148661,-0.057135,-0.105379,0.262847,-0.135868,-0.024989,0.447045
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1075,0.753460,-0.405436,-2.043983,-0.236463,-0.060888,-0.166613,-0.408703,2.239264,-1.707122,0.414015,...,0.767715,-0.497300,1.323811,0.287628,-0.159891,-0.260324,0.643435,-0.489992,-0.186801,0.827646
1076,0.718133,-0.244773,-1.407063,0.427099,-0.398885,0.352656,-0.447960,2.164487,-1.258633,0.485924,...,0.560913,-0.344171,1.237629,0.255320,-0.613077,-0.134705,0.504844,-0.159074,-0.008433,0.638325
1077,0.391744,-0.059399,-0.995554,-0.104431,-0.043432,0.007534,-0.120393,0.848417,-0.748010,0.259622,...,0.468657,-0.215684,0.640487,0.124751,0.045265,0.073763,0.317123,-0.147557,-0.004079,0.329879
1078,0.659894,-0.046712,-1.532826,-0.052219,-0.168106,0.009120,-0.198048,1.342105,-1.124352,0.315290,...,0.580789,-0.255621,0.874066,0.020125,0.043403,0.136039,0.522785,-0.329698,0.043808,0.410210


**Model Training**

Now we will train the data on a list of models

In [218]:
# importing packages required for model training
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

# list of models used for training
training_models = [SVC(),MultinomialNB()]
trained_models = []

for model in training_models:
  clf = make_pipeline(MinMaxScaler(), model)
  trained_models.append(clf.fit(df_train, y_train))

**Model peformance Evaluation**

In [225]:
model_selecton ={}
for ele in trained_models:
  model_selecton[str(ele[1])] = ele

model_selecton

{'MultinomialNB()': Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                 ('multinomialnb', MultinomialNB())]),
 'SVC()': Pipeline(steps=[('minmaxscaler', MinMaxScaler()), ('svc', SVC())])}

In [231]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


def model_performance_eval(model):
  y_pred = model_selecton[model].predict(df_test)
  acc = accuracy_score(y_pred, y_test)
  print("*****************************************************")
  print("Model Accuracy")
  print(acc)
  print("*****************************************************")
  cm = confusion_matrix(y_pred, y_test)
  print("Model Confusion Matrix")
  print(cm)
  print("*****************************************************")
  cr = classification_report(y_pred, y_test)
  print("Model Classification Report")
  print(cr)

list = ['MultinomialNB()', 'SVC()' ]
model = input("Enter the model name: ")
model_performance_eval(model)

Enter the model name: SVC()
*****************************************************
Model Accuracy
0.8305555555555556
*****************************************************
Model Confusion Matrix
[[228   3   1   7]
 [  3 216  25  13]
 [  9  36 108  15]
 [ 17   9  45 345]]
*****************************************************
Model Classification Report
              precision    recall  f1-score   support

   art_music       0.89      0.95      0.92       239
        food       0.82      0.84      0.83       257
     history       0.60      0.64      0.62       168
      travel       0.91      0.83      0.87       416

    accuracy                           0.83      1080
   macro avg       0.80      0.82      0.81      1080
weighted avg       0.83      0.83      0.83      1080

