# Data Science Analysis of The Joe Rogan Experience

## Group D
   
   >Simon Chamlers
   >
   >Mitchell Whyte
   >
   >Jack Moore
   
## Introduction

The Joe Rogan Experience is an extremely popular podcast that hosts a variety of different guests. The type of guests and the topics covered on the podcast vary greatly and each guest individually garners a unique reaction from the audience.

## Problem Statement and Goals

Can we predict the ratio of likes to dislikes on youtube videos using youtube video data?

## Our Data

Our data was collected from the Joe Rogan Experience youtube channel through the use of the YouTube data API. We querired the API for video IDs of each episode, we then used these video IDs to query for the statistics, dates and comments which is in JSON format.

The comments were obtained in a seperate file form the video statistics and dates. The comments were then processes through our own custom programmed NLP algorithm to apply a sentiment analysis for each individual comment. The sentiment analysis used a 'Bag of Words' model to evaluate the sentiment of each comment. Each comment was given a numbered label corresponding to which video the comment belonged.

### About the NLP

Each comment is treated as a document and is then broken down into a bagh of words. Each word in the comment is tehn tokenised using the spaCy python library. Each word is lemmatised o get the root meaning of the word. A list of 144,000 words with sentiments scored attached to them were obtained free from SentiWords, a common method of compaing words to obtain the sentiment. After processing a comment, each word in the bag of words has been given a sentiment score, and the total sentiment score is average of these sentiments. The total magnitude is added up aswell from the sentiment of each word.

After the comments were processed, the sentiment scores and magnitudes were grouped bu tthe video they belonged to. We took the descriptive statistics of the grouped comments for each video and joined it to our dataFrame, along with the other columns of our youtuve video dataFrame.

## Importing Libraries

In [5]:
import json
import pandas as pd
import csv
import seaborn as sns
import matplotlib.pyplot as plt
#import wikipedia
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.feature_selection import RFE, RFECV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import re
from pandas.plotting import register_matplotlib_converters
from sklearn.utils import resample
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

## Importing and Cleaning the Data



In [6]:
raw_scores = pd.read_csv('data/raw_scores.csv')
raw_scores = raw_scores.drop(['Unnamed: 0'],axis=1)
print(raw_scores.shape)
raw_scores.head()

FileNotFoundError: [Errno 2] File b'data/raw_scores.csv' does not exist: b'data/raw_scores.csv'

The raw_scores dataframe is a dataframe of all the magnitude and sentiment scores for all of our comments. each comment had an idx column entry to associate it to an index of our videos.

In [2]:
len(raw_scores[(raw_scores.magnitude == 0) & (raw_scores.sentiment == 0)])

NameError: name 'raw_scores' is not defined

nearly 300,000 comments with 0 sentiment and 0 magnitude. this is due to them being too short usually. sometimes because the characters were non readable (emojis and such), or spelling mistakes. We remove these from our dataframe

In [None]:
groups = raw_scores.groupby(['idx']).describe()
groups.head()

We group the sentiments and magnitudes according to the column 'idx'. idx represents the video index in the video dataframe. idx = 0 is the first video in the vids dataframe, so all of those sentiments are for all of the comments of videos 0. We append the descriptive stats of the sentiments and magnitudes to the videos dataframe as features to use in our analysis.

In [None]:
vids = pd.read_csv('files/labeled_vids.csv')
vids = vids.drop(['Unnamed: 0','Unnamed: 0.1','sentiment','magnitude'],axis=1)
print(vids.shape)
vids.head()

In [None]:
label_list = ['mean','std','min','25%','50%','75%','max']
magnitude_label_list = ['magMean','magStd','magMin','magLq','magMedian','magUq','magMax']
sentiment_label_list = ['sentMean','sentStd','sentMin','sentLq','sentMedian','sentUq','sentMax']
for i,label in enumerate(label_list):
    vids[magnitude_label_list[i]] = groups['magnitude'][label]
    vids[sentiment_label_list[i]] = groups['sentiment'][label]

In [None]:
vids.max()

In [None]:
print(vids.iloc[119].ratio)
print(vids.iloc[119].likeCount)
print(vids.iloc[119].dislikeCount)
vids.at[119,'dislikeCount'] = 1
vids.at[119,'ratio'] = 62

In [None]:
vids.ratio.max()

We see there is an inf value in our ratio column. We find the culprit and deal with the issue. The issue is that the ratio is derived from dividing the likeCount by the dislikeCount. Any number divided by 0 results in an inf value. This inf value causes problems when analysing the data - many methods do not run if there is an inf value in the column. we deal with this by changing the dislikeCount to 1. No inf value any longer.

In [None]:
vids.iloc[119]

In [None]:
vids[vids.title.duplicated()==True]

In [None]:
print("Shape before removing duplicates: ",vids.shape)
vids = vids[vids.title.duplicated()==False]
print("Shape after removing duplicates: ",vids.shape)

Because of the way we got our video ID to query the youtube API with, we have some duplicate rows. We remove the duplicated rows from the dataframe.

In [None]:
for col in vids.columns[3:]:
    if str(vids[col].dtype) != 'object':
        vids = vids[(np.abs(stats.zscore(vids[col])) < 3)]
vids = vids[vids['commentCount'] > 100]
vids.index = pd.to_datetime(vids.date)
print("Shape after removing outliers",vids.shape)

In [None]:
#6 equally distributed categories of likes/dislikes ratio
bin_labels = [0,1,2,3,4,5]
vids['ratio_bins'] = pd.qcut(vids['ratio'], q=6, labels = bin_labels)

We removed rows where columns were outside of 3 standard deviations. This is to reduce any skewing that might occur from outliers. We also decided to select videos with over 100 comments, because we feel if the amount of comments is too small, then we wont get robust results from the sentiment analysis of the comments. We also created a categorical version of our ratio column, for classification modelling later.

# Data Exploration and Visualisation

# Data Analysis

## Basic Modelling

In [None]:
vids = pd.read_csv('files/df_sans_zero_sentiments.csv')
#filter out outliers with not many comments (not enough comments for sentiment analysis to be robust)
print("dataframe before filtering: ",vids.shape)
for col in vids.columns[3:]:
    if str(vids[col].dtype) != 'object':
        vids = vids[(np.abs(stats.zscore(vids[col])) < 3)]
vids = vids[vids['commentCount'] > 100]
vids = vids.drop(['Unnamed: 0'],axis=1)
vids.index = pd.to_datetime(vids.date)
#6 equally distributed categories of likes/dislikes ratio
bin_labels = [0,1,2,3,4,5]
vids['ratio_bins'] = pd.qcut(vids['ratio'], q=6, labels = bin_labels)
print("dataframe shape after filtering outliers and low comment rows: ",vids.shape)
vids = vids[vids.title.duplicated()==False]
print("dataframe shape after removing duplicates",vids.shape)

In [None]:
drop_parameters = ['date','title','ratio','ratio_bins','likeCount','dislikeCount']
X = vids.drop(drop_parameters,axis=1)
y = vids['ratio_bins']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=42)

In [None]:
X_train.shape

## Linear Regression

In [None]:
predictions = []
for i in range(1, 18):
    estimator = linear_model.LinearRegression()
    selector = RFE(estimator, i)
    selector.fit(X_train,y_train)
    predictions.append(selector.predict(X_test))

In [None]:
mse, rmse, rsquared, mae = ([] for i in range(4))
for prediction in predictions:
    mse.append(mean_squared_error(y_test, prediction))
    rmse.append(np.sqrt(mean_squared_error(y_test, prediction)))
    rsquared.append(r2_score(y_test, prediction))
    mae.append(mean_absolute_error(y_test, prediction))
prediction_df = pd.DataFrame(
    {'mse': mse,
     'rmse': rmse,
    'rsquared': rsquared,
     'mae': mae
    })

In [None]:
cols = ['mse','rmse','rsquared','mae']
titleList = ['Mean Squared Error','Root Mean Squared Error','R Squared','Mean Absolute Error']

fig,axes = plt.subplots(2, 2, sharex=False, sharey=False,figsize = (20,20),constrained_layout = True)
#plt.tight_layout()
fig.suptitle('Linear Regression Performance with RFE', size='40',y=1.05)
for i, ax in enumerate(axes.flat):
    sns.lineplot(data=prediction_df,x=prediction_df.index,y=cols[i],ax=ax,marker="o")
    ax.set_title(titleList[i],size='24')
    ax.set_xlabel('No. of features',size='20')
    ax.set_ylabel(cols[i],size='20')

In [None]:
drop_parameters = ['date','title','ratio','ratio_bins']
X = vids.drop(drop_parameters,axis=1)
y = vids['ratio']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=42)
estimator = linear_model.LinearRegression()
selector = RFE(estimator,1)
selector.fit(X_train, y_train)
prediction = selector.predict(X_test)
selector.ranking_

In [None]:
feature_rankings = pd.DataFrame(
    {'features': X_train.columns,
     'ranking': selector.ranking_})
feature_rankings.sort_values(by=['ranking'])

We see that the linear regression performs the best with all of the features. We observe the ranking of importance of these features.

## Logistic Regression

In [None]:
drop_parameters = ['date','title','ratio','ratio_bins','dislikeCount','likeCount','commentCount','viewCount','magMax']
X = vids.drop(drop_parameters,axis=1)
y = vids['ratio_bins']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=42)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
predicted = logreg.predict(X_test)
print(accuracy_score(y_test,predicted))
print(f1_score(y_test,predicted,average='macro'))
print(recall_score(y_test,predicted,average='macro'))
print(precision_score(y_test,predicted,average='macro'))
print(confusion_matrix(y_test,predicted))

In [None]:
predictions = []
for i in range(1, 20):
    estimator = LogisticRegression()
    selector = RFE(estimator, i)
    selector.fit(X_train,y_train)
    predictions.append(selector.predict(X_test))

In [None]:
accuracy, f1, recall, precision = ([] for i in range(4))
for prediction in predictions:
    accuracy.append(accuracy_score(y_test,prediction))
    f1.append(f1_score(y_test,prediction,average='macro'))
    recall.append(recall_score(y_test,prediction,average='macro'))
    precision.append(precision_score(y_test,prediction,average='macro'))
    
prediction_df = pd.DataFrame(
    {'accuracy': accuracy,
     'f1': f1,
    'recall': recall,
     'precision': precision
    })

In [None]:
cols = ['accuracy','f1','recall','precision']
titleList = ['Accuracy','F1','Recall','Precision']

fig,axes = plt.subplots(2, 2, sharex=False, sharey=False,figsize = (20,20),constrained_layout = True)
#plt.tight_layout()
fig.suptitle('Logistic Regression Performance with RFE', size='40',y=1.05)
for i, ax in enumerate(axes.flat):
    sns.lineplot(data=prediction_df,x=prediction_df.index,y=cols[i],ax=ax,marker="o")
    ax.set_title(titleList[i],size='24')
    ax.set_xlabel('No. of features',size='20')
    ax.set_ylabel(cols[i],size='20')

In [None]:
estimator = LogisticRegression()
selector = RFE(estimator, 1)
selector.fit(X_train,y_train)

In [None]:
feature_rankings = pd.DataFrame(
    {'features': X_train.columns,
     'ranking': selector.ranking_})
feature_rankings.sort_values(by=['ranking'])

# Improved Modelling

## Linear Regression

## Other Model

## Improved Models Compared to Basic Models

# Summary and Future Improvement

# Conclusion