![youtube.jpg](attachment:youtube.jpg)

Welcome to our YouTube Analysis Project! With over 1 billion hours of content watched daily, YouTube stands as the second most visited site globally, captivating 44% of internet users. Furthermore, an impressive 37% of mobile internet traffic is dedicated to this platform. In this project, we aim to delve into the vast trove of YouTube data using powerful Python libraries. We'll analyze viewership trends, user engagement metrics, content performance, and much more, uncovering valuable insights into one of the internet's most influential platforms. Join us as we embark on this journey to decode the dynamics of YouTube through data analysis!

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Loading csv data

In [4]:
comments = pd.read_csv(r"E:\Youtube/UScomments.csv" , on_bad_lines='warn')

Skipping line 41589: expected 4 fields, saw 11
Skipping line 51628: expected 4 fields, saw 7
Skipping line 114465: expected 4 fields, saw 5

Skipping line 142496: expected 4 fields, saw 8
Skipping line 189732: expected 4 fields, saw 6
Skipping line 245218: expected 4 fields, saw 7

Skipping line 388430: expected 4 fields, saw 5

  comments = pd.read_csv(r"E:\Youtube/UScomments.csv" , on_bad_lines='warn')


In [5]:
comments.head()

Unnamed: 0,video_id,comment_text,likes,replies
0,XpVt6Z1Gjjo,Logan Paul it's yo big day ‼️‼️‼️,4,0
1,XpVt6Z1Gjjo,I've been following you from the start of your...,3,0
2,XpVt6Z1Gjjo,Say hi to Kong and maverick for me,3,0
3,XpVt6Z1Gjjo,MY FAN . attendance,3,0
4,XpVt6Z1Gjjo,trending 😉,3,0


In [6]:
comments.shape

(691400, 4)

In [7]:
comments.isnull().sum()

video_id         0
comment_text    26
likes            0
replies          0
dtype: int64

In [8]:
comments.dropna(inplace = True)

### Sentiment analysis

Polarity from -1 to 1 has been given with the help of TextBlob. It helps to segregate the comments into positive, negative and neutral sentiments. It helps the organization to hit the target audience in a more specified manner

In [9]:
!pip install textblob



In [10]:
from textblob import TextBlob

In [None]:
polarity = []
for comment in comments['comment_text']:
    try:
        polarity.append(TextBlob(comment).sentiment.polarity)
    except:
        polarity.append(0)

In [None]:
comments['Polarity'] = polarity

In [None]:
comments.head()

In [None]:
Sentiment = []
for i in comments['Polarity']:
    if i > 0:
        Sentiment.append('Positive')
    elif i == 0:
        Sentiment.append('Neutral')
    else:
        Sentiment.append('Negative')

In [None]:
comments['Sentiment'] = Sentiment

In [None]:
comments.head()

In [None]:
polarity_count = comments.groupby(by = 'Sentiment')[['Polarity']].count()
polarity_count

In [None]:
sns.barplot(data = polarity_count, x = polarity_count.index, y ='Polarity')

In [None]:
polarity_mean = comments.groupby(by = 'Sentiment')[['Polarity']].mean()
polarity_mean

In [None]:
sns.barplot(data = polarity_mean, x = polarity_mean.index, y ='Polarity')

### Wordcloud

Wordcloud helps to know the positive and negative words used by viewers in the comment section. Such segregation helps to know the quality of the content. Positive comments would help us to increase the reach of particular videos

In [None]:
!pip install wordcloud

In [None]:
from wordcloud import WordCloud, STOPWORDS

In [None]:
comments_positive = comments[comments['Polarity'] == 1][['comment_text']]
comments_negative = comments[comments['Polarity'] == -1][['comment_text']]

In [None]:
type(comments['comment_text'])  #wordcloud takes input in the form of string, so we need to convert it into string from Series

In [None]:
total_comment_positive = ' '.join(comments_positive['comment_text'])
total_comment_negative = ' '.join(comments_negative['comment_text'])

In [None]:
set(STOPWORDS)

In [None]:
wordcloud_positive = WordCloud(stopwords = set(STOPWORDS)).generate(total_comment_positive)

In [None]:
plt.imshow(wordcloud_positive)

In [None]:
wordcloud_negative = WordCloud(stopwords = set(STOPWORDS)).generate(total_comment_negative)

In [None]:
plt.imshow(wordcloud_negative)

### Emoji Analysis

Emoji analysis makes the comments segregation easier. Emojis helps us to know the content quality and makes it easier for Youtube to increase its reach 

In [None]:
!pip install emoji==2.2.0

In [None]:
import emoji

In [None]:
emoji.__version__

In [None]:
comments['comment_text'].head()

In [None]:
comment = 'trending 😉'
[i for i in comment if i in emoji.EMOJI_DATA]
        

In [None]:
all_emojis_list = []
for comment in comments['comment_text']:
    for char in comment:
        if char in emoji.EMOJI_DATA:
            all_emojis_list.append(char)

In [None]:
from collections import Counter

In [None]:
Counter(all_emojis_list).most_common(10)

In [None]:
emojis = [Counter(all_emojis_list).most_common(10)[i][0] for i in range(10)]

In [None]:
emojis

In [None]:
freqs = [Counter(all_emojis_list).most_common(10)[i][1] for i in range(10)]

In [None]:
freqs

In [None]:
import plotly.graph_objs as go
from plotly.offline import iplot

In [None]:
trace =  go.Bar(x= emojis, y=freqs)

In [None]:
iplot([trace])

Since the most used emojis are positive in nature, it shows that the content is being liked by the viewers.

### Data Collection

In [None]:
import os

In [None]:
files = os.listdir(r'E:\Youtube\additional_data')

In [None]:
files

In [None]:
files_csv = [file for file in files if '.csv' in file]

In [None]:
files_csv

In [None]:
import warnings

In [None]:
from warnings import filterwarnings

In [None]:
filterwarnings('ignore')

In [None]:
full_df = pd.DataFrame()
path = r"E:\Youtube\additional_data"

for file in files_csv:
    current_df = pd.read_csv(path + "/" + file, encoding = 'iso-8859-1', on_bad_lines = 'warn')
    full_df = pd.concat([full_df, current_df], ignore_index = True)

In [None]:
full_df.shape

### Transforming data

In [None]:
full_df[full_df.duplicated()].shape

In [None]:
full_df.drop_duplicates(inplace=True)

In [None]:
full_df.shape

### Exporting bulk data in csv, json, databases

In [None]:
#full_df.to_csv(r'E:\Youtube\Full_data/youtube_sample.csv', index = False)

In [None]:
#full_df.to_json(r'E:\Youtube\Full_data/youtube_sample.json')

In [None]:
#from sqlalchemy import create_engine

In [None]:
#engine = create_engine('sqlite:///E:\Youtube\Full_data/youtube_sample.sqlite')

In [None]:
#full_df.to_sql('Users', con = engine, if_exists = 'append')

### Analysing the most liked category

In [None]:
full_df.head()

In [None]:
full_df['category_id'].unique()

In [None]:
json_df = pd.read_json(r'E:\Youtube\additional_data/US_category_id.json')

In [None]:
json_df.head()

In [None]:
json_df['items'][1]

In [None]:
cat_dict = {}
for item in json_df['items'].values:
    cat_dict[int(item['id'])] = item['snippet']['title']

In [None]:
cat_dict

In [None]:
full_df['category_name'] = full_df['category_id'].map(cat_dict)

In [None]:
full_df.head()

In [None]:
plt.figure(figsize = (12,8))
sns.boxplot(data = full_df, x = 'category_name', y = 'likes')
plt.xticks(rotation = 90)
plt.title('Most Liked category over Youtube')
plt.show()

Insights: Music and Entertainment being the most liked categories followed by Nonprofits & Activism, People & Blogs. Whereas Shows, Movie, Trailers placed at the bottom of the list making it the most disliked category.

### Audience Engagement over various categories

In [None]:
full_df.columns

In [None]:
full_df['like_rate'] = (full_df['likes'] / full_df['views'])*100
full_df['dislike_rate'] = (full_df['dislikes'] / full_df['views'])*100
full_df['comment_count_rate'] = (full_df['comment_count'] / full_df['views'])*100

In [None]:
plt.figure(figsize = (8,6))
sns.boxplot(data=full_df, x = 'category_name', y = 'like_rate')
plt.xticks(rotation = 90)
plt.title('Like rate per Category showing audience engagement')
plt.show()

Insights: Entertainment category having the most like rate. Except for Shows, Movies and Trailers most categories have shown a significant amount of like rate

In [None]:
plt.figure(figsize = (8,6))
sns.boxplot(data=full_df, x = 'category_name', y = 'dislike_rate')
plt.xticks(rotation = 90)
plt.title('Dislike rate per Category showing audience engagement')
plt.show()

Insights: From the above boxplot, high dislie rate is evident among the most liked categories which indicates about high audience engagement having mixed sentiments. Categories such as People & Blogs, News & Politics, Nonprofits & Activism shows high dislike rate and moderate like rate.

In [None]:
plt.figure(figsize = (8,6))
sns.boxplot(data=full_df, x = 'category_name', y = 'comment_count_rate')
plt.xticks(rotation = 90)
plt.title('Comment count rate per Category showing audience engagement')
plt.show()

Insights: Entertainment, People & Blogs, Science & Technology, Gaming has high coment count rate which shows that these categories has the potential to actively engage the audience.

In [None]:
sns.regplot(data=full_df, x = 'views', y='likes')

In [None]:
full_df[['likes', 'dislikes', 'views']].corr()

In [None]:
sns.heatmap(full_df[['likes', 'dislikes', 'views']].corr(), annot = True)

Insights: Regex plot and heatmap gives a clear understanding of high positive correlation between views and likes.

### Channels having largest number of trending videos

In [None]:
full_df.head()

In [None]:
cdf = full_df['channel_title'].value_counts().reset_index()

In [None]:
cdf.rename(columns = {'count' : 'total_videos'}, inplace=True)

In [None]:
cdf

In [None]:
import plotly.express as px

In [None]:
px.bar(data_frame = cdf[0:20], x = 'channel_title', y='total_videos')

Insights: Channels from entertainment category being the topmost liked content. The Late Show with Stephen Colbert, WWE, Late Night with Seth Meyers, TheEllenShow securing the best four channels.

### Impact of punctuations on likes, dislikes, views

In [None]:
import string

In [None]:
string.punctuation

In [None]:
full_df['title'][0]

In [None]:
len([char for char in full_df['title'][0] if char in string.punctuation])

In [None]:
def punc_count(text):
    return len([char for char in text if char in string.punctuation])

In [None]:
full_df['count_punc'] = full_df['title'].apply(punc_count)

In [None]:
full_df['count_punc']

In [None]:
plt.figure(figsize= (8,6))
sns.boxplot(data = full_df, x = 'count_punc', y = 'views')
plt.xticks(rotation = 0)
plt.show()

Insights: Channels havig 8 punctuation marks can be seen as getting most views followed by 2,7 punctuation marks channels

In [None]:
plt.figure(figsize= (8,6))
sns.boxplot(data = full_df, x = 'count_punc', y = 'likes')
plt.xticks(rotation = 0)
plt.show()

Insights: 1 and 3 punctuation marks titles has the most likes. Clearly evident titles having two to three punctuation marks will get most likes and views 