# YouTube Streamer Analysis 

Analysing YouTube data about streamers, their subscriber count, like, comments as well as the visits they took in order to create a recommendation system based on the results obtained.

In [None]:
# installing the packages needed to run the project

install.packages(c("tidyverse", "ggplot2", "plotly", "stats"))
library(tidyverse)
library(ggplot2)
library(plotly)
library(stats)

In [None]:
# Setting maximum rows and columns to view everything in the dataset
options(repr.matrix.max.rows=250, repr.matrix.max.cols=50)

# Set display option to show complete numbers
options(digits = 10)

# Reset display option
options(digits = 7)

## 1. Data Exploration

Exploring the data by first loading it into Kaggle, viewing it, checking for outliers and then fixing the outliers that have been identified.

### 1.a Loading the data

In [None]:
# load the dataset

notebook_path <- "/kaggle/input/youtube-streamer-analysis/youtubers_df.csv"
youtube_df <- read_csv(notebook_path)

In [None]:
head(youtube_df) # view the data 

#### Observations
1. From the above data, it can be noted that ***tseries*** has the most subscribers of 249500000, is from ***India*** and is under the ***Música y baile***.
2. Subscribers is spelt incorrectly as *suscribers*

In [None]:
dim(youtube_df) # return the dimensions of the data

The data consists of 1000 data points and 9 columns

In [None]:
str(youtube_df) # display the structure of the data

In [None]:
# Checking the total number of missing values for each column
for (col in names(youtube_df)) {
  if (sum(is.na(youtube_df[[col]])) > 0) {
    print(paste(col, "has", sum(is.na(youtube_df[[col]])), "missing values"))
  }
}

#### Observations
- It can be seen that 306 fields are missing in the column **Categories**

In [None]:
# Filling in missing values in the "Categories" column with the mode
mode_category <- names(which.max(table(youtube_df$Categories)))
youtube_df$Categories[is.na(youtube_df$Categories)] <- mode_category

# Print the mode category
cat("Mode Category:", mode_category, "\n")

#### Observation
- **Música y baile** is the missing category

In [None]:
# Replacing missing values in the "Categories" column
youtube_df$Categories[is.na(youtube_df$Categories)] <- 'Música y baile'

In [None]:
# Checking for missing values in the entire data frame
sum(is.na(youtube_df))

- The missing values have been updated with Música y baile and there are no missing values

In [None]:
col_names <- names(youtube_df)
print(col_names)

- Viewing the columns in the table and noting that one of the categories has been spelt incorrectly

In [None]:
# only runs on the first try after that an error will be raised
names(youtube_df)[names(youtube_df) == 'Suscribers'] <- 'Subscribers'

In [None]:
col_names <- names(youtube_df)
print(col_names)

- All the categories are now correctly spelt

In [None]:
summary(youtube_df)

In [None]:
numerical_df <- youtube_df[c('Subscribers', 'Visits', 'Likes', 'Comments')]
head(numerical_df)

### 1. b. Visualising Outliers in the data

In [None]:
# Creating separate box plots for each variable
fig <- plot_ly(data = numerical_df, y = ~Subscribers, type = "box", name = "Subscribers") %>%
  add_trace(y = ~Visits, type = "box", name = "Visits") %>%
  add_trace(y = ~Likes, type = "box", name = "Likes") %>%
  add_trace(y = ~Comments, type = "box", name = "Comments")

# Adding a title
fig <- fig %>% layout(title = "BOX PLOT SHOWING THE OUTLIERS ACROSS VARIABLES")

# Display the plot
fig

#### Observations
- Subcribers has the most outliers over the 3rd quarter 
- Visits has the 2nd highest outliers 
- Likes has the 3rd least amount of outliers
- Comments has the least amount of outliers from all the 4 categories

### 1. c. Checking for outliers

In [None]:
# columns that we want to check for outliers
columns <- c('Subscribers', 'Visits', 'Likes', 'Comments')

# calculate the z score for each data point
z_scores <- scale(numerical_df[columns])

# setting a threshold for outliers (let's say 2 s.t.d from the mean)
threshold <- 2

# creating a dataframe with NA for non-outliers and actual values for outliers
zscore_outliers <- numerical_df[columns]

# getting results for where there are outliers and replacing non-outliers with NA
zscore_outliers <- zscore_outliers %>%
  mutate(across(all_of(columns), ~ifelse(abs(scale(.)) > threshold, ., NA)))

In [None]:
head(zscore_outliers)

In [None]:
# Counting outliers for each variable
subscriber_outliers <- sum(!is.na(zscore_outliers$Subscribers))
cat('subscriber_outliers: ', subscriber_outliers, '\n')

visit_outliers <- sum(!is.na(zscore_outliers$Visits))
cat('visit_outliers: ', visit_outliers, '\n')

likes_outliers <- sum(!is.na(zscore_outliers$Likes))
cat('likes_outliers: ', likes_outliers, '\n')

comment_outliers <- sum(!is.na(zscore_outliers$Comments))
cat('comment_outliers: ', comment_outliers, '\n')

# 2. Trend Analysis

### 2.a Popular Categories based on the number of streamers

In [None]:
# Finding the number of streamers by category
popular_category <- table(youtube_df$Categories)

# Converting the result to a data frame
popular_category_df <- as.data.frame(popular_category)

# Resetting the column names
colnames(popular_category_df) <- c("Category", "Total No. of Streamers / Category")

# Inspecting our data
head(popular_category_df)

In [None]:
# Creating a bar graph with default color
fig <- ggplot(popular_category_df[1:10, ], aes(x = Category, y = `Total No. of Streamers / Category`, fill = Category)) +
  geom_bar(stat = "identity") +
  labs(title = "BAR GRAPH SHOWING THE POPULAR CATEGORIES BY STREAMERS")

# Display the plot
print(fig)

### 2.b. Correlation between the number of subscribers, the likes and the comments 

In [None]:
library(corrplot) # visual exploratory tool on correlation matrix

# Extracting the relevant columns for correlation
correlation_matrix <- cor(select(youtube_df, Subscribers, Likes, Comments))

# Creating a heatmap
corrplot(correlation_matrix, method = "color", col = colorRampPalette(c("navyblue", "white", "darkred"))(50), 
         type = "upper", order = "hclust", tl.cex = 0.7)

# Adding title
title("CORRELATION MATRIX BETWEEN SUBSCRIBERS, LIKES AND COMMENTS")

#### Observation
- No correlation exists among the variables
- All the variables have a correlation closer to 0

# 3. Audience Study

### 3.a Studying the audience based on the streamers they watch the most

In [None]:
# Grouping by 'Country' and 'Username' and counting the number of streamers
streamers_country <- youtube_df %>%
  group_by(Country, Username) %>%
  summarise(`Number of Streamers` = n()) %>%
  ungroup()

# Renaming columns
colnames(streamers_country) <- c('Country', 'Streamers', 'Number of Streamers')


In [None]:
# Displaying the first few rows
head(streamers_country)

#### Observation
- Arabia Saudita has the most amount of streamers

In [None]:
# Creating a bar chart with plotly
fig <- plot_ly(
  data = streamers_country,
  x = ~Country,
  y = ~`Number of Streamers`,
  type = "bar",
  color = ~Streamers,
  text = ~paste("Streamers: ", Streamers, "<br>Number of Streamers: ", `Number of Streamers`),
  hoverinfo = "text"
) %>%
  layout(title = "TOP STREAMERS BY COUNTRY")

# Displaying the plot
fig

### 3.b Studying the audience based on the category and country

In [None]:
# Grouping by 'Country' and 'Categories' and counting the number of categories
country_category_df <- youtube_df %>%
  group_by(Country, Categories) %>%
  summarise(`Number of categories` = n()) %>%
  arrange(desc(`Number of categories`)) %>%
  ungroup()

In [None]:
head(country_category_df)

In [None]:
# Creating a bar chart with plotly
fig <- plot_ly(
  data = country_category_df[1:10, ],
  x = ~Categories,
  y = ~`Number of categories`,
  type = "bar",
  color = ~Country,
  text = ~paste("Country: ", Country, "<br>Number of categories: ", `Number of categories`),
  hoverinfo = "text"
) %>%
  layout(title = "TOP CATEGORIES BY COUNTRY")

# Displaying the plot
fig

#### Observation
- Indian subscribers prefer Musica y baile more than the other categories

# 4. Performance Metrics

### 4.a Average metrics

In [None]:
# Calculate the mean metrics
var_average <- youtube_df %>%
  summarise(
    mean_subscribers = mean(Subscribers, na.rm = TRUE),
    mean_visits = mean(Visits, na.rm = TRUE),
    mean_likes = mean(Likes, na.rm = TRUE),
    mean_comments = mean(Comments, na.rm = TRUE)
  )

# Display the mean metrics
var_average

In [None]:
# Creating a bar plot with plot_ly
fig <- plot_ly(
  x = names(var_average),
  y = var_average,
  type = "bar",
  marker = list(color = "purple")
) %>%
  layout(title = "Average Metrics", xaxis = list(title = "Metrics"), yaxis = list(title = "Mean Value"))

# Displaying the plot
fig

The following can be deduced from the graph:

1. From the graph above, we can see that there is a steady decline in the average metrics from Subscribers to Comments of youtube streamers
2. These values could possibly be as a result of diminishing trends in their online engagements and popularity of content or categories over time.
3. This could be as a result of factors such as loss of interest or it could mean that there is a need for strategic interventions to keep streamers engaged.
4. This decline may also mean that there are some challenges in retaining streamers audience or attracting new audience.

# 5. Content Categories

### 5.a Top Categories by Streamers

In [None]:
# Finding the number of streamers by category
popular_category <- table(youtube_df$Categories)

# Converting the result to a data frame
popular_category_df <- as.data.frame(popular_category)

# Resetting the row names
popular_category_df$Category <- rownames(popular_category_df)
rownames(popular_category_df) <- NULL

# Renaming the columns
colnames(popular_category_df) <- c('Category', 'Total No. of Streamers / Category')

# Creating a bar plot with plot_ly
fig <- plot_ly(
  data = popular_category_df[1:10, ],
  x = ~Category,
  y = ~`Total No. of Streamers / Category`,
  type = "bar",
  marker = list(color = "black")
) %>%
  layout(title = "BAR GRAPH SHOWING THE TOP CATEGORIES BY STREAMERS", xaxis = list(title = "Category"), yaxis = list(title = "Total No. of Streamers / Category"))

# Displaying the plot
fig

#### Observation
- Animacion, Videojuegoes has the most streamers

### 5.b Top Categoies which have exceptional performance metrics

In [None]:
# Creating a performance metric by finding the total mean of subscribers, visits, likes, and comments
youtube_df$Performance_Metric <- rowMeans(youtube_df[, c("Subscribers", "Visits", "Likes", "Comments")], na.rm = TRUE)

# Calculating the sum of the performance metric for each category
category_metrics <- aggregate(Performance_Metric ~ Categories, youtube_df, sum)

# Sorting and selecting the top 10 categories by performance metric
top_category_metrics <- category_metrics[order(-category_metrics$Performance_Metric), ][1:10, ]

# Viewing the results
top_category_metrics

#### Observation
- Música y baile has the most exceptional performance metrics among all the other categories

In [None]:
# Creating a bar chart to visualize the data
fig <- plot_ly(top_category_metrics, x = ~Categories, y = ~Performance_Metric, type = "bar",
               colors = c("red"), 
               marker = list(line = list(color = 'black', width = 1)))
fig <- fig %>% layout(title = "TOP CATEGORIES BY PERFORMANCE METRIC",
                      xaxis = list(title = "Categories"),
                      yaxis = list(title = "Performance Metric"))
fig

### 5.c Heatmap to view the categories with exceptional performance metrics

In [None]:
# Calculating average metrics by categories
category_avg_metrics <- youtube_df %>%
  group_by(Categories) %>%
  summarise(
    avg_subscribers = mean(Subscribers, na.rm = TRUE),
    avg_visits = mean(Visits, na.rm = TRUE),
    avg_likes = mean(Likes, na.rm = TRUE),
    avg_comments = mean(Comments, na.rm = TRUE)
  )

# Merging data into a long format suitable for ggplot
category_avg_metrics_long <- category_avg_metrics %>%
  pivot_longer(cols = starts_with("avg_"), names_to = "Metric", values_to = "Value")

# Creating a heatmap with ggplot
ggplot(category_avg_metrics_long, aes(x = Categories, y = Metric, fill = Value)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = sprintf("%.2f", Value)), vjust = 1, color = "black") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 50) +
  labs(
    title = "Average Metrics by Categories",
    subtitle = "Subscribers, Likes, Visits, and Comments"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## 6 . Brand and Collaboration

No information has been provided for the following and as such as been left empty

# 7. Benchmarking

### 7.a Above Average Streamers

In [None]:
# Pivot table to sum metrics for each streamer
streamer_metrics <- youtube_df %>%
  select(Username, Categories, Country, Subscribers, Visits, Likes, Comments)

# Calculating the average metrics
metrics_avg <- summarise_at(streamer_metrics, vars(Subscribers, Visits, Likes, Comments), mean, na.rm = TRUE)

# Identifying streamers with above-average performance
above_avg_streamers <- streamer_metrics %>%
  filter(
    Subscribers > metrics_avg$Subscribers,
    Visits > metrics_avg$Visits,
    Likes > metrics_avg$Likes,
    Comments > metrics_avg$Comments
  )

In [None]:
# Viewing our data
cat("The above average streamers are:\n")
print(above_avg_streamers)

In [None]:
# Counting the number of above-average streamers
streamers <- nrow(above_avg_streamers)
cat("There are", streamers, "above-average YouTube streamers\n")

#### Observation
- MrBeast is the top most streamer in the Videojuegos, Humor category with over 1835000000 subscribers
- There are 38 above average streamers

### 7.b Top Performing Creators based on Subscribers

In [None]:
# Top content categories by subscribers
top_content_creators_subscribers <- above_avg_streamers %>%
  arrange(desc(Subscribers)) %>%
  slice_head(n = 10) %>%
  select(Username, Subscribers)

# Printing the top performing content creators by subscribers
cat("The top performing content creators by subscribers are:\n")
print(top_content_creators_subscribers)

### 7.c Top Performing Creators based on Visits

In [None]:

# Top content categories by visits
top_content_creators_visits <- above_avg_streamers %>%
  arrange(desc(Visits)) %>%
  slice_head(n = 10) %>%
  select(Username, Visits)

# Printing the top performing content creators by visits
cat("The top performing content creators by visits are:\n")
print(top_content_creators_visits)

### 7.d Top Performing Creators based on Likes

In [None]:
# Top content categories by likes
top_content_creators_likes <- above_avg_streamers %>%
  arrange(desc(Likes)) %>%
  slice_head(n = 10) %>%
  select(Username, Likes)

# Printing the top performing content creators by likes
cat("The top performing content creators by likes are:\n")
print(top_content_creators_likes)

### 7.e Top Performing Creators based on Comments

In [None]:
# Top content categories by comments
top_content_creators_comments <- above_avg_streamers %>%
  arrange(desc(Comments)) %>%
  slice_head(n = 10) %>%
  select(Username, Comments)

# Printing the top performing content creators by comments
cat("The top performing content creators by comments are:\n")
print(top_content_creators_comments)

#### Observation
- Even though MrBeast has the most subscribers, when the filters are changed he no longer the top most streamer. Hence it can be noted that the top performer is subjective based on the filter applied.

# 8 Recommendations

In [None]:
# Install and load necessary packages
install.packages("caret")  # Install the required package
library(caret)

# Creating a data frame for label encoding
label_encoding_df <- youtube_df %>%
  select(Categories)

# Applying label encoding to 'Categories'
label_encoding_df$Category_Encoded <- as.numeric(factor(label_encoding_df$Categories))

# Displaying the data frame after label encoding
print(label_encoding_df)

In [None]:
# Fit and transform data using label encoding
youtube_df$Category_Encoded <- as.numeric(factor(youtube_df$Categories))

# Display the encoded values
head(youtube_df[c("Categories", "Category_Encoded")])

In [None]:
# Creating a unique user-category matrix -- identifying unique users and categories in our dataset
unique_users <- unique(youtube_df$Username)
unique_category <- unique(youtube_df$Category_Encoded)

# Creating a dataframe with the above
user_category_df <- as.data.frame(matrix(0, nrow = length(unique_users), ncol = length(unique_category)))
colnames(user_category_df) <- as.character(unique_category)
rownames(user_category_df) <- as.character(unique_users)

# Populating the dataframe with the performance metric
# This loop goes through each row in the dataframe and fills in the user-category with the corresponding performance matrix
for (i in 1:nrow(youtube_df)) {
  user <- as.character(youtube_df$Username[i])
  category <- as.character(youtube_df$Category_Encoded[i])
  performance_metrics <- youtube_df$Performance_Metric[i]
  
  user_category_df[user, category] <- performance_metrics
}

# Filling in missing values where there are NaN
user_category_df[is.na(user_category_df)] <- 0

In [None]:
head(user_category_df)

In [None]:
# Calculate the cosine similarity in our data
similarity_matrix <- proxy::simil(x = t(user_category_df), method = "cosine")

# Convert it to a matrix
similarity_matrix <- as.matrix(similarity_matrix)

# Convert it to a dataframe
similarity_df <- data.frame(similarity_matrix)

In [None]:
head(similarity_df)

In [None]:
# Choosing a target category - selecting a category for which I want to make recommendation
target_category <- 'Videojuegos, Humor'

# Finding categories that are similar to my chosen category
target_category_data <- youtube_df[youtube_df$Categories == target_category, c('Subscribers', 'Visits', 'Likes', 'Comments')]

In [None]:
library(proxy)

# Calculate similarity scores using proxy::dist
similarity_scores <- proxy::dist(rbind(target_category_data, as.matrix(youtube_df[, c('Subscribers', 'Visits', 'Likes', 'Comments')])), method = "cosine")

# Add similarity scores to the dataframe
youtube_df$'similarity score' <- 1 - as.vector(similarity_scores)[1:nrow(youtube_df)]

# Identify unseen categories - finding categories that a user has not interacted with yet
unseen_categories <- youtube_df[youtube_df$Categories != target_category, ]

# Calculate weighted recommendation - this is based on the similarity with the target category
weighted_recommendation <- youtube_df$'similarity score'[youtube_df$Categories != target_category]

# Sort and recommend - sorting the data based on their scores and then recommending top categories to users
top_recommendation <- unseen_categories[order(-weighted_recommendation), ][1:10, ]

# Present recommendation - these are the categories that are similar to the target category based on performance metrics
recommended_categories <- unique(top_recommendation$Categories)

cat(paste("If you liked", target_category, ", you might also enjoy these categories:\n", paste(recommended_categories, collapse = ', '), "\n"))