# YouTube Thumbnail Complexity Data Analysis

## Objective
This notebook will analyze thumbnail data using the data science lifecylce. This civic project is a collaboration between Minerva University (Fall 2021, Sophomore, Seoul) and Sandbox Network. Sandbox is a leading multi-channel network (MCN) in South Korea that supports over 450 digital creators and their content.

Data science life cycle reference: <br>
https://towardsdatascience.com/stoend-to-end-data-science-life-cycle-6387523b5afc

Information about Sandbox:<br>
https://www.kedglobal.com/kunicornsView/kun0004 <br>
https://www.kedglobal.com/newsView/ked202107220014

## Contents
1. Business Understanding
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment

## 1. Business Understanding
### 1.1 — A case study of Netflix 

Source: https://twitter.com/TrungTPhan/status/1445768087832182796

#### Q1: Why is thumbnail so important in the creator economy?
"A Netflix user will browse the app for 90 seconds and leave if they find nothing. Thumbnail artwork is actually NFLX's most effective level to influence a viewer's choice. A user will look at once for only 1.8 seconds so NFLX spends huge to optimize them. Humans are visual animals. Our eyes move 3-4x per second to process information and we can analyze an image in as little as 13 milliseconds. In 2014, Netflix consumer research showed that thumbnail artwork is: 1)the biggest influence to watch content, and 2) the focus of 82% of browsing. A user looks at one for only 1.8 seoncds. If they can't find Netflix content in 90 seoncds, they'll leave the app."

#### Q2: How does Netflix optimize for thumbnail selection?
"The process is called Aestethic Visual Analysis (AVA):
1. Pull all the frames from a video (e.g a 1-hr episode of Stranger Things has 86k frames)
2. Tag each frame with metadata identifying key attributes (i.e. "Frame Annotation):
    - Frame number
    - Saliency
    - Brightness / Contrast
    - Nudity probabiltiy
    - Face / skin tone
3. Grade each frame based on these variables (i.e. assign priority value)
    - Visual (brightness, contrast, color, motion blur)
    - Contextual (face detection / shot angle)
    - Composition (photography principles like "rule of thirds, symmetry, depth of field)
4. Rank frame, choosing thumbnails that are most likely to be clicked (i.e. heaps?)
    
Winning traits:
- expressive faces, main characters, brightness
- good localization; contain attributes most attractive to a specific region
- thumbnails with villainous characters outperform
- thumbnails with more than 3 people vastly underperform"

#### Q3: How does Netflix use Machine Leaning to optimize thumbnail selection? 

"With these findings, Netflix starting creating its own thumbnails because those provided by studios weren't optimized for the streaming. Netflix applies Machine Learning to select a thumbnail based on your recent watch history. They also A/B test thumbnails and the artwork is constantly changing.
As with any ML algorithm, results can be curious. In 2018, Netflix was accused of creating artwork based on race. For example, for a majority caucasian film, Black users were served a thumbnail with Black characters, instead of the caucasian main characters. However, Netflix said it makes artwork only on viewing history, not demographics"

### 1.2 — Possible benefits for Sandbox Network

A clickable thumbnail is important to Sandbox's target customers, namely YouTube creators. It constitutes to a video's first impression (other than the title) and often dictates whether a viewer clicks into the video. Given a dataset of thumbnail images, Sandbox can conduct exploratory data analyses to understand what image attributes most strongly predict click rates. In turn, Sandbox can inform digital creators qualitatively what image attribute to focus on when designing a thumbnail.

Armed with this business insight, Sandbox can develop possible SAAS (software as a service) tools in the future, in order to automate and support the thumbnail generation process for YouTube creators:
- a grader that scores thumbnails manually created by creators on key image variables 
- a thumbnail suggestor that chooses an image from all the video frames
- a thumbnail editor that edits image attributes based on key varialbes (to ensure visual appeal) + thumbnail history (to ensure consistency) using Machine Learning

#### Q1: Success criteria
What is the goal of the proposed project? What are the criteria for a successful or useful outcome? Consider resources, constraints, assumptions, requirements.

The business objective is more exploratory and less concrete. Sandbox wants the team to explore what can be mined from the dataset of thumbnails and provide insights into developing new tools to support creators. The success criteria is usually defined by KPIs (Key Performance Indicators). However, since the outcome of this project does not include any deployment into Sandbox's business operations, we do not define success criteria. The team, consisting three computer science majors from Minerva University, has 2 months (Oct - Nov 2021) to work on exploratory data analyses. Resources include our prior knowledge (limited), our Sandbox mentor Joseph, and the internet. Due to limited time, compounded by the team's individual academic workload, there's no formal requirement on the output. However, the team wants to at least begin defining attributes of images (Data Preparation) and run basic data analyses (EDA) like regression, correlation, and significance tests, and then provide future directions on what to delve into (Feauture Engineering).

#### Q2: Statistical criteria 
Translate business objective and metric for success to data mining goals. If the business goal is to increase sales, the data mining goal might be to predict variance in sales based on advertising money spent, production costs, etc.

#### Q3: Project plan 
Produce a project plan specifying the steps to be taken throughout the rest of the project, including initial assessment of tools and techniques, durations, dependencies, etc.

1. Set up github and repositories (*Oct 30*)
2. Fill in business understanding and project plan (*Nov 16, step 1: business understanding*)
3. Load data with Pandas + OpenCV (*step 2: data understanding*)
    - define image attributes to be worked with
    <br><br>
4. Run EDA (*step 2: data understanding*)
    - descriptive statistics
    - data visualizations
    - regression + correlation + signficance tests
   <br><br>
5. Feature selection (*step 3: data preparation*)
    - data cleaning
    - feature engineering (composite varialbe of mulitple attributes)

## 2. Data Understanding

#### 2.1 — Dataset
Sandbox provided the team with a dataset (thumbnail_and_sound_analysis.json )that consists of 50 videos and their respective thumbnails. It includes descriptive data including date, video_id, channel_id, title, status (private or public), game_tag (null or not, whether it has a game tag), is_paid (null, whether it is paid to watch), is_music_claim (null, whether it has a copy-righted music), description, thumbnail (url), and publish date. 

However, we need more data than the above. We need to obtain data related to our predictor (independent) variables and explanatory (dependent) variables. Our predictors will be thumbnail image attributes (specifics to be determined). We will need to feed image urls to OpenCV in order to obtain image data. Additionally, our explanatory variable will likely be some video metrics that carry normative value to creators. These could include views, like counts, like/dislike ratios, like/view ratios, etc. that require YouTube API requests.

In [1]:
import pandas as pd # pandas for data manipulation + data analysis
import numpy as np # numpy for math
import matplotlib.pyplot as plt # pyplot for visualization
import datetime # datetime for computing times
import seaborn as sns #seaborn for visualization
import statsmodels.api as statsmodels # for regression
from scipy import stats # more regression 
from sklearn import linear_model # regression model
from pylab import rcParams
rcParams['figure.figsize'] = 15, 8 # set figsize for all plots

In [2]:
# data source 1: Sandbox-provided basic info

# Load json dataset into pandas dataframe
df = pd.read_json("thumbnail_and_sound_analysis.json")

# remove unnecessary columns
df = df.drop(columns=['channel_id', 'status', 'game_tag', 'is_paid', 'is_music_claim', 
                      'description', 'title', 'published_at'])
df.head()

Unnamed: 0,date,video_id,thumbnail
0,2020-07-22,___8hOuoAKw,https://i.ytimg.com/vi/___8hOuoAKw/maxresdefau...
1,2020-08-20,___NoMi5pp0,https://i.ytimg.com/vi/___NoMi5pp0/sddefault.jpg
2,2021-07-27,__1e1QJ8-y8,https://i.ytimg.com/vi/__1e1QJ8-y8/maxresdefau...
3,2020-05-14,__3fHmFbnhU,https://i.ytimg.com/vi/__3fHmFbnhU/maxresdefau...
4,2020-05-06,__4sPuqw6s0,https://i.ytimg.com/vi/__4sPuqw6s0/maxresdefau...


In [32]:
# DATA SOURCE: YouTube API - our dependent variable

# !pip install --upgrade google-api-python-client
# !pip install --upgrade google-auth-oauthlib google-auth-httplib2
from googleapiclient.discovery import build

def get_stats(video_id):
    
    # create youtube resource object
    youtube = build("youtube", "v3", developerKey = "AIzaSyDsD5jELu-4jyFRYpeUfOiueSuuBMXz7aA")
    
    # get the video statistics
    request = youtube.videos().list(part='statistics', id=video_id)
    response = request.execute()
    
    # return None if request has no result, e.g. private video
    if not response['items']:
        return None
    
    items = response['items'][0]
    
    viewCount = items['statistics']['viewCount']
    likeCount = items['statistics']['likeCount']
    dislikeCount = items['statistics']['dislikeCount']
    favoriteCount = items['statistics']['favoriteCount']
    commentCount = items['statistics']['commentCount']
    
    return viewCount, likeCount, dislikeCount, favoriteCount, commentCount

def add_apidata(df):
    
    for index, row in df.iterrows():
        stats = get_stats(row["video_id"])
        
        if stats is None:
            df.loc[index, 'view_count'] = np.nan
            df.loc[index, 'like_count'] = np.nan
            df.loc[index, 'dislike_count'] = np.nan
            df.loc[index, 'favorite_count'] = np.nan
            df.loc[index, 'comment_count'] = np.nan
        
        else:
            df.loc[index, 'view_count'] = stats[0]
            df.loc[index, 'like_count'] = stats[1]
            df.loc[index, 'dislike_count'] = stats[2]
            df.loc[index, 'favorite_count'] = stats[3]
            df.loc[index, 'comment_count'] = stats[4]

    return df

df = add_apidata(df)
df.head()

Unnamed: 0,date,video_id,thumbnail,view_count,like_count,dislike_count,favorite_count,comment_count
0,2020-07-22,___8hOuoAKw,https://i.ytimg.com/vi/___8hOuoAKw/maxresdefau...,,,,,
1,2020-08-20,___NoMi5pp0,https://i.ytimg.com/vi/___NoMi5pp0/sddefault.jpg,33654.0,1004.0,18.0,0.0,141.0
2,2021-07-27,__1e1QJ8-y8,https://i.ytimg.com/vi/__1e1QJ8-y8/maxresdefau...,6645.0,260.0,10.0,0.0,60.0
3,2020-05-14,__3fHmFbnhU,https://i.ytimg.com/vi/__3fHmFbnhU/maxresdefau...,44244.0,396.0,17.0,0.0,109.0
4,2020-05-06,__4sPuqw6s0,https://i.ytimg.com/vi/__4sPuqw6s0/maxresdefau...,111873.0,1250.0,69.0,0.0,2.0


In [None]:
# DATA SOURCE 3: Thumbnail image attributes through OpenCV

# requirements: opencv-contrib-python, numpy, imutils, urllib
import cv2
import os, sys
import numpy as np
from imutils.object_detection import non_max_suppression
import urllib

# OpenCV images are represented in 3-d NumPy matrices (height, width, color(x3))
def get_hsv(image):
    
    # convert image from BGR to HSV format
    img_hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
    
    # take the means of the image's respective channels
    h, s, v = img_hsv[:, :, 0].mean(), img_hsv[:, :, 1].mean(), img_hsv[:, :, 2].mean()

    return round(h, 2), round(s, 2), round(v, 2)

# Uses EAST (Efficient Accurate Scene Text detector)
def east_detect(image):
    
    layerNames = ["feature_fusion/Conv_7/Sigmoid", "feature_fusion/concat_3"]
    
    orig = image.copy()
    
    if len(image.shape) == 2:
        image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)
    
    (H, W) = image.shape[:2]
    
    # set new width and height and determine the ratio of change; w and h should be multiple of 32
    (newW, newH) = (320, 320)
    
    rW = W / float(newW)
    rH = H / float(newH)
    
    # resize the image and grab the new image dimensions
    image = cv2.resize(image, (newW, newH))
    
    (H, W) = image.shape[:2]
    
    net = cv2.dnn.readNet("frozen_east_text_detection.pb")
    
    blob = cv2.dnn.blobFromImage(image, 1.0, (W, H), (123.68, 116.78, 103.94), swapRB=True, crop=False)    
    net.setInput(blob)
    (scores, geometry) = net.forward(layerNames)
    
    (numRows, numCols) = scores.shape[2:4]
    rects = []
    confidences = []
    
    # loop over the number of rows
    for y in range(0, numRows):
        # extract the scores (probabilities), followed by the geometrical
        # data used to derive potential bounding box coordinates that surround text
        scoresData = scores[0, 0, y]
        xData0 = geometry[0, 0, y]
        xData1 = geometry[0, 1, y]
        xData2 = geometry[0, 2, y]
        xData3 = geometry[0, 3, y]
        anglesData = geometry[0, 4, y]
    
        for x in range(0, numCols):
        # if our score does not have sufficient probability, ignore it; set minimum confidence as required
            if scoresData[x] < 0.5:
                continue
                
            # compute the offset factor as our resulting feature maps will x smaller than the input image
            (offsetX, offsetY) = (x * 4.0, y * 4.0)
            
            # extract the rotation angle for the prediction and then compute the sin and cosine
            angle = anglesData[x]
            cos = np.cos(angle)
            sin = np.sin(angle)
            
            # use the geometry volume to derive the width and height of the bounding box
            h = xData0[x] + xData2[x]
            w = xData1[x] + xData3[x]
            
            # compute both the starting and ending (x, y)-coordinates for the text prediction bounding box
            endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
            endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
            startX = int(endX - w)
            startY = int(endY - h)
            
            # add the bounding box coordinates and probability score to our respective lists
            rects.append((startX, startY, endX, endY))
            confidences.append(scoresData[x])
                        
    boxes = non_max_suppression(np.array(rects), probs=confidences)
    
    # loop over the bounding boxes
    areas = []
    for (startX, startY, endX, endY) in boxes:
        
        # scale the bounding box coordinates based on the respective ratios
        startX = int(startX * rW)
        startY = int(startY * rH)
        endX = int(endX * rW)
        endY = int(endY * rH)
        
        # draw the bounding box on the image
        cv2.rectangle(orig, (startX, startY), (endX, endY), (0, 255, 0), 2)
        
        # get area of bounding boxes and append
        area = abs(startX-endX)*abs(startY-endY)
        areas.append(area)
    
    text_area = sum(areas)
    
    return text_area

def get_text_cover(image):
    
    text_area = east_detect(image)
    h, w = image.shape[:2]
    img_area = h*w
    percentage_cover = text_area/img_area
    
    return round(percentage_cover, 2)

# Uses Haar Cascades method but not accurrate, should use YOLOv5 if able to
def get_faces(image):
    
    face_cascade = cv2.CascadeClassifier('haarcascade_frontalface.xml')
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, 1.3, 5)
    
    return len(faces)

def url_to_image(url):
    
    # download the image
    resp = urllib.request.urlopen(url)
    # convert image to NumPy aaray
    image = np.asarray(bytearray(resp.read()), dtype="uint8")
    # read NumPy array into OpenCV
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    
    return image

def add_imgdata(df):
    
    for index, row in df.iterrows():
        
        try:
            image = url_to_image(row["thumbnail"])
            h, s, v = get_hsv(image)
            text_cover = get_text_cover(image)
            n_faces = get_faces(image)
                    
        # see np.nan is better than python None
        except:
            h, s, v = np.nan, np.nan, np.nan
            text_cover = np.nan
            n_faces = np.nan
        
        df.loc[index, 'hue'] = h
        df.loc[index, 'saturation'] = s
        df.loc[index, 'value'] = v
        df.loc[index, 'text_cover'] = text_cover
        df.loc[index, 'n_faces'] = n_faces

    return df

df = add_imgdata(df)

In [34]:
df

Unnamed: 0,date,video_id,thumbnail,view_count,like_count,dislike_count,favorite_count,comment_count,hue,saturation,value,text_cover,n_faces
0,2020-07-22,___8hOuoAKw,https://i.ytimg.com/vi/___8hOuoAKw/maxresdefau...,,,,,,,,,,
1,2020-08-20,___NoMi5pp0,https://i.ytimg.com/vi/___NoMi5pp0/sddefault.jpg,33654.0,1004.0,18.0,0.0,141.0,52.15,45.87,98.22,0.0,0.0
2,2021-07-27,__1e1QJ8-y8,https://i.ytimg.com/vi/__1e1QJ8-y8/maxresdefau...,6645.0,260.0,10.0,0.0,60.0,89.38,170.28,67.16,0.03,0.0
3,2020-05-14,__3fHmFbnhU,https://i.ytimg.com/vi/__3fHmFbnhU/maxresdefau...,44244.0,396.0,17.0,0.0,109.0,56.6,84.42,131.32,0.12,0.0
4,2020-05-06,__4sPuqw6s0,https://i.ytimg.com/vi/__4sPuqw6s0/maxresdefau...,111873.0,1250.0,69.0,0.0,2.0,59.55,137.95,136.34,0.16,0.0
5,2021-02-18,__566nRGAt4,https://i.ytimg.com/vi/__566nRGAt4/maxresdefau...,,,,,,,,,,
6,2021-04-11,__5V-i5MfRw,https://i.ytimg.com/vi/__5V-i5MfRw/maxresdefau...,274835.0,8994.0,174.0,0.0,561.0,30.75,48.92,185.62,0.02,0.0
7,2021-07-31,__6NIUmDAMA,https://i.ytimg.com/vi/__6NIUmDAMA/maxresdefau...,48019.0,729.0,7.0,0.0,206.0,90.2,58.65,156.28,0.32,1.0
8,2020-05-14,__7ba5-FNAc,https://i.ytimg.com/vi/__7ba5-FNAc/maxresdefau...,108761.0,1192.0,54.0,0.0,480.0,78.09,141.49,151.88,0.2,0.0
9,2020-06-09,__arsuxE_P8,https://i.ytimg.com/vi/__arsuxE_P8/maxresdefau...,28586.0,222.0,12.0,0.0,59.0,50.72,86.79,125.94,0.04,0.0


<b>December 9 2021<b>

Today's symposium presentation marks the end of the official civic project partnership between Minerva University and Sandbox for the Fall semester of 2021. The team is in step 2 (Data Understanding) of the data science project life cycle collecting data. So far, video statistics were collected through YouTube API, image color attributes were collected through OpenCV, and text/face detection data were collected using the EAST algorithm and Haar Cascade Classifier.
<br><br>
The quality of any data insights depends heavily on the quality of data collected. Anyone picking up from where the team left off should focus on refining the accuracy of the data, especially on 1) correcting the bias of taking mean values of hue, satuation, and value, 2) finetuning bounding boxes for text detection, and 3) increasing accuracy of face detection using the state-of-the-art YOLOv5. This person should also focus on gathering more data, including the number of colors and color distribution.
<br><br>
Once data accuracy and data diversity is secured, this person should first conduct Exploratory Data Analysis, including correlation, regression, and signficance, etc. and continue with the Data Science Project Life Cycle. Good luck.