# Ranking analysis for vtalks.net

## Table of contents:

* [Introduction](#introduction)
    * [Setup & Configuration](#setup-and-configuration)
    * [Load the Data Set](#load-the-data-set)
    
* [Youtube Statistics Analysis](#youtube-statistics-analysis)
    * [Youtube Views](#youtube-views)
    * [Youtube Likes](#youtube-likes)
    * [Youtube Dislikes](#youtube-dislikes)
    * [Youtube Favorites](#youtube-favorites)
    
* [Statistics Analysis](#statistics-analysis)
    * [Views](#views)
    * [Likes](#likes)
    * [Dislikes](#dislikes)
    * [Favorites](#favorites)
    
* [Youtube Statistics Histograms](#youtube-statistics-histograms)
    * [Youtube Views Histogram](#youtube-views-histogram)
    * [Youtube Likes Histogram](#youtube-likes-histogram)
    * [Youtube Dislikes Histogram](#youtube-dislikes-histogram)
    * [Youtube Favorites Histogram](#youtube-favorites-histogram)
    
* [Statistics Histograms](#statistics-histograms)
    * [Views Histogram](#views-histogram)
    * [Likes Histogram](#likes-histogram)
    * [Dislikes Histogram](#dislikes-histogram)
    * [Favorites Histogram](#favorites-histogram)

## Introduction <a class="anchor" id="introduction"></a>

This jupyter network describes an exploratory data analysis for a data set of talks published on [vtalks.net](http://www.vtalks.net) website.

We are going to use numpy and pandas to load and analyze our dataset, and we will use matplotlib python libraries for
plotting the results.

In [3]:
!pwd

/Users/raul/Projects/vtalks/jupyter


### Setup & Configuration <a class="anchor" id="setup-and-configuration"></a>

In [4]:
import numpy as np
import pandas as pd
import pandas_profiling as pp
import matplotlib.pyplot as plt
import seaborn

Now we configure matplotlib to ensure we have somne pretty plots :)

In [5]:
%matplotlib inline

seaborn.set()
plt.rc('figure', figsize=(16,8))
plt.style.use('bmh')

plt.style.available

['seaborn-dark',
 'seaborn-darkgrid',
 'seaborn-ticks',
 'fivethirtyeight',
 'seaborn-whitegrid',
 'classic',
 '_classic_test',
 'fast',
 'seaborn-talk',
 'seaborn-dark-palette',
 'seaborn-bright',
 'seaborn-pastel',
 'grayscale',
 'seaborn-notebook',
 'ggplot',
 'seaborn-colorblind',
 'seaborn-muted',
 'seaborn',
 'Solarize_Light2',
 'seaborn-paper',
 'bmh',
 'seaborn-white',
 'dark_background',
 'seaborn-poster',
 'seaborn-deep']

### Load the Data Set <a class="anchor" id="load-the-dataset"></a>

And finally load our dataset. Notice that there are different data sets available.

The first one is a general data set with all the information available from the start (around mid 2010) until now. Then there are the same data sets but splitted by year.

In [540]:
# data_source = "../.dataset/vtalks_dataset_2018.csv"
# data_source = "../.dataset/vtalks_dataset_2017.csv"
# data_source = "../.dataset/vtalks_dataset_2016.csv"
# data_source = "../.dataset/vtalks_dataset_2015.csv"
# data_source = "../.dataset/vtalks_dataset_2014.csv"
# data_source = "../.dataset/vtalks_dataset_2013.csv"
# data_source = "../.dataset/vtalks_dataset_2012.csv"
# data_source = "../.dataset/vtalks_dataset_2011.csv"
# data_source = "../.dataset/vtalks_dataset_2010.csv"
data_source = "../.dataset/vtalks_dataset_all.csv"

data_set = pd.read_csv(
    data_source,
    parse_dates=[1],
    dtype={
        'id': int,
        'youtube_view_count': int, 
        'youtube_like_count': int,
        'youtube_dislike_count': int,
        'youtube_favorite_count': int,
        'view_count': int, 
        'like_count': int,
        'dislike_count': int,
        'favorite_count': int
    })

In [523]:
data_set.dtypes

id                                 int64
created                   datetime64[ns]
youtube_view_count                 int64
youtube_like_count                 int64
youtube_dislike_count              int64
youtube_favorite_count             int64
view_count                         int64
like_count                         int64
dislike_count                      int64
favorite_count                     int64
dtype: object

In [483]:
data_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20330 entries, 0 to 20329
Data columns (total 10 columns):
id                        20330 non-null int64
created                   20330 non-null datetime64[ns]
youtube_view_count        20330 non-null int64
youtube_like_count        20330 non-null int64
youtube_dislike_count     20330 non-null int64
youtube_favorite_count    20330 non-null int64
view_count                20330 non-null int64
like_count                20330 non-null int64
dislike_count             20330 non-null int64
favorite_count            20330 non-null int64
dtypes: datetime64[ns](1), int64(9)
memory usage: 1.6 MB


In [478]:
data_set.head()

Unnamed: 0,id,created,youtube_view_count,youtube_like_count,youtube_dislike_count,youtube_favorite_count,view_count,like_count,dislike_count,favorite_count
0,1,2017-09-19 03:24:39,4017.0,0.0,0.0,0.0,42,0,0,0
1,2,2017-09-19 03:24:26,2373.0,0.0,0.0,0.0,75,0,0,0
2,3,2017-09-19 03:24:27,2405.0,0.0,0.0,0.0,30,0,0,0
3,4,2017-09-19 03:24:28,2430.0,0.0,0.0,0.0,48,0,0,0
4,5,2017-09-19 03:24:28,3432.0,0.0,0.0,0.0,41,0,0,0


In [10]:
data_set.describe()

Unnamed: 0,id,youtube_view_count,youtube_like_count,youtube_dislike_count,youtube_favorite_count,view_count,like_count,dislike_count,favorite_count
count,20330.0,20330.0,20330.0,20330.0,20330.0,20330.0,20330.0,20330.0,20330.0
mean,10187.76183,3555.019,38.988146,2.021938,0.0,11.636006,0.001426,0.00059,0.000639
std,5873.905444,39499.62,387.537066,42.574971,0.0,11.90473,0.059081,0.051536,0.02528
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5104.25,192.0,1.0,0.0,0.0,5.0,0.0,0.0,0.0
50%,10191.5,555.0,5.0,0.0,0.0,9.0,0.0,0.0,0.0
75%,15273.75,1798.75,18.0,1.0,0.0,13.0,0.0,0.0,0.0
max,20356.0,3966597.0,35694.0,4271.0,0.0,227.0,7.0,7.0,1.0


In [11]:
pp.ProfileReport(data_set)

0,1
Number of variables,10
Number of observations,20330
Total Missing (%),0.0%
Total size in memory,1.6 MiB
Average record size in memory,80.0 B

0,1
Numeric,5
Categorical,0
Boolean,1
Date,1
Text (Unique),0
Rejected,3
Unsupported,0

0,1
Distinct count,17614
Unique (%),86.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2010-06-09 21:52:45
Maximum,2018-07-16 16:41:20

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.00059026
Minimum,0
Maximum,7
Zeros (%),100.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,7
Range,7
Interquartile range,0

0,1
Standard deviation,0.051536
Coef of variation,87.311
Kurtosis,16774
Mean,0.00059026
MAD,0.0011802
Skewness,125.04
Sum,12
Variance,0.002656
Memory size,158.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0,20324,100.0%,
1,5,0.0%,
7,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,20324,100.0%,
1,5,0.0%,
7,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,20324,100.0%,
1,5,0.0%,
7,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.00063945

0,1
0,20317
1,13

Value,Count,Frequency (%),Unnamed: 3
0,20317,99.9%,
1,13,0.1%,

0,1
Distinct count,20330
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10188
Minimum,1
Maximum,20356
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,1017.5
Q1,5104.2
Median,10192.0
Q3,15274.0
95-th percentile,19340.0
Maximum,20356.0
Range,20355.0
Interquartile range,10170.0

0,1
Standard deviation,5873.9
Coef of variation,0.57656
Kurtosis,-1.1988
Mean,10188
MAD,5086.2
Skewness,-0.0017771
Sum,207117198
Variance,34503000
Memory size,158.9 KiB

Value,Count,Frequency (%),Unnamed: 3
2047,1,0.0%,
4775,1,0.0%,
10912,1,0.0%,
8865,1,0.0%,
15010,1,0.0%,
12963,1,0.0%,
2724,1,0.0%,
677,1,0.0%,
6822,1,0.0%,
19116,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,1,0.0%,
2,1,0.0%,
3,1,0.0%,
4,1,0.0%,
5,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
20352,1,0.0%,
20353,1,0.0%,
20354,1,0.0%,
20355,1,0.0%,
20356,1,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.0014265
Minimum,0
Maximum,7
Zeros (%),99.9%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,7
Range,7
Interquartile range,0

0,1
Standard deviation,0.059081
Coef of variation,41.418
Kurtosis,9774.2
Mean,0.0014265
MAD,0.0028497
Skewness,87.001
Sum,29
Variance,0.0034905
Memory size,158.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0,20307,99.9%,
1,22,0.1%,
7,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,20307,99.9%,
1,22,0.1%,
7,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,20307,99.9%,
1,22,0.1%,
7,1,0.0%,

0,1
Distinct count,118
Unique (%),0.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,11.636
Minimum,0
Maximum,227
Zeros (%),11.7%

0,1
Minimum,0
5-th percentile,0
Q1,5
Median,9
Q3,13
95-th percentile,35
Maximum,227
Range,227
Interquartile range,8

0,1
Standard deviation,11.905
Coef of variation,1.0231
Kurtosis,24.018
Mean,11.636
MAD,7.4364
Skewness,3.3613
Sum,236560
Variance,141.72
Memory size,158.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2384,11.7%,
9,1586,7.8%,
10,1517,7.5%,
8,1473,7.2%,
11,1472,7.2%,
7,1249,6.1%,
12,1179,5.8%,
3,934,4.6%,
13,932,4.6%,
2,909,4.5%,

Value,Count,Frequency (%),Unnamed: 3
0,2384,11.7%,
1,196,1.0%,
2,909,4.5%,
3,934,4.6%,
4,400,2.0%,

Value,Count,Frequency (%),Unnamed: 3
144,1,0.0%,
148,2,0.0%,
162,1,0.0%,
214,1,0.0%,
227,1,0.0%,

0,1
Correlation,0.92353

0,1
Constant value,0

0,1
Correlation,0.95825

0,1
Distinct count,5635
Unique (%),27.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3555
Minimum,0
Maximum,3966597
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,48.0
Q1,192.0
Median,555.0
Q3,1798.8
95-th percentile,11583.0
Maximum,3966597.0
Range,3966597.0
Interquartile range,1606.8

0,1
Standard deviation,39500
Coef of variation,11.111
Kurtosis,6666.6
Mean,3555
MAD,4798.5
Skewness,75.099
Sum,72273528
Variance,1560200000
Memory size,158.9 KiB

Value,Count,Frequency (%),Unnamed: 3
50,55,0.3%,
48,55,0.3%,
51,48,0.2%,
55,47,0.2%,
113,42,0.2%,
57,42,0.2%,
71,41,0.2%,
124,40,0.2%,
87,40,0.2%,
109,38,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0,9,0.0%,
1,3,0.0%,
2,5,0.0%,
3,5,0.0%,
4,4,0.0%,

Value,Count,Frequency (%),Unnamed: 3
708440,1,0.0%,
759192,1,0.0%,
1743224,1,0.0%,
2929460,1,0.0%,
3966597,1,0.0%,

Unnamed: 0,id,created,youtube_view_count,youtube_like_count,youtube_dislike_count,youtube_favorite_count,view_count,like_count,dislike_count,favorite_count
0,1,2017-09-19 03:24:39,4017,0,0,0,42,0,0,0
1,2,2017-09-19 03:24:26,2373,0,0,0,75,0,0,0
2,3,2017-09-19 03:24:27,2405,0,0,0,30,0,0,0
3,4,2017-09-19 03:24:28,2430,0,0,0,48,0,0,0
4,5,2017-09-19 03:24:28,3432,0,0,0,41,0,0,0


## Youtube Ranking Analysis <a class="anchor" id="youtube-statistics-analysis"></a>

### Youtube Wilson Score <a class="anchor" id="youtube-views"></a>

In [588]:
def wilson_score(ups, downs, z=1.96):
    """ Wilson score interval sort
    (popularized by reddit's best comment system)
    http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
    """
    if ups is 0:
        return 0.0
    n = ups + downs
    p = ups / n
    sqrtexpr = (p * (1 - p) + z * z / (4 * n)) / n
    res = (p + z * z / (2 * n) - z * math.sqrt(sqrtexpr)) / (1 + z * z / n)
    return res

data_set = pd.DataFrame({
    'created': data_set.created,
    'youtube_like_count': data_set.youtube_like_count,
    'youtube_dislike_count': data_set.youtube_dislike_count,
})


ranking_data_set = pd.DataFrame({
    'created': data_set.created,
    'wilson_score': np.zeros(len(data_set.created), dtype=[('wilson_score', np.float64)]),
})

ranking_data_set

#
#zero_floats

# ranking_data_set.fillna(0)

# ranking_data_set['wilson_score'].astype(float)
# ranking_data_set['wilson_score'] = wilson_score(data_set.youtube_like_count, data_set.youtube_dislike_count)

# final_data_set = data_set.merge(ranking_data_set)

# final_data_set

Unnamed: 0,created,wilson_score
0,2017-09-19 03:24:39,"(0.0,)"
1,2017-09-19 03:24:26,"(0.0,)"
2,2017-09-19 03:24:27,"(0.0,)"
3,2017-09-19 03:24:28,"(0.0,)"
4,2017-09-19 03:24:28,"(0.0,)"
5,2017-09-19 03:24:29,"(0.0,)"
6,2017-09-19 03:24:29,"(0.0,)"
7,2017-09-19 03:24:30,"(0.0,)"
8,2017-09-19 03:24:31,"(0.0,)"
9,2017-09-19 03:24:31,"(0.0,)"
