# Concept
predict how much revenue a given video idea will produce for tiger fitness. we also want to predict views, percent viewed, and change in subscribers

# Features (input 1-7, output 8-14)
1. video duration (bucketed, analytics based)
2. title polarity
3. title length (characters)
4. n links in description
5. category of video (vlog, training, nutrition, personal story, supplements and steroids) 
6. product price range (bucketed, intuition based)
7. product type (apparel, supplements, food, coaching)
8. sales made
9. views 
10. subs 
11. average percent viewed
12. likes
13. Impressions
14. Impression click through

# Models
1. revenue prediction (predict sales volume, multiply by price range to give a revenue range 
2. views prediction
3. change in subscribers prediction
4. average percent viewed prediction
5. likes prediction

Data Sources...

manual entry or unsupervised learning: 5

youtube analytics report - video analytics overview: 2, 3, 9, 10, 11, 12

youtube analytics report - video analytics engadgement: 8, 13, 14

## Import statements

In [42]:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite://', echo=False)
from textblob import TextBlob
import numpy as np

## Merging the two reports

In [7]:
df1 = pd.read_csv(r'C:\Users\aacjp\OneDrive\Desktop\Table0.csv').iloc[1:].set_index('Video')
df2 = pd.read_csv(r'C:\Users\aacjp\OneDrive\Desktop\Table data.csv').iloc[1:]

imp = []
imp_crt = []
for v in df2['Video']: #adding impressions and impression click through for each video
    imp.append(df1.loc[v]['Impressions'])
    imp_crt.append(df1.loc[v]['Impressions click-through rate (%)'])
df2['Impressions'] = imp
df2['Impressions click-through rate (%)'] = imp_crt

## adding title polarity and length

In [13]:
tl = []
tp = []
for title in df2['Video title']:
    tl.append(len(title))
    tp.append(TextBlob(title).sentiment.polarity)
    
df2['title length'] = tl
df2['title polarity'] = tp

In [15]:
df2.head()

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),title length,title polarity
1,Ewdr2H2qmVY,Ux of 4 (prod. CashMoneyAP),"Oct 7, 2018",8,3,182,0:01:13,31.63,810,8.52,27,0.0
2,ApXSnKR8lQU,Retarded genius (prod. Letzer),"Nov 4, 2018",7,1,183,0:01:04,33.02,1251,8.63,30,-0.8
3,Hde5iY6CG8s,I tried to move to LA!!!!!,"Oct 1, 2019",4,0,52,0:04:39,35.42,188,11.7,26,0.0
4,QVYOgNfa1Mk,Your first Machine Learning Model | Beginner P...,"Jun 13, 2021",2,3,82,0:02:20,8.94,62,4.84,77,0.25
5,f-4wjPBSROo,Smart kids = hamster brain ??,"Jun 27, 2019",2,0,14,0:01:53,30.67,165,6.06,29,0.214286


In [24]:
def toSeconds(duration):
    timesplits = duration.split(':')
    return int(timesplits[0])*3600 + int(timesplits[1])*60 + int(timesplits[2])

In [28]:
df2['Average view duration'] = df2['Average view duration'].apply(lambda x: toSeconds(x)) #converting average view duration to an int
df2.head()

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),title length,title polarity
1,Ewdr2H2qmVY,Ux of 4 (prod. CashMoneyAP),"Oct 7, 2018",8,3,182,73,31.63,810,8.52,27,0.0
2,ApXSnKR8lQU,Retarded genius (prod. Letzer),"Nov 4, 2018",7,1,183,64,33.02,1251,8.63,30,-0.8
3,Hde5iY6CG8s,I tried to move to LA!!!!!,"Oct 1, 2019",4,0,52,279,35.42,188,11.7,26,0.0
4,QVYOgNfa1Mk,Your first Machine Learning Model | Beginner P...,"Jun 13, 2021",2,3,82,140,8.94,62,4.84,77,0.25
5,f-4wjPBSROo,Smart kids = hamster brain ??,"Jun 27, 2019",2,0,14,113,30.67,165,6.06,29,0.214286


## calculating total duration

In [41]:
def getVideoLength(avd, apv):
    vd = []
    for i in range(1, len(avd)+1):
        vd.append(int((avd[i]/apv[i])*100))
    return vd

df2['Video duration'] = getVideoLength(df2['Average view duration'], df2['Average percentage viewed (%)'])
df2.head()

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),title length,title polarity,Video duration
1,Ewdr2H2qmVY,Ux of 4 (prod. CashMoneyAP),"Oct 7, 2018",8,3,182,73,31.63,810,8.52,27,0.0,230
2,ApXSnKR8lQU,Retarded genius (prod. Letzer),"Nov 4, 2018",7,1,183,64,33.02,1251,8.63,30,-0.8,193
3,Hde5iY6CG8s,I tried to move to LA!!!!!,"Oct 1, 2019",4,0,52,279,35.42,188,11.7,26,0.0,787
4,QVYOgNfa1Mk,Your first Machine Learning Model | Beginner P...,"Jun 13, 2021",2,3,82,140,8.94,62,4.84,77,0.25,1565
5,f-4wjPBSROo,Smart kids = hamster brain ??,"Jun 27, 2019",2,0,14,113,30.67,165,6.06,29,0.214286,368


## Bucketing total duration

In [None]:
def getPercentile(n, sample):
    sample = sorted(list(sample))
    n_lesser = 0
    ...

ip = sample[round(n * p) ]