# Greetings, Makrwatch! I'm Wes Clark, the Insight Fellow who worked on the demographic prediction project!

# Welcome to my code notebook for the Makrwatch project! I'm going to split this up into relevant sections so you can mix and match code as you see fit.

## I've done my best to document anything, but if you have any questions, I'm reachable at wesleycclark@gmail.com and I'll also be available via the Makrwatch Slack (wes_clark)

### My deliverable to you all was a predictive model, which you've been able to see at the website link I gave to Andres. I'm hoping to migrate that to a real website at some point for you all to continue to play with, but for today this is a code notebook that contains all the relevant scripts/codes/functions that I used to build the product.

### The sections I'm going to have are:

<ol>
<li> Scraping the Data Piece-Wise and Tag Processing</li>
<li> Importing and Utilizing the Model</li>
<li> Adding to the Model in the Future</li>
<li> Additional Scripts I Wrote that may be Useful </li>
</ol>

## 1.) Scraping the Data Piece-Wise and Tag Processing

#### This first part will be basically scraping new channels and new videos for their tag information for these models. 
#### In order to utilize most of the YouTube API scripts, a Youtube *Data* key is needed for the api_key. I've left it blank for these, but you can fill in with yours. 
#### Once we have the tags, we'll process them and remove unimportant words (articles, prepositions) and move to the next step.

In [None]:
# This is just to test if the video ID that is passed is legitimate and not a bad link
# I specifically used this on the Flask app that you all have seen just to make sure I don't get bad requests

import requests

def is_url_ok(vid):

    
    full_url = 'https://www.youtube.com/watch?v='+vid
    return(200 == requests.head(url).status_code)

### This next cell contains the code for acquiring the tags from a single video!
#### (Assuming it has passed the above truth test of being a legit video ID!)

In [None]:
#As I mentioned before, you'll pass two things here: a video ID for a youtube video, and a youtube DATA api key.
import json

r = requests.get('https://www.googleapis.com/youtube/v3/videos?part=snippet%2CcontentDetails%2Cstatistics&id=' + video_id + '&key='+api_key)
data_json = r.json()

### This next loop extracts all tags and puts them into one giant string - this is easier for the next level of processing.

tag_string = ''
for i in data_json['items'][0]['snippet']['tags']:
    temp = i + ' '
    tag_string += temp


## We want a string for subsequent processing. We'll eventually turn it back into a list later to pass to the model, but I've written some pre-processing code that normally handles strings. 

### The next little code snippet is for taking a channel link, either given in the form 'youtube.com/user' or 'youtube.com/channel' and extracting the tags from the 50 most recent uploaded videos.

In [None]:
import re
if 'youtube.com/user/' in test_url:
    tag_list = []
    hits = re.findall("/user/(\w+)",test_url)
    print(hits[0])
    r = requests.get('https://www.googleapis.com/youtube/v3/channels?part=contentDetails&forUsername='+hits[0]+'&key='+api_key)
    playlist_json = r.json()
    
    try:
        playlist_id = playlist_json['items'][0]['contentDetails']['relatedPlaylists']['uploads']
    except:
        playlist_id = ''
        pass
    
    
    video_list = []
    try:   
        videos = requests.get('https://www.googleapis.com/youtube/v3/playlistItems?part=contentDetails&maxResults=50&playlistId='+playlist_id+'&key='+api_key)
        videos_json = videos.json()
        
        for i in range(len(videos_json['items'])):
            video_list.append(videos_json['items'][i]['contentDetails']['videoId'])
    
    except:
        pass
    
    
    for j in video_list:
       
        tag_request = requests.get('https://www.googleapis.com/youtube/v3/videos?part=snippet&id='+j+'&key='+ api_key)
        tag_json = tag_request.json()
        try:
            tag_list.append(tag_json['items'][0]['snippet']['tags'])
        except:
            tag_list.append('')
    
elif 'youtube.com/channel' in test_url:
    hits = re.findall('/channel/(\w+)',test_url)
    r = requests.get('https://www.googleapis.com/youtube/v3/channels?part=contentDetails&id='+hits[0]+'&key='+api_key)
    print(hits[0])
    playlist_json = r.json()
    try:
        playlist_id = playlist_json['items'][0]['contentDetails']['relatedPlaylists']['uploads']
    except:
        playlist_id = ''
        pass
    
    
    video_list = []
    try:   
        videos = requests.get('https://www.googleapis.com/youtube/v3/playlistItems?part=contentDetails&maxResults=50&playlistId='+playlist_id+'&key='+api_key)
        videos_json = videos.json()
        
        for i in range(len(videos_json['items'])):
            video_list.append(videos_json['items'][i]['contentDetails']['videoId'])
    
    except:
        pass
    
    
    for j in video_list:
       
        tag_request = requests.get('https://www.googleapis.com/youtube/v3/videos?part=snippet&id='+j+'&key='+ api_key)
        tag_json = tag_request.json()
        try:
            tag_list.append(tag_json['items'][0]['snippet']['tags'])
        except:
            tag_list.append('')
    

### And now that we have our list of tags from the channel, we'll turn it into a string for processing. 

In [None]:
tag_string = ''
for i in tag_list:
    temp = ''
    for j in range(len(i)):
        temp += i[j] + ' '
    tag_string += temp

### Now that we have our code in a giant string, we'll process it through two pre-defined functions to remove punctuation and unnecessary words. 

In [None]:
import nltk
from nltk.corpus import stopwords

punctuation = '!"#?$%“”&\'()’*+,—./:;<=>@[\\]^_`{|}~…' 
stoplist = stopwords.words('english')

### This cell contains the necessary imports for getting rid of English stop words ('the', 'to', 'a'... and so on) 
### As well as punctuation

In [None]:
tag_string = processText(removeStopwords(tag_string))
string_list = []
string_list.append(tag_string)

### And just like that, we have the tags from either a single video or from a channel of videos!

## 2.) Importing and Utilizing the Model

#### So my project was basically to build a model for you all that predicts age, gender, and country from the tags. 

#### The first thing we'll do is load the models. They're in pickle files, which are basically snapshots of code. 
#### Because of this, you can freely load them without having to re-train the models every time (which would be very time-consuming)

In [None]:
import pickle

#These next three models are associated with generating age/gender profiles.

with open('KNN_Reg_Tags_AgeGender.p', 'rb') as pickle_file:
    knn_age_gender = pickle.load(pickle_file)
with open('SVD_BigTags.p', 'rb') as pickle_file:
    svd = pickle.load(pickle_file)
with open('TFIDF_BigTags.p', 'rb') as pickle_file:
    tfidf = pickle.load(pickle_file)
    
    
# These next two are associated with generating the location breakdown. The second file is just a quick country reference
# based on the model. 

with open('knn_reg.p','rb') as pickle_file:
    knn_country_model = pickle.load(pickle_file)
        print('done')
with open('country_dict.p','rb') as pickle_file:
    country_dict = pickle.load(pickle_file)


### So now we have imported our models. The next few lines of code are for taking the above tag list that we made and getting output.

In [None]:
tag_desc = tfidf.transform(string_list)
svd_transform = svd.transform(tag_desc)
demo_data = knn_age_gender.predict(svd_transform)

### And we have our data! Demo_data is a 14 member array, whose first 7 elements are Male 13-17, 18-24, 25-34, 35-44, 45-54, 55-64, and 65+, while the last 7 elements are the respective female age breakdowns. The numeric elements are the fractional breakdown (out of 1) , so if element 2 is .25, that means that 25% of the viewers are Male from 18-24.

### Next we break down the country model, which picks up after the SVD transformation (basically breaking down the vocabulary into numbers, which can then be used as input for our predictive models.

In [None]:
import numpy 

country_demo = knn_reg.predict(svd_transform)
#Country demo is a large array containing an entry for every country that existed in the Youtube viewership. 
#As such, we only want the top 3, which the next line of code takes care of.

arr = np.argsort(country_demo[0])[::-1][:3]

# argsort returns the indices of the largest elements, and the [:3] returns the 3 largest. We're storing that in an array.
# Now we just build up the output.


country_demo_list = []
country_demo_names = []
total = 100

def dict_replace(dictionary, key):  # this little quick function returns the country name from the argsort index
    return(dictionary[key])


for i in arr:
    country_demo_list.append(country_demo[0][i]*100) #Giving a percentage 
    country_demo_names.append(dict_replace(country_dict,i)) #Giving us a two letter country code
    total -= country_demo[0][i]*100 #Just appropriately decrementing our total

country_demo_list.append(total)
country_demo_names.append('The rest')


#And now we have a list of country demographic percentages ('country_demo_list') and their names ('country_demo_names')

## So with that, we have predicted the age/gender breakdown and our country viewership numbers! That takes care of the prediction and modeling stage from our youtube video ID or youtube channel link! 

### In addition to this notebook, I'll also send over a zipped archive that contains all of the models (google drive?), as well as relevant information for the next chapter of adding to the model.

# 3.) Adding to the Model In the Future

### So you all have a model, but there's always room to grow! Generally a model gains in potency when it has more test data to sample. While there were always be edge-cases that are difficult to predict, if in a year or so from now you're able to acquire more demographic information, it's always good to update!

In [None]:
# In case you need to import all of the modules at once, I've aggregated most everything here
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import silhouette_samples, silhouette_score
from nltk.corpus import stopwords
from sklearn.manifold import TSNE
from sklearn.cluster import MiniBatchKMeans
import collections
from collections import Counter, defaultdict
from string import punctuation
import pandas as pd
import numpy as np
import nltk
import re

from sklearn.decomposition import TruncatedSVD

import matplotlib.pyplot as plt
import seaborn as sns

### I've compiled a data table of all channels that Makrwatch had demographic information on for age/gender, which can be loaded as below. It's a pandas dataframe, so you can add new rows to it by concatenating to the end (pandas.concat)

In [2]:
import pickle
import pandas as pd

with open('CombinedBigTagMergeDataFrame.p','rb') as pickle_file:
    full_demo_tag_df = pickle.load(pickle_file)

In [8]:
full_demo_tag_df.tail()

Unnamed: 0,Channel,Processed_Tags,M1,M2,M3,M4,M5,M6,M7,F1,F2,F3,F4,F5,F6,F7
22223,UCrHitTb3daDNagJ7NL8qOCw,curl my hair tutorial watch me subscribe sahm ...,0.5,4.0,11.7,6.4,3.0,1.5,0.7,2.2,16.4,32.2,11.8,5.6,2.1,1.8
22224,UCvVPZBHNGg4ZAYymVvzp6sA,acrylic nails acrylic nail designs unas de acr...,0.6,2.7,3.5,1.9,1.0,0.3,0.2,10.5,34.7,27.3,10.6,4.4,1.3,1.1
22225,UCmS-nTH3QH36VqXt8SfD8Qg,diy video kids crafts easy to spider halloween...,2.6,5.7,9.2,10.4,3.7,0.9,0.9,8.1,15.4,19.7,14.5,5.4,2.0,1.5
22226,UCR_jPo2RgJLWP1Z-0s7Zccg,photoshop courses march madness layers smart o...,2.5,18.9,26.0,11.8,6.5,3.9,3.1,1.6,8.7,9.1,3.8,2.2,1.2,0.8
22227,UCdSjoa8r-bPbMJcvymeSrJQ,bonde dos bronze bronde lol jogadas de bronze ...,18.7,38.7,21.8,7.4,2.3,0.7,2.0,0.7,2.0,2.3,2.3,0.9,0.1,0.1


### And as you can see, it has the channel ID, a list of processed tags, as well as the respective age/gender groups for the YouTube demographics. 

### Similarly, there's a separate table for the country information. 

In [9]:
with open('country_full_info_table.p','rb') as pickle_file:
    full_country_df = pickle.load(pickle_file)

In [10]:
full_country_df.head()

Unnamed: 0,GY,HK,YE,GP,AL,JP,MQ,ET,RW,SY,...,F1,F2,F3,F4,F5,F6,F7,tag_list,Processed_Tags,Len_Tags
20364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.3,32.7,33.2,13.0,5.8,1.8,1.0,hair makeup wig halloween cleopatra inspired m...,hair makeup wig halloween cleopatra inspired m...,23376
1413,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.5,12.1,23.9,13.9,6.3,2.2,1.1,plus size fashion plus size ootd plus size out...,plus size fashion plus size ootd plus size out...,57102
23814,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.8,25.5,25.8,7.7,3.3,1.5,1.0,indian youtuber indian skintone brown skintone...,indian youtuber indian skintone brown skintone...,22432
8748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,20.6,28.2,16.4,13.0,4.7,0.9,2.3,amigos amigas vs mejores amigas bff best frien...,amigos amigas vs mejores amigas bff best frien...,3748
17555,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.9,13.8,21.7,11.5,8.2,5.7,3.2,veganlovlie vegetable spaghetti spaghetti reci...,veganlovlie vegetable spaghetti spaghetti reci...,32112


### And this data frame has the 2 letter country codes with the respective breakdown per country.

### If you ever need to retrain the model, then you can do the following: 

In [None]:
tfidf = TfidfVectorizer(min_df=10, max_features=25000, strip_accents='unicode',
                           analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1,2), 
                           use_idf=1, smooth_idf=1, sublinear_tf=1)

#First we initialize the model

In [None]:
tfidf.fit(full_demo_tag_df['Processed_Tags'])

tfidf_descr = tfidf.fit_transform(full_demo_tag_df['Processed_Tags'])

# Then we create a full TF-IDF model of all the tags in our corpus.
#In this case, to feed it into the SVD model to build that too with the fit_transform

In [None]:
svd = TruncatedSVD(n_components=50, random_state=0)
svd.fit(tfidf_descr)
svd_tfidf = svd.fit_transform(tfidf_descr)
#Which we can feed into the model to fit the SVD model as well!
#In order to build the subsequent regression models, we'll need the fit_transform again. 

In [None]:
svd_test_tfidf_df =  pd.DataFrame(svd_tfidf)
X = pd.DataFrame(full_demo_tag_df, columns = svd_test_tfidf_df.columns)
y = pd.DataFrame(full_demo_tag_df, columns = ['M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7'])
y = y.convert_objects(convert_numeric=True)

# This is used to generate our features and labels for the regression model. 

In [None]:
knn_age_gender = KNeighborsRegressor(n_neighbors = 20, weights = 'uniform', p = 2)
knn_age_gender.fit(X,y)

### The last few cells have all been similar - instantiate/initialize the model, and then fit our parameters to it. And these models are the ones that we loaded from the pickled file before. 

### Also, you can use the predict function on the knn_age_gender to get out our demo_data from section 2. 

## 4.) Additional Scripts I Wrote that may be Useful

### To generate our data frames for the initial age/gender, I used the following scripts

In [12]:
import json

file_2 = './source_2_demographics.json'

channel_id_big = []
age_range_big = []
female_big = []
male_big = []

for line in open(file_2,'r'):
    #print(line)
    lineson = json.loads(line)
    channel_id_big.append(lineson['channelId'])
    age_range_big.append(lineson['range'])
    try:    
        male_big.append(lineson['male'])
    except:
        male_big.append('0')
    try:
        female_big.append(lineson['female'])
    except:
        female_big.append('0')
        
arrays = [channel_id_big,age_range_big]

multi_index_big = pd.MultiIndex.from_arrays(arrays,names=['ID','Age_Range'])

layer_df_big = pd.DataFrame({'Male':male_big,
                        'Female': female_big},
                       index = multi_index_big)

demo_df = pd.DataFrame()

demo_list = []


for i in set(channel_id_big):
    temp = layer_df_big.xs(i,level = 'ID', drop_level = False)
    temp_list = []
    temp_list.append(i)
    for j in range(7):
        try:
            temp_list.append(temp.Male[j])
        except:
            temp_list.append(0)
    for k in range(7):
        try:
            temp_list.append(temp.Female[k])
        except:
            temp_list.append(0)
    attempt.append(temp_list)
        
demo_df = pd.DataFrame(columns = ['ID','M1','M2','M3','M4','M5','M6','M7','F1','F2','F3','F4','F5','F6','F7'])
demo_df = demo_df.append(pd.DataFrame(demo_list, columns = demo_df.columns))

In [13]:
demo_df.head()

Unnamed: 0,ID,M1,M2,M3,M4,M5,M6,M7,F1,F2,F3,F4,F5,F6,F7
0,UCFtMeyaVYhicK5JTQ0mfNTA,1.6,11.8,23.6,12.3,5.4,2.0,1.2,2.7,13.4,14.5,6.5,3.1,1.1,0.7
1,UCeuG93VFLbag-hgCWv_a3uQ,5.7,27.4,44.6,9.1,3.8,0.4,0.9,0.9,3.3,3.3,0.4,0.0,0.0,0.2
2,UCcHsg3wQ8-1t4v7mhzM3sUg,1.6,12.8,30.1,19.2,10.2,6.0,4.2,0.8,4.6,5.4,2.9,1.4,0.5,0.4
3,UCpr25HXdBHjXxitvThcQgHQ,9.8,19.5,15.2,8.0,4.4,0.8,1.3,8.2,14.8,8.5,5.3,3.0,0.4,0.7
4,UCzG6VQOETsDJYGDAai_9hwQ,0.6,2.4,2.7,1.1,0.6,0.2,0.1,9.9,36.7,27.5,10.9,5.0,1.3,1.0


### And this is for generating a dataframe from the location information: 

In [17]:
file_2 = './source_2_audienceLocation.json'

import json
import numpy as np

big_json_df = pd.DataFrame()
big_channel_list = []
big_country_list = []
big_percent_list = []
with open(file_2,'r') as file:
    for line in file:
        dipdip = json.loads(line)
        big_channel_list.append(dipdip['channelId'])
        big_country_list.append(dipdip['country'])
        big_percent_list.append(dipdip['percent'])
        
big_json_df['channelId'] = big_channel_list
big_json_df['country'] = big_country_list
big_json_df['percentage'] = big_percent_list

big_df_channel_list = [i for i in  big_json_df['channelId']]
big_country_set = set(big_country_list)

big_df_channel_set = set(big_df_channel_list)
big_zero_data = np.zeros(shape=(len(big_df_channel_set),len(big_country_set)))
big_zero_country_df = pd.DataFrame(big_zero_data, index = big_df_channel_set, columns = big_country_set)

# So now we've generated a dataframe from our country location json with all of the country codes and the channel IDs

In [18]:
big_zero_country_df.head()

Unnamed: 0,KN,CZ,VG,CN,SY,TW,IQ,JP,JO,AM,...,CR,ES,SG,LU,OM,MG,PL,UY,CI,SI
UCFtMeyaVYhicK5JTQ0mfNTA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
UCIcbWJtXOjzm38IxPxaOQ_Q,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
UC5VvHEF2hG9W4lIZBLINdrA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
UComrvL3jVbXiOTis8lPUYqQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
UCcHsg3wQ8-1t4v7mhzM3sUg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
for idx,row in big_json_df.iterrows():
    x = row['channelId']
    y = row['country']
    z = row['percentage']
    big_zero_country_df.loc[x,y] = z
    
    
# And this code is used to fill the dataframe with the respective percentages per country!