In [1]:
from nltk.corpus import wordnet as wn
from gensim.summarization import keywords
import urllib.request as urllib2
import re
import json
import datetime

#  Tutorial :  A Video Recommender Based on Users's Preference




## Introduction 
Many people watches videos on Youtube every day. Youtube says people around the world are now comsuming a billion hours of video content per day. Certainly, a great volume of data from user's video history implies people's preference on video kinds, recent hot topics or favoarite stars. We can utilize these history information to give recommendations to users to increase user engagement and retention.

In this tutorial, we are planning to: 
1. Analyze what kind of video a user likes based on videos that the user watched in recent days
2. Use query to search for potential videos of similar kind
3. Further select videos with tags summerize from video descriptions
4. Finally recommend top ranked videos to users. 

## Objective 
This work will help to get familiar with:
* YouTube API
* JSON vs. HTML 
* HTTP Requests
* Use Wordnet to analyze semantic distance of words
* Use Genism to summarize text  

## Library Documentation
* Standard Library:
 * json
 * re
 * datatime
* Third Party 
 * nltk
 * gensim 

## Working with APIs


Before you start

1. Create a Google Account to access the Google Developers Console, request an API key, and register your application.
2. Create a project in the Google Developers Console and obtain authorization credentials so your application can submit API requests.


After creating your project, make sure the YouTube Data API is one of the services that your application is registered to use:

Go to the Developers Console and select the project that you just registered.
Open the API Library in the Google Developers Console. If prompted, select a project or create a new one. In the list of APIs, make sure the status is ON for the YouTube Data API v3.

If your application will use any API methods that require user authorization, read the authentication guide to learn how to implement OAuth 2.0 authorization.

You can find more help at: https://developers.google.com/youtube/v3/getting-started


## Q 1: Authenticated Request with the Youtube API

First, you can store your Youtube credentials in a local file where you can read your api_key from later. This file can be any format/Structure. Here, we recommend you to store your key in a text file.

you can store your key by (run in terminal):
```bash
echo 'Youtube_API_KEY' > api_key.txt
```

You can then fetch your key from the file using:
```python
with open('api_key.txt', 'r') as f:
    api_key = f.read().replace('\n','')
```

using the Youtube API, you will be able to access resource to search for videos matching specific search terms, topics, locations, publication dates, and much more. 


In [2]:
api_key = 'AIzaSyATWRStkfThktG-xwalZ0aXfxNWkVhmPvw'

You can use Youtube API to retrieve and manipulate YouTube resource like channels, videos, playlists by reaching links with request parameters. The base URL is https://www.googleapis.com/youtube/v3. Specific Parameters can be added after base url with separator '&'. Every parameter identifies one or more top-level resource properties that used to filter return results. 

The response to each request is the Json representation of a YouTube resource. Thus, we need to parse our JSON response from the API to get text information we need.


As a test, please retrieve the list of Youtube's most popular videos for US. 

You might find this guidance helpful:
https://developers.google.com/youtube/v3/getting-started




Please note that the defualt value of maxResults is 5. You should use "nextPageToken" in a reponse body to retrieve the top 25 results ranking by date. 

YouTube API returns a reponse body with the following structure:
```python 
{
  "kind": "youtube#activityListResponse",
  "etag": etag,
  "nextPageToken": string,
  "prevPageToken": string,
  "pageInfo": {
    "totalResults": integer,
    "resultsPerPage": integer
  },
  "items": [
    activity Resource
  ]
}
```



In [3]:
queryURL = 'https://www.googleapis.com/youtube/v3/videos?part=snippet&chart=mostPopular&regionCode=US&key=' + api_key
results = json.load(urllib2.urlopen(queryURL))

tok = results['nextPageToken']

while True:
    temp = json.load(urllib2.urlopen(queryURL + '&pageToken=' + tok))
    results['items'].extend(temp['items'])

    if 'nextPageToken' not in temp:
        break

    tok = temp['nextPageToken']
    
print(results)



## Q 2: Build Recommendation Video Dataset

In the second part, we are going to build a video datasets that consists of videos that a user potentially likes. 

## Q 2.1 Retrieve Tags of Videos From a User's Watching History 

Now we need to fetch user's history records. The Youtube API only allows history search within the authorized user's videos. User has no authority to get watching history of other users. Thus, we provide a sample video history for you( stored in "userlike_sample.text"). 

It might be extemely difficult to get the characteristic of videos from its frame flow information. Fortunately, the part "snippet" that Youtube API provides for a video contains "tag" attributes. These tags are summarized characters of videos that the user might like. Thus, we are parse the API response to get this crucial information. 

Based on history data stored in userlike_sample.text, you should firstly obtain a list of tags of these video. We called the set of tag of videos that user watched before as "user preference tag" in following page.

In [4]:
with open('userlike_sample.txt', 'r') as user_sample:
    data = json.load(user_sample)
    liketag = set()
    likeid = []

    for item in data['items']:
        liketag|=set(item['snippet']['tags'])
        likeid.append(item['id'])
    
print(liketag)
print(likeid)

{'Williams', 'How to play league', 'lin-manuel miranda', 'Noxus', 'lol champ teaser', 'broadway', 'Irelia trailer', 'mr rogers documentary', 'every day', 'Champ teaser', 'dewy', 'music video', 'official', 'music', 'dear evan hansen', 'Good Morning America', 'Irelia League', 'Trailers', 'Featurettes', 'league champion teaser', 'mister rogers trailer', 'Clips', 'Wendy Williams fainting', 'Wendy Williams show', 'MOBA', 'Riot', 'Wendy', 'lol', 'morgan neville', 'host', 'wont you be my neighbor', 'fast', 'LoL', 'Swain', 'mister rogers', 'ben platt', 'Wendy Williams TV show', 'Wendy Williams health', 'pasek and paul', 'Independent Film', 'medical', 'Irelia gameplay trailer', 'champ', 'mr rogers trailer', 'how to play lol', 'look', 'mister rogers documentary', 'tom hanks mr rogers', 'League of Legends', 'league champ teaser', 'routine', 'League of Legends Champion teaser', 'hamildrops', 'hamilton music', 'tom hanks mr rogers trailer', 'makeup', 'tonight', 'klpolish', 'mr rogers movie trailer'

## Q 2.2 Retrieve Videos with Tag Query 
In the second step, we plan to utilize user perference tags to find potential videos that the user might like. We build a huge datasets by querying each tag to retrieve videos (search engine provided by youtube)that are relevant to a tag word. Here, we require retrieved videos that be uploaded to Youtube in 90 days(proximately three months)

Search for videos with quest for each tag, sort them by videw counts and only retrieve top 50 videos for each tag. From the response data, we can see that most of newly uploaded data have complete description of the video but don't have tag attributes. In order to calculates video similarity, we will summarize video descriptions and title to a set of keywords. Thus now, the response results should contain video Id, title and description. All other information can be neglected. Sort results by videw counts before storing them in tagSerachVideos.txt file. The response result is the base data that we are gonging to retrieve recommend videos from. 

In [5]:
nowTime=datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%SZ')
pastTime = (datetime.datetime.now()-datetime.timedelta(days=90)).strftime('%Y-%m-%dT%H:%M:%SZ')

parameters = {
    'part':'snippet',
    'order':'viewCount',
    'publishedAfter': pastTime,
    'maxResults': '50',
    'type': 'video',
    'key':api_key}

queryURL =  "https://www.googleapis.com/youtube/v3/search"

separator = '?'
for i in parameters:
    queryURL = queryURL+separator+i+"="+parameters[i]
    separator = '&'

results = []
vidset = set()
with open('tagSerachVideos.txt', 'w') as jfile:
    for tag in liketag:
        tag = re.sub(r' ', r'%20', tag).strip()
        print(tag)
        query = queryURL + separator + "q=" + tag
        data = json.load(urllib2.urlopen(query))
        for item in data['items']:
            if vidset.__contains__(item['id']['videoId']) == False:
                vidset.add(item['id']['videoId'])
                results.append({'vid':item['id']['videoId'], 'title':item['snippet']['title'], 'dsp':item['snippet']['description']})
    jfile.write(json.dumps(results))


Williams
How%20to%20play%20league
lin-manuel%20miranda
Noxus
lol%20champ%20teaser
broadway
Irelia%20trailer
mr%20rogers%20documentary
every%20day
Champ%20teaser
dewy
music%20video
official
music
dear%20evan%20hansen
Good%20Morning%20America
Irelia%20League
Trailers
Featurettes
league%20champion%20teaser
mister%20rogers%20trailer
Clips
Wendy%20Williams%20fainting
Wendy%20Williams%20show
MOBA
Riot
Wendy
lol
morgan%20neville
host
wont%20you%20be%20my%20neighbor
fast
LoL
Swain
mister%20rogers
ben%20platt
Wendy%20Williams%20TV%20show
Wendy%20Williams%20health
pasek%20and%20paul
Independent%20Film
medical
Irelia%20gameplay%20trailer
champ
mr%20rogers%20trailer
how%20to%20play%20lol
look
mister%20rogers%20documentary
tom%20hanks%20mr%20rogers
League%20of%20Legends
league%20champ%20teaser
routine
League%20of%20Legends%20Champion%20teaser
hamildrops
hamilton%20music
tom%20hanks%20mr%20rogers%20trailer
makeup
tonight
klpolish
mr%20rogers%20movie%20trailer
TV
new%20trailers
parkland
dance
easy
no

## Q 3: Summarize Video Titles and Descriptions 

Now, we are going to summarize the characters of video from the dataset with help of Genism. Genism is a robust open-source vector space modeling and topic modeling toolkit that can be implemented in Python. 

Using Genism, get summarized keywords of both the title and the decription of a video. The results should satisfy following requests:
* lemmeatize keywords you retrieve
* remove accentuation of keywords from the results

Notice: 
* In the following part, we will calculate semantic distance between keywords, which requires input as a single word. Thus, if the keywords summarized by Genism is a multi-words phrase, please split the phrase before storing. 
* Frequently repeated words are more likely to be retrieved as keywords of text. In video description, video authors are likely to include some urls to introduce the content of their video. To avoid "http" or "https" to be summarized as keywords, we should replace them with emptry string before usign genism

In [None]:
createdTags = []
with open('tagSerachVideos.txt', 'r') as user_sample:
    result = json.load(user_sample)
    
for data in result:
  
    title= data['title']
    tags = set()
    try :
        title_tag = keywords(title).split('\n')
    except:
        continue
    for tag in title_tag:
        if tag !='':
            tags.add(tag)
    dsp= data['dsp']
   
    dsp = re.sub(r"http", "", dsp).strip() 
    dsp = re.sub(r"https", "", dsp).strip() 
    
    try :
        dsp_tag = keywords(dsp, lemmatize=True, deacc = True).split('\n')
    except:
        continue
        
    for tag in dsp_tag:
        if tag !='':
            tags.add(tag)
    createdTags.append({'vid': data['vid'],'tag':tags})

## Q 4: Utilize Semantic Distance of Keywords to Recomend Videos for Users 

So now, we have get lists of keywords of every video in dataset, and set of user preference tags. We are ready to analyze their similarities. 

Here, we use WordNet as tool. WordNet is a lexical database that groups English words into sets of synonyms called synsets, and records a number of relations among these sysnonym sets. 

Using synsets, look ups every keyword of videos. Synset is a set of synonyms that share a common meaning. Since a word can have multiple meanings, each synset contains one or more lemma that contains a specific sense of a specific word. You should try synsets corresponding to every lemma when we compare two synsets. 

Using method 'path_simiplarity' to get a list of scores denoting how simliar a video tag word and each user preference tag sense are. You should use the maximum score to denote how a video tag word matches user's perference. Then accumulate score in one video tag set. Rank final score and return the ranking list. The recommender system can give great recommendations to the user according to the ranking list.

In [None]:
synsets = []
for tag in liketag:
    try:
        if wn.synsets(tag) is not None and len(wn.synsets(tag))!=0: 
            synsets.append(wn.synsets(tag))
    except:
        continue

scores = []
for video in createdTags:
    score = 0
    cnt = 0
    for tag in video['tag']:
        try:
            ss = wn.synsets(tag)
            if wn.synsets(tag) is None or len(wn.synsets(tag))==0: 
                continue
        except:
            continue
        cnt += 1
        maxscore = 0
        for synset in synsets: 
            for i in range(len(ss)):
                for j in range(len(synset)):
                    similarity = ss[i].path_similarity(synset[j])
                    if similarity is not None:
                        if (similarity) > maxscore:
                            maxscore = similarity
        score += maxscore
    
    if cnt!=0:
        score = score/cnt
    scores.append([video['vid'],score]) 

sorted(scores, key=lambda score:score[1])