## Capstone Project - YouTube Most Viewed 500 Videos

### Aim of the Project :

YouTube is a collection of vast videos all over the world in different languages. Worldwide audience have different choices of websites like FaceBook videos, DailyMotion, MetaCafe, Vimeo, On.Aol, Break and Yahoo! Screen etc. YouTube is the primary choice and public platform for viewers all over the world.

In this vast platform of YouTube, 500 videos have the viewership ranging from 200 million to 2.6 billion plus views which is more than any other top videos from the other video platforms. A detailed analysis is required to measure the popularity of these top 500 videos.

This project conducts an analysis on what factors have influenced these videos to be at top 500.

The factors may be -
    1. Genre or Category of the video
    2. Title Description
    3. Ratings such as Likes and Dislikes
    4. If possible the Artist in case of Music videos
                 
For this analysis, this project considers Logistic Regression model primarily and if possible Decision Tree/Random Forest models.





### Steps involved in the Project

    1. Bring the required Data using YouTube APIs for Python -
        a. Check the YouTube APIs
        b. Get API Key from Google API Console
        c. Use Argument Parser from oath2client.tools package to establish search term
        d. Build YouTube call service
        e. Call "Search" API and process the response to capture playlist, videos and channels
        d. We require only one playlist with title "Most Viewed Videos of All Time"
        e. Using single playlist, call playlistItems API to bring the videos
        f. Bring categories and statistic of each individual video using videoCategories and videos APIs
        g. Create a dataframe using the Resultset 
        h. Create a .CSV file using the dataframe
       
    2. Data Cleaning/ Data Analysis -
        a. Load the .CSV file back into the dataframe
        b. Rearrange and clean the data
        c. Check the sorting order of the videod data with respect to view count
        d. Sort the data by view count in descending order 
        e. Divide or classify the data based on number of videos - Top 100 videos, Top      101-200 videos etc.
        f. Make sure that the data does not have any null values
        g. Check whether the dataframe contains all top 500 videos with data
        h. Describe the data
        i. Identify whether the data is suitable for the project
        
    3. Predictive Model -- To be continued
    
MG: Good job with this outline!

In [1]:
#!/usr/bin/python
import httplib2
from apiclient.discovery import build #YouTube API
from apiclient.errors import HttpError #YouTube API
from oauth2client.tools import argparser #YouTube API
import json
import urllib
import pandas as pd
import numpy as np


##### Setup YouTube Parameters

In [2]:
DEVELOPER_KEY = "AIzaSyAJPX-VI_5gAw7Hs9538Tgqw1XLIZWt5x4" #"REPLACE_ME" 
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

#### Setup search term for YouTube API

Use arg parser to set the search term for YouTube. The search term is "Top 500 videos" to get the most top viewed videos

In [3]:
argparser.add_argument("--q", help="Search term", default="Top 500 Videos")
argparser.add_argument("--max-results", help="Max results", default=25)
args = argparser.parse_known_args()
options = args
options

(Namespace(auth_host_name='localhost', auth_host_port=[8080, 8090], logging_level='ERROR', max_results=25, noauth_local_webserver=False, q='Top 500 Videos'),
 ['-f',
  '/Users/SatyaSagi/Library/Jupyter/runtime/kernel-45922c96-8f0d-4cf1-8ac3-c86af30b079a.json'])

#### Build YouTube API service with API Key

Use build method given by YouTube API with Service name, API version, Developer Key as parameters.

Get the search response with using method youtube.search( ).list( ). 

Execute the search response with execute method. The part string will return the Playlist Ids and Snippet with Playlist titles.

In [4]:
youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION, developerKey=DEVELOPER_KEY)

# Call the search.list method to retrieve results matching the specified
 # query term.
search_response = youtube.search().list(
 q=options[0].q,
 type="playlist",
 part="id,snippet",
 maxResults=options[0].max_results
).execute()





###### Print the Search Response to check the items coming back

In [5]:
print search_response

{u'nextPageToken': u'CBkQAA', u'kind': u'youtube#searchListResponse', u'items': [{u'snippet': {u'thumbnails': {u'default': {u'url': u'https://i.ytimg.com/vi/9bZkp7q19f0/default.jpg', u'width': 120, u'height': 90}, u'high': {u'url': u'https://i.ytimg.com/vi/9bZkp7q19f0/hqdefault.jpg', u'width': 480, u'height': 360}, u'medium': {u'url': u'https://i.ytimg.com/vi/9bZkp7q19f0/mqdefault.jpg', u'width': 320, u'height': 180}}, u'title': u'Most Viewed Videos of All Time\u30fb(Over 200 million views)', u'channelId': u'UCEDEKrjFZFp3Br3ENlYomdA', u'publishedAt': u'2012-10-17T22:55:58.000Z', u'liveBroadcastContent': u'none', u'channelTitle': u'MyTop100Videos', u'description': u'A complete ordered list of the Top 500 most viewed videos on YouTube (215M+ views) \u2022 Created on: 10/17/12 \u2022 Auto-updated.'}, u'kind': u'youtube#searchResult', u'etag': u'"yaAQEaHITilaGdzoEjt_tJdICqI/jBbRZIJ9mW8FwA2oE40neNurXB4"', u'id': {u'kind': u'youtube#playlist', u'playlistId': u'PLirAqAtl_h2r5g8xGajEwdXd3x1sZh

##### Combine the Playlist Ids and Titles

In [6]:
videos = dict( map( lambda obj: ( obj["id"]["playlistId"], obj["snippet"]["title"] ), search_response["items"] ) )

##### Check Videos Output to find the Playlist with title "Most Viewed Videos of All Time"

In [7]:
videos

{u'PL5sMqiHHSWqeSHminsjrYY5rdGzVPNLEn': u'Filtered pop songs top 500',
 u'PL6ZLc-zZUnxlkB9t8CcpFZeV6V5I_cVgu': u'M TOP POP MUSIC VEVO 500',
 u'PL79DF4D733125AD20': u'Country Music | Country Music Playlist 2016 | Kenny Chesney, Jason Aldean, Blake Shelton, Luke Bryan, Florida Georgia Line | BeachsideCountry 2016',
 u'PL8cFaF2b783Ls-_n2LcW25VzdHXJphxSx': u'Top Tracks - Prince',
 u'PLAB4C38FC639C24D7': u'MTV Top 500 Videos of All Time ( May 1997) # 500 at  # 472',
 u'PLAbeRqyTx1rIGWY13HgPyh0VF0LdoTQFp': u'500 greatest songs of all time',
 u'PLD8tVgPY6C1kXXb83cFaX_6xvSnnGvvYO': u'Top Tracks - Andr\xe9 Rieu',
 u'PLEXox2R2RxZKD0KvMoTYSiKnxwOn2joVU': u'Top 500 Greatest New Wave Songs',
 u'PLEXox2R2RxZKyTTlt3kxvtJbI_Cw1K1IX': u'Top 500 Greatest Synthpop Songs',
 u'PLFLjA-BCYmWH3oFUHqB_vOHkZADg6D1fG': u"2016's Top 500 Tracks Playlist- MusicAlivePlus+",
 u'PLNa8NxthA4jOntBYvZxGBX4EXtXzqnfpe': u'Top 500 Minecraft songs.',
 u'PLNxOe-buLm6cz8UQ-hyG1nm3RTNBUBv3K': u'Top 500 Classic Rock songs',
 u'P

----------------------------------------------------------------------------------------------

##### Pick up the Playlist with title "Most Viewed Video of All Time"

In [8]:
iString = 'Most Viewed Videos of All Time'
playList = [key for key, value in videos.iteritems() if iString.lower().replace(" ","") in value.lower().replace(" ","")]

MG: So you decided to go with a list comprehension after all...

In [9]:
playListId = playList[0]

##### Get the videos under the Playlist 

Prepare Playlist items request using the method youtube.playlistItems( )

In [10]:
# Retrieve the list of videos uploaded to the authenticated user's channel.
playlistitems_list_request = youtube.playlistItems().list(
    playlistId=playListId,
    part="snippet",
    maxResults=50)

##### Videos with Title, Categories, View Count, Like Count and Dislike Count

Playlist items request contains multiple Http Requests. There will be 10 requests in the request response. Each request contains 50 videos resulting into total 500 videos for 10 requests.

1. Use while loop to process each Http Request
2. Get the response using execute( ) method
3. Use 'for' loop to process the Playlist response items. The Playlist items contain Snippet with Video id and Title
4. Create 'for' loop for video list response using method videos( ) method under Playlist items to bring the category id of the video and statistics 
5. Using Category Id, bring Category Value using method videoCategories( ) method
6. Create runtime dictionary with Video Id, Video Title, Category Id, Category Value and Statistics
7. Append the dictionary values to Result Set

`MG: Nice!`

In [11]:
result=[]
video = {}

while playlistitems_list_request:
  

    playlistitems_list_response = playlistitems_list_request.execute()

    
    for playlist_item in playlistitems_list_response["items"]:

        title = playlist_item["snippet"]["title"]
        video_id = playlist_item["snippet"]["resourceId"]["videoId"]

        video[video_id] = title
        
        videos_list_response = youtube.videos().list(
            id=video_id,
            part='id,snippet,statistics').execute()


        for i in videos_list_response['items']:
           
            video_category_response = youtube.videoCategories().list(
            id= i["snippet"]["categoryId"],
            part='snippet').execute()
            
            for categ in video_category_response['items']:
                categoryValue = categ["snippet"]["title"]
                
          
            temp_res = dict(videoId = i['id'], title = video[i['id']], categoryId = i["snippet"]["categoryId"],
                           category = categoryValue)
            
            temp_res.update(i['statistics'])
            result.append(temp_res)

        
        
    playlistitems_list_request = youtube.playlistItems().list_next(
    playlistitems_list_request, playlistitems_list_response)

##### Create DataFrame with Result Set

In [18]:
dfTemp = pd.DataFrame(result)

In [19]:
df = dfTemp[['videoId', 'title', 'categoryId', 'category', 'commentCount', 'likeCount',
             'dislikeCount', 'viewCount']]

df.head()

Unnamed: 0,videoId,title,categoryId,category,commentCount,likeCount,dislikeCount,viewCount
0,9bZkp7q19f0,PSY - GANGNAM STYLE(강남스타일) M/V,10,Music,5034516,11126164,1562419,2620162554
1,RgKAFK5djSk,Wiz Khalifa - See You Again ft. Charlie Puth [...,10,Music,721334,10960983,317668,1927533947
2,OPf0YbXqDm0,Mark Ronson - Uptown Funk ft. Bruno Mars,10,Music,308047,6615916,362934,1764608510
3,e-ORhEE9VVg,Taylor Swift - Blank Space,10,Music,467036,5802214,460688,1694170544
4,fRh_vgS2dFE,Justin Bieber - Sorry (PURPOSE : The Movement),10,Music,576162,6423953,749392,1684774069


##### Create CSV file to process the data further 

In [23]:
#df.to_csv("../assets/ytvideos.csv",index=False)

Part 2 Score |  15/21
-------|-----
Identify: Articulate Problem Statement/Specific goals & success criteria        |        3
Identify: Outline proposed methods & models        | 3 
Parse: Identify risks & assumptions                | 2
Parse: Create local PostgreSQL database            | 2 (not a big deal)
Parse: Query, Sort, & Clean Data                | 3
Parse: Create a Data Dictionary                | 1
Mine: Perform & summarize EDA                | 1
Bonus! Refine: Explain how you intend to tune & evaluate your results | 0

Part 3 Score |  0/15
-------|-----
Mine: Correlate data & run statistical analysis        | 0
Refine: Plot data w visual analysis                | 0
Model: Run model on data (train subset as needed)            | 0
Present: Summarize approach & initial results                | 0
Present: Describe successes, setbacks, & lessons learned  | 0
Bonus: Use 2 or more dataviz tools | 0

MG: You've done a good job so far, but you are a little behind!  We're supposed to be to part 3 by now.  I will give you some time and grade this again when you've caught up.