Creator: Riley Cohen

# Background

In this project I will attempt to analyze youtube videos. To do this, I first need to utilize youtube's API's to scrape data from youtube. External API's are a huge part of running a software business and it's important to have knowledge about what they are and how to use them. 

_________________________________________________________________________________________________________________

# Google Cloud Platform: API Platform
To make a google cloud platform: https://console.developers.google.com/

    a. Create a project by clicking the down arrow next to where is says "Google API" then click "New Project"

    b. Click Enable APIs and Services then search for youtube API and enable "Youtube Data API v3"

    c. Navigate back to the cloud dashboard 

    d. Click the key symbol below where it says "API" on the left side of the screen

    e. Click "Create Credentials" then "API Key" and fill in the required info 

    f. Now you are done and can use the Youtube APIs

_________________________________________________________________________________________________________________
# Directions 

1. Hit Youtube search.list( ) to get random video ID's within the USA
    - The API returns many of the same video ID's so found a clever way to make the search "random" enough.
    - I extracted 2500 ID's on each notebook run
        - maxResults parameter is helpful in solving efficiency issues 
    
    
2. Hit the video.list( ) API to fetch data on those video ids
    - The id parameter is useful in solving efficiency issues, so look into the id parameter and how it differs from the videoid parameter.
    

3. Hit the commentThreads.list( ) API to fetch comments for the videos
    - https://www.geeksforgeeks.org/how-to-extract-youtube-comments-using-youtube-api-python/
    - Some videos have no comments or the comments are disabled, which will cause issues in the API call.
        - Have to catch errors.
    
4. Hit the channels.list( ) API to fetch data on the youtube channel associated with the video Id

5. Do some basic cleaning to make the dataframe nice
    - Change column name(s) 
    - Reorder the column names 
    - Delete duplicates

6. Export the dataframe to a csv
_________________________________________________________________________________________________________________
# General Tips 

- Start with small iterations and then go to large when you know your code is solid
- Try not to re-use variable names incase you mess up your code and need the original variable back 
    - If you incrementally add data to a dataset, it is useful to add save copys of the old dataset as you go step by step
- Create multiple projects and hence multiple API keys on the google cloud platfrom
    - Keep track of them however you feel fit 
    - You may need to change them as you go, especially when using the search API.
- Save the data as you finish each part
    - Maybe as a csv or something else 

Here are the libraries I used:
- import random
- import requests
- from googleapiclient.discovery import build 
- import numpy as np
- import pandas as pd
- from googleapiclient.errors import HttpError #Catch erros

_________________________________________________________________________________________________________________

# First let's use the youtube search.list( ) API to fetch some random video ID's

Note: Due to the limit on API calls, I had to create multiple Google project to collect a sufficient amount of data. That's why you will see multiple API keys throughout this notebook.

To make sure there is a limited amount of duplicates, I am going to use the assign a random word to the query parameter q. Below is an example of how it works.

In [1]:
#First get list of words
import requests

word_site = "https://www.mit.edu/~ecprice/wordlist.10000"

response = requests.get(word_site)
words = response.content.splitlines()
words = [word.decode("utf-8") for word in words]

In [2]:
import random
random.choice(words) #Extract random word from the list

'address'

In [3]:
#Needed to run API's
from googleapiclient.discovery import build 

api_key = 'your key here' #First API key 
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

In [4]:
video_ids = []
check = []

for iteration in range(50):

    # creating Youtube Resource Object 
    youtube_object = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION, 
                                            developerKey = api_key) 
    
    # calling the search.list method to retrieve youtube search results 
    search_keyword = youtube_object.search().list(part = "id, snippet",
                                                  regionCode ='US', 
                                                  type = 'video',
                                                  q = random.choice(words),
                                               maxResults = 50).execute() 
    # extracting the results from search response 
    results = search_keyword.get("items", [])
    results

    for result in results:
        video_id = result.get('id').get('videoId')
        if video_id == None:
            check.append(result)
        else:
            video_ids.append(video_id)



In [5]:
video_ids[0:5]

['jJfCI3FL9WI', 'XgvR3y5JCXg', 'wJnBTPUQS5A', '60ItHLz5WEA', 'xaPepCVepCg']

Now lets use the youtube video list API to fetch data on the youtube videos.

In [6]:
api_key = 'your key here' #Second API key to surpass quota 

The video.list API can only handle 50 video ID's at once so I am going to break the list up.

In [7]:
import numpy as np

split_ids = []

for k in np.arange(0,2550,50):
    back_index = k
    front_index = k + 50
    split = ",".join(video_ids[back_index:front_index])
    split_ids.append(split)
print('There are 50 comma separated lists comprised of 50 video ids. Below is an example of the first entry.')
split_ids[0]

There are 50 comma separated lists comprised of 50 video ids. Below is an example of the first entry.


'jJfCI3FL9WI,XgvR3y5JCXg,wJnBTPUQS5A,60ItHLz5WEA,xaPepCVepCg,1-xGerv5FOk,dhYOPzcsbGM,bPs0xFd4skY,H_4e85Q2EjE,NSitLOeMGMg,JW5UEW2kYvc,0a1r0JaONS4,HhjHYkPQ8F0,TTA2buWlNyM,5JRHxPOxmg4,WWCsGEarExg,M-P4QBt-FWw,en5KyaRyRbw,DDXLmYyFu4I,XGUcIDN5kZ0,IC-CwFr29no,4Oc0tBsc91s,Zy7ToV9DvJE,ha5Xm1GkX60,6uTj0uxIl54,ihUfpfcX4l4,mIxlvVlOIS0,aCKY_kxB6oY,llGQMlkgkpk,_DEHbOu6A3w,mZoexkeUMOI,WrHDquZ-tj0,WKfo9naU2Nw,1JKfCE6GUw4,RevMi122nnU,JnX2BoZE9w4,Qvuas0Yz8PU,AOeY-nDp7hI,UvJy1KxNwZc,XMCpuvTDXM8,rq_qX8dGyeQ,8EcZPEucdXk,Az-mGR-CehY,TF9I1GxNdJQ,Zm4gOozgKVg,FzRZ91mrT9k,6Htn1x-_-is,bM7SZ5SBzyY,wJJ1irx9tOA,EkDiGxkgkEU'

# Now lets get the video details by usinf the video.list( ) API.

In [8]:
video_details = [] #Will hold all the video details

for ids in split_ids:

    youtube = build('youtube', 'v3', developerKey=api_key)

    request = youtube.videos().list(id = ids,
                                    part="id, snippet, contentDetails, statistics"
                                   )

    response = request.execute()

    results = response.get("items", [])
    
    video_details += results

# Start creating the dataframe by unpacking the data. Examine the packaged data above and extract the data of interest.

In [9]:
import pandas as pd

In [10]:
def extract_data(response):
    '''
    Takes in a dictionary that is a response from the youtube video.list API
    Outputs a dataframe with a single row
    
    '''
    #Extract data of interest 
    id_number = response.get('id')
    title = response.get('snippet').get('title') #Title
    channelId = response.get('snippet').get('channelId')
    description = response.get('snippet').get('description') #Description 
    channelTitle = response.get('snippet').get('channelTitle') #ChannelTitle
    categoryId = response.get('snippet').get('categoryId')
    content = {'id':id_number,'title':title,'description':description, 'channelTitle':channelTitle
              ,'channelId':channelId,'categoryId':categoryId}
    stats = response.get('statistics') #Stats
    stats.update(content) #Combine stats and content

    #Now initialize the dataframe
    vals = list(stats.values())
    keys = list(stats.keys())

    df = pd.DataFrame(vals).T 
    df.columns = keys
    return df

extract_data(video_details[0])

Unnamed: 0,viewCount,likeCount,dislikeCount,favoriteCount,commentCount,id,title,description,channelTitle,channelId,categoryId
0,5629349,40017,1260,0,2582,jJfCI3FL9WI,alan 阿蘭(阿兰) - 青藏高原 Tibetan Plateau 藏/中文版 Tibet...,alan阿蘭於2020年環球綜藝秀演唱新編曲藏語+中文版《青藏高原》\n\n\nSpecia...,Wei H.,UCLxFQD7RvylccQK-jan59jw,10


In [11]:
#Now create the dataframe
df = pd.concat([extract_data(details) for details in video_details])
df = df.fillna(0)
print(df.shape)
df.head()

(2500, 11)


Unnamed: 0,viewCount,likeCount,dislikeCount,favoriteCount,commentCount,id,title,description,channelTitle,channelId,categoryId
0,5629349,40017,1260,0,2582,jJfCI3FL9WI,alan 阿蘭(阿兰) - 青藏高原 Tibetan Plateau 藏/中文版 Tibet...,alan阿蘭於2020年環球綜藝秀演唱新編曲藏語+中文版《青藏高原》\n\n\nSpecia...,Wei H.,UCLxFQD7RvylccQK-jan59jw,10
0,18197699,126507,3321,0,3432,XgvR3y5JCXg,"Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...","Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...",sgg17,UCv2yVXn__RBsu4j6JiRVmGg,15
0,851004923,7248864,174412,0,397236,wJnBTPUQS5A,Alan Walker - The Spectre,△ Merch @ https://store.alanwalker.no △\n\nTh...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10
0,2952949613,21902719,575167,0,1175389,60ItHLz5WEA,Alan Walker - Faded,△ Merch @ https://store.alanwalker.no △\n\nWa...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10
0,26763126,104285,2254,0,3848,xaPepCVepCg,Alan!.. Alan!.. Steve! | Walk on the Wild Side...,Subscribe and 🔔 to OFFICIAL BBC YouTube 👉 http...,BBC,UCCj956IF62FbT7Gouszaj9w,24


In [12]:
old_df = df.copy() #Save copy just incase we need to go back

In [25]:
api_key = 'your key here' 

# Now let's ge the comments for the videos by using the commentThreads.list( ) API.

In [14]:
from googleapiclient.errors import HttpError #Catch erros

In [27]:
comments = []
videos_with_disabled_comments = []

for video in video_ids:
    
    youtube = build('youtube', 'v3', developerKey = api_key)
    request = youtube.commentThreads().list(videoId = video,
                                      part = 'snippet,replies'
                                            )
    
    #Checks if comments are disabled for video
    try:
        request.execute()
    except HttpError as e:
        not_disabled = (not int(e.resp.status) > 0) & (not int(e.resp.status) == 403) 
    else:
        not_disabled = True
    
    
    has_comment = df.loc[df['id'] == video].iloc[0] #Used to check if video has comments
    
    
    if (int(has_comment['commentCount'])>0) & (not_disabled): #If video has comments then get the comments
        
        response = request.execute() 
        item = response['items']
        comment = ''

        #Concatenate all comments for given video
        for info in item:
            comment = comment + ' ' + info['snippet']['topLevelComment']['snippet']['textDisplay'] 
        comments.append(comment)

    else: #Otherwise the video has no comments & keeps track on the video ids 
        videos_with_disabled_comments.append(video)
        comments.append('none')


In [28]:
df['comments'] = comments  
df.head()

Unnamed: 0,viewCount,likeCount,dislikeCount,favoriteCount,commentCount,id,title,description,channelTitle,channelId,categoryId,comments
0,5629349,40017,1260,0,2582,jJfCI3FL9WI,alan 阿蘭(阿兰) - 青藏高原 Tibetan Plateau 藏/中文版 Tibet...,alan阿蘭於2020年環球綜藝秀演唱新編曲藏語+中文版《青藏高原》\n\n\nSpecia...,Wei H.,UCLxFQD7RvylccQK-jan59jw,10,你是歌手天才 我喜欢 The sick wrinkle archaeologically ...
0,18197699,126507,3321,0,3432,XgvR3y5JCXg,"Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...","Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...",sgg17,UCv2yVXn__RBsu4j6JiRVmGg,15,If your watching this in 2021 you are the pin...
0,851004923,7248864,174412,0,397236,wJnBTPUQS5A,Alan Walker - The Spectre,△ Merch @ https://store.alanwalker.no △\n\nTh...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10,Who can do The Spectre dance? 😆Starts around ...
0,2952949613,21902719,575167,0,1175389,60ItHLz5WEA,Alan Walker - Faded,△ Merch @ https://store.alanwalker.no △\n\nWa...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10,"Finally, I can show you what&#39;s next. <a h..."
0,26763126,104285,2254,0,3848,xaPepCVepCg,Alan!.. Alan!.. Steve! | Walk on the Wild Side...,Subscribe and 🔔 to OFFICIAL BBC YouTube 👉 http...,BBC,UCCj956IF62FbT7Gouszaj9w,24,I&#39;d like to thank you for putting CAPTION...


In [29]:
df.to_csv(r'/Users/rileycohen/Desktop\data_with_comments.csv') #Export dataframe to csv

In [285]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,viewCount,likeCount,dislikeCount,favoriteCount,commentCount,id,title,description,channelTitle,channelId,categoryId,comments
0,0,5629349.0,40017.0,1260.0,0.0,2582.0,jJfCI3FL9WI,alan 阿蘭(阿兰) - 青藏高原 Tibetan Plateau 藏/中文版 Tibet...,alan阿蘭於2020年環球綜藝秀演唱新編曲藏語+中文版《青藏高原》\n\n\nSpecia...,Wei H.,UCLxFQD7RvylccQK-jan59jw,10.0,你是歌手天才 我喜欢 The sick wrinkle archaeologically ...
1,0,18197700.0,126507.0,3321.0,0.0,3432.0,XgvR3y5JCXg,"Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...","Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...",sgg17,UCv2yVXn__RBsu4j6JiRVmGg,15.0,If your watching this in 2021 you are the pin...
2,0,851004900.0,7248864.0,174412.0,0.0,397236.0,wJnBTPUQS5A,Alan Walker - The Spectre,△ Merch @ https://store.alanwalker.no △\n\nTh...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10.0,Who can do The Spectre dance? 😆Starts around ...
3,0,2952950000.0,21902719.0,575167.0,0.0,1175389.0,60ItHLz5WEA,Alan Walker - Faded,△ Merch @ https://store.alanwalker.no △\n\nWa...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10.0,"Finally, I can show you what&#39;s next. <a h..."
4,0,26763130.0,104285.0,2254.0,0.0,3848.0,xaPepCVepCg,Alan!.. Alan!.. Steve! | Walk on the Wild Side...,Subscribe and 🔔 to OFFICIAL BBC YouTube 👉 http...,BBC,UCCj956IF62FbT7Gouszaj9w,24.0,I&#39;d like to thank you for putting CAPTION...


# Let's find all the channel information.

In [286]:
channel_ids = list(df['channelId'].astype(str))
split_channel_ids = []
for k in np.arange(0,2500,50):
    back_index = k
    front_index = k + 50
    split = ",".join(channel_ids[back_index:front_index])
    split_channel_ids.append(split)
print('There are 50 comma separated lists comprised of 50 video ids. Below is an example of the first entry.')
split_channel_ids[0]

There are 50 comma separated lists comprised of 50 video ids. Below is an example of the first entry.


'UCLxFQD7RvylccQK-jan59jw,UCv2yVXn__RBsu4j6JiRVmGg,UCJrOtniJ0-NWz37R30urifQ,UCJrOtniJ0-NWz37R30urifQ,UCCj956IF62FbT7Gouszaj9w,UCJrOtniJ0-NWz37R30urifQ,UCJrOtniJ0-NWz37R30urifQ,UCNqFDjYTexJDET3rPDrmJKg,UCbKWv2x9t6u8yZoB3KcPtnw,UCFar8ctgAwzJ4OqJI29zzCw,UCakbeXKVr0gPtcwOlkHKptg,UCbKWv2x9t6u8yZoB3KcPtnw,UCJrOtniJ0-NWz37R30urifQ,UCakbeXKVr0gPtcwOlkHKptg,UCqwMBERJTzuUyUU1dGybQKg,UCgn-j1_enys0xckwp_r0F0Q,UCJrOtniJ0-NWz37R30urifQ,UCbKWv2x9t6u8yZoB3KcPtnw,UCakbeXKVr0gPtcwOlkHKptg,UCTJTpwrK4a-ajXs4-Wry09A,UCFar8ctgAwzJ4OqJI29zzCw,UC1oPBUWifc0QOOY8DEKhLuQ,UCM01mCXo3lLJ5QQSAMDTtZw,UCFar8ctgAwzJ4OqJI29zzCw,UCM01mCXo3lLJ5QQSAMDTtZw,UCM01mCXo3lLJ5QQSAMDTtZw,UCJrOtniJ0-NWz37R30urifQ,UCFar8ctgAwzJ4OqJI29zzCw,UCp6_KuNhT0kcFk-jXw9Tivg,UCM01mCXo3lLJ5QQSAMDTtZw,UCYsOOzJ6OfvcdVqJlJKf-Aw,UCJrOtniJ0-NWz37R30urifQ,UCM01mCXo3lLJ5QQSAMDTtZw,UCM01mCXo3lLJ5QQSAMDTtZw,UCHry9N_s_BCfE9JzKARps0g,UCakbeXKVr0gPtcwOlkHKptg,UCRryRAQhcSxa0rh7mrXFpYA,UC_aEa8K-EOJ3D6gOs7HcyNg,UCM01mCXo3lLJ5QQSAMDTtZw,UCWdXa6wvgCt9UZy8dvIf4Qw

In [287]:
channel_responses = []
c = 0

for ids in split_channel_ids:
    print(c, end=' ')
    c+=1
    youtube = build('youtube','v3', 
                    developerKey= api_key) 

    channel_response = youtube.channels().list(id = ids,
                                                 part = "id,statistics"
                                                ).execute()
    
    channel_responses.append(channel_response)


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Create a dataframe with the channel data

In [288]:
responses = [channel_responses[k].get('items') for k in range(len(channel_responses))] 
ch_ids = []
ch_details = []
for response in responses:
    for data in response:
        
        details = data.get('statistics')
        details.update({"channelId":data.get('id')})
        ch_details.append(details)
        
ch_details[0]

{'viewCount': '9089996511',
 'subscriberCount': '29000000',
 'hiddenSubscriberCount': False,
 'videoCount': '864',
 'channelId': 'UC_aEa8K-EOJ3D6gOs7HcyNg'}

In [289]:
channel_info = pd.concat([pd.DataFrame.from_dict(detail, orient='index').T for detail in ch_details])
print(channel_info.shape)
channel_info.head()

(1929, 5)


Unnamed: 0,viewCount,subscriberCount,hiddenSubscriberCount,videoCount,channelId
0,9089996511,29000000,False,864,UC_aEa8K-EOJ3D6gOs7HcyNg


In [290]:
channel_info.to_csv(r'/Users/rileycohen/Desktop\channel_data.csv') #Export dataframe to csv

Re-index dataframe.

In [291]:
old_df = df.copy() #Make a copy 

index = old_df['id'].astype(str)
old_df = old_df.set_index('id')
#old_df = old_df.loc[index, :]
old_df = old_df.reset_index()

old_df.head() 

Unnamed: 0.1,id,Unnamed: 0,viewCount,likeCount,dislikeCount,favoriteCount,commentCount,title,description,channelTitle,channelId,categoryId,comments
0,jJfCI3FL9WI,0,5629349.0,40017.0,1260.0,0.0,2582.0,alan 阿蘭(阿兰) - 青藏高原 Tibetan Plateau 藏/中文版 Tibet...,alan阿蘭於2020年環球綜藝秀演唱新編曲藏語+中文版《青藏高原》\n\n\nSpecia...,Wei H.,UCLxFQD7RvylccQK-jan59jw,10.0,你是歌手天才 我喜欢 The sick wrinkle archaeologically ...
1,XgvR3y5JCXg,0,18197700.0,126507.0,3321.0,0.0,3432.0,"Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...","Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...",sgg17,UCv2yVXn__RBsu4j6JiRVmGg,15.0,If your watching this in 2021 you are the pin...
2,wJnBTPUQS5A,0,851004900.0,7248864.0,174412.0,0.0,397236.0,Alan Walker - The Spectre,△ Merch @ https://store.alanwalker.no △\n\nTh...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10.0,Who can do The Spectre dance? 😆Starts around ...
3,60ItHLz5WEA,0,2952950000.0,21902719.0,575167.0,0.0,1175389.0,Alan Walker - Faded,△ Merch @ https://store.alanwalker.no △\n\nWa...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10.0,"Finally, I can show you what&#39;s next. <a h..."
4,xaPepCVepCg,0,26763130.0,104285.0,2254.0,0.0,3848.0,Alan!.. Alan!.. Steve! | Walk on the Wild Side...,Subscribe and 🔔 to OFFICIAL BBC YouTube 👉 http...,BBC,UCCj956IF62FbT7Gouszaj9w,24.0,I&#39;d like to thank you for putting CAPTION...


In [292]:
df = old_df.copy() #Save a copy just incase

In [293]:
df.shape

(2542, 13)

In [294]:
len(df['channelId'].unique()) , len(channel_info['channelId'].unique())

(1852, 1837)

In [295]:
(channel_info['channelId'].isin(df['channelId'])).sum()

1929

Merge channel_info and original dataframe.

In [296]:
df = pd.merge(df, channel_info, on = 'channelId' , how = 'inner')
print(df.shape)
df.head()

(2860, 17)


Unnamed: 0.1,id,Unnamed: 0,viewCount_x,likeCount,dislikeCount,favoriteCount,commentCount,title,description,channelTitle,channelId,categoryId,comments,viewCount_y,subscriberCount,hiddenSubscriberCount,videoCount
0,jJfCI3FL9WI,0,5629349.0,40017.0,1260.0,0.0,2582.0,alan 阿蘭(阿兰) - 青藏高原 Tibetan Plateau 藏/中文版 Tibet...,alan阿蘭於2020年環球綜藝秀演唱新編曲藏語+中文版《青藏高原》\n\n\nSpecia...,Wei H.,UCLxFQD7RvylccQK-jan59jw,10.0,你是歌手天才 我喜欢 The sick wrinkle archaeologically ...,16572732,42300,False,184
1,XgvR3y5JCXg,0,18197700.0,126507.0,3321.0,0.0,3432.0,"Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...","Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...",sgg17,UCv2yVXn__RBsu4j6JiRVmGg,15.0,If your watching this in 2021 you are the pin...,20965898,8290,False,13
2,wJnBTPUQS5A,0,851004900.0,7248864.0,174412.0,0.0,397236.0,Alan Walker - The Spectre,△ Merch @ https://store.alanwalker.no △\n\nTh...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10.0,Who can do The Spectre dance? 😆Starts around ...,9164787242,38000000,False,241
3,60ItHLz5WEA,0,2952950000.0,21902719.0,575167.0,0.0,1175389.0,Alan Walker - Faded,△ Merch @ https://store.alanwalker.no △\n\nWa...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10.0,"Finally, I can show you what&#39;s next. <a h...",9164787242,38000000,False,241
4,1-xGerv5FOk,0,1136601000.0,9147128.0,204079.0,0.0,399579.0,Alan Walker - Alone,△ Merch @ https://store.alanwalker.no △\n\nTh...,Alan Walker,UCJrOtniJ0-NWz37R30urifQ,10.0,1 BILLION VIEWS!! Wow. I can&#39;t say it eno...,9164787242,38000000,False,241


# Now let's do some final cleaning of the data.

In [297]:
#Add URL
url = 'https://www.youtube.com/watch?v='
df['url'] = df['id'].apply(lambda x: url+x)
#Change column Names & drop unneeded columns
df['video_view_count'] = df['viewCount_x']
df['channel_view_count'] = df['viewCount_y']
df = df.drop(columns = ['hiddenSubscriberCount'])
df = df.drop(columns = ['viewCount_x','viewCount_y'])

#Change Ordering of columns
df = df[['id','channelId','categoryId','video_view_count','channel_view_count','commentCount','likeCount','dislikeCount','favoriteCount','subscriberCount',
   'videoCount','title','description','channelTitle','comments','url']]
#df = df.drop_duplicates(subset = ['id'])
print(df.shape)
df.head()

(2860, 16)


Unnamed: 0,id,channelId,categoryId,video_view_count,channel_view_count,commentCount,likeCount,dislikeCount,favoriteCount,subscriberCount,videoCount,title,description,channelTitle,comments,url
0,jJfCI3FL9WI,UCLxFQD7RvylccQK-jan59jw,10.0,5629349.0,16572732,2582.0,40017.0,1260.0,0.0,42300,184,alan 阿蘭(阿兰) - 青藏高原 Tibetan Plateau 藏/中文版 Tibet...,alan阿蘭於2020年環球綜藝秀演唱新編曲藏語+中文版《青藏高原》\n\n\nSpecia...,Wei H.,你是歌手天才 我喜欢 The sick wrinkle archaeologically ...,https://www.youtube.com/watch?v=jJfCI3FL9WI
1,XgvR3y5JCXg,UCv2yVXn__RBsu4j6JiRVmGg,15.0,18197700.0,20965898,3432.0,126507.0,3321.0,0.0,8290,13,"Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...","Alan,Alan, Alan, Alan, Alan, Ow, Ow, Steve, St...",sgg17,If your watching this in 2021 you are the pin...,https://www.youtube.com/watch?v=XgvR3y5JCXg
2,wJnBTPUQS5A,UCJrOtniJ0-NWz37R30urifQ,10.0,851004900.0,9164787242,397236.0,7248864.0,174412.0,0.0,38000000,241,Alan Walker - The Spectre,△ Merch @ https://store.alanwalker.no △\n\nTh...,Alan Walker,Who can do The Spectre dance? 😆Starts around ...,https://www.youtube.com/watch?v=wJnBTPUQS5A
3,60ItHLz5WEA,UCJrOtniJ0-NWz37R30urifQ,10.0,2952950000.0,9164787242,1175389.0,21902719.0,575167.0,0.0,38000000,241,Alan Walker - Faded,△ Merch @ https://store.alanwalker.no △\n\nWa...,Alan Walker,"Finally, I can show you what&#39;s next. <a h...",https://www.youtube.com/watch?v=60ItHLz5WEA
4,1-xGerv5FOk,UCJrOtniJ0-NWz37R30urifQ,10.0,1136601000.0,9164787242,399579.0,9147128.0,204079.0,0.0,38000000,241,Alan Walker - Alone,△ Merch @ https://store.alanwalker.no △\n\nTh...,Alan Walker,1 BILLION VIEWS!! Wow. I can&#39;t say it eno...,https://www.youtube.com/watch?v=1-xGerv5FOk


In [298]:
df.to_csv(r'/Users/rileycohen/Desktop/Final_data.csv') #Export dataframe to csv