# Hongfan Lu - Pset 1 - Youtube, BlueSky Data Collection

In [274]:
import pandas as pd

### Q1 (10 pts): Keyword selection

- Economy: We have record level high inflations and new threats of trade conflicts.
- Trump: Donald Trump and whether he will be the republican candidate again will be a major concerns for both parties
- Foreign Policy: With the Gaza-Isreali conflicts and the Russian-Ukrain war, the US foreign policy will be a major concern for US voters
- Democracy: With Cambridge Analytics scandal and the Jan 6th riot, many people are concerned that the US democracy is at risk.
- China: US voters will want to know the China policy that both democratic and the republican candidates might have.
- Climate: climate issue has always been the heated topic between left and right.
- Equality: Black Lives Matter movement and a series of social events has triggered voters' attention on equality in this new age.
- Biden: Like Trump, Biden's name will also appear a lot since he is so far the face and soul of the democrat party
- Democrat: left party
- Republican: right party.

### Q2) YouTube data collection

#### 2a (15 pts): For each of the two YouTube channels (CNN and FoxNews) use the YouTube API to list all the videos uploaded on the channel starting with the most recent. Then go over the videos and check whether the video title contains any of the election-related keywords you selected. If it does save relevant information about the video. Make sure that you have identified at least 50 election-related videos (according to your keywords) per channel, i.e., 100 videos in total.

In [275]:
!pip install --upgrade google-api-python-client --quiet

In [276]:
# Youtube Key
API_KEY = "AIzaSyAFkh7VXLquY7VOKsKm2mJ_7RHM7n_PxwQ"
# keyword list
keyword_list = ['Economy', "Trump", "Foreign Policy", "Democracy", "China", "Climate", "Equality", "Biden", "Democrat", "Republican"]

In [277]:
# Import relevant packages
import json
import googleapiclient
import googleapiclient.discovery
import googleapiclient.errors

In [278]:
# Initializing Youtube API
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=API_KEY)

In [279]:
import nltk
from nltk.stem import PorterStemmer

#### To solve the issue that same word might have different forms; I will use stemming to reduce words to their basic form

In [280]:
# Initialize NLTK
nltk.download("punkt")
porter = PorterStemmer()
import re

[nltk_data] Downloading package punkt to /Users/hongfanlu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [281]:
def search_keyword_videos(keyword_list, channel_name):
    relevant_videos = []
    next_page_token = None
    while len(relevant_videos) < 50:
        request = youtube.search().list(
            part='snippet, id',
            q= channel_name,
            maxResults=50,
            pageToken=next_page_token,
            order='date'
        )
        response = request.execute()  
        stemmed_keywords = set(porter.stem(keyword) for keyword in keyword_list)
        
        for item in response.get('items', []):
            video_title = item['snippet']['title'].lower()
            if any(re.search(r'\b{}\b'.format(re.escape(porter.stem(keyword))), video_title) for keyword in stemmed_keywords):
                video_id = item['id']['videoId']
                # this is for viewCount
                view_count = youtube.videos().list(
                id = video_id,
                part = 'statistics'
                ).execute()
                
                video_info = {
                    'Channel Name': channel_name,
                    'Video ID':video_id,
                    'Video title': item['snippet']['title'],
                    'Video creation time': item['snippet']['publishedAt'],
                    'Video number of views': view_count['items'][0]['statistics']['viewCount']       
                }
                relevant_videos.append(video_info)
            if len(relevant_videos) >= 50:
                break
        next_page_token = response.get('nextPageToken')
        if not next_page_token:
            break

    print(f"Found {len(relevant_videos)} keyword-related videos from {channel_name}.")
    return relevant_videos

In [282]:
cnn_keyword_videoids = search_keyword_videos(keyword_list, "CNN")
foxnews_keyword_videoids = search_keyword_videos(keyword_list, "FoxNews")

Found 50 keyword-related videos from CNN.
Found 50 keyword-related videos from FoxNews.


In [283]:
# Extract all video ids from the list of libraries return from search_keyword_videos function

cnn_videoids = [d['Video ID'] for d in cnn_keyword_videoids if 'Video ID' in d]
fox_videoids = [d['Video ID'] for d in foxnews_keyword_videoids if 'Video ID' in d]

#### 2b (15 pts): For each video fetch the 30 most relevant (as sorted by the API) comments. If the video has less than 30 comments, extract as many comments as there are. Make sure that this does not happen very often; if it does, understand why.


In [284]:
from googleapiclient.errors import HttpError

def extract_30_comments(video_id_list): 
    comments_per_video = []
    for vid in video_id_list:
        try:
            request = youtube.commentThreads().list(
                videoId = vid,
                part = "id,snippet,replies",
                textFormat = "plainText",
                order = "time",
                maxResults = 30
            )
            response = request.execute()
            
            for item in response["items"]:
                comments = item["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
                has_reply = item["snippet"]["totalReplyCount"]
                if has_reply != 0:
                    replies = []
                    for i in range(len(item["replies"]["comments"])):
                        reply = item["replies"]["comments"][i]['snippet']["textDisplay"]
                        replies.append(reply)
                else:
                    replies = None
                comment_info = {
                    'Video ID': vid,
                    'Comment id': item['snippet']['topLevelComment']['id'],
                    'Comment title': item['snippet']['topLevelComment']['snippet']['textOriginal'],
                    'Comment creation time': item['snippet']['topLevelComment']['snippet']['publishedAt'],
                    'Comment number of likes': item['snippet']['topLevelComment']['snippet']['likeCount'],
                    'Comment content': comments,
                    'Replies': replies
                }
                comments_per_video.append(comment_info)
        except HttpError as e:
            if e.resp.status == 403:
                print(f"Comments are disabled for video with ID: {vid}")
                comments_per_video.append(f"Comments are disabled for video with ID: {vid}")
            else:
                raise e  # Re-raise the exception if it's not a 403 error
    return comments_per_video

In [285]:
cnn_videos_comments = extract_30_comments(cnn_videoids)

In [288]:
fox_videos_comments = extract_30_comments(fox_videoids)

#### 2c (8 pts): Create a Pandas data frame that contains the relevant information about the data you extracted. Each row of your data frame should represent one comment. It should, at least, include the following columns (you can name the columns as you like):
- Channel name
- Video id
- Video title
- Video creation time
- Video number of views
- Comment id
- Comment title
- Comment creation time
- Comment number of likes

Include any other information that you think might be relevant for further analyses. The next two assignments will build on this one and you will be asked to analyze this data.
Make sure that you have at least 1000 rows per channel.
Print the number of rows in the data frame and display the first few rows of the data frame using head().

#### Now, I have 2 lists of dictionaries generated by function "search_keyword_videos" and function "extract_30_comments", I will turn both of them into pandas dataframes below

In [291]:
# Video info dataframe
cnn_video_df = pd.DataFrame(cnn_keyword_videoids)
fox_video_df = pd.DataFrame(foxnews_keyword_videoids)

# Comments info dataframe
cnn_comment_df = pd.DataFrame(cnn_videos_comments)
fox_comment_df = pd.DataFrame(fox_videos_comments)

In [292]:
# Join dataframes for cnn and fox by "Video ID"
cnn_df = pd.merge(cnn_comment_df,cnn_video_df, on = 'Video ID', how = 'left' )
fox_df = pd.merge(fox_comment_df,fox_video_df, on = 'Video ID', how = 'left' )

In [293]:
cnn_df.columns

Index(['Video ID', 'Comment id', 'Comment title', 'Comment creation time',
       'Comment number of likes', 'Comment content', 'Replies', 'Channel Name',
       'Video title', 'Video creation time', 'Video number of views'],
      dtype='object')

In [294]:
fox_df.columns

Index(['Video ID', 'Comment id', 'Comment title', 'Comment creation time',
       'Comment number of likes', 'Comment content', 'Replies', 'Channel Name',
       'Video title', 'Video creation time', 'Video number of views'],
      dtype='object')

In [295]:
# Stacking the two on top of each other
yt_comments = pd.concat([cnn_df,fox_df])

In [296]:
yt_comments.head()

Unnamed: 0,Video ID,Comment id,Comment title,Comment creation time,Comment number of likes,Comment content,Replies,Channel Name,Video title,Video creation time,Video number of views
0,zH9m20-wq2M,UgzqyxD7DCUCNWkRPLZ4AaABAg,This is so accurate 😂😂😂,2024-01-29T20:04:44Z,0,This is so accurate 😂😂😂,,CNN,Every Cable News Debate #funny #skit #democrat...,2024-01-29T19:27:30Z,16
1,zH9m20-wq2M,UgxQSXilM5XR909hPXx4AaABAg,Walmart Tom Brady on the weather 🎉🎉,2024-01-29T19:50:36Z,0,Walmart Tom Brady on the weather 🎉🎉,,CNN,Every Cable News Debate #funny #skit #democrat...,2024-01-29T19:27:30Z,16
2,zH9m20-wq2M,UgzScU1Hdo0wmF1A6bp4AaABAg,projecting much????,2024-01-29T19:48:46Z,0,projecting much????,,CNN,Every Cable News Debate #funny #skit #democrat...,2024-01-29T19:27:30Z,16
3,zH9m20-wq2M,UgxQi2FRO3a6IukfvL94AaABAg,What county do you live in where the mainstrea...,2024-01-29T19:44:59Z,0,What county do you live in where the mainstrea...,,CNN,Every Cable News Debate #funny #skit #democrat...,2024-01-29T19:27:30Z,16
4,sdYmRp2K19g,UgyM3XPpNXRFrbezgWl4AaABAg,How can Trump tank the border Bill ? He not th...,2024-01-29T20:15:33Z,0,How can Trump tank the border Bill ? He not th...,,CNN,Sen. Warren weighs in on Trump&#39;s desire to...,2024-01-29T18:38:34Z,26371


#### 2d (2 pt): Write code to turn the data frame into a CSV and save it in a file called “yt_comments.csv”.

In [297]:
yt_comments.to_csv('yt_comments.csv', index = True)

### Q3) Blue Sky data collection

### 3a (15 pts) For each of the two Blue Sky accounts (The Washington Post and the New York Times) use the Blue Sky API to fetch all posts posted by each account. Go over all posts and check whether the posts’ text contains the election-related keywords you select in Q1. Save relevant information about the posts that are about the election. If you are not able to identify at least 50 posts per account, try expanding your list of keywords.

In [298]:
!pip install atproto --quiet

In [299]:
import json
from atproto import Client, models

In [300]:
USERNAME = "tomatofriedegghl.bsky.social"
APP_PASSWORD = "bisb-bkdh-6avc-tlbl"
client = Client()
client.login(USERNAME, APP_PASSWORD)

ProfileViewDetailed(did='did:plc:yfftm3wglvnerwrp5z7qpnwn', handle='tomatofriedegghl.bsky.social', avatar=None, banner=None, description=None, display_name='Louise Lu', followers_count=6, follows_count=10, indexed_at='2024-01-10T01:17:53.077Z', labels=[], posts_count=3, viewer=ViewerState(blocked_by=False, blocking=None, blocking_by_list=None, followed_by=None, following=None, muted=False, muted_by_list=None, py_type='app.bsky.actor.defs#viewerState'), py_type='app.bsky.actor.defs#profileViewDetailed')

In [301]:
# feed_data_washingtonpost = client.get_author_feed(actor = 'washingtonpost.com')
# feed_data_nytimes = client.get_author_feed(actor = 'nytimes.com')

In [302]:
def get_user_posts(client, user, max_posts):
    all_post = []
    cursor = None
    stemmed_keywords = set(porter.stem(keyword) for keyword in keyword_list)
    
    while len(all_post) < max_posts:
        feed_data = client.get_author_feed(user, cursor=cursor)
        
        if feed_data is not None:
            for i, post in enumerate(feed_data.feed):
                post_text = feed_data.feed[i].post.record.text.lower()
                if any(re.search(r'\b{}\b'.format(re.escape(porter.stem(keyword))), post_text) for keyword in stemmed_keywords):
                    post = {
                        'Account name': user,
                        'Post ID uri':feed_data.feed[i].post['uri'],
                        'Poster handle': feed_data.feed[i].post.author.handle,
                        'Post created_at': feed_data.feed[i].post.record.created_at,
                        'Post text':feed_data.feed[i].post.record.text,
                        'Post like_count': feed_data.feed[i].post['like_count'],
                        'Post reply_count': feed_data.feed[i].post['reply_count'],
                        'Post repost_count': feed_data.feed[i].post['repost_count']
                    }
                    all_post.append(post)
                if len(all_post) >= max_posts:
                    break
            cursor = feed_data.cursor
            if cursor is None:
                break
        else:
            print("No more data available.")
            break
            
    return all_post

In [303]:
washingtonpost_results = get_user_posts(client, 'washingtonpost.com', 100)

In [304]:
nytimes_result = get_user_posts(client, 'nytimes.com', 100)

In [305]:
uri_list_washington = [d['Post ID uri'] for d in washingtonpost_results if 'Post ID uri' in d]
uri_list_nytimes = [d['Post ID uri'] for d in nytimes_result if 'Post ID uri' in d]

#### 3b (15 pts): For each of the election-related posts you identified, collect all of their replies. This should include not just the direct replies, but replies of the replies, and so on; i.e., the complete conversation prompted by the post. Make sure that you have at least 1000 comments per account across all posts.

In [313]:
def get_post_replies(uri_list):
    
    all_comments_replies = []
    for uri in uri_list:
        all_data = client.get_post_thread(uri)
        queue = []
        queue += all_data.thread.replies
        
        while queue: # while queue is not empty
            reply = queue.pop() # get the reply at the end of the queue
            result = {
                'Post ID uri':uri,
                "Reply ID URI": reply.post.uri,
                "Reply poster handle": reply.post.author.handle,
                "Reply created at": reply.post.record.created_at,
                "Reply text": reply.post.record.text,
                "Reply like_count": reply.post.like_count,
                "Reply reply_count": reply.post.reply_count,
                "Reply repost count": reply.post.repost_count
            }
            all_comments_replies.append(result)
            # loop through all replies of this reply (if any) and add them to the queue
            if reply.replies is not None:
                for sub_reply in reply.replies:
                    queue.append(sub_reply)
    
    return all_comments_replies

In [314]:
washingtonpost_comments_replies = get_post_replies(uri_list_washington)
nytimes_comments_replies = get_post_replies(uri_list_nytimes)

#### 3c (8 pts): Create a Pandas data frame that contains the relevant information about the data you extracted. Each row of your data frame should represent one reply. It should, at least, include the following columns (you can name the columns as you like):
- Account name (Washington Post / New York Times)
- Post id
- Post text
- Post creation time
- Post number of likes
- Post number of retweets
- Reply id
- Reply text
- Reply creation time
- Reply number of likes
- Reply number of retweets
Include any other information that you think might be relevant for further analyses.
Print the number of rows in the data frame and display the first few rows of the data frame using head().

In [315]:
washingtonpost_replies_df = pd.DataFrame(washingtonpost_comments_replies)
nytimes_replies_df = pd.DataFrame(nytimes_comments_replies)

In [316]:
washingtonpost_posts = pd.DataFrame(washingtonpost_results)
nytimes_posts = pd.DataFrame(nytimes_result)

In [317]:
nytimes_all = pd.merge(nytimes_replies_df, nytimes_posts, on = 'Post ID uri', how = 'left')
washingtonpost_all = pd.merge(washingtonpost_replies_df, washingtonpost_posts, on = 'Post ID uri', how = 'left')

In [318]:
nytimes_washingtonpost = pd.concat([nytimes_all,washingtonpost_all])
nytimes_washingtonpost.head()

Unnamed: 0,Post ID uri,Reply ID URI,Reply poster handle,Reply created at,Reply text,Reply like_count,Reply reply_count,Reply repost count,Account name,Poster handle,Post created_at,Post text,Post like_count,Post reply_count,Post repost_count
0,at://did:plc:eclio37ymobqex2ncko63h4r/app.bsky...,at://did:plc:qo55faqv63ih6uxshusn3mke/app.bsky...,helmutscholz.bsky.social,2024-01-29T18:31:48.044Z,China's real estate/banking sector looks like ...,0,0,0,nytimes.com,nytimes.com,2024-01-29T18:24:43.450Z,"China Evergrande, a massive property company w...",19,1,9
1,at://did:plc:eclio37ymobqex2ncko63h4r/app.bsky...,at://did:plc:cntyk4pi5cudb7xo4ezlqdoh/app.bsky...,motown.bsky.social,2024-01-28T21:50:55.049Z,There are no conservative voters.\n\nThere's M...,8,1,1,nytimes.com,nytimes.com,2024-01-28T21:44:00.506Z,Nikki Haley is trying to find a way to diminis...,47,12,7
2,at://did:plc:eclio37ymobqex2ncko63h4r/app.bsky...,at://did:plc:p3edo3eknwague2vsigx5dk3/app.bsky...,cinemastrikesback.bsky.social,2024-01-28T21:59:52.952Z,"Man, I know plenty of never Trump moderate con...",1,2,0,nytimes.com,nytimes.com,2024-01-28T21:44:00.506Z,Nikki Haley is trying to find a way to diminis...,47,12,7
3,at://did:plc:eclio37ymobqex2ncko63h4r/app.bsky...,at://did:plc:cntyk4pi5cudb7xo4ezlqdoh/app.bsky...,motown.bsky.social,2024-01-28T22:04:13.346Z,"Exactly, that's the ""everyone else"" part.",3,1,0,nytimes.com,nytimes.com,2024-01-28T21:44:00.506Z,Nikki Haley is trying to find a way to diminis...,47,12,7
4,at://did:plc:eclio37ymobqex2ncko63h4r/app.bsky...,at://did:plc:p3edo3eknwague2vsigx5dk3/app.bsky...,cinemastrikesback.bsky.social,2024-01-28T22:23:28.400Z,Ah misunderstood you,1,0,0,nytimes.com,nytimes.com,2024-01-28T21:44:00.506Z,Nikki Haley is trying to find a way to diminis...,47,12,7


#### 3d (2 pt): Write code to turn the data frame into a CSV and save it in a file called “bsky_replies.csv”.

In [319]:
nytimes_washingtonpost.to_csv('bsky_replies.csv', index = True)