# Semester 3 Coding Portfolio Topic 1 Formative Part 2/2:
# Parsing realistic websites and using Requests, BeautifulSoup, and Selenium

This notebook covers the following topics:
 - Scraping using Selenium

This notebook is expected to take around 5 hours to complete:
 - 2 hours for the formative part
 - 3 hours of self-study on the topics covered by this notebook

Like all topics in this portfolio, this topic is split into two sections:
 - Formative 
 - Summative

<b>Formative section</b><br>
Simply complete the given functions such that they pass the automated tests. This part is graded Pass/Fail; you must get 100% correct!
You can submit your notebook through Canvas as often as you like. Make sure to start doing so early to insure that your code passes all tests!
You may ask for help from fellow students and TAs on this section, and solutions might be provided later on.

<b>Summative section</b><br>
In this section, you are asked to do original work with little guidance, based on the skills you learned in the formative part (as well as lectures and workshops).
This section is graded not just on passing automated tests, but also on quality, originality, and effort (see assessment criteria in the assignment description).

In [45]:
# TODO: Please enter your student number here
STUDENT_NUMBER = 15281914

This workshop will focus on scraping more realistic websites than last workshop's toy example. 

Now that you have a sense of how to use requests and beautifulsoup, we're going to apply it to get information from the CSSci website.

We will then look at two small research projects that uses API and scraping to acquire data, and then analyze it using an API, and then draw research conclusions.

## A realistic example: Parsing the CSSci website
Let's use this to parse the CSSci website we fetched earlier.

In [46]:
import requests
from bs4 import BeautifulSoup

url = "https://www.uva.nl/en/programmes/bachelors/computational-social-science/computational-social-science.html"

response = requests.get(url)

#Parse the html
soup = BeautifulSoup(response.text, 'html.parser')


So, how do we find the element we want, in a complex HTML website like UvA.nl?

One way is to find out the CSS selector for the element you are seeking, you can use the extremely useful Chrome Developer Tools. Open Chrome. Go to the website. Go to Menu > More Tools > Developer Tools.

You can use the "Select element", represented by a diagonal arrow in the upper right corner. 

Click the element on the page that you are interested in: the main description.

You can now see that the description text is inside a _p_ of class _lead_ which is inside a _div_ with class 'c-programmepageheader'.

We can use the CSS selector to select this element:

In [47]:
description = soup.select('p.lead')

The result is a list with a single element in it: the element we are looking for. To get the text of this element, we simply need to:

In [48]:
print(description[0].get_text())

You‚Äôre a data idealist: you care about challenges in society like climate change, inequality, and the impact of AI. And you love data, but you want more than just coding for a tech company. In this Bachelor‚Äôs programme, you‚Äôll combine data science with social and behavioural science to solve real-world problems with technical solutions. From day 1, you‚Äôll work on projects for clients and develop practical skills like programming, data analysis, and project management.


### Exercise 1: "Is Computational Social Science right for you?"

Your task is to find the list of six points on the UvA CSSci website that answers the questions of whether the CSS programme is right for you. 

Use what you've learned to fetch this list, and print each point.


In [49]:
# TODO: Find the points from the website and save them as points
# The soup object is already created in cell 4, so we can use it here
# Based on the test assertion, we need to find list items containing "become a change-maker tackling real-world challenges from day one"

# Your solution here
# Find list items that contain "change-maker" to locate the right list
all_list_items = soup.find_all('li')
target_item = None
for li in all_list_items:
    if 'change-maker' in li.get_text().lower():
        target_item = li
        break

if target_item:
    # Find the parent list (ul or ol) that contains this item
    parent_list = target_item.find_parent(['ul', 'ol'])
    if parent_list:
        # Get all list items from this parent list
        points = parent_list.find_all('li')
    else:
        points = []
else:
    # Alternative: search for headings containing "right" and find nearby lists
    headings = soup.find_all(['h2', 'h3', 'h4'], string=lambda text: text and 'right' in text.lower() if text else False)
    points = []
    for heading in headings:
        # Look for the next ul or ol after this heading
        for element in heading.next_siblings:
            if hasattr(element, 'name') and element.name in ['ul', 'ol']:
                points = element.find_all('li')
                break
        if points:
            break

for i, point in enumerate (points):
    print(f"{i+1}. {point.get_text()}")

# Sample Test
# points[0] should print ('1. become a change-maker tackling real-world challenges from day one')
# assert points[0].get_text() == 'become a change-maker tackling real-world challenges from day one'

1. become a change-maker tackling real-world challenges from day one
2. learn by doing, through weekly, hands-on projects with international classmates
3. work with real data to design digital tools that drive social impact
4. combine academic insights with tech skills to explore today‚Äôs biggest issues
5. master coding step by step, from programming basics to data visualisation
6. grow in a supportive community that values curiosity more than grades


## Selenium: Scraping dynamic pages

While _requests_ is a powerful tool for getting static HTML pages, most websites these days are not static HTML. _requests_ cannot handle Javascript pages or dynamic content.

This is where Selenium comes in. Selenium automates a web browser, allowing it to interact with the JavaScript and dynamically loaded content on the webpage, thereby providing access to content modified or loaded by JavaScript after the initial page load. Selenium can also automate interactions with the website, such as clicking buttons, filling out forms, or navigating through pages. For these points, the requests library alone would be insufficient, as it cannot interact with webpage elements or execute user-like actions.

Selenium in other words runs a complete web browser, and automates clicking on the websites. This allows it to scrape nearly any website. But it also means that it is relatively heavy and slow, compared to a _requests_ based solution. 

*Takeaway: use requests when dealing with static websites or APIs. Use Selenium when dealing with more complex dynamic websites.*


### Installing Selenium

Installing selenium can be a bit of a challenge on its own, as it is dependent on having a chrome/chromium browser installed. Expect a bit of fiddling!

Regardless of the OS, you first need to install the Selenium python package: 





In [50]:
!pip install selenium
!pip install webdriver-manager




When running the cell below, you should now see a web browser opening on your computer, we picked duckduckgo because google.com actively blocks selenium for webscraping

In [51]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

url_search = "https://duckduckgo.com/"
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url_search)


In [52]:
# Find the search bar using the name of the input field
search_bar = driver.find_element("name", "q")

# Type the search term and hit ENTER
search_bar.send_keys("university of amsterdam")

In [53]:
# Find the search bar using the name of the input field
search_bar = driver.find_element("name", "q")

# Type the search term and hit ENTER
search_bar.send_keys("university of amsterdam")
#Click enter!
search_bar.send_keys(Keys.RETURN)

# Wait for some time to let the results load
time.sleep(2) 

#The page will now have made the search!

In [54]:
# Locate the titles and URLs of the search hits.
# results = driver.find_elements(by=By.CSS_SELECTOR,value='a h3') #Select all links under the div with id search: these are the search results.
results = driver.find_elements(By.CSS_SELECTOR, "a[data-testid='result-title-a'], a.result__a")
# Extract and print the top 7 hits 
i=0
for result in results:
    if len(result.text)>0:
        title = (result.text or "").strip()
        url   = (result.get_attribute("href") or "").strip()

        i+=1
        #This is how to get the parent element in selenium. We want the <a> to get the URL.
        # parent_element = result.find_element(by=By.XPATH, value='..')
        if "duckduckgo.com" in url:          # üö´ related-search chips & internal links
            continue
        print(f"Pos:{i}, Title: {title}\nURL: {url}\n")

Pos:1, Title: University of Amsterdam
URL: https://www.uva.nl/en

Pos:2, Title: University of Amsterdam - Wikipedia
URL: https://en.wikipedia.org/wiki/University_of_Amsterdam

Pos:3, Title: University of Amsterdam in Netherlands - US News Best Global Universities
URL: https://www.usnews.com/education/best-global-universities/university-of-amsterdam-502957

Pos:4, Title: University of Amsterdam : Rankings, Fees & Courses Details ...
URL: https://www.topuniversities.com/universities/university-amsterdam

Pos:5, Title: Applying for a Bachelor's programme - University of Amsterdam
URL: https://www.uva.nl/en/education/admissions/bachelors/applying-for-a-degree-programme.html

Pos:6, Title: All 7 Universities in Amsterdam | Rankings & Reviews 2025
URL: https://www.universityguru.com/universities-amsterdam

Pos:7, Title: University of Amsterdam | World University Rankings | THE
URL: https://www.timeshighereducation.com/world-university-rankings/university-amsterdam

Pos:8, Title: University o

-----
To get more results, we can scroll to the bottom of the page, and wait for a moment

In [55]:

# Locate the button by ID and click it
button = driver.find_element(By.ID, "more-results")
button.click()
# Wait for some time to let the results load 
time.sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for a bit to allow contents to load, if any
time.sleep(2)
results = driver.find_elements(By.CSS_SELECTOR, "a[data-testid='result-title-a'], a.result__a")
# Extract and print the top 7 hits 
i=0
for result in results:
    if len(result.text)>0:
        title = (result.text or "").strip()
        url   = (result.get_attribute("href") or "").strip()

        i+=1
        #This is how to get the parent element in selenium. We want the <a> to get the URL.
        if "duckduckgo.com" in url:          # 
            continue
        print(f"Pos:{i}, Title: {title}\nURL: {url}\n")

Pos:1, Title: University of Amsterdam
URL: https://www.uva.nl/en

Pos:2, Title: University of Amsterdam - Wikipedia
URL: https://en.wikipedia.org/wiki/University_of_Amsterdam

Pos:3, Title: University of Amsterdam in Netherlands - US News Best Global Universities
URL: https://www.usnews.com/education/best-global-universities/university-of-amsterdam-502957

Pos:4, Title: University of Amsterdam : Rankings, Fees & Courses Details ...
URL: https://www.topuniversities.com/universities/university-amsterdam

Pos:5, Title: Applying for a Bachelor's programme - University of Amsterdam
URL: https://www.uva.nl/en/education/admissions/bachelors/applying-for-a-degree-programme.html

Pos:6, Title: All 7 Universities in Amsterdam | Rankings & Reviews 2025
URL: https://www.universityguru.com/universities-amsterdam

Pos:7, Title: University of Amsterdam | World University Rankings | THE
URL: https://www.timeshighereducation.com/world-university-rankings/university-amsterdam

Pos:8, Title: University o

In [56]:
# Close the browser window
driver.quit()

### Exercise 2: Find the DuckDuckgo ranking of UvA's Computational Social Science. 

Your task is to adapt the code to use selenium to search for 'computational social science', and to find where UvA shows up in the search ranking. 

1. Use Selenium to open duckduckgo.com. And search for 'computational social science'

2. Your script should keep scrolling in the search result until it finds a search result with an HREF that includes with 'uva.nl'. 

3. It should then print the number in the list of the identified link, and how many pages you had to scroll. For instance, if it is the first link found, your code should output: 'UvA was number 1 link for search result, on page 1 in the duckduckgo ranking!'


In [57]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# TODO This function fetches the ranking of a website in a search.
# Takes: a search term string
# Returns: (rank_on_page, page_number) where target domain appears, or (None, None)
# If it does not find the link in the first 5 pages, "return None, None"
def find_duckduckgo_ranking(search_term, url_to_look_for):
    # Your solution here
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    
    try:
        # Open duckduckgo
        driver.get("https://duckduckgo.com/")
        time.sleep(1)
        
        # Find search bar and enter search term
        search_bar = driver.find_element("name", "q")
        search_bar.send_keys(search_term)
        search_bar.send_keys(Keys.RETURN)
        time.sleep(2)
        
        # Track results across pages
        total_rank = 0
        max_pages = 5
        
        for page_num in range(1, max_pages + 1):
            # Get all search result links
            results = driver.find_elements(By.CSS_SELECTOR, "a[data-testid='result-title-a'], a.result__a")
            
            # Check each result
            for result in results:
                url = (result.get_attribute("href") or "").strip()
                title = (result.text or "").strip()
                
                # Skip duckduckgo internal links
                if "duckduckgo.com" in url or len(title) == 0:
                    continue
                
                total_rank += 1
                
                # Check if this result contains the target URL
                if url_to_look_for.lower() in url.lower():
                    driver.quit()
                    return total_rank, page_num
            
            # If we haven't found it, try to load more results
            if page_num < max_pages:
                # Try to click "more results" button if it exists
                try:
                    button = driver.find_element(By.ID, "more-results")
                    button.click()
                    time.sleep(2)
                except:
                    # If button doesn't exist, scroll to bottom to load more
                    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                    time.sleep(2)
        
        # Not found in first 5 pages
        driver.quit()
        return None, None
        
    except Exception as e:
        driver.quit()
        return None, None

# Run the search
rank, page = find_duckduckgo_ranking("computational social science", "uva.nl")

if rank is None:
    print(f"Uva.nl was not listed in the first 5 pages! :( We need to work on our SEO!")
else:
    print(f"UvA was number {rank} link for search result, on page {page} in the DuckDuckGo ranking!")
    

UvA was number 66 link for search result, on page 3 in the DuckDuckGo ranking!


## More advanced API: How toxic are YouTube comments? Combining YouTube API and Perspective API
In this part of the guide, we will use YouTube API to collect comments from videos. 

### About authentication
APIs often require users to sign up and use credentials. These are often based on "API keys" which link a call to the API to a particular user or registered application. There are many reason for APIs requiring authentifiatoin: by requiring credentials, API providers can control access to the data or services they offer, preventing unauthorized access and abuse, and to ensure rate limiting - that is, managing the load on the server by restricting the number of API calls from a single user or application within a given time frame.


You can sign up to the YouTube API at https://developers.google.com/youtube/v3 
Read about the process on: https://developers.google.com/youtube/v3/getting-started

Google offers a range of powerful and interesting APIs, both for data collection and analysis. Have a look and browse their offerings.



### Fetching YouTube comments
We will now use the YouTube API to fetch comments associated to a particular YouTube video.

You'll find the API documentation here: https://developers.google.com/youtube/v3/docs/commentThreads 


In [58]:
import requests
import time

# The API key is your key to the YouTube API. You will neeed to get your own. To do so, visit https://developers.google.com/youtube/v3/getting-started
# TODO Enter your API key here
api_key = "AIzaSyCQ01CdAHdDW8t1hGhxzmQy6BkNIeQ28p8"
# Your solution here - Replace YOUR_YOUTUBE_API_KEY_HERE with your actual API key
video_id = "dQw4w9WgXcQ"  
# Replace with the ID of the video you are interested in. 
# You can find the ID by going to a video in Youtube, and getting the string after v= in the URL. For instance, i0EfLMe5FGk in https://www.youtube.com/watch?v=i0EfLMe5FGk

url = f"https://www.googleapis.com/youtube/v3/commentThreads"
params = {
    'part': 'snippet',
    'videoId': video_id,
    'maxResults': 100,  # max number of comments to fetch 
    'textFormat': 'plainText',
    'key': api_key,
}

all_comments = []

maximum_pages = 3 #How many pages to get at most

for page in range(maximum_pages):
    print(f"Getting page {page}...")
    response = requests.get(url, params=params)
    if response.status_code == 200:
        result_json = response.json()
        all_comments.extend([item['snippet']['topLevelComment']['snippet']['textDisplay'] for item in result_json.get('items', [])])

        # Many APIs provide the result page by page. If there is another page, this API returns a nextPageToken, that we can
        # send to the API to get the next page in line. If there are no more comments, there will be no such token.
        if 'nextPageToken' in result_json:
            params['pageToken'] = result_json['nextPageToken']
            
            # Ensure you don't hit the quota limits by adding a delay
            time.sleep(1)
        else: #No token, so no more pages
            break
    else:
        print("Error: ", response.status_code)
        break

# Now 'all_comments' list contains all the comments from the video
print(f"Done. Fetched {len(all_comments)} comments!")


Getting page 0...


Getting page 1...
Getting page 2...
Done. Fetched 300 comments!


In [59]:
#Print the first five comments
print(all_comments[:5])

['can confirm: he never gave us up', 'What,link says FREE ROBUX', 'damn my teacher give me this as a problem for cp :_))', 'My pc wallpaper rickroll meüòä', 'yes yes!']


### Perspective Toxicity API

The Perspective API, developed by Jigsaw and Google's Counter Abuse Technology team, is a tool that leverages machine learning to score toxicity in online conversation. The API provides various models to assess different aspects of conversations, like toxicity, severe toxicity, and threat, allowing developers and service providers to automatically moderate content that is harmful, abusive, or likely to drive users away, thus fostering healthier and more respectful online interactions.

Perspective API is an example of an API that can be used to analyze your own data, rather than just fetching existing data.

In this case, I will provide the API key, as it takes some time to receive one from Google.


Perspective API is easiest to use through the Python package offered by Google. Many APIs offer Python packages to make it easier to use the API. APIs offer packages to simplify and streamline the interaction between the end-user's code and the API‚Äôs endpoints, abstracting the intricacies of HTTP requests, response handling, and error handling.


In [60]:
#Install the package
!pip install google-api-python-client



In [61]:
from googleapiclient import discovery
import time 

In [62]:
PERSPECTIVE_API_KEY = 'AIzaSyAxTjb4F0tKxk-X6_s3Nd5E1VHKbok8KuU'

# The text string you want to analyze
message = "I fart in your general direction! Your mother was a hamster and your father smelt of elderberries!"

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=PERSPECTIVE_API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

analyze_request = {
  'comment': { 'text': message },
  'requestedAttributes': {'TOXICITY': {}}
}

# Don't overload the API
time.sleep(0.5)

response = client.comments().analyze(body=analyze_request).execute()

toxicity = response['attributeScores']['TOXICITY']['summaryScore']['value']

print(f"The message scores {toxicity} in toxicity")


The message scores 0.85850734 in toxicity


### Mini-project 1: How toxic are the YouTube comments?

Your task is to write a script that: 

1. Takes a list of YouTube video IDs and collects the first 100 comments from each video.
2. Calculate the toxicity of each comment on the videos using Perspective API, and stores the result in a pandas Dataframe.
3. Shows how toxic the comments are on average according to the Perspective API. (Use for instance np.mean() to calculate the average toxicity.)

Select a couple of Youtube videos of your own choice, and use your code to analyze which of them has the most toxic comments. Reflect about the meaning of the findings.


In [63]:
import requests
import time
from googleapiclient import discovery
import pandas as pd

# TODO Enter your API key here
youtube_api_key = "AIzaSyCQ01CdAHdDW8t1hGhxzmQy6BkNIeQ28p8"
# Your solution here - Replace YOUR_YOUTUBE_API_KEY_HERE with your actual API key

PERSPECTIVE_API_KEY = 'AIzaSyAxTjb4F0tKxk-X6_s3Nd5E1VHKbok8KuU'


client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=PERSPECTIVE_API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

# TODO This function should returns a list of comments (strings) associated to a video on Youtube, solution function
def fetch_comments_for_video(video_id, max_comments_to_fetch=100):

    print(f"Fetching comments for video {video_id}.")

    # Your solution here
    url = "https://www.googleapis.com/youtube/v3/commentThreads"
    params = {
        'part': 'snippet',
        'videoId': video_id,
        'maxResults': 100,  # max number of comments per page
        'textFormat': 'plainText',
        'key': youtube_api_key,
    }
    
    all_comments = []
    maximum_pages = (max_comments_to_fetch // 100) + 1  # Calculate pages needed
    
    for page in range(maximum_pages):
        if len(all_comments) >= max_comments_to_fetch:
            break
            
        print(f"Getting page {page + 1}...")
        response = requests.get(url, params=params)
        
        if response.status_code == 200:
            result_json = response.json()
            comments_on_page = [item['snippet']['topLevelComment']['snippet']['textDisplay'] 
                               for item in result_json.get('items', [])]
            all_comments.extend(comments_on_page)
            
            # Check if we've reached the desired number of comments
            if len(all_comments) >= max_comments_to_fetch:
                all_comments = all_comments[:max_comments_to_fetch]
                break
            
            # Check if there are more pages
            if 'nextPageToken' in result_json:
                params['pageToken'] = result_json['nextPageToken']
                time.sleep(1)  # Rate limiting
            else:
                break  # No more pages
        else:
            print(f"Error: {response.status_code}")
            break

    # Now 'all_comments' list contains all the comments from the video
    print(f"Done. Fetched {len(all_comments)} comments!")
    return all_comments

# This is provideed to students
def fetch_comments_for_videos(list_of_videos,max_comments_to_fetch=100):
    list_of_comments = []
    for video in list_of_videos:
        comments = fetch_comments_for_video(video['video_id'],max_comments_to_fetch)
        for comment in comments:
            list_of_comments.append(video | {'comment': comment}) #Add comment information to video information
                 
    return pd.DataFrame(list_of_comments)


# This measures the toxicity of a single message using the Perspective API
def measure_toxicity_of_message(message):
    
    analyze_request = {
      'comment': { 'text': message },
      'requestedAttributes': {'TOXICITY': {}}
    }

    time.sleep(0.1)

    response = client.comments().analyze(body=analyze_request).execute()

    toxicity = response['attributeScores']['TOXICITY']['summaryScore']['value']

    return toxicity

#Trump's state of the union vs Biden's state of the union
comments = fetch_comments_for_videos([{'video_id':'_oQQX3mTOjc'},{'video_id':'bm-hdFIEnnQ'}])


Fetching comments for video _oQQX3mTOjc.
Getting page 1...
Done. Fetched 100 comments!
Fetching comments for video bm-hdFIEnnQ.
Getting page 1...
Done. Fetched 100 comments!


In [64]:
#Prepare the dataframe
comments['toxicity'] = None
comments['analyzed'] = False    

In [65]:
#This is a simple way of structuring your code when scraping many pages.
i = 0
nrfailed = 0
while(True):    
    #Fetch a random row
    left_to_process = comments.loc[comments['analyzed']==False]
    
    if len(left_to_process)==0:
        print(f"We're done! Analysis failed for {nrfailed} of {len(comments)}.")
        break
    
    else:
        comment = left_to_process.sample(1)
        index = comment.index[0]
        message = comment.comment.values[0]

        #Keep track of progress. Every 10 measures, we print out a progress report
        i+=1
        if i%10==0:
            print(f"{len(comments.loc[comments['analyzed']==False])} comments left out of {len(comments)}...")

        try:
            #Analyze toxicity
            toxicity = measure_toxicity_of_message(message)
            comments.loc[index,'toxicity'] = toxicity

        except Exception as e:
            #The API will fail for many comments, for instance if they are too short or in the wrong language.
            nrfailed+=1
        finally:
            comments.loc[index,'analyzed'] = True

191 comments left out of 200...
181 comments left out of 200...
171 comments left out of 200...
161 comments left out of 200...
151 comments left out of 200...
141 comments left out of 200...
131 comments left out of 200...
121 comments left out of 200...
111 comments left out of 200...
101 comments left out of 200...
91 comments left out of 200...
81 comments left out of 200...
71 comments left out of 200...
61 comments left out of 200...
51 comments left out of 200...
41 comments left out of 200...
31 comments left out of 200...
21 comments left out of 200...
11 comments left out of 200...
1 comments left out of 200...
We're done! Analysis failed for 32 of 200.


In [66]:
print(comments)

        video_id                                            comment  toxicity  \
0    _oQQX3mTOjc                                                üíØüëçüòé      None   
1    _oQQX3mTOjc  Dirty Christopher Wray has a lot of very deep ...  0.319637   
2    _oQQX3mTOjc                         Wray was always a dirt-bag  0.497774   
3    _oQQX3mTOjc  All #274 Were Demon Cratics  ??? Oh Lord have ...  0.110002   
4    _oQQX3mTOjc                               You GOOO TRUMP \n‚ù§‚ù§‚ù§  0.016713   
..           ...                                                ...       ...   
195  bm-hdFIEnnQ  ◊ë◊ô◊ë◊ô ◊õ◊ú ◊î◊õ◊ë◊ï◊ì, ◊ê◊ï◊î◊ë◊ô◊ù ◊ê◊ï◊™◊ö ◊®◊ê◊© ◊î◊û◊û◊©◊ú◊î ◊©◊ú◊†◊ï ◊î◊ô◊ß...      None   
196  bm-hdFIEnnQ         Bu katil durdulamazsa ABD b√ºy√ºk tehlikede.      None   
197  bm-hdFIEnnQ  The truth is everyones, not only yours. Tell t...  0.184591   
198  bm-hdFIEnnQ                               Well done you Isreal  0.048842   
199  bm-hdFIEnnQ                Support from Sweden! Go

In [67]:
#Let's compare the toxicities.
# We look at the mean toxicity for the successfully analyzed comments:
print(comments.loc[~comments['toxicity'].isna()].groupby(['video_id'])['toxicity'].mean())

video_id
_oQQX3mTOjc    0.291141
bm-hdFIEnnQ    0.078882
Name: toxicity, dtype: object


## (Optional) Mini-project 2: Does the YouTube algorithm radicalize?

We will now go through a more complex exercise, for a small research paper.

Researchers have argued that the YouTube autoplay feature can lead to radicalization. The platform's recommendation system is designed to keep users engaged for as long as possible. The algorithm achieves this by suggesting content that it predicts the user will find interesting or compelling, based on their viewing history, search terms, and other interactions. 

However, critics argue that this approach can create a "filter bubble," where users are only exposed to content and perspectives similar to those they have already encountered, thereby reinforcing existing beliefs and opinions. There are concerns that this can lead to the incremental presentation of more extreme content, as users are gradually exposed to increasingly radical viewpoints in a bid to sustain engagement. This phenomenon, sometimes referred to as "algorithmic radicalization," has sparked debates about the ethical responsibilities of social media and content-sharing platforms and their role in the spread of misinformation, hate speech, and extremist ideologies. 

This would suggest that the comments on videos become more and more toxic as the algorithm proceeds!

In this exercise, we are going to explore this hypothesis by tracing the YouTube autoplay feature from a given starting point.

You will use the code we developed in our previous practical to fetch the comments on the video and evaluate their average toxicity. In case you did not complete that task, we will give you the solution here.

While the Youtube API gives access to some features, the "next video" feature is not accessible through the API. For this, we therefore need to scrape the interface.

#### Task: 

1. Choose a video to start from. This might for instance be a political video, where you would expect a radicalization loop to take place.

2. Write code to repeatedly go to the "next upcoming video" and store the number of steps taken, the video id and the title. (Remember to pause between each fetch, so the page has time to load.)

3. Store at least 10 steps of "next video", so that a trend can be spotted.

4. Use your code from the previous practical, where you collected comments on YouTube videos using the API, and calculated the toxicity of each comment.

5. For each video, calculate the average toxicity of the comments. 

6. Plot the trend: are the comments becoming more toxic? Do your findings fit with the YouTube autoplay radicalization hypothesis?


#### Code for analyzing how toxic text is
We will use the Perspective API to measure toxicity. It's a machine learning API that classifies how incivil a social media message is. 

Make sure that you go through this code and understand it!

In [68]:
import requests
import time
from googleapiclient import discovery
import pandas as pd
# Perspective API key
api_key = 'AIzaSyAxTjb4F0tKxk-X6_s3Nd5E1VHKbok8KuU'

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=api_key,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

# This measures the toxicity of a single message using the Perspective API
def measure_toxicity_of_message(message):
    analyze_request = {
      'comment': { 'text': message },
      'requestedAttributes': {'TOXICITY': {}}
    }
    time.sleep(0.1)
    response = client.comments().analyze(body=analyze_request).execute()
    toxicity = response['attributeScores']['TOXICITY']['summaryScore']['value']
    return toxicity

#This function takes a dataframe with comments, and analyzes each comment using perspective.
# It returns an updated dataframe with toxicity information for each comment.
def analyze_toxicity_of_comments(comments):
    #Prepare the dataframe
    comments['toxicity'] = None
    comments['analyzed'] = False    
    #This is a simple way of structuring your code when scraping many pages.
    i = 0
    nrfailed = 0
    while(True):    
        #Fetch a random row
        left_to_process = comments.loc[comments['analyzed']==False]

        if len(left_to_process)==0:
            print(f"We're done! Analysis failed for {nrfailed} of {len(comments)}.")
            break

        else:
            comment = left_to_process.sample(1)
            index = comment.index[0]
            message = comment.comment.values[0]

            #Keep track of progress. Every 10 measures, we print out a progress report
            i+=1
            if i%10==0:
                print(f"{len(comments.loc[comments['analyzed']==False])} comments left out of {len(comments)}...")

            try:
                #Analyze toxicity
                toxicity = measure_toxicity_of_message(message)
                comments.loc[index,'toxicity'] = toxicity

            except Exception as e:
                #The API will fail for mant comments, for instance if they are too short or in the wrong language.
                nrfailed+=1
            finally:
                comments.loc[index,'analyzed'] = True

    return comments

In [69]:
# In the previous exercise, we used this to compare the state of the union speeches of Trump and Biden:

#Fetch the comments: 
comments = fetch_comments_for_videos([{'president':'Trump','video_id':'ATFwMO9CebA'},{'president':'Biden','video_id':'Wl6b5KnpmB4'}],max_comments_to_fetch=50)

#Analyze comments:
comments = analyze_toxicity_of_comments(comments)

#Calculate average toxicity:
print("Average video comment toxicity:")
comments.loc[~comments['toxicity'].isna()].groupby(['president'])['toxicity'].mean()

# Who was more toxic? 

Fetching comments for video ATFwMO9CebA.
Getting page 1...
Done. Fetched 50 comments!
Fetching comments for video Wl6b5KnpmB4.
Getting page 1...


Done. Fetched 50 comments!
91 comments left out of 100...
81 comments left out of 100...
71 comments left out of 100...
61 comments left out of 100...
51 comments left out of 100...
41 comments left out of 100...
31 comments left out of 100...
21 comments left out of 100...
11 comments left out of 100...
1 comments left out of 100...
We're done! Analysis failed for 3 of 100.
Average video comment toxicity:


president
Biden    0.259309
Trump      0.1378
Name: toxicity, dtype: object

### Additional snippets of code to help you 

In [70]:
# Go to the initial video URL. You can modify this to your preferred video
# driver.get('https://www.youtube.com/watch?v=dQw4w9WgXcQ')  

In [72]:
# Close cookie popup
buttons = driver.find_elements(by=By.CSS_SELECTOR,value='button')

for button in buttons:
    if 'Reject' in button.text:
        button.click()

MaxRetryError: HTTPConnectionPool(host='localhost', port=50352): Max retries exceeded with url: /session/1282b348c9241b216c6ccdd5dee289bf/elements (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x135111510>: Failed to establish a new connection: [Errno 61] Connection refused'))

In [73]:
#Get the title and id of the current video
video_id = driver.current_url.split('=')[1]
title = driver.find_element(by=By.CSS_SELECTOR,value='div#title h1')
print(f"Id: {video_id}. Title: {title.text}")

MaxRetryError: HTTPConnectionPool(host='localhost', port=50352): Max retries exceeded with url: /session/1282b348c9241b216c6ccdd5dee289bf/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x135195690>: Failed to establish a new connection: [Errno 61] Connection refused'))

In [None]:
# Click next video
nextvid = driver.find_element(by=By.CSS_SELECTOR,value='ytd-compact-video-renderer.ytd-watch-next-secondary-results-renderer a')
nextvid.click()

### Your code goes here:


In [None]:
# This function takes a video_id to start with, and then takes nr_steps of "next video" from that video.
# It returns a list of dicts, each containing with the step number, the video_id, and the title of the video
#  e.g., [{'step':0,'video_id': 'ATFwMO9CebA', 'President Trump 2018 State of the Union Address (C-SPAN)' }...]

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By

def follow_next_video(starting_video_id, nr_steps):
    # Your solution here
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    results = []
    
    try:
        # Go to the initial video
        driver.get(f'https://www.youtube.com/watch?v={starting_video_id}')
        time.sleep(2)  # Wait for page to load
        
        # Close cookie popup if it exists
        try:
            buttons = driver.find_elements(By.CSS_SELECTOR, 'button')
            for button in buttons:
                if 'Reject' in button.text or 'Accept' in button.text:
                    button.click()
                    time.sleep(1)
                    break
        except:
            pass  # No cookie popup or already closed
        
        for step in range(nr_steps + 1):  # +1 to include the starting video
            time.sleep(2)  # Wait for page to load
            
            # Get the current video ID and title
            current_url = driver.current_url
            if 'watch?v=' in current_url:
                video_id = current_url.split('watch?v=')[1].split('&')[0]  # Get video ID from URL
            else:
                video_id = current_url.split('=')[1] if '=' in current_url else current_url
            
            try:
                title_element = driver.find_element(By.CSS_SELECTOR, 'div#title h1, h1.ytd-watch-metadata')
                title = title_element.text
            except:
                title = "Unknown Title"
            
            # Store the result
            results.append({
                'step': step,
                'video_id': video_id,
                'title': title
            })
            
            # If this is not the last step, click the next video
            if step < nr_steps:
                try:
                    # Find and click the next video link
                    next_video = driver.find_element(By.CSS_SELECTOR, 'ytd-compact-video-renderer.ytd-watch-next-secondary-results-renderer a')
                    next_video.click()
                    time.sleep(2)  # Wait for the next video to load
                except Exception as e:
                    print(f"Could not find next video at step {step}: {e}")
                    break
        
        driver.quit()
        return results
        
    except Exception as e:
        print(f"Error: {e}")
        driver.quit()
        return results

lst = follow_next_video("Wl6b5KnpmB4", 10)
for row in lst:
    print(row)





Could not find next video at step 0: Message: no such element: Unable to locate element: {"method":"css selector","selector":"ytd-compact-video-renderer.ytd-watch-next-secondary-results-renderer a"}
  (Session info: chrome=142.0.7444.176); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#nosuchelementexception
Stacktrace:
0   chromedriver                        0x000000010d77c698 chromedriver + 6153880
1   chromedriver                        0x000000010d773b6a chromedriver + 6118250
2   chromedriver                        0x000000010d208a5b chromedriver + 436827
3   chromedriver                        0x000000010d25b538 chromedriver + 775480
4   chromedriver                        0x000000010d25b791 chromedriver + 776081
5   chromedriver                        0x000000010d2ac934 chromedriver + 1108276
6   chromedriver                        0x000000010d2a9c8e chromedriver + 1096846
7   chromedriver                   

In [None]:
list_of_videos= [item['video_id'] for item in lst]
#Should look more or less like this:
# [{'step': 0,
#   'video_id': 'Wl6b5KnpmB4',
#   'title': 'President Joe Biden delivers 2023 State of the Union address to Congress ‚Äî 2/7/23'},
#  {'step': 1,
#   'video_id': 'FtzvOZNyXdw',
#   'title': "Rise, fall of Sam Bankman-Fried, FTX at center of Michael Lewis' new book | 60 Minutes"},
#  {'step': 2,
#   'video_id': 'XqwGt69pDXQ',
#   'title': 'The Collapse Of FTX: Insiders Tell All | CNBC Documentary'},
#  {'step': 3,
#   'video_id': 'gqDCrdZVZnk',
#   'title': 'The world‚Äôs most dangerous arms dealer | DW Documentary'},
#  {'step': 4, 'video_id': 'G1p6rlDCxq0', 'title': 'World War One (ALL PARTS)'},
# ...

## What can you observe about the videos? 
# For instance, does the recommendation algorithm gets stuck in a cycle between two or three videos? 

In [None]:
#Now we're going to use our old code to analyze the data!

#Fetch comments for the videos
comments = fetch_comments_for_videos(lst,max_comments_to_fetch=50)

In [None]:
#Calculate toxicity of each comment. This will take a while!
comments = analyze_toxicity_of_comments(comments)

In [None]:
#Let's plot the average toxicity over time. Is there a clear trend?
comments.loc[~comments['toxicity'].isna()].groupby(['step'])['toxicity'].mean().plot()