## Fundamentals of Python - Part 4 (Solutions)
In this lecture, you will learn how to use Selenium to scrape data from websites

## References

https://selenium-python.readthedocs.io/

### 1. Downloading, Importing Selenium and Creating a Driver
To download Selenium, run the following command <code>pip install selenium</code>. Once installed, we will be able to import the package.

In [1]:
# Subpackages from selenium that are needed
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

Now, we will create a <b>Web Driver</b>. A <b>Web Driver</b> allows you to open web applications in a testing environment. This has many uses like testing whether your web application works as intended. We will be focusing on scraping data from websites.

In [2]:
# Creating a Web Driver that runs Google Chrome. This also supports firefox and safari.

# Once you run this program, a google chrome tab should pop up
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))




[WDM] - Current google-chrome version is 103.0.5060
[WDM] - Get LATEST chromedriver version for 103.0.5060 google-chrome
[WDM] - There is no [win32] chromedriver for browser 103.0.5060 in cache
[WDM] - About to download new driver from https://chromedriver.storage.googleapis.com/103.0.5060.53/chromedriver_win32.zip
[WDM] - Driver has been saved in cache [C:\Users\mikes\.wdm\drivers\chromedriver\win32\103.0.5060.53]


### 2. Using and Closing the Web Driver
Once the Web Driver has been created, we can start opening up the websites that we want to scrape data from. For this lecture, we will be scraping all the tweets from Elon Musk

In [3]:
# .get() allows you to open a tab in the Web Driver
driver.get('https://twitter.com/elonmusk')

In [4]:
# Closes the tab
driver.close()

### 3. Scraping Data from Websites
Now that we know the basics about opening and closing tabs in selenium, let's start scraping data! First, we need to import <b>By</b>. This will give us 6 options on navigating an html tree (in other words, scraping data)

In [5]:
# Importing By from Selenium
from selenium.webdriver.common.by import By

We then have to use one of these two methods to find data: <b>find_element</b> and <b>find_elements</b>. <b>find_element</b> finds the first instance of the tag you want to find and returns a WebElement object. On the other hand, <b>find_elements</b> finds the all the instances of the tag you want to find and returns a WebElement list. Both are instance methods that you call from a WebDriver or WebElement object. Let's first try to find all the span tags in this website

In [6]:
# We need to initialize a new ChromeDriver session everytime we close
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

driver.get('https://twitter.com/elonmusk')

# To do that, we use the By.TAG_NAME option
spanTags = driver.find_elements(by = By.TAG_NAME, value = 'span')

# To get all the values from the spanTags, we use the .text feature
for spanTag in spanTags:
    print(spanTag.text)




[WDM] - Current google-chrome version is 103.0.5060
[WDM] - Get LATEST chromedriver version for 103.0.5060 google-chrome
[WDM] - Driver [C:\Users\mikes\.wdm\drivers\chromedriver\win32\103.0.5060.53\chromedriver.exe] found in cache


Don’t miss what’s happening
People on Twitter are the first to know.
Log in
Log in


WebDriverException: Message: unknown error: cannot determine loading status
from unknown error: unexpected command response
  (Session info: chrome=103.0.5060.53)
Stacktrace:
Backtrace:
	Ordinal0 [0x00256463+2188387]
	Ordinal0 [0x001EE461+1762401]
	Ordinal0 [0x00103D78+802168]
	Ordinal0 [0x000F7210+750096]
	Ordinal0 [0x000F675A+747354]
	Ordinal0 [0x000F5D3F+744767]
	Ordinal0 [0x000F4C28+740392]
	Ordinal0 [0x000F5228+741928]
	Ordinal0 [0x000FF153+782675]
	Ordinal0 [0x00109FBB+827323]
	Ordinal0 [0x0010D310+840464]
	Ordinal0 [0x000F54F6+742646]
	Ordinal0 [0x00109BF3+826355]
	Ordinal0 [0x0015CB47+1166151]
	Ordinal0 [0x0014C5F6+1099254]
	Ordinal0 [0x00126BE0+945120]
	Ordinal0 [0x00127AD6+948950]
	GetHandleVerifier [0x004F71F2+2712546]
	GetHandleVerifier [0x004E886D+2652765]
	GetHandleVerifier [0x002E002A+520730]
	GetHandleVerifier [0x002DEE06+516086]
	Ordinal0 [0x001F468B+1787531]
	Ordinal0 [0x001F8E88+1805960]
	Ordinal0 [0x001F8F75+1806197]
	Ordinal0 [0x00201DF1+1842673]
	BaseThreadInitThunk [0x76786739+25]
	RtlGetFullPathName_UEx [0x77988FEF+1215]
	RtlGetFullPathName_UEx [0x77988FBD+1165]
	(No symbol) [0x00000000]


Wait but that printed nothing. Let's look at Elon's Twitter. If we go to inspect and hover over the text of one of his tweets, we'll see that there is a span tag there, as shown below. Why didn't that span tag get printed?
<img src="ElonTweet.png"
     alt="Elon Tweet"
     style="float: left; margin-right: 10px;" />

This is because this content was dynamically loaded. The .get() method waits until the entire page is loaded, however this doesn't include dynamic content which loads after the inital frame of a website is loaded. If you ever open twitter, you may notice the blue circle buffering which is demonstrating the dynamic content loading.

In [20]:
def enoughTags(driver):
    if len(driver.find_elements(by = By.CSS_SELECTOR, value = 'div.css-1dbjc4n')) > 750:
        return True
    else:
        return False

# We need to Import This function to Implicitly Wait for an Element
from selenium.webdriver.support.ui import WebDriverWait

# We need to initialize a new ChromeDriver session everytime we close
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

driver.get('https://twitter.com/elonmusk')

# To solve this, we use something called an implicit wait. 
# This means that it waits until a specific element to show up before it continues the program
# This line of code allows you to wait until the share button is found
# CSS_Selector requires the tag name and the class name together
WebDriverWait(driver, timeout = 10).until(lambda d : enoughTags(d))

# To do that, we use the By.TAG_NAME option
spanTags = driver.find_elements(by = By.TAG_NAME, value = 'span')

# To get all the values from the spanTags, we use the .text feature
for spanTag in spanTags:
    try:
        print(spanTag.text)
    except:
        pass
    
driver.close()



Current google-chrome version is 103.0.5060
Get LATEST chromedriver version for 103.0.5060 google-chrome
Driver [/Users/roryliao/.wdm/drivers/chromedriver/mac64/103.0.5060.53/chromedriver] found in cache


Don’t miss what’s happening
People on Twitter are the first to know.
Log in
Log in
Sign up
Sign up




Elon Musk
Elon Musk
Elon Musk


Follow
Follow
Elon Musk
Elon Musk

@elonmusk
Joined June 2009
Joined June 2009
114
114
Following
Following
99.8M
99.8M
Followers
Followers

Tweets
Tweets & replies
Media
Likes

Pinned Tweet
Elon Musk
Elon Musk
@elonmusk
·
USA birth rate has been below min sustainable levels for ~50 years
55.1K
55.1K
55.1K
55.9K
55.9K
55.9K
309K
309K
309K
Show this thread
Elon Musk
Elon Musk
@elonmusk
·
36.5K
36.5K
36.5K
76.6K
76.6K
76.6K
728.5K
728.5K
728.5K
Show this thread
Elon Musk
Elon Musk
@elonmusk
·
AI gets better every day
12.2K
12.2K
12.2K
14.7K
14.7K
14.7K
127.8K
127.8K
127.8K
Elon Musk
Elon Musk
@elonmusk
·
Some great suggestions in the comments!
102.4K
102.4K
102.4K
2,091
2,091
2,091
33.2K
33.2K
33.2K
Show this thread
Elon Musk
Elon Musk
@elonmusk
·
But sometimes they’re out of stock
youtube.com
youtube.com
Monty Python- Cheese Shop
Monty Python- Cheese Shop

### 4. Problems With Selenium
Notice with the code above, all the data was not scraped from the website. There are a couple of issues with this. Most likely, all the tweets don't load unless you scroll though them. Since Selenium just opens the tab, there's isn't any activity on the page and thus, you can't scrape all the data. There are many ways to solve this. One is to access the Twitter API which will give you much more access to the data they have. This method usually costs money. Another way is to try another package that specializes in a website. Usually, these packages have access to twitter API and you can use it for free provided you don't use it too much. Lastly, you can try scrolling the page using selenium. There are so many features that Selenium offers that if you think of a solution, selenium has functions that can solve your problem. All depends on your specific needs. If you have any specific data you want to scrape, I would be more than happy to help you!