### Scraping Twitter using Selenium

##### Task 1: Collecting our ingredients: (Guided) 

You need 
- An python environment with Selenium.
- Google Chrome.
- ChromeDriver (Chromium)
- A Twitter Account

The collection of these are described in the presentation pdf, which is also in this repo.

Also, we need to import the following:

In [186]:
from time import sleep #Will come in hand
from getpass import getpass # For logging in to Twitter through Python
from selenium import webdriver # Our WebDriver

# other, but necessary:
from selenium.webdriver.common.by import By # For Crawling
from selenium.webdriver.common.keys import Keys  # For Crawling
from selenium.webdriver.chrome.options import Options # For setting some options for the driver, see Appendix.
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

##### Task 2: Setting up, and starting our driver: (Guided)

##### Task 3: Open Twitter, and provide the notebook with your login: (Guided)

#### Extra: HTML and XPATH

*HTML*, which stands for HyperText Markup Language, is the foundation of every website you see on the internet. It is a simple and powerful language used to create the structure and content of web pages. Think of HTML as the skeleton that gives a web page its shape.

Example:

    <div>
        First div
        <div>
            Second div
            <input type="text" placeholder="Middle input" />
        </div>
    </div>
    <div>
        Third div
    </div>

*XPath* is a query language used to navigate and select elements from an XML or HTML document. It provides a concise way to locate specific elements or extract data based on their element structure, attributes, or content.

To get the input element in the code above, we would have to feed Selenium with
    
    /html/body/div[1]/div/input[@placeholder='Middle input']



<br>



##### In our case...

The location of the element where you provide your username at twitter in full XPATH:

    "/html/body/div[1]/div/div/div[1]/div/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/
    div[5]/label/div/div[2]/div/input"

But this also works:

    "//input[@name='text']"

Because it's name is unique in the whole HTML code.



##### Task 4: Our first crawling: (Guided)

##### Task 5: Our second crawling: (Try yourself) - 5 min

##### Task 5: Search for tweets mentioning "bitcoin": (Guided)

##### Extra: Twitter Advanced Search

Bitcoin was exchanged at about 50'000 dollars in october 2021.
Bitcoin was exchanged at about 20'000 dollars in october 2022.

To search for particular dates, we can search for:

```"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies```

and
 
```"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies```

Here, we will have also filtered such that we get: *only english tweets*, *no links* and *no replies*.

The upper code-snippet might not work, why?

##### Task 6: Click on Latest (Try yourself) - 5 min
We want to look at the latest. Try to click it by
1. Locating the element
2. Use element.click()

If you have more time, try clicking "Top" again, or try to click on the "Tweet" button

##### Task 7: Locate tweets, collect them, and combine them in to a deck of "cards": (Guided)

The cards are WebElements until now. We can pick one card, and go a bit deeper.

##### Task 8: Finding the Twitter Handle (Name of Twitter Account, not username): (Guided)

Note: as soon as we have selected an element, we have to start the xpath with "."

##### Task 9: We can also find username and date: (Try yourself) - 10 min

First, try yourself. Username is a bit easier than date. *Hint*: Try to look for an unique identifier / tag. 

Selenium has the following ways of identifying elements:

    driver.find_element(By.ID, "id")
    driver.find_element(By.NAME, "name")
    driver.find_element(By.XPATH, "xpath")
    driver.find_element(By.LINK_TEXT, "link text")
    driver.find_element(By.PARTIAL_LINK_TEXT, "partial link text")
    driver.find_element(By.TAG_NAME, "tag name")
    driver.find_element(By.CLASS_NAME, "class name")
    driver.find_element(By.CSS_SELECTOR, "css selector")

If you have more time, try to collect other parts of the tweet. An advice, is to wait with the text of the tweet itself.
Try to collect:
- Likes
- Replies

##### Task 10: At last, lets collect the tweet itself (This is a bit more complicated):

Let's extend our collection from one to several tweets

##### Task 11: Make a function that executes all the steps above, and makes each tweet and the collected information into a tuple: (Try yourself) - 10 min

In [213]:
def collect_tweet(card):
    return

To collect more tweets we need to scroll, which can be done by:

    driver.execute_script('window.scroll(0,document.body.scrollHeight);')

#### Wrapping up:

Last part is inspired by @israel-dryer (github), and updated to fit our case. 

- Especially the 

    ```
    driver.find_elements(By.XPATH, '//article[@data-testid="tweet"]')
    ```

    is replaced by

    ```
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//input[@name='text']"))
        )
    ```

- I have also added a loading bar.

In [182]:
def my_scraper(DRIVER_PATH, options, max_tweets):
    driver = webdriver.Chrome(DRIVER_PATH, options=options)
    web_site = "https://twitter.com/home"
    driver.get(web_site)

    # Crawl:

    username = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//input[@name='text']"))
        )
    username.send_keys(my_username)
    username.send_keys(Keys.RETURN)

    password = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//input[@name='password']"))
        )
    password.send_keys(my_password)
    password.send_keys(Keys.RETURN)

    search_box = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//input[@aria-label='Search query']"))
        )
    search_box.send_keys('"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies')
    search_box.send_keys(Keys.RETURN)

    latest = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.LINK_TEXT, "Latest"))
        )
    latest.click()

    # Scrape:
    
    data = []
    tweet_ids = set() # In order to not collect duplicates
    last_position = driver.execute_script("return window.pageYOffset;")
    scrolling = True
     
    while scrolling:

        page_cards = driver.find_elements(By.XPATH, '//article[@data-testid="tweet"]')
        for card in page_cards[-15:]:
            tweet = collect_tweet(card)
            
            if tweet:
                tweet_id = ''.join(tweet)

                if tweet_id not in tweet_ids:
                    tweet_ids.add(tweet_id)
                    data.append(tweet)
        
        #Loading bar VISUALIZATION
        percent_done = int((len(data) / max_tweets)*100)
        print(f"{percent_done}% ", end="", flush=True)
                    
        scroll_attempt = 0

        while True:

            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(2)
            curr_position = driver.execute_script("return window.pageYOffset;")

            if last_position == curr_position:
                scroll_attempt += 1

                # end of scroll region
                if scroll_attempt >= 3:
                    scrolling = False
                    break

                else:
                    sleep(2) # attempt another scroll

            else:
                last_position = curr_position
                break

        if len(data) > max_tweets:
            scrolling = False

    # Close the web driver
    driver.close()
    return data


In [183]:
data = my_scraper(DRIVER_PATH, options, max_tweets=50)

  driver = webdriver.Chrome(DRIVER_PATH, options=options)


0% 14% 40% 62% 86% 108% 

## Appendix








In [None]:
#Some mentionworthy options:

options.add_experimental_option("prefs", {"download.default_directory": PLACE_YOUR_DESIRED_PATH,
                                        'download.prompt_for_download': False,
                                        'download.directory_upgrade': True,
                                        'safebrowsing.enabled': True})
# setDownloadPreferences: Sets the download preferences for the browser. 
# Here, it specifies the default download directory, disables the download prompt, 
# enables directory upgrade, and enables safe browsing.

options.add_argument('--headless=new')
# setHeadlessMode: Sets the browser in headless mode, which means it runs without a 
# graphical user interface.

options.add_argument('--disable-gpu')
# disableGPU: Disables the use of the GPU (graphics processing unit) in the browser.

options.add_argument('--no-sandbox')
# disableSandbox: Disables the sandbox mode, which provides an extra layer of security for the browser.

options.add_argument('--disable-dev-shm-usage')
# disableDevShmUsage: Disables the use of /dev/shm temporary storage in the browser.

options.add_argument("--log-level=3")
# setLogLevel: Sets the logging level for the browser. Here, it sets the log level to 3, which is the highest level of logging.

options.add_argument("--silent")
# setSilentMode: Sets the browser in silent mode, which suppresses most browser notifications and prompts.