**Task Instructions for Students**:

**Objective**: Scrape and extract data about public universities in Germany from the Hochschulkompass website (https://www.hochschulkompass.de/en/study-in-germany.html). Specifically, you will extract the following information for each university:

1) Name of the University
2) Location
3) Governining body
4) Number of students
5) Founding year
6) Link to the website of the University

You will then save this information in a CSV file/dataframe for analysis.

**Steps**:

1) *Navigate to the Hochschulkompass Website*:

Open the Hochschulkompass website: https://www.hochschulkompass.de/en/study-in-germany.html.

2) *Search for Public Universities:*

- Use the search functionality to filter the list of universities to only display public universities.
- You can use the filter options available on the site (e.g., school type, type of control, etc.) to narrow down the search results.

3) Use Selenium to Automate the Process:

- Write a Selenium script to automate the following tasks:
  - Perform the search to list public universities.
  - Navigate through the search results pages.
 
- Use Selenium to interact with the website, and then use Beautiful Soup to parse the HTML of each page to extract the required information.

4) Extract University Details Using Beautiful Soup:

- For each university, extract the following details using Beautiful Soup:
   - Name of the University
   - Location
   - Governing Body
   - Number of Students
   - Founding Year
   - Link to the website of the University
 
- Use Beautiful Soup’s parsing capabilities to locate and extract the text content of relevant HTML elements.

5) Save Data to CSV:

Use Python’s csv module or pandas to save the extracted data into a CSV file.
Each row in the CSV should correspond to a different university, with columns for Name, Location, Governing Body, Number of Students, Founding Year and Link.


6) Handle Multiple Pages:

If the results span multiple pages, ensure your script can handle pagination and continues to scrape data from all available pages.

7) Comment Your Code:

Make sure to comment on your code to explain what each part does. This will help others understand your approach and logic.


In [2]:
#installing libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
from get_chrome_driver import GetChromeDriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains

In [None]:
get_driver = GetChromeDriver()
get_driver.install()

In [None]:
chrome_options = Options()
chrome_options.add_argument("--headless") # enabling headless mode aka you won't see the browser --your choice
chrome_options.add_argument("--disable-search-engine-choice-screen")

In [None]:
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.hochschulkompass.de/en/higher-education-institutions/search-for-a-higher-education-institution.html?tx_szhrksearch_pi1%5Bsearch%5D=1&tx_szhrksearch_pi1%5BQUICK%5D=1&tx_szhrksearch_pi1%5Bname%5D=&tx_szhrksearch_pi1%5Bhstype%5D%5B1%5D=1&tx_szhrksearch_pi1%5Btraegerschaft%5D=1")
time.sleep(2)

In [None]:
# Step 1: Click the "Results per page" dropdown

# Wait until the "Results per page" dropdown is clickable
results_dropdown = WebDriverWait(driver, 15).until(
    EC.element_to_be_clickable((By.XPATH, "________"))
)  # HINT: Find the element that contains the text '10' in the dropdown menu

# Click on the dropdown to open it
results_dropdown.________()  # HINT: What method do you use to simulate a click on a web element?


In [None]:
# Step 2: Select "100" from the dropdown options

# Wait until the "100" option is clickable
option_100 = WebDriverWait(driver, 15).until(
    EC.element_to_be_clickable((By.XPATH, "________"))
)  # HINT: Look for the element with class 'jcf-option' and text '100'

# Click on the "100" option
option_100.________()  # HINT: What method is used to click a web element?


In [None]:
# Step 3: Find all "Learn More" links on the page

# Locate all elements with "Learn More" in the link text
learn_more_links = driver.find_elements(By.XPATH, "________")  
# HINT: What XPath would you use to find all links containing the text 'Learn More'?

# Step 4: Extract the href attribute from each link and store it in a list

# Use list comprehension to extract the 'href' attribute
learn_more_urls = [link.get_attribute('________') for link in learn_more_links]
# HINT: Which attribute of an anchor tag (<a>) contains the URL?


In [None]:
driver.quit()

In [None]:
driver = webdriver.Chrome(options=chrome_options)

In [None]:
meta_data = []
meta_data_links = []

for i in range(len(learn_more_urls)):
    print(i)
    driver.get(learn_more_urls[i])
    time.sleep(2)  # Pause to ensure the page has time to load
    
    # Wait for the page to load fully (adjust the locator if necessary)
    WebDriverWait(driver, 15).until(
        EC.presence_of_element_located((By.CLASS_NAME, "________"))
    )  # HINT: What class name indicates the university details section?

    # Parse the page source with BeautifulSoup
    soup = BeautifulSoup(driver.page_source, '________')
    # HINT: What parser should be used with BeautifulSoup for HTML content?
    
    # Extract the university information section
    uni_info = soup.find('div', {'class': '________'})
    # HINT: What class name contains the university's detailed information?
    
    # Extract the university logo section
    link_info = soup.find('div', {'class': '________'})
    # HINT: What class name contains the university's logo information?
    
    # Store the extracted information in the lists
    meta_data.append(uni_info)
    meta_data_links.append(link_info)


In [None]:
Name_Uni = []
Governing_Body = []
Number_of_Students = []
Founding_Year = []
Federal_State = []

for elem in meta_data:
    # Assume elem is already the 'uni-steckbrief' BeautifulSoup object from the previous step
    try:
        # Extract the university name
        Name_Uni.append(elem.find(class_="________").get_text(strip=True))
        # HINT: What class name is used to find the university's name?
        
        details = elem.find('ul').find_all('li')
        
        # Extract the governing body
        Governing_Body.append(details[0].find(class_='________').get_text(strip=True))
        # HINT: What class name holds the descriptive text for each detail?
        
        # Extract the number of students
        Number_of_Students.append(details[1].find(class_='________').get_text(strip=True))
        # HINT: What class name holds the descriptive text for each detail?
        
        # Extract the founding year
        Founding_Year.append(details[________].find(class_='________').get_text(strip=True))
        # HINT: Which index corresponds to the founding year? What class name holds the descriptive text?
        
        # Extract the federal state
        Federal_State.append(details[________].find(class_='________').get_text(strip=True))
        # HINT: Which index corresponds to the federal state? What class name holds the descriptive text?
    
    except (IndexError, AttributeError) as e:
        # Handle cases where the structure is not as expected
        Name_Uni.append(None)
        Governing_Body.append(None)
        Number_of_Students.append(None)
        Founding_Year.append(None)
        Federal_State.append(None)
        print(f"Error processing element: {e}")


In [None]:
Link_University = []
for elem in meta_data_links:
    
    try:
        # Extract the 'href' attribute of the anchor tag
        Link_University.append(elem.find('a')['________'])  
        # HINT: What attribute of an anchor tag (<a>) contains the URL?
    except:
        # Handle cases where the structure is not as expected
        Link_University.append(________)  
        # HINT: What should you append if the link is not found?


In [None]:
import pandas as pd
data_dict = {
    'University_Name':Name_Uni ,
    'Governing_Body': Governing_Body,
    'Number_of_Students': Number_of_Students,
    'Founding_Year': Founding_Year,
    'Federal_State': Federal_State,
    'University_Link': Link_University
}

# Convert the dictionary into a DataFrame
df = pd.DataFrame(data_dict)