
# **Web Scraping Jobstreet dengan Selenium**
---
Pada bagian ini, akan dilakukan web scraping dari situs JobStreet menggunakan **Selenium**.

Tujuan scraping ini adalah untuk mengumpulkan data lowongan pekerjaan yang relevan dengan bidang data, seperti judul pekerjaan, nama perusahaan, lokasi, dan deskripsi pekerjaan dari tiga negara: **Indonesia, Malaysia, dan Singapura.**

In [None]:
# Install semua library yang dibutuhkan

!pip install requests
!pip install selenium
!pip install -q google-colab-selenium
!pip install nltk
!pip install selenium webdriver-manager pandas

!apt-get update
!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
!dpkg -i google-chrome-stable_current_amd64.deb || apt-get install -f -y

Collecting selenium
  Downloading selenium-4.34.2-py3-none-any.whl.metadata (7.5 kB)
Collecting urllib3~=2.5.0 (from urllib3[socks]~=2.5.0->selenium)
  Downloading urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)
Collecting trio~=0.30.0 (from selenium)
  Downloading trio-0.30.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.12.2 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting outcome (from trio~=0.30.0->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.12.2->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.34.2-py3-none-any.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio-0.30.0-py3-none-any.whl (499 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m499.2/499.2 kB[0m [31m28

In [None]:
import requests
import selenium
import nltk
import random
import time
import pandas as pd
import logging
import string
import json
print("All libraries installed successfully!")

All libraries installed successfully!


# **1. Data Collection**

Install library terkait dan menyiapkan ChromeDriver di Google Colab.

Karena Colab berjalan di server dan bukan di komputer lokal, kita perlu menjalankan Chrome dalam mode headless (tanpa tampilan GUI). Kita juga menggunakan `webdriver-manager` untuk mengelola ChromeDriver secara otomatis.


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from webdriver_manager.chrome import ChromeDriverManager

import copy
import google_colab_selenium as gs

In [None]:
# Menyiapkan konfigurasi logging untuk memantau proses scraping
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

In [None]:
def setup():

  # Fungsi untuk menyiapkan dan mengembalikan WebDriver (Chrome) di Google Colab.
  # Menggunakan opsi headless dan beberapa flags tambahan agar dapat berjalan stabil.

  logging.info("Menyiapkan WebDriver untuk lingkungan Google Colab...")
  chrome_options = Options()
  chrome_options.add_argument("--headless")  # Menjalankan Chrome tanpa GUI
  chrome_options.add_argument("--no-sandbox")
  chrome_options.add_argument("--disable-dev-shm-usage")
  chrome_options.add_argument("--window-size=1920,1080")
  chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

  # Inisialisasi WebDriver menggunakan ChromeDriverManager
  driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
  logging.info("WebDriver berhasil disiapkan.")
  return driver

## Scraping Data Jobstreet

Mengambil data dari halaman-halaman lowongan pekerjaan di situs JobStreet. Untuk setiap lowongan, kami mengambil informasi berikut:
- Judul Pekerjaan
- Nama Perusahaan
- Lokasi
- Kategori Pekerjaan
- Tipe Pekerjaan (Full time, Contract, dll.)
- Deskripsi Lengkap (yang berisi requirements)
- Link ke halaman lowongan



---

Untuk mengekstrak informasi tersebut, kami menggunakan **Selenium** untuk menemukan elemen HTML berdasarkan atribut `data-automation` yang spesifik untuk JobStreet.


In [None]:
keywords = ["data scientist", "data analyst", "machine learning", "data engineer", "data science", "computer science"]

# Function for scraping per country
def scrape_jobstreet(domain, country_label):
    all_hrefs = []
    for keyword in keywords:
        search_keyword = keyword.replace(" ", "-")
        search_url = f"https://{domain}/en/job-search/{search_keyword}-jobs/"
        print(f"\nSearching on {country_label.upper()} for: {keyword.upper()} jobs")
        driver.get(search_url)
        try:
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "a[data-automation='jobTitle']"))
            )
            job_links_elements = driver.find_elements(By.CSS_SELECTOR, "a[data-automation='jobTitle']")
            hrefs = [(el.get_attribute("href"), country_label) for el in job_links_elements if el.get_attribute("href")]
            all_hrefs.extend(hrefs)
            print(f"  Found {len(hrefs)} job links in {country_label} for '{keyword}'")
        except TimeoutException:
            print(f"  Timeout loading results for: {keyword} in {country_label}")
    return all_hrefs

driver = setup()

# Scrape per country
all_job_hrefs_malaysia = scrape_jobstreet("jobstreet.com.my", "malaysia")
all_job_hrefs_singapore = scrape_jobstreet("jobstreet.com.sg", "singapore")
all_job_hrefs_indonesia = scrape_jobstreet("id.jobstreet.com", "indonesia")

# Gabungkan semua
all_job_hrefs = all_job_hrefs_malaysia + all_job_hrefs_singapore + all_job_hrefs_indonesia

# Deduplicate
unique_href_map = {}
for href, country in all_job_hrefs:
    if href and href not in unique_href_map:
        unique_href_map[href] = country

# Scrape detail tiap job
scraped_jobs = []
for i, (href, country_name) in enumerate(list(unique_href_map.items())):
    if not href.startswith("http"):
        continue

    print(f"\n[{i+1}] Navigating to: {href}")
    try:
        driver.get(href)
        WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1[data-automation='job-detail-title']"))
        )

        title = driver.find_element(By.CSS_SELECTOR, "h1[data-automation='job-detail-title']").text
        company = driver.find_element(By.CSS_SELECTOR, "span[data-automation='advertiser-name']").text
        location = driver.find_element(By.CSS_SELECTOR, "span[data-automation='job-detail-location']").text
        category = driver.find_element(By.CSS_SELECTOR, "span[data-automation='job-detail-classifications']").text
        work_type = driver.find_element(By.CSS_SELECTOR, "span[data-automation='job-detail-work-type']").text
        description = driver.find_element(By.CSS_SELECTOR, "div[data-automation='jobAdDetails']").text

        # Tambahan: Gaji
        try:
            salary_element = driver.find_element(By.CSS_SELECTOR, "span[data-automation='job-detail-salary']")
            salary = salary_element.text
        except:
            salary = "Not specified"

        # Tambahan: Ekstrak requirement sederhana dari deskripsi
        requirements = ""
        for line in description.splitlines():
            if "requirement" in line.lower() or "qualification" in line.lower() or "kualifikasi" in line.lower():
                requirements += line.strip() + " | "
        if not requirements:
            requirements = "Not specified"

        scraped_jobs.append({
            "Title": title.strip(),
            "Company": company.strip(),
            "Country": country_name.strip().title(),
            "Location": location.strip(),
            "Category": category.strip(),
            "Work Type": work_type.strip(),
            "Salary": salary.strip(),
            "Requirements": requirements.strip(),
            "Description": description.strip(),
            "Link": href.strip()
        })

        print(f"  Title: {title}")
        print(f"  Company: {company}")
        print(f"  Country: {country_name}")
        print(f"  Location: {location}")
        print(f"  Category: {category}")
        print(f"  Work Type: {work_type}")
        print(f"  Salary: {salary}")
        print(f"  Requirements: {requirements}")
        print(f"  Description preview: {description[:200]}...")
        print(f"  Link: {href}")

    except (TimeoutException, NoSuchElementException) as e:
        print(f"  Error scraping {href}: {e}")

driver.quit()

df_jobs = pd.DataFrame(scraped_jobs)
print("\nTotal jobs scraped:", len(df_jobs))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  Link: https://sg.jobstreet.com/job/85888961?type=standard&ref=search-standalone&origin=cardTitle#sol=ae38fac4a77feb3e04eb801b5d52510d7a00fe74

[223] Navigating to: https://sg.jobstreet.com/job/85571031?type=standard&ref=search-standalone&origin=cardTitle#sol=7943bb41585fcbaa55e453499da92f70bb707aba
  Title: Data Scientist (Tableau, SQL, Banking) Central, CBD ~
  Company: PERSOLKELLY Singapore Pte Ltd (Formerly Kelly Services Singapore Pte Ltd)
  Country: singapore
  Location: Raffles Place, Central Region
  Category: Database Development & Administration (Information & Communication Technology)
  Work Type: Contract/Temp
  Salary: $4,500 – $6,000 per month
  Requirements: Requirements | 
  Description preview: Wealth and Retail Banking Industry
 Hybrid Working arrangements
Duration: 6 months subject to extendable/convertible
Working Location: Central Business District, CBD
Working hours: 09.00am – 6.00pm (M...
  Link: h

In [None]:
df_jobs = pd.DataFrame(scraped_jobs)
print("\nTotal jobs scraped:", len(df_jobs))
df_jobs


Total jobs scraped: 576


Unnamed: 0,Title,Company,Country,Location,Category,Work Type,Salary,Requirements,Description,Link
0,Data Scientist,Western Digital Tech and Regional Center (M) S...,Malaysia,Kuala Lumpur,"Mathematics, Statistics & Information Sciences...",Full time,Not specified,"Meeting users, gathering user requirements and...",Job Description\nResponsibilities\nWork with d...,https://my.jobstreet.com/job/85371197?type=sta...
1,Senior Data Scientist - GenAI (Hybrid Working),SEEK,Malaysia,Kuala Lumpur (Hybrid),Other (Information & Communication Technology),Full time,Not specified,"Qualifications | Essential Qualifications, Ski...","Company Description\nAbout SEEK\nAt SEEK, we s...",https://my.jobstreet.com/job/85896672?type=sta...
2,Data Scientist,Sedgwick Singapore Pte Ltd,Malaysia,"Kuala Lumpur City Centre, Kuala Lumpur","Mathematics, Statistics & Information Sciences...",Full time,Not specified,Education/Qualifications: | Tertiary qualifica...,"As a Data Scientist, you will play a critical ...",https://my.jobstreet.com/job/85936726?type=sta...
3,Data Science Engineer,Jabil Circuit Sdn Bhd,Malaysia,Penang,Engineering - Software (Information & Communic...,Full time,Not specified,Educational Qualifications | Additional Requir...,Roles and Responsibility\nData Analysis: Perfo...,https://my.jobstreet.com/job/85468501?type=sta...
4,Data Scientist,TECHTIERA SDN. BHD.,Malaysia,Kuala Lumpur,Engineering - Software (Information & Communic...,Contract/Temp,"RM 9,333 – RM 14,000 per month",Responsibilities & Requirements: | Collaborate...,Responsibilities & Requirements:\nProven exper...,https://my.jobstreet.com/job/85811513?type=sta...
...,...,...,...,...,...,...,...,...,...,...
571,BI DEVELOPER,PT Harmoni Dinamik Indonesia,Indonesia,Jakarta,Developers/Programmers (Information & Communic...,Full time,Not specified,Collaborate with stakeholders to gather BI req...,"Responsibilities:\nDesign, develop, and mainta...",https://id.jobstreet.com/job/85904948?type=sta...
572,Senior Sales Account Manager/Sales Manager ( D...,PT Datumstruct Indonesia,Indonesia,Jakarta (Hybrid),Account & Relationship Management (Sales),Full time,Not specified,Requirements: |,PT. Datumstruct Indonesia\nDatumstruct Tech Di...,https://id.jobstreet.com/job/85103468?type=sta...
573,Project Manager (ERP & Digital Transformation),PT. ALPEN FOOD INDUSTRY,Indonesia,"North Jakarta, Jakarta",Programme & Project Management (Information & ...,Full time,Not specified,"Lead requirements gathering, validation, and c...",Aice is on a mission to accelerate our digital...,https://id.jobstreet.com/job/85871215?type=sta...
574,Software Engineer (Java),PT. TALENTA TEKNOLOGI SUKSES GEMILANG,Indonesia,Jakarta,Developers/Programmers (Information & Communic...,Full time,Rp 10.000.000 – Rp 14.000.000 per month,✅ Requirements: |,✅ Requirements:\nBachelor’s degree in Computer...,https://id.jobstreet.com/job/85905199?type=sta...


In [None]:
df_jobs.to_csv('jobstreet_data.csv', index=False)

## Scraping Berhasil

Scraping telah berhasil mengumpulkan ratusan link lowongan pekerjaan dari JobStreet di tiga negara. Detail dari setiap lowongan tersebut kemudian diekstrak dan disimpan ke dalam sebuah DataFrame. Data ini selajutnya akan dibersihkan dan diolah pada bagian **Text Preprocessing**.


# **2. Data Preprocessing (Text Cleaning)**

In [None]:
# Import Library Tambahan

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download resource NLTK untuk Bahasa Inggris
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab')

print("Library berhasil di-import.")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Library berhasil di-import.


In [None]:
# Menghapus URL, Hashtag, Emoji, Angka, dan Tanda Baca
def clean_noise(text):

  # Menghapus semua tag HTML secara utuh
  text = re.sub(r'<.*?>', '', text)
  # Menghapus URL
  text = re.sub(r'https?://\S+|www\.\S+', '', text)
  # Menghapus Hashtag
  text = re.sub(r'#\w+', '', text)
  # Menghapus Emoji dan Tanda Baca
  text = re.sub(r'[^\w\s]', '', text)
  # Menghapus Angka
  text = re.sub(r'\d+', '', text)
  # Menghapus spasi berlebih
  text = re.sub(r'\s+', ' ', text).strip()
  return text

In [None]:
# Menghapus Stopwords

# Define list_stopwords
from nltk.corpus import stopwords
list_stopwords = set(stopwords.words('english'))

def remove_stopwords(text):

  # Memecah kalimat menjadi kata-kata (tokenization)
  tokens = text.split()

  # Menghapus stopwords dari daftar token
  tokens_without_stopwords = [word for word in tokens if word not in list_stopwords]

  # Menggabungkan kembali token menjadi kalimat
  text = ' '.join(tokens_without_stopwords)
  return text

In [None]:
# Stemming
# Membuat stemmer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

### **Pipeline**

In [None]:
def cleaning_pipeline(text):
    text = clean_noise(text)

    # 1. Lowercase
    text = text.lower()

    # 2. Remove non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)

    # 3. Tokenize
    tokens = word_tokenize(text)

    # 4. Remove stopwords
    tokens = [word for word in tokens if word not in list_stopwords]

    # 5. Stemming
    stemmed_tokens = [stemmer.stem(word) for word in tokens]

    # 6. Join back
    cleaned_text = ' '.join(stemmed_tokens)

    return cleaned_text

In [None]:
# =====================
# POST-SCRAPING PIPELINE
# =====================

import pandas as pd

df_jobs = pd.read_csv("/content/jobstreet_data.csv")

if not df_jobs.empty:
    print("\n✅ Scraping berhasil. Memulai pipeline post-processing...")

    # Kolom-kolom yang ingin dibersihkan
    text_columns = ['Title', 'Company', 'Country', 'Location', 'Category',
                    'Work Type', 'Salary', 'Requirements', 'Description']

    # Simpan contoh data sebelum preprocessing
    sample_row = df_jobs.iloc[0]

    # Bersihkan semua kolom teks
    for col in text_columns:
        df_jobs[f'Cleaned {col}'] = df_jobs[col].apply(cleaning_pipeline)

    print("\n--- CONTOH HASIL CLEANING (5 Data Pertama) ---")
    print(df_jobs[[f'Cleaned {col}' for col in text_columns]].head(5))

    # Urutkan kolom biar rapi
    final_columns = text_columns + [f'Cleaned {col}' for col in text_columns] + ['Link']
    df_jobs = df_jobs[final_columns]

    # Simpan ke file
    output_csv = "jobstreet_final_data.csv"
    df_jobs.to_csv(output_csv, index=False)
    print(f"\n📁 Data berhasil disimpan ke: '{output_csv}'")

    output_json = "jobstreet_final_data.json"
    df_jobs.to_json(output_json, orient='records', indent=4)
    print(f"📁 Data berhasil disimpan ke: '{output_json}'")

else:
    print("⚠️ Tidak ada data yang berhasil di-scrape.")


✅ Scraping berhasil. Memulai pipeline post-processing...

--- CONTOH HASIL CLEANING (5 Data Pertama) ---
                             Cleaned Title  \
0                           data scientist   
1  senior data scientist genai hybrid work   
2                           data scientist   
3                        data scienc engin   
4                           data scientist   

                            Cleaned Company Cleaned Country  \
0  western digit tech region center sdn bhd        malaysia   
1                                      seek        malaysia   
2                 sedgwick singapor pte ltd        malaysia   
3                     jabil circuit sdn bhd        malaysia   
4                         techtiera sdn bhd        malaysia   

                       Cleaned Location  \
0                          kuala lumpur   
1                   kuala lumpur hybrid   
2  kuala lumpur citi centr kuala lumpur   
3                                penang   
4                      

# Proses selesai dan data siap digunakan!

Seluruh pipeline telah berhasil dijalankan, meliputi:
1. Data Collecting: Mengambil data lowongan pekerjaan dari situs JobStreet dengan menggunakan Selenium.

2. Data Cleaning (Preprocessing):
- Konversi teks menjadi lowercase.
- Penghapusan noise (URL, hashtag, angka, emoji, dan tanda baca).
- Penghapusan stopwords multi-bahasa (Inggris & Indonesia).
- Lemmatization & Stemming untuk menstandarkan kata.

3. Data Export: Dataset pekerjaan yang sudah dibersihkan dan terstruktur telah disimpan dalam format CSV dan JSON (jobstreet_final_data.csv dan jobstreet_final_data.json).