# Langkah #1 - Requirements Gathering & Proposed Solution

Pada tahapan ini, Anda harus memahami mengenai data apa yang akan dipakai, source-nya darimana, bentuk dan konteks data seperti apa, dan solusi apa yang harus dipakai untuk menyelesaikan problem yang dihadapi dari Stakeholder. Solusi ini bisa seperti metode apa yang bisa kita pakai pada proses Transform, tools apa yang digunakan, dsb.

## Data source yang dipakai:
### 1. Sales Data
Untuk *Sales data*, Anda dapat mengaksesnya pada Docker berikut:  
[https://hub.docker.com/r/shandytp/amazon-sales-data-docker-db](https://hub.docker.com/r/shandytp/amazon-sales-data-docker-db)
### 2. Marketing Data
Untuk *Marketing data*, Anda dapat mengakses data pada link berikut:  
[ElectronicsProductsPricingData.csv](ElectronicsProductsPricingData.csv)
### 3. Web Scraping
Anda diberi kebebasan untuk website apa yang ingin di-*scraping*, boleh portal berita atau sejenisnya. Anda juga diberi kebebasan untuk melakukan *scraping* menggunakan struktur HTML atau menggunakan API. Namun, pastikan website yang digunakan diperbolehkan untuk dilakukan proses *scraping* dan berikan *disclaimer* pada dokumentasi!

In [1]:
# import library
from sqlalchemy import create_engine
import pandas as pd
import requests
import luigi
import json
from urllib import request

In [5]:
marketing_data = pd.read_csv('source-marketing_data/ElectronicsProductsPricingData.csv')
marketing_data = pd.DataFrame(marketing_data)
marketing_data.head()

Unnamed: 0,id,prices.amountMax,prices.amountMin,prices.availability,prices.condition,prices.currency,prices.dateSeen,prices.isSale,prices.merchant,prices.shipping,...,name,primaryCategories,sourceURLs,upc,weight,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,AVphzgbJLJeJML43fA0o,104.99,104.99,Yes,New,USD,"2017-03-30T06:00:00Z,2017-03-10T22:00:00Z,2017...",False,Bestbuy.com,,...,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,Electronics,https://www.amazon.com/Sanus-VLF410B1-10-Inch-...,794000000000.0,32.8 pounds,,,,,
1,AVpgMuGwLJeJML43KY_c,69.0,64.99,In Stock,New,USD,2017-12-14T06:00:00Z,True,Walmart.com,Expedited,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642000000000.0,14 pounds,,,,,
2,AVpgMuGwLJeJML43KY_c,69.0,69.0,In Stock,New,USD,2017-09-08T05:00:00Z,False,Walmart.com,Expedited,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642000000000.0,14 pounds,,,,,
3,AVpgMuGwLJeJML43KY_c,69.99,69.99,Yes,New,USD,2017-10-10T05:00:00Z,False,Bestbuy.com,,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642000000000.0,14 pounds,,,,,
4,AVpgMuGwLJeJML43KY_c,66.99,66.99,Yes,New,USD,2017-08-28T07:00:00Z,False,Bestbuy.com,,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642000000000.0,14 pounds,,,,,


In [22]:
def db_source_sales_engine():
    db_username = 'postgres'
    db_password = 'password123'
    db_host = 'localhost:5433'
    db_name = 'etl_db'

    engine_str = f"postgresql://{db_username}:{db_password}@{db_host}/{db_name}"
    engine = create_engine(engine_str)

    return engine
    

In [10]:
source_engine = db_source_sales_engine()
source_engine

Engine(postgresql://postgres:***@localhost:5433/etl_db)

In [11]:
query = """
SELECT table_name 
FROM information_schema.tables
WHERE table_schema = 'public'
"""
tables_df = pd.read_sql_query(query, source_engine)
table_names = tables_df['table_name'].tolist()  # Daftar nama tabel

print("Tabel yang ditemukan:", table_names)

Tabel yang ditemukan: ['amazon_sales_data']


In [13]:
query = "SELECT * FROM amazon_sales_data"
sales_data = pd.read_sql(query, source_engine)
sales_data = pd.DataFrame(sales_data)
sales_data

Unnamed: 0.1,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price,Unnamed: 0
0,Aahwan Women's & Girls' Solid Basic Super Crop...,women's clothing,Western Wear,https://m.media-amazon.com/images/I/61Ou9rolop...,https://www.amazon.in/Aahwan-Cropped-Without-W...,,,₹399,₹999,
1,Fabme Unisex's Cold Weather Headband (PO2-ACC0...,sports & fitness,"All Sports, Fitness & Outdoors",https://m.media-amazon.com/images/I/81LVOS343V...,https://www.amazon.in/Fabme-Unisexs-Headband-P...,5,1,₹265,₹999,1110.0
2,Men's Fashion Sneakers Lace-Up Trainers Basket...,men's shoes,Casual Shoes,https://m.media-amazon.com/images/I/71sCueaM0-...,https://www.amazon.in/Fashion-Sneakers-Lace-Up...,,,,,
3,HISTORICAL INDIA - Gwalior Collection - ½ Anna...,women's clothing,Clothing,https://m.media-amazon.com/images/I/91N6W7gYl3...,https://www.amazon.in/HISTORICAL-INDIA-Gwalior...,4.4,40,₹670,"₹1,500",
4,Sonata Act Safety Watch Analog White Dial Wome...,accessories,Watches,https://m.media-amazon.com/images/I/81sf24RFnD...,https://www.amazon.in/Sonata-Safety-Analog-Wom...,3,22,,"₹3,040",
...,...,...,...,...,...,...,...,...,...,...
100887,LORENZ Analogue Black Dial Men's Watch -Combo ...,stores,Men's Fashion,https://m.media-amazon.com/images/I/71BEdDAGaI...,https://www.amazon.in/Lorenz-MK-4849A-Combo-Bl...,3.5,40,₹319,"₹1,999",7707.0
100888,Campus Men's Rampage Running Shoes,men's shoes,Sports Shoes,https://m.media-amazon.com/images/I/71cVJlYVkA...,https://www.amazon.in/Campus-Rampage-R-Slate-R...,4,31,"₹1,949","₹2,799",
100889,Sri Jagdamba Pearls 22KT Yellow Gold Chain for...,accessories,Gold & Diamond Jewellery,https://m.media-amazon.com/images/W/IMAGERENDE...,https://www.amazon.in/Sri-jagdamaba-pearls-Yel...,,,"₹1,46,905","₹1,60,260",
100890,mitushi products Boys One Piece Swimsuit,kids' fashion,Kids' Fashion,https://m.media-amazon.com/images/W/IMAGERENDE...,https://www.amazon.in/mitushi-products-Shorts-...,4.1,143,₹400,₹450,


In [15]:
class ExtractMarketingData(luigi.Task):
    def requires(self):
        pass

    def run(self):
        #read data
        marketing_data = pd.read_csv('source-marketing_data/ElectronicsProductsPricingData.csv')

        marketing_data.to_csv(self.output().path, index = False)

    def output(self):
        return luigi.LocalTarget('raw-data/extracted_marketing_data.csv')

In [25]:
class ExtractDatabaseSalesData(luigi.Task):
    def requires(self):
        pass
    def run(self):
        engine = db_source_sales_engine()
        query = 'SELECT * FROM amazon_sales_data'

        db_data = pd.read_sql(query, engine)
        
        db_data.to_csv(self.output().path, index = False)
    def output(self):
        return luigi.LocalTarget('raw-data/extracted_sales_data.csv')

In [26]:
luigi.build([ExtractDatabaseSalesData()], local_scheduler = True)

DEBUG: Checking if ExtractDatabaseSalesData() is complete
INFO: Informed scheduler that task   ExtractDatabaseSalesData__99914b932b   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 11412] Worker Worker(salt=9432138340, workers=1, host=zueible, username=LENOVO, pid=11412) running   ExtractDatabaseSalesData()
INFO: [pid 11412] Worker Worker(salt=9432138340, workers=1, host=zueible, username=LENOVO, pid=11412) done      ExtractDatabaseSalesData()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   ExtractDatabaseSalesData__99914b932b   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=9432138340, workers=1, host=zueible, username=LENOVO, pid=11412) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 1 t

True

In [2]:
from tqdm import tqdm
from bs4 import BeautifulSoup

In [3]:
resp_1= requests.get('https://shopee.co.id/torchidofficial#product_list')
resp_1.status_code

200

In [3]:
resp = requests.get('https://torch.id/collections/backpack-torch?itm_source=mega-menu-backpack&itm_campaign=1&limit=48')
resp.status_code

200

In [12]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

url = "https://www.tokopedia.com/torch-id/product"

# Setup Selenium
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
driver.get(url)

try:
    # Tunggu halaman dimuat
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, 'body'))
    )

    # Scroll ke bawah beberapa kali untuk memuat lebih banyak produk
    for _ in range(5):  # Sesuaikan jumlah scroll
        # Scroll menggunakan JavaScript
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)  # Tunggu produk dimuat

    # Ambil semua produk
    products = driver.find_elements(By.CLASS_NAME, "prd_link-product-name")
    
    print(f"Total produk ditemukan: {len(products)}")
    
    # Cetak nama produk
    for product in products:
        print(product.text)

except Exception as e:
    print(f"Terjadi kesalahan: {e}")

finally:
    # Tutup browser
    driver.quit()

Total produk ditemukan: 80
TORCH Tas Pouch Antro - Sling Bag Mini Anti Air Transparan Outdoor Air Pria Wanita
Tas Pouch Sling Bag Mini Transparan Stylish Anti Air Pria Wanita - TORCH Antro
Tas Selempang Ringan Anti Air Outdoor -  Torch Wonjin Sling Bag
Torch Wonjin Sling Bag - Tas Pinggang Travel Ringan Praktis Stylish
Tas Jinjing Lipat Olahraga Gym Sport Outdoor Yesan - Torch Duffle Bag
Torch Duffle Bag Stylish - Tas Jinjing Baju Olahraga Gym Lipat Yesan
Torch Dangsan Backpack - Tas Ransel Sekolah Kekinian Anak Sekolah Outdoor Laptop 14 Inch
Torch Dangsan Backpack - Tas Ransel Laptop Sekolah Anak Laptop 14 Inch Stylish
Torch Backpack Stylish Ransel Anti Air Kuliah Sekolah Kerja Harian - Purana
Torch Purana Backpack - Ransel Laptop Anti Air Aktivitas Harian Outdoor Sekolah Kuliah Kerja
Torch Travel Backpack Collection - Tas Punggung Travelling
Torch Backpack: Koleksi Ransel Sekolah Keren Nyaman Ringan Muat Banyak
Torch Ringkas Backpack Laptop Tas Ransel Punggung Kerja Kuliah Minimalis 

In [4]:
soup = BeautifulSoup(resp.text, 'html.parser')
soup.find_all('div')

[<div id="cx_whatsapp_init"></div>,
 <div class="shopify-section shopify-section-group-header-group section-header" id="shopify-section-sections--16382233084060__header"><link href="//torch.id/cdn/shop/t/92/assets/component-list-menu.css?v=151968516119678728991707739999" media="print" onload="this.media='all'" rel="stylesheet"/>
 <link href="//torch.id/cdn/shop/t/92/assets/component-search.css?v=165164710990765432851707739998" media="print" onload="this.media='all'" rel="stylesheet"/>
 <link href="//torch.id/cdn/shop/t/92/assets/component-menu-drawer.css?v=85170387104997277661707739998" media="print" onload="this.media='all'" rel="stylesheet"/>
 <link href="//torch.id/cdn/shop/t/92/assets/component-cart-notification.css?v=54116361853792938221707739998" media="print" onload="this.media='all'" rel="stylesheet"/>
 <link href="//torch.id/cdn/shop/t/92/assets/component-cart-items.css?v=136978088507021421401707739999" media="print" onload="this.media='all'" rel="stylesheet"/><link href="//to

In [15]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

# Base URL untuk halaman produk
base_url = "https://www.tokopedia.com/torch-id/product/page/{}"

# Setup Selenium
options = webdriver.ChromeOptions()
# Tambahkan argumen untuk menghindari batasan
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])

driver = webdriver.Chrome(options=options)

# List untuk menyimpan semua produk
all_products = []

try:
    for page in range(1, 12):
        # Konstruksi URL untuk setiap halaman
        url = base_url.format(page)
        driver.get(url)

        # Tunggu halaman dimuat sepenuhnya
        WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.TAG_NAME, 'body'))
        )

        # Scroll bertahap untuk memastikan semua konten dimuat
        for _ in range(5):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)
            
            # Scroll ke atas sedikit untuk memicu lazy loading
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight / 2);")
            time.sleep(2)

        # Cari semua produk dengan pendekatan ganda
        products = driver.find_elements(By.CSS_SELECTOR, ".prd_link-product-name, [data-testid='linkProductName']")

        # Tambahkan produk ke list utama
        for product in products:
            if product.text and product.text not in all_products:
                all_products.append(product.text)

except Exception as e:
    print(f"Terjadi kesalahan: {e}")

finally:
    # Tutup browser
    driver.quit()

Halaman 1: Ditemukan 80 produk
Halaman 2: Ditemukan 80 produk
Halaman 3: Ditemukan 80 produk
Halaman 4: Ditemukan 80 produk
Halaman 5: Ditemukan 80 produk
Halaman 6: Ditemukan 80 produk
Halaman 7: Ditemukan 80 produk
Halaman 8: Ditemukan 80 produk
Halaman 9: Ditemukan 80 produk
Halaman 10: Ditemukan 80 produk
Halaman 11: Ditemukan 60 produk

Total produk keseluruhan: 860
1. TORCH Tas Pouch Antro - Sling Bag Mini Anti Air Transparan Outdoor Air Pria Wanita
2. Tas Pouch Sling Bag Mini Transparan Stylish Anti Air Pria Wanita - TORCH Antro
3. Tas Selempang Ringan Anti Air Outdoor -  Torch Wonjin Sling Bag
4. Torch Wonjin Sling Bag - Tas Pinggang Travel Ringan Praktis Stylish
5. Tas Jinjing Lipat Olahraga Gym Sport Outdoor Yesan - Torch Duffle Bag
6. Torch Duffle Bag Stylish - Tas Jinjing Baju Olahraga Gym Lipat Yesan
7. Torch Dangsan Backpack - Tas Ransel Sekolah Kekinian Anak Sekolah Outdoor Laptop 14 Inch
8. Torch Dangsan Backpack - Tas Ransel Laptop Sekolah Anak Laptop 14 Inch Stylish
9