## Getting dataset

For this project we will take laptops info from dns-shop.kz which is one of the largest retailers in Kazakhstan.  
Homepage: https://www.dns-shop.kz/  
Search start page: https://www.dns-shop.kz/catalog/17a892f816404e77/noutbuki/

In [68]:
import numpy as np
print('numpy version:', np.__version__)

import pandas as pd
print('pandas version:', pd.__version__)

from bs4 import BeautifulSoup

from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium import webdriver

import sys
import pathlib

import logging

from time import sleep
import re

numpy version: 1.23.5
pandas version: 1.5.2


In [69]:
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

handler = logging.StreamHandler(stream=sys.stdout)
handler.setFormatter(logging.Formatter(fmt='[%(asctime)s: %(funcName)s: %(levelname)s] %(message)s'))
logger.addHandler(handler)

In [70]:
pathlib.Path('d', 'Applications', 'WebDriver')

WindowsPath('d/Applications/WebDriver')

In [71]:
chrome_options = Options()
# chrome_options.add_argument('--disable-extensions')
# chrome_options.add_argument('--disable-gpu')
# chrome_options.add_argument('--headless')
service = Service(executable_path=pathlib.WindowsPath('d:/Applications/WebDriver/chromedriver-112-x32.exe'))
browser = webdriver.Chrome(service=service, options=chrome_options)

Search page = https://www.dns-shop.kz/catalog/17a892f816404e77/noutbuki/  
Suffix for page navigation: ?p=2  
Number of pages: element type = 'li', properties: class="pagination-widget__page", data-role="pagination-page", data-page-number="12"  
Products description page prefix: https://www.dns-shop.kz/product/  
Product short description selector: body > div.container.category-child > div > div.products-page__content > div.products-page__list > div.products-list > div > div.catalog-products.view-simple > div class="catalog-product ui-button-widget"  
Product short description: <a class="catalog-product__name ui-link ui-link_black" href="/product/43ad370591b71bb0/156-noutbuk-asus-laptop-15-x515ka-ej065w-seryj/"><span>15.6" Ноутбук ASUS Laptop 15 X515KA-EJ065W серый [Full HD (1920x1080), TN+film, Intel Celeron N4500, ядра: 2 х 1.1 ГГц, RAM 8 ГБ, SSD 128 ГБ, Intel UHD Graphics , Windows 11 Home Single Language]</span></a>  
Product price selector: body > div.container.category-child > div > div.products-page__content > div.products-page__list > div.products-list > div > div:nth-child(1) > div:nth-child(1) > div.product-buy.product-buy_one-line.catalog-product__buy > div > div.product-buy__price

In [73]:
base_url = 'https://www.dns-shop.kz'
search_page = 'https://www.dns-shop.kz/catalog/17a892f816404e77/noutbuki/'

In [74]:
laptops = {
    'id': [],
    'link': [],
    'descr_short': [],
    'price': []
}
browser.get(search_page)
soup = BeautifulSoup(browser.page_source, 'html.parser')
# Now we can search for keys needed
num_pages = int(soup.find_all('li', 'pagination-widget__page')[-1]['data-page-number'])
for page in range(1, num_pages+1):
    if page > 1:
        browser.get(search_page+f'?p={page}')
        sleep(5) # needed to give time to update prices as they seem to be loaded dynamically
        soup = BeautifulSoup(browser.page_source, 'html.parser')
    ids = soup.find_all('div', 'catalog-product ui-button-widget')
    descrs = soup.find_all('a', 'catalog-product__name ui-link ui-link_black')
    prices = soup.find_all('div', 'product-buy__price')
    print(f'Page: {page}: Got {len(ids)} ids, {len(descrs)} descriptions, {len(prices)} prices')
    for el in ids:
        laptops['id'].append(el['data-code'])
    for el in descrs:
        laptops['link'].append(el['href'])
        laptops['descr_short'].append(el.span.text)
    for el in prices:
        laptops['price'].append(el.text)

Page: 1: Got 18 ids, 18 descriptions, 18 prices
Page: 2: Got 18 ids, 18 descriptions, 18 prices
Page: 3: Got 18 ids, 18 descriptions, 18 prices
Page: 4: Got 18 ids, 18 descriptions, 18 prices
Page: 5: Got 18 ids, 18 descriptions, 18 prices
Page: 6: Got 18 ids, 18 descriptions, 18 prices
Page: 7: Got 18 ids, 18 descriptions, 18 prices
Page: 8: Got 18 ids, 18 descriptions, 18 prices
Page: 9: Got 18 ids, 18 descriptions, 18 prices
Page: 10: Got 18 ids, 18 descriptions, 18 prices
Page: 11: Got 18 ids, 18 descriptions, 18 prices
Page: 12: Got 15 ids, 15 descriptions, 15 prices


In [75]:
for key, value in laptops.items():
    print(key, len(value))

id 213
link 213
descr_short 213
price 213


In [76]:
browser.quit()

In [77]:
df = pd.DataFrame(laptops)
df.head()

Unnamed: 0,id,link,descr_short,price
0,5074554,/product/a9069bce37c6ed20/14-noutbuk-dexp-aqui...,"14"" Ноутбук DEXP Aquilon серебристый [Full HD ...",99 990 ₸
1,5074520,/product/5b988b0337c5ed20/141-noutbuk-dexp-aqu...,"14.1"" Ноутбук DEXP Aquilon серый [Full HD (192...",109 990 ₸
2,5074555,/product/0997a2e037c7ed20/156-noutbuk-dexp-aqu...,"15.6"" Ноутбук DEXP Aquilon серебристый [Full H...",109 990 ₸
3,5074553,/product/1b77f39237c6ed20/156-noutbuk-dexp-aqu...,"15.6"" Ноутбук DEXP Aquilon серебристый [Full H...",119 990 ₸
4,4900552,/product/5dd56779a0f76200/156-noutbuk-lenovo-i...,"15.6"" Ноутбук Lenovo IdeaPad 3 15IGL05 серый [...",139 990 ₸


From short description of each laptop we can extract main characteristics we will use later

In [78]:
df['screen_size'] = df['descr_short'].apply(lambda x: re.findall(pattern=r'.+(?=\")', string=x)[0].strip())
df['resolution'] = df['descr_short'].apply(lambda x: re.findall(pattern=r'\d+x\d{,4}', string=x)[0].strip())
df['cpu'] = df['descr_short'].apply(lambda x: re.findall(pattern=r',[^,]+(?:intel|amd|apple).*?,', string=x, flags=re.I)[0][1:-1].strip())
#df['cpu_cores'] = df['descr_short'].apply(lambda x: re.findall(r'(?<=ядра:\s).*(?=\sГГц)', x, flags=re.I)[0].strip())
df['ram'] = df['descr_short'].apply(lambda x: re.findall(r'(?<=ram).*?(?=гб)', x, flags=re.I)[0].strip())
df['hdd_ssd'] = df['descr_short'].apply(lambda x: re.findall(r'(?<=hdd\s|ssd\s)\d+', x, flags=re.I)[0].strip())
df['gpu'] = df['descr_short'].apply(lambda x: x[x.index('[')+1 : -1].split(',')[-2].strip())
df['os'] = df['descr_short'].apply(lambda x: x[x.index('[')+1 : -1].split(',')[-1].strip())
df['price'] = df['price'].apply(lambda x: int(''.join(re.findall(r'\d+', x))))
df.tail()

Unnamed: 0,id,link,descr_short,price,screen_size,resolution,cpu,ram,hdd_ssd,gpu,os
208,4881039,/product/8b8679ac3238ed20/142-noutbuk-apple-ma...,"14.2"" Ноутбук Apple MacBook Pro серый [3024x19...",1499990,14.2,3024x1964,Apple M1 Pro,16,512,Apple M1 Pro 14-core,macOS
209,4881038,/product/8b8679ab3238ed20/142-noutbuk-apple-ma...,"14.2"" Ноутбук Apple MacBook Pro серебристый [3...",1572990,14.2,3024x1964,Apple M1 Pro,16,1024,Apple M1 Pro 16-core,macOS
210,4881029,/product/8b8679a13238ed20/162-noutbuk-apple-ma...,"16.2"" Ноутбук Apple MacBook Pro серый [3456x22...",1574990,16.2,3456x2234,Apple M1 Pro,16,512,Apple M1 Pro 16-core,macOS
211,4881043,/product/91f64b813238ed20/142-noutbuk-apple-ma...,"14.2"" Ноутбук Apple MacBook Pro серый [3024x19...",1644990,14.2,3024x1964,Apple M1 Pro,32,512,Apple M1 Pro 14-core,macOS
212,9970850,/product/4ab1aff6b210b603/16-noutbuk-asus-rog-...,"16"" Ноутбук ASUS ROG Strix SCAR 16 G634JY-NM03...",1999990,16.0,2560x1600,Intel Core i9-13980HX,32,1000,GeForce RTX 4090 для ноутбуков 16 ГБ,Windows 11 Pro


In [79]:
df.to_csv('datasets/laptops.csv', index=False)