This is the Python code to scrape product text data and product images from the JD website. The scraper predominantly relies on the Python libraries Selenium, Requests, and BeautifulSoup. In brief, the scraper works by drilling down the category tree on the JD website. First, it accesses the major category pages (e.g., Drinks), then the minor category pages (e.g., Soft Drinks), and then the product category pages (e.g., Coco Cola 2L). The process is iterative to that each product page is accessed separately and just once. This Python code can be adapted to scrape product text data and product images from other websites. 

# 0. Required downloads

In order for this code to run, the following programs must be downloaded:

- Anaconda (https://www.anaconda.com/products/distribution). This code is designed to run in Jupyter Notebook within Anaconda. There should be a free version of Anaconda.

- Firefox browser

- Gecko driver (https://github.com/mozilla/geckodriver/releases)

# 1. Import required libaries

These are the standard libaries to conduct web scraping in Python. You'll likely be using these same libraries to scrape from other websites. Note, I believe all of these libraries are in-built in the Anaconda software. If you receive any errors importing any of these libraries, you can install the required library using pip install (https://datatofish.com/install-package-python-using-pip/).

In [2]:
# import selenium - this is used to automate browsing of any website. 
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

# import requests - this is used to extract the HTML from the desired webpages
import requests

# import beautifulsoup - this is used to isolate the desired information from the extracted HTML
from bs4 import BeautifulSoup

# import regular expressions for data manipulation
import re

# import numpy for manipulation of numerical data
import numpy as np

# import pandas - this is used for dataframe manipulation
import pandas as pd

# set the maximum number of columns/rows to display in pandas dataframe
pd.set_option('display.max_columns', 999)
pd.set_option('display.max_rows', 999)

# import time so that we can time how long each step takes
import time

# import library to download images based on URL
import urllib.request

# 2. Open Firefox browser using Selenium and log into JD account

There are two challenges that are specific to scraping JD. These challenges may/may not be present when scraping other websites:

- Sometimes, webpages can take >1 minute to fully load. This is problematic as the scraping time is proportional to the webpage loading time (e.g., scraping 1 million pages at a rate of 1 min/page would take 694 days!). It appears this slowness is present on both Australian and Chinese servers. For example, most of the time (but not all of the time) this page takes about 1 min 30 seconds to load: https://list.jd.com/list.html?cat=1320,1583. However, fortunately, the desired product information (i.e., product text and product images) typically finishes loading within 5 seconds of accessing a webpage, and therefore we can instruct the scraper to only wait 5 seconds for a webpage to load.

- The full range of products is only viewable after logging into an account. The website will often default to the login page if a user tries to browse without being logged into an account (i.e., https://passport.jd.com/new/login.aspx). To overcome this, we will need to manually log into an account prior to scraping. 


In [122]:
# First, we specify that we don't want the web scraper to wait for the webpages to fully load
# We therefore set the loading strategy to "none" instead of "normal"
# Later in the codebase, we will instruct the scraper to wait 5 seconds after loading each page
capa = DesiredCapabilities.FIREFOX
capa["pageLoadStrategy"] = "none" # here we specify for the 

# We then open a new firefox browser using Selenium. For this to open, make sure to:
# (1) Update the 'executable_path' to wherever you saved geckodriver
# (2) Ensure privacy settings in system preferences allow 'geckodriver' to be opened
driver = webdriver.Firefox(executable_path = "/Users/tazmandavies/Downloads/geckodriver",
                           desired_capabilities=capa)

After the browser has opened, type and enter 'https://passport.jd.com/new/login.aspx' into the 
browser search bar, and then log into a JD account. 

# 3. Create spreadsheet of subcategory URLs

Here, we need to manually feed the web scraper with the URLs for each JD category page. There are seven of them: 进口食品, 地方特产, 休闲食品, 粮油调味, 饮料冲调, 节庆食品/礼券, and 茗茶. For each category, we instruct the web scraper to construct a spreadsheet of subcategory URLs, as shown in steps 3.1 to 3.8. Then, as shown in step 3.9, we access each subcategory page and determine the number of products. The output from this phase is saves as 'JD_subcategory_URLs.xlsx'. When Tazman ran this code on 2022/12/18, there were 660 subcategories and 2,869,885 products! 

This section should take about 30-40 minutes to run.

If some of the Python code doesn't make sense in this section, check out this webpage: https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/


#### 3.1. 进口食品

In [4]:
# set category URL for 进口食品
进口食品 = 'https://list.jd.com/list.html?cat=1320,5019'

# access URL in firefox
driver.get(进口食品)

# wait 5 seconds to make sure the desired information has loaded
time.sleep(5)

# extract the HTML from the category page
soup=BeautifulSoup(driver.page_source, 'lxml')

# create an empty dataframe (df) for 进口食品
进口食品_df = pd.DataFrame()

# add a column for the subcategory name
进口食品_df['Subcategory'] = [x.find('a').get('onclick').split(',')[-1][1:-2] for x in soup.find_all('div', {"class":"sl-value"})[3].find_all('li')]

# add a column for the subcategory URL
进口食品_df['URL'] = ['https://search.jd.com/' + x.find('a').get('href') for x in soup.find_all('div', {"class":"sl-value"})[3].find_all('li')]

# add a column for the category
进口食品_df['Category'] = '进口食品'

#### 3.2. 地方特产

In [5]:
# set category URL for 地方特产
地方特产 = 'https://list.jd.com/list.html?cat=1320,1581'

# access URL in firefox
driver.get(地方特产)

# wait 5 seconds to make sure the desired information has loaded
time.sleep(5)

# extract the HTML from the category page
soup=BeautifulSoup(driver.page_source, 'lxml')

# create a dataframe (df) for 地方特产
地方特产_df = pd.DataFrame()
地方特产_df['Subcategory'] = [x.find('a').get('onclick').split(',')[-1][1:-2] for x in soup.find_all('div', {"class":"sl-value"})[2].find_all('li')]
地方特产_df['URL'] = ['https://search.jd.com/' + x.find('a').get('href') for x in soup.find_all('div', {"class":"sl-value"})[2].find_all('li')]
地方特产_df['Category'] = '地方特产'


#### 3.3. 休闲食品

In [6]:
# set category URL for 休闲食品
休闲食品 = 'https://list.jd.com/list.html?cat=1320,1583'

# access URL in firefox
driver.get(休闲食品)

# wait 5 seconds to make sure the desired information has loaded
time.sleep(5)

# extract the HTML from the category page
soup=BeautifulSoup(driver.page_source, 'lxml')

# create a dataframe (df) for 休闲食品
休闲食品_df = pd.DataFrame()
休闲食品_df['Subcategory'] = [x.find('a').get('onclick').split(',')[-1][1:-2] for x in soup.find_all('li', {"data-group":"1"})]
休闲食品_df['URL'] = ['https://search.jd.com/' + x.find('a').get('href') for x in soup.find_all('li', {"data-group":"1"})]
休闲食品_df['Category'] = '休闲食品'


#### 3.4. 粮油调味

In [7]:
# set category URL for 粮油调味
粮油调味 = 'https://list.jd.com/list.html?cat=1320,1584'

# access URL in firefox
driver.get(粮油调味)

# wait 5 seconds to make sure the desired information has loaded
time.sleep(5)

# extract the HTML from the category page
soup=BeautifulSoup(driver.page_source, 'lxml')

# create a dataframe (df) for 粮油调味
粮油调味_df = pd.DataFrame()
粮油调味_df['Subcategory'] = [x.find('a').get('onclick').split(',')[-1][1:-2] for x in soup.find_all('li', {"data-group":"2"})]
粮油调味_df['URL'] = ['https://search.jd.com/' + x.find('a').get('href') for x in soup.find_all('li', {"data-group":"2"})]
粮油调味_df['Category'] = '粮油调味'


#### 3.5. 饮料冲调

In [8]:
# set category URL for 饮料冲调
饮料冲调 = 'https://list.jd.com/list.html?cat=1320,1585'

# access URL in firefox
driver.get(饮料冲调)

# wait 5 seconds to make sure the desired information has loaded
time.sleep(5)

# extract the HTML from the category page
soup=BeautifulSoup(driver.page_source, 'lxml')

# create a dataframe (df) for 饮料冲调
饮料冲调_df = pd.DataFrame()
饮料冲调_df['Subcategory'] = [x.find('a').get('onclick').split(',')[-1][1:-2] for x in soup.find_all('li', {"data-group":"6"})]
饮料冲调_df['URL'] = ['https://search.jd.com/' + x.find('a').get('href') for x in soup.find_all('li', {"data-group":"6"})]
饮料冲调_df['Category'] = '饮料冲调'


#### 3.6. 节庆食品/礼券

In [9]:
# set category URL for 节庆食品/礼券
节庆食品 = 'https://search.jd.com/list.html?cat=1320,2641'

# access URL in firefox
driver.get(节庆食品)

# wait 5 seconds to make sure the desired information has loaded
time.sleep(5)

# extract the HTML from the category page
soup=BeautifulSoup(driver.page_source, 'lxml')

# from the HTML, extract the names and URLs of all subcategories
节庆食品_df = pd.DataFrame()
节庆食品_df['Subcategory'] = [x.find('a').get('onclick').split(',')[-1][1:-2] for x in soup.find_all('li', {"data-group":"3"})]
节庆食品_df['URL'] = ['https://search.jd.com/' + x.find('a').get('href') for x in soup.find_all('li', {"data-group":"3"})]
节庆食品_df['Category'] = '茗茶'


#### 3.7. 茗茶

In [10]:
# set category URL for 地方特产
茗茶 = 'https://list.jd.com/list.html?cat=1320,12202'

# access URL in firefox
driver.get(茗茶)

# wait 5 seconds to make sure the desired information has loaded
time.sleep(5)

# extract the HTML from the 茗茶 page
soup=BeautifulSoup(driver.page_source, 'lxml')

# create a dataframe (df) for 茗茶
茗茶_df = pd.DataFrame()
茗茶_df['Subcategory'] = [x.find('a').get('onclick').split(',')[-1][1:-2] for x in soup.find_all('li', {"data-group":"2"})]
茗茶_df['URL'] = ['https://search.jd.com/' + x.find('a').get('href') for x in soup.find_all('li', {"data-group":"2"})]
茗茶_df['Category'] = '茗茶'


#### 3.8. Combine seven dataframes together

In [11]:
# vertically combine the seven datasets
df = pd.concat([进口食品_df, 地方特产_df, 休闲食品_df, 粮油调味_df, 饮料冲调_df, 节庆食品_df, 茗茶_df]).reset_index(drop=True)

# change order of columns
df = df[['Category', 'Subcategory', 'URL']]

# see total number of subcategories
df.shape[0]

685

#### 3.9. Determine the number of products in each subcategory

In [13]:
%%time 
# this block should take about 35 minutes to run

# create an empty list to append the number of products from each subcategory
num_products_list = []

for i in range(0, df.shape[0]):
    # access subcategory URL
    driver.get(df['URL'].tolist()[i])
    
    # wait three seconds to make sure page has sufficiently loaded
    time.sleep(3)
    
    # extract the HTML from the page
    soup=BeautifulSoup(driver.page_source, 'lxml')
    
    # extract number of products where available
    try:
        # extract the number of products from the 'soup' variable where possible
        num_products = soup.find('div', {"class":"f-result-sum"}).text
        
    except:
        # but if the number of products is not available, just set this to 'Not sure'
        num_products = 'Not sure' # 
        
    # append number of products to list
    num_products_list.append(num_products)
    
# create new column in df dataset for number of products
df['n'] = num_products_list

CPU times: user 1min 1s, sys: 445 ms, total: 1min 1s
Wall time: 35min 29s


In [15]:
# create clean variable for number of products
df['n clean'] = [int(eval(x)) for x in df['n'].str.replace('共', '').str.replace('件商品', '').str.replace('+', '').str.replace('万', '*10000').tolist()]

# export dataset
df.to_excel('JD_subcategory_URLs.xlsx', index=False)

# see the total unumber of categories - 685
print(df.shape[0])

# see total number of products - 2.7 million!!!
print(df['n clean'].sum())

685
2706844


In [16]:
# this is a sample of what the output of this section will look like
df.head()

Unnamed: 0,Category,Subcategory,URL,n,n clean
0,进口食品,威化饼干,https://search.jd.com//list.html?cat=1320%2C50...,共1700+ 件商品,1700
1,进口食品,海苔片,https://search.jd.com//list.html?cat=1320%2C50...,共300+ 件商品,300
2,进口食品,水晶米,https://search.jd.com//list.html?cat=1320%2C50...,共90+ 件商品,90
3,进口食品,果味汽水,https://search.jd.com//list.html?cat=1320%2C50...,共1300+ 件商品,1300
4,进口食品,驼奶粉,https://search.jd.com//list.html?cat=1320%2C50...,共100+ 件商品,100


# 4. Narrow down category selection

JD has A LOT OF products - approximately 2.7 million. I estimate scraping product text data and image data for this many products will take about 90 days (i.e., can probably scrape about 30K products in one day). I also anticipate that scraping this many products will produce about 27 million product images (i.e., 10 images per product), which would equate to about 5400 GB of image data alone! Yikes!

I recommend we narrow down the selection of categories from JD. In the 'df' dataset, let's create a column for whether or not we would like to scrape a subcategory. As an example, below I have just scraped information for the subcategory 西柚汁.


In [24]:
# let's create a column for whether or not we want to scrape a subcategory

# For example, in this code, I have just included 西柚汁
df['Include'] = 0
df['Include'] = np.where(df['Subcategory'] == '西柚汁', 1, df['Include'])

# create list of subcategory URLs of interest
subcategory_urls = df[df['Include']==1]['URL'].tolist()

# 5. Create spreadsheet of product page URLs

In this section, we access each of subcategory sections. For each advertised product, we collect information on the product name, product price, and product URL. The output from this phase is saved as 'JD_product_URLs.xlsx'

The time taken to run this section depends on the number on the number of subcategories.

In [36]:
%%time
data = []

# iterate through each of the subcategory URLs - can delete '[:5]' below if want to scrape all URLs
for subcategory_url in subcategory_urls:
    print(subcategory_url)
    
    # access the subcategory page, this will load 30 products; wait 2 seconds
    driver.get(subcategory_url)
    time.sleep(5)
    
    # scroll down to the bottom of the page to load another 30 products; wait 2 seconds
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    
    # access the HTML from the page
    soup=BeautifulSoup(driver.page_source, 'lxml')
    
    # determine the number of pages for the subcategory
    num_pages = int(soup.find('span', {"class":"fp-text"}).text.split('/')[1])
    
    # create a list of chunks; one chunk for each product
    chunks = soup.find('div', {"id":"J_goodsList"}).find_all('li')
    
    # collect desired information
    for chunk in chunks:
        row = {}
        row['Category'] = df[df['URL'] == subcategory_url]['Category'].tolist()[0]
        row['Subcategory'] = df[df['URL'] == subcategory_url]['Subcategory'].tolist()[0]
        row['Page number'] = 1
        row['Product name'] = 1
        row['Product price'] = 1
        row['Product URL'] = 'https://item.jd.com/' + chunk.get('data-sku') + '.html'
        row['Product name'] = chunk.find('a').get('title')
        row['Product price'] = chunk.find('div', {"class":"p-price"}).text.replace('\n', '').replace('\n', '')      
        data.append(row)
        
    # repeat the above steps for any additional pages for the category
    for num_page in range(3, 2*num_pages+1, 2):
        driver.get(subcategory_url + '&page=' + str(num_page))
        time.sleep(2)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        soup=BeautifulSoup(driver.page_source, 'lxml')
        
        chunks = soup.find('div', {"id":"J_goodsList"}).find_all('li')
        for chunk in chunks:
            row = {}
            row['Category'] = df[df['URL'] == subcategory_url]['Category'].tolist()[0]
            row['Subcategory'] = df[df['URL'] == subcategory_url]['Subcategory'].tolist()[0]
            row['Page number'] = int((num_page+1)/2)
            row['Product name'] = 1
            row['Product price'] = 1
            row['Product URL'] = 'https://item.jd.com/' + chunk.get('data-sku') + '.html'
            row['Product name'] = chunk.find('a').get('title')
            row['Product price'] = chunk.find('div', {"class":"p-price"}).text.replace('\n', '').replace('\n', '')      
            data.append(row)
        

CPU times: user 402 ms, sys: 8.64 ms, total: 411 ms
Wall time: 18.5 s


In [39]:
df2 = pd.DataFrame(data)

# export dataset; the file name can also be changed to '.csv' if preferable
df2.to_excel('JD_product_URLs.xlsx', index=False)

# see sample of products
df2.head()

Unnamed: 0,Category,Subcategory,Page number,Product name,Product price,Product URL
0,饮料冲调,西柚汁,1,果满乐乐（gomolo）地中海塞浦路斯进口 100%红心西柚汁 大瓶装纯果汁 1升*4瓶,￥48.00,https://item.jd.com/100021726516.html
1,饮料冲调,西柚汁,1,上好佳大湖 Great lakes西柚汁果蔬汁饮料1L/瓶婚宴家用大瓶装（新老包装随机发）,￥35.79,https://item.jd.com/10064874457545.html
2,饮料冲调,西柚汁,1,本小青西柚汁饮料本小青新鲜果汁0脂肪网红解腻饮品300ml*6瓶,￥31.40,https://item.jd.com/100039808885.html
3,饮料冲调,西柚汁,1,冲调方便，19种口味可选,￥55.00,https://item.jd.com/45538492136.html
4,饮料冲调,西柚汁,1,云舵 【优选好物】夏季网红饮品本小青西柚汁饮料新鲜维C果汁0脂肪解 西柚汁330ml*12瓶,￥132.30,https://item.jd.com/10066285549681.html


# 6. Create spreadsheet of product information and product image URLs

In this section, we iterate through each of the product URLs to obtain the product text information and the product image URLs. Importantly, on each page we scroll down to the 商品评价 section so that this section can load and we can scrape the comments images. 

The time taken to run this section depends on the number on the number of subcategories.

In [187]:
%%time
data2 = []

# create a counter - this can show us how many product URLs the code has iterated through
count = 0

# iterate through each of the product URLs
product_urls = df2['Product URL'].tolist()
for product_url in product_urls:
    count += 1
    if count % 10 == 0:
        print(count)
    
    # access product URL; wait 5 seconds to load
    driver.get(product_url)
    time.sleep(5)
    
    # scroll down to the 商品评价 section so that the comments can load; wait 5 seconds
    try:
        target = driver.find_element_by_id('comment')
        actions = ActionChains(driver)
        actions.move_to_element(target).perform()
        time.sleep(5)
    except:
        None
    
    # extract HTML from the product page
    soup=BeautifulSoup(driver.page_source, 'lxml')

    # create a dictionary for the given product
    row = {}
    
    # create a field for the product URL
    row['001_Product URL'] = product_url
     
    # extract the category information
    try:
        row['002_Category level 1'] = soup.find('div', {"class":"crumb fl clearfix"}).find('a', {"clstag":"shangpin|keycount|product|mbNav-2"}).text
    except:
        None
    try:
        row['003_Category level 2'] = soup.find('div', {"class":"crumb fl clearfix"}).find('a', {"clstag":"shangpin|keycount|product|mbNav-3"}).text
    except:
        None
    try:
        row['004_Category level 3'] = soup.find('div', {"class":"crumb fl clearfix"}).find('a', {"clstag":"shangpin|keycount|product|mbNav-4"}).text
    except:
        None
    
    # extract the brand name
    try:
        row['005_Brand name'] = soup.find('div', {"class":"crumb fl clearfix"}).find('a', {"clstag":"shangpin|keycount|product|mbNav-5"}).text
    except:
        None
    
    # extract the product name
    try:
        row['006_Product name'] = soup.find('div', {"class":"sku-name"}).text.replace('\n', '').lstrip().rstrip()
    except:
        None
    
    # extract the product price
    try:
        row['007_Price'] = soup.find('div', {"class":"summary-price-wrap"}).find('span', {"class":"p-price"}).text.replace('\n', '')
    except:
        None
    
    # extract the URLs for the images in the left panel (first 5 images)
    try:   
        product_photos = ['https:' + x.get('src').replace('n5','n0').replace('.avif', '').replace('https:https:', 'https:') for x in soup.find('div', {"id":"spec-list"}).find_all('img')]
        num_product_photos = len(product_photos)
        for n in range(0, num_product_photos):
            if (n < 5):
                row['008_Product photo ' + str(n)] = product_photos[n].replace('.gif', '')
    except:
        None

    # extract the URLs for the images in the 商品介绍 section (first 5 images)
    try:
        #description_photos = ['https:' + x.get('src').replace('.avif', '').replace('https:https:', 'https:') for x in soup.find('div', {"id":"J-detail-content"}).find_all('img')]
        description_photos = [x.split('.avif')[0].replace("(", 'https:') for x in str(soup.find('div', {"id":"J-detail-content"})).split('background-image:url')[1:-1]]
        
        num_description_photos = len(description_photos)
        for n in range(0, num_description_photos):
            if (n < 5):
                row['009_Description photo ' + str(n)] = description_photos[n].replace('.gif', '')
    except:
        None
  
    # extract the URLs for the images in the 商品评价 section (first 10)
    try:
        comment_photos = ['https:' + x.get('src').replace('n0/s48x48_jfs/', 'shaidan/s616x405_jfs/').replace('.avif', '').replace('https:https:', '') for x in soup.find('div', {"id":"comment-0"}).find_all('img')]
        comment_photos = [x for x in comment_photos if 'https://img' == x[:11]]
        num_comment_photos = len(comment_photos)
        for n in range(0, num_comment_photos):
            if (n < 10):
                row['010_Comment photo ' + str(n)] = comment_photos[n].replace('.gif', '')
    except:
        None
    
    # append the product information to 'data2'
    data2.append(row)
    

10
20
30
40
50
60
70
80
90
100
110
120
CPU times: user 11.3 s, sys: 142 ms, total: 11.5 s
Wall time: 10min 29s


In [188]:
# create dataframe of product information
df3 = pd.DataFrame(data2)

# change column names
cols = df3.columns.tolist()
cols.sort()
df3 = df3[cols]
df3.columns = [x[4:] for x in df3.columns.tolist()]

# limit product photos

# create a variable for the SKU
df3['SKU'] = df3['Product URL'].str.split('/').str[-1].str.split('.html').str[0].astype(int)

# export dataset
df3.to_excel('JD_product_info.xlsx', index=False)

# see sample of products
df3.head(2)

Unnamed: 0,Product URL,Category level 1,Category level 2,Category level 3,Brand name,Product name,Price,Product photo 0,Product photo 1,Product photo 2,Product photo 3,Product photo 4,Description photo 0,Description photo 1,Description photo 2,Description photo 3,Description photo 4,Comment photo 0,Comment photo 1,Comment photo 2,SKU
0,https://item.jd.com/100021726516.html,饮料冲调,饮料,果蔬汁/饮料,果满乐乐（gomolo）,果满乐乐（gomolo）地中海塞浦路斯进口 100%红心西柚汁 大瓶装纯果汁 1升*4瓶,￥48.00,https://img11.360buyimg.com/n0/jfs/t1/98350/21...,https://img11.360buyimg.com/n0/jfs/t1/119583/4...,https://img11.360buyimg.com/n0/jfs/t1/34617/22...,https://img11.360buyimg.com/n0/jfs/t1/158775/3...,https://img11.360buyimg.com/n0/jfs/t1/42644/24...,https://img30.360buyimg.com/sku/jfs/t1/125420/...,https://img30.360buyimg.com/sku/jfs/t1/77393/3...,https://img30.360buyimg.com/sku/jfs/t1/78564/3...,https://img30.360buyimg.com/sku/jfs/t1/83012/3...,https://img30.360buyimg.com/sku/jfs/t1/184610/...,,,,100021726516
1,https://item.jd.com/10064874457545.html,饮料冲调,饮料,果蔬汁/饮料,雨小姐（YUXIAOJIE）,上好佳大湖 Great lakes西柚汁果蔬汁饮料1L/瓶婚宴家用大瓶装（新老包装随机发）,￥35.79,https://img10.360buyimg.com/n0/jfs/t1/29201/1/...,https://img10.360buyimg.com/n0/jfs/t1/156514/2...,https://img10.360buyimg.com/n0/jfs/t1/134967/1...,https://img10.360buyimg.com/n0/jfs/t1/169016/5...,https://img10.360buyimg.com/n0/jfs/t1/90430/3/...,,,,,,,,,10064874457545


I note the section above does consistently scrape products from the product section, but inconsistently from the 商品介绍 section and the 商品评价 section - I'm not sure why this is.

# 7. Scrape product images

Here, we access the product images URLs and download them. 

The time taken to run this section depends on the number on the number of subcategories. The 西柚汁 alone produced 629 images!

In [189]:
# prepare image URL scraper
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'MyApp/1.0')]
urllib.request.install_opener(opener)

In [191]:
%%time

# prepare list of product SKUs to scrape
SKU_list = df3['SKU'].tolist()

# use counter
count = 0
error_count = 0

# scrape the product images for each SKU, for example's sake I capped at 20
for SKU in SKU_list:
    count += 1
    print(count)
    
    # determine list of image URLs from spreadsheet
    image_urls = df3[df3['SKU']==SKU].iloc[0][7:-1].dropna().tolist()
    num_image_urls = len(image_urls)
    
    # scrape image URLs
    for num in range(0, num_image_urls):
        try:
            urllib.request.urlretrieve(image_urls[num], str(SKU) + '|' + str(num+10) + ".jpg") 
        except:
            error_count += 1
    

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
CPU times: user 6.02 s, sys: 1.06 s, total: 7.07 s
Wall time: 3min 7s


In [192]:
error_count

1