# Amazon UK Search Results Notebook

## Installations

Importing all necessary modules to run this notebook. Ensure selenium, selectorlib, and fake-useragent have been installed prior to running this notebook.

In [18]:
import requests
import json
import time
import warnings
import numpy as np
import pandas as pd

In [2]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selectorlib import Extractor
from fake_useragent import UserAgent
from selenium.webdriver.support.ui import WebDriverWait

## Loading Pre-Documented Gender Stereotyped Toys

Taking in predoc_stereotyped_items.csv, a CSV file containing 72 rows.

In [27]:
stereo_toys = pd.read_csv('../predoc_info/predoc_stereotyped_items.csv', delimiter =',')
stereo_toys[:10]

Unnamed: 0,BOY,GIRL,NEUTRAL
0,vehicle toys,doll,toy animals
1,sport,domestic toys,books
2,military toys,educational art,educational teaching
3,race cars,clothes,musical games
4,outer space toys,dollhouses,games
5,depots,clothing accessories,live animals
6,machines,doll accessories,
7,doll-humanoid,furnishing,
8,action figures,ballerina costume,candy land
9,gi joe action figure,barbie costume,winnie the pooh


In [29]:
len(stereo_toys)

72

## Loading List of Toys Collected from Previous Research

all_items.txt contains a list of strings, where each string represents a toy that will be searched on Amazon UK. This text file contains 166 rows.

In [26]:
with open('../predoc_info/all_items.txt') as f:
    all_items = f.read().splitlines()

In [5]:
len(all_items)

166

## Trial

Creating a short list of 6 toys from all_items. Trial is used so the following functions can be tested on a smaller sample rathen than testing on all 166 toys.

In [6]:
trial = all_items[160:]
trial

['legos', 'scooter', 'drum set', 'puzzles', 'board games', 'rock painting']

In [None]:
len(trial)

In [7]:
generic = ['toys', 'books', 'learning material', 'games', 'sports']

In [8]:
gender = ['boys', 'girls', 'neutral']

## Scraping Functions

### Unique Identifier Function

This function is used to scrape the ASIN of each toy.

In [9]:
def asin(driver):
    asin_list = []
    for index in range(1, 10):
        asins = driver.find_elements("xpath", "//div[@data-asin]")
        for asin in asins:
            asin_list.append(asin.get_attribute('data-asin'))
    return asin_list

### Product Title Function

This function is used to scrape the name of each toy.

In [10]:
def item_info(driver):
    item = []
    elem = driver.find_elements(By.CSS_SELECTOR, 'h2')
    for i in elem:
        item.append(i.text)
    return item

### Product Link Function

This function is used to scrape the associated links of each toy.

In [11]:
def item_link(driver):
    href = []
    links = driver.find_elements('xpath', "//h2//a[@href]")
    for link in links:
        href.append(link.get_attribute('href'))
    return href

### Search Function

This function calls the above three functions, and runs them on each toy for all three queries (for boys, for girls, for kids).

In [12]:
def search(item, who):
    if who == 'neutral':
        query = item + ' for ' + 'kids'
    else:
        query = item + ' for ' + who
    driver.get(f'https://www.amazon.co.uk/s?k={query}&ref=nb_sb_noss')
    driver.implicitly_wait(10)
    list_asin = asin(driver)
    item_list = item_info(driver)
    item_page = item_link(driver)
    return (list_asin, item_list), item_page

In [14]:
driver = webdriver.Chrome(ChromeDriverManager().install())
search_result = {}
for i in generic:
    gender_dict = {}
    for g in gender:
        gender_dict[g] = search(i, g)
        driver.implicitly_wait(5)
    search_result[i] = gender_dict
    time.sleep(1.5)
driver.close()

## Database Initialization

Initializing databases to store scraped data.

In [15]:
columns1 = ['gender', 'query', 'result']
qr = pd.DataFrame(columns = columns1)
columns2 = ['gender', 'query', 'href']
qr_link = pd.DataFrame(columns = columns2)

## Running Queries for Boys, Girls, and Kids (Neutral)

This code is used to scrape all relevant data from the toys included in all_items. As of right now, all_items is used on line 7 in order to run the code through the entire list of toys. Changing all_items with trial on line 7 will faciliate testing as this will run the code on a smaller sample size.

In [21]:
warnings.filterwarnings('ignore')
driver = webdriver.Chrome(ChromeDriverManager().install())
data1 = []
data2 = []
item = ''
for item in all_items:
    for g in gender:
        result, link = search(item, g)
        values1 = [g, item, result]
        values2 = [g, item, link]
        zipped1 = zip(columns1, values1)
        zipped2 = zip(columns2, values2)
        a_dictionary1 = dict(zipped1)
        a_dictionary2 = dict(zipped2)
        time.sleep(1.5)
        data1.append(a_dictionary1)
        data2.append(a_dictionary2)
driver.close()

Appending ASIN data to previously initialized dataframe.

In [22]:
qr = qr.append(data1, True)
qr

Unnamed: 0,gender,query,result
0,boys,legos,"([, B0BQDCTRHK, B09WF29MFV, B09WF2RPK1, B09XKM..."
1,girls,legos,"([, B09TDGFW5V, B07FSMCLH9, B07Z3FSKST, B0B973..."
2,neutral,legos,"([, B0BPYDGWZX, B09FM6DRLB, B081F8VHQ9, B09XWW..."
3,boys,scooter,"([, B07X9WF8WM, B075ZQZDDM, B08BR77NKT, B0BJ1H..."
4,girls,scooter,"([, B099S4DTXW, B09881ZR77, B08ZN98H3F, B015JT..."
5,neutral,scooter,"([, B095LM4Y6R, B08BR2726L, B08K2J3RBC, B0BDLC..."
6,boys,drum set,"([, B0B5MKSZ1R, B099DL3L8P, B0852FRRGC, B09MRL..."
7,girls,drum set,"([, B0B5MKSZ1R, B099DL3L8P, B0852FRRGC, B09PTS..."
8,neutral,drum set,"([, B0B48S9KCP, B0B8HKBSPR, B0B4V3T25K, B0B5MK..."
9,boys,puzzles,"([, B08VNF359Z, B08ZNCMJYD, B0BK35FZJC, B08LG7..."


Database of toys and their associated links.

In [23]:
qr_link = qr_link.append(data2, True)
qr_link

Unnamed: 0,gender,query,href
0,boys,legos,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...
1,girls,legos,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...
2,neutral,legos,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...
3,boys,scooter,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...
4,girls,scooter,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...
5,neutral,scooter,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...
6,boys,drum set,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...
7,girls,drum set,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...
8,neutral,drum set,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...
9,boys,puzzles,[https://www.amazon.co.uk/sspa/click?ie=UTF8&s...


## Converting Data to CSV File

In [30]:
az_uk_search_results = pd.DataFrame()

In [22]:
az_uk_search_results = az_uk_search_results.append(qr, ignore_index =True)
# az_uk_search_results_link = az_uk_search_results_link.append(qr_link, ignore_index =True)

In [24]:
az_uk_search_results

Unnamed: 0,gender,query,result
0,boys,vehicle toys,"([, B09N783Q3T, B09XQXTC74, B07WQTSPGF, B0B426..."
1,girls,vehicle toys,"([, B07WQTSPGF, B09XQXTC74, B08ZCRVGHQ, B0759M..."
2,neutral,vehicle toys,"([, B09N783Q3T, B09J8BRFSP, B08ZCRVGHQ, B09XQX..."
3,boys,sport,"([, B081GW29YC, B00I04FDCI, B09ZKX11BT, B06XFQ..."
4,girls,sport,"([, B089W2C2Z8, B01HEZMR6I, B0B8HBZCRL, B0B9WW..."
...,...,...,...
493,girls,board games,"([, B092STXL3T, B09H7J66Z2, B00000JICB, B0BLHW..."
494,neutral,board games,"([, B092STXL3T, B07B7KPTQG, B078S8D27R, B078TW..."
495,boys,rock painting,"([, B08XK67TRQ, B07Z2R7S5P, B08RDCZWC7, B087PW..."
496,girls,rock painting,"([, B08XK67TRQ, B07Z2R7S5P, B08RDCZWC7, B087PW..."


In [28]:
az_uk_search_results.to_csv('az_uk_search_results.csv', index = False)

In [51]:
df = pd.DataFrame.from_dict(search_result).T.reset_index()
df.rename(columns={'index':'item'}, inplace = True)
df

Unnamed: 0,item,boys,girls,neutral
0,toys,"(([, B0B9GL62T7, B0B8HZJZQF, B09X9SX2ZK, B0B42...","(([, B09CPH2Q1V, B085DGS9BN, B0B9BW3QGV, B087D...","(([, B09FT5P8RK, B09NLPL4TM, B09XBQGML5, B0B42..."
1,books,"(([, 1653075104, B09MDHH1CF, B0BJHF35GL, B09Y4...","(([, 1953424341, B0BLFSVS1G, B0B3RL89SB, 18485...","(([, B0BCCVT2S3, B0BBXZPG9J, 0241381223, 02415..."
2,learning material,"(([, B09N3S4D7N, B0BDCVF2X2, B0BHXQMQZB, B0BHX...","(([, B09N3S4D7N, B09SG8W7VS, B0BHXQMQZB, B0BHX...","(([, B08725XV15, B0B9VJYCZ7, B07M9D92SJ, B07MM..."
3,games,"(([, B0B91WFR9W, B07YRGFQHY, B088M5PZG8, B09L5...","(([, B09FTHMRGR, B0B3M432H5, 1687795347, B09MH...","(([, , B076PRWVFG, B09FTHMRGR, B0B91WFR9W, B08..."
4,sports,"(([, B09PYX62JJ, B08JZ25FFZ, B0BFF896D6, B08HG...","(([, B0BNVY8N37, B0B2JJMQ9J, B097GZD6WD, B0B4B...","(([, B09VBMYWK8, B093HB43VL, B078ZTYFWY, B09PY..."
