# Obtaining product categories

IMPORTANT REMARK:

This code shall be executed from start to finish in the defined order. Errors may occur if the cells are executed in a different order.

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from time import sleep
import requests
import random
from lxml import html  
from bs4 import BeautifulSoup 
import os

In [12]:
outcomesDf = pd.read_csv('./outcomes_clean.tsv',sep='\t')

The column "desc" in the dataset contains a description about the product that is auctioned. Most of the products that are auctioned in the dataset are electronics (mobile phones, video games, laptops, televisions, etc.), but the dataset does not specify any category to which each product belongs to. This could be useful information, because it may ocurr that the auction results are different depending on the category of the product that is being offered (for example, it could happen that most of the users were mostly interested in placing bids for mobile phones, and the final price that these auctions reached could be much higher than the final price reached for other types of auctions).

In [18]:
outcomesDf.head(2)

Unnamed: 0.1,Unnamed: 0,auction_id,product_id,item,desc,retail,price,finalprice,bidincrement,bidfee,...,freebids,endtime_str,flg_click_only,flg_beginnerauction,flg_fixedprice,flg_endprice,bids_placed,swoopo_sale_price,swoopo_profit,winner_benefit
0,0,86827.0,10009602.0,sony-ericsson-s500i-unlocked-mysterious-,Sony Ericsson S500i Unlocked Mysterious Green,499.99,13.35,13.35,0.15,0.75,...,0.0,2008-09-16 19:52:00,0.0,0.0,0.0,0.0,89.0,77.060489,-422.929511,467.14
1,1,87964.0,10009881.0,psp-slim-lite-sony-piano-black,PSP Slim & Lite Sony Piano Black,169.99,74.7,74.7,0.15,0.75,...,0.0,2008-08-28 11:17:00,0.0,0.0,0.0,0.0,498.0,431.192397,261.202397,46.54


Amazon.com is one of the largest Internet retailer in the world and sells or used to sell most of the products contained in the dataset. It also assigns a product category to each one of the products that it sells. This category could be useful to complete the missing product category information for the products contained in the dataset, and web scraping methods can be used to extract those categories.

The following function is used to extract Amazon links related to the product name given as input. The links that are extracted are the first ones appearing in Google when a search is performed for that product name with the restriction that the search results should correspond to the Amazon website.

In [13]:
def getGoogleResultHrefs(productName):
    
    print('Performing a Google search for the product name : '+productName)
    
    #web browser header
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}

    #spaces in the product name are replaced with '+'
    escaped_search_term = productName.replace(' ', '+')
    
    #the search is performed in English, and 10 search results are obtained
    number_results=8
    language_code = 'en'
    
    #The search site is www.amazon.com, and the search is performed with the defined parameters
    google_url = 'https://www.google.com/search?q=site:amazon.com+{}&num={}&hl={}'.format(escaped_search_term, number_results, language_code)
    googlePageSession = requests.Session()
    googlePage = googlePageSession.get(google_url,headers=headers)
    googleSoup = BeautifulSoup(googlePage.content, "lxml")
    googleResultDivs = googleSoup.findAll('div', {'class': 'g'})[:8]
    #links are obtained
    googleResultLinks = [div.find('a') for div in googleResultDivs]
    #hrefs are obtained
    googleResultHrefs = [link.get('href') for link in googleResultLinks]
    return googleResultHrefs

For example, for the product name "PSP Slim & Lite Sony Piano Black"...

In [12]:
exampleGoogleResultHrefs = getGoogleResultHrefs('PSP Slim & Lite Sony Piano Black')

Performing a Google search for the product name : PSP Slim & Lite Sony Piano Black


These are the search results:

In [13]:
exampleGoogleResultHrefs

['https://www.amazon.com/Sony-PSP-Slim-Lite-2000-Console/dp/B000F6W1AG',
 'https://www.amazon.com/Sony-PSP-Slim-Lite-Handheld-console/dp/B000VCVR9A',
 '/search?q=site:amazon.com+PSP+Slim&num=8&hl=en&tbm=isch&tbo=u&source=univ&sa=X&ved=0ahUKEwikvePsitPbAhVBzRQKHcMCA5sQsAQISg',
 'https://www.amazon.com/Sony-PSP-Games/b?ie=UTF8&node=11075221',
 'https://www.amazon.com/Sony-PlayStation-Portable-2000-PSP-Slim-PSP/dp/B001NMKHXO',
 'https://www.amazon.com/Sony-Psp-2000fb-Playstation-Portable-Slim/dp/B000UL11SO',
 'https://www.amazon.com/Sony-PSP-2000LP-PlayStation-Portable-Slim/dp/B000UKUFXC',
 'https://www.amazon.com/Sony-PSP-Slim-Lite-PSP-2000IS-Handheld/dp/B000UL11SE']

The previous result contains several links to the Amazon website. The following function sends an HTTP request to one of those Amazon links (the one specified by the index given as input). If the HTTP request status code is not 200, the request was not fulfilled and the function raises an exception.

In [14]:
def getAmazonPage(googleResultHrefs, resultIndex):
    
    #get the Google result corresponding to the specified index
    googleResult = googleResultHrefs[resultIndex]
    #amazon product url
    amazonUrl=googleResult   
    print("The Amazon URL being exlored is : "+amazonUrl)
    
    #web browser header
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
    
    page = requests.get(amazonUrl,headers=headers)
    
    #The value 200 of the numeric status code to an HTTP request indicates that the request was fulfilled.  
    if (page.status_code!=200):
        #the request was not fulfilled
        print("The webpage status code is not 200")
        raise Exception('The webpage status code is not 200')
    
    return (page,amazonUrl)

As an example, the function is called with the previously obtained result (that contains several Amazon links for the product "PSP Slim & Lite Sony Piano Black") and is specified to send an HTTP request to the link corresponding to the first index. In this example, the HTTP request status code is 200 (the request was fulfilled). 

In [25]:
exampleAmazonPage, exampleAmazonUrl = getAmazonPage(exampleGoogleResultHrefs,0)
print(exampleAmazonPage)
print(exampleAmazonUrl)

The Amazon URL being exlored is : https://www.amazon.com/Sony-PSP-Slim-Lite-2000-Console/dp/B000F6W1AG
<Response [200]>
https://www.amazon.com/Sony-PSP-Slim-Lite-2000-Console/dp/B000F6W1AG


The following function is given the product name as input and obtains the Amazon product category for that product. In order to do so, it calls the two other previously explained functions and extracts the product category from the Amazon product site by searching for the corresponding HTML tag. In case that it cannot extract the category from any of the Amazon links (because they are not product-related working links. For example, they may be broken links or Amazon links not related to a product), it raises an exception. 

In [15]:
def getAmazonProductCategory(productName): 

    #obtain the Google results
    googleResultHrefs = getGoogleResultHrefs(productName)
    
    resultIndex = 0
    goodAmazonWebpage = False
    
    while(goodAmazonWebpage == False):
        try:
            #waits for a random time
            sleep(2+3*random.random())
            print("Calling to getAmazonPage from try block")
            page,amazonUrl = getAmazonPage(googleResultHrefs,resultIndex)
            goodAmazonWebpage = True #if the code reaches this line, there was no exception
            #when calling getAmazonPage, so the status code was 200
            print("goodAmazonWebpage = True")
            #the while block ends here
        except Exception as e:
            print(e)
            print("Inside exception block of while loop goodAmazonWebpage == False")
            if (resultIndex == len(googleResultHrefs)-1):
                #the function did not work for any of the Google results obtained
                raise Exception('There are no more Google results in the first page')
            
            #the status code was not 200 for the amazon URL associated to the current index,
            #so the index is incremented and the while block is executed again
            resultIndex += 1
                        
    try:
                
        while(True):
        
            amazonDoc = html.fromstring(page.content)
    
            XPATH_CATEGORY = '//a[@class="a-link-normal a-color-tertiary"]//text()'
            RAW_CATEGORY = amazonDoc.xpath(XPATH_CATEGORY)
            CATEGORY = '>'.join([i.strip() for i in RAW_CATEGORY]) if RAW_CATEGORY else None         
        
            if CATEGORY is None:
                #The product category could not be obtained from the Amazon link. This means
                #that it was not a proper product link (for example, it could be a link to the comment
                #section of the product, or any other Amazon link that does not correspond
                #to a product page).
                print("the category is None")
            
                goodAmazonWebpage = False
                if (resultIndex == len(googleResultHrefs)-1):
                    #the function did not work for any of the Google results obtained
                    raise Exception('There are no more Google results in the firt page')
                    
                #the Amazon link was not a proper one, so the index is incremented
                #to have a look at the next link
                resultIndex += 1
            
                while(goodAmazonWebpage == False):
                    try:
                        #waits for a random time
                        sleep(3+2*random.random())
                        page = getAmazonPage(googleResultHrefs,resultIndex)
                        goodAmazonWebpage = True #if the code reaches this line, there was no exception
                        #when calling getAmazonPage, so the status code was 200, and the while(True) block
                        #starts again.
                    except Exception as e:
                        if (resultIndex == len(googleResultHrefs)-1):
                            #the function did not work for any of the Google results obtained
                            raise Exception('There are no more Google results in the firt page')
                        #the status code was not 200 for the amazon URL associated to the current index,
                        #so the index is incremented and the while block is executed again
                        resultIndex += 1
            else:
                #everything worked fine, so the while(True) block ends here
                break
        
        print("Product category : "+CATEGORY)
        return (CATEGORY,amazonUrl)
    except Exception as e:
         print(e)

As an example, the previous function is called with the product name "PSP Slim & Lite Sony Piano Black". The extracted Amazon category for this product name is "Video Games>Sony PSP>Consoles".

In [32]:
exampleCategory, exampleAmazonUrl = getAmazonProductCategory('PSP Slim & Lite Sony Piano Black')

Performing a Google search for the product name : PSP Slim & Lite Sony Piano Black
Calling to getAmazonPage from try block
The Amazon URL being exlored is : https://www.amazon.com/Sony-PSP-Slim-Lite-2000-Console/dp/B000F6W1AG
goodAmazonWebpage = True
Product category : Video Games>Sony PSP>Consoles


In [33]:
print(exampleCategory)
print(exampleAmazonUrl)

Video Games>Sony PSP>Consoles
https://www.amazon.com/Sony-PSP-Slim-Lite-2000-Console/dp/B000F6W1AG


The following function iterates through each one of the product descriptions contained in the dataset and calls the function that extracts the product category. The fields contained in the original dataset to describe the product (columns "item" and "desc") are then saved in the file "productcategories.tsv" along with the extracted product category and the Amazon link from where the category has been extracted (in case that it is needed to extract more information from the same link in the future, such as product characteristics, product reviews, etc).

If several automated requests are sent to Google or Amazon, these websites may end up blocking the requests. Although a random waiting time has been set to perform these requests, the pattern used in the requests is the same and it is easily detectable. If the requests get blocked, it is needed to wait for some time before sucessful requests can be sent again.

Therefore, each time that a product category is obtained, the information is appended to the already existing one, and if the product category for a certain product has already been obtained, the search is not performed again. This is also useful because there are several auctions that offer the same product, and it would be unnecesary to obtain the same information several times.

In [None]:
for index, row in outcomesDf.iterrows():
    item = row["item"]
    productDescription = row['desc']
    
    filename = 'productcategories.tsv'

    if os.path.exists(filename):
        append_write = 'a' # append if already exists
        categoriesDf = pd.read_csv('./productcategories.tsv',sep='\t')
        if productDescription in categoriesDf["desc"].unique():
            continue
    
    else:
        append_write = 'w' # make a new file if not
    
    productCategory,amazonUrl = getAmazonProductCategory(productDescription)  

    f = open(filename,append_write)
    separator = '\t'
    print('item: '+item)
    print('productDescription: '+productDescription)
    print('link: '+amazonUrl)
    print('productCategory: '+productCategory)
    print("------")
    
    if append_write == 'w':
        f.write('item'+separator+'desc'+separator+'link'+separator+'category'+'\n')
    f.write(str(item)+separator+str(productDescription)+separator+amazonUrl+separator+productCategory+'\n')
    f.close()
    
    if index == 3:
        break
        

Performing a Google search for the product name : Sony Ericsson S500i Unlocked Mysterious Green
Calling to getAmazonPage from try block
The Amazon URL being exlored is : https://amazon.com/dp/B000RVUQLU?tag=bestproducts029-20
goodAmazonWebpage = True
Product category : Cell Phones & Accessories>Cell Phones>Unlocked Cell Phones
item: sony-ericsson-s500i-unlocked-mysterious-
productDescription: Sony Ericsson S500i Unlocked Mysterious Green
link: https://amazon.com/dp/B000RVUQLU?tag=bestproducts029-20
productCategory: Cell Phones & Accessories>Cell Phones>Unlocked Cell Phones
------
Performing a Google search for the product name : PSP Slim & Lite Sony Piano Black
Calling to getAmazonPage from try block
The Amazon URL being exlored is : https://www.amazon.com/Sony-PSP-Slim-Lite-2000-Console/dp/B000F6W1AG
goodAmazonWebpage = True
Product category : Video Games>Sony PSP>Consoles
item: psp-slim-lite-sony-piano-black
productDescription: PSP Slim & Lite Sony Piano Black
link: https://www.amazo

After having executed the previous code for several days, with different random waiting times, and having manually fixed the categories obtained that did not seem to be correct according to the description of the item, the results have been saved in the following Excel file:

In [3]:
productCategoriesDf = pd.read_excel('./product_categories.xlsx')
productCategoriesDf.head()

Unnamed: 0,item,desc,link,category
0,sony-ericsson-s500i-unlocked-mysterious-,Sony Ericsson S500i Unlocked Mysterious Green,https://www.amazon.com/Sony-Ericsson-S500i-Slo...,Cell Phones & Accessories › Cell Phones › Unlo...
1,psp-slim-lite-sony-piano-black,PSP Slim & Lite Sony Piano Black,https://www.amazon.com/Sony-PSP-Slim-Lite-2000...,Video Games › Sony PSP › Consoles
2,ipod-touch-apple-8gb-with-software-upgra,iPod Touch Apple 8GB with Software Upgrade,https://www.amazon.com/Apple-touch-Generation-...,Electronics › Portable Audio & Video › MP3 & M...
3,lg-ku990-viewty-unlocked-black,LG KU990 Viewty Unlocked Black,https://www.amazon.com/gp/product/B000W88J4Y,Cell Phones & Accessories › Cell Phones › Unlo...
4,logitech-cordless-wave-keyboard-and-mous,Logitech Cordless Wave Keyboard and Mouse,https://www.amazon.com/Logitech-Cordless-Deskt...,Electronics › Computers & Accessories › Comput...


Some of the products could not be found in Amazon, because they are Swoopo-related (for example, they are free bids vouchers, gift cards, etc). All of these products have been assigned to the "Swoopo" category.

Also, some of the products were silver or gold bars, which are not sold in Amazon. These products have been assigned the a similar Amazon category to the one that they would have had (Home & Kitchen › Home Décor › Home Décor Accents › Decorative Accessories).

In [9]:
productCategoriesDf[productCategoriesDf.isnull().any(1)]

Unnamed: 0,item,desc,link,category
34,-80-cash-,$80 Cash!,,Swoopo
35,-1-000-cash-,"$1,000 Cash!",,Swoopo
36,50-freebids-voucher,50 FreeBids Voucher,,Swoopo
37,300-freebids-voucher,300 FreeBids Voucher,,Swoopo
48,-320-cash-,$320 Cash!,,Swoopo
431,5-g-gold-bar-0-16-t-oz-,5 g Gold Bar (0.16 t oz),,Home & Kitchen › Home Décor › Home Décor Accen...
461,-1000-iphone-3g-gift-card,$1000 iPhone 3G Gift Card,,Swoopo
655,-500-cash-,$500 Cash!,,Swoopo
694,0,$15 Florist Voucher,,Swoopo
695,0,$30 Florist Voucher,,Swoopo


The product category can be assigned to each one of the products contained in the dataset by performing a merge on the 'desc' column.

In [5]:
outcomesDfWithCategory = outcomesDf.merge(productCategoriesDf[['desc','category']],how='left')
outcomesDfWithCategory.head(1)

Unnamed: 0,auction_id,product_id,item,desc,retail,price,finalprice,bidincrement,bidfee,winner,...,endtime_str,flg_click_only,flg_beginnerauction,flg_fixedprice,flg_endprice,bids_placed,swoopo_sale_price,swoopo_profit,winner_benefit,category
0,86827,10009602,sony-ericsson-s500i-unlocked-mysterious-,Sony Ericsson S500i Unlocked Mysterious Green,499.99,13.35,13.35,0.15,0.75,Racer11,...,2008-09-16 19:52:00,0,0,0,0,89.0,77.060489,-422.929511,467.14,Cell Phones & Accessories › Cell Phones › Unlo...


In the end, all of the items in the dataset are assigned a category:

In [6]:
outcomesDfWithCategory[outcomesDfWithCategory.isnull().any(1)]

Unnamed: 0,auction_id,product_id,item,desc,retail,price,finalprice,bidincrement,bidfee,winner,...,endtime_str,flg_click_only,flg_beginnerauction,flg_fixedprice,flg_endprice,bids_placed,swoopo_sale_price,swoopo_profit,winner_benefit,category
