# FCP Image Classifier

<p>This code accomplishes several things: It will scrape one page of parts at a time from one of the parts listing pages such as <a target="_blank" href="https://www.fcpeuro.com/BMW-parts/">here</a>. For each part it goes into the individual <a target="_blank" href="https://www.fcpeuro.com/products/bmw-ignition-service-kit-12137841754kt">part page</a>, and uses the main image. We also gather the total number of images for each part on this page. Next, we gather both SKU and FCP ID from a third HTML such as <a target="_blank" href="https://www.fcpeuro.com/products/bmw-drive-shaft-flex-joint-kit-26112226527kt2/extended.html">this.</a> For now, I have left SKU and FCP_ID out of the dataframe to save space and eliminate the need to horizontally scroll the dataframe. At the bottom of code cells 1 and 3, there are places to paste SKU and ID in order to include them in the dataframe.</p>

<p>In terms of images, the best way to seperate FCP taken images from stock images is to use the size/resolution or the total number of images for a particular part. After looking at the data for many part images, the majority of FCP taken images have a width or height of atleast 850 px (most have flat 900 width) . All of the parts I have seen which have a stock photo only have one photo available. In the list of 99 parts in the dataframe below, I only find three errors in the classification due to large stock images and one in which a stock photo is the main photo, but there are alternate FCP taken images.</p>

<p> The challenge of the images is that there is a wide range of qualities: some stock photos are quite good and others are not. If the goal is to test how image quality effects sales, then it would be helpful to know which images have very low quality. This has not been an easy task. I have tried testing some image quality assesment algorithms but they have not given reliable results. I am still looking for ways to address this issue.</p>

In [180]:
import bs4
import urllib.request
import requests
import urllib
# pip install beautifulsoup4
from bs4 import BeautifulSoup as soup
# pip install pandas
import pandas as pd
from PIL import Image
from io import BytesIO
import time
pd.set_option('display.max_colwidth', -1)

# Paste the following as is below: 'SKU', 'FCP_ID',
df = pd.DataFrame(columns = ['Part_Name', # paste
                             'Image_URL', 'Image_Size', 'Image_DPI',
                             'Total_Images', 'Image_Quality'])

<h2>Scrape the initial parts <a target="_blank" href="https://www.fcpeuro.com/BMW-parts/">page</a> </h2>
<p>Any part page url can be used, and I have included several. For now, the url must manually be entered, but this could extended to accomplish this on its own. After selecting a url, run this cell and the next to populate the dataframe. If you'd like to add more parts to the dataframe, change the url and run the two cells again.</p>

In [189]:

#url = 'https://www.fcpeuro.com/Audi-parts/Air-Intake/'
#url = 'https://www.fcpeuro.com/BMW-parts/'
#url = 'https://www.fcpeuro.com/BMW-parts/Cooling-System/?page=8'
#url = 'https://www.fcpeuro.com/BMW-parts/Cooling-System/?page=9'
url = 'https://www.fcpeuro.com/BMW-parts/Cooling-System/?page=10'


page = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) 
infile = urllib.request.urlopen(page).read()
data = infile.decode('ISO-8859-1')
page_soup = soup(data, 'html.parser')
image_divs = page_soup.find_all('div', class_='hit__img')
divs = page_soup.find_all('div', class_='grid-x hit')

<p>The following loops through each part on the main parts page and scrapes data from the next two pages for each part. The data is added to the dataframe which is displayed below. For image quality, I made the description good/bad for easy human veiwing, but it can be changed to anything else. Non-existing images (boxes) have an image quality of 'null'. The program sleeps often to not overload the server, but this also means it takes a while to run.</p>

In [190]:
# scrape_part_page() gets the HTML from the individual part page (the page after clicking a part from the list).
# The main image here is used for that parts image.
def scrape_part_page(part_url):
    time.sleep(2)
    part_page = urllib.request.Request(part_url, headers={'User-Agent': 'Mozilla/5.0'}) 
    infile = urllib.request.urlopen(part_page).read()
    data = infile.decode('ISO-8859-1')
    part_page_soup = soup(data, 'html.parser')
    return part_page_soup
    
# The SKU and FCP_ID comes from a simpler HTML which is gathered in scrape_inner_html()
def scrape_inner_html(part_html_url):
    time.sleep(2)
    html_page = urllib.request.Request(part_html_url, headers={'User-Agent': 'Mozilla/5.0'})
    infile = urllib.request.urlopen(html_page).read()
    data = infile.decode('ISO-8859-1')
    html_page_soup = soup(data, 'html.parser')
    return html_page_soup

# get_sku_and_fcpid() gathers both SKU and FCP_ID from the HTML gathered in the previous function
def get_sku_and_fcpid(html_page_soup):
    ul_div = html_page_soup.find_all('div', class_='extended__details')
    li_list = ul_div[0].find_all('li')[0:2]
    sku_id_list = []
    for j in range(2):
        s = ''.join(li_list[j].findAll(text=True))
        info = s.split('\n')
        sku_id_list.append(info[2])
    return sku_id_list[0], sku_id_list[1]

# get_image() grabs the image to be compared
def get_image(image_url):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    return img

# This function created clickable links out of the URLs in the dataframe for easy access to the images
def make_clickable(val):
    # target _blank to open new window
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)


# The main 4 line method for seperating the images.
def check_quality(img_size):
    if img_size[0] > 850 or img_size[1] > 850 or total_images > 1:
        img_quality = 'Good'
    else:
        img_quality = 'Bad'
    return  img_quality

# Get the pixel per inch information, this can vary depending on image type,
# or not be avaiable at all. This doesn't tell much and could be removed.
def get_dpi(img):
    try:
        jfif = img.info['jfif_density']
    except:
        try:
            jfif = img.info['dpi']
        except:
            jfif = (0, 0) # If info not available
    return jfif

# Here we loop through all of the products listed on each page and adds information to the dataframe
for i, div in enumerate(image_divs):
    part_url = 'https://www.fcpeuro.com' + divs[i]['data-href']
    part_page_soup = scrape_part_page(part_url)
    part_divs = part_page_soup.find_all('div', class_ = 'extended grid-x grid-margin-x')
    part_html_url = 'https://www.fcpeuro.com' + part_divs[0]['data-load-async']
    html_page_soup = scrape_inner_html(part_html_url)
    sku, fcp_id = get_sku_and_fcpid(html_page_soup)
    
    # Get part name
    name = image_divs[i].img['alt']
    
    # Each individual part page has one or more images, these are stored in 'info', 
    # the first image is used to to classify, and the total number of images is gathered
    listing_gallery = part_page_soup.find_all('div', 'listing__gallery')
    info = listing_gallery[0].find_all('img')
    image_url = info[0]['src']
    img = get_image(image_url)
            
    # Get the size/resolution of the image
    img_size = img.size
    
    # All 'boxes' representing a non-existing image are 500, 388
    # image_quality is set to 0 if one of these are found, and total images
    # is set to 0, otherwise total images is set to the length of 'info'.
    if img_size == (500, 388):
        total_images = 0
        img_quality = 'null'
    else:
        total_images = len(info)
        
    img_quality = check_quality(img_size)
    jfif = get_dpi(img)
    
    # Add everything we have gathered to a dataframe.
    # Paste the following below:   'SKU' : sku, 'FCP_ID' : fcp_id,
    df = df.append({'Part_Name' : name, # paste
                    'Image_URL' : image_url, 'Image_Size' : img_size, 
                    'Image_DPI' : jfif, 'Total_Images' : total_images,
                   'Image_Quality' : img_quality}, ignore_index=True)
    # Slow the program down to not bog down the servers
    time.sleep(5)

# Display the dataframe with clickable URLS
df.style.format({'Image_URL': make_clickable})

Unnamed: 0,Part_Name,Image_URL,Image_Size,Image_DPI,Total_Images,Image_Quality
0,Audi VW Throttle Body Kit - 06F133062TKT,https://www.fcpeuro.com/public/assets/products/330700/large/KIT-06F133062TKT.jpg?1565787568,"(900, 810)","(72, 72)",5,Good
1,Audi VW Throle Body - Bosch 0280750030,https://www.fcpeuro.com/public/assets/products/107089/large/078133062B.jpg?1496421965,"(640, 480)","(72, 72)",1,Bad
2,Audi Porsche VW Throttle Body - VDO A2C59511705,https://www.fcpeuro.com/public/assets/products/241130/large/open-uri20150921-29936-1n3zn47.?1496488676,"(900, 785)","(72, 72)",3,Good
3,Audi VW Throttle Body - Bosch 06A133062BD,https://www.fcpeuro.com/public/assets/products/175229/large/open-uri20141107-2865-yercxp.?1496445377,"(900, 856)","(72, 72)",4,Good
4,Audi VW Throttle Body - VDO 03L129086,https://www.fcpeuro.com/public/assets/products/248227/large/Untitled.png?1496492183,"(499, 471)","(0, 0)",1,Bad
5,Audi VW Throttle Body - Bosch 06B133062M,https://www.fcpeuro.com/public/assets/products/151320/large/open-uri20140311-17128-n5wg77.?1496434600,"(640, 480)","(72, 72)",1,Bad
6,Audi VW Intake Manifold Kit - Genuine Audi VW 06H133201ATKT,https://www.fcpeuro.com/public/assets/products/343660/large/open-uri20200108-21504-hmp05b.?1578518599,"(900, 900)","(240, 240)",4,Good
7,Audi VW TSI Intake Manifold Kit - Genuine Audi VW KIT-535538,https://www.fcpeuro.com/public/assets/products/291254/large/open-uri20180905-3227-78u6rq.?1536182796,"(900, 600)","(28, 28)",4,Good
8,Audi VW Air Pump - Pierburg 078906601E,https://www.fcpeuro.com/public/assets/products/131579/large/open-uri20140226-13157-w6x6k2.?1496430772,"(640, 480)","(72, 72)",1,Bad
9,Audi VW Power Brake Booster Vacuum Pump - Pierburg 06E145100R,https://www.fcpeuro.com/public/assets/products/248234/large/Untitled.png?1496492186,"(649, 492)","(0, 0)",1,Bad
