<center><h1>Web Scraping Kijiji</h1><h3>Using Python and Beautiful Soup</h3></center>

In [41]:
import pandas as pd
from IPython.display import HTML
from bs4 import BeautifulSoup
import urllib.request as request
from ipywidgets import interact
pd.set_option("display.max_rows",1000)
pd.set_option("display.max_columns",20)
pd.set_option("display.max_colwidth", 200)

### For this exercise, I will only be scraping the Toronto listings

In [42]:
base_url = 'http://www.kijiji.ca'
toronto_url = 'http://www.kijiji.ca/h-city-of-toronto/1700273'
html_kijiji = request.urlopen(toronto_url)

soup_kijiji = BeautifulSoup(html_kijiji, 'lxml')

Since I will be creating a drop-down widget containing all the listing categories, I looked at the source page to find where I can find a complete list of all available categories.  I found that I need to grab all the **&lt;a&gt;** elements that have a class attribute equals to **"category-selected"**

In [43]:
div_categories = soup_kijiji.find_all('a', class_='category-selected')

### Let's look at the first 20 rows of the category list:

In [44]:
div_categories[:20]

[<a class="category-selected" data-id="10" href="/b-buy-sell/city-of-toronto/c10l1700273">buy and sell</a>,
 <a class="category-selected" data-id="72" href="/b-services/city-of-toronto/c72l1700273">services</a>,
 <a class="category-selected" data-id="27" href="/b-cars-vehicles/city-of-toronto/c27l1700273">cars &amp; vehicles</a>,
 <a class="category-selected" data-id="112" href="/b-pets/city-of-toronto/c112l1700273">pets</a>,
 <a class="category-selected" data-id="800" href="/b-vacation-rentals/c800l1700273">vacation rentals</a>,
 <a class="category-selected" data-id="1" href="/b-community/city-of-toronto/c1l1700273">community</a>,
 <a class="category-selected" data-id="34" href="/b-real-estate/city-of-toronto/c34l1700273">real estate</a>,
 <a class="category-selected" data-id="45" href="/b-jobs/city-of-toronto/c45l1700273">jobs</a>,
 <a class="category-selected" data-id="218" href="/b-resumes/city-of-toronto/c218l1700273">resumes</a>,
 <a class="category-selected" data-id="63" href="/

From above, we can see the category listings over on the right: "buy and sell", "services", "cars and vehicles", etc

### Now I will create a Python dictionary to map all the category listings to its respective URL

In [45]:
categories = {}
for item in div_categories:
    categories[item.get_text()] = base_url + item['href']

### Let's see what the dictionary looks like:

In [46]:
categories

{'(more categories...)': 'http://www.kijiji.ca/b-resumes/city-of-toronto/c218l1700273',
 'ATVs, snowmobiles': 'http://www.kijiji.ca/b-atv-snowmobile/city-of-toronto/c171l1700273',
 'Canada': 'http://www.kijiji.ca/b-vacation-rentals-canada/c801l1700273',
 'Caribbean': 'http://www.kijiji.ca/b-vacation-rentals-caribbean/c803l1700273',
 'Mexico': 'http://www.kijiji.ca/b-vacation-rentals-mexico/c804l1700273',
 'Other Countries': 'http://www.kijiji.ca/b-vacation-rentals-other-countries/c805l1700273',
 'RVs, campers, trailers': 'http://www.kijiji.ca/b-rv-camper-trailer/city-of-toronto/c172l1700273',
 'SUVs': 'http://www.kijiji.ca/b-cars-trucks/city-of-toronto/suv+crossover/c174l1700273a138',
 'USA': 'http://www.kijiji.ca/b-vacation-rentals-usa/c802l1700273',
 'accessories': 'http://www.kijiji.ca/b-pet-accessories/city-of-toronto/c115l1700273',
 'accounting, mgmt': 'http://www.kijiji.ca/b-accounting-management-jobs/city-of-toronto/c58l1700273',
 'activities, groups': 'http://www.kijiji.ca/b-ac

### Now, I will make a list of just the keys from the category dictionary I created earlier:

In [47]:
category_list = [key for key in categories.keys()]

### Now for the fun part.  Below, I am using Jupyter interact decorator to create a drop-down widget.  I will pass the category_list that I just made to it and also a dictionary containing page numbers so that the user can page through one page at a time.  The rest of the function is grabbing the listing's image url, title, description, price, etc., from which I will make a pandas data frame which is then outputted to the screen.

In [64]:
pages = {'Page 1':'', 'Page 2':'page-2', 'Page 3':'page-3', 'Page 4':'page-4', 'Page 5':'page-5',
         'Page 6':'page-6', 'Page 7':'page-7', 'Page 8':'page-8', 'Page 9':'page-9'}

@interact
def kijiji_listings(category = sorted(category_list), page = sorted(pages)):
    if page == 'Page 1':
        print(categories[category])
        html_cars = request.urlopen(categories[category])
    else:
        url = categories[category]
        last_forward_slash = url.rfind('/')
        beginning_url = url[:last_forward_slash+1]
        ending_url = url[last_forward_slash:]
        print(beginning_url + pages[page] + ending_url)
        html_cars = request.urlopen(beginning_url + pages[page] + ending_url)

    soup_cars = BeautifulSoup(html_cars, 'lxml')
    
    #tables = soup_cars.find_all('table',  class_ = re.compile('regular-ad|top-'))
    tables = soup_cars.find_all('table')
    
    img_urls = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='image'):
            try:
                img_urls.append("<img src='" + row.div.img['src'] + "'>")
            except:
                img_urls.append("<img src='" + row.img['src'] + "'>")
                
    titles = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='description'):
            titles.append(row.a.get_text().strip())
            
    comments = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='description'):
            comments.append(row.p.get_text().strip())
            
    details = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='description'):
            for item in row.find_all('p', class_='details'):
                details.append(item.get_text().strip())
    
    prices = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='price'):
            try:
                prices.append(float(row.get_text().replace('$','').replace(',','').strip()))
            except:
                prices.append(0.0)
    
    try:
        df = pd.DataFrame({'Price':prices, 'Image':img_urls, 'Title':titles, 'Comment':comments, 'Details':details})
        # Arrange the columns in a certain order
        df = df[['Image','Title','Comment','Details','Price']]
    # Some category listings don't have a price and title, so this script would bomb unless we leave them out
    except:
        df = pd.DataFrame({'Image':img_urls, 'Comment':comments, 'Details':details})
        # Arrange the columns in a certain order
        df = df[['Image','Comment','Details']]

    
    return HTML(df.to_html(escape=False))  # if escape is set to True, the images won't be rendered

http://www.kijiji.ca/b-cars-trucks/city-of-toronto/convertible__coupe__hatchback__other+body+type__sedan__wagon/c174l1700273a138


Unnamed: 0,Image,Title,Comment,Details,Price
0,,2010 Chrysler 300 TOURING | P/DRIVER SEAT | F/FOG LIGHTS,BEAUTIFUL GREY ON BLACK LEATHER CHRYSLER 300 TOURING W/TOP FEATURES: POWER DRIVER SEAT // FRONT FOG LIGHTS // CRUISE CONTROL // KEYLESS ENTRY & MORE! **Unique Features** Take command of the road in…,92715km | Automatic | $ Financing Available,12987
1,,2004 Mazda RX8 GT,I have a beautiful Mazda rx8 for sale that stands out from the rest. A lot of money invested but time to move on to something different as I've owned this car for over 8 years. Add ons: Turbo xs…,17000km | Manual,5888
2,,2012 Toyota Camry XLE V6 ~ EXTRA CLEAN ~ FULLY LOADED ~ NAVIGATI,Safety & E-test included in the price. No extra fees. Warranty & CarProof report available. HST is not included. K & L Auto Sales 4699 Keele st. unit#19 (Steeles & Keele) Toronto M3J 2N8 (416)…,96150km | Automatic,16499
3,,2011 BMW 535i xDrive 535i xDrive COMFORT ACCESS NAVIGATION PARKI,WELCOME!! I AM A 2011 BMW 535i xDRIVE AND CAN'T WAIT FOR US TO MEET!! LET ME TELL YOU A LITTLE ABOUT MYSELF!! I HAVE A BEAUTIFUL WHITE EXTERIOR ON A LIGHT BEIGE INTERIOR AND COME WITH OPTIONS LIKE…,92047km | Automatic,27998
4,,2012 BMW 535I xDrive M-Sport Pkg Nav Leather Sunroof Xenons Blin,This Gorgeous Canadian 2012 BMW 535i M-Sport comes with a clean CarProof and loaded with: xDrive! M-Sport Package! Navigation! Sunroof! Leather Seating! BlindSpot Warning! 3D Camera! Xenon HID…,65659km | Automatic | CarProof,38494
5,,2003 Volkswagen Jetta 1.8t Sedan,Hello I am selling a 2003 Volkswagen Jetta 1.8t. This is an amazing German car with all maintenance up too date. This car has been very well taken care of with ZERO SIGNS OF RUST OR dents. In the…,210000km | Automatic,4000
6,,"2013 Kia Optima (K5) LX Sedan - Well Cared For, MT, Enkei Wheels","Paltinum Graphite 2013, Manual Transmission Kia Optima LX, rebadged completely (inside and out) to a K5. I'm its first and only owner thus far and it has been well cared for. Regular AMSOIL full…",56855km | Manual,17500
7,,"2006 LT v6 MINT * 134,000 km - PERFECT 10 OUT OF 10 * CERTIFIED","BRAND NEW BRAKES,BALL JOINTS & TIE ROD ENDS INCLUDED FOR THE SAFETY.Like new inside & out.Factory Remote Starter. No rust,scrapes or dents.Recent tires.Great on gas.Runs & drives like new. We want...",134000km | Automatic,5275
8,,2006 AUDI A6 4.2 QUATTRO **EXECUTIVE PGK**,"4 DOOR SILVER ON BLACK LEATHER AUDI A6 EXECUTIVE PACKAGE. NAVIGATION, BACK-UP CAMERA, AND SUNROOF PLUS MUCH MORE! LOCAL CANADIAN CAR...VERY CLEAN AND WELL MAINTAINED CAR. PLEASE CHECK OUR WEB AT…",145000km | Automatic,12999
9,,"2008 Honda Accord coupe EX-L, GREAT BUY","CERTIFIED AND ETESTED, ALL PAPERWORK AVAILABLE NEW BATTERY PUT IN, ONLY 30000km ON BRAND NEW ENGINE PUT IN BY HONDA, NOT A REBUILT ENGINE Call or text (416) 427-6895 for quick response EXTRAS: ...",177489km | Manual,7500


<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>