<center><h1>Web Scraping Kijiji</h1><h3>Using Python and Beautiful Soup</h3></center>

In [3]:
import pandas as pd
from IPython.display import HTML
from bs4 import BeautifulSoup
import urllib.request as request
from ipywidgets import interact
pd.set_option("display.max_rows",1000)
pd.set_option("display.max_columns",20)
pd.set_option("display.max_colwidth", 200)

### For this exercise, I will only be scraping the Toronto listings

In [4]:
base_url = 'http://www.kijiji.ca'
toronto_url = 'http://www.kijiji.ca/h-city-of-toronto/1700273'
html_kijiji = request.urlopen(toronto_url)

soup_kijiji = BeautifulSoup(html_kijiji, 'lxml')

Since I will be creating a drop-down widget containing all the listing categories, I looked at the source page to find where I can find a complete list of all available categories.  I found that I need to grab all the **&lt;a&gt;** elements that have a class attribute equals to **"category-selected"**

In [5]:
div_categories = soup_kijiji.find_all('a', class_='category-selected')

### Let's look at the first 20 rows of the category list:

In [6]:
div_categories[:20]

[<a class="category-selected" data-id="10" href="/b-buy-sell/city-of-toronto/c10l1700273">buy and sell</a>,
 <a class="category-selected" data-id="72" href="/b-services/city-of-toronto/c72l1700273">services</a>,
 <a class="category-selected" data-id="27" href="/b-cars-vehicles/city-of-toronto/c27l1700273">cars &amp; vehicles</a>,
 <a class="category-selected" data-id="112" href="/b-pets/city-of-toronto/c112l1700273">pets</a>,
 <a class="category-selected" data-id="800" href="/b-vacation-rentals/c800l1700273">vacation rentals</a>,
 <a class="category-selected" data-id="1" href="/b-community/city-of-toronto/c1l1700273">community</a>,
 <a class="category-selected" data-id="34" href="/b-real-estate/city-of-toronto/c34l1700273">real estate</a>,
 <a class="category-selected" data-id="45" href="/b-jobs/city-of-toronto/c45l1700273">jobs</a>,
 <a class="category-selected" data-id="218" href="/b-resumes/city-of-toronto/c218l1700273">resumes</a>,
 <a class="category-selected" data-id="63" href="/

From above, we can see the category listings over on the right: "buy and sell", "services", "cars and vehicles", etc

### Now I will create a Python dictionary to map all the category listings to its respective URL

In [7]:
categories = {}
for item in div_categories:
    categories[item.get_text()] = base_url + item['href']

### Let's see what the dictionary looks like:

In [8]:
categories

{'(more categories...)': 'http://www.kijiji.ca/b-resumes/city-of-toronto/c218l1700273',
 'ATVs, snowmobiles': 'http://www.kijiji.ca/b-atv-snowmobile/city-of-toronto/c171l1700273',
 'Canada': 'http://www.kijiji.ca/b-vacation-rentals-canada/c801l1700273',
 'Caribbean': 'http://www.kijiji.ca/b-vacation-rentals-caribbean/c803l1700273',
 'Mexico': 'http://www.kijiji.ca/b-vacation-rentals-mexico/c804l1700273',
 'Other Countries': 'http://www.kijiji.ca/b-vacation-rentals-other-countries/c805l1700273',
 'RVs, campers, trailers': 'http://www.kijiji.ca/b-rv-camper-trailer/city-of-toronto/c172l1700273',
 'SUVs': 'http://www.kijiji.ca/b-cars-trucks/city-of-toronto/suv+crossover/c174l1700273a138',
 'USA': 'http://www.kijiji.ca/b-vacation-rentals-usa/c802l1700273',
 'accessories': 'http://www.kijiji.ca/b-pet-accessories/city-of-toronto/c115l1700273',
 'accounting, mgmt': 'http://www.kijiji.ca/b-accounting-management-jobs/city-of-toronto/c58l1700273',
 'activities, groups': 'http://www.kijiji.ca/b-ac

### Now, I will make a list of just the keys from the category dictionary I created earlier:

In [9]:
category_list = [key for key in categories.keys()]

### Now the fun part.  Below, I am using Jupyter interact decorator to create a drop-down widget.  I will pass the category_list that I just made to it.  The rest of the function is grabbing the listing's image url, title, description, price, etc., from which I will make a pandas data frame which is then outputted to the screen.

Unfortunately, this does NOT work for all listing categories, since some categories may not have a title or description, or some other reason.  But it works for most categories, especially categories that entail selling items.

In [17]:
@interact
def kijiji_listings(category = sorted(category_list)):
    html_cars = request.urlopen(categories[category])

    soup_cars = BeautifulSoup(html_cars, 'lxml')
    
    #tables = soup_cars.find_all('table',  class_ = re.compile('regular-ad|top-'))
    tables = soup_cars.find_all('table')
    
    img_urls = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='image'):
            try:
                img_urls.append("<img src='" + row.div.img['src'] + "'>")
            except:
                img_urls.append("<img src='" + row.img['src'] + "'>")
                
    titles = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='description'):
            titles.append(row.a.get_text().strip())
            
    comments = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='description'):
            comments.append(row.p.get_text().strip())
            
    details = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='description'):
            for item in row.find_all('p', class_='details'):
                details.append(item.get_text().strip())
    
    prices = []
    for table in tables[1:]:
        for row in table.find_all('td', class_='price'):
            try:
                prices.append(float(row.get_text().replace('$','').replace(',','').strip()))
            except:
                prices.append(0.0)
            
    df = pd.DataFrame({'Price':prices, 'Image':img_urls, 'Title':titles, 'Comment':comments, 'Details':details})
    # Arrange the columns in a certain order
    df = df[['Image','Title','Comment','Details','Price']]

    
    return HTML(df.to_html(escape=False))  # if escape is set to True, the images won't be rendered

Unnamed: 0,Image,Title,Comment,Details,Price
0,,BOXING DAY sale BRAND NEW BEATS by Dr Dre Headphones WARRANTY,EARLY BOXING DAY SALE! ===================== PRICES LOWER THEN BLACK FRIDAY! =============================== LIMITED QUANTITY ================ WHY WAIT IN LINE? GET THE DEALS EARLY!…,,75.0
1,,★iTOUCH 4 AND 5 REPAIR★ ON THE SPOT - 6 REPAIR CENTERS +WARRANTY,Do you currently have an iTouch with a broken LCD digitizer screen? we can repair battery charging port and headphone jack. We use factory original parts and repairs are done on the spot with 3…,,39.99
2,,Clairtone Cabinet Stereo,Clairtone Cabinet Stereo - model number S 403; Walnut Cabinet. Complete with manuals and wiring diagrams. Needle assembly missing from record player. Unit in need of a good home. Make offer.,,0.0
3,,"Harman Kardon AVR 1600 Home Theatre Cinema, with Speaker Package","*Note: All pics taken from Harman/Kardon Site. Entire set in brand new condition, including all manuals. Retail price approx $2150.Will only sell as entire set. Down sized from House to 1 BR condo .…",,1700.0
4,,Beats Solo 2 HD Headphones,Beats Solo2 HD Headphones Used in good condition Price negotiable (no low ballers),,160.0
5,,MLB Toronto Blue Jays BIGR Audio Headphones New,Description Up for sale is a brand new (unopened) BIGR Blue Jays headphones. Officially Licensed by Major League Baseball Properties. Ready for Smart Phones: The headphones come with a cable with…,,150.0
6,,JBL J55i Black SEALED box #Premium #High-Performance Headphone,"Brand new in sealed box, high-performance; premium quality metal plate; considerable mid-range to low price; semi- professional DJ w/ original Harman & Kardon™ engineering; unique styles 360°…",,80.0
7,,USED BEATS DR DRE SOLO 2.0 HEADPHONES BLACK,ON-EAR Model #: B0518 Current Retail Value : $220 (IF NEW) Source: amazon.ca USED TESTED - WORKING 14 DAY WARRANTY AUCTIONMAXX FREIGHT & SURPLUS LIQUIDATORS 200,,90.0
8,,YAMAHA HTR-5063 7.1 CHANNEL 630 WATT HOME THEATRE 3D RECEIVER,"MODEL: YAMAHA HTR-5063 NATURAL SOUND RECEIVER Excellent A condition. Powerful and perfectly working A condition. **Receiver only as pictured...no remote.Any cable,satellite, universal remote can ...",,220.0
9,,Sony On Ear Headphones White,New unopened with voice control original price $49.99 + tax,,30.0


<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>