# Name: Huon Sophy
# ID: M060809
# <center> Scrape Bicycles from Khmer24 using BeautifulSoup </center>

### What is Web Scraping?
<p style="font-family: Times New Roman, serif; font-size:16pt; line-height: 1.5;"> 
     &nbsp <b>Web scraping</b> is the process of gathering information from the Internet. However, the words “web scraping” usually refer to a process that involves automation.
</p>

### What is Beautiful Soup?
<p style="font-family: Times New Roman, serif; font-size:16pt; line-height: 1.5;">
    &nbsp </b>Beautiful Soup</b> is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
</p>

In [26]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd

## HTTP Request
<p style="font-family: Times New Roman, serif; font-size:16pt; line-height: 1.5;">
    Store website in variable
</p>

In [27]:

url = 'https://www.khmer24.com/en/c-bicycles.html'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")


In [28]:
# display html of the bicycles page
# page_soup

<p style="font-family: Times New Roman, serif; font-size:16pt; line-height: 1.5;">
    <b> Find area that we want to scrape the information. </b>
</p>

In [29]:
results = page_soup.findAll('div', class_ = 'item-detail')

In [30]:
results[0]

<div class="item-detail">
<h2 class="item-title truncate truncate-2">TREK Frame Original</h2>
<p class="description truncate truncate-2">TREK Frame
#ហាងកង់ចាក់អង្រែ
&gt; មានលក់កង់ជជុស​ជប៉ុន​​ និងកង់ថ្មីម៉ាក UPLAND DIAMON និង​ TRINX
&gt; មានសេវា <i>tel: 092510051</i></p> <ul class="list-unstyled summary">
<li>Phnom Penh</li>
<li><time datetime="2022-07-06 10:00:02">7 minutes ago</time></li>
<li>1 hits</li> </ul>
<p class="item-prices m-0 text-red"><span class="price">$195</span></p>
</div>

In [31]:
len(results)

50

<p style="font-family: Times New Roman, serif; font-size:16pt; line-height: 1.5;">
    <b> Get Title, Descrioption, Location, and Price. </b>
</p>

In [32]:
results[0].find('h2', {'class':'item-title'}).get_text()

'TREK Frame Original'

In [33]:
results[0].find('p', {'class':'description truncate truncate-2'}).get_text().strip()

'TREK Frame\n#ហាងកង់ចាក់អង្រែ\n> មានលក់កង់ជជុស\u200bជប៉ុន\u200b\u200b និងកង់ថ្មីម៉ាក UPLAND DIAMON និង\u200b TRINX\n> មានសេវា tel: 092510051'

In [34]:
results[0].find('p', {'class':'description truncate truncate-2'}).find('i').get_text()

'tel: 092510051'

In [35]:
results[0]('ul', {'class':'list-unstyled summary'})

[<ul class="list-unstyled summary">
 <li>Phnom Penh</li>
 <li><time datetime="2022-07-06 10:00:02">7 minutes ago</time></li>
 <li>1 hits</li> </ul>]

In [36]:
results[0].find('li').get_text()

'Phnom Penh'

In [37]:
results[0].find('span', {'class':'price'}).get_text()

'$195'

<p style="font-family: Times New Roman, serif; font-size:16pt; line-height: 1.5;">
    <b> Put Title, Descrioption, Location, and Price inside the Loop. </b>
</p>

In [38]:
# Get Title, Descrioption, Location, and Price

title = []
description = []
phone = []
location = []
price = []

for result in results:
    
    try:
        title.append(result.find('h2', {'class':'item-title'}).get_text())
    except:
        title.append('n/a')
    try:
        description.append(result.find('p', {'class':'description truncate truncate-2'}).get_text().strip('\n'))
    except:
        description.append('n/a')
    try:
        phone.append(result.find('p', {'class':'description truncate truncate-2'}).find('i').get_text())
    except:
        phone.append('n/a')
    try:
        location.append(result.find('li').get_text())
    except:
        location.append('n/a')
    try:
        price.append(result.find('span', {'class':'price'}).get_text())
    except:
        price.append('n/a')
        

In [39]:
title

['TREK Frame Original',
 'Scott scale 970',
 'កង់លក់ Scott sub',
 'Scott gravel',
 'Scott',
 'TRINX',
 'ពិលកង់ 800lumen ( fast charger )',
 'Cannondale F300',
 'Panasonic MR-S vintage ( classic)',
 'Groupset SRAM SX',
 'Trek Malin 5 2022 99% 3 Times used.',
 '2022 Trek Malin 5 S27.5 99%',
 'ជជុះជប៉ុនរង្វង់27.5 នៅខ្មែរមិនទាន់ជិះ',
 'Wheetset 29er',
 'ជជុះជប៉ុនរង្វង់27.5នៅខ្មែរមិនទាន់ជិះ',
 'កង់សម្រាប់ជិះហាត់ប្រាណ រង្វង់កង់លេខ 26\u200b ៉',
 'កង់លក់',
 'Giant Talon 1 2021 for sale',
 'Specialized Epic Expert 2021 New',
 'Santacruz លក់\u200b',
 'SantaCruz លក់\u200b 99%',
 'Giant Rincon លក់ប្រញាប់',
 'Forever Brand',
 'សួរស្តី',
 'All 10$',
 'កង់បុរាណ លកតែ 180$',
 'កង់បុរាណ លកតែ 180$',
 'បញ្ចុះតម្លៃពិសេសពិលក្រោយកង់មានសុីញ៉ូប្រើតេលេបញ្ជាមានសំលេង ( Wireless remote control turn signal With horn)',
 'កង់លក់ canondle ទើបដូរដុំបាដាងហើយ នៅថ្មី',
 'កង់លក់ cannondle Tril5 សាយ Lអាស៊ី កង់នៅថ្មីណាស់ទើបដូរជង្គង់ប្រហោង និងបាដាងARC 009 ហើយ',
 'SRAM SX 12',
 'នៅថ្មីអត់ដែលជិះ រង្វង់29 150$',
 'GIANT Revel 1

## Create Dataframe

In [40]:
bicycle_data = pd.DataFrame({'Title': title, 'Description': description, 'Phone': phone, 'Location': location, 'Price': price})

In [41]:
bicycle_data

Unnamed: 0,Title,Description,Phone,Location,Price
0,TREK Frame Original,TREK Frame\n#ហាងកង់ចាក់អង្រែ\n> មានលក់កង់ជជុស​...,tel: 092510051,Phnom Penh,$195
1,Scott scale 970,Sram SX 12លេខ ប៉ូម rockshok judy ប្រាំង Mt200 ...,"tel: 0965159022,078971799",Phnom Penh,$930
2,កង់លក់ Scott sub,កង់លក់ Scott sub M 27.5 គ្រឿង Sram X7 កង់ជុលជុ...,"tel: 0965159022,078971799",Phnom Penh,$300
3,Scott gravel,Scott gravel 700 X 35 9លេខ ជង្គង់ប៉ាដាង single...,"tel: 0965159022,078971799",Phnom Penh,$400
4,Scott,Scott sub សេះស គ្រឿង sram X7 ជង្គង់ប៉ាដាងជើងប៉...,"tel: 0965159022,078971799",Phnom Penh,$400
5,TRINX,កង់ស្អាតគ្រឿង shimano Altus M2000 ប្រាំង ប្រេង...,"tel: 0965159022,078971799",Phnom Penh,$300
6,ពិលកង់ 800lumen ( fast charger ),🚴‍♂️ ពិលមូល #Rockbros 800lm ( fast charger )\n...,"tel: 0962322777,011232277",Phnom Penh,$17.85
7,Cannondale F300,010 545774 tel: 010545774,tel: 010545774,Phnom Penh,$420
8,Panasonic MR-S vintage ( classic),Shimano Deore DX (3x7) រង្វង់ កង់ 26 tel: 0105...,tel: 010545774,Phnom Penh,$195
9,Groupset SRAM SX,ធ្លាប់តែប្រើ #លីប #ដេរីយ័រ #ដៃចុច និង #ច្រវ៉ាក...,tel: 092510051,Phnom Penh,$165


## Cleaning Data for first page

In [42]:
bicycle_data['Phone'] = bicycle_data['Phone'].apply(lambda x:x.strip('tel:'))

bicycle_data

Unnamed: 0,Title,Description,Phone,Location,Price
0,TREK Frame Original,TREK Frame\n#ហាងកង់ចាក់អង្រែ\n> មានលក់កង់ជជុស​...,092510051,Phnom Penh,$195
1,Scott scale 970,Sram SX 12លេខ ប៉ូម rockshok judy ប្រាំង Mt200 ...,0965159022078971799,Phnom Penh,$930
2,កង់លក់ Scott sub,កង់លក់ Scott sub M 27.5 គ្រឿង Sram X7 កង់ជុលជុ...,0965159022078971799,Phnom Penh,$300
3,Scott gravel,Scott gravel 700 X 35 9លេខ ជង្គង់ប៉ាដាង single...,0965159022078971799,Phnom Penh,$400
4,Scott,Scott sub សេះស គ្រឿង sram X7 ជង្គង់ប៉ាដាងជើងប៉...,0965159022078971799,Phnom Penh,$400
5,TRINX,កង់ស្អាតគ្រឿង shimano Altus M2000 ប្រាំង ប្រេង...,0965159022078971799,Phnom Penh,$300
6,ពិលកង់ 800lumen ( fast charger ),🚴‍♂️ ពិលមូល #Rockbros 800lm ( fast charger )\n...,0962322777011232277,Phnom Penh,$17.85
7,Cannondale F300,010 545774 tel: 010545774,010545774,Phnom Penh,$420
8,Panasonic MR-S vintage ( classic),Shimano Deore DX (3x7) រង្វង់ កង់ 26 tel: 0105...,010545774,Phnom Penh,$195
9,Groupset SRAM SX,ធ្លាប់តែប្រើ #លីប #ដេរីយ័រ #ដៃចុច និង #ច្រវ៉ាក...,092510051,Phnom Penh,$165


<hr>
<hr>
<hr>

## Scraping all page

In [43]:
title = []
description = []
phone = []
location = []
price = []

for i in range(1,42):
    url = 'https://www.khmer24.com/en/c-bicycles.html?per_page=' + str(i)
    req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
    
    
    # request to website
    website = urlopen(req).read()
    
    # create soup object
    page_soup = soup(website, "html.parser")
    
    for result in results:
    
        try:
            title.append(result.find('h2', {'class':'item-title'}).get_text())
        except:
            title.append('n/a')
        try:
            description.append(result.find('p', {'class':'description truncate truncate-2'}).get_text().strip('\n'))
        except:
            description.append('n/a')
        try:
            phone.append(result.find('p', {'class':'description truncate truncate-2'}).find('i').get_text())
        except:
            phone.append('n/a')
        try:
            location.append(result.find('li').get_text())
        except:
            location.append('n/a')
        try:
            price.append(result.find('span', {'class':'price'}).get_text())
        except:
            price.append('n/a')

In [44]:
bicycle_data = pd.DataFrame({'Title': title, 'Description': description, 'Phone': phone, 'Location': location, 'Price': price})

bicycle_data

Unnamed: 0,Title,Description,Phone,Location,Price
0,TREK Frame Original,TREK Frame\n#ហាងកង់ចាក់អង្រែ\n> មានលក់កង់ជជុស​...,tel: 092510051,Phnom Penh,$195
1,Scott scale 970,Sram SX 12លេខ ប៉ូម rockshok judy ប្រាំង Mt200 ...,"tel: 0965159022,078971799",Phnom Penh,$930
2,កង់លក់ Scott sub,កង់លក់ Scott sub M 27.5 គ្រឿង Sram X7 កង់ជុលជុ...,"tel: 0965159022,078971799",Phnom Penh,$300
3,Scott gravel,Scott gravel 700 X 35 9លេខ ជង្គង់ប៉ាដាង single...,"tel: 0965159022,078971799",Phnom Penh,$400
4,Scott,Scott sub សេះស គ្រឿង sram X7 ជង្គង់ប៉ាដាងជើងប៉...,"tel: 0965159022,078971799",Phnom Penh,$400
...,...,...,...,...,...
2045,កង់ជប៉ុន បូមមុខក្រោយមានច្រើនoption off road,សួស្ដីបងៗ​\nអតិថិជនទាំងអស់ដែលផ្តោតសំខាន់ទៅលើទី...,tel: 093688846,Phnom Penh,$390.00
2046,លក់ឡៃឡុងYETI,កង់មួយទឹក\r(មិនទាន់ប្រើស្រុកខ្មែរ)\n(ទិញលុយសុទ...,tel: 012879091,Phnom Penh,"$2,500"
2047,KTM,កង់មួយទឹក\n(ទិញលុយសុទ្ធ ឬបង់រំលោះ100%)\nKTM\nS...,tel: 012879091,Phnom Penh,$1
2048,កង់ លក់ 230$,កង់ថ្មី ណាស់ មិនសូវដែលជិះ ថ្មីដូចបកកេស\nទើបតែថ...,"tel: 016786478,092207092",Banteay Meanchey,$230


## Cleaning Phone Data column

In [45]:
bicycle_data['Phone'] = bicycle_data['Phone'].apply(lambda x:x.strip('tel:'))

bicycle_data

Unnamed: 0,Title,Description,Phone,Location,Price
0,TREK Frame Original,TREK Frame\n#ហាងកង់ចាក់អង្រែ\n> មានលក់កង់ជជុស​...,092510051,Phnom Penh,$195
1,Scott scale 970,Sram SX 12លេខ ប៉ូម rockshok judy ប្រាំង Mt200 ...,0965159022078971799,Phnom Penh,$930
2,កង់លក់ Scott sub,កង់លក់ Scott sub M 27.5 គ្រឿង Sram X7 កង់ជុលជុ...,0965159022078971799,Phnom Penh,$300
3,Scott gravel,Scott gravel 700 X 35 9លេខ ជង្គង់ប៉ាដាង single...,0965159022078971799,Phnom Penh,$400
4,Scott,Scott sub សេះស គ្រឿង sram X7 ជង្គង់ប៉ាដាងជើងប៉...,0965159022078971799,Phnom Penh,$400
...,...,...,...,...,...
2045,កង់ជប៉ុន បូមមុខក្រោយមានច្រើនoption off road,សួស្ដីបងៗ​\nអតិថិជនទាំងអស់ដែលផ្តោតសំខាន់ទៅលើទី...,093688846,Phnom Penh,$390.00
2046,លក់ឡៃឡុងYETI,កង់មួយទឹក\r(មិនទាន់ប្រើស្រុកខ្មែរ)\n(ទិញលុយសុទ...,012879091,Phnom Penh,"$2,500"
2047,KTM,កង់មួយទឹក\n(ទិញលុយសុទ្ធ ឬបង់រំលោះ100%)\nKTM\nS...,012879091,Phnom Penh,$1
2048,កង់ លក់ 230$,កង់ថ្មី ណាស់ មិនសូវដែលជិះ ថ្មីដូចបកកេស\nទើបតែថ...,016786478092207092,Banteay Meanchey,$230


# Write all data into excel

In [46]:
bicycle_data.to_csv('bicycle_data.csv', index=False)

In [47]:
bicycle_data.to_excel('bicyle_data.xlsx', index=False)

## Read data from file that we have create

In [48]:
df = pd.read_csv(r"C:\Users\Hsophy\Downloads\bicycle_data.csv")
df

Unnamed: 0,Title,Description,Phone,Location,Price
0,Cannondale F300,010 545774 tel: 010545774,010545774,Phnom Penh,$420
1,Panasonic MR-S vintage ( classic),Shimano Deore DX (3x7) រង្វង់ កង់ 26 tel: 0105...,010545774,Phnom Penh,$195
2,Groupset SRAM SX,ធ្លាប់តែប្រើ #លីប #ដេរីយ័រ #ដៃចុច និង #ច្រវ៉ាក...,092510051,Phnom Penh,$165
3,Trek Malin 5 2022 99% 3 Times used.,450$ 99% Size S 27.5.\nTrek Malin 5 2022 tel: ...,089992249,Phnom Penh,$450
4,2022 Trek Malin 5 S27.5 99%,តម្លៃ​ពិសេស​។ កង់​ថ្មី​ 99.99​ តម្លៃ​ខាង​លេី​អ...,089992249,Phnom Penh,$450
...,...,...,...,...,...
2045,កង់លក់ប្រញាប់ Branch អាមេរិច model : Trek marl...,កង់លក់ប្រញាប់ Branch អាមេរិច model : Trek marl...,085622188,Phnom Penh,$420
2046,លក់ឡៃឡុងROCKRIDER XC50 LTD (កង់បារាំង),កង់មួយទឹក(មិនទាន់ប្រើក្នុងស្រុកខ្មែរ)\n(ទិញលុយ...,012879091,Phnom Penh,$650
2047,CANYON LUX 2020,កង់មួយទឹក\r(មិនទាន់ប្រើស្រុកខ្មែរ)\n(ទិញលុយសុទ...,012879091,Phnom Penh,"$4,500"
2048,កង់បត់តួរបស់ជប៉ុន,ចង់លក់កង់ក្មេងអាយុ៤ទៅ៨ឆ្នាំជិះបានអាចបត់ដាក់គូទ...,016787849068727719,Phnom Penh,$25
