<a href="https://colab.research.google.com/github/stevcas17/codigos/blob/main/Actividad_Carros_webScraping_cars_fromColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Used Car Web Data: Case Study tucarrro.com (Colab version)
[Author: Elias Buitrago Bolivar](https://github.com/ebuitrago?tab=repositories)

This jupyter notebook depicts a python based web scraping  algorithm to obtain data to train a price car prediction machine learning algorithm. Used cars web data are extracted from [Tu Carro](www.tucarro.com.co). The code presented here is functional and was tested by scraping real data. This code version is compatible with Colab.
_Updated: Jun 20, 2024_


## Install required libraries

In [17]:
!pip install lxml
!pip install scrapy
!pip3 install requests-html
!pip3 install selenium



In [18]:
%%shell
# Install chromedriver
# Credits: https://medium.com/@MinatoNamikaze02/running-selenium-on-google-colab-a118d10ca5f8
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb

wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.62/linux64/chromedriver-linux64.zip -P /tmp/
unzip -o /tmp/chromedriver-linux64.zip -d /tmp/
chmod +x /tmp/chromedriver-linux64/chromedriver
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver

pip install selenium chromedriver_autoinstaller

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
[33m0% [Connecting to archive.ubuntu.com (91.189.91.81)] [Waiting for headers] [Con[0m                                                                               Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
[33m0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpadcont[0m                                                                               Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-d



### Web Scraping Used Car Sales Data
This section explains the web scraping process implemented to obtain the data from the used car sales web site [Tu Carro](www.tucarro.com.co).

In [19]:
!pip install undetected_chromedriver



## Import required libraries


---

In [20]:
'''
credits:
https://github.com/googlecolab/colabtools/issues/3347
https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com
Sept 19, 2023
'''

#
!pip3 install chromedriver-autoinstaller



In [21]:
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

import time
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import chromedriver_autoinstaller
import json

## Setup chrome and chrome driver


---



In [22]:
# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# # set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()

'/usr/local/lib/python3.10/dist-packages/chromedriver_autoinstaller/126/chromedriver'


## Section to declare functions

---



### Function scrapebyPages

In [23]:
def scrapebyPages(brand,model,min, max):
  #Range of pages from the total search to scrape in.
  #It is recomended to cover a range of one hundred pages in each iteration of this section.
  data = pd.DataFrame()
  for i in range(min,max):

      print(f'************************************')
      print(f'WEB SCRAPING FROM SEARCH PAGE #{i}')
      pag = i
      url = f'https://vehiculos.tucarro.com.co/{brand}/{model}/_Desde_{49*i}_NoIndex_True'

      driver = webdriver.Chrome(options=chrome_options)
      driver.get(url)
      driver.implicitly_wait(10)
      html = driver.page_source
      soup = bs(html,'lxml')

      #Get href
      links = gethref(soup)

      p = []
      #Scraping
      for i in range(0,len(links)):
          print('Scrapping', i, '/', len(links), '...')
          p.append(scrapper(links[i]))
          print(f'Este es el valor de p[i]: {p[i]}')

      # append list to DataFrame
      temp_df = pd.DataFrame(p)
      data = pd.concat([data, temp_df], ignore_index=True)

  #Close the web browser tab
  driver.close()

  # quit the driver
  driver.quit()

  return data

### Function gethref

In [24]:
#Function to get 'href' from each article item
def gethref(soup):

    links = []
    for link in soup.findAll('a'):
      url_car = link.get('href')
      if 'MCO-' in url_car:
        # print(url_car)          %Print each car url as a validity test
        links.append(url_car)

    print("Href obtained: ", len(links))

    return links
    # return

### Function scrapper

In [25]:
#Function to call housing_features routine on each href
def scrapper(url_car):

    # set up the webdriver
    driver = webdriver.Chrome(options=chrome_options)

    # Scrape
    driver.get(url_car)
    driver.implicitly_wait(10)
    html=driver.page_source

    #Obtaining the html from the web page after applying Selenium
    soup = bs(html,'lxml')

    #Create a list to store info obtained from one particular property
    features = []

    #Applying function to obtain variables defined from one particular property
    features = extract_cars_features(soup)

    #Close the web browser tab
    driver.close()

    # quit the driver
    driver.quit()

    return(features)

### Function extract_cars_features

In [26]:
# Version 1.0
def extract_cars_features(soup):

  features_list = []

  # car_name
  try:
    car_name = soup.find('h1',{'class': 'ui-pdp-title'}).text
    features_list.append(car_name)
    # print(f"Car's name is: {car_name}")
  except:
    car_name = ' '
    features_list.append(car_name)

  # price
  try:
    price=soup.find('div',{'class': 'ui-pdp-price__second-line'}).text
    features_list.append(price)
    # print(f"Car's price is: {price}")
  except:
    price = 0
    features_list.append(price)

  # year_car
  try:
    year_kms_datePub = soup.find('div',{'class': 'ui-pdp-header__subtitle'}).text.split(' ')
    year = year_kms_datePub[0]
    features_list.append(year)
  except:
    year = 0
    features_list.append(year)

  # kms
  try:
    year_kms_datePub = soup.find('div',{'class': 'ui-pdp-header__subtitle'}).text.split(' ')
    kms = year_kms_datePub[2]
    features_list.append(kms)
  except:
    kms = 0
    features_list.append(kms)
  # print(f"Kms: {kms}")

# color and Fuel Type
  try:
    script = soup.find("script", {'type': 'application/ld+json'})
    if script:
      # Obtain script content
      script_text = json.loads(script.string)

      # Extract json keys for color and fuel type
      color = script_text.get('color', 'Color not found')
      fuel = script_text.get('fuelType','Fuel type not found')

      # Append results
      features_list.extend([color, fuel])
    else:
      print("JavaScript script was not found on the page.")
  except json.JSONDecodeError as e:
      print("Error decoding JSON:", str(e))
      # Append default values in case of JSON decoding error
      features_list.extend([0, 0])
  except Exception as e:
      print("An unexpected error occurred:", str(e))
      # Handle unexpected errors gracefully
      features_list.extend([0, 0])


  # print(features_list)


  return features_list

## Start scraping

---

In [27]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

In [47]:
"""
 The input parameters for the 'scrapebyPages' function are: Brand name, Car model
 name. Be careful to write the brand and model names exactly as they are in tucarro.com.
 The third input parameter is the initial results page (always initialize to 1)
 and the fourth input parameter is the final results page you want to download data from;
 this parameter depends on the amount of results pages your car returns
 for the brand and model you want to get data from. So, it is recommended to search
 the web portal first to find out how many pages of results you can get
 for the car you want to get data from.
"""

car_brand = 'Ford'   # Brand car name. Ej: chevrolet, renault, kia.
car_model = 'escape'        # Model car name. Ej: duster, onix, rio.
data = scrapebyPages(car_brand,car_model,1,8)
# scrapebyPages(1,2)

************************************
WEB SCRAPING FROM SEARCH PAGE #1
Href obtained:  48
Scrapping 0 / 48 ...
Este es el valor de p[i]: ['Ford Escape 2.5 Se Hev 4x2', '$125.500.000', '2022', '22.000', 'Gris', 'Híbrido']
Scrapping 1 / 48 ...
Este es el valor de p[i]: ['Ford Escape Se 4X4', '$57.000.000', '2013', '88.400', 'Rojo', 'Gasolina']
Scrapping 2 / 48 ...
Este es el valor de p[i]: ['Ford Escape 2.0 Sel Hev', '$145.000.000', '2022', '16.000', 'Gris', 'Híbrido']
Scrapping 3 / 48 ...
Este es el valor de p[i]: ['Ford Escape 2.0 Se 4x4', '$50.000.000', '2014', '99.589', 'Blanco', 'Gasolina']
Scrapping 4 / 48 ...
Este es el valor de p[i]: ['Ford Escape Titanium 2.0 4x4', '$92.000.000', '2018', '36.000', 'Rojo', 'Gasolina']
Scrapping 5 / 48 ...
Este es el valor de p[i]: ['Ford Escape 2.0 Se 4x2', '$51.000.000', '2014', '110.000', 'Plateado', 'Gasolina']
Scrapping 6 / 48 ...
Este es el valor de p[i]: ['Ford Escape 2.0 TITANIUM TP 4X4', '$80.000.000', '2018', '50.000', 'Azul', 'Gasolina']

In [48]:
cols = ['car_model','price','year_model','kms','color','fueltype']
data.columns = cols
print(data.shape)
data.head()

(336, 6)


Unnamed: 0,car_model,price,year_model,kms,color,fueltype
0,Ford Escape 2.5 Se Hev 4x2,$125.500.000,2022,22.0,Gris,Híbrido
1,Ford Escape Se 4X4,$57.000.000,2013,88.4,Rojo,Gasolina
2,Ford Escape 2.0 Sel Hev,$145.000.000,2022,16.0,Gris,Híbrido
3,Ford Escape 2.0 Se 4x4,$50.000.000,2014,99.589,Blanco,Gasolina
4,Ford Escape Titanium 2.0 4x4,$92.000.000,2018,36.0,Rojo,Gasolina


In [50]:
saved_name=f'CarsFordEscape_{car_model}_200624.csv'
data.to_csv(saved_name, encoding='utf-8', index=False)

### Testing code for scraping only one page
This section provides a testing code for one page web scraping results

In [None]:
#*****************************
#Code for testing in one page
#*****************************
import json

brand = 'kia'   # Brand car name. Ej: chevrolet, renault, kia.
model = 'rio'   # Model car name. Ej: duster, onix, rio.

# url = f'https://vehiculos.tucarro.com.co/{model}-{brand}'
url = f'https://vehiculos.tucarro.com.co/{brand}/{model}/_Desde_{49*1}_NoIndex_True'
print(url)

#Function to call cars_features routine on each href
def scrapper(url_car):

    # set up the webdriver
    driver = webdriver.Chrome(options=chrome_options)

    # Scrape
    driver.get(url_car)
    driver.implicitly_wait(10)
    html=driver.page_source

    #Obtaining the html from the web page after applying Selenium
    soup = bs(html,'lxml')

    #Create a list to store info obtained from one particular property
    features = []

    #Applying function to obtain variables defined from one particular property
    features = extract_cars_features(soup)

    #Close the web browser tab
    driver.close()

    # quit the driver
    driver.quit()

    return(features)


def extract_cars_features(soup):

  features_list = []

  # car_name
  try:
    car_name = soup.find('h1',{'class': 'ui-pdp-title'}).text
    features_list.append(car_name)
    # print(f"Car's name is: {car_name}")
  except:
    car_name = ' '
    features_list.append(car_name)

  # price
  try:
    price=soup.find('div',{'class': 'ui-pdp-price__second-line'}).text
    features_list.append(price)
    # print(f"Car's price is: {price}")
  except:
    price = 0
    features_list.append(price)

  # year_car
  try:
    year_kms_datePub = soup.find('div',{'class': 'ui-pdp-header__subtitle'}).text.split(' ')
    year = year_kms_datePub[0]
    features_list.append(year)
  except:
    year = 0
    features_list.append(year)

  # kms
  try:
    year_kms_datePub = soup.find('div',{'class': 'ui-pdp-header__subtitle'}).text.split(' ')
    kms = year_kms_datePub[2]
    features_list.append(kms)
  except:
    kms = 0
    features_list.append(kms)
  # print(f"Kms: {kms}")

 # color and Fuel Type
  try:
    script = soup.find("script", {'type': 'application/ld+json'})
    if script:
      # Obtain script content
      script_text = json.loads(script.string)

      # Extract json keys for color and fuel type
      color = script_text.get('color', 'Color not found')
      fuel = script_text.get('fuelType','Fuel type not found')

      # Append results
      features_list.extend([color, fuel])
    else:
      print("JavaScript script was not found on the page.")
  except json.JSONDecodeError as e:
      print("Error decoding JSON:", str(e))
      # Append default values in case of JSON decoding error
      features_list.extend([0, 0])
  except Exception as e:
      print("An unexpected error occurred:", str(e))
      # Handle unexpected errors gracefully
      features_list.extend([0, 0])

  return features_list


driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
driver.implicitly_wait(10)
html = driver.page_source
soup = bs(html,'lxml')

#Get href
links = []
for link in soup.findAll('a'):
  url_car = link.get('href')
  if 'MCO-' in url_car:
    links.append(url_car)
print("Href obtained: ", len(links))

p = []
#Scraping
for i in range(0,len(links)):
  print('Scrapping', i, '/', len(links), '...')
  p.append(scrapper(links[i]))
  print(f'Este es el valor de p[i]: {p[i]}')

temp_df = pd.DataFrame(p)
# data = pd.concat([data, temp_df], ignore_index=True)

#Close the web browser tab
driver.close()

# quit the driver
driver.quit()


temp_df.head()

https://vehiculos.tucarro.com.co/kia/rio/_Desde_49_NoIndex_True
Href obtained:  48
Scrapping 0 / 48 ...
Este es el valor de p[i]: ['Kia Rio 1.4 Xcite', '$28.500.000', '2010', '91.100', 'Plateado', 'Gasolina']
Scrapping 1 / 48 ...
Este es el valor de p[i]: ['Kia Rio 1.4 Zenith', '$78.000.000', '2024', '7.800', 'Gris', 'Gasolina']
Scrapping 2 / 48 ...
Este es el valor de p[i]: ['Kia Rio 1.5 Stylus', '$20.500.000', '2010', '115.000', 'Gris', 'Gasolina']
Scrapping 3 / 48 ...
Este es el valor de p[i]: ['Kia Rio 1.4 Vibrant At', '$58.800.000', '2019', '35.800', 'Gris', 'Gasolina']
Scrapping 4 / 48 ...
Este es el valor de p[i]: [' Kia Rio Zenith 2018', '$59.900.000', '2018', '44.307', 'Blanco', 'Gasolina']
Scrapping 5 / 48 ...
Este es el valor de p[i]: ['Kia Rio Zenit 2018', '$56.990.000', '2018', '79.693', 'Rojo', 'Gasolina']
Scrapping 6 / 48 ...
Este es el valor de p[i]: ['Kia Rio 1.4 Vibrant At', '$78.000.000', '2023', '9.850', 'Gris', 'Gasolina']
Scrapping 7 / 48 ...
Este es el valor de p

Unnamed: 0,0,1,2,3,4,5
0,Kia Rio 1.4 Xcite,$28.500.000,2010,91.1,Plateado,Gasolina
1,Kia Rio 1.4 Zenith,$78.000.000,2024,7.8,Gris,Gasolina
2,Kia Rio 1.5 Stylus,$20.500.000,2010,115.0,Gris,Gasolina
3,Kia Rio 1.4 Vibrant At,$58.800.000,2019,35.8,Gris,Gasolina
4,Kia Rio Zenith 2018,$59.900.000,2018,44.307,Blanco,Gasolina


## Referencias
---



https://github.com/kiteco/kite-python-blog-post-code/blob/master/Web%20Scraping%20Tutorial/script.py

https://medium.com/geekculture/scrappy-guide-to-web-scraping-with-python-475385364381

https://stackoverflow.com/questions/47730671/python-3-using-requests-does-not-get-the-full-content-of-a-web-page