### Task

Scrap following website: https://www.adac.de/rund-ums-fahrzeug/autokatalog/marken-modelle/ford/kuga/iii/310380/

### Evaluation

Key points to consider

1. Your decision on what the important data is
2. The data output of your choice
3. The readability of your code

<b>Time Spent:</b> approximately 6 hours

Hi all,

First I just wanted to mention few points according to my understanding.
1. <b>Important data:</b> According to my understanding, importance of data depends on one's objective which he/she wants to achieve and then according to that, data scraping is done and insigts are gathered, for the given task, the decision needs to be made by us regarding the data importance, so I have scrapped the data according to my own thoughts. On the web page, data about Ford Kuga (III) model was presented in different forms such as textual data (e.g. title, conclusion of the model), tabular data (e.g. test result rating of the items) and in the form of images(e.g. model pictures). I have scrapped all form of data related to the existing model, as each one can be used to generate different information such as textual data can be useful for nlp related tasks and image data for computer vision. Each form of data has its own importance, below each form of data related to the Ford Kuga (III) model has been extracted which can be helpful to gather different insights. The reason to extract image data was, they also represent existing car model and can be useful especially for model classification related tasks, even we can extract pdf report to get useful information out of that but for the sake of task simiplicity, I didn't extract that.

</br>

2. <b>Data Output:</b> As we have different form of data, so we can save our data into different format such as tabular data in csv file or database, textual data in text file, but for better visualization, I have just displayed all in the notebook file.


### Importing Libraries

Please make sure that you have all used libraries installed, otherwise please install it with command: pip install library_name.

In [1]:
import pandas as pd
import selenium
from selenium import webdriver
import time
import os
import io
from PIL import Image
import requests
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementClickInterceptedException

### Install Chrome Driver

In [2]:
opts = webdriver.ChromeOptions()
opts.headless = True

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()) ,options=opts)



Current google-chrome version is 100.0.4896
Get LATEST chromedriver version for 100.0.4896 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/100.0.4896.60/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\Usama\.wdm\drivers\chromedriver\win32\100.0.4896.60]


### Specifying Web URL

In [3]:
url = 'https://www.adac.de/rund-ums-fahrzeug/autokatalog/marken-modelle/ford/kuga/iii/310380/'
if requests.get(url).status_code == 200:
    driver.get(url)
else:
    print("please enter a valid url")

### A function to take the cursor to the end of the page

In [4]:
def scroll_to_end(driver):
    
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)

### A fuction to get items from the path and store in a dictionary

In [5]:
def get_data(path:str) -> dict:
    
    web_elements = driver.find_elements(By.XPATH,  path)
    if web_elements:
        if "sc-hatQeL bGVFHQ" in path:
            web_elements = web_elements[1:]
        items_dict = {element.text.split('\n')[0]:element.text.split('\n')[1] for element in web_elements}
        return items_dict
    else:
        return "No element found"

### A fuction to get text from the path

In [6]:
def get_element_text(path:str) -> str:
    
    web_element = driver.find_element(By.XPATH, path)
    if web_element:
        return web_element.text
    else:
        return "No element found"

### Called Methods to get textual information related to the model

In [7]:
#Title of the model
scroll_to_end(driver)
car_title= get_element_text(path= "//h1")

#Informations of the model
model_info = get_data(path= "//*[contains(@class, 'sc-bdvvtL goIptw sc-gsDKAQ sc-hXLIYv hIxhWw gDkLYR')]")

#test result text and date
test_result_text=get_element_text(path= "//div[@class='sc-eSxRXt sc-aaqME sc-gNUafA wROml itVlFP kULHUp']/h2")
test_result_date=get_element_text(path= "//div[@class='sc-eSxRXt sc-aaqME sc-gNUafA wROml itVlFP kULHUp']/p")

#Conclusion of the model, header and paragraph
conclusion_header = get_element_text(path= "//div[@class='sc-bttaWv iYkiKq sc-kNzDjo fAMqTT sc-hWBuOZ sc-kLnunm dksvCW iCyXOG']/h2")
conclusion_paragraph = get_element_text(path= "//div[@class='sc-bttaWv iYkiKq sc-kNzDjo fAMqTT sc-hWBuOZ sc-kLnunm dksvCW iCyXOG']/p")

#strength and weakness of the model
strength_weakness = driver.find_elements(By.XPATH, "//div[@class='sc-iRFsWr sc-eZhRLC sc-kudmJA gnrIkT gPaWAU iJATmX']/div")

# Similar models to the current model
similar_header = get_element_text(path= "//div[@class='sc-gqtqkP sc-ihINtW jxxFFm cRtkmj sc-jefHZX kyAaUw']/header")
similar_articles = get_element_text(path= "//div[@class='sc-gqtqkP sc-ihINtW jxxFFm cRtkmj sc-jefHZX kyAaUw']/div/div")
similar_models = {similar_header: similar_articles.split('\n')} 

### Called methods to get information related to rating of the items

In [8]:
# Heading of both rating tabs
rating_tabs_web_element = driver.find_elements(By.XPATH,  "//*[contains(@class, 'sc-fUCuFg hQyzTx')]")
if rating_tabs_web_element:
    tab_texts=[text.text for text in rating_tabs_web_element]
else:
    print("No element found")

# Rating of items in Testergebnis and Zielgruppencheck
try:
    driver.execute_script("arguments[0].click();", WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "label[for='tab-test-results']"))))
    test_result_rating=get_data(path= "//dl[@class='sc-hatQeL bGVFHQ']/div")
    driver.execute_script("arguments[0].click();", WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "label[for='tab-target-group']"))))
    target_result_rating=get_data(path= "//dl[@class='sc-hatQeL bGVFHQ']/div")
except ElementClickInterceptedException or ElementNotInteractableException as err:
    print(err)
    
# Actual Rating Catagories
truth_rating = get_data(path= "//div[@class='sc-cCKzRf iHQIhw sc-iNpzLj icBbah']/div")   

### Getting Model images

In [9]:
def getting_images(path:str):
    time.sleep(2)
    web_element_imgs = driver.find_elements(By.XPATH, path)
    
    if web_element_imgs:
        for i, element in enumerate(web_element_imgs):
            file_name = f"{i:50}.jpg"  
            if element.get_attribute('src') and 'https' in element.get_attribute('src'):
                url=element.get_attribute('src')
                
                try:
                    image_content = requests.get(url).content
                except Exception as e:
                    print(f"ERROR - COULD NOT DOWNLOAD {url} - {e}")
                    
                try:
                    image_file = io.BytesIO(image_content)
                    image = Image.open(image_file).convert('RGB')
                    with open(file_name, 'wb') as f:
                        image.save(f, "JPEG", quality=85)  
                except Exception as e:
                    print(f"ERROR - COULD NOT SAVE {url} - {e}")
                    
        print("Saved all images in a default directory")
    else:
        print("Images element not found")

### Displaying the output in a suitable form

#### Brand and Model of the car and its information

In [10]:
print("\033[1m" + "Brand and model of the car is: " + "\033[0m" + car_title)
pd.DataFrame(model_info.items(), columns=['Item', 'ItemValue'])

[1mBrand and model of the car is: [0mFord Kuga 1.5 EcoBoost ST-Line X (04/20 - 09/20)


Unnamed: 0,Item,ItemValue
0,Grundpreis,35.434 €
1,Kraftstoff,Super
2,Verbrauch,"6,8 l/100 km"
3,Leistung,110 kW (150 PS)


#### Testergebnis date

In [11]:
print(test_result_text + ": " + test_result_date)

Testergebnis: April 2021


#### Ratings in Testergebnis and Zielgruppencheck

In [12]:
df_testergebnis = pd.DataFrame(test_result_rating.items(), columns=['Testergebnis_Item', 'Testergebnis_ItemRating'])
df_zielgruppencheck = pd.DataFrame(target_result_rating.items(), columns=['Zielgruppencheck_Item', 'Zielgruppencheck_ItemRating'])
result = pd.concat([df_testergebnis[0:7], df_zielgruppencheck [0:7]], axis=1, join='inner')

rating_catagories = pd.DataFrame(truth_rating.items(), columns=['Catagory', 'Value'])
df_common = df_testergebnis[7:].rename({'Testergebnis_Item': 'Item', 'Testergebnis_ItemRating': 'ItemRating'}, axis=1).reset_index(drop=True)

display(result)
display(df_common)
display(rating_catagories)

Unnamed: 0,Testergebnis_Item,Testergebnis_ItemRating,Zielgruppencheck_Item,Zielgruppencheck_ItemRating
0,Karosserie/Kofferraum,25,Familie,24
1,Innenraum,21,Stadtverkehr,43
2,Komfort,25,Senioren,31
3,Motor/Antrieb,19,Langstrecke,26
4,Fahreigenschaften,24,Transport,23
5,Sicherheit,16,Fahrspaß,31
6,Umwelt/EcoTest,31,Preis/Leistung,24


Unnamed: 0,Item,ItemRating
0,ADAC Urteil Autotest,23
1,Autokosten,25


Unnamed: 0,Catagory,Value
0,sehr gut,"0,6 - 1,5"
1,gut,"1,6 - 2,5"
2,befriedigend,"2,6 - 3,5"
3,ausreichend,"3,6 - 4,5"
4,mangelhaft,"4,6 - 5,5"


#### Stärken und Schwächen of the model

In [13]:
if strength_weakness:
    strenth_weak_elememts = {item.text.split("\n")[0]: item.text.split("\n")[1:] for item in strength_weakness}
    for header, elements in strenth_weak_elememts.items():
        print("\n\033[1m" + header + ":\033[0m")
        for element in elements:
            print(element)
else:
    print("strength and weakness of model not found")


[1mStärken:[0m
Navigationssystem Serie
praktischer Türkantenschutz
umfangreiche Sicherheitsausstattung

[1mSchwächen:[0m
im Grenzbereich zu starkes Untersteuern, besteht ADAC Ausweichtest nicht
hoher Verbrauch


#### Conclusion of the model

In [14]:
print("\033[1m" + conclusion_header + ":\n \033[0m")
print(conclusion_paragraph)

[1mFazit zum Ford Kuga 1.5 EcoBoost ST-Line X (04/20 - 09/20):
 [0m
Die SUV-Welle schwappt schon einige Zeit um die Welt, dementsprechend ist der 2019 vorgestellte Ford Kuga tatsächlich schon in der dritten Generation auf dem Markt. Er basiert auf der C2 genannten Plattform des Herstellers, die auch vom Ford Focus genutzt wird. Äußerlich nicht gleich zu erkennen, merkt man beim Blick auf das Armaturenbrett, dass der Focus verwandt sein muss: Nur Experten könnten ohne Vergleich spontan sagen, in welchem Auto sie sich befinden. Der Kuga bietet im Vergleich zum Focus die höhere Sitzposition, aber auch die höhere Ladekante. Alles wie vom SUV erwartet also. Gerade in der getesteten Ausstattung ST-Line X trägt der Kuga äußerlich dick auf und will sportliche Naturen ansprechen, die uns deutlich zu nervös ansprechende Lenkung und das ST-Line-Sportfahrwerk sollen wohl in die gleiche Kerbe schlagen. Fordert man den Kuga dann aber, wie etwa im ADAC Ausweichtest, flüchtet er sich in starkes Unte

#### Similar models to the current model

In [15]:
print("\033[1m" + similar_header + ":\033[0m")
print(similar_articles)

[1mAuswahl ähnlicher Modelle:[0m
Suzuki SX4
BMW X2
SsangYong Korando
VW T-Roc


#### Getting Model Images

In [16]:
getting_images(path= "//div[contains(@class, 'sc-jvvksu kiAtge sc-iWBNLc sc-inrDdN dyHdxL dUQZlk')]/picture/img")

Saved all images in a default directory


In [17]:
driver.close()

## THE END