# Every Ferrari Ever Made - Web Scraping, Data Cleaning and Analysis

This project is based around web scraping from this website:
[https://www.supercars.net/blog/all-brands/ferrari/ferrari-model-list/](https://www.supercars.net/blog/all-brands/ferrari/ferrari-model-list/)

The following steps present the process of scraping, cleaning and exporting the data to a .csv file:

Import the necessary libraries:

In [7]:
from bs4 import BeautifulSoup
import requests

Send a GET request to the the specified url:

In [9]:
url = 'https://www.supercars.net/blog/all-brands/ferrari/ferrari-model-list/'

In [11]:
page = requests.get(url)

Parse the HTML content of the web page using BeautifulSoup.
**'page.text'** contains the HTML content of the page retrieved from the web.

**'BeautifulSoup'** is a Python library used for parsing HTML and XML documents.
The first argument 'page.text' is the raw HTML content that needs to be parsed.
The second argument 'lxml' specifies the parser to use for parsing the HTML.
'lxml' is a fast and feature-rich HTML/XML parser, which is more robust compared to the default 'html.parser'.
After parsing, 'soup' will be a BeautifulSoup object, which provides various methods to navigate and search the HTML structure easily.

In [13]:
soup = BeautifulSoup(page.text, 'lxml')

This is what the beginning soup looks like:

<img src="input1.jpg" width="800"/>

### Extraction of desired data
While exploring the html structure of the website, I discovered that the data describing each Ferrari model is located within the **div** section called 'block-html-content clearfix link-color-wrap'.

<img src="PythonInputs/input6.jpg">

I proceeded to find every part of the site that contains such a box of data.

In [15]:
models = soup.find_all('div', class_='block-html-content clearfix link-color-wrap')

Extracting current Ferrari model names and links to more information about them, using a for loop. There are 14 current models, but after every **div** box with data, there was an another one with a short text description, which was assigned to the same class. We are then skipping every two **div** objects to only leave the desired data (through indexing: models[:28:2]).

In [17]:
for model in models[:28:2]:
    names = model.find_all('a')
    for name in names:
        print(name.text)
        print(name['href'])

Ferrari 812 GTS
https://www.supercars.net/blog/ferrari-812-gts/
Ferrari SF90 Stradale
https://www.supercars.net/blog/category/brand/ferrari/ferrari-supercars/ferrari-sf90-stradale/
Ferrari SF90 Spider
https://www.supercars.net/blog/category/brand/ferrari/ferrari-supercars/ferrari-sf90-spider/
Ferrari 296 GTB
https://www.supercars.net/blog/ferrari-296-gtb-an-in-depth-look/
Ferrari 296 GTS
https://www.supercars.net/blog/the-new-ferrari-296-gts/
Ferrari F8 Tributo
https://www.supercars.net/blog/category/brand/ferrari/ferrari-road-cars/ferrari-f8-tributo/
Ferrari F8 Spider
https://www.supercars.net/blog/category/brand/ferrari/ferrari-road-cars/ferrari-f8-tributo/ferrari-f8-tributo-spider/
Ferrari Roma
https://www.supercars.net/blog/category/brand/ferrari/ferrari-road-cars/ferrari-roma/
Ferrari Portofino M
https://www.supercars.net/blog/category/brand/ferrari/ferrari-road-cars/ferrari-portofino/ferrari-portofino-m/
Ferrari 812 Competizione
https://www.supercars.net/blog/2022-ferrari-812-com

Extracting **li** - list items, containing all of the data about current models:

In [None]:
for item in models[:28]:
    li_tags = item.find_all('li')
    for li_tag in li_tags:
        print(li_tag)

Here is what the beginning of the captured data looks like:

<img src="PythonInputs/input2.jpg">

### Storing current model names and links in lists

In [351]:
current_model_names = []
current_model_links = []

for model in models[:28:2]:
    names = model.find_all('a')
    for name in names:
        current_model_names.append(name.text)
        current_model_links.append(name['href'])

In [None]:
print(current_model_names)
print(current_model_links)

Here is what the lists look like:

<img src="PythonInputs/input3.jpg">

### Storing legacy model names in lists

In [355]:
legacy_model_names = []
legacy_model_links = []

for model in models[28::2]:
    names = model.find_all('a')
    for name in names:
        legacy_model_names.append(name.text)
        legacy_model_links.append(name['href'])

In [None]:
print(legacy_model_names)
print(legacy_model_links)

Here is what the beginning of the lists looks like:

<img src="PythonInputs/input4.jpg">

### Importing pandas library and creating the first DataFrame

In [319]:
import pandas as pd

Extracting data column header names:

In [432]:
for item in models[0]:
    li_tags = item.find_all('li')
    current_model_data_columns = [li_tag.text.split(':')[0] for li_tag in li_tags]
    print(current_model_data_columns)

[]
['Base price', 'Engine', 'Power', 'Torque', '0-60 mph', '0-124 mph', 'Top Speed']


Creating the first Pandas DataFrame, starting with headers:

In [482]:
df1 = pd.DataFrame(columns = current_model_data_columns)
df1

Unnamed: 0,Base price,Engine,Power,Torque,0-60 mph,0-124 mph,Top Speed


Storing individual data rows for each current Ferrari model:

In [582]:
for item in models[:28:2]:
    li_tags = item.find_all('li')
    individual_row_data = [li_tag.text.split(':')[-1].strip() for li_tag in li_tags]
    #print(individual_row_data)
    length = len(df1)
    df1.loc[length] = individual_row_data

In [584]:
df1

Unnamed: 0,Base price,Engine,Power,Torque,0-60 mph,0-124 mph,Top Speed
0,"US$401,000",6.2L naturally aspirated V12,789 hp @ 8900 rpm,530 lb-ft @ 7000 rpm,2.9 s,8.3 s,211 mph
1,"US$507,000",4.0L TT V8 + 3 electric,989 hp (combined),590 lb-ft,2.5 s,6.7 s,211 mph
2,"US$557,000",4.0L TT V8 + 3 electric,989 hp (combined),590 lb-ft,2.5 s,6.7 s,211 mph
3,"US$322,986",2.9L TT V6 + Electric Motor,819 hp (combined),546 lb-ft,2.9 s,7.3 s,205 mph
4,"≈ US$340,000",2.9L TT V6 + Electric Motor,819 hp (combined),546 lb-ft,2.9 s,7.6 s,205 mph
5,"US$276,000",3.9L twin-turbo V8,710 hp @ 8000 rpm,568 lb-ft @ 3250 rpm,2.9 s,7.8 s,211 mph
6,"US$274,000",3.9L twin-turbo V8,710 hp @ 8000 rpm,568 lb-ft @ 3250 rpm,2.9 s,7.8 s,211 mph
7,"US$222,630",3.9L twin-turbo V8,612 hp @ 7500 rom,560 lb-ft @ 3000 rpm,3.4 s,9.3 s,199 mph
8,"US$245,000",3.9L twin-turbo V8,"612 hp @ 7,500 rpm",560 lb-ft @ 3000 rpm,3.4 s,9.3 s,199 mph
9,"US$601,570",6.5L naturally aspirated V12,"819 hp @ 9,250 rpm","510 lb-ft @ 7,000 rpm",2.6 s,7.0 s,212 mph


Inserting two columns, containing the model name and the link for more information:

In [586]:
df1.insert(0, "Model Name", current_model_names, True)
df1.insert(1, "More Info", current_model_links, True)

In [588]:
df1

Unnamed: 0,Model Name,More Info,Base price,Engine,Power,Torque,0-60 mph,0-124 mph,Top Speed
0,Ferrari 812 GTS,https://www.supercars.net/blog/ferrari-812-gts/,"US$401,000",6.2L naturally aspirated V12,789 hp @ 8900 rpm,530 lb-ft @ 7000 rpm,2.9 s,8.3 s,211 mph
1,Ferrari SF90 Stradale,https://www.supercars.net/blog/category/brand/...,"US$507,000",4.0L TT V8 + 3 electric,989 hp (combined),590 lb-ft,2.5 s,6.7 s,211 mph
2,Ferrari SF90 Spider,https://www.supercars.net/blog/category/brand/...,"US$557,000",4.0L TT V8 + 3 electric,989 hp (combined),590 lb-ft,2.5 s,6.7 s,211 mph
3,Ferrari 296 GTB,https://www.supercars.net/blog/ferrari-296-gtb...,"US$322,986",2.9L TT V6 + Electric Motor,819 hp (combined),546 lb-ft,2.9 s,7.3 s,205 mph
4,Ferrari 296 GTS,https://www.supercars.net/blog/the-new-ferrari...,"≈ US$340,000",2.9L TT V6 + Electric Motor,819 hp (combined),546 lb-ft,2.9 s,7.6 s,205 mph
5,Ferrari F8 Tributo,https://www.supercars.net/blog/category/brand/...,"US$276,000",3.9L twin-turbo V8,710 hp @ 8000 rpm,568 lb-ft @ 3250 rpm,2.9 s,7.8 s,211 mph
6,Ferrari F8 Spider,https://www.supercars.net/blog/category/brand/...,"US$274,000",3.9L twin-turbo V8,710 hp @ 8000 rpm,568 lb-ft @ 3250 rpm,2.9 s,7.8 s,211 mph
7,Ferrari Roma,https://www.supercars.net/blog/category/brand/...,"US$222,630",3.9L twin-turbo V8,612 hp @ 7500 rom,560 lb-ft @ 3000 rpm,3.4 s,9.3 s,199 mph
8,Ferrari Portofino M,https://www.supercars.net/blog/category/brand/...,"US$245,000",3.9L twin-turbo V8,"612 hp @ 7,500 rpm",560 lb-ft @ 3000 rpm,3.4 s,9.3 s,199 mph
9,Ferrari 812 Competizione,https://www.supercars.net/blog/2022-ferrari-81...,"US$601,570",6.5L naturally aspirated V12,"819 hp @ 9,250 rpm","510 lb-ft @ 7,000 rpm",2.6 s,7.0 s,212 mph


Saving the DataFrame as a .csv file:

In [590]:
df1.to_csv(r'C:\Users\dawid\Python Outputs\current_ferrari_models.csv')

### Second DataFrame creation

Extracting data column header names:

In [634]:
for item in models[28]:
    li_tags = item.find_all('li')
    legacy_model_data_columns = [li_tag.text.split(':')[0] for li_tag in li_tags]
    print(legacy_model_data_columns)

[]
['Years', 'Production', 'Engine', 'Power', 'Torque', '0-60 mph', 'Top Speed']


Creating the second Pandas DataFrame, starting with headers:

In [638]:
df2 = pd.DataFrame(columns = legacy_model_data_columns)
df2

Unnamed: 0,Years,Production,Engine,Power,Torque,0-60 mph,Top Speed


### Storing individual data rows for each current Ferrari model

There were four problems that occured as a result of mistakes in formatting of the scraped website.

#### Unrelated text box between desired data boxes:
<img src="PythonInputs/problematic_list.jpg">

#### Unformatted model names
<img src="PythonInputs/problematic_list2.jpg">

<img src="PythonInputs/problematic_list3.jpg">

<img src="PythonInputs/problematic_list4.jpg">

To conquer these problems, I have decided to insert slices in adequate index positions [110, 134, 222, 240]. I have achieved them by inspecting the original html code of the website.

In [None]:
slice1 = models[28:110:2]
slice2 = models[112:134:2]
slice3 = models[136:222:2]
slice4 = models[224:240:2]
slice5 = models[242::2]

for item in slice1 + slice2 + slice3 + slice4 + slice5:
    li_tags = item.find_all('li')
    individual_row_data2 = [li_tag.text.split(':')[-1].strip() for li_tag in li_tags]
    print(individual_row_data2)
    
    length2 = len(df2)
    df2.loc[length2] = individual_row_data2

Here is what a snippet of the captured data looks like:

<img src="PythonInputs/input5.jpg">

In [808]:
df2

Unnamed: 0,Years,Production,Engine,Power,Torque,0-60 mph,Top Speed
0,1948 - 1950,38 units,2.0 L Colombo V12,"110 bhp @ 6,000 rpm",,,106 mph
1,1950 - 1951,28 units,2.3 L Colombo V12,130 bhp @ 6000 rpm,,,112 mph
2,1951 - 1952,82 units,2.6 L Colombo V12,150 bhp @ 6500 rpm,,,112 mph
3,1951 - 1952,27 units,2.6 L Colombo V12,150 bhp @ 6500 rpm,,~9.0 seconds,115 mph
4,1952,23 units,4.1 L Lampredi V12,200 bhp @ 5000 rpm,,,116 mph
...,...,...,...,...,...,...,...
136,1984 - 1987,272,2.9L twin-turbo V8,"394 bhp @ 7,000 rpm","366 lb-ft @ 3,800 rpm",4.8 seconds,189 mph
137,1987 - 1992,1315,2.9L twin-turbo V8,478 bhp @ 7000 rpm,"425 lb-ft @ 4,000 rpm",3.8 seconds,201 mph
138,1995 - 1997,349,4.7L Tipo F130B V12,"513 hp @ 8,500 rpm","347 lb-ft @ 6,500 rpm",3.7 seconds,202 mph
139,2002 - 2004,400,6.0L Tipo F140B V12,"660 hp @ 7,800 rpm","485 lb-ft @ 5,500 rpm",3.1 seconds,217 mph


Inserting two columns, containing the model name and the link for more information:

In [810]:
df2.insert(0, "Model Name", legacy_model_names, True)
df2.insert(1, "More Info", legacy_model_links, True)

In [812]:
df2

Unnamed: 0,Model Name,More Info,Years,Production,Engine,Power,Torque,0-60 mph,Top Speed
0,Ferrari 166 Inter,https://www.supercars.net/blog/category/brand/...,1948 - 1950,38 units,2.0 L Colombo V12,"110 bhp @ 6,000 rpm",,,106 mph
1,Ferrari 195 Inter,https://www.supercars.net/blog/category/brand/...,1950 - 1951,28 units,2.3 L Colombo V12,130 bhp @ 6000 rpm,,,112 mph
2,Ferrari 212 Inter,https://www.supercars.net/blog/category/brand/...,1951 - 1952,82 units,2.6 L Colombo V12,150 bhp @ 6500 rpm,,,112 mph
3,Ferrari 212 Export,https://www.supercars.net/blog/category/brand/...,1951 - 1952,27 units,2.6 L Colombo V12,150 bhp @ 6500 rpm,,~9.0 seconds,115 mph
4,Ferrari 342 America,https://www.supercars.net/blog/category/brand/...,1952,23 units,4.1 L Lampredi V12,200 bhp @ 5000 rpm,,,116 mph
...,...,...,...,...,...,...,...,...,...
136,Ferrari 288 GTO,https://www.supercars.net/blog/category/brand/...,1984 - 1987,272,2.9L twin-turbo V8,"394 bhp @ 7,000 rpm","366 lb-ft @ 3,800 rpm",4.8 seconds,189 mph
137,Ferrari F40,https://www.supercars.net/blog/category/brand/...,1987 - 1992,1315,2.9L twin-turbo V8,478 bhp @ 7000 rpm,"425 lb-ft @ 4,000 rpm",3.8 seconds,201 mph
138,Ferrari F50,https://www.supercars.net/blog/category/brand/...,1995 - 1997,349,4.7L Tipo F130B V12,"513 hp @ 8,500 rpm","347 lb-ft @ 6,500 rpm",3.7 seconds,202 mph
139,Ferrari Enzo,https://www.supercars.net/blog/category/brand/...,2002 - 2004,400,6.0L Tipo F140B V12,"660 hp @ 7,800 rpm","485 lb-ft @ 5,500 rpm",3.1 seconds,217 mph


Saving the DataFrame as a .csv file:

In [814]:
df2.to_csv(r'C:\Users\dawid\Python Outputs\legacy_ferrari_models.csv')