# Exploratory Data Analysis of YallaMotors Cars Dataset

This notebook performs exploratory data analysis on the YallaMotors Cars Dataset. The dataset contains information about various cars, including their specifications, prices, and other relevant features.

In [1]:
#import necessary libraries
import re
import pandas as pd
import opendatasets as od

# About this file

This dataset was scraped from the YallaMotors website using Python and Requests-HTML. It consists of approximately 6750 rows and 9 columns, making it suitable for conducting Exploratory Data Analysis and applying Machine Learning algorithms such as Linear Regression and so on.

The dataset contains the following columns:
* Car Name: The name of the car.
* Price: The price of the car.
* Engine Capacity: The engine capacity of the car.
* Cylinder: The power of the car's cylinder.
* Horse Power: The horse power of the car.
* Top Speed: The top speed of the car.
* Seats: The number of seats in the car.
* Brand: The brand of the car.
* Country: The country in which the website sells this car.

With this dataset, you can explore various aspects of the cars, analyze their features, and perform tasks like predicting the car price using machine learning algorithms.
This dataset provides a valuable resource for conducting detailed analysis and gaining insights into the car market.

In [2]:
# download the dataset (this is a Kaggle dataset)
# during download you will be required to input your Kaggle username and password
od.download("https://www.kaggle.com/datasets/mahmoudahmed6/yallamotors-cars-dataset", force=True)

Downloading yallamotors-cars-dataset.zip to .\yallamotors-cars-dataset


100%|██████████| 111k/111k [00:00<00:00, 310kB/s]







After reading the CSV file and loading it into a DataFrame, let's explore the dataset to gain initial insights into the data.

The dataset consists of cars with various attributes, including the car name, price, engine capacity, cylinder power, horse power, top speed, number of seats, brand, and country. It contains around 6,308 rows, each representing a car entry.

Let's take a look at the first few and last rows of the dataset:

In [3]:
# Read the CSV file into a DataFrame and display it
df = pd.read_csv('./yallamotors-cars-dataset/cars.csv')
df

Unnamed: 0,car name,price,engine_capacity,cylinder,horse_power,top_speed,seats,brand,country
0,Fiat 500e 2021 La Prima,TBD,0.0,"N/A, Electric",Single,Automatic,150,fiat,ksa
1,Peugeot Traveller 2021 L3 VIP,"SAR 140,575",2.0,4,180,8 Seater,8.8,peugeot,ksa
2,Suzuki Jimny 2021 1.5L Automatic,"SAR 98,785",1.5,4,102,145,4 Seater,suzuki,ksa
3,Ford Bronco 2021 2.3T Big Bend,"SAR 198,000",2.3,4,420,4 Seater,7.5,ford,ksa
4,Honda HR-V 2021 1.8 i-VTEC LX,Orangeburst Metallic,1.8,4,140,190,5 Seater,honda,ksa
...,...,...,...,...,...,...,...,...,...
6303,Bentley Mulsanne 2021 6.75L V8 Extended Wheelbase,DISCONTINUED,6.8,8,505,296,5 Seater,bentley,uae
6304,Ferrari SF90 Stradale 2021 4.0T V8 Plug-in-Hybrid,"AED 1,766,100",4.0,8,25,800,Automatic,ferrari,uae
6305,Rolls Royce Wraith 2021 6.6L Base,"AED 1,400,000",6.6,12,624,250,4 Seater,rolls-royce,uae
6306,Lamborghini Aventador S 2021 6.5L V12 Coupe,"AED 1,650,000",6.5,,740,350,2 Seater,lamborghini,uae


# Data Information
To gain a deeper understanding of the dataset, let's examine its information using the info() method. This will provide us with essential details about the DataFrame, including the column names, data types, and the number of non-null values.

In [4]:
# Display information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6308 entries, 0 to 6307
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   car name         6308 non-null   object
 1   price            6308 non-null   object
 2   engine_capacity  6308 non-null   object
 3   cylinder         5684 non-null   object
 4   horse_power      6308 non-null   object
 5   top_speed        6308 non-null   object
 6   seats            6308 non-null   object
 7   brand            6308 non-null   object
 8   country          6308 non-null   object
dtypes: object(9)
memory usage: 443.7+ KB


From the information provided, we can observe the following:

* The dataset contains 6,308 entries (rows) and 9 columns.
* The column "Price" is currently stored as an object (string) type, which may contain a combination of text and numbers.
* The "Engine Capacity", "Cylinder", "Horse Power", "Top Speed", and "Seats" columns are also stored as object types, even though they should ideally be numerical.
* There are some missing values in the "Engine Capacity", "Cylinder", "Horse Power", and "Top Speed" columns, as indicated by the non-null counts.
* The "Engine Capacity" and "Top Speed" columns have float64 data types, indicating that they already contain numerical values.

This initial data overview highlights some data quality issues and potential areas for data cleaning and preprocessing.

# Data wrangling nad basic feature engineering

To enhance the analysis and enable comparisons, we can convert the prices from various currencies to USD. The code provided performs the following steps:

* A dictionary, currency_rates, is defined to store the conversion rates from different currencies to USD.
* Two functions, extract_currency(price) and extract_price(price), are defined to extract the currency and price value from the original "Price" column, respectively.
* The function convert_to_usd(price, currency) converts the extracted price to USD based on the provided currency and conversion rate from the currency_rates dictionary.
* Three new columns are created in the DataFrame:
* "currency" stores the extracted currency values.
* "price_currency" contains the extracted price values.
* "price_USD" holds the converted price values in USD.

To facilitate further analysis and eliminate redundancy, we can drop the initial "Price" column from the DataFrame.

To further prepare the dataset for exploratory analysis, the code provided includes the following steps:

* The function extract_seats(value) is defined to extract the number of seats from the "seats" column. It searches for a pattern that matches a number followed by the word "seater" (e.g., "8 seater"). If a match is found, the function returns the extracted number as an integer. Otherwise, it returns None.
* The function is applied to the "seats" column using the apply method, and the results are stored in a new column called "seats".
* The columns 'engine_capacity', 'cylinder', 'horse_power', and 'top_speed' are converted to numeric format using the pd.to_numeric function. The errors='coerce' parameter is used to handle any non-numeric values by converting them to NaN (Not a Number).
* The info() method is then called to display information about the DataFrame, including the data types and the number of non-null values in each column.

After performing these conversions, we can ensure that the numeric columns are in the correct format for further analysis.

In [5]:
# Define currency conversion rates to USD
currency_rates = {
    'AED': 0.27,   # UAE Dirham
    'BHD': 2.65,   # Bahraini Dinar
    'EGP': 0.064,  # Egyptian Pound
    'KWD': 3.32,   # Kuwaiti Dinar
    'OMR': 2.60,   # Omani Rial
    'QAR': 0.27,   # Qatari Rial
    'SAR': 0.27    # Saudi Rial
}

# Function to extract currency from price
def extract_currency(price):
    match = re.search(r'([a-zA-Z]{3})', price)
    if match:
        currency = match.group(1)
        if currency in currency_rates:
            return currency
    return

# Function to extract price from price
def extract_price(price):
    match = re.search(r'\d+(,\d+)?', price)
    if match:
        return float(match.group(0).replace(',', ''))
    return

# Function to convert price to USD
def convert_to_usd(price, currency):
    if currency and currency != 'USD' and currency in currency_rates:
        conversion_rate = currency_rates[currency]
        return round(price * conversion_rate, 2)
    return

# Create new columns for currency and price
df['currency'] = df['price'].apply(extract_currency)
df['price_currency'] = df['price'].apply(extract_price)

# Convert price to USD
df['price_USD'] = df.apply(lambda row: convert_to_usd(row['price_currency'], row['currency']), axis=1)

# Function to extract the number of seats
def extract_seats(value):
    if isinstance(value, str):
        match = re.search(r'\b(\d+)\s*seater\b', value, re.IGNORECASE)
        if match:
            return int(match.group(1))
    return None

# Apply the function to the "seats" column
df['seats'] = df['seats'].apply(extract_seats)

# Convert 'engine_capacity' column to numeric
df['engine_capacity'] = pd.to_numeric(df['engine_capacity'], errors='coerce')

# Convert 'cylinder' column to numeric (assuming missing values should be NaN)
df['cylinder'] = pd.to_numeric(df['cylinder'], errors='coerce')

# Convert 'horse_power' column to numeric
df['horse_power'] = pd.to_numeric(df['horse_power'], errors='coerce')

# Convert 'top_speed' column to numeric
df['top_speed'] = pd.to_numeric(df['top_speed'], errors='coerce')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6308 entries, 0 to 6307
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   car name         6308 non-null   object 
 1   price            6308 non-null   object 
 2   engine_capacity  6305 non-null   float64
 3   cylinder         5574 non-null   float64
 4   horse_power      6186 non-null   float64
 5   top_speed        5875 non-null   float64
 6   seats            5789 non-null   float64
 7   brand            6308 non-null   object 
 8   country          6308 non-null   object 
 9   currency         4979 non-null   object 
 10  price_currency   4985 non-null   float64
 11  price_USD        4979 non-null   float64
dtypes: float64(7), object(5)
memory usage: 591.5+ KB


The current DataFrame information shows that after the data conversions and cleaning steps, the dataset contains 6308 rows and 12 columns. Here is a summary of the columns and their respective non-null counts and data types.
Upon examining the DataFrame, we can observe the following information:

* The column 'car name' contains 6308 non-null values, indicating that all rows have a car name entry.
* The column 'price' also contains 6308 non-null values, indicating that all rows have a price entry.
* The column 'engine_capacity' has 6305 non-null values, implying that there are three missing values in this column.
* The column 'cylinder' has 5574 non-null values, suggesting that there are 734 missing values in this column.
* The column 'horse_power' has 6186 non-null values, indicating that there are 122 missing values in this column.
* The column 'top_speed' has 5875 non-null values, implying that there are 433 missing values in this column.
* The column 'seats' has 5789 non-null values, indicating that there are 519 missing values in this column.
* The columns 'brand' and 'country' contain 6308 non-null values, indicating that there are no missing values in these columns.
* The columns 'currency', 'price_currency', and 'price_USD' have missing values, with 4979 non-null values in each column.

This information provides an overview of the missing values present in the DataFrame, which can be further analyzed and addressed in the data exploration and cleaning process.

In [6]:
# Check the number of missing values in each column
df.isnull().sum()

car name              0
price                 0
engine_capacity       3
cylinder            734
horse_power         122
top_speed           433
seats               519
brand                 0
country               0
currency           1329
price_currency     1323
price_USD          1329
dtype: int64

To deal with missing data, we have several options:

1. Dropping data:

* Dropping the whole row: This method involves removing rows that contain missing values. It can be applied when the missing values are limited to a few rows and do not significantly impact the overall dataset. In our case, we can choose to drop specific rows if they have missing values in crucial columns.
* Dropping the whole column: This method involves removing columns that have a significant number of missing values. It is suitable when a column has a large proportion of missing values and does not provide valuable information. However, in our dataset, none of the columns have a high number of missing values, so dropping entire columns is not necessary.

2. Replacing data:

* Replacing missing values with the mean:

    This method involves replacing missing values with the mean value of the respective column. It can be applied to numeric columns where the mean value is a reasonable estimate. In our code, we haven't used this method.

* Replacing missing values with the frequency: This method involves replacing missing values with the most frequent value in the respective column. It can be applied to categorical columns where the most frequent value represents a reasonable estimate. In our code, we haven't used this method.

* Replacing missing values based on other functions:

    This method involves replacing missing values using other techniques, such as interpolation, regression models, or domain-specific knowledge. The choice of replacement method depends on the specific context and the characteristics of the data. In our code, we'll use this method later.

We performed the following actions:

* Drop the column "price":

Since we have already converted the price to separate columns for currency and USD, the original "price" column becomes redundant. We dropped it using the drop() function with axis=1 to indicate a column-wise operation.

* Drop rows with missing values:

We dropped rows with missing values in the "price_USD" column because those rows do not provide useful information for our analysis.
We also dropped rows with missing values in the "seats", "engine_capacity", "cylinder", "horse_power", and "top_speed" columns. Since these are critical characteristics of the cars and we cannot reasonably estimate or replace missing values for them, it is more appropriate to remove the rows with missing data.

After performing these actions, we checked for missing values again using isnull().sum(). The output would show the number of missing values in each column. Since all the columns in the output have zero missing values, it indicates that we have successfully dropped the rows with missing data.

The rationale behind these actions is as follows:

* The "price" column becomes redundant after extracting the separate columns for currency and USD.
* Rows without price in USD are not useful for our analysis, so we can safely drop them.
* The remaining numeric characteristics (seats, engine capacity, cylinder, horse power, top speed) are essential attributes of the cars, and since we cannot reliably estimate or replace missing values for them, it is more appropriate to drop the rows with missing data.

By dropping the irrelevant column and the rows with missing values, we ensure that our dataset contains only relevant and complete information for further analysis.

In [7]:
# Drop the column "price" from the DataFrame
df.drop("price", axis=1, inplace=True)

# Drop rows with missing values in the "price_USD" column
df.dropna(subset=['price_USD'], inplace=True)

# Drop rows with missing values in the "seats" column
df.dropna(subset=['seats'], inplace=True)

# Drop rows with missing values in the "engine_capacity" column
df.dropna(subset=['engine_capacity'], inplace=True)

# Drop rows with missing values in the "cylinder" column
df.dropna(subset=['cylinder'], inplace=True)

# Drop rows with missing values in the "horse_power" column
df.dropna(subset=['horse_power'], inplace=True)

# Drop rows with missing values in the "top_speed" column
df.dropna(subset=['top_speed'], inplace=True)

df.isnull().sum()

car name           0
engine_capacity    0
cylinder           0
horse_power        0
top_speed          0
seats              0
brand              0
country            0
currency           0
price_currency     0
price_USD          0
dtype: int64

* The "count" values for all columns indicate that there are 4096 non-null entries, suggesting that the previous data cleaning steps successfully removed rows with missing values.

* The "mean" values provide an estimate of the central tendency of the data. For example, the average engine capacity is around 5.32, the average number of cylinders is approximately 281.43, and the average horsepower is about 221.74.

* The "std" values represent the standard deviation, which measures the spread or dispersion of the data around the mean. Higher standard deviation values indicate greater variability in the data. For instance, the column with the highest variability appears to be "seats" with a standard deviation of approximately 185,915.34.

* The "min" values represent the lowest values observed in each column, while the "max" values indicate the highest values. For example, the minimum engine capacity is 3.0, and the maximum is 16.0.

* The "25%", "50%", and "75%" percentiles provide information about the distribution of the data. The 25th percentile (Q1) indicates the value below which 25% of the data falls, the 50th percentile (Q2) represents the median, and the 75th percentile (Q3) indicates the value below which 75% of the data falls. These percentiles help understand the range and distribution of the data.

Overall, the descriptive statistics provide insights into the central tendency, variability, and distribution of the numeric columns in the DataFrame. We can further analyze and interpret the statistics to gain a better understanding of the car characteristics and identify any outliers or patterns within the data.
Additionally, we can consider visualizing the data using plots and charts to explore relationships and patterns within the dataset.

In [8]:
#descriptive statistics of the DataFrame
df.describe()

Unnamed: 0,engine_capacity,cylinder,horse_power,top_speed,seats,price_currency,price_USD
count,4096.0,4096.0,4096.0,4096.0,4096.0,4096.0,4096.0
mean,111.425049,5.324707,281.428223,221.741699,4.89917,142545.509033,64229.2
std,470.061157,1.891575,183.2973,42.172588,1.450473,185915.339018,64945.58
min,0.0,3.0,67.0,120.0,2.0,1000.0,66.56
25%,2.0,4.0,160.0,185.0,4.0,17500.0,24743.64
50%,2.7,4.0,246.0,215.0,5.0,67097.5,43828.04
75%,4.0,6.0,362.0,250.0,5.0,189022.5,80570.0
max,6000.0,16.0,5050.0,350.0,18.0,997500.0,1352825.0


After analyzing the initial descriptive statistics using the describe() function, several issues were identified:

* Min price in USD: The minimum price in USD was found to be 0, which is not realistic for car prices.

* Unreal engine capacity: Some of the engine capacity values were identified to be unrealistic, including extremely high values such as 6000 liters and low values like 0.1 liters (originally 100 milliliters).

To address these issues, the following actions were taken:

* Price in USD: Rows with a price in USD less than $7000 were filtered out. This helps to remove unrealistic and irrelevant price values.

* Engine Capacity: The engine capacity values greater than 100 (originally in milliliters) were updated to represent liters instead.

After applying these actions, the DataFrame has been updated to ensure more accurate and meaningful data.

In [9]:
# Update engine capacity values greater than 100 to represent liters instead of milliliters
df.loc[df['engine_capacity'] > 100, 'engine_capacity'] = (df['engine_capacity'] / 1000).round(1)

# Filter the DataFrame to keep rows with engine_capacity greater than or equal to 1
df = df.loc[df['engine_capacity'] >= 1]

# Filter the DataFrame to keep rows with price_USD greater than or equal to 7000
df = df.loc[df['price_USD'] >= 7000]

df

Unnamed: 0,car name,engine_capacity,cylinder,horse_power,top_speed,seats,brand,country,currency,price_currency,price_USD
2,Suzuki Jimny 2021 1.5L Automatic,1.5,4.0,102.0,145.0,4.0,suzuki,ksa,SAR,98785.0,26671.95
5,Honda HR-V 2021 1.8 i-VTEC EX,1.8,4.0,140.0,190.0,5.0,honda,ksa,SAR,95335.0,25740.45
8,Renault Koleos 2021 2.5L LE (4WD),2.5,4.0,170.0,199.0,5.0,renault,ksa,SAR,116900.0,31563.00
10,Suzuki Jimny 2021 1.5L M/T,1.5,4.0,102.0,145.0,4.0,suzuki,ksa,SAR,91885.0,24808.95
11,Honda HR-V 2021 1.8 i-VTEC DX,1.8,4.0,140.0,190.0,5.0,honda,ksa,SAR,72335.0,19530.45
...,...,...,...,...,...,...,...,...,...,...,...
6270,Aston Martin DB11 2021 4.0T V8 Volante,4.0,8.0,503.0,322.0,4.0,aston-martin,uae,AED,945384.0,255253.68
6271,BMW M8 Convertible 2021 4.4T V8 Competition xD...,4.4,8.0,625.0,250.0,4.0,bmw,uae,AED,930300.0,251181.00
6273,Mercedes-Benz S Class Cabriolet 2021 S 65,6.0,12.0,630.0,250.0,4.0,mercedes-benz,uae,AED,980000.0,264600.00
6275,BMW M8 Coupe 2021 4.4T V8 Competition xDrive (...,4.4,8.0,625.0,250.0,4.0,bmw,uae,AED,905900.0,244593.00


Overall, the DataFrame has been refined, and the columns now contain meaningful and cleaned data without missing values.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3985 entries, 2 to 6277
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   car name         3985 non-null   object 
 1   engine_capacity  3985 non-null   float64
 2   cylinder         3985 non-null   float64
 3   horse_power      3985 non-null   float64
 4   top_speed        3985 non-null   float64
 5   seats            3985 non-null   float64
 6   brand            3985 non-null   object 
 7   country          3985 non-null   object 
 8   currency         3985 non-null   object 
 9   price_currency   3985 non-null   float64
 10  price_USD        3985 non-null   float64
dtypes: float64(7), object(4)
memory usage: 373.6+ KB


The data has undergone several modifications to address missing values, incorrect units, and unrealistic values.

In [11]:
df.describe()

Unnamed: 0,engine_capacity,cylinder,horse_power,top_speed,seats,price_currency,price_USD
count,3985.0,3985.0,3985.0,3985.0,3985.0,3985.0,3985.0
mean,2.798695,5.251192,273.258218,220.163614,4.929987,146029.246675,65864.76
std,1.282833,1.789188,143.585508,40.95896,1.443329,186791.351522,64964.36
min,1.0,3.0,67.0,120.0,2.0,2899.0,7800.0
25%,2.0,4.0,156.0,185.0,4.0,18900.0,26000.0
50%,2.5,4.0,245.0,211.0,5.0,70999.0,45440.0
75%,3.5,6.0,355.0,250.0,5.0,194535.0,82417.5
max,6.8,12.0,800.0,350.0,18.0,997500.0,1352825.0


After checking for duplicates in the 'car name' column, it appears that there are multiple entries with the same car name. However, upon further analysis, it can be observed that these duplicates correspond to different countries. Each car name is associated with a specific country, and the duplicates arise from having the same car model available in multiple countries.

For example, the car model "Peugeot 5008 2021 1.6T Active" appears 7 times, but each entry corresponds to a different country. The same pattern is observed for other car models as well.

Therefore, having duplicates in the 'car name' column is acceptable in this dataset since the duplicates represent the same car model available in different countries.

In [12]:
#check duplicates
df[df.duplicated(subset='car name', keep=False)]['car name'].value_counts()

Peugeot 5008 2021 1.6T Active                  7
Suzuki Ertiga 2021 1.5L GLX                    7
Mini Hatch 2021 5-Door Cooper                  7
Mini Hatch 2021 3-Door Cooper S                6
Mercedes-Benz GLA 2021 250 4MATIC              6
                                              ..
Citroen C3 Aircross 2021 1.2T Feel             2
Citroen C3 Aircross 2021 1.2T Live             2
Citroen C3 Aircross 2021 1.2T Shine            2
BMW X2 2021 sDrive20i (M sport X package)      2
Changan CS95 2022 2.0T Royal (7-Seater) AWD    2
Name: car name, Length: 935, dtype: int64

The wrangling is done.
The cleaned DataFrame has been saved to the CSV file 'wrangled_cars.csv' without including the index column.

In [13]:
# Save the DataFrame to a CSV file without including the index column
df.to_csv('wrangled_cars.csv', index=False)
df

Unnamed: 0,car name,engine_capacity,cylinder,horse_power,top_speed,seats,brand,country,currency,price_currency,price_USD
2,Suzuki Jimny 2021 1.5L Automatic,1.5,4.0,102.0,145.0,4.0,suzuki,ksa,SAR,98785.0,26671.95
5,Honda HR-V 2021 1.8 i-VTEC EX,1.8,4.0,140.0,190.0,5.0,honda,ksa,SAR,95335.0,25740.45
8,Renault Koleos 2021 2.5L LE (4WD),2.5,4.0,170.0,199.0,5.0,renault,ksa,SAR,116900.0,31563.00
10,Suzuki Jimny 2021 1.5L M/T,1.5,4.0,102.0,145.0,4.0,suzuki,ksa,SAR,91885.0,24808.95
11,Honda HR-V 2021 1.8 i-VTEC DX,1.8,4.0,140.0,190.0,5.0,honda,ksa,SAR,72335.0,19530.45
...,...,...,...,...,...,...,...,...,...,...,...
6270,Aston Martin DB11 2021 4.0T V8 Volante,4.0,8.0,503.0,322.0,4.0,aston-martin,uae,AED,945384.0,255253.68
6271,BMW M8 Convertible 2021 4.4T V8 Competition xD...,4.4,8.0,625.0,250.0,4.0,bmw,uae,AED,930300.0,251181.00
6273,Mercedes-Benz S Class Cabriolet 2021 S 65,6.0,12.0,630.0,250.0,4.0,mercedes-benz,uae,AED,980000.0,264600.00
6275,BMW M8 Coupe 2021 4.4T V8 Competition xDrive (...,4.4,8.0,625.0,250.0,4.0,bmw,uae,AED,905900.0,244593.00


# Summary

We inspected the dataset and found that it had 6,277 rows and 11 columns. We noticed missing values in some columns and decided to drop those rows to ensure data integrity. Afterward, we proceeded with data type conversions to ensure the appropriate data types for each column.

During the cleaning process, we identified and handled duplicates in the dataset, specifically focusing on the "car name" column. We discovered several cars with the same name but different specifications, likely due to different models or years. Since the quantity of duplicates for each car name did not exceed 7, which corresponds to the number of countries represented in the dataset, we concluded that this duplication was acceptable.

Additionally, we performed some basic descriptive statistics on the dataset, providing insights into the distribution and range of values for numeric columns. This allowed us to get a better understanding of the dataset's characteristics.

Finally, we saved the cleaned dataset to a CSV file named 'wrangled_cars.csv', excluding the index column to maintain a clean and organized structure.

In summary, our work involved cleaning and wrangling the car dataset, which consisted of removing missing values, converting data types, handling duplicates, and performing basic descriptive statistics. The resulting dataset is now ready for further analysis and exploration of various factors related to car specifications, brands, countries, and prices.