# Here we will Basically Explore the Dataset. And we will try to find the issues.

In [1]:
%config IPCompleter.use_jedi = False 
%config Completer.evaluation = 'limited'
import warnings
warnings.filterwarnings('ignore')

## Importing Libraries

In [None]:
# If you haven't installed these libraries, run this cell.
!pip install -q numpy pandas seaborn scikit-learn matplotlib

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

## Load the Dataset

In [None]:
df = pd.read_csv('../Data/quikr_car.csv')

In [4]:
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,"45,000 kms",Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,40 kms,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,"22,000 kms",Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,"28,000 kms",Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,"36,000 kms",Diesel


## Column Info: 
- `name` -> represent the model name of the car.
- `company` -> represent the company name
- `Price` --> price of the car[`Target column`]
- `kms_driven` -> total driven the car in kilo-meters
- `fuel_type` -> fueal type of the car 

# 🧾 Issues in Dataset

## ✅ Completeness Issues

1. Columns like `name`, `company`, `year`, and `Price` do **not** have any missing values (**Good**).
2. **Missing values** are present in the following columns:
   - `kms_driven`
   - `fuel_type`

---

## 🧪 Quality Issues

### 🔢 Data Type Mismatches
- `year` is of type `object` (string) but should be **integer**.
- `price` is of type `object` (string) but should be **float**.
- `kms_driven` is of type `object` (string) but should be **integer** or **float**.

### 🗑️ Garbage or Irrelevant Entries
- In the `company` column: entries like:'I', 'i', 'selling', 'Sale', 'sell', 'Any', '7', '9', 'all', 'scratch', 'urgent'


### ❌ Invalid Values in `year` Column
- Examples of invalid values: '150k', 'TOUR', 'r 15', 'Zest', '/-Rs', '2 bs', 'arry', 'Eon', 'o...', 'ture',
'emi', 'car', 'able', 'no.', 'd...', 'SALE', 'digo', 'sell', 'd Ex', 'n...',
'e...', 'D...', 'Ac', 'go .', 'k...', 'o c4', 'zire', 'cent', 'Sumo',
'cab', 't xe', 'EV2', 'r...', 'zest'

### 💵 Price Column Issues
- Needs to be converted to **numeric**.
- Contains **commas** (`,`) in values.
- Some prices are **missing**.
- One row contains `'Ask For Price'` (non-numeric string).

### 🚗 `kms_driven` Column Issues
- Contains the word `'kms'` (e.g., `"54,000 kms"`).
- Needs conversion to **numeric**.

### ⛽ `fuel_type` Column
- Has **missing values**.
- No invalid values present (**Clean aside from nulls**).

In [5]:
df.shape 

(892, 6)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        892 non-null    object
 1   company     892 non-null    object
 2   year        892 non-null    object
 3   Price       892 non-null    object
 4   kms_driven  840 non-null    object
 5   fuel_type   837 non-null    object
dtypes: object(6)
memory usage: 41.9+ KB


In [12]:
df.duplicated().sum()

np.int64(94)

In [13]:
# Let's check the dublicate's columns
df[df.duplicated()]

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
14,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,"45,000 kms",Petrol
15,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,40 kms,Diesel
20,Mahindra Scorpio S10,Mahindra,2016,350000,"43,000 kms",Diesel
24,Hyundai i20 Sportz 1.2,Hyundai,2012,100000,"55,000 kms",Petrol
25,Hyundai i20 Sportz 1.2,Hyundai,2012,100000,"55,000 kms",Petrol
...,...,...,...,...,...,...
626,Tata Sumo Gold EX BS IV,Tata,2012,210000,"75,000 kms",Diesel
641,Maruti Suzuki Swift VDi BS IV,Maruti,2012,280000,"48,006 kms",Diesel
727,Mahindra Scorpio S4,Mahindra,2015,865000,"30,000 kms",Diesel
861,Hyundai Getz Prime 1.3 GLX,Hyundai,2009,115000,"20,000 kms",Petrol


In [14]:
# drop the duplicate rows
cars = df.drop_duplicates()

In [15]:
cars.shape

(798, 6)

**After dropping all dupicated rows we only have 798 rows**

### Checking each column indiviually 

In [8]:
# df['name'].unique()

In [17]:
cars['company'].unique()

array(['Hyundai', 'Mahindra', 'Maruti', 'Ford', 'Skoda', 'Audi', 'Toyota',
       'Renault', 'Honda', 'Datsun', 'Mitsubishi', 'Tata', 'Volkswagen',
       'I', 'Chevrolet', 'Mini', 'BMW', 'Nissan', 'Hindustan', 'Fiat',
       'Commercial', 'MARUTI', 'Force', 'Mercedes', 'Land', 'Yamaha',
       'selling', 'URJENT', 'Swift', 'Used', 'Jaguar', 'Jeep', 'tata',
       'Sale', 'very', 'Volvo', 'i', '2012', 'Well', 'all', '7', '9',
       'scratch', 'urgent', 'sell', 'TATA', 'Any', 'Tara'], dtype=object)

In [18]:
cars['year'].unique()

array(['2007', '2006', '2018', '2014', '2015', '2012', '2013', '2016',
       '2010', '2017', '2008', '2011', '2019', '2009', '2005', '2000',
       '...', '150k', 'TOUR', '2003', 'r 15', '2004', 'Zest', '/-Rs',
       'sale', '1995', 'ara)', '2002', 'SELL', '2001', 'tion', 'odel',
       '2 bs', 'arry', 'Eon', 'o...', 'ture', 'emi', 'car', 'able', 'no.',
       'd...', 'SALE', 'digo', 'sell', 'd Ex', 'n...', 'e...', 'D...',
       ', Ac', 'go .', 'k...', 'o c4', 'zire', 'cent', 'Sumo', 'cab',
       't xe', 'EV2', 'r...', 'zest'], dtype=object)

In [22]:
# cars['Price'].unique() 

In [23]:
cars.columns

Index(['name', 'company', 'year', 'Price', 'kms_driven', 'fuel_type'], dtype='object')

In [25]:
# cars['kms_driven'].unique()

In [26]:
cars['fuel_type'].unique()

array(['Petrol', 'Diesel', nan, 'LPG'], dtype=object)