<a href="https://colab.research.google.com/github/sreesushma/Assignment-1/blob/main/PDS_Assignment_2(16361446).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading the Dataset

In [30]:
import pandas as pd
import numpy as np
from datetime import datetime

# Load dataset
df = pd.read_csv('/content/used_cars_data.csv')
df.head()


Unnamed: 0.1,Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
1,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,13 km/kg,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5
2,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0
3,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74
4,6,Nissan Micra Diesel XV,Jaipur,2013,86999,Diesel,Manual,First,23.08 kmpl,1461 CC,63.1 bhp,5.0,,3.5


# A) Handling missing values & B) Removing Units from Columns

In [31]:
print(" Missing values before handling \n", df.isnull().sum())

 Missing values before handling 
 Unnamed: 0              0
Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 36
Power                  36
Seats                  38
New_Price            5032
Price                   0
dtype: int64


In [32]:
# Function to extract numeric values
def extract_number(value):
    try:
        return float(str(value).split()[0].replace(',', ''))
    except:
        return np.nan

# Clean numeric fields
for col in ['Mileage', 'Engine', 'Power']:
    df[col] = df[col].apply(extract_number)

# Convert New_Price from 'Lakh' or 'Cr'
df['New_Price'] = df['New_Price'].str.replace('Lakh', '').str.replace('Cr', '')
df['New_Price'] = pd.to_numeric(df['New_Price'], errors='coerce')


# Impute numerical columns with median
num_cols = ['Mileage', 'Engine', 'Power', 'Seats', 'New_Price']
for col in num_cols:
    df[col] = df[col].fillna(df[col].median())

# Impute categorical with mode
df['Location'].fillna(df['Location'].mode()[0], inplace=True)

# Show missing after
print("\n Missing values after handling :\n", df.isnull().sum())



 Missing values after handling :
 Unnamed: 0           0
Name                 0
Location             0
Year                 0
Kilometers_Driven    0
Fuel_Type            0
Transmission         0
Owner_Type           0
Mileage              0
Engine               0
Power                0
Seats                0
New_Price            0
Price                0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Location'].fillna(df['Location'].mode()[0], inplace=True)


The median was used for the numerical columns (like Mileage, Power, Engine, and New_Price) because:

* It handles outliers well.
* It works better with skewed distributions.
* It provides more realistic fill-ins for missing data.

Mode is great for handling categorical columns like Fuel_Type and Transmission because

* It picks the most common value.
* It's perfect for non-numeric data and
* keeps everything consistent within the categories.

Using the mean isn't the best choice because:

* It can be thrown off by outliers.
* It doesn't always show the true average of the data.
* It might lead to wrong and unfair guesses when filling in missing info.

B)

In [33]:

print(df[['Mileage', 'Engine', 'Power', 'New_Price']].head())


   Mileage  Engine   Power  New_Price
0    19.67  1582.0  126.20      11.48
1    13.00  1199.0   88.70       8.61
2    20.77  1248.0   88.76      11.48
3    15.20  1968.0  140.80      11.48
4    23.08  1461.0   63.10      11.48


# C) One-hot Encoding

In [34]:
# One-hot encode
df = pd.get_dummies(df, columns=['Fuel_Type', 'Transmission'], drop_first=False)

# result
print( [col for col in df.columns if 'Fuel_Type' in col or 'Transmission' in col])


['Fuel_Type_Diesel', 'Fuel_Type_Electric', 'Fuel_Type_Petrol', 'Transmission_Automatic', 'Transmission_Manual']


# D) Adding New Feature

In [35]:
df['Car_Age'] = datetime.now().year - df['Year']
print( df[['Year', 'Car_Age']].head())


   Year  Car_Age
0  2015       10
1  2011       14
2  2012       13
3  2013       12
4  2013       12


# E) Data Wrangling Tasks

In [36]:
# Select & Rename
selected_df = df[['Name', 'Location', 'Year', 'Mileage', 'Engine', 'Power', 'Price']].copy()
selected_df.rename(columns={'Name': 'Car_Name'}, inplace=True)
print(" Selected & Renamed Columns:\n", selected_df.head())

# Filter
filtered_df = selected_df[selected_df['Price'] > 10]
print("\n Cars priced above 10 lakhs:\n", filtered_df.head())

# Mutate
df['km_per_cc'] = df['Mileage'] / df['Engine']
print("\n New column 'km_per_cc' (Mileage / Engine):\n", df[['Mileage', 'Engine', 'km_per_cc']].head())

# Arrange
arranged_df = df.sort_values(by='Price', ascending=False)
print("\n Top 5 most expensive cars:\n", arranged_df[['Name', 'Price']].head())

# Summarize with group by
fuel_column = 'Fuel_Type_Petrol' if 'Fuel_Type_Petrol' in df.columns else df.columns[df.columns.str.startswith('Fuel_Type')][0]
summary = df.groupby(fuel_column).agg({
    'Price': ['mean', 'max'],
    'Mileage': 'mean',
    'Power': 'mean'
}).round(2)
print("\n Grouped summary by fuel type:\n", summary)


 Selected & Renamed Columns:
                            Car_Name    Location  Year  Mileage  Engine  \
0  Hyundai Creta 1.6 CRDi SX Option        Pune  2015    19.67  1582.0   
1                      Honda Jazz V     Chennai  2011    13.00  1199.0   
2                 Maruti Ertiga VDI     Chennai  2012    20.77  1248.0   
3   Audi A4 New 2.0 TDI Multitronic  Coimbatore  2013    15.20  1968.0   
4            Nissan Micra Diesel XV      Jaipur  2013    23.08  1461.0   

    Power  Price  
0  126.20  12.50  
1   88.70   4.50  
2   88.76   6.00  
3  140.80  17.74  
4   63.10   3.50  

 Cars priced above 10 lakhs:
                              Car_Name    Location  Year  Mileage  Engine  \
0    Hyundai Creta 1.6 CRDi SX Option        Pune  2015    19.67  1582.0   
3     Audi A4 New 2.0 TDI Multitronic  Coimbatore  2013    15.20  1968.0   
5   Toyota Innova Crysta 2.8 GX AT 8S      Mumbai  2016    11.36  2755.0   
11   Land Rover Range Rover 2.2L Pure       Delhi  2014    12.70  2179.0   
