Step a: Handling Missing Values

In [3]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/train.csv')

# Check for missing values
missing_values = df.isnull().sum()

# Impute missing values
for column in df.columns:
    if df[column].dtype == 'object':
        # Impute categorical columns with mode
        df[column].fillna(df[column].mode()[0], inplace=True)
    else:
        # Impute numerical columns with mean
        df[column].fillna(df[column].mean(), inplace=True)

# Verify that there are no missing values
print(df.isnull().sum())


Unnamed: 0           0
Name                 0
Location             0
Year                 0
Kilometers_Driven    0
Fuel_Type            0
Transmission         0
Owner_Type           0
Mileage              0
Engine               0
Power                0
Seats                0
New_Price            0
Price                0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mode()[0], inplace=True)


Step b: Removing Units from Attributes

In [4]:
# Remove units from specified columns
df['Mileage'] = df['Mileage'].astype(str).str.replace(' kmpl', '').str.replace(' km/kg', '').astype(float)
df['Engine'] = df['Engine'].astype(str).str.replace(' CC', '').astype(float)
df['Power'] = df['Power'].astype(str).str.replace(' bhp', '').astype(float)

# For New_Price, handle both Lakh and Crore
df['New_Price'] = df['New_Price'].astype(str).str.replace(' Cr', '').str.replace(' Lakh', '').astype(float)
# Convert Crore values to Lakh (1 Crore = 100 Lakh)
mask = df['New_Price'].astype(str).str.contains('Cr')
df.loc[mask, 'New_Price'] *= 100

# Verify the changes
print(df[['Mileage', 'Engine', 'Power', 'New_Price']].head())

   Mileage  Engine   Power  New_Price
0    19.67  1582.0  126.20       4.78
1    13.00  1199.0   88.70       8.61
2    20.77  1248.0   88.76       4.78
3    15.20  1968.0  140.80       4.78
4    23.08  1461.0   63.10       4.78
