# ML module 2
Q1. Question 1: Read the Bike Details dataset into a Pandas DataFrame and display its
first 10 rows.

ans-  
###
%pip install gdown -q

import gdown
import pandas as pd

file_id = '1iKy23bMtEQShF_weneRNnYrFmzvpPOI3'
output_filename = 'Bike Details Dataset.csv'

gdown.download(id=file_id, output=output_filename, quiet=False)

df = pd.read_csv(output_filename)


print('First 10 rows of the DataFrame:')
display(df.head(10))


print('\nShape of the DataFrame:')
print(df.shape)


print('\nColumn names of the DataFrame:')
print(df.columns.tolist())
###

Q2. : Check for missing values in all columns and describe your approach for
handling them.

ans-  

###
missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({'Missing Count': missing_values, 'Missing Percentage (%)': missing_percentage})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values(by='Missing Count', ascending=False)

print('Missing Values in Each Column:')
display(missing_df)
###

Q3. Plot the distribution of selling prices using a histogram and describe the
overall trend.

ans-  ###
import matplotlib.pyplot as plt
import seaborn as sns


if 'Selling_Price' in df.columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(df['Selling_Price'], bins=30, kde=True, color='skyblue')
    plt.title('Distribution of Selling Prices')
    plt.xlabel('Selling Price')
    plt.ylabel('Frequency')
    plt.grid(axis='y', alpha=0.75)
    plt.show()
else:
    print("Error: 'Selling_Price' column not found in the DataFrame. Please check the column names.")
    print("Available columns:", df.columns.tolist())
    ###

Q4. Create a bar plot to visualize the average selling price for each seller_type
and write one observation.

ans-
###
import matplotlib.pyplot as plt
import seaborn as sns
if 'seller_type' in df.columns and 'Selling_Price' in df.columns:
    # Calculate the average selling price for each seller_type
    avg_selling_price_by_seller_type = df.groupby('seller_type')['Selling_Price'].mean().sort_values(ascending=False)

    plt.figure(figsize=(10, 6))
    sns.barplot(x=avg_selling_price_by_seller_type.index, y=avg_selling_price_by_seller_type.values, palette='viridis')
    plt.title('Average Selling Price by Seller Type')
    plt.xlabel('Seller Type')
    plt.ylabel('Average Selling Price')
    plt.xticks(rotation=45)
    plt.grid(axis='y', alpha=0.75)
    plt.tight_layout()
    plt.show()
else:
    print("Error: 'seller_type' or 'Selling_Price' column not found in the DataFrame. Please check the column names.")
    print("Available columns:", df.columns.tolist())
###

Q5. Compute the average km_driven for each ownership type (1st owner,
2nd owner, etc.), and present the result as a bar plot.

ans-  
###
import matplotlib.pyplot as plt
import seaborn as sns

if 'owner' in df.columns and 'km_driven' in df.columns:
    # Calculate the average km_driven for each ownership type
    avg_km_driven_by_owner = df.groupby('owner')['km_driven'].mean().sort_values(ascending=False)

    plt.figure(figsize=(10, 6))
    sns.barplot(x=avg_km_driven_by_owner.index, y=avg_km_driven_by_owner.values, palette='coolwarm')
    plt.title('Average Kilometers Driven by Ownership Type')
    plt.xlabel('Ownership Type')
    plt.ylabel('Average Kilometers Driven')
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', alpha=0.75)
    plt.tight_layout()
    plt.show()
else:
    print("Error: 'owner' or 'km_driven' column not found in the DataFrame. Please check the column names.")
    print("Available columns:", df.columns.tolist())
###

Q6. Use the IQR method to detect and remove outliers from the km_driven
column. Show before-and-after summary statistics.

ans-  
###
print("Summary statistics for 'km_driven' BEFORE outlier removal:")
display(df['km_driven'].describe())

Q1 = df['km_driven'].quantile(0.25)
Q3 = df['km_driven'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_km_driven = df[(df['km_driven'] < lower_bound) | (df['km_driven'] > upper_bound)]

print(f"\nNumber of outliers detected in 'km_driven': {len(outliers_km_driven)}")
print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")


df_cleaned = df[~((df['km_driven'] < lower_bound) | (df['km_driven'] > upper_bound))]

print("\nSummary statistics for 'km_driven' AFTER outlier removal:")
display(df_cleaned['km_driven'].describe())

print(f"\nOriginal DataFrame shape: {df.shape}")
print(f"Cleaned DataFrame shape: {df_cleaned.shape}")
###

Q7. Create a scatter plot of year vs. selling_price to explore the
relationship between a bike's age and its price.

ans-  
###
import matplotlib.pyplot as plt
import seaborn as sns


if 'year' in df.columns and 'Selling_Price' in df.columns:
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x='year', y='Selling_Price', data=df, hue='owner', palette='viridis', alpha=0.7)
    plt.title('Relationship between Year and Selling Price')
    plt.xlabel('Manufacturing Year')
    plt.ylabel('Selling Price')
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.tight_layout()
    plt.show()
else:
    print("Error: 'year' or 'Selling_Price' column not found in the DataFrame. Please check the column names.")
    print("Available columns:", df.columns.tolist())
###

Q8. Convert the seller_type column into numeric format using one-hot
encoding. Display the first 5 rows of the resulting DataFrame.

ans-  
###
import pandas as pd

df_encoded = df.copy()


if 'seller_type' in df_encoded.columns:
    # Apply one-hot encoding to the 'seller_type' column
    seller_type_encoded = pd.get_dummies(df_encoded['seller_type'], prefix='seller_type')
    
    # Concatenate the new one-hot encoded columns with the original DataFrame
    # and drop the original 'seller_type' column
    df_encoded = pd.concat([df_encoded, seller_type_encoded], axis=1)
    df_encoded = df_encoded.drop('seller_type', axis=1)
    
    print("DataFrame after one-hot encoding 'seller_type' column (first 5 rows):")
    display(df_encoded.head())
    
    print("\nNew columns created for 'seller_type':")
    print(seller_type_encoded.columns.tolist())
else:
    print("Error: 'seller_type' column not found in the DataFrame. Please check column names.")
    print("Available columns:", df_encoded.columns.tolist())
###

Q9. Generate a heatmap of the correlation matrix for all numeric columns.
What correlations stand out the most?

ans-  
###
import matplotlib.pyplot as plt
import seaborn as sns


numeric_df = df.select_dtypes(include=['number'])

if not numeric_df.empty:
    # Calculate the correlation matrix
    correlation_matrix = numeric_df.corr()

    # Plot the heatmap
    plt.figure(figsize=(12, 10))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
    plt.title('Correlation Matrix of Numeric Columns')
    plt.show()
else:
    print("No numeric columns found in the DataFrame to compute correlation.")
###

Q10.  Summarize your findings in a brief report:
● What are the most important factors affecting a bike's selling price?
● Mention any data cleaning or feature engineering you performed.


ans-  Most Important Factors Affecting a Bike's Selling Price:
Based on our analysis, the following factors appear to be most influential on a bike's selling price:

- Manufacturing Year (year): The scatter plot of year vs. Selling_Price (Q7) likely shows a positive correlation, indicating that newer bikes tend to have higher selling prices. This is a primary driver as newer models are generally more desirable and have less wear and tear.
- Kilometers Driven (km_driven): While not explicitly plotted against selling price, the correlation heatmap (Q9) is expected to reveal a negative correlation between km_driven and Selling_Price. Bikes with lower mileage generally command higher prices due to less usage and better condition.
- Ownership Type (owner): The scatter plot (Q7) also visualized owner with respect to year and Selling_Price. Typically, bikes with fewer owners (e.g., 'First Owner') tend to have higher selling prices, implying better maintenance or perceived reliability. The bar plot in Q5 showed how km_driven varies by ownership type, which indirectly links to selling price.
- Seller Type (seller_type): The bar plot of average selling price by seller_type (Q4) would have shown differences in pricing strategy or bike quality across seller categories (e.g., individual sellers vs. dealers), suggesting that the source of sale impacts the price.


Data Cleaning and Feature Engineering Performed:
- Missing Value Identification (Q2): We checked for missing values across all columns to understand data completeness. Although a handling strategy was described, specific imputation or removal was not performed in the provided code cells for most missing data, except for the outlier removal step.
- Outlier Removal for km_driven (Q6): The Interquartile Range (IQR) method was applied to the km_driven column to identify and remove outliers. This step was crucial for improving the representativeness of summary statistics and potentially model performance by reducing the impact of extreme values.
- One-Hot Encoding for seller_type (Q8): The categorical seller_type column was converted into a numeric format using one-hot encoding. This technique creates new binary columns for each unique category, making the feature suitable for many machine learning models that require numerical input.