# **Project Name**    -




##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 2 -**Tamanna Kansal (2210992436)
##### **Team Member 3 -**Tanika (2210992437)
##### **Team Member 1 -**Swastik Tara (2210992433)
##### **Team Member 4 -**Tanishq Singla (2210992438)

# **Project Summary -**

The "Car Sales Price Prediction" project aims to develop a machine learning model that predicts the sale price of cars based on various features. This falls under the domain of supervised learning and specifically regression. The dataset used for training and evaluation contains historical information about cars, including details such as model, year, mileage, brand, and other relevant features, along with their corresponding sale prices.

Key Objectives:

Data Collection: Gather a comprehensive dataset containing information on various car features and their associated sale prices.

Exploratory Data Analysis (EDA): Conduct EDA to understand the characteristics of the dataset, identify patterns, and preprocess the data. This involves handling missing values, detecting outliers, and visualizing relationships between features.

Data Preprocessing: Prepare the data for training by encoding categorical variables, scaling numerical features, and splitting the dataset into training and testing sets.

Model Selection: Choose a regression algorithm suitable for the task. Common choices include linear regression, decision tree regression, or ensemble methods like random forests.

Model Training: Train the selected model using the labeled training data, allowing the algorithm to learn the relationships between the input features and the target variable (car sales price).

Model Evaluation: Evaluate the model's performance on a separate test dataset to assess its ability to generalize to new, unseen data. Metrics such as Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE) may be used.

Hyperparameter Tuning: Fine-tune the model's hyperparameters to optimize its performance and generalization ability.

Prediction: Deploy the trained model to predict the sale prices of new cars based on their features.

Documentation: Document the entire process, including data sources, methodology, model details, and results.


The project aims to deliver a robust machine learning model capable of accurately predicting car sales prices. This predictive tool can be valuable for both buyers and sellers in the automotive market, providing insights into fair pricing based on various car attributes.


# **GitHub Link -**

https://github.com/tanika04/AI-project.git

# **Problem Statement**


The automotive industry relies heavily on understanding market trends and consumer behavior to optimize sales strategies and pricing decisions. For both dealerships and private sellers, accurately pricing vehicles is essential for maximizing profits and minimizing time on the market. However, determining the optimal selling price for a vehicle can be challenging due to various factors such as market demand, vehicle condition, and geographic location.

The project targets providing a valuable tool for both sellers and buyers a in making informed decisions and presents an opportunity to create a solution that can predict car sales prices with greater precision and contribute to a more transparent and efficient car-selling process.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

##For data manipulation
import pandas as pd # pandas for working with dataframes
import numpy as np # numpy for numerical processing

##For data visualization
import matplotlib.pyplot as plt # matplotlib for plotting graphs
import seaborn as sns # seaborn for statistical data visualization


### Dataset Loading

In [None]:
# Load Dataset

df=pd.read_csv('/content/car_prices.csv') #df is  name of dataframe.
#This uses the pandas library to load the CSV data into a pandas DataFrame.
#The read_csv() function loads the data. We can pass the filepath 'car_prices.csv' to it to load from that file.

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of Rows and Columns: ",df.shape);

### Dataset Information

In [None]:
# Dataset Info
df.info();

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values=df.isnull().sum()
print(null_values)

In [None]:
# Visualizing the missing values

plt.figure(figsize = (20,7))
sns.heatmap(df.isnull(), cbar=False)
plt.show()
#Visualizing missing values helps plan appropriate data preprocessing strategies like dropping columns with too many missing values or imputation methods to fill them in


### What did you know about your dataset?

1.   In our dataset, there are 19938 rows and 16 columns.
2.   It has no duplicate
3.   It has no missing values.
4.   There are 1 column of int64 datatype, 11 columns of object datatype and 4 columns of float64 datatype.
5.   The heatmap gives a visual representation of the missing values across
     all data points. It makes it easy to spot patterns in missing data, like if they are concentrated in a particular subset of rows.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

1.  **year**:


> Represents the manufacturing year of the car.


> data type: Numeric (integer) data type.




2. **make**:


> Indicates the manufacturer or brand of the car.

> Data Type: String (object) data type.

3. **model**:


> Specifies the model name or number of the car.
>Data Type: String (object) data type.

4. **trim**:

> Refers to the specific version or trim level of the car model.

> data type: String (object) data type.

5. **body**:

> Describes the body type of the car (e.g., sedan, SUV, truck).

> data type: String (object) data type.

6. **transmission**:

>Indicates the type of transmission the car has (e.g., automatic, manual).

> data type: String (object) data type.

7. **vin**:

> Stands for Vehicle Identification Number, a unique code used to identify individual motor vehicles.

> data type: String (object) data type.

8. **state**:

> Represents the state in which the car is located or registered.

> data type: String (object) data type.

9. **condition**:

> Describes the overall condition of the car (e.g., new, used, excellent).

> data type: Numeric (integer or float) data type.

10. **odometer**:

> Indicates the total distance traveled by the car, typically in miles or kilometers.

> data type: Numeric (integer or float) data type.

11. **color**:

> Specifies the exterior color of the car.

> data type: String (object) data type.

12. **interior**:

> Describes the interior color or features of the car.

> data type: String (object) data type.

13. **seller**:

> Indicates the entity or individual selling the car.

> data type: String (object) data type.

14. **mmr**:

> Possibly represents the Manheim Market Report, a pricing tool for used cars.

> data type: Numeric (integer or float) data type.

15. **sellingprice**:

> Represents the sale price of the car.

> data type: Numeric (integer or float) data type.

16. **saledate**:

> Indicates the date when the car was sold.

> data type: Date or datetime data type.




### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values=df.nunique()
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Remove duplicate rows based on the VIN column
df.drop_duplicates(subset='vin', inplace=True)

# Print the number of rows in the DataFrame after removing duplicates
print(df.shape)

In [None]:
# Check for missing values in each column
missing_values = df.isnull().sum()

# Print the number of missing values in each column
print(missing_values)

In [None]:
# Impute missing values in the 'year' column with the median
df['year'].fillna(df['year'].median(), inplace=True)

In [None]:
# Impute missing values in the 'make' column with new category
df['make'].fillna('Unknown', inplace=True)

In [None]:
# Impute missing values in the 'model' column with the most frequent category:

df['model'].fillna(df['model'].mode()[0], inplace=True)

In [None]:
# Impute missing values in the 'trim' column with the most frequent category:

df['trim'].fillna(df['trim'].mode()[0], inplace=True)

In [None]:
# Impute missing values in the 'body' column with the most frequent category:
df['body'].fillna(df['body'].mode()[0], inplace=True)

In [None]:
# Impute missing values in the 'transmission' colujmn with the most frequent category:

df['transmission'].fillna(df['transmission'].mode()[0], inplace=True)

In [None]:
# Impute missing values in the 'condition' column with the mean:

df['condition'].fillna(df['condition'].mean(), inplace=True)

In [None]:
# Impute missing values in the 'odometer' column with the median
df['odometer'].fillna(df['odometer'].median(), inplace=True)

In [None]:
# Impute missing values in the 'interior' column with the most frequent category:

df['interior'].fillna(df['interior'].mode()[0], inplace=True)

In [None]:
#  Basic Aggregations
#1. Total sales revenue for each state
total_revenue_by_state = df.groupby('state')['sellingprice'].sum()
print("Total sales revenue by state:")
print(total_revenue_by_state)

In [None]:
# Multi-Level GroupBy
# 2.Average sales volume by make and year
average_sales_by_make_year = df.groupby(['make', 'year'])['sellingprice'].mean()
print("\nAverage sales price by make and year:")
print(average_sales_by_make_year)

In [None]:
# Conditional Aggregations
#3. Grouping by 'Model' and calculating total sales for each model
total_sales_by_model = df.groupby('model')['saledate'].sum()
print("Total sales by model:\n", total_sales_by_model)

In [None]:
#4.find count of different color of vehicles in each year
find=df.groupby('year')['color'].value_counts()
print(find)

In [None]:
#5. count of occurrences for each combination of 'model' and 'interior'.
models=df.groupby('model')['interior'].value_counts()
print(models)

In [None]:
#6. will display the count of occurrences for each transmission type.
transmission_counts = df.groupby('transmission').size()
# Print the counts of occurrences for each transmission type
print(transmission_counts)

In [None]:
#7. Group by 'year' column and find the maximum selling price for each year
max_selling_price_by_year = df.groupby('year')['sellingprice'].max()

# Print the maximum selling price for each year
print(max_selling_price_by_year)

In [None]:
#8. calculate the average selling price for each combination of vehicle make and model?
average_price_by_make_model = df.groupby(['make', 'model'])['sellingprice'].mean()

# Print the average selling price for each combination of make and model
print(average_price_by_make_model)

In [None]:
#9.calculate the average selling price for each body type.
average_price_by_body = df.groupby('body')['sellingprice'].mean()
print(average_price_by_body)

In [None]:
#10. count the number of sales for each body type.
sales_count_by_body = df.groupby('body').size()
print(sales_count_by_body )

In [None]:
#11. find the maximum odometer reading for each body type.
max_odometer_by_body = df.groupby('body')['odometer'].max()
print(max_odometer_by_body)

### What all manipulations have you done and insights you found?

**Data Manipulation**:Imputed missing values in each column using appropriate strategies:

year: imputed with the median year.

make: imputed with the new catagory.

model: imputed with the most frequent model for popular car models.

trim: imputed with the most frequent trim level for popular car models.

body: imputed with the most frequent body type for popular car models and a similar car body type for other models.

transmission: imputed with the most frequent transmission type for popular car models.

condition: imputed with the mean condition for cars with high selling prices and a similar car condition for other cars.

odometer: imputed with the median odometer reading.

interior: imputed with the most frequent catagory.

**Insights:**

* Trends in car prices over the years.

* Price variations across different car makes and models.

* Influence of trim levels and body types on prices.

* Impact of transmission types on pricing trends.

* Preferences for car color and interior features and their effect on prices.

* Accuracy of pricing predictions compared to market values.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Plot the distribution of vehicle colors
plt.figure(figsize=(10, 6))
color_counts = df['color'].value_counts()
color_counts.plot(kind='bar', color=['black', 'blue', 'grey', 'silver' ,'blue','red' ,'black','gold','green' , 'red','beige','brown','orange','purple','blue','yellow','grey','turquoise','red','orange'])
plt.title('Distribution of Vehicle Colors')
plt.xlabel('Color')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A countplot is chosen for its effectiveness in visually representing the distribution of car colors, offering a clear and concise overview of categorical data.Its simplicity facilitates quick insights, making it an ideal choice for integration with AIML in a vehicle price prediction project.

##### 2. What is/are the insight(s) found from the chart?


The chart uses red color to represent vehicle colors, but insights into the actual color distribution require considering the original data. Analysis of bar heights suggests common trends in preferred vehicle colors, providing initial insights into color popularity within the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Insights on popular vehicle colors can inform effective marketing strategies, potentially boosting customer satisfaction and sales.

Negative Growth:
Limited color diversity in the dataset may hinder customization options, leading to negative growth as customers seeking variety may be deterred, impacting market competitiveness and overall sales.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Plot the distribution of selling prices
plt.figure(figsize=(10, 6))
plt.hist(df['sellingprice'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of Selling Prices')
plt.xlabel('Selling Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

The histogram is chosen to depict the distribution of selling prices due to its ability to illustrate the range and frequency of prices, providing insights into common price bands. This visualization is essential for understanding the overall distribution and informing pricing strategies in the vehicle sales project.

##### 2. What is/are the insight(s) found from the chart?

Price Range Distribution: The histogram illustrates the spread of selling prices, revealing the range of prices and their respective frequencies within the dataset.

Common Price Bands: Peaks or clusters in the histogram indicate prevalent selling price ranges, offering insights into the most frequent pricing segments in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Insights into selling price distribution inform effective pricing strategies, potentially attracting more customers and maximizing revenue.

Negative Growth:
Skewed or limited selling price diversity may deter customers, leading to negative growth by reducing sales and limiting market appeal. A balanced and competitive pricing approach is crucial for sustained positive impact.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#create a box plot or to visualize the distribution of selling prices for each body type.

import seaborn as sns
plt.xlabel('body')
plt.ylabel('sellingprice')
sns.boxplot(data=df)

##### 1. Why did you pick the specific chart?

The boxplot is chosen to compare selling prices across different body types because it effectively displays the central tendency, spread, and potential outliers within each category. This visualization provides a clear and concise way to understand the distribution of selling prices based on body types, aiding in identifying patterns and making informed decisions in your vehicle sales analysis.

##### 2. What is/are the insight(s) found from the chart?

The boxplot provides the following insights:

Price Variation by Body Type: Variability in the box lengths and positions indicates differing price distributions among various body types.

Outliers: Points beyond the "whiskers" suggest potential outliers in selling prices for specific body types, highlighting exceptional cases that may require further investigation.

Market Preferences: Uniform box lengths within body types indicate consistency in selling prices, offering insights into market preferences and enabling more targeted pricing strategies for each vehicle category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the boxplot, such as understanding price variation by body type, can positively impact business by informing targeted pricing strategies. Tailoring prices based on body types to align with customer preferences can enhance market appeal, potentially leading to increased sales and positive business impact.
However, if the boxplot reveals extreme outliers or significant pricing disparities among body types, it may lead to negative growth. Excessive pricing variations can create customer dissatisfaction, impacting overall sales and market competitiveness

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Create a scatter plot to examine the relationship between vehicle prices and odometer readings.
#This helps identify any patterns or correlations between price and mileage.
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='odometer', y='sellingprice', color='green')
plt.title('Vehicle Prices vs. Odometer Readings')
plt.xlabel('Odometer Reading')
plt.ylabel('Selling Price ($)')
plt.show()


##### 1. Why did you pick the specific chart?
The scatterplot is chosen to visualize the relationship between vehicle prices and odometer readings due to its effectiveness in highlighting potential patterns or correlations. By representing each data point, this chart provides a clear overview of how odometer readings may influence selling prices, aiding in the identification of trends and outliers for more informed decision-making in your vehicle sales analysis.

##### 2. What is/are the insight(s) found from the chart?

The scatterplot provides the following insights:

Price and Odometer Relationship: Examining the scatterplot reveals the nature of the relationship between vehicle prices and odometer readings. This helps identify whether there is a correlation, trend, or any potential patterns.

Outliers: Outlying points on the scatterplot may indicate exceptional cases where the selling price does not align with typical expectations based on odometer readings, suggesting the presence of influential factors beyond mileage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights gained from the scatterplot can positively impact business by helping to identify trends and correlations between odometer readings and selling prices. This information enables more precise pricing strategies, potentially increasing customer satisfaction and sales, thus contributing to positive business impact.

However, if the scatterplot reveals no clear relationship between odometer readings and selling prices, it may hinder growth

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Create a KDE plot to estimate the probability density function of vehicle prices.
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df, x='sellingprice', color='skyblue', fill=True)
plt.title('Kernel Density Estimation of Vehicle Prices')
plt.xlabel('Selling Price ($)')
plt.ylabel('Density')
#plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

The KDE plot is chosen for its ability to offer a smooth and informative representation of the probability density of vehicle prices, aiding in a nuanced understanding of the overall distribution characteristics in a concise manner

##### 2. What is/are the insight(s) found from the chart?

Probability Density: The KDE plot visually represents the probability density of vehicle prices, showing where prices are more concentrated or sparse in the dataset.

Distribution Shape: The smooth curve reveals the overall shape of the selling price distribution, indicating whether prices are skewed, symmetric, or exhibit multiple peaks.

Common Price Ranges: Concentrations in the curve highlight common price ranges, providing insights into the most frequent selling prices within the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the KDE plot insights aid in setting precise and competitive prices, potentially enhancing customer satisfaction and positively impacting sales for a beneficial business outcome.

No, the KDE plot itself does not inherently lead to negative growth. However, if the insights from the plot reveal extreme price concentration or erratic distribution patterns, inconsistent pricing strategies may emerge.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# create a pie chart showing the distribution of vehicle transmissions in the dataset
# Count occurrences of each transmission type
transmission_counts = df['transmission'].value_counts()

# Plot the pie chart
plt.figure(figsize=(4,5))
plt.pie(transmission_counts, labels=transmission_counts.index, autopct='%1.1f%%', startangle=140,colors=['red','yellow'])
plt.title('Distribution of Vehicle Transmissions')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

##### 1. Why did you pick the specific chart?

The pie chart is chosen to visualize the distribution of vehicle counts by transmission type because it provides a clear, intuitive representation of the proportion of each category. This chart is effective for showcasing the relative frequencies of different transmission types in the dataset, allowing for a quick and easy-to-understand overview.

##### 2. What is/are the insight(s) found from the chart?

Proportion of Transmission Types: The chart shows the relative proportions of different transmission types (e.g., automatic and manual) in the dataset, providing an at-a-glance understanding of their distribution.

Prevalence of Transmission Types: The larger the slice of a specific color, the more prevalent that transmission type is in the dataset. This helps in identifying the dominant transmission types among the vehicles.

Visual Comparison: The visual comparison of pie slices allows for easy identification of the distribution pattern, making it straightforward to observe any significant imbalances or trends in transmission types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the distribution of vehicle counts by transmission type can potentially contribute to a positive business impact. Understanding the prevalence and proportion of different transmission types in the dataset allows businesses to tailor their marketing strategies, inventory management, and customer offerings to align with customer preferences

Misinterpreting or neglecting insights related to pricing, market trends, marketing strategies, customer preferences, or temporal patterns may lead to incorrect business decisions, negatively impacting sales and overall growth. The key lies in strategic and informed responses to insights to avoid adverse consequences.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#create a violin plot to visualize the distribution of selling prices for each body type.
v=sns.violinplot(data=df)
v.set_xlabel('body')
v.set_ylabel('sellingprice')


##### 1. Why did you pick the specific chart?

The violin plot is chosen to illustrate the distribution of selling prices across various body types due to its comprehensive visualization capabilities. By combining box plots and kernel density estimation, it provides a nuanced representation of the data, offering insights into the central tendency, spread, and probability density of prices.

##### 2. What is/are the insight(s) found from the chart?

The violin plot highlights diverse selling price distributions among different body types, depicting

Variability: Width of violins shows diversity in selling prices across body types.
Probability Density: Height indicates likelihood of prices within specific ranges for each category.
Comparative Analysis: Allows quick assessment of price differences among body types, guiding pricing strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights from the violin plot can positively impact business by informing precise pricing strategies, potentially increasing customer satisfaction and sales through a nuanced understanding of price distributio

If the plot indicates extreme variability or inconsistent probability density in selling prices, it may lead to negative growth.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Plot the selling prices over time with a trendline to visualize the overall trend.
plt.figure(figsize=(50, 20))
sns.lineplot(data=df, x='saledate', y='sellingprice', estimator='mean', color='skyblue',marker='o')
plt.title('Selling Prices Over Time with Trendline')
plt.xlabel('Date')
plt.ylabel('Selling Price ($)')
plt.show()

##### 1. Why did you pick the specific chart?

The selected line plot with a trendline and mean estimation is optimal for visually capturing selling price trends over time. Its large figure size, use of markers, and focus on the mean provide clarity in understanding the temporal evolution of selling prices.

##### 2. What is/are the insight(s) found from the chart?

Overall Trend: The chart helps identify whether there is a consistent upward, downward, or stable trend in selling prices over the observed period.

Seasonal Patterns: Patterns or fluctuations in selling prices across different time intervals may indicate seasonal influences or periodic market dynamics.

Outliers or Anomalies: Any significant deviations from the trendline could signify outliers or unique events affecting selling prices on specific dates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Line plots help identify pricing trends, analyze seasonal variations, and conduct competitive analysis, enabling informed pricing strategies.

Negative Growth Potential:
Failure to adapt pricing strategies to trends or seasonal changes may lead to decreased sales and loss of market share.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Group by 'condition' and calculate the average selling price
avg_price_by_condition = df.groupby('condition')['sellingprice'].mean().sort_values(ascending=False)

# Plot the average selling price by condition
plt.figure(figsize=(10, 6))
avg_price_by_condition.plot(kind='bar', color='skyblue')
plt.title('Average Selling Price by Condition')
plt.xlabel('Condition')
plt.ylabel('Average Selling Price ($)')
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is selected to illustrate the average selling prices by condition, offering a clear and concise comparison of pricing variations based on the condition of vehicles in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The bar chart displays the average selling prices categorized by vehicle condition, highlighting that better conditions are associated with higher average prices. This insight assists in identifying the price variations across conditions and informs decision-making for inventory management and pricing strategies, supporting a nuanced understanding of market dynamics related to vehicle conditions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the chart can positively influence business by guiding strategic pricing and inventory decisions, potentially boosting sales. Conversely, neglecting to align strategies with condition-related trends may lead to negative growth due to a potential mismatch in market offerings and customer preferences.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Calculate the average selling price by state
avg_price_by_state = df.groupby('state')['sellingprice'].mean().sort_values(ascending=False)

# Set a suitable figure size
plt.figure(figsize=(12, 8))

# Plot the Bar Chart
avg_price_by_state.plot(kind='bar', color='blue')
plt.title('Average Selling Price by State')
plt.xlabel('State')
plt.ylabel('Average Selling Price ($)')
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

Comparison of Numerical Data: Bar charts are effective for comparing numerical data (average selling prices in this case) across different categories (states). Each bar represents a state, and the height of the bar corresponds to the average selling price for that state, making it easy to compare prices between states.

Clear Representation: Bar charts provide a clear and straightforward representation of data, making it easy for viewers to understand and interpret the average selling prices for each state at a glance.

##### 2. What is/are the insight(s) found from the chart?

Regional Price Variations: The chart highlights significant variations in average selling prices across different states. States with higher bars indicate higher average selling prices, while states with lower bars have comparatively lower average prices. This suggests that regional factors, such as market demand, economic conditions, or local preferences, influence vehicle pricing.

High-Value Markets: States with the tallest bars represent high-value markets where vehicles tend to sell at higher prices on average. Identifying these high-value markets is crucial for automakers and dealerships to allocate resources effectively and target marketing efforts towards these regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Insights from average selling price by state can help businesses tailor marketing strategies, optimize pricing, and allocate resources effectively in high-value markets, potentially leading to increased revenue and profitability.

Negative Impact:
Overemphasis on high-priced markets may lead to market saturation and intensified competition, while neglecting lower-priced markets could result in missed revenue opportunities and hinder overall growth potential. Additionally, customer perception issues in lower-priced markets could impact brand reputation and loyalty, leading to negative growth.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Countplot of Transmission Types: Illustrating the count of each transmission type.
# Set a suitable figure size
plt.figure(figsize=(10, 6))

# Plot the Countplot
sns.countplot(data=df, x='transmission')
plt.title('Count of Vehicles by Transmission Type')
plt.xlabel('Transmission Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot directly shows the counts of observations in each category, which can be informative when you want to understand the distribution of sales across different categories.It is straightforward and easy to interpret. Each bar represents the count of observations in a category, making it simple to compare between categories.

##### 2. What is/are the insight(s) found from the chart?

Prevalence of Transmission Types: The countplot would reveal which transmission types are most commonly found within your dataset. You could identify whether automatic, manual, CVT (Continuously Variable Transmission), or other types dominate the market.
Consumer Preferences: If one transmission type significantly outweighs the others in terms of counts, it suggests a strong consumer preference for that type. This insight could influence decisions related to inventory management, production planning, or marketing strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Meeting Consumer Demand: Understanding which transmission types are most popular among consumers allows businesses to align their product offerings with market preferences. This alignment can lead to increased sales and customer satisfaction as the company provides vehicles that meet consumer needs and preferences.
Failure to Adapt to Market Trends: If a company ignores or misinterprets insights from the countplot and continues to produce vehicles with outdated or unpopular transmission types, it may experience negative growth. Consumers may prefer competitors' vehicles with more desirable transmission options, leading to decreased market share and sales.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Create subplots
fig, axs = plt.subplots(nrows=2,ncols=1,sharex=True, figsize=(10, 8))

# Plot selling price
axs[0].scatter(df['year'], df['sellingprice'], marker='+', color='b')
axs[0].set_title('Selling Price Over Time')
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Selling Price ($)')
#plot mileage
axs[1].scatter(df['year'], df['odometer'], color='r',marker='+')
axs[1].set_title('Scatterplot of Mileage Over Time')
axs[1].set_xlabel('Date')
axs[1].set_ylabel('Mileage')

##### 1. Why did you pick the specific chart?

Comparison: Subplots enable a direct comparison between different variables or datasets. In this case, selling price and mileage are two distinct variables that are being plotted against the same time axis (year). By placing them in separate subplots, it's easy to compare their trends and distributions side by side.

Space Efficiency: Using subplots allows you to maximize the use of space within a single figure. Rather than creating separate figures for each variable, subplots enable you to display multiple plots within the same visualization, reducing clutter and making it easier to interpret the data.

##### 2. What is/are the insight(s) found from the chart?

Trend Identification: By observing the general direction of the data points in each scatter plot, one can identify trends in selling price and mileage over time. For example, if the selling price tends to increase while mileage decreases over time, it suggests that newer vehicles with lower mileage tend to command higher prices.

Correlation Assessment: Examining the distribution of data points in relation to the regression line or pattern in each scatter plot can provide insight into the correlation between selling price and mileage. If the data points are clustered closely around the regression line, it indicates a strong correlation, whereas scattered points suggest a weaker correlation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 The insights gained from the scatter plots can positively impact businesses by informing pricing strategies, optimizing inventory management, and understanding customer preferences. However, failure to address negative trends or misinterpretation of data could lead to declining sales and profitability. It's crucial for businesses to leverage these insights effectively to achieve sustainable growth in the automotive industry.






#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Create a violin plot by make
plt.figure(figsize=(20, 8))
sns.violinplot(x='make', y='sellingprice', data=df)
plt.title('Distribution of Vehicle Prices by Make')
plt.xlabel('Vehicle Make')
plt.ylabel('Price ($)')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A violin plot provides a concise visual summary of the distribution of a continuous variable (in this case, selling price) within each category of a categorical variable (vehicle make). It allows for easy comparison of price distributions across different vehicle makes.

##### 2. What is/are the insight(s) found from the chart?

Price Distribution Variation: The width of the violin plot for each make indicates the density of vehicle prices within that make. Wider sections represent higher density, suggesting that there is more variability in prices for some makes compared to others.

Central Tendency: The thicker part of the violin plot (the "body") represents the most common prices within each make. By observing the height and position of this portion, one can infer the central tendency or typical price range for each vehicle make.

Outliers: The tails of the violin plot represent the less common price points or outliers. By examining the length and thickness of the tails, one can identify which makes have more outliers or extreme price variations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Insights can optimize pricing strategies, target marketing efforts, and enhance competitive positioning.

Negative Growth Potential:
Misjudged pricing can lead to overpricing or underpricing, risking sales and profit.
Ignoring niche markets or failing to address brand perception challenges could limit growth opportunities.





#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

##### 1. Why did you pick the specific chart?

The heatmap of the correlation matrix is chosen to visually represent and quantify the strength and direction of relationships between numerical variables in the dataset. It aids in identifying dependencies and is useful for making informed decisions in data analysis.

##### 2. What is/are the insight(s) found from the chart?

Strength of Relationships: The heatmap quantifies the strength of relationships between numerical variables, revealing the degree of correlation.

Direction of Correlation: Positive values indicate variables moving in the same direction, while negative values suggest variables moving in opposite directions.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
#Create a pairplot to visualize pairwise relationships between numerical variables such as 'sellingprice', 'odometer', and 'year'.
#This plot provides a comprehensive view of correlations and distributions.
sns.pairplot(df[['sellingprice','odometer','year']])
plt.show()

##### 1. Why did you pick the specific chart?

The pairplot is selected for its ability to offer a comprehensive view of correlations and distributions between 'sellingprice', 'odometer', and 'year', allowing simultaneous examination of multiple factors for a holistic understanding in a concise manner.

##### 2. What is/are the insight(s) found from the chart?

The pairplot provides the following insights:

Correlations: By examining the scatterplots, one can identify potential correlations between 'sellingprice', 'odometer', and 'year', gaining insights into how these variables may interact.

Distributions: The diagonal histograms display the univariate distribution of each variable, offering insights into the spread and concentration of data points for 'sellingprice', 'odometer', and 'year'.

Potential Outliers: Outliers or unusual patterns in the scatterplots may be noticeable, providing insights into exceptional cases or data points that deviate from the overall trends.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***