# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 - Raj vardhan - 2210990710**
##### **Team Member 2 - Rohan - 2210990737**
##### **Team Member 3 - Vanshul Rana - 2210992511**
##### **Team Member 4 -Krishna - 2210931008**

# **Project Summary -**

The car price prediction project utilizes machine learning techniques to estimate the prices of cars based on various features. The dataset comprises information about different car models, including their make, model, year of manufacture, mileage, engine size, fuel type, and other relevant attributes. The goal of the project is to develop a predictive model that accurately estimates the prices of cars, helping buyers, sellers, and manufacturers make informed decisions.

The project begins with data preprocessing steps, including handling missing values, encoding categorical variables, and scaling numerical features to ensure the quality and consistency of the dataset. Various machine learning algorithms are explored, including linear regression, decision trees, random forests, and gradient boosting, to determine the most suitable model for the task.

After training and evaluating several models using techniques such as cross-validation, the performance of each model is assessed using metrics such as mean absolute error (MAE), mean squared error (MSE), and R-squared (R2) score. The model with the lowest error metrics and the highest R-squared score is selected as the final predictive model.

Insights gained from the analysis reveal the key features that significantly influence car prices, such as the car's make, model, year of manufacture, and mileage. The model's predictions provide valuable insights into the factors driving car prices, enabling stakeholders to make informed decisions about buying, selling, or pricing cars in the market.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


In today's automotive market, the pricing of cars is a complex and multifaceted task influenced by numerous factors such as brand reputation, vehicle specifications, market demand, economic conditions, and consumer preferences. Car manufacturers and dealerships face the challenge of accurately pricing their vehicles to remain competitive, maximize profits, and meet customer expectations. To address this challenge, the development of predictive models using Artificial Intelligence (AI) and Machine Learning (ML) techniques presents a promising solution.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv("/content/CarPrice_project.csv")

### Dataset First View

In [None]:
# Dataset First Look
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values for {column}:")
    print(unique_values)
    print()



## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#price of car
Price=pd.DataFrame({
    'car_Id':[1,2,3,4,5,6,7,8,9,10],
    'price':['13495','16500','16500','13950','17450','15250','17710','18920','23875','17859.17']})
print(Price)

In [None]:
#2.grouping data

CarName={'carName':["alfa-romero giulia"," alfa-romero stelvio","alfa-romero Quadrifoglio ","audi 100 ls","audi 100ls","audi fox","audi 100ls","audi 5000","audi 4000","audi 5000s (diesel)"],
         'fueltype':['gas','gas','gas','gas','gas','gas','gas','gas','gas','gas'],
         'carbody':["convertible","convertible","hatchback","sedan","sedan","sedan","sedan","wagon","sedan","hatchback"],
         'enginesize':[130,130,152,109,136,136,136,136,131,131]
}

df=pd.DataFrame(CarName)
grouped=df.groupby('enginesize')
print(grouped.get_group(136))

In [None]:
#3.concatenating the dataframes
df=pd.DataFrame({
    'enginetype':['dohc','dohc','ohcv','ohc','ohc','ohc','ohc','ohc','ohc','ohc'],
    'cylindernumber':['four','four','six','four','five','five','five','five','five','five']
})
df

### What all manipulations have you done and insights you found?

We have done the following manipulations:-
1.data exploration
2.merge two data frame
3.grouping data
4.visualizing the data
5.filtering the data
6.concatenating the data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Selecting the bottom 20 rows of data
df_subset = df.tail(20)

# Plotting pie chart for 'wheelbase'
plt.figure(figsize=(8, 8))
plt.pie(df_subset['wheelbase'], labels=df_subset['wheelbase'], autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Wheelbase')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()

##### 1. Why did you pick the specific chart?

Proportion Visualization: Pie charts are ideal for visualizing proportions of a whole. In this case, the pie chart shows the distribution of different wheelbase lengths within the bottom 40 rows of the dataset.

Limited Number of Categories: Pie charts work well when there are a limited number of categories to represent. Since we are visualizing the wheelbase lengths, which are continuous variables, but are effectively categorized within the dataset, a pie chart can effectively show the relative frequencies of these categories

##### 2. What is/are the insight(s) found from the chart?

Dominant Wheelbase Lengths: Observing the size of each slice in the pie chart can provide insights into which wheelbase lengths are most prevalent within the bottom 40 rows of the dataset. Larger slices indicate more common wheelbase lengths, while smaller slices represent less common lengths.

Variety of Wheelbase Lengths: Examining the number of distinct wheelbase lengths represented in the pie chart can reveal insights into the diversity of wheelbase options within the dataset. A larger number of distinct slices suggests a wider range of wheelbase lengths available in the dataset.

Concentration of Wheelbase Lengths: Analyzing whether certain wheelbase lengths dominate the distribution or if the distribution is relatively evenly spread across different lengths can provide insights into market preferences or manufacturing trends. Concentrated distributions may indicate that specific wheelbase lengths are more popular among consumers or more commonly produced by manufacturers

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Market Segmentation: Understanding the distribution of wheelbase lengths can help businesses tailor their product offerings to better meet the needs and preferences of different customer segments. By offering a diverse range of wheelbase options, businesses can attract a wider customer base and increase sales.

Product Development: Insights from the distribution of wheelbase lengths can inform product development strategies, allowing businesses to design vehicles that cater to specific market niches or emerging trends. Developing vehicles with popular wheelbase lengths can enhance competitiveness and drive revenue growth.

Negative Business Impact:

Misalignment with Market Demand: Misinterpreting the distribution of wheelbase lengths or failing to accurately assess market demand can lead to negative growth. For example, if a business produces vehicles with wheelbase lengths that are not aligned with customer preferences or market trends, it may struggle to attract buyers and experience declining sales.

Excess Inventory: Overestimating demand for certain wheelbase lengths and producing vehicles in excess can result in excess inventory, leading to storage costs, markdowns, and reduced profit margins. Businesses may need to implement discounting strategies or liquidate inventory to clear excess stock, resulting in financial losses

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Extracting the first 20 data points
data_subset = df.head(20)

# Plotting the bar chart
plt.figure(figsize=(10, 6))
plt.bar(data_subset['CarName'], data_subset['price'], color='skyblue')
plt.xlabel('Car Name')
plt.ylabel('Price')
plt.title('Car Price for First 20 Data Points')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.show()

##### 1. Why did you pick the specific chart?

Clear Representation: Bar charts provide a clear visual representation of the data. The length of each bar corresponds directly to the value being represented (average price), making it intuitive for viewers to interpret the data.

Categorical Data: Since car makes are categorical data (e.g., Toyota, Honda, Ford), a bar chart is appropriate for displaying this type of information. Each bar represents a distinct category, making it easy to understand the distribution of average prices across different makes.

##### 2. What is/are the insight(s) found from the chart?

Answer Here Price Comparison: The chart facilitates a direct comparison of average prices between car makes. Viewers can easily identify which makes are generally more expensive or more affordable relative to others. This comparison can inform purchasing decisions and provide insights into the perceived value of different car brands.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Optimized Pricing Strategies: Identifying pricing patterns and trends between the two datasets can help businesses refine their pricing strategies. They can adjust prices for specific models to remain competitive in the market while maximizing profitability.

Market Positioning: Understanding price differences can provide insights into how the business is positioned relative to competitors. This information can guide decisions on whether to target premium or budget segments of the market and tailor marketing strategies accordingly.

Negative Business Impact:

Loss of Market Share: If the analysis reveals that prices in one dataset are consistently lower than those in the other dataset for comparable models, it could indicate that the business is overpricing its products. This may lead to a loss of market share as customers opt for more competitively priced alternatives.

Perceived Value: Large discrepancies in prices between the two datasets for the same models may raise questions about the perceived value of the products. Customers may become skeptical about the pricing integrity of the business, leading to a negative impact on brand reputation and customer loyalty.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

data_subset = df.head(20)

# Plotting the line chart
plt.figure(figsize=(10, 6))
plt.plot(data_subset.index, data_subset['price'], marker='*', color='skyblue', linestyle='-')
plt.xlabel('Index')
plt.ylabel('Price')
plt.title('Car Prices for First 20 Data Points')
plt.xticks(data_subset.index)  # Use data indices as x-axis ticks
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Trend Analysis: A line chart is effective for visualizing trends over time. In this case, plotting car prices against years allows us to observe how prices have changed over different time periods. This is valuable for identifying long-term trends, seasonal patterns, or any other temporal variations in car prices.

Comparison: Line charts allow for easy comparison between different categories or series. If there are multiple car makes or models in the dataset, each can be represented by a separate line on the chart, enabling viewers to compare the price trends of different cars over time.

##### 2. What is/are the insight(s) found from the chart?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Identifying Growth Opportunities: By analyzing the trends in car prices over time, businesses can identify periods of increasing demand or rising prices. This insight can help businesses capitalize on growth opportunities by adjusting their inventory, pricing strategies, and marketing efforts accordingly.

Optimizing Inventory Management: Understanding how car prices fluctuate over time allows businesses to better manage their inventory. They can anticipate changes in demand and adjust their inventory levels to meet customer needs effectively, minimizing the risk of overstocking or stockouts.

Negative Growth Potential:

Identifying Declining Trends: While positive growth opportunities can be identified through upward trends in car prices, businesses must also be vigilant for declining trends. If the line chart reveals a consistent downward trend in car prices over time, it may indicate a shrinking market or declining consumer interest. Failing to recognize and address such trends promptly could lead to negative growth for the business.

Market Saturation: A flat or stagnant trend in car prices over time may suggest market saturation, where demand for new cars is reaching a plateau. Businesses operating in saturated markets may face increased competition and pricing pressures, making it challenging to achieve growth without innovative strategies to differentiate their offerings.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# scatter plot

plt.scatter(df['horsepower'],df['citympg'],color='red',marker='+')
plt.title('scatter plot btw horsepower and citympg')
plt.xlabel('horsepower')
plt.ylabel('citympg')


##### 1. Why did you pick the specific chart?

I chose a scatter plot because it's a commonly used chart type for visualizing the relationship between two continuous variables, such as horsepower and price in this case. Scatter plots are effective for identifying patterns, trends, and correlations in the data. By plotting price against horsepower, we can visually inspect if there's any discernible relationship between these two variables, such as whether cars with higher horsepower tend to have higher prices. If there's a relationship, it might be useful for understanding the pricing dynamics in the automotive market. Additionally, scatter plots are straightforward to interpret and communicate the data effectively to others.

##### 2. What is/are the insight(s) found from the chart?

Interpreting insights from a scatter plot involves examining the pattern or trend between the two variables plotted. Here are some potential insights that could be derived from the scatter plot of price against horsepower:

Positive correlation: If the points on the scatter plot generally slope upwards from left to right, it suggests a positive correlation between horsepower and price. In other words, cars with higher horsepower tend to have higher prices.

No correlation: If the points on the scatter plot appear scattered randomly without any clear pattern, it suggests that there is no significant relationship between horsepower and price.

Outliers: Identification of outliers can provide insights into exceptional cases where cars may have unusually high or low prices given their horsepower. These outliers might represent unique models, luxury vehicles, or instances of pricing anomalies.

Clusters or patterns: Clusters or patterns in the scatter plot may indicate specific segments of the market where certain types of cars (e.g., high-performance sports cars, economy cars) are priced differently relative to their horsepower.

Potential for non-linear relationships: While a linear trend is commonly assumed, the scatter plot might reveal that the relationship between price and horsepower is non-linear. This could indicate that factors other than horsepower also influence the pricing of cars.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from analyzing the relationship between price and horsepower in the automotive market can indeed have a positive business impact if leveraged effectively. Here's how:

Optimized Pricing Strategies: Understanding the positive correlation between price and horsepower can help automotive companies and dealerships optimize their pricing strategies. They can price high-horsepower vehicles accordingly, maximizing profits without deterring potential buyers.

Product Development: Insight into the preferences of consumers regarding horsepower and price can guide product development efforts. Manufacturers can adjust their product offerings to cater to market demand, potentially developing new models or enhancing existing ones to align with consumer expectations.

However, there are potential negative impacts if the insights are not properly interpreted or addressed:

Overpricing or Underpricing: Relying solely on the positive correlation between price and horsepower without considering other factors may lead to overpricing or underpricing of vehicles. Overpricing can deter potential buyers, while underpricing may result in missed revenue opportunities or reduced profit margins.

Market Segmentation Issues: Focusing too narrowly on high-horsepower vehicles may neglect other segments of the market with different preferences or budget constraints. This could result in missed opportunities to capture market share from competitors or meet the needs of diverse customer segments.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Selecting only the first 30 rows of data
df_subset = df.head(30)

# Plotting histogram for 'CarName'
plt.figure(figsize=(10, 6))
plt.hist(df_subset['CarName'], bins=15, color='orange', edgecolor='black')
plt.title('Histogram of Car Names')
plt.xlabel('Car Name')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Comparative Analysis: By plotting histograms for multiple categorical variables, we can compare the frequency distributions of different categories within each variable. This allows for easy identification of patterns or differences in the data.

Readability: Histograms are easy to interpret and understand, making them suitable for communicating insights to others. They present the data in a clear and concise manner, facilitating quick comprehension of the frequency distribution.

##### 2. What is/are the insight(s) found from the chart?

Variety of Car Models: Analyzing the spread of frequencies across different car names can highlight the diversity of car models available in the dataset. A wide range of car models with varying frequencies suggests a diverse product portfolio.

Preference for Car Body Types: Understanding the distribution of car body types can indicate consumer preferences or trends in the market. For instance, if hatchbacks have a higher frequency compared to other body types, it suggests a preference for compact and versatile vehicles.

Identification of Outliers: Unusually tall bars or unexpected patterns in the histograms may indicate outliers or anomalies in the data. These outliers could represent rare or niche car models that require further investigation

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

positive impact Product Development: Insights into consumer preferences for certain car models or body types can guide product development efforts. Businesses can prioritize the development of new models or variants that are in high demand, enhancing their competitiveness in the market.

Inventory Management: Knowledge of popular car models and body types can inform inventory management decisions, allowing businesses to optimize their stock levels to meet customer demand more effectively. This can lead to improved efficiency and reduced inventory costs

Negative Business Impact:

Overlooked Market Segments: Focusing exclusively on the most popular car models or body types identified in the histograms may lead to overlooking niche or emerging market segments. Neglecting these segments could result in missed opportunities for growth and potential loss of market share to competitors who cater to these segments.

Limited Product Differentiation: Relying solely on insights from histograms may result in limited product differentiation strategies. Businesses may fail to differentiate their offerings sufficiently from competitors, leading to commoditization and price-based competition that can erode profit margins

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# Extracting the first 20 data points
data_subset = df.head(20)

# Define attributes to compare (assuming 'CarName', 'horsepower', and 'Price' here)
attributes = ['CarName', 'horsepower', 'price']

# Plotting the multiple bar chart without a loop
plt.figure(figsize=(12, 6))

# Plotting the first attribute ('CarName')
plt.subplot(1, len(attributes), 1)
plt.bar(data_subset.index, data_subset[attributes[0]], label=attributes[0], color='skyblue')
plt.xlabel('Index')
plt.ylabel(attributes[0])
plt.title(attributes[0] + ' for First 20 Data Points')
plt.xticks(data_subset.index)
plt.legend()

# Plotting the second attribute ('horsepower')
plt.subplot(1, len(attributes), 2)
plt.bar(data_subset.index, data_subset[attributes[1]], label=attributes[1], color='lightgreen')
plt.xlabel('Index')
plt.ylabel(attributes[1])
plt.title(attributes[1] + ' for First 20 Data Points')
plt.xticks(data_subset.index)
plt.legend()

# Plotting the third attribute ('Price')
plt.subplot(1, len(attributes), 3)
plt.bar(data_subset.index, data_subset[attributes[2]], label=attributes[2], color='salmon')
plt.xlabel('Index')
plt.ylabel(attributes[2])
plt.title(attributes[2] + ' for First 20 Data Points')
plt.xticks(data_subset.index)
plt.legend()

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Categorical and Numerical Data: The chart effectively combines categorical data (car names) with numerical data (car lengths). This makes it suitable for visualizing the relationship between car names and their corresponding lengths.

Readability: Multiple bar charts are easy to interpret, making them suitable for communicating insights to a broad audience. The clear labeling of car names on the x-axis and the corresponding lengths on the y-axis facilitates easy understanding of the data.

Flexibility: This chart type provides flexibility in terms of customization. You can adjust parameters such as figure size, colors, and orientation of labels to enhance readability and aesthetics.

##### 2. What is/are the insight(s) found from the chart?

Identification of Longest and Shortest Cars: The bar chart allows for easy identification of the longest and shortest cars among the models included in the dataset. The tallest bar represents the longest car, while the shortest bar represents the shortest car.

Comparison of Car Lengths: Comparing the lengths of different car models can reveal patterns or trends in car design or market segmentation. For example, certain car manufacturers may produce longer vehicles on average, while others may focus on compact models.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Product Development: Insights into car lengths can inform product development strategies, allowing businesses to design vehicles that align with consumer preferences. By understanding which car lengths are popular or desirable, businesses can develop new models that cater to specific market segments, potentially leading to increased sales and market share.

Market Positioning: Knowledge of car lengths across different models can help businesses strategically position their products in the market. For example, if there's a demand for longer luxury vehicles, businesses can focus on marketing their high-end models with spacious interiors and advanced features to target affluent consumers.

Negative Business Impact:

Misalignment with Consumer Preferences: If businesses misinterpret or overlook insights from the chart, they may develop products that do not align with consumer preferences. For example, investing in the production of longer vehicles when there's actually a growing demand for compact cars could result in excess inventory and decreased profitability.

Market Saturation: Overemphasis on certain car lengths based on insights from the chart without considering broader market trends could lead to market saturation. If businesses flood the market with similar products, it may dilute brand value and erode pricing power, ultimately resulting in negative growth.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Selecting the first 20 rows of data
df_subset = df.head(20)

# Create violin plot
plt.figure(figsize=(12, 8))
sns.violinplot(x='citympg', y='highwaympg', data=df_subset)
plt.title('Violin Plot of Car Price by Car Name')
plt.xlabel('Citympg')
plt.ylabel('highwaympg')
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

Distribution Comparison: Violin plots provide a clear visual representation of the distribution of data within each category. The width of the violin corresponds to the frequency or density of data points at different values of the numerical variable.

Summary Statistics: Violin plots often include summary statistics such as median, quartiles, and outliers, providing additional insights into the distribution of the data within each category.

##### 2. What is/are the insight(s) found from the chart?

Variability in Car Prices: The width and shape of each violin indicate the variability in car prices for each car name. A wider section of the violin suggests greater variability in prices within that category, while a narrower section indicates more uniform pricing.

Central Tendency: The central point or the thickest part of each violin represents the median car price for the corresponding car name. Comparing these central tendencies across different car names can provide insights into the typical pricing range for each model

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Informed Pricing Strategies: Understanding the variability in car prices and identifying popular price segments can help businesses develop more informed pricing strategies. By aligning pricing with customer preferences and market demand, businesses can attract more customers and increase sales, leading to positive growth.

Market Segmentation Opportunities: Insights into the distribution of car prices across different car names can help identify opportunities for market segmentation. Businesses can tailor their marketing and product offerings to target specific customer segments based on their preferences for price ranges and car models, leading to increased customer satisfaction and loyalty

Negative Business Impact:

Lost Sales Opportunities: Misinterpreting pricing insights or failing to adjust pricing strategies in response to market dynamics can lead to lost sales opportunities. For example, pricing cars above the preferred price segments of target customers may result in decreased demand and negative growth.

Brand Perception Damage: Inconsistent pricing or pricing outliers compared to competitors within the same car segment can damage brand perception and erode customer trust. Customers may perceive such pricing strategies as unfair or unjustified, leading to negative word-of-mouth and reputational damage.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Extracting the first 60 data points
data_subset = df.head(60)

# Creating the strip plot
plt.figure(figsize=(15, 10))
sns.stripplot(x='carheight', y='carwidth', data=data_subset, jitter=True)
plt.xlabel('Car Height')
plt.ylabel('Car Width')
plt.title('Strip Plot between Car Height and Car Width (First 40 Data Points)')
plt.show()

##### 1. Why did you pick the specific chart?

Individual Data Points: A strip plot displays individual data points, providing a detailed view of the distribution of the numerical variable within each category. This allows for precise interpretation of the data and identification of outliers or patterns.

Categorical Comparison: Strip plots allow for easy comparison of the distribution of the numerical variable across different categories. By plotting all data points for each category along the same axis, viewers can quickly assess differences or similarities in the distribution of the variable among categories.

##### 2. What is/are the insight(s) found from the chart?

Variability in Car Prices: The spread of data points along the y-axis for each car name indicates the variability in car prices within each category. Widely spread out points suggest a larger range of prices, while closely clustered points indicate a narrower price range.

Central Tendency: The central tendency of the data points for each car name provides insight into the typical or median car price within that category. Comparing the central tendencies across different car names can reveal which models tend to be priced higher or lower on average.

Outlier Detection: Any data points that significantly deviate from the main cluster may represent outliers. These outliers could indicate unique or premium models with exceptionally high prices or other factors affecting pricing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Informed Pricing Strategies: Understanding the variability in car prices and identifying popular price segments can help businesses develop more informed pricing strategies. By aligning pricing with customer preferences and market demand, businesses can attract more customers and increase sales, leading to positive growth.

Market Segmentation Opportunities: Insights into the distribution of car prices across different car names can help identify opportunities for market segmentation. Businesses can tailor their marketing and product offerings to target specific customer segments based on their preferences for price ranges and car models, leading to increased customer satisfaction and loyalty.

Negative Business Impact:

Lost Sales Opportunities: Misinterpreting pricing insights or failing to adjust pricing strategies in response to market dynamics can lead to lost sales opportunities. For example, pricing cars above the preferred price segments of target customers may result in decreased demand and negative growth.

Brand Perception Damage: Inconsistent pricing or pricing outliers compared to competitors within the same car segment can damage brand perception and erode customer trust. Customers may perceive such pricing strategies as unfair or unjustified, leading to negative word-of-mouth and reputational damage

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Select the first 40 rows of data
df_subset = df.head(40)

# Create a catplot
sns.catplot(data=df_subset, x='wheelbase', y='enginesize', kind='bar', height=6, aspect=2)
plt.title('Wheel base V/S engine size')
plt.xlabel('wheel base')
plt.ylabel('engine size')

plt.show()


##### 1. Why did you pick the specific chart?

Comparison of Distributions: Catplots provide an effective way to compare the distribution of a continuous variable across multiple categories. In this case, we can visualize how the engine size varies across different categories of wheelbase length.
Categorical Representation: By binning the continuous variable 'wheelbase' into categories, we can transform it into a categorical variable, enabling us to explore how the relationship between wheelbase and engine size differs across distinct groups.
Box Plot Representation: The box plot within the catplot is suitable for visualizing the distribution of a continuous variable within each category. It displays summary statistics such as median, quartiles, and outliers, providing insights into the variability and central tendency of engine sizes within each wheelbase category.
Space Efficiency: Catplots can efficiently represent multiple distributions simultaneously, making them suitable for comparing the relationship between two variables across multiple categories.

##### 2. What is/are the insight(s) found from the chart?

Variation in Engine Size: The box plot displays the distribution of engine sizes for different values of wheelbase. It shows how engine sizes vary across different wheelbase lengths.
Central Tendency: The median line within each box represents the central tendency of engine sizes for a particular wheelbase length. By comparing the positions of the medians across different wheelbase lengths, we can observe any trends or differences in the central tendency of engine sizes.
Variability: The length of the box and the whiskers indicate the variability of engine sizes within each wheelbase length category. A longer box and wider whiskers suggest greater variability, while a shorter box and narrower whiskers indicate less variability.
Outliers: Outliers, if present, are visually identifiable as individual data points beyond the whiskers of the box plot. These outliers represent extreme values of engine size for specific wheelbase lengths.
Relationship Between Wheelbase and Engine Size: By examining the box plot, we can assess the relationship between wheelbase and engine size. If there is a clear pattern or trend in the distribution of engine sizes across different wheelbase lengths, it suggests a potential relationship between these variables.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Product Development: Understanding the relationship between wheelbase and engine size can guide product development efforts. For example, if there's a trend of larger engine sizes for longer wheelbases, this insight can inform the development of larger and more powerful vehicles tailored to customer preferences.
Market Segmentation: Identifying clusters of customers based on preferences for certain combinations of wheelbase and engine size can enable targeted marketing and product offerings. For instance, if there's a segment of customers who prefer compact cars with smaller engine sizes, the company can tailor marketing campaigns and product features to appeal to this segment.
Competitive Advantage: Leveraging insights into customer preferences for specific combinations of vehicle attributes can give the company a competitive advantage. By aligning product offerings with customer preferences, the company can differentiate itself from competitors and attract more customers.
Negative Growth:

Production Costs: If there's a mismatch between customer preferences and the company's existing product lineup, it could lead to increased production costs and inefficiencies. For example, if the company continues to produce vehicles with larger engine sizes for shorter wheelbases despite a declining demand for such configurations, it may result in excess inventory and higher production costs.
Market Saturation: If the company fails to adapt its product offerings based on changing customer preferences revealed by insights from the data, it may face challenges related to market saturation. For instance, if competitors are quick to respond to emerging trends in vehicle attributes while the company lags behind, it may lose market share and experience negative growth.
Brand Perception: Ignoring insights into customer preferences and failing to innovate in line with market trends can negatively impact brand perception. Customers may perceive the company as outdated or out of touch with their needs, leading to decreased brand loyalty and negative growth in sales.

#### Chart - 10

In [None]:
# Chart - 10 visualization code


# Extracting the first 40 data points
data_subset = df.head(40)

# Creating the box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='carbody', y='carlength', data=data_subset)
plt.xlabel('Car Body')
plt.ylabel('Car Length')
plt.title('Box Plot between Car Body and Car Length (First 40 Data Points)')
plt.show()

##### 1. Why did you pick the specific chart?

I selected a box plot because it's a suitable choice for visualizing the distribution of a continuous variable ('carlength') across different categories or levels of a categorical variable ('carbody'). Here's why I chose this specific chart:

Comparison of Distributions: Box plots provide a clear visual representation of the distribution of a continuous variable across different categories or levels of a categorical variable. In this scenario, we can compare the distribution of car lengths for various car body types.
Identification of Outliers: Box plots display outliers as individual data points beyond the whiskers, making it easy to identify any extreme values or anomalies in the data.
Summary Statistics: Box plots show key summary statistics such as the median, quartiles, and range of the data, providing insights into the central tendency and variability of the distribution.
Space Efficiency: Box plots are space-efficient, making them suitable for visualizing multiple distributions simultaneously, especially when dealing with large datasets or many categories.

##### 2. What is/are the insight(s) found from the chart?

From the box plot between 'carbody' and 'carlength', we can derive several insights:

Variation in Car Length: The box plot displays the distribution of car lengths for different car body types. It shows how car lengths vary across different categories of car bodies.
Central Tendency: The median line within each box represents the central tendency of car lengths for a particular car body type. By comparing the positions of the medians across different car body types, we can observe any trends or differences in the central tendency of car lengths.
Variability: The length of the box and the whiskers indicate the variability of car lengths within each car body type category. A longer box and wider whiskers suggest greater variability, while a shorter box and narrower whiskers indicate less variability.
Outliers: Outliers, if present, are visually identifiable as individual data points beyond the whiskers of the box plot. These outliers represent extreme values of car length for specific car body types.
Relationship Between Car Body and Length: By examining the box plot, we can assess the relationship between car body type and car length. If there is a clear pattern or trend in the distribution of car lengths across different car body types, it suggests a potential relationship between these variables.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Extracting the first 20 data points
data_subset = df.head(20)

# Calculate the correlation matrix
numeric_data_subset = data_subset.select_dtypes(include=[np.number])
correlation_matrix = numeric_data_subset.corr()

# Create the correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of First 20 Data Points')
plt.show()

##### 1. Why did you pick the specific chart?

Comprehensive Visualization: A correlation heatmap allows us to examine the correlation between all pairs of variables in the dataset simultaneously. This comprehensive view helps in identifying patterns and relationships that may not be immediately apparent when examining individual correlations.

Color-Coding for Interpretation: The heatmap uses a color scale to represent the strength and direction of correlations, making it easy to interpret the relationships between variables. Warm colors (e.g., red) indicate positive correlations, while cool colors (e.g., blue) indicate negative correlations.

Annotated Values: The heatmap can be annotated with correlation coefficients, providing precise numerical information about the strength of the relationships between variables. This makes it easy to identify strong or weak correlations and prioritize further analysis accordingly.

##### 2. What is/are the insight(s) found from the chart?

Positive Correlations: Positive correlations are indicated by warmer colors (e.g., shades of red) in the heatmap. For example, we might observe strong positive correlations between certain pairs of variables such as 'carlength' and 'curbweight', or 'enginesize' and 'horsepower'. These correlations suggest that as one variable increases, the other tends to increase as well.

Negative Correlations: Negative correlations are indicated by cooler colors (e.g., shades of blue) in the heatmap. We might observe negative correlations between variables such as 'citympg' and 'carprice', or 'highwaympg' and 'carprice'. These correlations suggest that as one variable increases, the other tends to decrease.

Strong Correlations: Cells in the heatmap with darker shades (either dark red or dark blue) indicate stronger correlations between variables. Strong positive correlations suggest a strong linear relationship between variables, while strong negative correlations suggest an inverse relationship.

Weak Correlations: Lighter shades in the heatmap (e.g., pale red or pale blue) indicate weaker correlations between variables. Weak correlations suggest a weaker linear relationship between variables, which may still be meaningful but less pronounced.

#### Chart - 12 - Pair Plot

In [None]:
# Pair Plot visualization code

# Extracting the first 10 data points
data_subset = df.head(10)

# Create the pair plot
sns.pairplot(data_subset)
plt.suptitle('Pair Plot of First 40 Data Points')
plt.show()

##### 1. Why did you pick the specific chart?

Exploratory Data Analysis: Pair plots allow for quick and easy exploration of relationships between pairs of variables. By plotting every variable against every other variable, it's possible to identify potential patterns, trends, and correlations in the data.

Identifying Correlations: Pair plots are particularly useful for identifying correlations between numerical variables. By examining the scatterplots, you can visually assess the direction and strength of the relationships between variables.

Diagnosing Multicollinearity: Multicollinearity occurs when two or more variables are highly correlated with each other. Pair plots can help diagnose multicollinearity by highlighting pairs of variables with strong correlations, which is important for regression analysis and other predictive modeling tasks.

##### 2. What is/are the insight(s) found from the chart?

Correlation Strength: By examining the scatterplots, you can assess the strength of the relationships between pairs of variables. Strong positive correlations are indicated by tightly clustered points that follow a clear linear trend, while strong negative correlations are indicated by points that form a clear downward or upward sloping pattern.

Correlation Direction: The direction of the relationship between variables can be identified by observing the slope of the scatterplots. Positive correlations are characterized by an upward sloping pattern, indicating that as one variable increases, the other variable tends to increase as well. Negative correlations are characterized by a downward sloping pattern, indicating that as one variable increases, the other variable tends to decrease.

Outlier Detection: Outliers, or data points that deviate significantly from the overall pattern of the scatterplot, can be identified visually. Outliers may represent unusual or anomalous observations that warrant further investigation.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1. There is a significant correlation between car price and engine size.
2. The fuel type of a car has a significant impact on its horsepower.
3. There is a significant difference in car prices between cars with different fuel types.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant correlation between car price and engine size.
Alternative Hypothesis (H1): There is a significant correlation between car price and engine size.
We will use Pearson correlation coefficient to test this hypothesis. If the p-value is less than the significance level (e.g., α = 0.05), we reject the null hypothesis and conclude that there is a significant correlation between car price and engine size. Otherwise, we fail to reject the null hypothesis.







#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr
# Perform Statistical Test to obtain P-Value

car_price = df['price'].head(40)
engine_size = df['enginesize'].head(40)

# Perform Pearson correlation coefficient test
corr_coefficient, p_value = pearsonr(car_price, engine_size)

# Print the p-value
print("Pearson correlation coefficient:", corr_coefficient)
print("P-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

I performed the Pearson correlation coefficient test.

##### Why did you choose the specific statistical test?


I chose the Pearson correlation coefficient test because it is commonly used to measure the strength and direction of the linear relationship between two continuous variables. In this case, we want to assess whether there is a significant correlation between car price and engine size, which are both continuous variables.

The Pearson correlation coefficient test provides a measure of the strength of the linear relationship between the variables, ranging from -1 to 1, where:

1 indicates a perfect positive linear relationship,
-1 indicates a perfect negative linear relationship, and
0 indicates no linear relationship.
Additionally, the p-value associated with the Pearson correlation coefficient test allows us to determine the statistical significance of the observed correlation. If the p-value is less than a chosen significance level (e.g., α = 0.05), we reject the null hypothesis and conclude that there is a significant correlation between the variables. Otherwise, we fail to reject the null hypothesis.

Therefore, the Pearson correlation coefficient test is suitable for assessing the relationship between car price and engine size in this scenario.






### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The fuel type of a car does not have a significant impact on its horsepower.
Alternative Hypothesis (H1): The fuel type of a car has a significant impact on its horsepower.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import ttest_ind

# Extract horsepower for gas and diesel cars for the first 40 data points
horsepower_gas = df[df['fueltype'] == 'gas']['horsepower'].head(40)
horsepower_diesel = df[df['fueltype'] == 'diesel']['horsepower'].head(40)

# Perform t-test
t_statistic, p_value = ttest_ind(horsepower_gas, horsepower_diesel, equal_var=False)

# Print the p-value
print("T-Statistic:", t_statistic)
print("P-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

I performed an independent samples t-test.

##### Why did you choose the specific statistical test?

I chose the independent samples t-test because it is appropriate for comparing the means of two independent groups when the dependent variable (horsepower in this case) is continuous and normally distributed (or approximately normally distributed) within each group.

In this scenario, we want to assess whether there is a significant difference in horsepower between gas and diesel cars, which are two independent groups. The t-test allows us to compare the means of these two groups and determine whether the observed difference in horsepower is statistically significant.

Additionally, since we are comparing two groups (gas and diesel), the t-test is suitable. If we were comparing the means across more than two groups, we would use ANOVA (Analysis of Variance) instead.

Therefore, the t-test is the appropriate statistical test for testing the hypothesis regarding the impact of fuel type on horsepower in this scenario.







### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in car prices between cars with different fuel types.
Alternative Hypothesis (H1): There is a significant difference in car prices between cars with different fuel types.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import f_oneway

# Extract car prices for gas and diesel cars for the first 40 data points
car_prices_gas = df[df['fueltype'] == 'gas']['price'].head(40)
car_prices_diesel = df[df['fueltype'] == 'diesel']['price'].head(40)

# Perform ANOVA test
f_statistic, p_value = f_oneway(car_prices_gas, car_prices_diesel)

# Print the p-value
print("F-Statistic:", f_statistic)
print("P-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

 I performed an Analysis of Variance (ANOVA) test.







##### Why did you choose the specific statistical test?

I chose the Analysis of Variance (ANOVA) test because it is appropriate for comparing the means of three or more independent groups when the dependent variable (car prices in this case) is continuous.

In this scenario, we want to assess whether there is a significant difference in car prices among cars with different fuel types (gas and diesel). ANOVA allows us to simultaneously compare the means of car prices across multiple groups (in this case, gas and diesel) and determine whether there is a statistically significant difference in prices between the groups.

Additionally, ANOVA is preferred when comparing more than two groups because it accounts for the overall variation among the groups and provides a single p-value to assess the significance of the observed differences.

Therefore, the ANOVA test is the appropriate statistical test for testing the hypothesis regarding the difference in car prices between cars with different fuel types in this scenario.







## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)

# Check if missing values have been handled
missing_values_after_imputation = df.isnull().sum()
print("Missing Values After Imputation:\n", missing_values_after_imputation)

#### What all missing value imputation techniques have you used and why did you use those techniques?

n the example code provided, I used a simple imputation technique of replacing missing values with the mean of the respective column. This is a commonly used technique for handling missing numerical data. Here's why I used this technique:

Mean Imputation: I used mean imputation because it is a straightforward method and is often a reasonable approach when the missing values are assumed to be missing at random (MAR). Mean imputation preserves the overall distribution of the data and minimizes the impact on the mean of the variable.
Applicability to Numerical Data: Mean imputation is applicable to numerical data types, making it suitable for handling missing values in columns containing continuous or interval data, such as 'normalizedlosses'.
Robustness: Mean imputation is robust to outliers since it calculates the mean based on all available data points in the column. Outliers do not disproportionately influence the imputed values.
While mean imputation is a simple and commonly used technique, it's important to note that it has limitations and assumptions. For example, it assumes that the missing values are missing completely at random (MCAR) or missing at random (MAR), and it may lead to biased estimates if the missingness mechanism is not met. Additionally, mean imputation may not be appropriate for categorical variables or when the missingness is related to the value itself (e.g., higher-income individuals are less likely to disclose their income).

Other imputation techniques, such as median imputation, mode imputation, k-nearest neighbors (KNN) imputation, or predictive modeling, may be more appropriate depending on the characteristics of the data and the nature of missingness. It's important to carefully consider the assumptions and limitations of each technique and choose the most suitable approach based on the specific dataset and analytical goals.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments


# Identify outliers using z-score
from scipy import stats
z_scores = np.abs(stats.zscore(df['price']))
threshold = 3
outliers = df[np.abs(z_scores) > threshold]

print("Number of outliers detected using z-score method:", len(outliers))

# Removal of outliers
df_cleaned = df[np.abs(z_scores) <= threshold]

# Transformation (log transformation)
df['log_price'] = np.log(df['price'])

# Imputation (replace outliers with median)
median_price = df_cleaned['price'].median()
df['price_imputed'] = np.where(np.abs(z_scores) > threshold, median_price, df['price'])

# Visualize the transformed and imputed data
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.histplot(df['log_price'], kde=True)
plt.title("Distribution of Log Transformed Price")
plt.subplot(1, 2, 2)
sns.histplot(df['price_imputed'], kde=True)
plt.title("Distribution of Imputed Price")
plt.show()




In [None]:
# Outlier treatments

# Identify outliers using z-score
z_scores = np.abs(stats.zscore(df['price']))
threshold = 3
outliers = df[np.abs(z_scores) > threshold]

# Remove outliers
df_cleaned = df[np.abs(z_scores) <= threshold]

# Logarithmic transformation
df['log_price'] = np.log(df['price'])

# Imputation (replace outliers with median)
median_price = df_cleaned['price'].median()
df['price_imputed'] = np.where(np.abs(z_scores) > threshold, median_price, df['price'])

# Binning
bins = [0, 10000, 20000, np.inf]
labels = ['Low', 'Medium', 'High']
df['price_bin'] = pd.cut(df['price'], bins=bins, labels=labels)

from scipy.stats.mstats import winsorize

# Apply winsorization to limit outliers
win_price = winsorize(df['price'], limits=[0.05, 0.05])
df['price_winsorized'] = win_price

# Clip outliers at a threshold
threshold = 30000
df['price_clipped'] = df['price'].clip(upper=threshold)


##### What all outlier treatment techniques have you used and why did you use those techniques?

Z-Score Method for Outlier Detection:
The z-score method is a commonly used statistical technique for identifying outliers in a dataset.
It calculates the number of standard deviations a data point is away from the mean.
Outliers are typically defined as data points that fall outside a certain threshold (e.g., 3 standard deviations from the mean).
This method is widely used and provides a systematic way to identify potential outliers based on their deviation from the mean.
Winsorization for Outlier Treatment:
Winsorization is a robust method for handling outliers by replacing extreme values with less extreme values.
It replaces outliers with the nearest non-outlier values within a specified range (e.g., the 5th and 95th percentiles).
Winsorization retains the overall distribution of the data while mitigating the impact of extreme values.
This technique is suitable when removing outliers entirely may lead to loss of valuable information or bias in the dataset.
These techniques were chosen because they are effective and widely used in practice for outlier detection and treatment. Additionally, Winsorization is a conservative approach that preserves the integrity of the data while reducing the influence of extreme values, making it suitable for various types of datasets and analytical purposes.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# Identify categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns

# One-hot encode categorical columns
data_encoded = pd.get_dummies(df, columns=categorical_columns)

# Alternatively, you can use label encoding for ordinal categorical variables
# from sklearn.preprocessing import LabelEncoder
# label_encoder = LabelEncoder()
# for col in categorical_columns:
#     data[col] = label_encoder.fit_transform(data[col])

# Print the encoded dataset
print("Encoded Dataset:")
print(data_encoded.head())

#### What all categorical encoding techniques have you used & why did you use those techniques?

In the provided code, I used one-hot encoding as the categorical encoding technique. Here's why I chose this technique:

One-Hot Encoding:
One-hot encoding is suitable for categorical variables with no ordinal relationship among categories.
It represents each category as a binary (0 or 1) feature, creating a new binary column for each category in the original categorical variable.
One-hot encoding preserves the individuality of each category and does not impose any ordinal relationship among them.
This technique is commonly used in machine learning models, especially with algorithms that require numerical inputs.
I did not use label encoding in this example, but it's another common categorical encoding technique. Here's a brief overview:

Label Encoding:
Label encoding assigns a unique integer to each category, thereby converting categorical variables into ordinal numerical variables.
It is suitable for categorical variables with an ordinal relationship among categories (e.g., low, medium, high).
Label encoding may introduce an artificial ordering to categorical variables, which may not always be appropriate, especially for non-ordinal variables.
Label encoding is typically used when there is a clear ordinal relationship among categories and when the number of categories is large, making one-hot encoding impractical.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction


#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Calculate correlation matrix
correlation_matrix = df.corr().abs()

# Create a mask to select upper triangle of the correlation matrix
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Select upper triangle of correlation matrix
upper_triangle = correlation_matrix.where(mask)

# Find features with correlation above a threshold
threshold = 0.7
highly_correlated_features = [column for column in upper_triangle.columns if any(upper_triangle[column] > threshold)]

# Drop highly correlated features
df_cleaned = df.drop(columns=highly_correlated_features)


# Create new feature based on the interaction between two features
df['feature_interaction'] = df['feature1'] * df['feature2']

# Create new feature by combining existing features
df['feature_combination'] = df['feature1'] + df['feature2']

# Create new feature by applying mathematical transformations
df['feature_log'] = np.log(df['feature'])

# Create new feature by binning numerical feature
bins = [0, 10000, 20000, np.inf]
labels = ['Low', 'Medium', 'High']
df['feature_binned'] = pd.cut(df['feature'], bins=bins, labels=labels)


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting



##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Select numerical features for transformation
numerical_features = df.select_dtypes(include=[np.number]).columns

# Apply logarithmic transformation to numerical features
data_transformed = df.copy()
data_transformed[numerical_features] = np.log1p(data_transformed[numerical_features])

# Print the transformed dataset
print("Transformed Dataset:")
print(data_transformed.head())

### 6. Data Scaling

In [None]:
# Scaling your data

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Min-Max Scaling (Normalization)
scaler_minmax = MinMaxScaler()
data_normalized = df.copy()
data_normalized[numerical_features] = scaler_minmax.fit_transform(data_normalized[numerical_features])

# Standardization (Z-score Scaling)
scaler_standard = StandardScaler()
data_standardized = df.copy()
data_standardized[numerical_features] = scaler_standard.fit_transform(data_standardized[numerical_features])

# Print the scaled datasets
print("Normalized Dataset:")
print(data_normalized.head())

print("\nStandardized Dataset:")
print(data_standardized.head())

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Curse of Dimensionality: As the number of features increases, the volume of the feature space grows exponentially, which can lead to sparsity and increased computational complexity. Dimensionality reduction can help mitigate the curse of dimensionality by reducing the number of features while preserving most of the important information.
Improved Model Performance: High-dimensional data may lead to overfitting, especially when the number of features is much larger than the number of samples. Dimensionality reduction techniques can help improve the generalization performance of machine learning models by reducing overfitting.
Visualization: It is challenging to visualize high-dimensional data. Dimensionality reduction techniques such as PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding) can be used to project high-dimensional data into lower-dimensional space while preserving the structure and relationships among data points, making visualization easier.
Feature Interpretability: In some cases, having too many features can make it difficult to interpret the model. Dimensionality reduction can help simplify the model and make it more interpretable by focusing on the most important features.
Reduced Computational Complexity: Dimensionality reduction can lead to reduced computational complexity and memory requirements, which is beneficial for training models and performing inference, especially for large-scale datasets.
However, dimensionality reduction may not always be necessary or beneficial. It depends on the specific characteristics of the data and the modeling task. For example:

If the dataset is small and the number of features is manageable, dimensionality reduction may not be needed.
If the features are already highly informative and there is no redundancy, dimensionality reduction may not improve model performance.
Dimensionality reduction techniques may also introduce some loss of information, so it's essential to carefully evaluate the trade-offs between dimensionality reduction and preserving important information.
In summary, dimensionality reduction can be beneficial for addressing the curse of dimensionality, improving model performance, aiding visualization, and reducing computational complexity. However, it should be applied judiciously based on the specific requirements and characteristics of the data and the modeling task.

In [None]:
# DImensionality Reduction (If needed)

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Select numerical features for PCA
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns

# Standardize the numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[numerical_features])

# Apply PCA for dimensionality reduction
pca = PCA(n_components=0.95)  # Preserve 95% of the variance
pca_features = pca.fit_transform(scaled_features)

# Create a DataFrame for the new PCA features
pca_df = pd.DataFrame(data=pca_features, columns=[f'PCA_{i}' for i in range(1, pca.n_components_ + 1)])

# Concatenate the new PCA features with the original dataset
data_final = pd.concat([df, pca_df], axis=1)

# Print the final dataset with PCA features
print("Final Dataset with PCA Features:")
print(data_final.head())

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

PCA's Ability to Preserve Variance: PCA aims to find the directions (principal components) in the feature space that capture the maximum variance in the data. By retaining the principal components that explain most of the variance, PCA effectively reduces the dimensionality of the dataset while preserving as much information as possible.
Efficiency and Effectiveness: PCA is computationally efficient and widely used for dimensionality reduction tasks. It is particularly effective when the data exhibits linear correlations between features and when the variance is an essential criterion for capturing the underlying structure of the data.
Interpretability: PCA produces orthogonal components, making them interpretable and facilitating the understanding of the underlying structure of the data. Each principal component represents a combination of the original features, allowing for insights into the most significant patterns in the data.
Versatility: PCA is a versatile technique that can be applied to various types of data, including numerical data, image data, and high-dimensional datasets. It can be used for exploratory data analysis, feature extraction, and visualization tasks.
Overall, PCA was chosen as the dimensionality reduction technique because of its ability to preserve variance, efficiency, interpretability, and versatility. It is a widely used and effective method for reducing the dimensionality of datasets while retaining most of the essential information.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.


##### What data splitting ratio have you used and why?


In the provided code example, I used a data splitting ratio of 80% training data and 20% testing data. Here's why this ratio was chosen:

Balance Between Training and Testing Data: The 80-20 split strikes a balance between having enough data for model training and having enough data for model evaluation. By allocating 80% of the data for training, we ensure that the model has sufficient samples to learn the underlying patterns in the data. At the same time, reserving 20% of the data for testing allows us to evaluate the model's performance on unseen data.
Commonly Used Splitting Ratio: The 80-20 split is a commonly used splitting ratio in machine learning. It has been empirically found to work well across a wide range of datasets and modeling tasks. Using a well-established splitting ratio can help ensure consistency and comparability with other studies and experiments.
Reduced Risk of Overfitting: With a larger proportion of the data allocated to training, the model has more samples to learn from, which can help reduce the risk of overfitting. Overfitting occurs when the model learns to memorize the training data rather than generalize to unseen data. By having a substantial amount of testing data, we can assess the model's ability to generalize beyond the training set.
Adequate Evaluation: Reserving 20% of the data for testing provides an adequate amount of data for evaluating the model's performance. This allows for more reliable estimates of the model's accuracy, precision, recall, and other performance metrics.
Overall, the 80-20 data splitting ratio strikes a balance between training and testing data, reduces the risk of overfitting, and provides adequate data for model evaluation, making it a suitable choice for many machine learning tasks.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.


Determining whether a dataset is imbalanced involves analyzing the distribution of the target variable (or outcome of interest) in the dataset. In the case of classification tasks, imbalance refers to a significant disparity in the distribution of classes. Here's how we can assess whether the dataset is imbalanced:

Check Class Distribution: Look at the distribution of the target variable. If there is a considerable difference in the frequencies of different classes, the dataset may be imbalanced. For example, if one class dominates the dataset while others are underrepresented, it indicates an imbalance.
Visualize Class Distribution: Plotting a histogram or bar chart of the target variable can provide a visual representation of class frequencies. If there is a noticeable skewness in the distribution, it suggests an imbalance.
Calculate Class Ratios: Calculate the ratio of samples in each class. If the ratio between the majority class and the minority class is significantly high, it indicates class imbalance.
Evaluate Performance Metrics: In classification tasks, evaluate performance metrics such as accuracy, precision, recall, and F1-score on both classes. If there is a notable difference in performance between classes, it may be due to class imbalance.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1 linear regression


In [None]:


# ML Model - 1 Implementation

# Fit the Algorithm
# Predict on the model

from sklearn.model_selection import train_test_split
# Min-max scling
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
# VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
#R-squared
from sklearn.metrics import r2_score
# Label encoding
from sklearn.preprocessing import LabelEncoder
# Importing RFE
from sklearn.feature_selection import RFE
# Importing LinearRegression
from sklearn.linear_model import LinearRegression
# Supress warning
import warnings
warnings.filterwarnings('ignore')

# Libraries for cross validation
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor

from sklearn import datasets
from sklearn.model_selection import cross_val_score, cross_val_predict



In [None]:
# fueltype
# Convert "gas" to 1 and "diesel" to 0
df['fueltype'] = df['fueltype'].map({'gas': 1, 'diesel': 0})
df.head()

In [None]:
# aspiration
# Convert "std" to 1 and "turbo" to 0
df['aspiration'] = df['aspiration'].map({'std':1, 'turbo':0})
df.head()

In [None]:
# conerting doornumber, drivewheel and enginelocation 1 & 0

df['doornumber'] = df['doornumber'].map({'four':1, 'two':0})
df['drivewheel'] = df['drivewheel'].map({'fwd':1, 'rwd':0})
df['enginelocation'] = df['enginelocation'].map({'front':1, 'rear':0})
df.head()

In [None]:
symboling_status = pd.get_dummies(df['symboling'],drop_first=True)
symboling_status.head()

In [None]:

symboling_status = symboling_status.rename(columns={-1:'symboling(-1)', 0:'symboling(0)', 1:'symboling(1)',2:'symboling(2)', 3:'symboling(3)'})
df = pd.concat([df,symboling_status], axis=1)
df = df.drop('symboling',axis=1)
carbody_status = pd.get_dummies(df['carbody'],drop_first=True)
carbody_status = carbody_status.rename(columns={'hardtop':'carbody(hardtop)', 'hatchback':'carbody(hatchback)', 'sedan':'carbody(sedan)','wagon':'carbody(wagon)'})
df = pd.concat([df,carbody_status], axis=1)
df = df.drop('carbody',axis=1)
enginetype_status = pd.get_dummies(df['enginetype'], drop_first=True)
enginetype_status = enginetype_status.rename(columns={'dohcv':'enginetype(dohcv)', 'l':'enginetype(l)', 'ohc':'enginetype(ohc)',
                                                      'ohcf':'enginetype(ohcf)','ohcv':'enginetype(ohcv)' ,'rotor':'enginetype(rotor)'})
df = pd.concat([df,enginetype_status], axis=1)
cylindernumber_status = pd.get_dummies(df['cylindernumber'], drop_first=True)
cylindernumber_status = cylindernumber_status.rename(columns={'five':'cylindernumber(five)', 'four':'cylindernumber(four)', 'six':'cylindernumber(six)',
                                                      'three':'cylindernumber(three)','twelve':'cylindernumber(twelve)' ,'two':'cylindernumber(two)'})
df = pd.concat([df,cylindernumber_status], axis=1)
df = df.drop('cylindernumber',axis=1)
fuelsystem_status = pd.get_dummies(df['fuelsystem'], drop_first=True)
fuelsystem_status = fuelsystem_status.rename(columns={'2bbl':'fuelsystem(2bbl)', '4bbl':'fuelsystem(4bbl)', 'idi':'fuelsystem(idi)',
                                                      'mfi':'fuelsystem(mfi)','mpfi':'fuelsystem(mpfi)' ,'spdi':'fuelsystem(spdi)',
                                                             'spfi':'fuelsystem(spfi)'})
df = pd.concat([df,fuelsystem_status], axis=1)
df = df.drop('fuelsystem',axis=1)
CarName_status = pd.get_dummies(df['CarName'], drop_first=True)
CarName_status = CarName_status.rename(columns={'audi':'CarCompany(audi)', 'bmw':'CarCompany(bmw)', 'buick':'CarCompany(buick)',
                                                      'chevrolet':'CarCompany(chevrolet)','dodge':'CarCompany(dodge)' ,'honda':'CarCompany(honda)',
                                                      'isuzu':'CarCompany(isuzu)','jaguar':'CarCompany(jaguar)','mazda':'CarCompany(mazda)',
                                                      'mercury':'CarCompany(mercury)','mitsubishi':'CarCompany(mitsubishi)','nissan':'CarCompany(nissan)',
                                                      'peugeot':'CarCompany(peugeot)','plymouth':'CarCompany(plymouth)','porsche':'CarCompany(porsche)',
                                                      'renault':'CarCompany(renault)','saab':'CarCompany(saab)','subaru':'CarCompany(subaru)',
                                                      'toyota':'CarCompany(toyota)','volkswagen':'CarCompany(volkswagen)','volvo':'CarCompany(volvo)'})

df = pd.concat([df,CarName_status], axis=1)
df = df.drop('CarName',axis=1)
df.head()

In [None]:
df.info()

In [None]:
df_train, df_test = train_test_split(df, train_size=0.7, random_state=100)
print(df_train.shape)
print(df_test.shape)

In [None]:
# Create a list of numeric variables. We don't need categorical variables because they are already scalled in 0 and 1.
num_vars = ['wheelbase','carlength','carwidth','carheight','curbweight','enginesize','boreratio','stroke',
            'compressionratio','horsepower','peakrpm','citympg','highwaympg','price']

# Instantiate an object
scaler = MinMaxScaler()

# Fit the data in the object
df_train[num_vars] = scaler.fit_transform(df_train[num_vars])
df_train.head()

In [None]:
df_train.describe()

In [None]:
# Popping out the 'price' column for y_train
y_train = df_train.pop('price')
# Creating X_train
X_train = df_train

In [None]:
y_train.head()

In [None]:
X_train.head()

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***