In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing the dataset

In [None]:
df = pd.read_csv("/kaggle/input/car-data/CarPrice_Assignment.csv")
df.head()

# Information Regarding the data

**Shape of the dataset**

In [None]:
df.shape

Inference: There are 205 rows and 26 columns in the given dataset.

**List of columns**

In [None]:
df.columns

Inference: The names above are the column names in the dataset.

**Column Data Types**

In [None]:
df.dtypes

**Descriptive Statistics**

In [None]:
df.describe()

# Check for any missing data

In [None]:
df.isnull().sum()

Inference: There are no missing values in the given dataset.

# Categorical Data

In [None]:
cat = [feature for feature in df.columns if df[feature].dtypes=='O' and feature!='CarName']

dypes=='O' checks if the data type is an object (appropriate data type for categorical variables). Excluded the variable 'CarName' as it doesn't look useful for any analysis.

In [None]:
df[cat]

Inference: There are 9 categorical variables.

**Unique values**

In [None]:
for feature in cat:
    print(feature, df[feature].unique())

In [None]:
for feature in cat:
    print("There are %d unique features in the column %s." %(len(df[feature].unique()), feature))

In [None]:
for feature in cat:
    sns.barplot(x=df[feature].value_counts().index, y=df[feature].value_counts())
    plt.xlabel(feature)
    plt.ylabel("count")
    plt.title("Frequency table for %s" %feature)
    plt.show()

Inferences:
1. The count of vehicles that use gas as a fuel is very high compared the vehicles that use diesel.
2. The count of vehicles that have standard aspiration is very high compared to the vehicles that have turbo aspiration.
3. Vehicles with 4 doors have a higher count than vehicles with 2 doors.
4. Sedans have the highest count for car's body whereas Convertibles have the lowest count.
5. FWD drive wheel has the highest count, followed by RWD, and the lowest count being of 4WD.
6. Number of vehicles with engine location in the front greatly surpasses the ones with engine located at the rear.
7. OHC engine type has the highest count whereas DOHCV engine type has the lowest count.
8. Vehicles with 4 cyclinders have the highest number whereas the ones with 3 have the lowest number.
9. MPFI fuel system has the highest count whereas SPFI fuel system has the lowest count.

Hence, in the given sample,
1. Gas fuel vehicles are highly preferred.
2. Vehicles with 4 doors are preferred.
3. Sedans are the most popular.

The other variables might be unfamiliar to many people.

**Relationship between categorical variables and price**

Using the median of count of variables for checking average instead of the mean as mean gets affected in case of any outliers, and in case it's not, it'll be pretty close to the median anyway.

In [None]:
for feature in cat:
    df.groupby(feature)['price'].median().plot.bar()
    plt.show()

Inferences:
1. Gas vehicles have a lower average price than Diesel vehicles. This could be a possible reason why the count of vehicles that used gas fuel was higher than the ones that used diesel.
2. Standard aspiration vehicles have a lower average price than Turbo aspiration vehicles. This could be a possible reason why the count of vehicles that used Standard aspiration was higher than the ones that used Turbo aspiration.
3. While both of them are almost equal, vehicles with 4 doors have higher average price than vehicles with 2 doors. However the average price of vehicles with 4 doors is higher than vehicles with 2 doors. This means some other factor is coming into play, for instance, people find vehicles with 4 doors more convinient.
4. The average prices of vehicles based on body type is (ascending order):
    1. Hatchbacks
    2. Sedans
    3. Wagons
    4. Convertible
    5. Hardtops
  
  Hatchbacks and Sedans have the highest counts when compared with other body types, with Sedan having higher count than
 Hatchback. However, the average price of Hatchbacks is lower than Sedans. Hence, another factor could be coming into play, for instance, Sedans, for their bigger size are preffered more than Hatchbacks, which have a relatively smaller size. Hardtops and Convertibles are usually hard to obtain, hence their count was low and due to their high performance, their average price is higher.
5. FWD drivewheel has the lowest average price, followed by 4WD, with RWD having the highest average price. This could explain why FWD vehicles has a higher count compared to other two. However, 4WD vehicles' count was the lowest, but its average price goes higher than RWD. One of the possible reasons that can explain this is that 4WD vehicles specializes for off-road driving purpose. Hence, it is not preferred by most of the sample.
6. Vehicles with engine located at the front have a lower average price, explaining the high count of the same when compared to vehicles with engine located at the rear.
7. OHC and OHCF have the lowest average prices, which OHC being slightly lower. This explains the high count of OHC when compared to other engine types.
8. Vehicles with 3 cylinders have the lowest average price, followed by vehicles with 4 cylinders. However, the count of 4-cylinder vehicles greatly surpassed others. One of the possible reasons that can explain this is that most common number of cylinders used in a vehicle is 4.
9. MPFI fuel system has the highest average price as well as the highest count. This means some other factor is coming into play, for instance, MPFI systems are more efficient, and more widely used.

# Numerical Data

In [None]:
num = [feature for feature in df.columns if df[feature].dtype!='O' and feature!='car_ID']

In [None]:
df[num]

Inference: There are 15 numerical variables.

**Unique Values**

In [None]:
for feature in num:
    print("There are %d unique features in the column %s." %(len(df[feature].unique()), feature))

**Discrete numerical data**

Using 25 as the threshold value for classfying a numerical variable as discrete or continuous. If there are more than 25 unique values for a variable, it'll be considered continuous and will be plotted on a histogram. Otherwise, it'll be considered a discrete variable, which can be plotted with a count plot.

In [None]:
discrete = [feature for feature in num if len(df[feature].unique()) <= 25]

In [None]:
df[discrete]

Inference: There are 2 discrete numerical variables.

In [None]:
for feature in discrete:
    sns.countplot(data=df, x=feature)
    plt.xticks(rotation=90)
    plt.show()

Inference: 
1. Symboling corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. A value of +3 indicates that the auto is risky, -2 that it is probably pretty safe (source: GOOGLE). The plot shows most of the vehicles are in the safe zone (-1 to +1), although the ones being highly safe are  very low and there is a fair share of vehicles that are highly risky.
2. Most popular vehicles have a peak RPM of 5500 and 4800.

**Continuous numerical data**

In [None]:
continuous = [feature for feature in num if len(df[feature].unique()) > 25]

In [None]:
df[continuous]

Inference: There are 13 continuous numerical variables.

In [None]:
for feature in continuous:
    sns.histplot(data=df, x=feature)
    plt.show()

Inference: The plot of Car Length, Car Width, Car Height, Bore Ratio, Stroke, City's MPG and Highway's MPG roughly follow a normal distribution.


**Relationship of numerical data with price**

In [None]:
for feature in num:
    sns.scatterplot(x=df[feature], y=df['price'])
    plt.show()

Inferences:
1. There is no graspable relationship with price for Symboling, Car Height, Stroke, Compression Ratio, and Peak RPM.
2. There is a positive and roughly linear relationship with price for Curb Weight, Engine Size, and Horse Power. In most cases, when these variables increase, the price increases too.
3. There is a positive relationship for price with Wheel Base, Car Length, and Bore Ratio. In most cases, when these variables increase, the price increases too.
4. There is a negative and roughly linear relationship for price with City's MPG and Highway's MPG. When they decrease, the price increase. This makes sense as lower MPG means more efficiency, hence the higher price.
5. Price, obviously has a linear relationship with itself.

**Correlation among numerical variables**

In [None]:
df[num].corr()

In [None]:
plt.rcParams['figure.figsize'] = (12, 6)
sns.heatmap(df[num].corr(), annot=True)
plt.show()

### Thank you for reading this notebook. This is my first project here and I'm still learning, so if there is any mistake that I've made or some possible feature I can use for further analysis, let me know in the comments.