In this notebook, we are going to do an exploratory data analysis on the used car's auction dataset. We will briefly cover what type of cars are auctioned, do the auction prices depend on the type of cars sold, how is the trend each year/month etc.

We will keep adding to this list as we progress. Lets begin.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import re

from collections import Counter


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

A look at the data outside this coding environment suggests that there are few rows where due there are variable columns. This is due to the presence of multiple commas which has made the data to shift thereby creating additional column. To illustrate,let me use `error_bad_lines=False` & `warn_bad_lines=True` in pandas read_csv class which will provide the row numbers where this pattern is seen. Pandas will remove those lines for us. Refer to [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) documentation for more details.

In [None]:
data=pd.read_csv('../input/used-car-auction-prices/car_prices.csv',error_bad_lines=False,warn_bad_lines=True)

In [None]:
data.shape

Total number of rows are 558K with 16 columns.Lets check for null values.

In [None]:
data.head()

In [None]:
(data.isnull().sum()/data.shape[0])*100

If we consider the null values, except for transmission (11%) , all others are negligible. For simplicity lets remove those null values and do our analysis. Lets check for duplicates,

In [None]:
data.duplicated().sum()

No rows are duplicated.

In [None]:
data.dtypes

## Selling Price

In [None]:
plt.figure(figsize=(15,8))
plt.hist(data['sellingprice'],bins=200,color='#dc2f02')
plt.title('Distribution of Selling Price',fontsize=15)
plt.xticks(np.arange(0,data['sellingprice'].max(),15000))
plt.xlabel('Selling Price',fontsize=12)
plt.ylabel('Freq',fontsize=12)

In [None]:
data['sellingprice'].describe()

The distribution appears to be skewed towards the right with most of the selling price between 15000 and 30000. There are outliers > 60000.75 % of the data is having selling price < 200K . I assume the top price makes should be vintage cars or high end models like ferrari, BMW etc. This might become clear when we compare this with the make. Lets take a rough look for top end makes.

In [None]:
data.loc[data['sellingprice']>60000,'make'].value_counts()

Our assumption was right. The list is dominated by BMW,Benz,Jaguars & ferrari. Another interesting thing to note as we proceed our analysis is that the make column might require some data cleaning since few makes appear to be repeated (BMW ,bmw , Land Rover,land rover etc). For the simplicity of this analysis, we are not going to do this and instead focus on the overall analysis part.

In [None]:
data['make'].value_counts()[:10]

In [None]:
top_make=data['make'].value_counts()[:10].index

In [None]:
top_make

Since we have close to 96 make, lets consider only top 10 and visualize their selling price trend.

In [None]:
#Inspiration -https://www.kaggle.com/nroman/eda-for-ashrae
fig,ax=plt.subplots(5,2,figsize=(14,20))
color_list=['#0a9396','#ca6702','#ae2012','#9b2226','#001219','#005f73','#94d2bd','#e9d8a6','#e5e5e5','#e07a5f'] #coolors.co
i=0
for t in top_make:
    data.loc[data['make']==t,'sellingprice'].hist(ax=ax[i%5][i//5],bins=100,color=np.random.choice(color_list,replace=False))
    ax[i%5][i//5].set_xlabel('Selling Price',fontsize=10)
    ax[i%5][i//5].set_ylabel('Frequency',fontsize=10)
    ax[i%5][i//5].set_title(f'Selling price distribution for {t}',fontsize=15)
    plt.subplots_adjust(hspace=0.45)
    i+=1

In [None]:
data.loc[data['make']=='Ford','sellingprice'].describe()

* The distribution of top 10 makes is unique and the price range is different. 
* Out of these, Honda and Chrysler seems to have had a larger range of selling price distribution.
* Ford's selling price was narrow and restricted to 50000. There is one outlier which seems to appear above 200,000 price range.
* Almost all of the 10 makes have peaks around 10000. There are few makes which have bi-modal peaks.


An analysis similar to the above can be done for different types of car body. Lets do a quick check on the same.

In [None]:
(data['body'].value_counts()[:10]/data.shape[0])*100


Lets check the selling prices distribution based on transmission.Since 11% of the total data in that column is null, I will remove them for this analysis.

In [None]:
trans_df=data.loc[~(data['transmission'].isna()),]
trans_df.isna().sum() ## confirms that null values are removed.

In [None]:
trans_df['transmission'].value_counts()

In [None]:
plt.figure(figsize=(10,11))
# plt.hist(trans_df.loc[trans_df['transmission']=='automatic','sellingprice'],color='#b5179e',alpha=0.8,label='automatic',bins=100)
# plt.hist(trans_df.loc[trans_df['transmission']=='manual','sellingprice'],color='#480ca8',alpha=0.8,label='manual',bins=100)
# plt.legend()
sns.boxplot(x='transmission',y='sellingprice',data=trans_df,palette=['#b5179e','#480ca8'])
plt.title('Distribution of Selling Price',fontsize=15)
plt.xlabel('Selling Price',fontsize=10)
plt.ylabel('Freq',fontsize=10)


There is a distint difference between the transmission and selling type. The automatic transmission type seems to be heavily dominated by presence of outliers. The median selling price of automatic transmission type cars are higher than the manual tranmission cars.

## Odometer & Selling Price

Now that we know about the selling price distribution , lets check whats their correction with the odometer values. In general, I would expect higher the usage (odometer value),lesser will be the resale value. Lets confirm

In [None]:
plt.figure(figsize=(10,10))
g=sns.scatterplot(x='odometer',y='sellingprice',data=data,color='#0d3b66',alpha=0.8)
g.set_title('Odometer vS Selling Price Correlation',fontsize=12)
g.set_xlabel('Odometer',fontsize=10)
g.set_ylabel('Selling Price',fontsize=10)
xlabels=['{:,.2f}'.format(x)+'k' for x in g.get_xticks()/10e3]
ylabels=['{:,.2f}'.format(y)+'k' for y in g.get_yticks()/10e3]
g.set_xticklabels(xlabels);
g.set_yticklabels(ylabels);

It is seen that though increase in odometer values ( > 45k) has bought down the selling price of the cars, there were cars who have had lower odometer values, yet their selling price has remain > 5k. Thus ,we could understand that not only odometer value, other factors like make,model,state etc has played a factor in deciding the selling price of cars.

### Make & Model

In order to do this analysis,lets remove the null values in make and model.

In [None]:
mod_df=data.dropna(axis=0,subset=['make','model'])
mod_df.isna().sum()

Lets identify the top 10 make and model.

In [None]:
ma_mo=list(zip(mod_df['make'],mod_df['model']))

In [None]:
Counter(i for i in ma_mo).most_common()[:10]

Ford clearly seems to dominate the list of most sold cars. Nissan ,Chevrolet,Honda & BMW are other makes.

## Condition of Cars

In [None]:
plt.figure(figsize=(10,8))
plt.hist(data['condition'],bins=30,color='#023047')
plt.title('Distribution of Condition',fontsize=20)
plt.xlabel('Condition',fontsize=15)
plt.ylabel('Freq',fontsize=15)

Condition of the cars being auctioned are rated between 1-5. There is no clear distribution which could be infered from the plot.Close to 45000 cars are in very good condition and there are also cars with rating 1.8 & 3.5 in high numbers. 

Lets bin this column and get the average selling price.

In [None]:
data['condition_bin'],bins=pd.cut(data['condition'],bins=4,retbins=True)

In [None]:
plt.figure(figsize=(10,8))
sns.violinplot(y=data['condition_bin'],x=data['sellingprice'],palette=['#606c38','#283618','#dda15e','#bc6c25'])
plt.title('Condition of Cars Vs Selling Price',fontsize=12)
plt.ylabel('Condition',fontsize=12)
plt.xlabel('Selling Price',fontsize=12)

Clearly it is seen that the median value of selling price becomes higher as the condition of cars increases. The conditions are dominated by lot of outliers as seen from the plots.

### Auction Years

In [None]:
data['year'].min(),data['year'].max()

We have data from 1982 to 2015.

There is also a column - saledate which has some good information. Lets clean it to extract weekday,month,day.

In [None]:
data['sale_dow']=data['saledate'].apply(lambda x:re.search('^(\w+)\s',x).group(1))
data['sale_month']=data['saledate'].apply(lambda x:re.search('(\w+)\s(\d+)',x).group(1))
data['sale_day']=data['saledate'].apply(lambda x:re.search('(\w+)\s(\d+)',x).group(2))
data['sale_year']=data['saledate'].apply(lambda x:re.search('(\w+)\s(\d{4})',x).group(2))
data['sale_date']=data['saledate'].apply(lambda x:re.search('(\w+\s\d{2}\s\d{4})',x).group(1))
data['sale_date']=pd.to_datetime(data['sale_date'],format='%b %d %Y')

A quick check to see the summary of values in each of the columns we created...

In [None]:
data['sale_dow'].value_counts()

In [None]:
data['sale_month'].value_counts()

In [None]:
data['sale_day'].value_counts()

In [None]:
data['sale_year'].value_counts()

We have already seen that the year column has values from 1982 to 2015 whereas the sale year is 2014,2015. Therefore we can conclude that year column is the model year.

#### Which make's were hottest in the auction ?

Let us see which year of manufactured make had the highest selling value. We then understand the models in that top year.

In [None]:
sale_model=data.groupby('year')['make'].count().sort_values(ascending=False)[:10].reset_index().rename(columns={'make':'total_units'})

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(x='year',y='total_units',data=sale_model,palette=sns.set_palette('Set1'))
plt.title('Number of Units Sold by year of make',fontsize=15)
plt.xlabel('Year',fontsize=10)
plt.ylabel('Total Units',fontsize=10)

From the chart, it is seen that cars with 2012 make have had the most auctions followed by 2013 & 2014. Thus, vintage cars did not sell much and people were more interested in latest makes.Interesting. Lets see which model tops by average selling price.

In [None]:
sale_year=data.groupby('year')['sellingprice'].mean().sort_values(ascending=False)[:10].reset_index()

In [None]:
plt.figure(figsize=(8,8))
sns.barplot(x='year',y='sellingprice',data=sale_year)
plt.title('Top 10 Models by Selling Price')
plt.xlabel('Year')
plt.ylabel('Selling Price')

2012 model might have made the highest auctions by total units but when it comes to average selling price, 2015 model tops followed by 2014.Another interesting thing to note is the price of 1982 model an vintage , dominates the top 10 selling price.

Mearly looking at the model year might not provide any valuable information since there are multipe makes for the model year. Lets concat year and make column and revisit these analysis.

In [None]:
data['year_make']=data['year'].astype('str')+'_'+data['make']

In [None]:
data['year_make'].value_counts().reset_index().rename(columns={'year_make':'units','index':'year_make'})[:10]

In [None]:
sale_make=data.groupby('year_make')['sellingprice'].mean().sort_values(ascending=False)[:10].reset_index()
units_make=data['year_make'].value_counts().reset_index().rename(columns={'year_make':'units','index':'year_make'})[:10]

In [None]:
plt.figure(figsize=(22,12))

plt.subplot(1,2,1)

a=sns.barplot(x='sellingprice',y='year_make',data=sale_make)
a.set_title('Top 10 Make&Edition by Selling Price(Avg)',fontsize=15)
a.set_ylabel('Year & Make',fontsize=12)
a.set_xlabel('Selling Price',fontsize=12)

plt.subplot(1,2,2)
b=sns.barplot(x='units',y='year_make',data=units_make)
b.set_title('Top 10 Make&Edition by Total Units Sold',fontsize=15)
b.set_ylabel('Year & Make',fontsize=12)
b.set_xlabel('Units Sold',fontsize=12)


plt.subplots_adjust(hspace=0.45)

* When it comes to models with was sold a higher selling price, Ferrari's , Rolls Royce , Bentley's dominate the list. Not surprising.
* By total units sold, Nissan , Ford, Hyundai & Chevrolet dominate the list.

## Conclusion:

In this notebook, we have done an exploratory analytics on the car auction dataset - briefly we have seen what is the distribution of selling price look like , type & model of cars sold, how does each of the parameters like odometer,transmission affect the value of selling price.