# Used car sales in the US
This notebook is an intro into the US used car sales dataset on Kaggle.  

## Take a look at the features

In [None]:
import pandas as pd
import os
print(os.listdir("../input"))

In [None]:
df = pd.read_csv('../input/us-used-car-sales-data/used_car_sales.csv')
zip_codes = pd.read_csv('../input/zipcodes-county-fips-crosswalk/ZIP-COUNTY-FIPS_2017-06.csv')
df.head()

For our purpose we only need zip code and state. If we would only copy these columns we would end with duplicates which would duplicate sales data later when we join the two dataframes. To be safe let's create a unique zip,state dataframe.

In [None]:
zip_codes_clean = zip_codes.groupby(by=['ZIP','STATE'], as_index=False).first()[['ZIP','STATE']]

## Adding some features
Adding some features that don't come with the dataset but are helpful for visualization and building models on top of the dataset. 
### Age of the car

In [None]:
df['Age'] = df['yearsold'] - df['Year']
df.head()

### US States from ZIP Code
for this to work we need to clean up the zip codes in the sales dataframe first by removing non numeric zip codes and convert the column type to integer. I'm converting the zip codes to integer because the zip data frames stores them as such (i hope there are no issues with the leading zeroes in the US zip codes)

In [None]:
df = df[df['zipcode'].str.isdigit() == True]
df['zipcode'] = df['zipcode'].astype(int)

In [None]:
df.shape

In [None]:
df = pd.merge(df, zip_codes_clean, left_on='zipcode', right_on='ZIP', how='left')
df.drop('ZIP',axis=1,inplace=True)


In [None]:
df.shape

ok, we're good. no duplicates

## Initial Analysis & Cleanup
Let's start with a pairplot to get an overview of the data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

In [None]:
sns.pairplot(df)
plt.show()

Oh boy, there's some cleanup to do.

### Selling price by Age
Let's take a look how the age/price scatter plot looks like

In [None]:
g = sns.scatterplot(df['Age'],df['pricesold'])
g.set(xlabel='Age', ylabel='Selling Price')
plt.show()

There's some outliers, most likely due to a wrong Model year in the data set. Let's clean this up

In [None]:
df[df['Age']>100]

Fix the samples that used YY instead of YYYY. The list above showed only 19xx cars. So you'll need to change the code below if you'll see cars that clearly where built in the 2000s.

In [None]:
df = df[df['Year']>0]
df.loc[df['Year']<100,['Year']] += 1900

And recalculate the Age column again

In [None]:
df['Age'] = df['yearsold'] - df['Year']

Let's do the scatterplot again

In [None]:
g = sns.scatterplot(df['Age'],df['pricesold'])
g.set(xlabel='Age', ylabel='Selling Price')
plt.show()

There's still some odd looking old cars with an age over 100 years. Let's look at them

In [None]:
df[df['Age']>100]

I'll just delete the wrong ones from the data set.

In [None]:
df = df[df['Age']<100]

And let's do the scatterplot one more time.

In [None]:
g = sns.scatterplot(df['Age'],df['pricesold'])
g.set(xlabel='Age', ylabel='Selling Price')
plt.show()

cars with a negative age? there's some next year models and typos. For now I just delete them

In [None]:
df = df[df['Age']>=0]
g = sns.scatterplot(df['Age'],df['pricesold'])
g.set(xlabel='Age', ylabel='Selling Price')
plt.show()

Cars for 200k+? Let's see them.

In [None]:
df[df['pricesold']>200000]

I'm not a car salesman but $200k+ for Porsche and Ferrari? Sounds realistic.

## Selling price by miles

In [None]:
g = sns.scatterplot(df['Mileage'],df['pricesold'])
g.set(xlabel='Mileage', ylabel='Selling Price')
plt.show()

well, looks like there are cars with over 1M miles on the clock - hard to believe. Let's fix this. We could either just replace the mileage with something more reasonable (e.g. the maximum of the "clean" mileage in the remaining dataset) or simply delete those items. I chose the latter and delete all samples with a mileage over 300K.
Same applies to samples with mileage of zero. We're looking at used cars, so I would expect some miles.

In [None]:
df = df[(df['Mileage']<300000) & (df['Mileage']>0)]

In [None]:
g = sns.scatterplot(df['Mileage'],df['pricesold'])
g.set(xlabel='Mileage', ylabel='Selling Price')
plt.show()

That looks much better. 

The pairplot also showed some issues with the NumCylinders feature. Let's take a closer look at this.

In [None]:
sns.distplot(df['NumCylinders'],kde=False,bins=20)
plt.show()

ok - yeah there's something off. In commercial vehicles 16 cylinders are max. Let's print the outliers here.

In [None]:
df[df['NumCylinders'] > 16]

Just a few above 16. I'll just delete them.

In [None]:
df = df[df['NumCylinders'] <= 16]

And do the histogram again.

In [None]:
sns.distplot(df['NumCylinders'],kde=False,bins=16)
plt.show()

There's a lot of samples with ZERO cylinders. I'm not cleaning this up here but you could try to map that number from other listings to correct this as much as possible.

## Splitting dataset into Oldtimers and Newtimers
I think it makes sense to split the dataset into two. Historical cars (age >25 years) and "normal" cars that are younger than 25 years

In [None]:
oldtimers = df[df['Age'] > 25]
newtimers = df[df['Age'] <= 25]

Let's take a look at the age/selling price distribution side by side

In [None]:
plt.subplot(1, 2, 1)
plt.scatter(newtimers['Age'],newtimers['pricesold'])
plt.ylabel('Selling Price')
plt.xlabel('Age')
plt.title('Newtimers Selling Prices')
plt.subplot(1, 2, 2)
plt.scatter(oldtimers['Age'],oldtimers['pricesold'])
plt.ylabel('Selling Price')
plt.xlabel('Age')
plt.title('Oldtimers Selling Prices')
plt.tight_layout()
plt.show()

Looks reasonable to me. Newtimers prices go down over time while Oldtimer prices seem to go up over time.

## More Visualizations
### Sales by Car Makes
What's the Car Make breakdown in both groups?

In [None]:
import numpy as np

plt.rcParams["figure.figsize"] = [10,5]
plt.subplot(1, 2, 1)
makes = newtimers['Make'].value_counts(ascending=True).tail(10).index
y_pos = np.arange(len(makes))
salescount = newtimers['Make'].value_counts(ascending=True).tail(10).values 
plt.barh(y_pos, salescount, align='center', alpha=0.5)
plt.yticks(y_pos, makes)
plt.ylabel('Makes')
plt.title('Newtimers Top 10 Sales count')
plt.subplot(1, 2, 2)
makes = oldtimers['Make'].value_counts(ascending=True).tail(10).index
y_pos = np.arange(len(makes))
salescount = oldtimers['Make'].value_counts(ascending=True).tail(10).values 
plt.barh(y_pos, salescount, align='center', alpha=0.5)
plt.yticks(y_pos, makes)
plt.ylabel('Makes')
plt.title('Oldtimers Top 10 Sales count')
plt.tight_layout()
plt.show()

Some differences in Makes between New- and Oldtimers. Nissan and Honda haven't made it to the oldtimer section, yet :)

### Sales by Region

In [None]:
states = df['STATE'].value_counts().index
salescount = df['STATE'].value_counts().values

In [None]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

data = [ dict(
        type='choropleth',
        locations = states,
        z = salescount, 
        locationmode = 'USA-states',
        colorbar = dict(
            title = "Salescount")
        ) ]

layout = dict(
    title = 'US used car sales by states',
    geo = dict(
        scope = 'usa',
        projection=dict(type='albers usa')
    )
)

fig = dict(data=data, layout=layout)
iplot(fig, filename='d3-cloropleth-map')

## Analysis on a specific model
Let's focus on newtimers and one of the top selling models, the 2007 Ford Mustang.

In [None]:
newtimers.groupby(by=['Make','Model','Year']).size().sort_values(ascending=False).head()

In [None]:
mileage = newtimers[(newtimers['Make'] == 'Ford') 
          & (newtimers['Model'] == 'Mustang') 
          & (newtimers['Year'] == 2007)]['Mileage']
salesprices = newtimers[(newtimers['Make'] == 'Ford') 
          & (newtimers['Model'] == 'Mustang') 
          & (newtimers['Year'] == 2007)]['pricesold']

In [None]:
plt.scatter(mileage,salesprices)
plt.ylabel('Selling Price')
plt.xlabel('Mileage')
plt.show()