# What about cars? (EDA)
<image>
<img src="https://m.eet.com/media/1161066/benz_070312.jpg">
 <figcaption>Source: https://www.edn.com/electronics-blogs/edn-moments/4376656/Karl-Benz-drives-the-first-automobile--July-3--1886</figcaption>    
</image>

In this kernel we explore some of the aspects that can be deduced from the Used Cars Dataset dataset, which convey information about cars on sale in Craiglist. Notice that further analysis shall be performed, and this kernel is still being updated with new ideas. 

# Table of contents

* [1. Volume/Price comparison](#section1)
* [2. Age analysis](#section2)
* * [2.1. Volume/Age analysis](#section2.1)
* * [2.2. Fitting Gamma and Beta distributions](#section2.2)
* * [2.3. Price/Age analysis](#section2.3)
* [3. Manufacturer analysis](#section3)
* [4. USA vs the World](#section4)


# Data reading

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
data = pd.read_csv('/kaggle/input/craigslist-carstrucks-data/craigslistVehicles.csv');

<a id='#section1'> </a>

# 1. Volume/Price comparison

In a first instance it seems interesting to understand how prices and vehicles are distributed. That is, how many vehicles for a certain price range are on sale without further knowledge about manufacturer, location, etc.


In [None]:
[density, edges] = np.histogram(data['price'], np.logspace(2,9, 8));
fig = plt.figure(figsize=(10, 5), dpi= 80, facecolor='w', edgecolor='k')
plt.bar(x =range(7), height = np.log(density))
plt.xticks(range(7), (edges[1:]-edges[0:-1])/2, rotation=45)
plt.title('Number of vehicles (logarithmic) vs price range')
plt.xlabel('Price range')
plt.ylabel('# of vehicles (log)');


We observe that most of the vehicle prices concentrate in the centers from $[450, 45000]$. Not everybody needs a car that costs more than $100000$\$, and of course the price depends strongly on the manufacturer/make/condition of the car.


<a id='#section2'> </a>


# 2. Age analysis

It is interesting to correlate the price with the age of the car, a parameter that can be computed thanks to its 'year' label.
First, we notice that there are some cars whose 'year' is not available. We assign them the -1 value.


In [None]:
current_year = 2019;
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant', fill_value=-1, verbose=1);
data['year'] = imputer.fit_transform(np.reshape(data['year'].values, (len(data),1)))

Notice that cars with no 'year' will have an age of (current_year+1).

We also impute cars that are newer than the current year (2019).


In [None]:
imputer = SimpleImputer(missing_values = current_year+1, strategy='constant', fill_value=-1, verbose=1);
data['year'] = imputer.fit_transform(np.reshape(data['year'].values, (len(data),1)))
data['Age'] = current_year - data['year'];

<a id='#section2.1'> </a>

## 2.1. Volume/Age analysis
An interesting idea that can be explored is how old are the cars on sale. We design a plot that compares the volume of cars on sale against their corresponding age of manufacturing.

In [None]:
aux = data['Age'].value_counts().to_dict()
x = np.sort(list(aux.keys()));
y = [aux[i] for i in x]
plt.figure(figsize = (16,8));
plt.bar(x[:-1], y[:-1]);
plt.xlabel('Age (years)');
plt.ylabel('# of cars on sale');
plt.title('Volume of cars on sale considering their year of manufacturing');


<a id='#section2.2'> </a>


## 2.2. Fitting Gamma and Beta distributions
The distribution of cars by age shown above looks similar to a Gamma/Beta distribution, so we try to fit the data to see if they are actually that similar.


In [None]:
import scipy.stats as stats    
import random
fit_alpha, fit_loc, fit_beta=stats.gamma.fit(data['Age'][data['Age'] < 140])
y_gamma = stats.gamma.rvs(fit_alpha, loc=fit_loc, scale=1/fit_beta, size=1000000, random_state=1492);
y_gamma_values = plt.hist(y_gamma,bins=121);
plt.close()

fit_alpha, fit_beta, fit_loc, fit_scale = stats.beta.fit(data['Age'][data['Age'] < 140])
y_beta = stats.beta.rvs(fit_alpha, fit_beta, loc=fit_loc, scale=fit_scale, size=1000000, random_state=1492);
y_beta_values = plt.hist(y_beta,bins=121);
plt.close()

I try to convert the representation of the volume of cars into a histogram with 121 bins by fullfilling the empty years with zeros. Then I produce random numbers using the Gamma and Beta distributions I have fit with data.

In [None]:
y_data = [];
for i in range(121):
    try:
        y_data.append(aux[i]);
    except:
        y_data.append(0);
plt.figure(figsize = (16,8))
est_data =  y_data/np.sum(y_data);
est_gamma = y_gamma_values[0]/np.sum(y_gamma_values[0]);
est_beta = y_beta_values[0]/np.sum(y_beta_values[0]);
plt.bar(range(121), est_data, label = 'Data')
plt.plot(range(121), est_gamma, alpha = 1, label = 'Gamma distribution', color='r',linewidth = 5)
plt.plot(range(121), est_beta, alpha = 1, label = 'Beta distribution', color='g',linewidth = 5)
plt.legend();
plt.xlabel('Age (year)');
plt.ylabel('Price weight');
plt.title('Comparison of two statistical distributions against the distribution of prices of our dataset.');

So we observe that both distributions seem to fit rather well the distribution followed by the prices by year of our car dataset. Maybe there exist other distributions that actually produce a better fit, and a distance/divergence analysis shall be performed.

<a id='#section2.3'> </a>

## 2.3. Price/Age analysis
Once we've got the age the 'price' vs 'Age' plot is obtained.

In [None]:
ages = np.unique(data['Age']);
fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')

for i in range(len(ages)):
    if ages[i]!= (current_year+1):
        y = data['price'][data['Age'] == ages[i]]
        x = data['Age'][data['Age'] == ages[i]]
        plt.scatter(x,(y))

plt.xlabel('Age (years)')
plt.ylabel('Price (\$)')
plt.yscale('symlog')
plt.ylim(0, 1e10);
plt.grid()
plt.title('Raw prices');

The following considerations should be taken into consideration. 
1. We assume that the vendors are confident about the year of the cars they are selling. Consequently, we consider authentic the examples where the age of the car is bigger than $100$.
2. We are not removing prices as ridiculous as $0$\$ or $1$\$.
3. We are not removing car prices as big as $10e6$ and so on. Some true examples with these prices might exist.

Some **conclusions** that can be drawn are:
1. There are several cars whose price is near to 0 or 1, which might hint that some vendor wants to negotiate the price (perhaps). On the contrary, these examples might be simply incomplete. Additionally, some of these cars might be on the market just for spare parts.
2. The newer the car (year 0 and nearby), the wider its price range is. For example, with cars whose year is set to be 2019 we have prices in the range $[10, 500000]$ \$, with even bigger prices that we will obviate in this analysis. This means that newer cars are offered at a lower starting prices than older ones.
3. The previous conclusion does not hold for cars older than 20 years. For example, we observe that in spite of the outlier prices the price range goes from $[100, 100000]$, and keeps reducing for older cars. The limit example that could be considered without relying on outlier values could be around 90 year old cars, where the price range is $[5000, 50000]$.
4. From the previous two conclusions we can also deduce that newer cars can be sold for a higher price than older cars. This makes sense since newer cars shouldn't suffer as much deterioration as older cars. On the contrary, some of the older cars might be considered collector pieces, an scenario where their price might fluctuate depending on how popular each car is. Additionally, I don't know if craiglist is used to sell brand-new cars, although that would help to explain why some prices are that big.


We need to clean the data from outliers, and we compute Z-score for each sample (See https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/):
\begin{equation}
Z_i = \frac{X_i-\overline x}{ s}
\end{equation}

Then, we remove the points where $|Z_i| >3$, which have three times the sample standard deviation and we will consider to be outliers. 


In [None]:
z_lim = 3;
def z_score_cleaning(y_vec, z_lim):
    work = True;
    
    while work == True:
        ave   = np.mean(y_vec);
        stdev = np.std(y_vec);
        Z = np.abs(y_vec - ave)/stdev;
        if np.max(Z) > z_lim:
            y_vec = y_vec[Z <= z_lim];
        else:
            work = False;
    return ave, stdev, y_vec

In [None]:
ages = np.unique(data['Age']);
fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')

for i in range(len(ages)):
    if ages[i]!= (current_year+1):
        y = data['price'][data['Age'] == ages[i]]
        dummy, dummy, y_vec = z_score_cleaning(y, z_lim)
        x = np.ones(len(y_vec))*ages[i];
        plt.scatter(x,y_vec)

plt.xlabel('Age (years)')
plt.ylabel('Price (\$)')
plt.grid()
plt.title('Prices after removing outliers using Z-scores');

After cleaning using the Z-score and suppresing the logarithmic Y-axis it can be observed that global ranges are different than before. We observe that:
* The newer the car the bigger is its price. It decreases until the car becomes 20 years old, when its price starts increasing again. This suggest again that some classic cars are quite valuable.
* When age increases significantly (Age > 45) we observe that prices become more spares, and there are less and less cars available on sale.

<a id='#section3'> </a>

# 3. Manufacturer analysis
The dataset includes 43 manufacturers, and we start analyzing the most frequent ones.

In [None]:
manufacturers_counts = data['manufacturer'].value_counts().to_dict()
fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')
plt.bar(manufacturers_counts.keys(), manufacturers_counts.values())
plt.xticks(rotation = 45);
plt.xlabel('Manufacturer')
plt.ylabel('# of cars on sale')

Not surprised to see that the most frequent manufacturers in the top-10 are actually from the USA (Ford, Chevrolet, Ram, etc.).  The top sellers by far are Ford and Toyota. I don't know if this is possitive or not, since it might imply that these companies sell a lot of cars in the country (which makes sense) or that the onwers simply want to get rid of them.


We will study the top-9 manufacturers individually using their price/age plots. The goal consists in determining which cars hold their value for a longer period of time. Since I am not interested into prices that are $[0,1]$\$ or extremely big as $price > 10e6\$$, I cut them from the limits of the plot.

In [None]:
fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')
for j in range(9):
    manufacturer = list(manufacturers_counts.keys())[j];
    plt.subplot(3,3,j+1)
    for i in range(len(ages)):
        if ages[i]!= (current_year+1):
            y = data[data['manufacturer']==manufacturer]['price'][data['Age'] == ages[i]]
            x = np.ones(len(y))*ages[i];
            #x = data[data['manufacturer']==manufacturer]['Age'][data['Age'] == ages[i]]
            plt.scatter(x,(y))
    if j > 5:
        plt.xlabel('Age (years)')
    plt.ylabel('Price (\$)')
    plt.yscale('symlog')
    plt.ylim(2, 1e6);
    plt.xlim(0, 120);
    plt.grid();
    plt.title('{0}'.format(manufacturer));

From the previous charts we deduce:
* According to the dataset the USA companies that have been selling cars for a longer period of time are Ford, Chevrolet and Dodge. I discarded values close to 120 since some of the companies were not making cars so many years ago. 
* With respect to the Japanese companies (Toyota, Nissan and Honda), the oldest cars available in the dataset were manufactured by Toyota almost 50-60 years ago. This makes sense since according to Wikipedia Toyota was the first japanese company to sell cars un the USA, starting at 1957. Nissan and Honda seemed to arrive to the States in the 70s.
* It is interesting to observe that for recent cars, lets say younger than 10 years, most of the manufacturers show a common top value around $10e5$, whereas some differences exist for the bottom prices. 

From these plots we design a new graph that shows the avarege and standard deviation of each of the manufacturers considering their prices and years. First, we need to clean the data from outliers considering each year and manufacturer independently using their Z-score as previously. Notice that we require that there are at least 10 samples to compute the sample mean and standard deviation.


In [None]:
fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')
for j in range(9):
    manufacturer = list(manufacturers_counts.keys())[j];
    plt.subplot(3,3,j+1)
    for i in range(len(ages)):
        if ages[i]!= (current_year+1):
            y = data[data['manufacturer']==manufacturer]['price'][data['Age'] == ages[i]]
            if len(y) > 10:
                ave, stdev, dummy = z_score_cleaning(y, z_lim)
                x = ages[i];
                plt.errorbar(x, ave, yerr=stdev)
    if j > 5:
        plt.xlabel('Age (years)')
    plt.ylabel('Price (\$)')
    plt.xlim(0, 120);
    plt.grid();
    plt.title('{0}'.format(manufacturer));


First of all, notice that there are years with no 'price' data. This occurs since we required a minimum number of 5 samples to estimate the Z-score. In addition, some of the prices show a huge standard deviation value, which is caused because the number of samples if bigger than 5, but still small. This causes that the standard deviation gets severely affected when at least one of the values is truly big or small.

Although this is not conclusive since we don't have enough data for some of the years, we observe that:
* There seems to be a patternsuggesting that during the first 2-3 years the value of the car remains similar and relatively high. Then, its value decreases to a minimum.
* In the particular case of **Ford**, we observe that from 20 to 40 years car prices remain minimum, and after 40 years they start increasing until they show similar values or even bigger than new Ford cars. Consequently, this plot would suggest that many Ford cars that are more than 40 years old are still very valued in the market (or at least they are as expensive as new Ford cars). **Chevrolet** cars show a similar trend.  Could we state that USA buyers love classic (maybe muscle) cars?
* About **Toyota**, we observe that the oldest cars on sale were produced 40 years ago. However, the trend seems to be similar to the one of Ford and Chevrolet, although it stops earlier.
* The other two Japanese manufactured cars, **Nissan** and **Honda**, seem too reach their minimum values after 20 years. There is no clear sign of revaluation in their plots. This also occurs for Ram, although it seems to be a younger company and its cars might revaluate similarly to Ford and Chevrolet.
* About **Jeep**, **GMC** and **Dodge**, there is missing information about cars older than 40 years. I would not risk to say that Jeep classic cars show signs of revaluation, either for GMC. However, with the little information that we have about Dodge it seems that some cars produced 50 to 60 years ago are sold with prices similar to new models. 

We cannot conclude if this trend on the prices will hold in the following, let's say, 50 years. However, what seems to occur is that classic Ford and Chevrolet cars are quite appreciated and valued by car owners in the USA.






<a id='#section4'> </a>

# 4. USA vs the World
We study now the price evolution of cars that belong to the USA against foreign manufacturers.

In [None]:
# pd.isna(data['manufacturer'])
Europe_cars = ['alfa-romeo','aston-martin', 'audi','bmw','ferrari','fiat','jaguar','land rover', 'porche','mercedes-benz',  'morgan','volkswagen', 'volvo', 'rover', 'mini']
Asia_cars = [  'kia', 'infiniti',  'hyundai', 'acura', 'datsun', 'honda', 'lexus', 'mazda', 'mitsubishi','nissan', 'subaru', 'toyota']   
USA_cars = ['lincoln','hennessey','saturn','buick', 'cadillac', 'chevrolet', 'chrysler', 'dodge',  'ford', 'gmc', 'harley-davidson', 'jeep', 'pontiac', 'ram', 'mercury']
manufacturer_region = [];
for i in data['manufacturer']:
    if i in Europe_cars:
        manufacturer_region.append('EU');
    elif i in Asia_cars:
        manufacturer_region.append('AS');
    elif i in USA_cars:
        manufacturer_region.append('USA')
    else:
        manufacturer_region.append('NONE')       

data['Manufacturer_region'] = manufacturer_region;

An simple analysis that can be performed consists in showing how many cars belong to manufacturers of different regions. Or equivalently, the percentage of cars from each region.

We define four regions: 
* USA:  obviously
* AS:   Asia
* EU:   Europe
* NONE: It includes the data entries that didn't have a defined manufacturer.

In [None]:
aux = data['Manufacturer_region'].value_counts().to_dict();
radio_usa = aux['USA']/sum(aux.values());
eps = 0.025;
# radio_sum = radio*2;
circle1 = plt.Circle((radio_usa+eps, radio_usa), 
                     radio_usa,
                     linewidth=5,
                     #facecolor='r',
                     facecolor=((255/255,102/255,102/255,1)) ,
                     edgecolor = ((255/255,0/255,0/255,1))
                    )

radio_asia = aux['AS']/sum(aux.values());
circle2 = plt.Circle((2*radio_usa+radio_asia+eps, radio_asia), 
                     radio_asia,
                     linewidth=5,
                     facecolor= ((178/255,255/255,102/255,1)) ,
                     edgecolor =  ((0/255,255/255,0/255,1))
                    )

radio_europe = aux['EU']/sum(aux.values());
circle3 = plt.Circle((2*(radio_usa+radio_asia)+radio_europe+eps,radio_europe),
                     radio_europe, 
                     linewidth=5,
                     facecolor=((153/255,153/255,255/255,1)) ,
                     edgecolor = ((0/255,0/255,255/255,1)) 
                    )
fig, ax = plt.subplots(figsize = (16,8)) 
ax.add_artist(circle1)
ax.add_artist(circle2)
ax.add_artist(circle3)
plt.text(radio_usa-eps,radio_usa*2+eps,
         'USA\n{0:.2f}%'.format(radio_usa*100),
         fontsize=25)
plt.text(radio_usa*2+radio_asia*3/4,radio_asia*2+eps,
         'Asia\n{0:.2f}%'.format(radio_asia*100),
         fontsize=25)
plt.text(radio_usa*2+radio_asia*2+radio_europe/2,radio_europe*2+eps,
         'EU\n{0:.2f}%'.format(radio_europe*100),
         fontsize=25)

plt.xlim(0,2);
plt.ylim(0, 1.5);

Notice that the previous bubbles **DO NOT SUM** $100\%$ because the class 'NONE' was not included. What seems clear is that the vehicles on sale were mostly manufactured by US companies ($57.67\%$). The second region is Asia, which holds $28\%$ of the market and seems to be dominated by Japanese companies. Finally, the smallest contribution comes from the European Union manufacturers, with less than $10\%$.

The 'NONE' class includes $4.89\%$ of the vehicles in the dataset, a clear minority. Consequently, we will not consider it in later regional analysis.

We complete this particular analysis studying the quartile prices for each of the available regions. First, we remove the outliers considering the Z-score as we explained in a previous section. Then, we split data into three Age ranges: 
* Cars with ages in [0, 10).
* Cars with ages in [10, 20).
* Cars older than 20 years.

In [None]:
#  z_score_cleaning(y_vec, z_lim)
y_out = {};
age_range = [0, 10, 20,140];

for i_range in range(len(age_range)-1):
    y_aux = [];
    for manufacturer in ['USA','AS','EU']:
        y = data['price'][(data['Manufacturer_region'] == manufacturer) & 
                          (data['Age']>=age_range[i_range]) 
                          & (data['Age']<age_range[i_range+1])]
        [dummy, dummy, y_clean] = z_score_cleaning(y, z_lim);
        y_aux.append(y_clean);
    y_out[i_range] = y_aux;
    

fig = plt.figure(figsize=(16,6))
for i_range in range(len(age_range)-1):
    plt.subplot(1,3,i_range+1)
    plt.boxplot(y_out[i_range],labels=['USA','Asia','Europe']);
    if i_range == 0:
        plt.ylabel('Price ($)');
        plt.title('Age between 0 and 10 years (young)')
    elif i_range == 1:
        plt.title('Age between 10 and 20 years (middle)')
    elif i_range == 2:
        plt.title('Age older than 20 years (old)')
    plt.ylim(0,60000)
    plt.grid()

From these boxplots we can extract the following conclusions:
* Median price tends to be bigger no matter the age range.
* For cars with less than ten years we observe that for every region the boxplots show the biggest quartile values. This means that newer cars are usually sold at a bigger price. Do not get confused by the conclusions of previous sections, where we studied the whole price ranges for each year. Here, the boxplots show the prices after computing the quantiles, which take into consideration the concentration of price values in different ranges: 25% on the bottom of the box and 75% on top. In addition, Z-score cleaning is not performed globally for all the regions, but individually for each region and age.
* We observe that is common that for the 'young' box plots all the regions show a similar 0.25 quartile. On the contrary, the 0.75 quartile is clearly bigger for the USA cars, and is followed by the European cars. This would suggest that Asian cars are being sold for a slightly smaller price than cars from the other regions. This effect is not so noticeable for 'middle' aged cars.