# Data analysis and hypothesis testing on Singapore used car data

### Singapore used car market background

Singapore is a country where there is a unique price structure where by several policies are put in place by government to curb vehicle population due to land constraint. Growth rate are determined by a ratio of car population vs road built. The growth rate is controlled by a quota system where only certain amount of cars can be registered in a month. The system is called the Certificate of Entitlement, also known as COE in short, a paper licence to legally own a car in Singapore. It is only valid for ten years after which one may choose to renew this licence for a 5 years or 10 years to keep ownership. Otherwise, the vehicle will need to be scrapped or exported to make way for a new car to be registered by other owners.

Besides the COE, there are also other taxes to discourage car ownership. To put in context, a popular sedan such as a new Toyota Corolla may cost 28,000 (in singapore dollars) to 40,000 elsewhere in the world, but will cost 80,000 to 120,000 depending on quota available on the certificates, which its price mark is a variable that fluctuates along supply and demand. 

More information can be found at https://en.wikipedia.org/wiki/Driving_in_Singapore

For this data analysis, we would like to find out more about Singapore used car market. While new car price fluctuates with the quota of COEs, used car price too are affected when new car price changes. But besides COEs, we would also like to know if there are any other contributing factors which could affect a used car resale price. Some brands may be popular and are selling at higher price, or perhaps a lesser mileage clocked car may attract a better price?

### Dataset

This dataset is obtained from one of the online used car portal in May 2021.

### Information about the data set

<b>Brand</b> - Brand of the vehicle

<b>Type</b> - Model type classification eg, sedan, SUV

<b>Reg_date</b> - The registration date of the vehicle. Also mean its COE will end ten years from this date

<b>Coe_left</b> - The balance lifespan of the vehicle. Due to renewal of COE or to be scrapped

<b>Dep</b> - The yearly depreciation amount. Differs from all vehicle even if of same age and model.

<b>Mileage</b> - The mileage clocked on the vehicle

<b>Road Tax</b> - The yearly taxable amount for usage of public roads

<b>Dereg Value</b> - The value of the vehicle if it is deregistered today, an amount which is given back for vehicle scrapped

<b>COE</b> - The price of COE paid when the car is registered

<b>Engine Cap</b> - The engine capacity size in CC

<b>Curb Weight</b> - The unladen weight of the vehicle

<b>Manufactured</b> - The year in which the vehicle is manufactured

<b>Transmission</b> - Gearbox type of vehicle

<b>OMV</b> -Open market value, a valuation determined by car manufacturer, supposing the cost price from factory

<b>ARF</b> - Additional parf value, an additional tax paid on top of car price, COE and some misc license

<b>Power</b> - Power rating of vehicle in horsepower

<b>No. of Owners</b> - Number of previous owners

<b>Price</b> - The asking price of the vehicle


### Data analysis and hypothesis testing

Using the variables available in the data set, we will perform a EDA and some statistical testing to gain new insights. In order to keep the data as original as possbile, we will drop most of the null values and impute data on those available to be calculated from other available variables. The content of the analysis is as follows

* Load data
* Preprocess data and data cleaning
* EDA 
* Determine business question for testing to gain new insights
* Investigation to conclusion

# Data Loading

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import statistics
import scipy.stats as stats
import statsmodels.api as smi 
from sklearn.model_selection import StratifiedShuffleSplit


In [1]:
# loading of dataset from webscrapping
df = pd.read_csv('../input/singapore-used-car/SG_usedcar.csv')
df = df.drop(['Unnamed: 18'],axis=1)

In [1]:
df

In [1]:
# Replacing the missing values with np.nan
df = df.replace('N.A',np.nan)
df = df.replace('N.A.',np.nan)
df.info()

There are some null values in the data set which preprocess and validate.

In [1]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()
missing_values_count

In [1]:
# total missing values
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

# percentage of missing values
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

In [1]:
# dropping the null value in some of the columns as without them the entry is not very useful in the analysis
df.dropna(subset=['Price','Mileage','Reg_date','No. of Owners','COE'],inplace=True)
df = df.reset_index(drop=True)

In [1]:
# removing cars aged above 10 years, Off peak cars and commercial vehicles
for i, v in enumerate(df['Brand']):
    desp = str(v)
    if 'COE' in desp or 'OPC' in desp:
        df=df.drop([i])
df = df.reset_index(drop=True)
for i, v in enumerate(df['Type']):
    desp = str(v)
    if desp == 'Van' or desp == 'Bus/Mini Bus' or desp == 'Truck' or desp =='Others':
        df=df.drop([i])
df = df.reset_index(drop=True)

Due to a ten year limit on car lifespan, we decided to only validate private cars of the age below ten years.

In [1]:
df.info()

# Preprocessing and Data cleaning

Objective to preliminary checks and cleaning on data set is to retain as much of the original information and extraction of any possible calculated fields within the data set.

In [1]:
df.Reg_date

In [1]:
# Change date column to datetime format
df.Reg_date = pd.to_datetime(df.Reg_date,format="%d-%b-%y")

In [1]:
# Successful change date type to datetime format
df.Reg_date

Next we will do a new calculated field for balance of COE, to have it in months instead of years and months for easier visualization.

In [1]:
import warnings 
warnings.simplefilter(action='ignore')

df['Coe_left_mths'] = 0

# Calculating for COE balance field
for i, v in enumerate(df['Coe_left']):
    coe_string = str(v).split()
    years = 0
    if 'yrs' in coe_string[0]:
        years = 0
        y = coe_string[0].split('yrs')
        years = int(y[0])
    elif 'yr' in coe_string[0]:
        years = 0
        y = coe_string[0].split('yr')
        years = int(y[0])
    if 'mths' in coe_string[0]:
        mths = 0
        y = coe_string[0].split('mths')
        mths = int(y[0])
    elif 'mth' in coe_string[0]:
        mths = 0
        y = coe_string[0].split('mth')
        mths = int(y[0])
    if 'mths' in coe_string[1]:
        mths = 0
        y = coe_string[1].split('mths')
        mths = int(y[0])
    elif 'mth' in coe_string[1]:
        mths = 0
        y = coe_string[1].split('mth')
        mths = int(y[0])
    if 'days' in coe_string[0]:
        print('Flag car with COE lesser than a month')
        mths = 0
        years = 0
    months = (12*years) + mths
    df['Coe_left_mths'].loc[i] = months

We will remove the entries with less than two months on its lifespan as these car are likely to be scrapped if no one buy them.

In [1]:
for i,v in enumerate(df['Coe_left_mths']):
    if v <= 2:
        df.drop([i],inplace=True)
df.reset_index(drop=True,inplace=True)

In [1]:
df.info()

In [1]:
df_dep = df[df['Dep'].isnull()]
for i,v in enumerate(df_dep['Dep']):
    price = int(df_dep['Price'].iloc[i])
    bal = df_dep['Coe_left_mths'].iloc[i]
    df_dep['Dep'].iloc[i] = price / bal * 12

In [1]:
index = df_dep.index
index

In [1]:
i = 0
for idx in index:
    value = str(int(df_dep['Dep'].iloc[i]))
    df['Dep'].loc[idx] = value 
    i += 1
    
df_dep

In [1]:
df.info()

We will try to look up the curb weight of the vehicle with missing values. Imputing them with the same model within the data.

In [1]:
df['Curb Weight'] = df['Curb Weight'].replace(np.nan,'0')

In [1]:
# find the curb weight of the same model with the data set
for i,v in enumerate(df['Curb Weight']):
    if v == '0':
        for t,s in enumerate(df['Brand']):
            if df['Brand'].loc[i] == s:
                df['Curb Weight'].loc[i] = df['Curb Weight'].loc[t]

In [1]:
df[df['Curb Weight'] == '0']

We have found the remaining missing value without identical model in the data. These are newly registered car sold as second owner's car problably due to reason such as demo vehicles or test drive vehicles. We will remove them since they are new cars.

In [1]:
# will drop all remaining missing values
df['Curb Weight'] = df['Curb Weight'].replace('0',np.nan)
df.dropna(inplace=True)
df = df.reset_index(drop=True)
df.info()

We are left with 1885 private car entries within 1 to 9.8 years old, which is still a good amount of sample for testing. Next we will try to get more details out of our data columns separating brand and models.

In [1]:
# Renaming columns
df.rename({'Brand':'Model'},axis='columns',inplace=True)

# Create a new column for data
df['Brand'] = 0


In [1]:
for i,v in enumerate(df.Model):
    wordstr = v.split()
    print(wordstr)
    df['Brand'].loc[i] = str(wordstr[0]) 

In [1]:
df['Brand']

In [1]:
df['Brand'].value_counts()

In [1]:
df.columns

In [1]:
# create a list of fields to be converted to int
convert_dict = {'Dep': 'int64','Mileage': 'int64','Road Tax': 'int64',
               'COE': 'int64','OMV': 'int64','ARF': 'int64', 'Manufactured': 'int64',
               'No. of Owners': 'int64','Price': 'int64','Coe_left_mths': 'int64',
                'Dereg Value':'int64','Engine Cap':'int64','Curb Weight':'int64',
                'Power':'int64','Type':'category'}

In [1]:
df = df.astype(convert_dict)

In [1]:
df.info()

The data are now cleaned and ready for Exploratory data analysis.

# EDA

In [1]:
num_value_col = df.select_dtypes(include=['int64','float64']).columns.tolist()
fig, axes = plt.subplots(7,2, figsize=(16, 14))
fig.subplots_adjust(hspace=0.5)
fig.suptitle('Numerical value histogram',fontsize=18)

i=0
n=0

for x in num_value_col:
    sb.distplot(df[x],fit=stats.norm,ax=axes[i,n])

    if n < 1:
        n+=1
    else:
        n=0
        i+=1
plt.show()

The numerical histogram shows quite a number of fields with long tails. Suggesting there are quite a number of outlier presence.

In [1]:
plt.rcParams["figure.figsize"] = [10,6]
sb.set_style("darkgrid")
sb.set_context("notebook", font_scale=1.5, rc={"font.size":16,"axes.titlesize":16,"axes.labelsize":16})
p = sb.boxplot(df['Price'],x='Price')
p.set_xlabel('Price', fontsize = 15)
p.set_title("Numerical value boxplot")
plt.show()


A boxplot indicates outliers for car prices over 220000 dollars.

In [1]:
sb.set_style("white")
num_value_col = df.select_dtypes(include=['int64','float64']).columns.tolist()
fig, axes = plt.subplots(5, 3, figsize=(18, 16))
fig.suptitle('Numerical value scatterplot',fontsize=18)

i=0
n=0

for x,y in enumerate(num_value_col):
    p = sb.scatterplot(data=df,x=num_value_col[x],y='Price',ax=axes[i,n])
    p.set(xlabel = num_value_col[x])

    if n < 2:
        n+=1
    else:
        n=0
        i+=1
plt.show()

The numerical scatterplot above shows a group of outliers scattered widely from the main cluster. Visually it suggest this group of cars are highly priced from the regular cars.

In [1]:
df[df['Price'] > 400000].head()

In [1]:
df[df['Dep'] > 25000].head()

A check shows this group consist of supercars and collector cars

In [1]:
df_super = df[df['Price'] > 400000]
num_value_col = df_super.select_dtypes(include=['int64','float64']).columns.tolist()
fig, axes = plt.subplots(5, 3, figsize=(18, 16))
fig.subplots_adjust(hspace=0.5)
fig.suptitle('Numerical value scatterplot')

i=0
n=0

for x,y in enumerate(num_value_col):
    p = sb.scatterplot(data=df_super,x=num_value_col[x],y='Price',ax=axes[i,n])
    p.set(xlabel = num_value_col[x])

    if n < 2:
        n+=1
    else:
        n=0
        i+=1
plt.show()

For the outliers in the data, they appears to be scattered without a clear pattern, this suggest they might belong to a separate distribution from the mass.

In [1]:
df_normal = df[df['Dep'] < 25000]
num_value_col = df_normal.select_dtypes(include=['int64','float64']).columns.tolist()
fig, axes = plt.subplots(5, 3, figsize=(18, 16))
fig.subplots_adjust(hspace=0.5)
fig.suptitle('Numerical value scatterplot')

i=0
n=0

for x,y in enumerate(num_value_col):
    p = sb.scatterplot(data=df_normal,x=num_value_col[x],y='Price',ax=axes[i,n])
    p.set(xlabel = num_value_col[x])

    if n < 2:
        n+=1
    else:
        n=0
        i+=1
plt.show()

With the removal of outliers, the rest of the data points appears more linear.

In [1]:
sb.set_style("darkgrid")
num_value_col = df_normal.select_dtypes(include=['int64','float64']).columns.tolist()
fig, axes = plt.subplots(5, 3, figsize=(18, 16))
fig.suptitle('Numerical value histogram')

i=0
n=0

for x in num_value_col:
    sb.distplot(df_normal[x],fit=stats.norm,ax=axes[i,n])

    if n < 2:
        n+=1
    else:
        n=0
        i+=1
plt.show()

With the removal of outliers, the distribution appears much normal.

In [1]:
df_normal.boxplot(column='Price',by='Type')
plt.title('Price vs Type',fontsize=12)
plt.xlabel('Type',fontsize=16)
plt.xticks(rotation=45)
plt.ylabel('Price',fontsize=16)
plt.show()

In [1]:
def color_negative_red(val):
    color = 'red' if val < 0 else 'black'
    return 'color: %s' % color

df_normal.corr().style.applymap(color_negative_red)

In [1]:
df_normal = df_normal.reset_index(drop=True)
df_normal

In [1]:
(stat, p_value) = stats.spearmanr(df_normal['Mileage'],df_normal['Price'])

print("The correlation coefficient is: ", '{:.2f}'.format(stat))
print("The p-value is:",'{:.5f}'.format(p_value))

if p_value > 0.05:
    print('The two variables are likely independent')
else:
    print('The two variables are likely dependent')

After performing EDA, we have a question we would like to test on.

Some of the variables are fixed from the day the vehicle are registered. Furthermore, variables such as ARF, Dep, Road Tax are calculated fields from OMV, COE month left, engine capacity. They are similar across similar models and make. All except mileage is the only difference which no two cars shares the same. 

Mileage and price have a moderate co-relationship, suggesting price may be affected by milege of a car. However in singapore, each car has a lifespan of 10 years due to certificate or entitlement. Unlike other countries which has no limit on the years you can use a car, mileage pose a important factor for used car as it determines how frequently the car has been used. Higher mileage would suggest more usage and repair cost for one might likely to be higher. Thus we would like to find out if there is a significant difference between the groups.



#  Running statistical test

Here we create another category of mileage difference. According to Singapore Land transport authority data, it shows a private vehicle in singapore has an average of 20000km clocked annually. As such, we are dividing the classes to three category of low, normal and high. 

In [1]:
df_normal['Mileage_cat'] = 0
for i,v in enumerate(df_normal['Mileage']):
    mileage = int(v)
    mths_left = df_normal['Coe_left_mths'].loc[i]
    avg_miles = mileage / (120-mths_left) * 12
    if avg_miles <= 14000: 
        df_normal['Mileage_cat'].loc[i] = 'Low'
    elif avg_miles > 14000 and avg_miles <= 24000:
        df_normal['Mileage_cat'].loc[i] = 'Normal'
    elif avg_miles > 24000:
        df_normal['Mileage_cat'].loc[i] = 'High'
df_normal

In [1]:
df_normal.boxplot(column='Price',by='Mileage_cat')
plt.title('Price vs Mileage_cat',fontsize=12)
plt.xlabel('Mileage_cat',fontsize=16)
plt.xticks(rotation=45)
plt.ylabel('Price',fontsize=16)
plt.show()

The boxplot above shows similar median prices for all vehicles. However, there are more vehicles of higher pricing in the low and normal class. This is quite normal considering these are perceived to be better maintain. 

In [1]:
import warnings
def changing_room(df,fit_val):
    # Code by Melquiades Ochoa from stack overflow
    # for more information please visit the site below
    # https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python

    plt.figure(figsize=(12,8))
    ax = df[fit_val].plot(kind='hist', bins=50, alpha=0.5)
    
    dataYLim = ax.get_ylim()

    best_fit = best_fit_distribution(df[fit_val], 200, ax)
    
    plt.show()

    plt.rcParams["figure.figsize"] = [10,6]
    sb.set_style("darkgrid")
    sb.set_context("notebook", font_scale=1.5, rc={"font.size":16,"axes.titlesize":16,"axes.labelsize":16})
    p = sb.distplot(df[fit_val],fit=best_fit,kde=True)
    p.set_xlabel(fit_val, fontsize = 15)
    p.set_title(best_fit.name+" distribution of "+fit_val)
    #plt.show()
    return 

def best_fit_distribution(data, bins=200, ax=None):
    # Code by Melquiades Ochoa from stack overflow
    # for more information please visit the site below
    # https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        stats.alpha,stats.anglit,stats.arcsine,stats.beta,stats.betaprime,stats.bradford,stats.burr,stats.burr12,stats.cauchy,
        stats.chi,stats.chi2,stats.cosine,stats.dgamma,stats.dweibull,stats.erlang,stats.expon,stats.exponweib,stats.exponpow,
        stats.fatiguelife,stats.fisk,stats.foldcauchy,stats.foldnorm,stats.f,stats.gamma,stats.genlogistic,stats.genpareto,
        stats.genexpon, stats.genextreme, stats.gengamma, stats.genhalflogistic, stats.geninvgauss, stats.gennorm, stats.gilbrat,
        stats.gompertz, stats.gumbel_r,stats.gumbel_l,stats.halfcauchy,stats.halfnorm,stats.halflogistic,stats.hypsecant,
        stats.gausshyper, stats.invgamma, stats.invgauss, stats.invweibull,stats.johnsonsb,stats.johnsonsu,stats.ksone,
        stats.kstwo, stats.kstwobign, stats.laplace, stats.levy_l,stats.levy,stats.logistic,
        stats.loglaplace, stats.loggamma, stats.lognorm,stats.loguniform,stats.maxwell,stats.mielke,stats.nakagami,
        stats.ncx2,stats.nct,stats.norm,stats.norminvgauss,stats.pareto,stats.lomax,stats.powerlognorm,
        stats.powernorm,stats.powerlaw,stats.rdist,stats.rayleigh,stats.rice,stats.recipinvgauss,stats.semicircular,
        stats.t,stats.triang,stats.truncexpon,stats.truncnorm,stats.tukeylambda,stats.uniform,stats.vonmises,
        stats.wald,stats.weibull_max,stats.weibull_min,stats.wrapcauchy
    ] # excluded distribution due to slow speed in fitting : stats.ncf stats.laplace_asymmetric stats.trapezoid

    # Best holders
    best_distribution = stats.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:
        
        print('Trying out '+str(distribution.name)+' distribution')
        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse
                    print('This is a better fit')

        except Exception:
            pass
    return best_distribution

In [1]:
changing_room(df_normal,'Price')

We are trying to find a nearest normal distribution fit for our pricing. Here we found the Mielke distribution to be a good fit.

In [1]:
def normal_test(df,test_val):
    n_test_count = 0
    print('\n',' Normality test '.center(40,'*'))
    shap_stat,shap_p = stats.shapiro(df[test_val])
    DAgostino_stat, DAgostino_p = stats.normaltest(df[test_val])
    result = stats.anderson(df[test_val])
    print('\n',' Shapiro-Wilk test '.center(40,'*'))
    print((test_val).center(40,'*'))
    print('stat=%.3f' % (shap_stat))
    if shap_p > 0.05:
        print('P-value : ',shap_p)
        print('Distribution is Gaussion')
    else:
        print('P-value : ',shap_p)
        print('Distribution is not Gaussion')
        n_test_count += 1
    print('\n'," D'Agostino's test ".center(40,'*'))
    print((test_val).center(40,'*'))
    print('stat=%.3f' % (DAgostino_stat))
    if DAgostino_p > 0.05:
        print('P-value : ',DAgostino_p)
        print('Distribution is Gaussion') 
    else:
        print('P-value : ',DAgostino_p)
        print('Distribution is not Gaussion')
        n_test_count += 1 
    print('\n'," Anderson-Darling test ".center(40,'*'))
    print((test_val).center(40,'*'))
    print('stat=%.2f' % (result.statistic))
    ad = 0
    for i in range(len(result.critical_values)):
        significance_level, critical_Value = result.significance_level[i], result.critical_values[i]
        if result.statistic < critical_Value:
            print('Approximately Normally Distributed at %.2f%% level' % (significance_level))
        else:
            print('Not Approximately Normally Distributed %.2f%% level' % (significance_level))
            if n_test_count < 3 and ad == 0:
                n_test_count += 1
                ad += 1
    
    if n_test_count == 3:
        print('Your distribution is not normal.')
        print('You might want to try other transformation.')
    if n_test_count == 2 or n_test_count == 1:
        print('Your distribution is approximately normal.')
        print('You might want to try other transformation or proceed to parametric test.')
    if n_test_count == 0:
        print('Your distribution is now normal.')
        print('You can proceed to parametric test.')
    

In [1]:
normal_test(df_normal,'Price')

A standard test of normality shows the current distribution is not normal. We will try to transform the data next.

In [1]:
price_transform = np.log10(df_normal.Price)

In [1]:
df_normal['log_10_Price'] = price_transform

In [1]:
print('\n',' Before Transformation '.center(40,'*'))
plt.rcParams["figure.figsize"] = [10,6]
sb.set_style("darkgrid")
sb.set_context("notebook", font_scale=1.5, rc={"font.size":16,"axes.titlesize":16,"axes.labelsize":16})
p = sb.distplot(df_normal['Price'],fit=stats.mielke,kde=True)
p.set_xlabel('Price', fontsize = 15)
p.set_title("Histogram distribution of Price")
plt.show()

In [1]:
skew_b4 = np.round(df_normal['Price'].skew(),2)
variance = statistics.variance(df_normal['Price'])
stdev = statistics.stdev(df_normal['Price'])
kurt = stats.kurtosis(df_normal['Price'], bias=False)

print('Skewness : ',skew_b4)
print('Varience : ',variance)
print('Standard deviation : ',stdev)
print('Kurtosis : ',kurt)

In [1]:
print('\n',' After log 10 Transformation '.center(40,'*'))
plt.rcParams["figure.figsize"] = [10,6]
sb.set_style("darkgrid")
sb.set_context("notebook", font_scale=1.5, rc={"font.size":16,"axes.titlesize":16,"axes.labelsize":16})
p = sb.distplot(df_normal['log_10_Price'],fit=stats.mielke,kde=True)
p.set_xlabel('Log_10_Price', fontsize = 15)
p.set_title("Histogram distribution of Log 10 Price")
plt.show()

In [1]:
skew_b4 = np.round(df_normal['log_10_Price'].skew(),2)
variance = statistics.variance(df_normal['log_10_Price'])
stdev = statistics.stdev(df_normal['log_10_Price'])
kurt = stats.kurtosis(df_normal['log_10_Price'], bias=False)

print('Skewness : ',skew_b4)
print('Varience : ',variance)
print('Standard deviation : ',stdev)
print('Kurtosis : ',kurt)

A skewness of near zero and estimation of 68 percent of the data falling within 1 st deviation suggest we may be getting a normal distribution after a log10 transformation.

In [1]:
normal_test(df_normal,'log_10_Price')

We did not get a standard, standard distribution but test suggest we may assume normality here. A qqplot will be able to show us more.

In [1]:
fig = plt.figure()
fig.subplots_adjust(hspace=0.4)
ax1 = fig.add_subplot(211)
prob = stats.probplot(df_normal['Price'], plot=ax1)
ax1.set_xlabel('')
ax1.set_title('Probplot against normal distribution')
ax2 = fig.add_subplot(212)
prob = stats.probplot(df_normal['log_10_Price'], plot=ax2)
ax2.set_title('Probplot after log10 transformation')
plt.show()


We have most of the data falling almost on a straight line. We are safe to assume normality for the distribution. We will move to the next assumption for parametric test.

In [1]:
df_high = df_normal[df_normal['Mileage_cat'] == 'High']
df_norm = df_normal[df_normal['Mileage_cat'] == 'Normal']
df_low = df_normal[df_normal['Mileage_cat'] == 'Low']

In [1]:
high_variance = statistics.variance(df_high['log_10_Price'])
norm_variance = statistics.variance(df_norm['log_10_Price'])
low_variance = statistics.variance(df_low['log_10_Price'])

print('High Variance : ',high_variance)
print('Norm Variance : ',norm_variance)
print('Low Variance : ',low_variance)

In [1]:
(test_statistic, p_value) = stats.levene(df_low['log_10_Price'], df_norm['log_10_Price'],df_high['log_10_Price'],center='mean')
print('\n',' Levene Test '.center(40,'*'))
print("The test statistic is: ", round(test_statistic,5))
print("The p-value is: ", round(p_value,5))
if p_value > 0.05:
    print('The group have approximate equal variance')
if p_value < 0.05:
    print('The group variance is not equal')
if p_value == 1:
    print('The group have equal variance')

The group variance is not homogenious, we will need to resample from the group to achieve homogenity.

In [1]:
def stratified_samples(df,num,sV):
    num = num/100  
    stratifiedSampling = StratifiedShuffleSplit(n_splits=1, test_size=num,random_state=3)
    sort_value = df.select_dtypes(include=['object','int64','float64']).columns.tolist()
    try:
        for x, y in stratifiedSampling.split(df, df[sV]):
            stratified_random_sample = df.iloc[y].sort_values(by=sV)
    except:
        print('Error - Unable to run with chosen strata')
        return
    stratified_random_sample.info()
    return stratified_random_sample
    
def random_samples(df,num):
    sample = df.sample(num,random_state=1)
    sample = sample.reset_index(drop=True)
    return sample   

In [1]:
df_sample = stratified_samples(df_normal,60,'Mileage_cat')

In [1]:
df_high = df_sample[df_sample['Mileage_cat'] == 'High']
df_norm = df_sample[df_sample['Mileage_cat'] == 'Normal']
df_low = df_sample[df_sample['Mileage_cat'] == 'Low']

In [1]:
(test_statistic, p_value) = stats.levene(df_low['log_10_Price'], df_norm['log_10_Price'],df_high['log_10_Price'],center='mean')
print('\n',' Levene Test '.center(40,'*'))
print("The test statistic is: ", round(test_statistic,5))
print("The p-value is: ", round(p_value,5))
if p_value > 0.05:
    print('The group have approximate equal variance')
if p_value < 0.05:
    print('The group variance is not equal')
if p_value == 1:
    print('The group have equal variance')

We have homogenious variance for our stratified samples.

In [1]:
high_variance = statistics.variance(df_high['log_10_Price'])
norm_variance = statistics.variance(df_norm['log_10_Price'])
low_variance = statistics.variance(df_low['log_10_Price'])

print('High Variance : ',high_variance)
print('Norm Variance : ',norm_variance)
print('Low Variance : ',low_variance)

In [1]:
df_Rsample = random_samples(df_normal,1000)

In [1]:
dfr_high = df_Rsample[df_Rsample['Mileage_cat'] == 'High']
dfr_norm = df_Rsample[df_Rsample['Mileage_cat'] == 'Normal']
dfr_low = df_Rsample[df_Rsample['Mileage_cat'] == 'Low']

In [1]:
(test_statistic, p_value) = stats.levene(dfr_low['log_10_Price'], dfr_norm['log_10_Price'],dfr_high['log_10_Price'],center='mean')
print('\n',' Levene Test '.center(40,'*'))
print("The test statistic is: ", round(test_statistic,5))
print("The p-value is: ", round(p_value,5))
if p_value > 0.05:
    print('The group have approximate equal variance')
if p_value < 0.05:
    print('The group variance is not equal')
if p_value == 1:
    print('The group have equal variance')

There is no homogenity in variance in our random samples.

In [1]:
high_variance = statistics.variance(dfr_high['log_10_Price'])
norm_variance = statistics.variance(dfr_norm['log_10_Price'])
low_variance = statistics.variance(dfr_low['log_10_Price'])

print('High Variance : ',high_variance)
print('Norm Variance : ',norm_variance)
print('Low Variance : ',low_variance)

In [1]:
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import statsmodels.stats.multicomp as multi

print('\n',' MANOVA '.center(40,'*'))
model = smf.ols('log_10_Price ~ C(Mileage_cat)', data=df_sample).fit()
aov_table = anova_lm(model, typ=2)
print(aov_table)
print('\n')
mcTreatment    = multi.MultiComparison(df_sample['log_10_Price'], df_sample['Mileage_cat'])
results_Treatment  = mcTreatment.tukeyhsd()
print(results_Treatment.summary())
    
df_sample.boxplot(column='log_10_Price',by='Mileage_cat')
plt.title('log_10_Price vs Mileage_cat',fontsize=12)
plt.xlabel('Mileage_cat',fontsize=16)
plt.xticks(rotation=45)
plt.ylabel('Log_10_Price',fontsize=16)
plt.show()

residuals = model.resid
#smi.qqplot(residuals, line='s')
fig = smi.qqplot(residuals, line='s')
plt.show()

Check with a non parametric test

In [1]:
dfm_low = df_normal[df_normal['Mileage_cat'] == 'Low']
dfm_norm = df_normal[df_normal['Mileage_cat'] == 'Normal']
dfm_high = df_normal[df_normal['Mileage_cat'] == 'High']

In [1]:
# Man Whitney test
print('The Mean of low mileage cars:', dfm_low.Price.mean(axis=0))
print('The Mean of normal mileage cars:', dfm_norm.Price.mean(axis=0))
print('The Mean of high mileage cars:', dfm_high.Price.mean(axis=0))


In [1]:
import scipy.stats as stats
(test_statistic, p_value) = stats.mannwhitneyu(dfm_high['Price'], dfm_low['Price'], alternative='two-sided')
print("The test statistic is: ", '{:.5f}'.format(test_statistic))
print("The p-value between high and low mileage is:",'{:.5f}'.format(p_value))
(test_statistic, p_value) = stats.mannwhitneyu(dfm_low['Price'], dfm_norm['Price'], alternative='two-sided')
print("The test statistic is: ", '{:.5f}'.format(test_statistic))
print("The p-value between low and normal mileage is:",'{:.5f}'.format(p_value))
(test_statistic, p_value) = stats.mannwhitneyu(dfm_norm['Price'], dfm_high['Price'], alternative='two-sided')
print("The test statistic is: ", '{:.5f}'.format(test_statistic))
print("The p-value between normal and high mileage is:",'{:.5f}'.format(p_value))

The above result suggest that in between groups, there is no significant difference in treatment for groups except for low to normal mileage cars, which cars in the low mileage category may offer a better drive condition thus, a higher asking price which is logical.

But there is a point of interest, there is no significant difference between low and high. Which is quite abnormal. We will investigate further.

# Investigation

In [1]:
df_normal[df_normal['Coe_left_mths']<24]

Higher mileage would be expected from cars of older age, but a check above show most car of eight years old and above mostly have low mileage clocked.

In [1]:
df_normal[df_normal['Model']=='Toyota Camry 2.0A']

We do a comparison of a particular car model of different age.
The oldest car in the group actually have a much higher depreciation even thou the compared price is lower.

In [1]:
df_normal[df_normal['Mileage_cat'] == 'High']

Most car of high mileage are relatively new with an average of 2 to 3 years old.

In [1]:
df_normal[df_normal['Model'] == 'Kia Cerato 1.6A EX']

A check on a popular make of mid size sedan, the depreciation from cars of different mileage group did not show much gap in between pricing.

In [1]:
df_normal[df_normal['Model'] == 'Mercedes-Benz E-Class E200 Avantgarde']

The price gap is much more apparent in a more expensive make. It shows a 10000 dollar difference for similar age but with a low mileage clocked.

### How real are the mileage recorded in the listing?

The national average mileage of a private car in singapore is estimated to be 20000km per year. From the data above, we found most of our high mileage category to be form by newer cars. This does not make sense if older cars are actually clocking lower mileages. We will conduct a comparison between sub category of the car age and see what we can find from the analysis.

In [1]:
# Creating a new segmentation of car age
df_normal['Age_seg'] = '0'
for i,v in enumerate(df_normal['Coe_left_mths']):
    if v >= 90:
        df_normal['Age_seg'].loc[i] = 'Near New'
    elif v < 90 and v >= 60:
        df_normal['Age_seg'].loc[i] = 'Upper Mid Age'
    elif v < 60 and v >= 30:
        df_normal['Age_seg'].loc[i] = 'Lower Mid Age'
    else:
        df_normal['Age_seg'].loc[i] = 'Old'

In [1]:
# Creating a new average_mileage_per_annum
df_normal['Avg_mileage'] = 0
for i,v in enumerate(df_normal['Mileage']):
    df_normal['Avg_mileage'].loc[i] = v/(120-df_normal['Coe_left_mths'].loc[i])*12

In [1]:
df_type = df_normal.groupby(['Type','Age_seg'])
df_type['Avg_mileage'].describe()

Points to note on the above analysis
- Cars with less than 2.5 years on lifespan appears to be the lowest in most category car type
- All vehicle average is at least 25 to 50 percent lesser than national average

# Conclusion

A check with a car dealer, they confirm that depending on current COE trend, used car price may differ. But usually cars of older age lesser than 3 years will have a higher depreciation due to its affordability. As lower cash outlay is usually required, thus the high depreciation on average.

Here we conclude that in a case of singapore used car market, mileage affects pricing in car in different car age segment. It may be applicable to newer cars or some specific higher end models. For bread and butter cars, the difference is rather insignificant.

Or that the mileage in the list may not be true due human error. We will need more factual data to ascertain this claim.