# The purpose of this project is to find the best age of used car to purchase in Ontario to minimize the value lost.

- I will be showcasing the use of pandas, seaborn, matplotlib, folium, and other libraries to analyze the dataset downloaded from Kaggle, which includes inventory from 65k dealerships across US and Canada. For our purposes, we will only be using data in Ontario.
- The Ontario dataset still contains over a hundred thousand used cars listings.

In [None]:

! pip install geocoder
! pip install folium
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
plt.style.use('seaborn')
from numpy import median
import missingno as msno
import folium
from folium import plugins
import geocoder
import geopy
from geopy.geocoders import Nominatim

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Descriptive Analysis**

# Let's only work with Canadian data


In [None]:
df = pd.read_csv("/kaggle/input/marketcheck-automotive-data-us-canada/ca-dealers-used.csv")
df.head(5)

In [None]:
df.dtypes

In [None]:
#Column 13 and 15 have unspecified data type, checking data type and manually specifying dtype
df = pd.read_csv("/kaggle/input/marketcheck-automotive-data-us-canada/ca-dealers-used.csv", dtype={'fuel_type': 'object', 'engine_block': 'object'})
df.head(5)

In [None]:
#Working only with data from Ontario
df=df[df.state == "ON"]

In [None]:
#Dropping columns not needed for data analysis
df1 = df.drop(['stock_no', 'model', 'trim', 'drivetrain', 'transmission', 'fuel_type', 'engine_size', 'engine_block', 'seller_name'],axis=1)

In [None]:
#Checking null observations
df1.info()

In [None]:
# Checking max and mins
df1.describe().apply(lambda s: s.apply('{0:.5f}'.format))


In [None]:
# Lots of missing value for price, dropping rows without price because rows without price are not useful for our purposes
df1 = df1.dropna(subset=['price'])

In [None]:
df1.info()

# Checking for missing data

In [None]:
#Using missingno to check missing data
msno.matrix(df1)
plt.show()

In [None]:
#Filling in missing data with unknowns and medians for floats
for i in df1.drop(['year'],axis=1).columns: 
    if df1[i].dtype=='float': 
        df1[i]=df1[i].fillna(df1[i].median()) #filling missing miles with median of miles
df1['body_type']=df1['body_type'].fillna('Unknown')
df1['vehicle_type']=df1['vehicle_type'].fillna('Unknown')

# Number of vehicle by build year

In [None]:
df1[df1.year >= 2006].year.value_counts().sort_index().plot(lw = 4)
plt.title("Number of vehicles in the dataset by build year")
plt.xlabel("Year")
plt.ylabel("Count")
plt.show()

Majority of cars being sold are between the years 2016 - 2019.

# **Data Cleaning**

In [None]:
# Checking max price and min price
print(f"Maximum price: {df1.price.max()} $\nMinimum price: {df1.price.min()} $")


Max price of car is almost \\$1.3 mil, min price is \\$0. 0 dollar cars don't exist so let's remove those.
Removing rows with price over \\$100k and under \\$200 for the average car buyer

In [None]:
df1 = df1[(df1["price"] >= 200) & (df1["price"] <= 100000)]
print(f"Maximum price: {df1.price.max()} $\nMinimum price: {df1.price.min()} $")

# Boxplot of price after removing price points above 100k and under 200

In [None]:
sns.boxplot(df1.price)
plt.title("Distribution of Price", fontsize = 15)
plt.show()

We can see majority of cars are between the price points of 10k - 40k with a lots of outliers

## Build year of the vehicles

In [None]:
print(f"Higher year: {df1.year.max()}\nLowest year: {df1.year.min()}")

Oldest vehicle is 40 years old. For our purposes, let's only consider cars that are "recent" so between the age of 0-15

In [None]:
df1 = df1[df1.year.notnull()]
df1["age"] = df1.year.apply(lambda x: int(2021-x))
df1 = df1[(df1.age >= 0) & (df1.age <= 15)]
print(f"Maximum age: {df1.age.max()} \nMinimum age: {df1.age.min()} ")


In [None]:
sns.histplot(data=df1, x="age", binwidth =1)
plt.title("Number of Cars Being Sold by Age", fontsize = 15)
plt.show()

Majority of vehicles are between 0 - 10 years old.

# Mileage of Vehicles

In [None]:
print(f"Highest Mileage: {df1.miles.max()}\nLowest Mileage: {df1.miles.min()}")

In [None]:
#Since the highest mileage of a vehicle is 2.1 million miles and probably a mistake, let's consider cars between 0 miles and 150000 miles.
df1 = df1[(df1["miles"] >= 0) & (df1["miles"] <= 150000)]
print(f"Maximum miles: {df1.miles.max()} \nMinimum miles: {df1.miles.min()} ")

# Mapping Car Dealership Location Distribution on Interactive Map

In [None]:
#Concatenate street, city, state in order to geocode into coordinates
# Excluding postal code because Nominatim doesn't seem to work when we include postal code
df1['address']=df1['street'].astype(str)+', '+df1['city']+', '+df1['state']
df1.head(5)

In [None]:
# Using Nominatim to locate coordinates of addresses provided in dataset
geolocator = Nominatim(user_agent="maximillan.ys.lau@gmail.com")

#for loop to loop through first 200 addresses. Geocoding 100k addresses would like way too long.
for i in df1.index[:500]:
    try:
        #tries fetch address from geopy
        location = geolocator.geocode(df1['address'][i])
        
        #append lat/long to column using dataframe location
        df1.loc[i,'location_lat'] = location.latitude
        df1.loc[i,'location_long'] = location.longitude
    except:
        #catches exception for the case where no value is returned
        #appends null value to column
        df1.loc[i,'location_lat'] = "0"
        df1.loc[i,'location_long'] = "0"

df1.head(5)

In [None]:
# Mapping first 200 coordinates
map1 = folium.Map(
    location=[48.632909,-84.124552],
    tiles='cartodbpositron',
    zoom_start=4.5,
)
df1[:500].apply(lambda row:folium.CircleMarker(location=[row["location_lat"], row["location_long"]]).add_to(map1), axis=1)
map1

Majority of the listings are within the GTA area, some in Ottawa as well, but we do have a good distribution of listing acrossing Ontario with some listings being in Timmins and Thunder Bay.

# Correlation

In [None]:
#Graphing correlation between price, age, and miles.
cols_cor = ["price","age", "miles"]
sns.heatmap(df1[cols_cor].corr(), annot = True)
plt.title("Correlation:")
plt.show()

Highest correlation is between age and price.

In [None]:
df2 = df1.groupby(['age']).median()

df2['price_diff'] =  df2['price'] - df2['price'].shift(+1)
df2['price_diff_pct'] = df2['price'].pct_change()

df2.head(5)

In [None]:
xticks = (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)

fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, sharex=True)

ax1.plot(df2.index.values, df2.price)
ax2.plot(df2.index.values, df2.price_diff_pct)

ax1.set_title('Median Car Price by Age')
ax1.set_xticks(xticks)

ax2.yaxis.set_major_formatter(mtick.PercentFormatter())
ax2.set_title('Percentage Change in Median Car Price by Age')
ax2.set_xticks(xticks)

plt.show

## Conclusion

- As we have seen above, the largest correlation to the price of a used car is the age of the vehicle. The question then becomes what year of used car should you purchase to gain the most value?
- The fast depreciating rate of a new vehicle is very apparent in the first graph. Cars that are brand new depreciated more than 30% in one year, around 15% for cars that are 1 year old, then reaches to an almost 0% yearly depreciation rate for cars that are 2 years old. 
- This yearly depreciation rate is unseen for cars that are between 0-12 years old. It's also interested that we see a positive appreciation rate for cars that are between 13 and 14 years old, this could be due to the desire for antique vehicles
- In conclusion, buying a 2 years old car seems to be a good choice. If the purchaser doesn not care for modern technology and style of a brand new car then purchasing a car that is around 6 years old is also a good idea.