## Writer: Zhenjian(Tom) Wang on 08/13/2021

## Overview
Many college students, like me, want to buy a car so that we can travel around. However, since we have a limited amount of money, we want to find a non-expensive car that does not depreciate much to save our money. This project analyzes the top ten non-expensive cars that have the lowest depreciation rate. 

## Data and Model
data citation: we use the used cars dataset that can be downloaded from https://www.kaggle.com/austinreese/craigslist-carstrucks-data. This dataset keeps updating and it shows the most recent used car data from different car dealer websites. 

In [None]:
import pandas as pd
import numpy as np

dataframe_original = pd.read_csv('../input/craigslist-carstrucks-data/vehicles.csv')
dataframe_original.head(5)
dataframe_original.info()




The dataframe has 26 columns and we only need the price, year, manufacturer, model of the cars.


## Data Selection and Cleaning
To clean up the dataset, we are going to choose only the columns we need and drop all the na values.


In [None]:
df = dataframe_original[['price','year','manufacturer','model','condition']].dropna()
print(df.size)
print(df.head(10))

To avoid any duplicate model names that are actually from a different manufacturer, we change the model name to be the combination of manufacturer name and model name.

In [None]:
new_model_name = df["manufacturer"] + df["model"]
df["model"] = new_model_name

Now we can take a look at our dataset. We have 1198390 rows of data, and we have the price, year, manufacturer, model
and condition of the cars for our columns. The next step is to exclude cars that are in bad conditions. As a consumer, 
we do not want to buy cars in bad condition because that will cost us a lot of extra money to fix it. 

In [None]:
print(df.groupby('condition').size())

As we can see, there are 532 cars in our dataframe that has the "salvage" title, meaning that it has been damaged before and the is declared a total loss by the insurance company. we definitely want to exclude that from our dataset. And we want to exclude "fair" from our dataset, too, because most sellers have a tendency to prettify their car's condition so that they could sell it in a higher price. A car that has fair condition usually indicates that the condition is bad.

Besides, we are only choosing cars that are newer than 2015 and more than 10000 dollars and less than 50000 because the cars that are too old do not have a normal depreciation rate and non-expensive cars usually cost between 10000 to 50000 dollars.

In [None]:
df = df[df.condition != "salvage"]
df = df[df.condition != "fair"]
df = df[df.year >= 2015.0]
df = df[df["price"].between(10000, 50000)]
print(df.head(10))
print(df.size)


The mean of the price of different models:

In [None]:
man_model_df = df.groupby(["manufacturer", "model","year"]).mean()
print(man_model_df)

The range of years:

In [None]:
print(df.sort_values(by = ["year"], ascending=False))
# we could see that the latest year is 2022

We want to keep the models that have more than 1 data because we cannot calculate the depreciation rate based on single data. We also delete all the rows that have car from a single year for the same reason. 

In [None]:
df = df[df.duplicated(subset=["model"], keep=False)]
group_by_data = df.groupby(["model","year"], as_index=False).mean()

# delete all the rows that only has car from a single year.
df_clean = group_by_data[group_by_data.duplicated(subset=["model"], keep=False)]
print(df_clean)

Now we complete all the data cleaning and selection process. We have 7145 data. Since we used the group by statement, the year is already in ascending order.

## Model
We use the straight line formula depreciation formula to calculate the depreciation rate.The formula is (latest year price - oldest year price) / oldest year price / year difference. We use a while loop to calculate all depreciation rate based on the groupby dataframe above.

In [None]:
df_clean.reset_index()


row_num = df_clean.shape[0]  # number of rows in the clean dataframe
depreciation_ratio_lst = [];
i = 0
model_new_price = 0
newer_year = 0
temp_lst_model_year = []
temp_lst_model_price = []

while i < row_num:
    model_name = df_clean.iat[i, 0]
    try:
        if df_clean.iat[i + 1, 0] == model_name:
            temp_lst_model_year.append(df_clean.iat[i, 1])
            temp_lst_model_price.append(df_clean.iat[i, 2])
            i += 1
            if df_clean.iat[i + 1, 0] != model_name:
                newer_year = df_clean.iat[i, 1]
                model_new_price = df_clean.iat[i, 2]
                depreciation_rate = (model_new_price - temp_lst_model_price[0]) / temp_lst_model_price[0] / \
                                    (newer_year - temp_lst_model_year[0])
                depreciation_ratio_lst.append([model_name, depreciation_rate])
                temp_lst_model_year = []
                temp_lst_model_price = []
                i += 1
                continue
            else:
                newer_year = df_clean.iat[i, 1]
                model_new_price = df_clean.iat[i, 2]
                i += 1
                continue
        else:
            # we use the straight line formula depreciation formula here to calculate the depreciation rate,
            newer_year = df_clean.iat[i, 1]
            model_new_price = df_clean.iat[i, 2]
            depreciation_rate = (model_new_price - temp_lst_model_price[0]) / temp_lst_model_price[0] / \
                                (newer_year - temp_lst_model_year[0])
            depreciation_ratio_lst.append([model_name, depreciation_rate])
            temp_lst_model_year = []
            temp_lst_model_price = []
            i += 1
            continue
    # this is used to handle the last element that could be out of index bound because we should i + 1 before
    except IndexError as error:
        break

depreciation_ratio_lst.sort(key=lambda x: x[1])

top_value_model_name = []
for i in range(50):
    top_value_model_name.append(depreciation_ratio_lst[i])

result = df[np.isin(df, top_value_model_name).any(axis=1)]

group_by_result = result.groupby(["model"]).count()
group_by_result.reset_index()
# we decide to require all car models to have more than 10 sets of data to add accuracy
top_ten_group_by = group_by_result[(group_by_result > 10).any(axis=1)]


top_ten_list = list(top_ten_group_by.index)

final_df = df_clean[np.isin(df_clean, top_ten_list).any(axis=1)]



## Results

In [None]:
for i in depreciation_ratio_lst:
    for j in top_ten_list:
        if i[0] == j:
            print(i)

These are the top ten non-expensive cars with least depreciation ratio.
The data of these models are shown as below:


In [None]:
print(final_df)
print(final_df.groupby(["model", "year"]).mean())

## Conclusion

The top 10 non-expensive cars that has the lowest depreciation ratio are:
1. Gmc savana commercial cutaway
2. kiao ptima ex sedan 4d
3. ford super duty f-550 drw
4. gmc acadia slt-2 sport utility
5. lexus is 250
6. dodge durango limited
7. bmw4 series 428i xdrive
8. bmw4 series 430i xdrive
9. dodge dart
10.chrysler town & country

The finding is very interesting because all of the cars above have negative inflation rate. However, we know that the calculation is correct  by looking at the final_df above and do some calculation by ourselves. We also went on cardealer websites such as carmax to check if it is possible for cars to have negative inflation rate. Surprisingly, we do find some cars listed on our top-10 rankings usually have a very low depreciation ratio and sometimes negative ratio on Carmax. 

The explaination for the negativce depreciation ratio is the inflation we are experiencing now in the US. Because the microchip industry has been heavily affected by Covid-19 and many factories in worldwide had to shut down, the car industry was also strongly affected because cars need microchip. To avoid waiting a long period time for a car, the consumer demand of cars goes from buying a new car to buying a used car,making the used car market heated in the past year. 