# Don't Get Kicked! 
This notebook performs some analysis on the *Don't get kicked!* data set from Kaggle. For this purpose I downloaded the [training.csv](https://www.kaggle.com/c/DontGetKicked/data) file, which contains 34 columns and 73,000 rows. One of those columns is *IsBadBuy*, which contains the target for our predictions.

To contextualize this data set - one of the major problems of buying a car at auction is the risk that the car won't be able to be sold. This can be because of a number of reasons (tampered odometer, problematic parts, severe defects, etc) but they all result in the dealership or buyer taking a large loss on the buy. In this analysis, we aim to predict whether or not a car will be a bad buy.

We will go through some basic data cleaning, identify any particularly important features, create some visualizations, and finally run a couple of models. At the end, I'll describe any future steps that might help further our predictions.

In [1]:
# show all jupyter notebook output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot

# data science utility packages
import scikitplot as skplt
from pandas_profiling import ProfileReport

# scikit-learn imports

# xgboost imports

First, let's load up the data and run a profile report. This will give us a nice overview of all the data, and I'll pull any specific insights that we find into this notebook.

In [2]:
dgk_data = pd.read_csv("data/training.csv")
dgk_data.shape
dgk_data.head()

(72983, 34)

Unnamed: 0,RefId,IsBadBuy,PurchDate,Auction,VehYear,VehicleAge,Make,Model,Trim,SubModel,...,MMRCurrentRetailAveragePrice,MMRCurrentRetailCleanPrice,PRIMEUNIT,AUCGUART,BYRNO,VNZIP1,VNST,VehBCost,IsOnlineSale,WarrantyCost
0,1,0,12/7/2009,ADESA,2006,3,MAZDA,MAZDA3,i,4D SEDAN I,...,11597.0,12409.0,,,21973,33619,FL,7100.0,0,1113
1,2,0,12/7/2009,ADESA,2004,5,DODGE,1500 RAM PICKUP 2WD,ST,QUAD CAB 4.7L SLT,...,11374.0,12791.0,,,19638,33619,FL,7600.0,0,1053
2,3,0,12/7/2009,ADESA,2005,4,DODGE,STRATUS V6,SXT,4D SEDAN SXT FFV,...,7146.0,8702.0,,,19638,33619,FL,4900.0,0,1389
3,4,0,12/7/2009,ADESA,2004,5,DODGE,NEON,SXT,4D SEDAN,...,4375.0,5518.0,,,19638,33619,FL,4100.0,0,630
4,5,0,12/7/2009,ADESA,2005,4,FORD,FOCUS,ZX3,2D COUPE ZX3,...,6739.0,7911.0,,,19638,33619,FL,4000.0,0,1020


## Pandas Profiling
Pandas profiling is a great library that creates a basic summary report of a data set. It provides simple analysis and data set interpretation, which can be a great way to get an overview of your data. It also provides recommendations on which columns to drop, which columns might be correlated with each other, and which columns might otherwise be unusuable.

In [7]:
profile = ProfileReport(dgk_data, title="Don't Get Kicked data", interactions={"targets": ["IsBadBuy"]})
profile.to_file("profiling_report.html")
profile.to_widgets()

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=47.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render widgets'), FloatProgress(value=0.0, max=1.0), HTML(value='')))

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

This pandas profiling report is extremely handy when trying to examine a data set (just don't use it on massive data sets or especially sparse data sets). First, we see that about 6% of cells are missing; when we go to the missing values tab, we can see that PRIMEUNIT and AUCGUART contain most of the missing cells as they are only filled for around 5% of the data. 


### Missing Data
According to the data set description in Kaggle, PRIMEUNIT identifies if a car has a higher than standard demand. When we look at it in the variables tab, we see mostly false values and a few true. If there were only true values I might consider missing to be false, but since false values are present I think it's worth simply dropping it.

AUCGUART is the level of guarantee provided by the auction, and it can be Red, Yellow, or Green. Unfortunately, this is only provided approximately 5% of the time as well. In a full investigation, I would try to find out why - my hunch is that only a certain auction site provides AUCGUART and PRIMEUNIT. For now, we will simply drop these columns.

There are a few other columns with missing information, but these are mostly filled. For these columns, we will use a simple imputation so as not to introduce too much bias in the final model.


### Correlations
Pandas profiling also gives us a nice warning as to highly correlated columns, which we likely want to get rid of when we look at linear methods like logistic regression. All of the price variables seem to be highly correlated (makes sense), so we might need to address that either by dropping most of them or transforming them in some way (like the difference between certain combinations of them. Additionally, we probably either want to choose year or age, but not both.

Ultimately, we might just use a strong L1 regularization penalty to choose features and see what happens.

### Target
Our target is the *IsBadBuy* column. The first thing that I noticed with this is that we have an imbalanced data set; we have many more good buys than bad buys. This makes sense, as no one would likely make a 50/50 bet on as large of a purchase as a car. However, we will need to keep this in mind as we move forward as some methods won't work as well with an imbalanced data set.

As for things that immediately jump out as correlated with the target, age and odomoter are positively correlated while the price variables seem negatively correlated. Interestingly, based on Cramer's V none of the categorical variables seem to have a high correlation with the target. However, I'll need to pull out that correlation specifically - I think the values might be overshadowed by the more highly correlated columns.

Let's take a look at age and odomoter specifically - I think those might provide some interesting visualizations.

## Exploratory Analysis
We will start by examining VehAge - this was one of the more highly correlated variables with the target. We can see from the pandas profiling report that it is roughly normally distributed and ranges from 0-10. I'm curious to see how this distribution interacts with our target, so let's plot the two distributions separately.