# EDA and price prediction of used vehicles

### Craigslist is the world's largest collection of used vehicles for sale. The dataset contains all information of used vehicles such as model, year, condition, cylinders, drive, size, paint color, price etc. 

**In order to do the predict the price, first we need to understand, clean, and scale the data.**

In [None]:
import pandas as pd
from pandas import DataFrame
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from matplotlib import rcParams
import numpy as np
import seaborn as sns
%matplotlib inline
from matplotlib import rc
import plotly.graph_objs as go
from sklearn import preprocessing
import matplotlib
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.tree import DecisionTreeRegressor
import math
from sklearn.metrics import mean_squared_error as MSE
from sklearn import tree
from sklearn.tree import export_graphviz
# from sklearn.externals.six import StringIO
from IPython.display import Image
# import pydotplus
from sklearn.ensemble import RandomForestRegressor

**Read in data**

In [None]:
vehicles = pd.read_csv(r"../input/craigslist-carstrucks-data/vehicles.csv") #reading the file

In [None]:
vehicles.head()

**Understand the data.**

In [None]:
vehicles.shape

In [None]:
vehicles.columns

**I want to remove some irrelavent columns that are not important for our prediction. These columns may important/necessary for some other type of prediction/analysis.**

In [None]:
vehicles.drop(['id', 'url','region_url', 'vin', 'image_url', 'lat', 'long', 'description'], axis=1, inplace=True)

In [None]:
vehicles.info()

In [None]:
print(vehicles.isnull().sum())

**As we can see there are many columns with missing values. We need to handle those missing values and clean the data for accuarate prediction.**

**Clean the data.**
**Since here missing values are very large in numbers, we are only going to keep columns which have less than 55% of missing values.**

In [None]:
null_val = vehicles.isna().sum()
def na_filter(na, threshold = .55): #only select variables that passees the threshold
    col_pass = []
    for i in na.keys():
        if na[i]/vehicles.shape[0]<threshold:
            col_pass.append(i)
    return col_pass
vehicles_cleaned = vehicles[na_filter(null_val)]
vehicles_cleaned.columns

**Following is just to show different catagories of categorical variables.**

In [None]:
vehicles_cleaned.manufacturer.unique()

In [None]:
vehicles_cleaned.model.unique()

In [None]:
vehicles_cleaned.cylinders.unique()

In [None]:
vehicles_cleaned.fuel.unique()

In [None]:
vehicles_cleaned.title_status.unique()

In [None]:
vehicles_cleaned.transmission.unique()

In [None]:
vehicles_cleaned.drive.unique()

In [None]:
vehicles_cleaned.type.unique()

In [None]:
vehicles_cleaned.paint_color.unique()

**price, year, and odometer are numerical varaibles.**

In [None]:
vehicles_cleaned.describe()

**To handle rest of the missing values, we will drop all rows with missing values.**

In [None]:
vehicles_df = vehicles_cleaned.dropna()

In [None]:
vehicles_df.shape

In [None]:
print(vehicles_df.isnull().sum())

**We are left with 118898 rows and 14 coulmns. Also, our dataset doesn't contain any missing values now.**

**Handling outliers.**

In [None]:
plt.figure(figsize=(3,6))
sns.boxplot(y='price', data=vehicles_df,showfliers=False);

In [None]:
vehicles_df.price.min()

In [None]:
vehicles_df.price.max()

**Price of the vehicle can never be zero. So, we will remove rows with price as 0.**

In [None]:
vehicles_df = vehicles_df[vehicles_df['price']>0]

In [None]:
vehicles_df.shape

**We will remove outliers of price using IQR.**

In [None]:
y = vehicles_df['price']
removed_outliers = y.between(y.quantile(.05), y.quantile(.95))
removed_outliers

In [None]:
print(removed_outliers.value_counts())

In [None]:
index_names = vehicles_df[~removed_outliers].index # INVERT removed_outliers!!
print(index_names) # The resulting 11027 prices to drop.

In [None]:
vehicles_df.drop(index_names, inplace=True)

In [None]:
vehicles_df.describe()

In [None]:
plt.figure(figsize=(3,6))
sns.boxplot(y='odometer', data=vehicles_df,showfliers=False);

In [None]:
vehicles_df

**Let's change the order of the column and place target column first for simplicity.**

In [None]:
vehicles_df = vehicles_df[['price','region','year','manufacturer','model','cylinders','fuel','odometer','title_status','transmission','drive','type','paint_color','state']]

**Let's check distribution of all predictors with respect to target(price) for general understanding.**

In [None]:
y = vehicles_df['price']
x = vehicles_df['odometer']
plt.scatter(x, y)
plt.xlabel('odometer')
plt.ylabel('price')
plt.show()

In [None]:
sns.catplot(y="manufacturer", x="price",kind="boxen", data=vehicles_df)

In [None]:
sns.catplot(x="drive", y="price",kind="bar", palette="ch:.25", data=vehicles_df)

In [None]:
sns.violinplot(x=vehicles_df.fuel, y=vehicles_df.price)

In [None]:
sns.catplot(y="type", x="price",kind="violin", data=vehicles_df)

In [None]:
sns.catplot(x="price", y="paint_color", kind="boxen",
            data=vehicles_df)

In [None]:
sns.catplot(x="title_status", y="price",kind="violin", palette="ch:.25", data=vehicles_df)

In [None]:
y = vehicles_df['price']
x = vehicles_df['year']
plt.scatter(x, y)
plt.xlabel('year')
plt.ylabel('price')
plt.show()

**In order to prepare data for predictive modeling we will use Label Encoder since we have many categories for categorical variables. Label encoding is simply converting each value in a column to a number.**

In [None]:
le = preprocessing.LabelEncoder()

In [None]:
vehicles_df[['region','manufacturer','model','cylinders','fuel','title_status','transmission','drive'
             ,'type','paint_color','state']] = vehicles_df[['region','manufacturer','model','cylinders','fuel','title_status',
                                                            'transmission','drive','type','paint_color','state']].apply(le.fit_transform)

In [None]:
vehicles_df

**As we can see that odometer is a feature with larger magnitude. We need to reduce the scale of it to prevent from dominating the prediction model.**

**In order to have fair glass to see all variables from the same lands, I have applied MinMaxScaler so prediction model will perform better.**

In [None]:
from sklearn.preprocessing import MinMaxScaler
vehicles_df["odometer"] = np.sqrt(preprocessing.minmax_scale(vehicles_df["odometer"]))

In [None]:
vehicles_df

In [None]:
# vehicles_df.to_csv(r'C:/Users/Aneri/Desktop/Python datsets/vehicles_df.csv')

**Predictive Modeling.**

### Multiple Linear Regression.

**I am going to use three method for variable selection in MLR and select the one with highest accuracy.**

**1) Filter Method: As the name suggest, I will filter and take subset of relevant features. I have done filtering using correlation matrix with Pearson Correlation.**

In [None]:
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = vehicles_df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

**As we can see from the heatmap that drive, odometer, cylinders, fuel, and year have relatively high postive and negative relationship with price. So we will drop the features apart from this.**

**One of the assumptions of linear regression is that the independent variables need to be uncorrelated with each other. We also need to check if these variables are related with eachother.**

In [None]:
print(vehicles_df[["drive","odometer"]].corr())
print(vehicles_df[["odometer","cylinders"]].corr())
print(vehicles_df[["cylinders","fuel"]].corr())
print(vehicles_df[["fuel","year"]].corr())
print(vehicles_df[["year","drive"]].corr())

**It doesn't seem that these variables have high relation with each other.**

In [None]:
features_p = vehicles_df[["drive","odometer","cylinders","fuel","year"]]
target_p = vehicles_df[["price"]]

In [None]:
#splitting our dataset randomly with the test data containing 25% of the data,
x_train, x_test, y_train, y_test = train_test_split(features_p,target_p, 
                                                    test_size=0.25, 
                                                    random_state=0)

#view number of training and testing data
print('Our training prediction variable contains :',len(y_train) ,'rows')
print('Our training independent variable contains :',len(x_train) ,'rows')
print('Our testing prediction variable contains :',len(y_test) ,'rows')
print('Our testing independent variable contains :',len(x_test) ,'rows')

In [None]:
#run the regression model with Pearson Correlation method
reg_model_p = LinearRegression()

In [None]:
#fitting the training data to the model,
reg_model_p.fit(x_train, y_train)
#outputs the coefficients
print('Intercept :', reg_model_p.intercept_[0], '\n')
print(pd.DataFrame({'features':x_train.columns,'coeficients':reg_model_p.coef_[0]}))


In [None]:
#prediction
lr_pred_p = reg_model_p.predict(x_test)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, lr_pred_p))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, lr_pred_p))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, lr_pred_p)))

In [None]:
score = r2_score(y_test, lr_pred_p)
score

**2)Backward Elimination(Wrapper Method): This is an iterative and computationally expensive process but it is more accurate. As the name suggest, we feed all the possible features to the model at first. We check the performance of the model and then iteratively remove the worst performing features one by one till the overall performance of the model comes in acceptable range.**

**The performance metric used here to evaluate feature performance is pvalue. If the pvalue is above 0.05 then we remove the feature, else we keep it.**

In [None]:
features = vehicles_df.loc[:,vehicles_df.columns != 'price']
target = vehicles_df.loc[:,vehicles_df.columns == 'price']

**p values of all features.**

In [None]:
#Adding constant column of ones, mandatory for sm.OLS model
X_1 = sm.add_constant(features)
X_1
#Fitting sm.OLS model
model = sm.OLS(target,X_1).fit()
model.pvalues

In [None]:
#Backward Elimination
cols = list(features.columns)
pmax = 1
while (len(cols)>0):
    p= []
    X_1 = features[cols]
    X_1 = sm.add_constant(X_1)
    model = sm.OLS(target,X_1).fit()
    p = pd.Series(model.pvalues.values[1:],index = cols)      
    pmax = max(p)
    feature_with_p_max = p.idxmax()
    if(pmax>0.05):
        cols.remove(feature_with_p_max)
    else:
        break
selected_features_BE = cols
print(selected_features_BE)

**Our final set of variables are shown above. It seems it has all the variables in dataset. Let's apply them into the model.**

In [None]:
features_be = vehicles_df.loc[:,vehicles_df.columns != 'price']
target_be = vehicles_df.loc[:,vehicles_df.columns == 'price']

In [None]:
#splitting our dataset randomly with the test data containing 25% of the data,
X_train, X_test, y_train, y_test = train_test_split(features_be,target_be, 
                                                    test_size=0.25, 
                                                    random_state=0)

#view number of training and testing data
print('Our training prediction variable contains :',len(y_train) ,'rows')
print('Our training independent variable contains :',len(X_train) ,'rows')
print('Our testing prediction variable contains :',len(y_test) ,'rows')
print('Our testing independent variable contains :',len(X_test) ,'rows')

In [None]:
# run the regression model with backward elimination
reg_model_be = LinearRegression()

In [None]:
#fitting the training data to the model,
reg_model_be.fit(X_train, y_train)
#outputs the coefficients
print('Intercept :', reg_model_be.intercept_[0], '\n')
print(pd.DataFrame({'features':X_train.columns,'coeficients':reg_model_be.coef_[0]}))

In [None]:
lr_pred_be = reg_model_be.predict(X_test)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, lr_pred_be))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, lr_pred_be))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, lr_pred_be)))

In [None]:
from sklearn.metrics import r2_score
score = r2_score(y_test, lr_pred_be)
score

**3) Recursive Feature Elimination(Wrapper Method): This method is recursively removing attributes and building a model on those attributes that remain. It uses accuracy metric to rank the feature according to their importance.**

In [None]:
features = vehicles_df.loc[:,vehicles_df.columns != 'price']
target = vehicles_df.loc[:,vehicles_df.columns == 'price']

In [None]:
#no of features
nof_list=np.arange(1,13)            
high_score=0
#Variable to store the optimum features
nof=0           
score_list =[]
for n in range(len(nof_list)):
    X_train, X_test, y_train, y_test = train_test_split(features,target, test_size = 0.25, random_state = 0)
    model = LinearRegression()
    rfe = RFE(model,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train.values.ravel())
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train.values.ravel())
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))

**12 features give optimum score. Now let's figure out these 12 features.**

In [None]:
cols = list(features.columns)
model = LinearRegression()
#Initializing RFE model
rfe = RFE(model, 12)             
#Transforming data using RFE
X_rfe = rfe.fit_transform(features,target.values.ravel())  
#Fitting the data to model
model.fit(X_rfe,target)              
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index
print(selected_features_rfe)

In [None]:
features_rfe = vehicles_df[['region', 'year', 'manufacturer', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'paint_color','state']]
target_rfe = vehicles_df.loc[:,vehicles_df.columns == 'price']

In [None]:
#splitting our dataset randomly with the test data containing 25% of the data,
X_train, X_test, y_train, y_test = train_test_split(features_rfe,target_rfe, 
                                                    test_size=0.25, 
                                                    random_state=0)

#view number of training and testing data
print('Our training prediction variable contains :',len(y_train) ,'rows')
print('Our training independent variable contains :',len(X_train) ,'rows')
print('Our testing prediction variable contains :',len(y_test) ,'rows')
print('Our testing independent variable contains :',len(X_test) ,'rows')

In [None]:
# run the regression model for recursive feature elimination
reg_model_rfe = LinearRegression()

In [None]:
#fitting the training data to the model,
reg_model_rfe.fit(X_train, y_train)
#outputs the coefficients
print('Intercept :', reg_model_rfe.intercept_[0], '\n')
print(pd.DataFrame({'features':X_train.columns,'coeficients':reg_model_rfe.coef_[0]}))

In [None]:
lr_pred_rfe = reg_model_rfe.predict(X_test)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, lr_pred_rfe))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, lr_pred_rfe))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, lr_pred_rfe)))

In [None]:
score = r2_score(y_test, lr_pred_rfe)
score

**Both backward elimination and recursive feature elimination are giving highest R value.**

### Decision Tree(CART)

**As we know that variable selection and reduction is automatic in CART, let's apply the algorithm.**

In [None]:
x_final = vehicles_df[['region','year','manufacturer','model','cylinders','fuel','odometer','title_status','transmission','drive'
             ,'type','paint_color','state']]
y_final = vehicles_df[['price']]

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_final,y_final, test_size = 0.25, random_state=0)
#view number of training and testing data
print('Our training prediction variable contains :',len(y_train) ,'rows')
print('Our training independent variable contains :',len(x_train) ,'rows')
print('Our testing prediction variable contains :',len(y_test) ,'rows')
print('Our testing independent variable contains :',len(x_test) ,'rows')

In [None]:
dtree = DecisionTreeRegressor()
model = dtree.fit(x_train, y_train)  #train parameters: features and target
pred = dtree.predict(x_test)

In [None]:
text_representation = tree.export_text(dtree)
print(text_representation)

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

**As we know that full tree is always overfitted and this tree is also pretty long. Let’s change a couple of parameters to see if there is any effect on the accuracy and also to make the tree shorter.**

**Criterion: defines what function will be used to measure the quality of a split. The options are "mse","mae", and "friedman mse".**

**Max_depth: defines the maximum depth of the tree. If it’s “none”, the tree will be as long as possible, when all the leaves are pure (risk of overfitting the model).**

In [None]:
max_depth = []
acc_mse = []
acc_mae= []
acc_friedman_mse = []
for i in range(1,30):
    dtree = DecisionTreeRegressor(criterion='mse', max_depth=i)
    dtree.fit(x_train, y_train)
    pred = dtree.predict(x_test)
    acc_mse.append(np.sqrt(metrics.mean_squared_error(y_test, pred)))
    dtree = DecisionTreeRegressor(criterion='mae', max_depth=i)
    dtree.fit(x_train, y_train)
    pred = dtree.predict(x_test)
    acc_mae.append(np.sqrt(metrics.mean_squared_error(y_test, pred)))
    ####
    dtree = DecisionTreeRegressor(criterion='friedman_mse', max_depth=i)
    dtree.fit(x_train, y_train)
    pred = dtree.predict(x_test)
    acc_friedman_mse.append(np.sqrt(metrics.mean_squared_error(y_test, pred)))
    ####
    max_depth.append(i)
    d = pd.DataFrame({'acc_mse':pd.Series(acc_mse), 
    'acc_mae':pd.Series(acc_mae),
    'acc_friedman_mse':pd.Series(acc_friedman_mse),
    'max_depth':pd.Series(max_depth)})
                                                
# visualizing changes in parameters
plt.plot('max_depth','acc_mse', data=d, label='mse')
plt.plot('max_depth','acc_mae', data=d, label='mae')
plt.plot('max_depth','acc_friedman_mse', data=d, label='friedman_mse')
plt.xlabel('max_depth')
plt.ylabel('RMSE')
plt.legend()

**We want the value of RMSE as short as possible. The depth 14 is giving lowest Root Mean Squared Error with criterian mse or friedman_mse. Thus, I am going to apply mae for deptth 14.**

In [None]:
dtree_m = DecisionTreeRegressor(criterion='mae',max_depth = 14)
model = dtree_m.fit(x_train, y_train)  #train parameters: features and target
pred = dtree_m.predict(x_test)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

### Random Forest Regressor

In [None]:
rf = RandomForestRegressor(random_state=1).fit(x_train, y_train.values.ravel())
rf_pred = rf.predict(x_test)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, rf_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, rf_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rf_pred)))

**These are the few algorithms/models we applied on given dataset.**