## Capstone 1 - Machine Learning – Predicting Lemon titles (Kicks) in Car Auctions
#### Objective:
> To predict if the car purchased at the auction is a good/bad buy among thousands of cars purchased through online auctions. The goal is to create a machine learning model to predict the condition of the vehicle being purchased at a car auction, if it is a good/bad buy, hence reducing the risk.  

#### Problem:
> Predict if the car being purchased at auction is Good or Bad buy?

#### Outcome:
>One of the challenges for an auto dealership in purchasing a used car at an auction is the risk of that vehicle might have serious issues that prevent it from being resold. These are referred to as “kicks” or unfortunate purchases and are often resulting in a significant loss. Some examples of kicks could be tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller or some unforeseen problem. Using machine learning to predict which cars have higher risk can provide real value to dealerships as they can predict kicks before the dealership buys at auctions.

#### Dataset:
>Source: https://www.kaggle.com/c/DontGetKicked/data

>Train set – 60%<br>
>Test set – 40%

>The data set contains information about each car, like purchase price, make and model, trim level, odometer reading, date of purchase, state of origin and so on. There are about 40 different variables (along with the lemon status indicator IsBadBuy) on around 72K cars, the test data set has the same information on around 40K cars. The target variable is “IsBadBuy” which is a binary variable and is a post-purchase classification for kicked on non-kicked cars.

#### Evaluation Metrics:
>The evaluation metrics for this problem are going to be the Gini Index, Classification Accuracy %, F1 Score, Precision, Recall, and Log Loss metrics.

In [2]:
## Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
from fancyimpute import KNN
from sklearn.preprocessing import OrdinalEncoder

Using TensorFlow backend.


In [3]:
pwd

'C:\\Users\\datawiz\\Documents\\Springboard\\carvana_lemons'

In [4]:
# Input training data
df = pd.read_csv('C:\\Users\\datawiz\\Documents\\Springboard\\carvana_lemonsdata\\training.csv')

FileNotFoundError: [Errno 2] File b'C:\\Users\\datawiz\\Documents\\Springboard\\carvana_lemonsdata\\training.csv' does not exist: b'C:\\Users\\datawiz\\Documents\\Springboard\\carvana_lemonsdata\\training.csv'

In [None]:
df.head()
df.info()

In [None]:
print(df.columns)

In [None]:
df['IsBadBuy'].value_counts()

In [None]:
df.groupby(['Make'],sort=False)['IsBadBuy'].count()

In [None]:
pd.DataFrame.hist(df, figsize= [15,15]);

##### Drop columns:
1. Ref ID
2. BYRNO
3. WheelTypeID

##### Numerical Columns:
1.  VehYear
2.  VehcileAge
3.  WarrantyCost
4.  VehOdo
5.  VehBCost
6.  VNZIP1
7.  MMRAcquisitionAuctionAveragePrice
8.  MMRAcquisitionAuctionCleanPrice
9.  MMRAcquisitionRetailAveragePrice
11. MMRAcquisitonRetailCleanPrice
12. MMRCurrentAuctionAveragePrice
13. MMRCurrentAuctionCleanPrice
14. MMRCurrentRetailAveragePrice
15. MMRCurrentRetailCleanPrice

##### Categorical Columns:
1. Auction
2. Transmission
3. WheelType
4. Nationality
5. Size
6. TopThreeAmericanName
7. IsOnlineSale

##### Fix NULLs
1. Trim
2. AUCGUART
3. PRIMEUNIT
4. ALL Price Cols

In [None]:
#Lets looks a a data profiling report using pandas_profiling API
#pandas_profiling.ProfileReport(df)

### Correlation analysis

In [None]:
df_new_corr = df.copy()
df_new_corr = df_new_corr.drop(['RefId','PurchDate','Auction','Make','Model','Size','TopThreeAmericanName','PRIMEUNIT','AUCGUART',
                              'Trim','SubModel','Color','Transmission','WheelType','Nationality','BYRNO','VNST'],axis=1)
correlations = df_new_corr.corr()['IsBadBuy'].sort_values()
print('Most Positive Correlations: \n', correlations.tail(5))
print('\nMost Negative Correlations: \n', correlations.head(5))

# Let's plot a heatmap to visualize the correaltion between IsBadBuy and other attributes

In [None]:
# Calculate correlations
corr = df_new_corr.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
# Heatmap
plt.figure(figsize=(15, 10))
sns.heatmap(corr,
            vmax=.5,
            mask=mask,
            # annot=True, fmt='.2f',
            linewidths=.2, cmap="YlGnBu")

> As seen in the profile report above, the price attributes are highliy correlated to each other. <p>
> VehAge, WarrantCost and VehOdo are the most correlated attributes

### Dataprocessing
1. Dropping columms

In [None]:
raw_df = df
# Dropping columns
df = df.drop(['PRIMEUNIT', 'AUCGUART'], axis=1)

#dropping the target variable
#df = df.drop('IsBadBuy', axis=1)

In [None]:
df.info()

>Has NULL Values:

>1. Trim
>2. AUCGUART
>3. PRIMEUNIT
>4. ALL Price Cols

>Numerical Variables:

>1. VehYear
>2. VehicleAge
>3. VehOdi
>4. VNZIP1
>5. VehCost
>6. All Price Columns

>Categorical Variables:

>1. Auction
>2. Transmission
>3. WheelType
>4. Nationality
>5. Size
>6. TopThreeAmericanName
>7. IsOnlineSale

> The target variable IsBadBuy a binary classifcation variable, meaning we are assigned a value of 1 if the car purchased in a Bad buy or 0 if a car is not a Bad buy (good buy).

> It is important to note that while doing this prediction we need to be careful about the high cost of predicting false negatives. This means that a dealership might think that this car is a good buy and think they would be able to sell it, however in reality this a actually a Bad Buy and not sellable.

> A false postive has a cost associated with it as well, if the purchase as classified as a Bad buu in realilty it is indeed a sellable car, then the delearship might loose the opportunity selling the used car and generating profit of it.

>Quesitons:

>1.There is no column in the Test data provided from the Kaggle Competetion, does this mean that I have to use Cross Validation sampling to split the Training Data into either 5 folds to 10 folds and also how can I know how many number of folds to choose in this case?

>2.WheelType has 3174 / 4.3% missing values Missing
>WheelTypeID has 3169 / 4.3% missing values Missing
>I thought both should have mostly same missing values, but they do not.



>Clarifications:
>1. Auction: This is the expected price of the car at an Auction.
>2. MMR: This is Manheim Market Report, which is an indicator of wholesale prices of a car determined by a very establised company that provides very statistically sounds whole car price determinations.
>3. Acquisition: This is the price at which the car's MMR sold at the auction. 
>4. Retail: This mean the expected price of the car which the customer is willing to pay at the dealership.
>5. VNST and VNZIP1 are state and zip codes
>6. TopAmericanName: If the vechicle is from one of the top three american car manufacturers.

>nearest neighbhors for categorials or look at similar cars look at high correlated variables understand the missing data and then find solutions

In [None]:
df['Make'].value_counts().plot(kind='barh', figsize=(10, 8))
plt.xlabel("Car Makes", labelpad=14)
plt.ylabel("Vehicle Counts", labelpad=14)
plt.title("Counts of Car Makes", y=1.02);

In [None]:
#replace Manual with MANUAL
df['Transmission'].replace(to_replace =['Manual'],
                           value ="MANUAL",inplace=True)

In [None]:
df['Transmission'].value_counts()

In [None]:
df.head()

In [None]:
import re

# Write a pattern to extract numbers and decimals
def return_eng_size(length):
    pattern = re.compile(r"\d+\.\d[lL]")
    
    # Search the text for matches
    size = re.search(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if size is not None:
        return str(size.group(0))
    else:
        return str("Missing")
    
df["EngineSize"] = df['Model'].apply(lambda x: return_eng_size(x))

In [None]:
# Write a pattern to extract numbers and decimals
def return_veh_char(length):
    pattern = re.compile(r"4WD|2WD|AWD|FWD|V8|V6|4C|6C|DOHC|MPI|SFI*|MFI|EFI*")
    
    # Search the text for matches
    veh_char = re.search(pattern, str(length))
    
    # If a value is returned, use group(0) to return the found value
    if veh_char is not None:
        return str(veh_char.group(0))
    else:
        return str("Missing")
    
df["VehileEngChar"] = df['Model'].apply(lambda x: return_veh_char(x))

In [None]:
df["VehileEngCharSub"] = df['SubModel'].apply(lambda x: return_veh_char(x))

In [None]:
df["VehileEngChar"].value_counts()

In [None]:
df['VehEngChar'] = df.apply(lambda x: 'AWD' if x['VehileEngChar'] == 'AWD' or x['VehileEngCharSub'] == 'AWD' else x['VehileEngChar'], axis=1)

In [None]:
df['VehEngChar'].value_counts()

In [None]:
df = df.drop(['VehileEngChar','VehileEngCharSub'],axis=1)
df.info()

In [None]:
# Write a pattern to extract numbers and decimals
def return_eng_cylnd(length):
    pattern = re.compile(r"([vViI]\d|[vViI]\s\d|[vViI]\-\d)")
    
    # Search the text for matches
    eng_cylnd = re.search(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if eng_cylnd is not None:
        return str(eng_cylnd.group(0))
    else:
        return str("Missing")
    
df["VehileEngCylinder"] = df['Model'].apply(lambda x: return_eng_cylnd(x))

In [None]:
df.EngineSize.value_counts()

In [None]:
def fix_eng_vals(x):
    if x == "I-4":
        return str("I4")
    elif x == "I 4":
        return str("I4")
    elif x == "I-2":
        return str("I2")
    elif x == "I 2":
        return str("I2")
    elif x == "I-3":
        return str("I3")
    elif x == "I 3":
        return str("I3")
    elif x == "I-6":
        return str("I6")
    elif x == "I 6":
        return str("I6")
    elif x == "V-6":
        return str("V6")
    elif x == "V 6":
        return str("V6")
    elif x == "V-4":
        return str("V4")
    elif x == "V 4":
        return str("V4")
    elif x == "V-2":
        return str("V2")
    elif x == "V 2":
        return str("V2")
    else:
        return x

df["VehEngCylndr"] = df['VehileEngCylinder'].apply(lambda x: fix_eng_vals(x))

In [None]:
df["VehEngCylndr"].value_counts()
df.drop(['VehileEngCylinder'],axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df['Model'].value_counts()

In [None]:
#The data set contains several numerical and categorial attributes
df.columns.to_series().groupby(df.dtypes).groups

In [None]:
null_df = df[['MMRCurrentRetailCleanPrice','MMRCurrentRetailAveragePrice','MMRCurrentAuctionCleanPrice',
            'MMRCurrentAuctionAveragePrice','MMRAcquisitionRetailAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
             'MMRAcquisitonRetailCleanPrice','MMRAcquisitionAuctionAveragePrice','Color','SubModel',
             'TopThreeAmericanName','Nationality','Size']].isnull().sum().sort_values(ascending=False)
print(null_df)

### Regressing Current Prices based on Acquisition Prices

In [None]:
df_num = df.copy(deep=True)

In [None]:
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

clf = linear_model.LinearRegression()
degree = 3
polynomial_features = PolynomialFeatures(degree = degree, include_bias = True)
pipeline = Pipeline([("polynomial_features", polynomial_features), ("classifier", clf)])

non_null_train = df_num[(pd.notnull(df_num.MMRAcquisitionAuctionAveragePrice)) &
  (pd.notnull(df_num.MMRAcquisitionAuctionCleanPrice)) &
  (pd.notnull(df_num.MMRAcquisitionRetailAveragePrice)) &
  (pd.notnull(df_num.MMRAcquisitonRetailCleanPrice)) &
  (pd.notnull(df_num.MMRCurrentAuctionAveragePrice)) &
  (pd.notnull(df_num.MMRCurrentAuctionCleanPrice)) &
  (pd.notnull(df_num.MMRCurrentRetailAveragePrice)) &
  (pd.notnull(df_num.MMRCurrentRetailCleanPrice))]

null_mask = ((pd.notnull(df_num.MMRAcquisitionAuctionAveragePrice)) &
  (pd.notnull(df_num.MMRAcquisitionAuctionCleanPrice)) &
  (pd.notnull(df_num.MMRAcquisitionRetailAveragePrice)) &
  (pd.notnull(df_num.MMRAcquisitonRetailCleanPrice)) &
  (pd.isnull(df_num.MMRCurrentAuctionAveragePrice)) &
  (pd.isnull(df_num.MMRCurrentAuctionCleanPrice)) &
  (pd.isnull(df_num.MMRCurrentRetailAveragePrice)) &
  (pd.isnull(df_num.MMRCurrentRetailCleanPrice)))

null_predict = df_num[null_mask]

X = non_null_train[['MMRAcquisitionAuctionAveragePrice',
  'MMRAcquisitionAuctionCleanPrice',
  'MMRAcquisitionRetailAveragePrice',
  'MMRAcquisitonRetailCleanPrice',
  'VehicleAge',
  'VehOdo'
]]

X_predict = null_predict[['MMRAcquisitionAuctionAveragePrice',
  'MMRAcquisitionAuctionCleanPrice',
  'MMRAcquisitionRetailAveragePrice',
  'MMRAcquisitonRetailCleanPrice',
  'VehicleAge',
  'VehOdo'
]]

for target in ['MMRCurrentAuctionAveragePrice', 
               'MMRCurrentAuctionCleanPrice', 
               'MMRCurrentRetailAveragePrice', 
               'MMRCurrentRetailCleanPrice']:
    y = non_null_train[target]
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.3, random_state=123)
    pipeline.fit(X_train, y_train)
    print(pipeline.score(X_train, y_train))
    print(pipeline.score(X_test, y_test))
    result = pipeline.predict(X_predict)
    df_num.loc[null_mask, target] = result

In [None]:
df_num[['MMRCurrentAuctionAveragePrice', 
        'MMRCurrentAuctionCleanPrice', 
        'MMRCurrentRetailAveragePrice', 
        'MMRCurrentRetailCleanPrice']].isnull().sum()

### Imputing the remaining missing numerical values

In [None]:
#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

num_df = df.select_dtypes(include=[np.number])

imputer.fit(num_df)

X = imputer.transform(num_df)

num_df = pd.DataFrame(X, columns=num_df.columns.values)

num_df.head()

num_df.info()

In [None]:
# All the price columns are related as per the report above.
avg_prices_cols = ['MMRAcquisitionAuctionAveragePrice',
              'MMRAcquisitionAuctionCleanPrice',
              'MMRAcquisitionRetailAveragePrice',
              'MMRAcquisitonRetailCleanPrice',              
              'MMRCurrentAuctionAveragePrice',
              'MMRCurrentAuctionCleanPrice',
              'MMRCurrentRetailAveragePrice',
              'MMRCurrentRetailCleanPrice']

#df['AvgAuctionPrice'] = sum(df[i] for i in avg_prices_cols) / len(avg_prices_cols) 
#df = df.drop(avg_prices_cols, axis=1)

In [None]:
empty_trim_df = df[df['Trim'].isnull()]
empty_trim_df['Make'].value_counts().plot(kind='barh',figsize=(10, 8))
plt.xlabel("Count of Car Makes with Nulls", labelpad=14)
plt.ylabel("Car Makes", labelpad=14)
plt.title("Count Null values by Car Manufacturer", y=1.02);

> Suzuki cars have a lot of their Trim values missing in the data set.
> Will be using categorical imputation for filling missing values by most common occurance by using fancy impute package

In [None]:
# Take a copy of the data frame so far
clean_df = df.copy()

#Extract date features and drop the date column
clean_df['PurchDate'] = pd.to_datetime(clean_df['PurchDate'])

# Adding new features from the Purchase date
clean_df['PurchDay'] = clean_df['PurchDate'].apply(lambda x:x.day)
clean_df['PurchMon'] = clean_df['PurchDate'].apply(lambda x:x.month)
clean_df['PurchYear'] = clean_df['PurchDate'].apply(lambda x:x.year)

#Remove columns deemed not necessary since we already extracted its features
clean_df = clean_df.drop(['PurchDate'],axis=1)

In [None]:
print("Missing values in the dataset:")
print(clean_df.isnull().sum())
print("\n")
print("% of missing values:")
print(clean_df.isnull().mean()*100)

In [None]:
# Analysis of Missing Values using missingno package in Python
import missingno as msno

#Plot the missing no bar chart
msno.bar(clean_df)

In [None]:
#plot the matrix to show 
msno.matrix(clean_df)

In [None]:
# Plot the heat to show any correlation between missing values, for example, wheel type
msno.heatmap(clean_df)

In [None]:
# Plot the Dendrogram for futher analysis for distance between like missing values
msno.dendrogram(clean_df)

In [None]:
# Take a copy of the data frame for Categrorical imputation
cars_df = clean_df.copy(deep=True)

# Drop the columns that are not needed like numerical and categorical values that are fully existing in the data set.
cars_df = cars_df.drop(['Auction','VehYear','VehicleAge','Make','Model','VehOdo','BYRNO','VNZIP1','VNST',
                        'VehBCost','IsOnlineSale','WarrantyCost','WheelTypeID','IsBadBuy','MMRAcquisitionAuctionAveragePrice',
                       'MMRAcquisitionAuctionCleanPrice','MMRAcquisitionRetailAveragePrice','MMRAcquisitonRetailCleanPrice',
                        'MMRCurrentAuctionAveragePrice','MMRCurrentAuctionCleanPrice','MMRCurrentRetailAveragePrice',
                       'MMRCurrentRetailCleanPrice'],axis=1)

In [None]:
# Categorical variables selected for imputation
cars_df.info()

In [None]:
# Create an empty dictionary ordinal_enc_dict
ordinal_enc_dict = {}

for col_name in cars_df:
    # Create Ordinal encoder for col
    ordinal_enc_dict[col_name] = OrdinalEncoder()
    col = cars_df[col_name]
    
    # Select non-null values of col
    col_not_null = col[col.notnull()]
    reshaped_vals = col_not_null.values.reshape(-1, 1)
    encoded_vals = ordinal_enc_dict[col_name].fit_transform(reshaped_vals)
    
    # Store the values to non-null values of the column in users
    cars_df.loc[col.notnull(), col_name] = np.squeeze(encoded_vals)

In [None]:
#clean_df.info()
from fancyimpute import SimpleFill
# Create SimpleFill imputer
Simple_imp = SimpleFill("mean")

# Impute and round the users DataFrame
#clean_df.iloc[:, :] = np.round(KNN_imputer.fit_transform(clean_df))
cars_df.iloc[:, :] = np.round(Simple_imp.fit_transform(cars_df))


# Loop over the column names in users
for col_name in cars_df:
    
    # Reshape the data
    reshaped = cars_df[col_name].values.reshape(-1, 1)
    
    # Perform inverse transform of the ordinally encoded columns
    cars_df[col_name] = ordinal_enc_dict[col_name].inverse_transform(reshaped)

In [None]:
# View to the imputed data frame
cars_df.head(5)

# merging the numerical and categorical variables data frames
final_df = pd.merge(num_df, cars_df, on='RefId')

# Variables in the final data frame before addressing class imbalance
final_df.info()

In [None]:
final_df.head()

In [None]:
# Since over 95% values are missing in PRIMEUNIT and AUCGUART variables, removing these columns
#final_df = final_df.drop(['PRIMEUNIT','AUCGUART'],axis=1)

In [None]:
# Plot the class imbalance in the data set
df_grpd = final_df.groupby(['IsBadBuy']).size().plot.barh(x="IsBadBuy",y="Vehicle Counts")

In [None]:
#Import package
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = final_df[final_df.IsBadBuy==0]
df_minority = final_df[final_df.IsBadBuy==1]
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=64007,    # to match majority class
                                 random_state=1234) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled.IsBadBuy.value_counts()

In [None]:
df_grpd_upsam = df_upsampled.groupby(['IsBadBuy']).size().plot.barh(x="IsBadBuy",y="Vehicle Counts")

In [None]:
#Copy upsmaples DataFrame
data = df_upsampled.copy(deep=True)

data.info()

#### Feature Selection
#### dropping WheelTypeID since WheelTypeName exists 
#### dropping ID variable RefID

In [None]:
df_upsampled.info()

### Machine Learning

In [None]:
#Copy upsmaples DataFrame
data = df_upsampled.copy(deep=True)

# Dropping varibles that are not neeeded and also target variables
X = data.drop(['RefId','WheelTypeID','IsBadBuy'], axis=1)

# target variable
y = data.IsBadBuy

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=123, 
                                                    stratify=y)

In [None]:
# Make dummy variables over entire train/test set
def make_dummies(train_in, test_in, feature):
    from sklearn import preprocessing
    le = preprocessing.LabelEncoder()
    train_set = list(set(train_in[feature].values))
    test_set = list(set(test_in[feature].values))
    encoder_set = list(set(train_set + test_set))
    le.fit(encoder_set)
    new_feature = le.transform(train_in[feature].values)
    train_in=train_in.drop(feature, axis=1)
    train_in[feature] = new_feature 
    new_feature = le.transform(test_in[feature].values)
    test_in=test_in.drop(feature, axis=1)
    test_in[feature] = new_feature
 
    return test_in, train_in

In [None]:
def gini(actual, pred):
    assert (len(actual) == len(pred))
    all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)
    all = all[np.lexsort((all[:, 2], -1 * all[:, 1]))]
    totalLosses = all[:, 0].sum()
    giniSum = all[:, 0].cumsum().sum() / totalLosses

    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)


def gini_normalized(actual, pred):
    return gini(actual, pred) / gini(actual, actual)

In [None]:
for feature in ['Trim','SubModel', 'Color' ,'Transmission', 'WheelType', 'Nationality',
                'Size', 'TopThreeAmericanName', 'EngineSize' ,'VehEngChar' ,'VehEngCylndr']:
    X_test, X_train = make_dummies(X_train, X_test, feature)

In [None]:
# # Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6
dt = DecisionTreeClassifier(criterion ='gini', max_depth=30, random_state=1)

# Fit dt to the training set
dt.fit(X_train,y_train)

# Predict test set labels
y_pred = dt.predict(X_test)

# Compute test set accuracy  
acc = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.2f}".format(acc))

In [None]:
# Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("\n")

gini_predictions = gini(y_test, y_pred)
gini_max = gini(y_test, y_pred)
ngini= gini_normalized(y_test, y_pred)
print('Gini: %.3f, Max. Gini: %.3f, Normalized Gini: %.3f' % (gini_predictions, gini_max, ngini))

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier 

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=4)

# Fit the classifier to the data
knn.fit(X_train,y_train)

# Predict the labels for the training data X_test
y_pred_knn = knn.predict(X_test)

# Accuracy Scores
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_knn))

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

f1_knn = metrics.f1_score(y_pred_knn, y_test)

print("F1 Score:"+str(f1_knn))

print("\n")

gini_predictions_knn = gini(y_test, y_pred_knn)
gini_max_knn = gini(y_test, y_pred_knn)
ngini_knn = gini_normalized(y_test, y_pred_knn)
print('Gini: %.3f, Max. Gini: %.3f, Normalized Gini: %.3f' % (gini_predictions_knn, gini_max_knn, ngini_knn))

In [None]:
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(n_estimators = 50, criterion = 'entropy', min_samples_split = 285)

RF.fit(X_train, y_train)

y_pred_rf = RF.predict(X_test)

f1_rf = metrics.f1_score(y_pred_rf, y_test)

print("F1 Score:"+str(f1_rf))

metrics.classification_report(y_test, y_pred_rf)

print(metrics.classification_report(y_test, y_pred_rf))

importance = zip(RF.feature_importances_, X)

for rank in sorted(importance, key = lambda x: x[0], reverse = True):
    print(rank)
    
print("\n")

gini_predictions_rf = gini(y_test, y_pred_rf)
gini_max_rf = gini(y_test, y_pred_rf)
ngini_rf = gini_normalized(y_test, y_pred_rf)
print('Gini: %.3f, Max. Gini: %.3f, Normalized Gini: %.3f' % (gini_predictions_rf, gini_max_rf, ngini_rf))

In [None]:
import xgboost as xgb
from matplotlib import pyplot

# instansitate xbg classifier
xg_cl = xgb.XGBRFClassifier(objective='binary:logistic', n_estimators=10,seed=123)

# Fit on train
xg_cl.fit(X_train, y_train)

#Predict on Test
y_pred_xgb = xg_cl.predict(X_test)

#Accuracy Scores
accuracy_xgb = float(np.sum(y_pred_xgb==y_test))/y_test.shape[0]

print('accuracy: %f' % (accuracy_xgb))

f1_xgb = metrics.f1_score(y_pred_xgb, y_test)

print("F1 Score:"+str(f1_xgb))

metrics.classification_report(y_test, y_pred_xgb)

print(metrics.classification_report(y_test, y_pred_xgb))

importance = zip(xg_cl.feature_importances_, X)

for rank in sorted(importance, key = lambda x: x[0], reverse = True):
    print(rank)

print("\n")

gini_predictions_xgb = gini(y_test, y_pred_xgb)
gini_max_xgb = gini(y_test, y_pred_xgb)
ngini_xgb = gini_normalized(y_test, y_pred_xgb)
print('Gini: %.3f, Max. Gini: %.3f, Normalized Gini: %.3f' % (gini_predictions_xgb, gini_max_xgb, ngini_xgb))

#### Based on the different machine learning models applied so far for classifiying good and bar car buys at the auction, DecisionTreeClassifier seems be the most performant model of all.