In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **INTRODUCTION**

## **Aim of the study:**

To create new features using a naive method which will be called 'Brute Force' feature engineering for the purpose of this exercise.

## **Operational definitions:**

Brute force method: The method of trying all combinations of available features. This is limited only by the imagination of the user. Followed by Feature Selection.

## **Rationale/Purpose of study:**

I remain data agnostic at this point of career, hence my functional knowledge in some areas which the data deals with are shallow. So the 'Brute Force' feature engineering helps create new more useful features. If this method remains relevant for most data cases of different functional areas then we can use it as a means of improving our models. Before even attempting 'Brute Force' on the features we can identify that this will work only with continuous variables.

# **METHODOLOGY**

## **Settings & Sample:**

813 records of the dataset, after removing the Null records.

## **Procedure:**

Our aim is to find new useful features. To achieve that we will do the following:
1. Multiply every feature to all other features. Upto 3 features will be used to do this.
1. Divide each feature with every other feature.
1. Break each feature into different groups based on the number of target variables classes.
1. Do a shapiro test to check normalcy and then perform the Kruskal-Wallis H-test between each group to check whether the medians are significantly different.
1. Record all the p values in a seperate table.
1. From the p-value table select all the combinations (product or division) where all p values are less than 0.05. Which would mean that when each group under the feature combination has its median significantly different from each other, then we will select that combination.
1. Create a new dataframe with all the selected feature combinations.

Note: Original features were dropped here, although we should note that if original features passed the criterias set for selection at the start then they would be a part of the final dataset.

# **CODE**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import shapiro, kruskal
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
# Load the dataset
df_vehicle = pd.read_csv("/kaggle/input/vehicle/vehicle.csv")

# Display basic information
df_vehicle.info()

## **Data Cleaning**

**If we remove all the Null records then we still have around 813 records left with which we can work with. So we will simply drop the Null records instead of imputing them.**

In [None]:
# Dropping records with a ny Null value
df_clean = df_vehicle.drop(df_vehicle[df_vehicle.isnull().any(axis=1)].index).reset_index(drop=True)

In [None]:
# View the miltivariate pairplot
sns.pairplot(data=df_clean, hue="class")

**As evidenced from the graph above, few columns are highly correlated with each other and boundaries between different classes are quite overlapped. Let's see if we can create new features with more distinct boundaries.**
    
**The following part will be the main purpose of my work on this dataset. My functional knowledge in this area is shallow so we will use something akin to a brute force to create new features. We will try all combinations of available features upto the 3rd degree and also division of each feature. Then we will let the significance test of the medians of each attribute of the new feature decide if the feature should be used. This is to compensate for the lack of fiunctional knowledge.**

**Although p-statisitic of medians will not generate distinct decision boundaries, it will however be able to check whether majority of the class is significantly on the correct side of the decision boundary.**

**Our aim is to find new useful features. To achieve that we will do the following:-**
1. Multiply every feature to all other features. Upto 3 features will be used to do this.
1. Divide each feature with every other feature.
1. Break each feature into different groups based on the number of target variables classes.
1. Do a shapiro test to check normalcy and then perform the Kruskal-Wallis H-test between each group to check whether the medians are significantly different.
1. Record all the p values in a seperate table.
1. From the p-value table select all the combination (product or division) where all p values are less than 0.05. Which would mean that when each group under the feature combination has its median significantly different from each other, then we will select that combination.
1. Create a new dataframe with all the selected feature combinations.
    
**The number of features selected will generally be quite high. If that is the case we can use PCA or other dimension reduction techniques to reduce the cost of processing.**

**Using PCA will have another useful side effect, which is to remove correlation between new features. Based on the fact that many features will be created from a combination of a few, hence most selected features will be corelated with others.**

**Please note that I will keep things simple in these runs and not use PCA unless I absolutely have to. I will also not do any further feature engineering or feature selection. This is because my sole aim is to check how the brute force method of creating new features perform.**

In [None]:
# Creating a function which will return required outputs after performing shapiro test
def significance_test(x, y):
    # If Shapiro test clears both groups (Confidence of 95%) then perform welch test else display appropriate message
    if (shapiro(x)[1]<0.05) and (shapiro(y)[1]<0.05):
        # Kruskal Test
        t, p = kruskal(x, y)
    else:
        p = np.nan
    return(p)

In [None]:
# Creating analysis dictionary to store results
analysis = {}

# Creating a column of values 1 so that product of 2 and single columns are also considered 
df_clean.insert(df_clean.shape[1]-1, '', 1)

# Performing Welch's t test on normal and abnormal groups for all independent variables
# Running a for loop to extract each attribute name individually
for idx1 in range(len(df_clean.columns[:-2])):
    
    # Assign col1 identity
    col1 = df_clean.columns[idx1]
    for idx2 in range(len(df_clean.columns[:-1])):
        
        # Assign col2 identity
        col2 = df_clean.columns[idx2]
        
        # Avoiding duplicates and cancelling off through if condition
        if not (idx2==df_clean.shape[1]-2 or idx1==idx2):
            # Creating 3 groups based on Dependent variable labels for division
            group1 = df_clean[col1][df_clean["class"]=="car"]/df_clean[col2][df_clean["class"]=="car"]
            group2 = df_clean[col1][df_clean["class"]=="bus"]/df_clean[col2][df_clean["class"]=="bus"]
            group3 = df_clean[col1][df_clean["class"]=="van"]/df_clean[col2][df_clean["class"]=="van"]
            
            # Storing results in analysis dictionary
            analysis[col1+'/'+col2] = []
            analysis[col1+'/'+col2].append(significance_test(group1, group2))
            analysis[col1+'/'+col2].append(significance_test(group3, group2))
            analysis[col1+'/'+col2].append(significance_test(group1, group3))

        for idx3 in range(len(df_clean.columns[:-1])):
            
            # Assign col3 identity
            col3 = df_clean.columns[idx3]
            
            if idx1<=idx2 and idx2<=idx3:
                
                # Creating 3 groups based on Dependent variable labels for product
                group1 = df_clean[col1][df_clean["class"]=="car"]*df_clean[col2][df_clean["class"]=="car"]*df_clean[col3][df_clean["class"]=="car"]
                group2 = df_clean[col1][df_clean["class"]=="bus"]*df_clean[col2][df_clean["class"]=="bus"]*df_clean[col3][df_clean["class"]=="bus"]
                group3 = df_clean[col1][df_clean["class"]=="van"]*df_clean[col2][df_clean["class"]=="van"]*df_clean[col3][df_clean["class"]=="van"]

                # Storing results in analysis dictionary
                analysis[col1+'*'+col2+'*'+col3] = []
                analysis[col1+'*'+col2+'*'+col3].append(significance_test(group1, group2))
                analysis[col1+'*'+col2+'*'+col3].append(significance_test(group3, group2))
                analysis[col1+'*'+col2+'*'+col3].append(significance_test(group1, group3))
                
# Create a dataframe from dictionary for convenience
df_analysis = pd.DataFrame(analysis, index=["CarvsBus", "VanvsBus", "CarvsVan"]).T

# Store the index of all feature combination with Null Hypothesis rejected (@ 95% confidence) on all groups in a list
ls_best = df_analysis[(df_analysis<0.05).all(axis=1)].index

# Initialise a new dataframe
df_new = pd.DataFrame()

# Loop through the best combination names from list to assess which column to operate on
for comb in ls_best:
    
    # Split names into required column names if division
    if '/' in comb:
        tmp = comb.split('/')
        # Create the new feature
        df_new[comb] = df_clean[tmp[0]]/df_clean[tmp[1]]
        
    # Split names into required column names if product
    elif '*' in comb:
        tmp = comb.split('*')
        # Create the new feature
        df_new[comb] = df_clean[tmp[0]]*df_clean[tmp[1]]*df_clean[tmp[2]]

# Drop the column of 1's as it is no longer required                        
df_clean.drop('', axis=1, inplace=True)

# Display stats of new dataset
df_new.describe().T

## **Data Preprocessing**

In [None]:
# Seperating Predictor variables, we know the target variable "Class" is the last column
X = df_new

# Seperating Target variables, we know the target variable is called "Class"
y = df_clean["class"]

# Label Encoding target variable
y = y.replace({'car': 0, 'bus': 1, 'van': 2})

# Performing the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Verifying the proportions of y_train and y_test is the same as original dataset
print("y_train proportions")
print(np.unique(y_train, return_counts=True)[1]/len(y_train))

print("y_test proportions")
print(np.unique(y_test, return_counts=True)[1]/len(y_test))

## **Grid Search for best Parameters**

In [None]:
# parameter grid
param_grid = {"C": [0.001, 0.01, 0.1, 10, 100],
              'gamma': [0.001, 0.01, 0.1, 10, 100],
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
             }

In [None]:
# Creating the SVC model
model = svm.SVC(class_weight="balanced")

In [None]:
# run grid search on 3 folds
folds = 3
grid_search_SVC = GridSearchCV(model, 
                               cv = folds,
                               param_grid=param_grid,
                               return_train_score=True,                         
                               verbose = 1)

In [None]:
# fit 
grid_search_SVC.fit(X_train, y_train)

In [None]:
# cv results
cv_results = pd.DataFrame(grid_search_SVC.cv_results_)
cv_results.sort_values("rank_test_score")

## **Final Model**

In [None]:
# Creating the SVC model with the best hyper parameters
model = svm.SVC(class_weight="balanced", random_state=42, C=0.001, gamma=0.001, kernel='linear')

# Fit the model to the entire train set
model.fit(X_train, y_train)

# Get Predictions
y_pred = model.predict(X_test)

# Create confusion matrix dataframe
df_cm = pd.crosstab(y_test, y_pred)

# Display confusion matrix dataframe
df_cm

## **Analysis**

In [None]:
# Creating a dictionary with the class information
dict_classes = {0: "car", 1: 'bus', 2: 'van'}

# Displaying the statistics of the classification Report
# Initialising precision and recall variables
prec_avg = 0
rec_avg = 0
for i in range(df_cm.shape[0]):
    # Calculating Precision and Recall
    prec = df_cm.iloc[i, i]/sum(df_cm.iloc[:, i])
    rec = df_cm.iloc[i, i]/sum(df_cm.iloc[i, :])
    
    #Totalling up precision and recall for each label
    prec_avg += prec
    rec_avg += rec
    print(dict_classes[i])
    print("Precision: ", prec)
    print("Recall: ", rec)
    print()
prec_avg = prec_avg/3
rec_avg = rec_avg/3
f1_avg = 2 * (prec_avg * rec_avg)/(prec_avg+rec_avg)
print("F1 Score of the entire model", f1_avg)

## **Results and Discussion:**

The number of features selected will generally be quite high. If that is the case we can use PCA or other dimension reduction techniques to reduce the cost of processing.

Using PCA will have another useful side effect, which is to remove correlation between new features. Based on the fact that many features will be created from a combination of a few, hence most selected features will be correlated with others.

Please note that I kept things simple in this run and did not use PCA. I also did not do any further feature engineering or feature selection. This is because my sole aim is to check how the "Brute Force" method of creating new features performs. Results were quite impressive (in the high 90's). However, this dataset alone cannot decide the usefulness. I will leave it to the reader to assess whether they want to use this method.

# **CONCLUSION**

## **Limitations:** 

This dataset alone cannot decide the usefulness. Further experimentation of the brute force method is necessary.

## **Recommendations:** 

It is also advisable to choose fewer feature combinations by selecting the feature combinations whose medians are the most significant. This can be easily achieved by ordering the analysis dataframe created during the run, by ascending order of the total of all p-values in a particular row. Then selecting only top 'n' feature combinations as independent variables.

## **Implications:**

The Brute Force Method can be used to combine Continuous variables. Datasets predominant with Continuous variables are the ideal candidates where Brute Force Method can be applied to check if it improves performance. Another aspect to be considered is that Functional Knowledge of the Domain in which the dataset resides should be used first. Only where there is a dearth of such information should Brute Force Method be used. Also we should consider applying Brute Force Method only when there is a chance that combination of features will uncover more relevant information.