# **AI Project 1**
## **Surface Type Classification**

---

### Richanshu Jha - rj1469

## Importing Modules
* `pandas` and `numpy`: Pandas Dataframes are used here to store the datasets. Numpy is used along with this, numpy arrays are used which are very handy, allow useful array operations and work well with pandas dataframes
* `matplotlib.pyplot`: Used here for plotting graphs and plots.
* `sklearn.preprocessing`: `LabelEncoder` and `StandardScalar` are used for the preprocessing of data. Their use has been documented in detail at their respective cells. 
* `sklearn.metrics`: Its `accuracy_score` is used to generate accuracy. While it could easily be found using a custom accuracy function, I thought it would be better to use a universal function to do it.
* `seaborn` and `sklearn.metrics.confusion_matrix`: These are used twice in this notbook to create confusion matrices to better understand testing results.
* `sklearn.neural_network.MLPClassifier` and `sklearn.ensemble.RandomForestClassifier`: These are the two classifier that I have worked on in this notebook.
* `sklearn.model_selection.StratifiedKFold`: Used for cross-validation of the final model

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn import metrics

import seaborn as sn
from sklearn.metrics import confusion_matrix

from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold

## Importing the data from kaggle
* The dataset is loaded from Kaggle. X_train and y_train have been split into training and testing data according to the instructions provided in Slack.

In [None]:
X_train = pd.read_csv('/kaggle/input/career-con-2019/X_train.csv')
y_train = pd.read_csv('/kaggle/input/career-con-2019/y_train.csv')

# split X_train
samples = 20
time_series = 128

start_x = X_train.shape[0] - samples*time_series
X_train_new, X_test_new = X_train.iloc[:start_x], X_train.iloc[start_x:]

# split y_train
start_y = y_train.shape[0] - samples
y_train_new, y_test_new = y_train.iloc[:start_y], y_train.iloc[start_y:]

X_train.head()

## Approach
There are two ways of approaching this ML problem. We have time 128 series data points for each series. One thing that can be done is to disregard the time series entirely and perform model training without altering the size of the dataset. Another approach is that since data of time series is fairly similar, they can all be grouped together using aggregations. This allows for generation of a large number of relevant features if the correct aggregations are used. I have gone for this approach to solve the problem.

## Feature Engineering and Cleaning Data
The following has been done for engineering features for the model:
 1. Angular velocity and acceleration are given in X, Y ,Z. New features `velocityMagnitude` and `accelerationMagnitude` are created from these by getting the magnitude of their resultant vector.
 2. In order to create single values for each time series, the features have been grouped by the `series_id` and aggregates of the measurements has been considered. 
 3. Various statistical aggregates have been taken: 
     * The mean, median, min, max values are a good representation of the features.
     * Standard deviation and a 'max-min' (named `variation`) is a useful metric that should be able to separate surfaces based on their roughness.
     * Since this is a time series data, it is essential that we take the differences between pairs of subsequent measurements. To do this, the differences of the mean has been taken (`AbsMeanDelta`)
 4. After the training data is engineered, it has the number of rows equal to the number of `series_id` values. 
 5. Since the training data rows have effectivly been internally merged into the number of time series present, there is no need to expand the test set by 128. If the first discussed approach was followed, then this would have to be done 
 6. Features that were engineered that had very low importance scores have been removed. The final feature engineering results in 110 total features. 

In [None]:
xTrain = X_train_new
yTrain = y_train_new

xTest = X_test_new
yTest = y_test_new
    
def engineerFeats(inputDf):
    df = pd.DataFrame()
    
    featureCols = ['orientation_X', 'orientation_Y', 'orientation_Z', \
                   'angular_velocity_X', 'angular_velocity_Y', 'angular_velocity_Z',   \
                   'linear_acceleration_X', 'linear_acceleration_Y', 'linear_acceleration_Z']
    
    #If we are not considering orientation features
    #featureCols = ['angular_velocity_X', 'angular_velocity_Y', 'angular_velocity_Z', \
    #               'linear_acceleration_X', 'linear_acceleration_Y', 'linear_acceleration_Z']
    
    inputDf['velocityMagnitude'] = ((inputDf['angular_velocity_X']**2) + (inputDf['angular_velocity_Y']**2) + (inputDf['angular_velocity_Z']**2)) ** 0.5
    featureCols.append('velocityMagnitude')
    
    inputDf['accelerationMagnitude'] = ((inputDf['linear_acceleration_X']**2) + (inputDf['linear_acceleration_Y']**2) + (inputDf['linear_acceleration_Z']**2)) ** 0.5
    featureCols.append('accelerationMagnitude')
        
    
    for col in featureCols:
        df[col + 'Min'] = inputDf.groupby('series_id')[col].min()
        df[col + 'Max'] = inputDf.groupby('series_id')[col].max()
        df[col + 'Mean'] = inputDf.groupby('series_id')[col].mean()
        df[col + 'Median'] = inputDf.groupby('series_id')[col].median()
        df[col + 'StdDev'] = inputDf.groupby('series_id')[col].std()
        df[col + 'q33'] = inputDf.groupby('series_id')[col].quantile(0.33)
        df[col + 'q66'] = inputDf.groupby('series_id')[col].quantile(0.66)
        df[col + 'q99'] = inputDf.groupby('series_id')[col].quantile(0.99)
        df[col + 'variation'] = inputDf.groupby('series_id')[col].max() \
                                - inputDf.groupby('series_id')[col].min()
        df[col + 'AbsMeanDelta'] = inputDf.groupby(['series_id'])[col] \
                                .apply(lambda x: np.mean(np.abs(np.diff(x))))
        
    #Cleaning the data
    df.fillna(0, inplace = True)
    df.replace(-np.inf, 0, inplace = True)
    df.replace(np.inf, 0, inplace = True)
    
    return(df)

print('Engineering features for Training dataset')
xTrain = engineerFeats(xTrain)

print('Engineering features for Testing dataset')
xTest = engineerFeats(xTest)

print('xTrain shape -> ',xTrain.shape)

print('xTest shape -> ',xTest.shape)
#print(xTest.head())

xTrain.head()

## Encoding Labels
* Here, `sklearn.preprocessing.LabelEncoder` is being used to label the categorical surface outputs. It essentially creates a hash mapping of each unique surface with an integer from 0 to n-1 (n being the number of unique surfaces. 
* Label Encoding is required so that the classifier gets discrete integers from 0 to 9 in order to classify properly.

In [None]:
labelEncoder = LabelEncoder()

labelEncoder.fit(yTrain['surface'])

yTrain['label'] = labelEncoder.transform(yTrain['surface'])
yTest['label'] = labelEncoder.transform(yTest['surface'])

yTrain.head()

## Reviewing the Train and Test datasets
* This cell is primarily used for debugging.
* Note: The shape of the output dataframe is `(n,4)` because it stores `series_id`, `group_id`, `surface`, and the actual label: `label`. `DataFrame.label` will be passed as a numpy array to the model.

In [None]:
print('TRAINING FEATURES')
print('SHAPE: ',xTrain.shape)
#print(xTrain.head(), end='\n---------------\n')

print('\nTRAINING LABELS')
print('SHAPE: ',yTrain.shape)
#print(yTrain.head(), end='\n---------------\n')

print('\nTESTING FEATURES')
print('SHAPE: ',xTest.shape)
#print(xTest.head(), end='\n---------------\n')

print('\nTESTING LABELS')
print('SHAPE: ',yTest.shape)
#print(yTest.head(), end='\n---------------\n')

# Models
Models that were considered were `Logistic Regression`, `Support Vector Machines`, `Decision Trees`, `Random Forests`, `Deep Neural Networks` and `Convolutional Neural Networks`. Models that were implemented were optimized and tested, Logistic and Support vector machines were elimiated first. I tried building a convolutional neural network using tensorflow's `tflearn` module but kaggle experiences import errors when `import tflearn` is called, possibly due to version issues. 
# Neural Network
Scikit learn's Multi Layer Perceptron was used for the Neural network Implemenation. Initially, many models were made and tested on the same data. The parameters were left more or less constant and only the shape of the network was altered to find one that was giving the relatively best results. The `max_iterations` was initially set to 100, this was done to have it be feasible to test many different and complex networks. The final  2 to 3 layers, starting with 200 neurons in the first hidden layer, reducing by around half each layer. The `max_iterations` was gradually increased in subsequent tests while tuning the hyperparameters with small variations.

## Scaling
For a neural network, the data must be properly standardized. `sklearn.preprocessing.StandardScaler` is being used to scale it.

In [None]:
scaler = StandardScaler()

# FEATURE SCALING
scaler.fit(xTrain)
xTrainScaled = scaler.transform(xTrain)
xTestScaled = scaler.transform(xTest)

## Training

In [None]:
numLabels = len(yTrain.groupby('label').count())
print('Number of Labels: ', numLabels)

bSize = int(xTrain.shape[0]/6)
print('Batch Size: ', bSize)

hiddenLayersShape = (int(xTrain.shape[1]*2), int(xTrain.shape[1]*1), int(xTrain.shape[1]*0.5))
#hiddenLayersShape = (int(numLabels*numLabels), int(numLabels*numLabels/2))
print('Training Model with NN Hidden Layer Shape: ',hiddenLayersShape)

model = MLPClassifier(solver='adam', n_iter_no_change = 35, batch_size = bSize, max_iter = 1000 , hidden_layer_sizes=(hiddenLayersShape), activation = 'tanh', verbose = True)
model.fit(xTrainScaled, yTrain['label'])
print('Model Trained')

In [None]:
trueArr = np.array(yTrain['label'])
predArr = model.predict(xTrainScaled)

trainingAccuracy = metrics.accuracy_score(trueArr, predArr)
print('Training Accuracy: ',trainingAccuracy)

## Testing Accuracy 
This model was tested with the 20 sample `y_test_new` and had a high variance in scores. The score ranged from `0.85` to `0.95`. The corrosponding confusion matrix has been plotted.

In [None]:
trueArr = np.array(yTest['label'])
predArr = model.predict(xTestScaled)

testingAccuracy = metrics.accuracy_score(trueArr, predArr)
'''
print('Confusion Matrix: \n')
print(confusion_matrix(predArr,trueArr))
print('\n')
print('Number of test samples: ',len(trueArr))
'''
print('Testing Acuracy', testingAccuracy)

dfConfusionMat = pd.DataFrame(confusion_matrix(predArr, trueArr))
plt.figure(figsize = (10,7))
sn.heatmap(dfConfusionMat, annot=True)

# Random Forest Classifier
Scikit learn's `RandomForestClassifier` has been used for getting a more reliable classification. It has been consitently giving a low variance in the training and testing accuracies, with a lower difference in training and testing accuracy than the Neural Network.

## Training the Model

In [None]:
#model = RandomForestClassifier(n_estimators = 100, verbose = 2, max_depth = 25)
model = RandomForestClassifier(n_estimators = 600, max_depth = 25 ,n_jobs = -1, verbose = 2)

model.fit(xTrain, yTrain['label'])
print('Model Trained')

## Plotting important features
Feature importance from this has been the main metric of selecting and engineering features at the start of this notebook

In [None]:
### PLOTTING
df = pd.DataFrame()
df['x'] = [col for col in xTrain]
df['y'] = model.feature_importances_
df = df.sort_values(by = ['y'])

fig=plt.figure(figsize=(18, df.shape[0]/5))
plt.xticks(fontsize = 10)
plt.yticks(fontsize = 10)
plt.barh(df['x'],df['y'],0.8)
plt.show()

## Testing Accuracy
This model was tested with the 20 sample `y_test_new` and is consitent with the scores. This model generally scores `0.95` on the given test data. While comparing this with the neural network, larger data samples were tested, The neural network sometimes gets a better testing accuracy than this; however it has unreliable amount of varation in accuracies even with the same hyperparameters. Due to this, the Random Forest model has been selected for this project. 

The confusion matrix corrosponding to the 20 sample test has been plotted.

In [None]:
trueArr = np.array(yTest['label'])
predArr = model.predict(xTest)

testingAccuracy = metrics.accuracy_score(trueArr, predArr)

'''
print('Confusion Matrix: \n')
print(confusion_matrix(predArr, trueArr))
print('\n')
print('Number of test samples: ',len(trueArr))
'''
print('Testing Acuracy', testingAccuracy)

dfConfusionMat = pd.DataFrame(confusion_matrix(predArr, trueArr))
plt.figure(figsize = (10,7))
sn.heatmap(dfConfusionMat, annot=True)

## Stratified K Fold Cross validation
K-Fold crossvalidation iteratively takes k sections of the dataset to test and train the model. Stratified K-Fold tries to ensure that the ratio of classes in the dataset and subsets of the datasets remains even. This method gives a much less biased value for the accuracy of the model.

#### **PLEASE NOTE** : This Algorithm for Stratified K-Fold Cross validation, while present in some or complete part in most notebooks, has been been directly referred to from https://www.kaggle.com/gpreda/robots-need-help

In [None]:
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=1234)

sub_preds_rf = np.zeros((xTest.shape[0], 9))
oof_preds_rf = np.zeros((xTrain.shape[0]))
score = 0
for fold_, (trn_idx, val_idx) in enumerate(folds.split(xTrain, yTrain['label'])):
    #clf =  RandomForestClassifier(n_estimators = 500, n_jobs = -1)
    clf = RandomForestClassifier(n_estimators = 200 ,n_jobs = -1, verbose = 0)
    clf.fit(xTrain.iloc[trn_idx], yTrain['label'][trn_idx])
    oof_preds_rf[val_idx] = clf.predict(xTrain.iloc[val_idx])
    sub_preds_rf += clf.predict_proba(xTest) / folds.n_splits
    score += clf.score(xTrain.iloc[val_idx], yTrain['label'][val_idx])
    print('Fold: {} score: {}'.format(fold_,clf.score(xTrain.iloc[val_idx], yTrain['label'][val_idx])))
    
print('\nAvg Accuracy', score / folds.n_splits)

## Conclusion
To conclude, two models have been developed for this project. The neural network model could be optimized furhter but as it currently stands, the Random Forest is providing a better classification and thus its scores are going to be reported

## Final Scores

* With K-Fold CrossValidation: ~88%
* Accuracy of sample `y_test_new`: 95% (19/20 samples)