Reto individual

*   Hackathon: HACK THAT STARTUP 3 
*   Autor: Francisco Manuel Mendoza Soto





In [None]:
# Imports
import pandas as pd
from sklearn.preprocessing import MaxAbsScaler
from sklearn.utils import shuffle
import xgboost as xgb
from sklearn.metrics import f1_score
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier



# Data exploration

First, let's start loading the dataset directly from github. Then, let's look at its characteristics and its first five rows.

In [None]:
url = "https://raw.githubusercontent.com/nuwe-io/HTS3-DataScience-Individual/main/asteroid_types.csv"
df = pd.read_csv(url)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Feature1        2000 non-null   float64
 1   Feature2        2000 non-null   float64
 2   Feature3        2000 non-null   float64
 3   Feature4        2000 non-null   float64
 4   Feature5        2000 non-null   float64
 5   Feature6        2000 non-null   float64
 6   Classification  2000 non-null   int64  
dtypes: float64(6), int64(1)
memory usage: 109.5 KB


Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Classification
0,0.015724,-1.383637,0.821846,1.314887,-0.071768,2.342294,3
1,0.1467,0.778094,0.486682,-0.697206,0.047063,0.651647,2
2,0.012067,1.299313,0.047187,0.752812,0.898408,0.835497,3
3,-0.84786,0.262294,-0.162009,1.095407,0.549862,1.515246,3
4,1.286735,1.907767,-0.380351,-0.145083,0.11128,-0.076647,0


I can see that I am working with a dataset of size (2000, 7), which includes 6 features of floating type and one target variable of discrete type.

Then, I divide the features from the target variable, and analyze them.

In [None]:
features = df.iloc[:,:6]
target = df.iloc[:,6]

features.describe()

Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,0.001511,-0.01306,0.008257,0.707124,0.87352,0.008531
std,0.988149,0.982257,0.974719,1.191548,0.937349,1.31327
min,-3.424463,-3.671441,-3.196257,-4.686986,-3.479527,-3.420182
25%,-0.663821,-0.639263,-0.640503,0.053565,0.325542,-0.997193
50%,0.021062,-0.022352,0.003777,0.827173,0.901551,-0.415309
75%,0.671152,0.630646,0.665473,1.464697,1.470986,0.997545
max,3.39036,3.333229,3.375086,4.654234,4.631273,4.81934


In [None]:
target.value_counts()

3    1700
2     200
1      50
0      50
Name: Classification, dtype: int64

In [None]:
df.isna().sum()

Feature1          0
Feature2          0
Feature3          0
Feature4          0
Feature5          0
Feature6          0
Classification    0
dtype: int64

The exploratory study shows a couple of things. The first thing that I can see is that the features of the dataset are not normalized or scaled. 

The second thing that I can see is that the target variable is very unbalanced. Eighty five percent (85%) of the dataset is classified as type three, ten percent (10%) as type two, and the other five percent is evenly distributed between types one and zero. This shall be considered in the following steps.

Finally, I can see that there is not any missing values in the data frame. This means there is not a need for data imputation in the preprocessing step.

# Data pre-processing

For the pre-processing step, I am applying a maximum absolute scaling to the dataset, so they fit in a [-1, 1] range. 

In [None]:
abs_scaler = MaxAbsScaler()
abs_scaler.fit(features)
processed_features = pd.DataFrame(abs_scaler.transform(features))
processed_features.describe()

Unnamed: 0,0,1,2,3,4,5
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,0.000441,-0.003557,0.002447,0.15087,0.188613,0.00177
std,0.288556,0.26754,0.288798,0.254225,0.202395,0.2725
min,-1.0,-1.0,-0.947015,-1.0,-0.751311,-0.709679
25%,-0.193847,-0.174118,-0.189774,0.011428,0.070292,-0.206915
50%,0.006151,-0.006088,0.001119,0.176483,0.194666,-0.086175
75%,0.195988,0.171771,0.197172,0.312503,0.31762,0.206988
max,0.990041,0.90788,1.0,0.993012,1.0,1.0


Then, I have decided to split the dataset into a training and a testing set, using 25% of the data for testing. Usually I split the dataset into three sets (training, validation and testing), but given the small quantity of rows that are classified as class zero or one, I have decided to only split it into two sets so the testing set has a significant amount of rows of each class.

In [None]:
test_split = int(len(df.index) * 0.75)

processed_features, target = shuffle(processed_features, target, random_state = 0)

x_train = processed_features.iloc[:test_split]
x_test = processed_features.iloc[test_split:] 

y_train = target.iloc[:test_split]
y_test = target.iloc[test_split:]

In [None]:
y_train.value_counts()

3    1279
2     153
1      34
0      34
Name: Classification, dtype: int64

In [None]:
y_test.value_counts()

3    421
2     47
1     16
0     16
Name: Classification, dtype: int64

As it can be seen, the testing set has at least 16 rows of each class, which would have not been possible if the dataset were separated into three sets. Having enough element of each class is important so the evaluation is significant.

There are some common pre-processing steps that are not necessary in this project, such as feature encoding (since all features are numerical) or data imputation (since there is not any missing data).

Finally, I have decided that there is not a need to do any filtration to the dataset, since it is quite small, and the xgBoost model that is going to be used is able to priorize the features that are most related to the target anyways.

# Model training and parameter tuning

For this project, I have decided to test three different machine learning algorithms: extreme gradieng boosting, support vector machine, and k-nearest neighbours. 

For each algorithm, a parameter tuning is done in order to find the set of parameters that produces the best results.

Each model is evaluated in base of its f1 score when classificating the test dataset, calculated using a macro average.

## Training an xgBoost model

The extreme gradient boosting is a machine learning model that uses ensemble learning with different decision trees.  

Four different hyper-parameters will be tuned: the learning rate (sometimes known as eta), the max depth of each decision tree, the percentage of features used by each tree, and the number of rounds of the training process.

In [None]:
dtrain = xgb.DMatrix(data=x_train,label=y_train)
dtest = xgb.DMatrix(data=x_test)

In [None]:
param = {'objective' : "multi:softmax",
         'num_class': 4,
         'num_round': 1000}

In [None]:
from sklearn.metrics import accuracy_score
def tune_parameter(param_dict, parameter_to_calibrate, values_to_try, dtrain, dtest, y_train, y_test):
   training_scores = []
   testing_scores = []

   for value in values_to_try:
     param_dict[parameter_to_calibrate] = value
     model = xgb.train(param_dict, dtrain, num_boost_round = 1000) 

     train_predictions = model.predict(dtrain)
     train_f1_score = f1_score(y_train, train_predictions, average='macro')
     training_scores.append(train_f1_score)

     predictions = model.predict(dtest)
     f_score = f1_score(y_test, predictions, average='macro')
     testing_scores.append(f_score)

   return training_scores, testing_scores


### Tuning learning rate

In [None]:
training_scores, testing_scores = tune_parameter(param, 'learning_rate', [0.01, 0.05, 0.1, 0.2], dtrain, dtest, y_train, y_test)
pd.DataFrame({'Learning rate': [0.01, 0.05, 0.1, 0.2], 'Training score': training_scores, 'Testing scores': testing_scores})

Unnamed: 0,Learning rate,Training score,Testing scores
0,0.01,0.999083,0.793019
1,0.05,1.0,0.793019
2,0.1,1.0,0.766956
3,0.2,1.0,0.773698


In [None]:
param['learning_rate'] = 0.05

### Tuning max depth

In [None]:
training_scores, testing_scores = tune_parameter(param, 'max_depth', [4, 5, 6, 7, 8], dtrain, dtest, y_train, y_test)
pd.DataFrame({'Max depth': [4, 5, 6, 7, 8], 'Training score': training_scores, 'Testing scores': testing_scores})

Unnamed: 0,Max depth,Training score,Testing scores
0,4,1.0,0.754167
1,5,1.0,0.801809
2,6,1.0,0.793019
3,7,1.0,0.78123
4,8,1.0,0.78123


In [None]:
param['max_depth'] = 5

### Tuning percentage of features used by each tree

In [None]:
training_scores, testing_scores = tune_parameter(param, 'colsample_bytree', [0.5, 0.75, 0.9, 1], dtrain, dtest, y_train, y_test)
pd.DataFrame({'Percentage of features used by tree': [0.5, 0.75, 0.9, 1], 'Training score': training_scores, 'Testing scores': testing_scores})

Unnamed: 0,Percentage of features used by tree,Training score,Testing scores
0,0.5,1.0,0.647896
1,0.75,1.0,0.671455
2,0.9,1.0,0.737152
3,1.0,1.0,0.801809


In [None]:
param['colsample_bytree'] = 1

### Tuning number of rounds

In [None]:
def tune_nrounds(param_dict, values_to_try, dtrain, dtest, y_train, y_test):
   training_scores = []
   testing_scores = []

   for value in values_to_try:
     model = xgb.train(param_dict, dtrain, num_boost_round = value) 

     train_predictions = model.predict(dtrain)
     train_f1_score = f1_score(y_train, train_predictions, average='macro')
     training_scores.append(train_f1_score)

     predictions = model.predict(dtest)
     f_score = f1_score(y_test, predictions, average='macro')
     testing_scores.append(f_score)

   return training_scores, testing_scores

In [None]:
training_scores, testing_scores = tune_nrounds(param, range(500, 2001, 500), dtrain, dtest, y_train, y_test)
pd.DataFrame({'nrounds': list(range(500, 2001, 500)), 'Training score': training_scores, 'Testing scores': testing_scores})

Unnamed: 0,nrounds,Training score,Testing scores
0,500,1.0,0.769224
1,1000,1.0,0.801809
2,1500,1.0,0.799634
3,2000,1.0,0.799634


We can see that, after the tuning process, the best xgBoost model has the following parameters: 

*   Learning rate: 0.5
*   Max depth: 5
*   Percentage of features used by each tree: 1.0
*   Number of rounds: 1000

And the final results obtained with this models is:

*   F1 score at training: 1.000000
*   F1 score at testing: 0.801809


## Training a SVM model

After the xgBoost model, a support vector machine model will be trained.

In this case, two different kernels will be tested: linear kernel and radial basis function kernel.

For the linear kernel, only one hyperparameter will be tuned: C.

For the RBF kernel, two hyperparameters will be tuned: C and gamma.

In [None]:
def tune_svm(param_grid, x_train, x_test, y_train, y_test):
  results = []
  for param_dict in param_grid:
    if param_dict['kernel'] == 'linear':
      for c_value in param_dict['C']:
        svm_model = svm.SVC(C = c_value)
        svm_model.fit(x_train, y_train)
       
        train_predictions = svm_model.predict(x_train)
        train_f_score = f1_score(y_train, train_predictions, average='macro')

        predictions = svm_model.predict(x_test)
        f_score = f1_score(y_test, predictions, average='macro')
        
        results.append({'kernel' : 'linear', 'C': c_value, 'Training score' : train_f_score, 'Testing score' : f_score})

    elif param_dict['kernel'] == 'rbf':
      for c_value in param_dict['C']:
        for gamma_value in param_dict['gamma']:
          svm_model = svm.SVC(C = c_value, gamma = gamma_value)
          svm_model.fit(x_train, y_train)
        
          train_predictions = svm_model.predict(x_train)
          train_f_score = f1_score(y_train, train_predictions, average='macro')

          predictions = svm_model.predict(x_test)
          f_score = f1_score(y_test, predictions, average='macro')
          
          results.append({'kernel' : 'rbf', 'C': c_value, 'gamma': gamma_value, 'Training score' : train_f_score, 'Testing score' : f_score})
    
  return results

In [None]:
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': 'linear'},
  {'C': [1e5, 1e6, 1e7], 'gamma': [0.1, 0.01, 0.001], 'kernel': 'rbf'},
 ]
results = tune_svm(param_grid, x_train, x_test, y_train, y_test)

In [None]:
pd.DataFrame(results).iloc[:, [0, 1, 4, 2, 3]]

Unnamed: 0,kernel,C,gamma,Training score,Testing score
0,linear,1.0,,0.650875,0.547548
1,linear,10.0,,0.838173,0.684492
2,linear,100.0,,0.945956,0.636462
3,linear,1000.0,,0.988725,0.598043
4,rbf,100000.0,0.1,0.86457,0.68286
5,rbf,100000.0,0.01,0.720085,0.672591
6,rbf,100000.0,0.001,0.536843,0.563378
7,rbf,1000000.0,0.1,0.928805,0.645214
8,rbf,1000000.0,0.01,0.746826,0.684161
9,rbf,1000000.0,0.001,0.605998,0.643995


We can finally see that best results have been obtained with the following hyperparameters:

* Kernel: Radial basis function
* C: 1e7
* Gamma: 0.010

And the final results obtained with this models is:

*   F1 score at training: 0.748360
*   F1 score at testing: 0.734410

## Training a k-NN model

Finally, a k-nearest neighbors model is trained.

Only one hyperparameter is tuned for this algorithm, which is the number of neighbors.

In [None]:
def tune_knn(n_neighbors_values, x_train, x_test, y_train, y_test):
  results = []

  for value in n_neighbors_values:
    knn_model = KNeighborsClassifier(n_neighbors=value)
    knn_model.fit(x_train, y_train) 

    train_predictions = knn_model.predict(x_train)
    train_f_score = f1_score(y_train, train_predictions, average='macro')

    predictions = knn_model.predict(x_test)
    f_score = f1_score(y_test, predictions, average='macro')
  
    results.append({'n_neighbors' : value, 'Training score' : train_f_score, 'Testing score' : f_score})

  return results

In [None]:
results = tune_knn([1, 2, 3, 4, 5], x_train, x_test, y_train, y_test)
pd.DataFrame(results)

Unnamed: 0,n_neighbors,Training score,Testing score
0,1,1.0,0.61661
1,2,0.871035,0.683894
2,3,0.761917,0.623795
3,4,0.72403,0.604297
4,5,0.665124,0.524874


It can be see that the model that perform the best uses n_neighbors = 2, and it has a final f score of 0.871035 on training and 0.683894 on testing.

# Evaluation and conclusion

At this moment, all the models have been trained and tuned.

In order to better check their overall accuracy, I have decided to compare all the trained models with a dummy classifier that uses the tactic of always predicting the most common class, that for this dataset is the class three, which is the classification of the 75% of the dataset.  

In [None]:
train_f_score = f1_score(y_train, [3 for i in range(len(y_train))], average='macro')

f_score = f1_score(y_test, [3 for i in range(len(y_test))], average='macro')

print("Dummy classifier, f-score at training: %f, f-score at testing: %f"%(train_f_score, f_score))

Dummy classifier, f-score at training: 0.230119, f-score at testing: 0.228556


In [None]:
final_results = [{'model': 'xgBoost', 'f-score in training': 1.000000, 'f-score in testing': 0.801809},
                 {'model': 'SVM', 'f-score in training': 0.748360, 'f-score in testing': 0.734410},
                 {'model': 'k-NN', 'f-score in training': 0.871035, 'f-score in testing': 0.683894},
                 {'model': 'Dummy', 'f-score in training': 0.230119, 'f-score in testing': 0.228556}
]
pd.DataFrame(final_results)

Unnamed: 0,model,f-score in training,f-score in testing
0,xgBoost,1.0,0.801809
1,SVM,0.74836,0.73441
2,k-NN,0.871035,0.683894
3,Dummy,0.230119,0.228556


It can be seen that after the data mining process, the best model out of all that have been trained is the xgBoost model, which besides being able to correctly classify all the elements of the training set, **has a final f-score of 0.801809** when validated with the test set. 

Also, it can be seen that all the trained models are significantly more accurate than the dummy model, which is only able to get a final f-score of 0.229556.