
- Student name: Nguyen Quang Linh
- Student ID: 20162439
- Class: ICT.02 K61


**Task:**


- Formulate the learning problem
- Choose a way to deal with missing value
- Choose a Machine Learning model 
- Do training / evaluation
- Make prediction for 10 testing samples

# Gathering Data & Data Preprocessing

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier

In [2]:
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)

In [3]:
# Load train & test files
training_file = '1-training-data.csv'
test_file = '20162439-test.csv'

# Read files
training_data = pd.read_csv (training_file)
test_data = pd.read_csv (test_file , header = None)

# Shape of training data
print (training_data.shape)

(1000, 9)


In [4]:
# Analyze the full training data 
print (training_data)

              A1           A2  A3            A4 A5           A6 A7   A8  y
0              ?  3.683393747   ?  -0.634417312  1  0.409611744  7   30  5
1              ?            ?  60   1.573617763  0  0.639813727  7   30  5
2              ?  3.096229013  67   0.249917163  0  0.089343498  ?   80  3
3    2.887677333  3.870994828  68  -1.347755064  ?  1.276985638  ?   60  5
4    2.731273335  3.945024383  79   1.967319655  1  2.487831092  ?  100  4
..           ...          ...  ..           ... ..          ... ..  ... ..
995  3.125917333  3.245429971  68  -0.142997786  ?  2.540562226  7    ?  4
996  2.566080318  3.567651314   ?             ?  1  2.414309121  7   70  4
997  1.783414232            ?   ?   0.411349173  0  1.234719984  7   60  3
998  1.633291266  4.130596422   ?   1.938253526  ?  -1.38920108  6    0  4
999            ?   4.13807089  65   2.107206276  0            ?  6    0  3

[1000 rows x 9 columns]


The data contains discreted value with 1000 rows and 9 columns, some values of parameters are missing so we can see that this is classification problem. We need to change all that missing values to NaN (standing for not a number). 

In [5]:
# Change all '?' to 'NaN'
for column in training_data:
    training_data[column] = pd.to_numeric(training_data[column], errors='coerce')
training_data.head(10)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,y
0,,3.683394,,-0.634417,1.0,0.409612,7.0,30.0,5
1,,,60.0,1.573618,0.0,0.639814,7.0,30.0,5
2,,3.096229,67.0,0.249917,0.0,0.089343,,80.0,3
3,2.887677,3.870995,68.0,-1.347755,,1.276986,,60.0,5
4,2.731273,3.945024,79.0,1.96732,1.0,2.487831,,100.0,4
5,0.864607,,61.0,,0.0,-1.262565,7.0,40.0,4
6,,3.319786,58.0,-1.481795,0.0,-1.47069,,0.0,1
7,,3.628115,60.0,1.48731,0.0,,7.0,30.0,5
8,1.102136,3.386358,,1.644159,0.0,,7.0,25.0,4
9,1.250055,4.363551,66.0,2.327634,,-1.530729,7.0,,4


In [6]:
# Count null values in each column
null_data = training_data.isnull()
for column in training_data:
    print (column)
    print (null_data[column].value_counts())

A1
False    751
True     249
Name: A1, dtype: int64
A2
False    782
True     218
Name: A2, dtype: int64
A3
False    762
True     238
Name: A3, dtype: int64
A4
False    785
True     215
Name: A4, dtype: int64
A5
False    761
True     239
Name: A5, dtype: int64
A6
False    764
True     236
Name: A6, dtype: int64
A7
False    771
True     229
Name: A7, dtype: int64
A8
False    743
True     257
Name: A8, dtype: int64
y
False    1000
Name: y, dtype: int64


In [7]:
# Generate descriptive statistics
training_data.describe()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,y
count,751.0,782.0,762.0,785.0,761.0,764.0,771.0,743.0,1000.0
mean,1.378244,3.618547,64.69685,-0.002154,0.257556,-0.170783,6.73022,25.881561,3.351
std,1.300856,0.473805,7.254793,1.452304,0.437576,1.41462,0.683655,28.263956,1.289753
min,-1.448874,2.484787,41.0,-1.722658,0.0,-1.642174,6.0,0.0,1.0
25%,0.511147,3.278746,61.0,-1.382425,0.0,-1.372719,6.0,0.0,2.0
50%,1.417296,3.596953,65.5,-0.618728,0.0,-0.751386,7.0,15.0,4.0
75%,2.549721,3.932959,69.0,1.432672,1.0,1.254511,7.0,45.0,4.0
max,3.983271,4.912296,79.0,2.481077,1.0,2.768033,9.0,100.0,6.0


In this dataset, we can see that many values are replaced by NaN and it may lead to error in future so I will replace each NaN value by mean value of the corresponding column to overcome this problem.

In [8]:
training_data.fillna(training_data.mean(), inplace = True)
training_data

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,y
0,1.378244,3.683394,64.69685,-0.634417,1.000000,0.409612,7.00000,30.000000,5
1,1.378244,3.618547,60.00000,1.573618,0.000000,0.639814,7.00000,30.000000,5
2,1.378244,3.096229,67.00000,0.249917,0.000000,0.089343,6.73022,80.000000,3
3,2.887677,3.870995,68.00000,-1.347755,0.257556,1.276986,6.73022,60.000000,5
4,2.731273,3.945024,79.00000,1.967320,1.000000,2.487831,6.73022,100.000000,4
...,...,...,...,...,...,...,...,...,...
995,3.125917,3.245430,68.00000,-0.142998,0.257556,2.540562,7.00000,25.881561,4
996,2.566080,3.567651,64.69685,-0.002154,1.000000,2.414309,7.00000,70.000000,4
997,1.783414,3.618547,64.69685,0.411349,0.000000,1.234720,7.00000,60.000000,3
998,1.633291,4.130596,64.69685,1.938254,0.257556,-1.389201,6.00000,0.000000,4


# Train model

We will use two popular approaches which are appropriate with the problem: Random Forest and SVM. At first, I will split the dataset into train and test using sklearn with the ratio 70/30. 

In [9]:
X = np.array(training_data[training_data.columns[:-1]])
Y = np.array(training_data[training_data.columns[-1]])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=42)
print (X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(700, 8) (300, 8) (700,) (300,)


## Random Forest

With random forest, I will use RandomForestClassifier, which is a set of decision trees from randomly selected subset of training set. After that, I will perform predictions on the test set. 

### Parameter tuning: n_estimators

In [10]:
from sklearn.model_selection import GridSearchCV
param_grid = {
              "n_estimators":[50,100,200,300,400,500,600,700,800,900,1000]
             }

forest_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
                 param_grid = param_grid,   
                 scoring="accuracy",  #metrics
                 cv = 5,              #cross-validation
                 n_jobs = 1)          #number of core

forest_grid.fit(X_train,Y_train)
forest_grid_best = forest_grid.best_estimator_ #best estimator
print("Best Model Parameter: ",forest_grid.best_params_)


Best Model Parameter:  {'n_estimators': 300}


In [11]:
rf_model = RandomForestClassifier (n_estimators = 300 , criterion = 'gini',bootstrap = True)
rf_model.fit (X_train, Y_train)
rf_predict = rf_model.predict(X_test)

After training, I will check the accuracy using actual and predicted values.

In [12]:
from sklearn import metrics
print ("Accuracy: ", metrics.accuracy_score(Y_test, rf_predict))

Accuracy:  0.9033333333333333


##  Support Vector Machine (SVM)

Some parameters will be used in SVM:
- Kernel: RBF (Radial Basis Function). RBF can map an input space in infinite dimensional space.
- gamma
- C
<br>
At first, I will import module for SVM, then I will fit the model and perform prediction on the test set. Finally I will evaluate the model.

### Parameters tuning: gamma & C

In [None]:
from sklearn.svm import SVC
# defining parameter range 
param_grid = {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [0.1 , 0.2 ,0.5, 1.0], 
              'kernel': ['rbf']
             }  
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3, cv = 5)

# fitting the model for grid search 
grid.fit(X_train, Y_train) 
# print best parameter after tuning 
print(grid.best_params_) 

After parameters tuning, I get C = 100 and gamma = 0.2

In [14]:
# Import SVM model
from sklearn.svm import SVC
svm_model = SVC (kernel = 'rbf' , gamma = 0.2, C = 100.0)

svm_model.fit (X_train, Y_train)
Y_predict = svm_model.predict(X_test)

In [15]:
# Evaluate the model
from sklearn import metrics
print ("Accuracy: ", metrics.accuracy_score(Y_test, Y_predict))

Accuracy:  0.9066666666666666


As we can see, both Random Forest and SVM are provide nearly the same accuracy, SVM is a little bit better so I will use SVM to train the test file.

# Predict the results

In [17]:
X_id = np.array (test_data[test_data.columns[:-1]])
Y_ori = np.array (test_data[test_data.columns[-1]])
Y_pred = svm_model.predict (X_id)
print (Y_ori)
print (Y_pred)

[3 5 2 5 1 0 0 0 0 0 0 0 0 0 0]
[3 5 2 5 1 2 1 5 1 4 4 4 2 2 3]


We can see that, the accuracy in the first 5 rows is 100% so the result of the predicted values might be really good.