# Computer Hardware Performance Prediction

In this notebook, we will explore the dataset from UCI repository on computer hardware performance. Here is the link to dataset and description of it http://archive.ics.uci.edu/ml/datasets/Computer+Hardware

The problem is about finding out the performance of the particular hardware model based on data like vendor name, machine cycle time in nanoseconds, minimum main memory in kilobytes, maximum main memory in kilobytes, cache memory in kilobytes, minimum channels in units, maximum channels in units, published relative performance. 

# Application

This kind of problem is interesting as we can predict the performance of machine and can then use that information for future design and for making it better. Also this way we do not have to work with the actual machine to find the performance. No doubts that these many of features are not enough to make perfect decision but this is just an idea.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("computer-hardware.csv")

In [3]:
print(df.head())
print("Total number of enteried: {}".format(df.size))
print("Shape of the data: {}".format(df.shape))

    vendor    model  myct  mmin   mmax  cach  chmin  chmax  pep  erp
0  adviser    32/60   125   256   6000   256     16    128  198  199
1   amdahl   470v/7    29  8000  32000    32      8     32  269  253
2   amdahl  470v/7a    29  8000  32000    32      8     32  220  253
3   amdahl  470v/7b    29  8000  32000    32      8     32  172  253
4   amdahl  470v/7c    29  8000  16000    32      8     16  132  132
Total number of enteried: 2090
Shape of the data: (209, 10)


Name of vendor is important as lot of vendors have their standards of hardware,so we will sue it. We can see that model is just name given by Vendor, so that is not going to affect the performance. So we will drop this column. All other features are important to find out the ERP (Estimated Relative Performance). 

In [4]:
# Now let's normalize the data. The values of all features are ranging very differently.
# So normalizing is very necessary. Also we are spliting the data into train and test for future use. 
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cross_validation import train_test_split

# Let's convert our vendor's name into number. As it is categorial variable we can not fit that into model directly.
# So we will convert that into some number ranging from 0 to the number of unique vendor's name. 
encoder = LabelEncoder()
df['vendor'] = encoder.fit_transform(df['vendor']).astype('str')
# print(df['vendor'].head())

y = np.array(df["erp"])
X = np.array(df[["vendor","myct", "mmin", "mmax", "cach", "chmin", "chmax", "pep"]]) #Here we are not using vendor name as that is categorial variable.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=33)

# Now we will normalize our data as it is one of the most useful thing to do beofre training.
# If the data is not normalized then the model will not fit properly. 
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(np.reshape(y_train,(-1,1)))

X_train = scalerX.transform(X_train)
y_train = scalery.transform(np.reshape(y_train,(-1,1)))
X_test = scalerX.transform(X_test)
y_test = scalery.transform(np.reshape(y_test,(-1,1)))

print (np.max(X_train), np.min(X_train), np.mean(X_train), np.max(y_train), np.min(y_train), np.mean(y_train))
# converting reshaped Y vector into array again
y_train=y_train.flatten(order='C')
y_test=y_test.flatten(order='C')

7.903174847721237 -2.0951086068167517 -2.846725704167068e-18 8.190563136424101 -0.5332953962881468 0.0




For this problem we are going to use SVM regressor.

In [5]:
from sklearn.svm import SVR
kernels = ['rbf', 'linear', 'poly', 'sigmoid'] 

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    # Fitting the model on the data
    clf.fit(X_train, y_train)
    
    print ("Accuracy on training set: {}".format(clf.score(X_train, y_train)))
    print ("Accuracy on testing set: {}".format(clf.score(X_test, y_test)))

In [6]:
for i in kernels:
    clf = SVR(kernel= i)
    print("Results of SVR with kernel: {}".format(i))
    train_and_evaluate(clf, X_train, X_test, y_train, y_test)
    print("")

Results of SVR with kernel: rbf
Accuracy on training set: 0.605448057154983
Accuracy on testing set: 0.5524439594166559

Results of SVR with kernel: linear
Accuracy on training set: 0.9107637482045451
Accuracy on testing set: 0.90821888575205

Results of SVR with kernel: poly
Accuracy on training set: 0.9845253714732883
Accuracy on testing set: -0.12212087846534092

Results of SVR with kernel: sigmoid
Accuracy on training set: -17.42736885820057
Accuracy on testing set: -19.804107251858504



In the above results we can see that SVR with linear kernel works the best. It has good accuracy on train and test set both. While some of the kernels like Sigmoid and poly just did not perform the best and did the worse. So now let's see what kind of decision boundary linear and rbf kernels has learned. 

Now let's make some prediction and see how it is doing. 

In [7]:
y_pred = SVR(kernel= 'linear').fit(X_train, y_train).predict(X_test)
from sklearn.metrics import mean_squared_error
print("Root Mean Square Error is: {}".format(mean_squared_error(y_test,y_pred)))

Root Mean Square Error is: 0.16383139792515305


We can see the RMSE is about 0.16, that means that our model has learned how to predict the ERP of the hardware from the given set of features as mentioned above. 