# Objective: Feature Subset Selection to Improve Software Cost Estimation

## Dataset
This is a PROMISE Software Engineering Repository data set made publicly available to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering. The main objective is to estimate the software cost estimation using feature subset selection techniques.

## Attributes
1.	RELY {Nominal,Very_High,High,Low} 
2.	DATA {High,Low,Nominal,Very_High} 
3.	CPLX {Very_High,High,Nominal,Extra_High,Low} 
4.	TIME {Nominal,Very_High,High,Extra_High} 
5.	STOR {Nominal,Very_High,High,Extra_High} 
6.	VIRT {Low,Nominal,High}
7.	TURN {Nominal,High,Low}
8.	ACAP {High,Very_High,Nominal} 
9.	AEXP {Nominal,Very_High,High} 
10.	PCAP {Very_High,High,Nominal}
11.	VEXP {Low,Nominal,High}
12.	LEXP {Nominal,High,Very_Low,Low} 
13.	MODP {High,Nominal,Very_High,Low}
14.	TOOL {Nominal,High,Very_High,Very_Low,Low} 
15.	SCED {Low,Nominal,High}
16.	LOC numeric 

## Target Class
ACT_EFFORT numeric %17

### Source: http://promise.site.uottawa.ca/SERepository/datasets/cocomonasa_v1.arff

Tasks:
1.	Obtain the software cost estimation dataset
2.	Apply pre-processing techniques (if any)
3.	Apply feature subset selection techniques such as correlation analysis, forward selection, backward elimination, recursive feature elimination etc. Find best possible subset of features from each method.
4.	Divide dataset into training and testing set, respectively.
5.	Implement support vector regression (SVR), Linear regression, and Decision tree.
6.	Ensemble SVR, Linear regression and Decision tree. 
7.	Evaluate Coefficient of determination and Root mean square error for all the models including the ensemble one.
8.	Conclude the results

Helpful links: https://scikit-learn.org/stable/modules/ensemble.html
https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/
https://medium.com/pursuitnotes/support-vector-regression-in-6-steps-with-python-c4569acd062d
https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html


## Task 1: Implementation of regression models 

In [111]:
# Load the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from scipy.io import arff
from sklearn.utils import shuffle
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import RFE

In [1]:
!wget http://promise.site.uottawa.ca/SERepository/datasets/cocomonasa_v1.arff

In [10]:
# Load the dataset 
data=arff.loadarff('cocomonasa_v1.arff')
data=data[0]
df=pd.DataFrame(data)
for i in range(df.shape[0]):
    for j in range(df.shape[1]-2):
        df.iloc[i,j]=df.iloc[i,j].decode('utf-8')
df.head()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,Nominal,High,Very_High,Nominal,Nominal,Low,Nominal,High,Nominal,Very_High,Low,Nominal,High,Nominal,Low,70.0,278.0
1,Very_High,High,High,Very_High,Very_High,Nominal,Nominal,Very_High,Very_High,Very_High,Nominal,High,High,High,Low,227.0,1181.0
2,Nominal,High,High,Very_High,High,Low,High,High,Nominal,High,Low,High,High,Nominal,Low,177.9,1248.0
3,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,115.8,480.0
4,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,29.5,120.0


In [11]:
# Shuffle the dataset
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,Nominal,Low,High,Nominal,Nominal,High,Low,High,High,High,Low,Very_Low,Nominal,Nominal,Nominal,100.0,215.0
1,Nominal,Low,High,Nominal,Nominal,Low,Low,High,Very_High,Very_High,Nominal,High,Nominal,Nominal,Nominal,20.0,72.0
2,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,10.4,50.0
3,High,Nominal,High,Nominal,Nominal,Nominal,Nominal,Nominal,Nominal,Nominal,Nominal,Nominal,Nominal,Nominal,Nominal,6.5,42.0
4,Nominal,Low,High,Nominal,Extra_High,Low,Low,High,Very_High,Very_High,Nominal,High,Nominal,Nominal,Nominal,150.0,324.0


In [12]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)
df.isna().sum()

RELY          0
DATA          0
CPLX          0
TIME          0
STOR          0
VIRT          0
TURN          0
ACAP          0
AEXP          0
PCAP          0
VEXP          0
LEXP          0
MODP          0
TOOL          0
SCED          0
LOC           0
ACT_EFFORT    0
dtype: int64

In [13]:
data = df.iloc[:,:-1]

In [14]:
LE = LabelEncoder()
CateList = data.select_dtypes(exclude="int64").columns
print(CateList)

Index(['RELY', 'DATA', 'CPLX', 'TIME', 'STOR', 'VIRT', 'TURN', 'ACAP', 'AEXP',
       'PCAP', 'VEXP', 'LEXP', 'MODP', 'TOOL', 'SCED', 'LOC'],
      dtype='object')


In [16]:
for i in CateList:
    df[i] = LE.fit_transform(df[i])

In [17]:
df.head()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,2,1,1,2,2,0,1,0,0,0,1,3,2,2,2,41,215.0
1,2,1,1,2,2,1,1,0,2,2,2,0,2,2,2,23,72.0
2,0,1,1,2,2,1,1,1,1,1,2,0,0,2,1,11,50.0
3,0,2,1,2,2,2,2,1,1,1,2,2,2,2,2,4,42.0
4,2,1,1,2,0,1,1,0,2,2,2,0,2,2,2,44,324.0


In [20]:
data = df.iloc[:,:-1]
mm = MinMaxScaler()
data[:]= mm.fit_transform(data[:])

In [21]:
data.head()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC
0,0.666667,0.333333,0.25,0.666667,0.666667,0.0,0.5,0.0,0.0,0.0,0.5,1.0,0.666667,0.5,1.0,0.759259
1,0.666667,0.333333,0.25,0.666667,0.666667,0.5,0.5,0.0,1.0,1.0,1.0,0.0,0.666667,0.5,1.0,0.425926
2,0.0,0.333333,0.25,0.666667,0.666667,0.5,0.5,0.5,0.5,0.5,1.0,0.0,0.0,0.5,0.5,0.203704
3,0.0,0.666667,0.25,0.666667,0.666667,1.0,1.0,0.5,0.5,0.5,1.0,0.666667,0.666667,0.5,1.0,0.074074
4,0.666667,0.333333,0.25,0.666667,0.0,0.5,0.5,0.0,1.0,1.0,1.0,0.0,0.666667,0.5,1.0,0.814815


In [117]:
X = data
y = df['ACT_EFFORT']
print(X.shape, y.shape)

(60, 16) (60,)


In [183]:
# Apply feature subset selection techniques 
X_new = SelectKBest(f_regression, k=8).fit_transform(X, y)
X_new.shape

(60, 8)

In [184]:
# Divide the dataset to training and testing set
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.25, random_state = 123)

In [185]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(45, 8) (45,)
(15, 8) (15,)


In [186]:
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [187]:
# Build regression models 
# SVR, Linear regression and Decision tree
clf1 = SVR(kernel='linear')
clf2 = LinearRegression()
clf3 = DecisionTreeRegressor(criterion='mse')

In [188]:
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)
clf3.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [189]:
# Evaluate the build model on test dataset
pred1_t1 = clf1.predict(X_train)
pred1_t2 = clf1.predict(X_test)
pred2_t1 = clf2.predict(X_train)
pred2_t2 = clf2.predict(X_test)
pred3_t1 = clf3.predict(X_train)
pred3_t2 = clf3.predict(X_test)

In [190]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [191]:
# Evaluate training and testing coefficient of determination and root mean squre error
r2_1_t1 = r2_score(y_train, pred1_t1)
r2_1_t2 = r2_score(y_test, pred1_t2)
r2_2_t1 = r2_score(y_train, pred2_t1)
r2_2_t2 = r2_score(y_test, pred2_t2)
r2_3_t1 = r2_score(y_train, pred3_t1)
r2_3_t2 = r2_score(y_test, pred3_t2)

In [192]:
rmse1_t1 = mean_squared_error(y_train, pred1_t1, squared=False)
rmse1_t2 = mean_squared_error(y_test, pred1_t2, squared=False)
rmse2_t1 = mean_squared_error(y_train, pred2_t1, squared=False)
rmse2_t2 = mean_squared_error(y_test, pred2_t2, squared=False)
rmse3_t1 = mean_squared_error(y_train, pred3_t1, squared=False)
rmse3_t2 = mean_squared_error(y_test, pred3_t2, squared=False)

In [193]:
print("Training R2 Score (SVR):",r2_1_t1)
print("Testing R2 Score (SVR):",r2_1_t2)
print("Training R2 Score (Linear Regression):",r2_2_t1)
print("Testing R2 Score (Linear Regression):",r2_2_t2)
print("Training R2 Score (Decision Tree):",r2_3_t1)
print("Testing R2 Score (Decision Tree):",r2_3_t2)

Training R2 Score (SVR): -0.17223976933988427
Testing R2 Score (SVR): -0.21958629537979868
Training R2 Score (Linear Regression): 0.5837204842907424
Testing R2 Score (Linear Regression): 0.49239916906649206
Training R2 Score (Decision Tree): 0.999985222455648
Testing R2 Score (Decision Tree): 0.6844886060607711


In [194]:
print("Training RMSE (SVR):",rmse1_t1)
print("Testing RMSE (SVR):",rmse1_t2)
print("Training RMSE (Linear Regression):",rmse2_t1)
print("Testing RMSE (Linear Regression):",rmse2_t2)
print("Training RMSE (Decision Tree):",rmse3_t1)
print("Testing RMSE (Decision Tree):",rmse3_t2)

Training RMSE (SVR): 712.5200288682679
Testing RMSE (SVR): 696.7488572097509
Training RMSE (Linear Regression): 424.6013509036133
Testing RMSE (Linear Regression): 449.50151466139135
Training RMSE (Decision Tree): 2.5298221281347035
Testing RMSE (Decision Tree): 354.3867868116229


##Task 2: Ensemble regression models


In [195]:
# Ensemble the regression models
estimators = []
estimators.append(('SVR', clf1))
estimators.append(('Linear Regression', clf2))
estimators.append(('Decision Tree', clf3))
estimators

[('SVR',
  SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
      kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)),
 ('Linear Regression',
  LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)),
 ('Decision Tree',
  DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                        max_features=None, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=None, splitter='best'))]

In [196]:
from sklearn.ensemble import VotingRegressor

In [197]:
ensemble = VotingRegressor(estimators)
ensemble.fit(X_train, y_train)

VotingRegressor(estimators=[('SVR',
                             SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                                 epsilon=0.1, gamma='scale', kernel='linear',
                                 max_iter=-1, shrinking=True, tol=0.001,
                                 verbose=False)),
                            ('Linear Regression',
                             LinearRegression(copy_X=True, fit_intercept=True,
                                              n_jobs=None, normalize=False)),
                            ('Decision Tree',
                             DecisionTreeRegressor(ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features=None,
                                                   max_leaf_nodes=None,
                                                   min_impurity_decrease=0.0,
            

In [198]:
pred_t1 = ensemble.predict(X_train)
pred_t2 = ensemble.predict(X_test)

In [199]:
# Evaluate Coefficient of determination and Root mean square error 
r2_t1 = r2_score(y_train, pred_t1)
r2_t2 = r2_score(y_test, pred_t2)
rmse_t1 = mean_squared_error(y_train, pred_t1, squared=False)
rmse_t2 = mean_squared_error(y_test, pred_t2, squared=False)

In [200]:
print("Training R2 Score (Ensemble):",r2_t1)
print("Testing R2 Score (Ensemble):",r2_t2)
print("Training RMSE (Ensemble):",rmse_t1)
print("Testing RMSE (Ensemble):",rmse_t2)

Training R2 Score (Ensemble): 0.7309830884233982
Testing R2 Score (Ensemble): 0.7372252159797692
Training RMSE (Ensemble): 341.3332866383362
Testing RMSE (Ensemble): 323.4162530061874



##Task 3: Conclude the results


We can see that, on an average, the Ensemble model (combination of clf1, clf2 and clf3) performs better than any other model used on an individual basis (SVR, Linear Regression and Decision Tree) in terms of the regression metrics - RMSE, R2 Score