# Objective: Feature Subset Selection to Improve Software Cost Estimation

## Dataset
This is a PROMISE Software Engineering Repository data set made publicly available to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering. The main objective is to estimate the software cost estimation using feature subset selection techniques.

## Attributes
1.	RELY {Nominal,Very_High,High,Low} 
2.	DATA {High,Low,Nominal,Very_High} 
3.	CPLX {Very_High,High,Nominal,Extra_High,Low} 
4.	TIME {Nominal,Very_High,High,Extra_High} 
5.	STOR {Nominal,Very_High,High,Extra_High} 
6.	VIRT {Low,Nominal,High}
7.	TURN {Nominal,High,Low}
8.	ACAP {High,Very_High,Nominal} 
9.	AEXP {Nominal,Very_High,High} 
10.	PCAP {Very_High,High,Nominal}
11.	VEXP {Low,Nominal,High}
12.	LEXP {Nominal,High,Very_Low,Low} 
13.	MODP {High,Nominal,Very_High,Low}
14.	TOOL {Nominal,High,Very_High,Very_Low,Low} 
15.	SCED {Low,Nominal,High}
16.	LOC numeric 

## Target Class
ACT_EFFORT numeric %17

### Source: http://promise.site.uottawa.ca/SERepository/datasets/cocomonasa_v1.arff

Tasks:
1.	Obtain the software cost estimation dataset
2.	Apply pre-processing techniques (if any)
3.	Apply feature subset selection techniques such as correlation analysis, forward selection, backward elimination, recursive feature elimination etc. Find best possible subset of features from each method.
4.	Divide dataset into training and testing set, respectively.
5.	Implement support vector regression (SVR), Linear regression, and Decision tree.
6.	Ensemble SVR, Linear regression and Decision tree. 
7.	Evaluate Coefficient of determination and Root mean square error for all the models including the ensemble one.
8.	Conclude the results

Helpful links: https://scikit-learn.org/stable/modules/ensemble.html
https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/
https://medium.com/pursuitnotes/support-vector-regression-in-6-steps-with-python-c4569acd062d
https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html


## Task 1: Implementation of regression models 

In [21]:
# Load the libraries
import pandas as pd
import numpy as np
from scipy.io import arff

from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [9]:
# Load the dataset 

darff = arff.loadarff('cocomonasa_v1.arff')
data = pd.DataFrame(darff[0])

for i in range(data.shape[0]):
    for j in range(data.shape[1] - 2): # last two columns are numerical
            data.iloc[i, j] = data.iloc[i, j].decode('utf-8')
            
data.head()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,Nominal,High,Very_High,Nominal,Nominal,Low,Nominal,High,Nominal,Very_High,Low,Nominal,High,Nominal,Low,70.0,278.0
1,Very_High,High,High,Very_High,Very_High,Nominal,Nominal,Very_High,Very_High,Very_High,Nominal,High,High,High,Low,227.0,1181.0
2,Nominal,High,High,Very_High,High,Low,High,High,Nominal,High,Low,High,High,Nominal,Low,177.9,1248.0
3,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,115.8,480.0
4,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,29.5,120.0


In [10]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)
categories = [list(data[i].unique()) for i in data.columns[: -2]]
categories

[['Nominal', 'Very_High', 'High', 'Low'],
 ['High', 'Low', 'Nominal', 'Very_High'],
 ['Very_High', 'High', 'Nominal', 'Extra_High', 'Low'],
 ['Nominal', 'Very_High', 'High', 'Extra_High'],
 ['Nominal', 'Very_High', 'High', 'Extra_High'],
 ['Low', 'Nominal', 'High'],
 ['Nominal', 'High', 'Low'],
 ['High', 'Very_High', 'Nominal'],
 ['Nominal', 'Very_High', 'High'],
 ['Very_High', 'High', 'Nominal'],
 ['Low', 'Nominal', 'High'],
 ['Nominal', 'High', 'Very_Low', 'Low'],
 ['High', 'Nominal', 'Very_High', 'Low'],
 ['Nominal', 'High', 'Very_High', 'Very_Low', 'Low'],
 ['Low', 'Nominal', 'High']]

In [13]:
# we can observe a ranked ordered, low < nominal < high < very-high
# so we make use of ordinal encoding for these categorical variables
oe = OrdinalEncoder(categories = categories)
data.iloc[:, :-2] = oe.fit_transform(data.iloc[:, :-2])
data.head()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70.0,278.0
1,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,227.0,1181.0
2,0.0,0.0,1.0,1.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,177.9,1248.0
3,2.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,115.8,480.0
4,2.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,29.5,120.0


In [15]:
# there are no null values
data.isna().sum()

RELY          0
DATA          0
CPLX          0
TIME          0
STOR          0
VIRT          0
TURN          0
ACAP          0
AEXP          0
PCAP          0
VEXP          0
LEXP          0
MODP          0
TOOL          0
SCED          0
LOC           0
ACT_EFFORT    0
dtype: int64

In [16]:
data.describe()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
count,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0
mean,0.95,1.35,1.183333,0.483333,0.516667,0.3,1.283333,1.1,1.016667,1.5,0.833333,0.816667,1.0,0.766667,0.683333,74.588333,406.413333
std,1.015557,0.879619,0.676273,0.770025,0.892372,0.53043,0.804472,0.933374,0.929583,0.701089,0.418499,0.567231,1.073565,1.212459,0.650728,97.172089,656.96567
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.2,8.4
25%,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.75,0.0,0.0,0.0,12.95,60.0
50%,0.0,1.0,1.0,0.0,0.0,0.0,1.5,1.0,1.0,2.0,1.0,1.0,1.0,0.0,1.0,30.5,118.8
75%,2.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,100.0,377.5
max,3.0,3.0,4.0,3.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,4.0,2.0,423.0,3240.0


In [23]:
# lets scale the LOC feature.
scaler = MinMaxScaler()
data['LOC'] = scaler.fit_transform(np.array(data['LOC']).reshape(-1, 1))

In [24]:
data.head()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.161122,278.0
1,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.534221,1181.0
2,0.0,0.0,1.0,1.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.417538,1248.0
3,2.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,0.269962,480.0
4,2.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,0.064876,120.0


In [25]:
X = data.drop(columns = ['ACT_EFFORT'], axis = 1)
y = data['ACT_EFFORT']

In [26]:
print("X shape : ", X.shape)
print("y shape : ", y.shape)

X shape :  (60, 16)
y shape :  (60,)


In [28]:
# Apply feature subset selection techniques 
from sklearn.feature_selection import SelectKBest, f_classif
model_lin = LinearRegression()
sel = SelectKBest(f_classif, k = 13)
X = sel.fit_transform(X, y)

  f = msb / msw


In [None]:
# Divide the dataset to training and testing set



In [None]:
# Build regression models 





In [None]:
# Evaluate the build model on test dataset



In [None]:
# Evaluate training and testing coefficient of determination and root mean squre error



## Task 2: Ensemble regression models


In [None]:
# Ensemble the regression models

In [None]:
# Evaluate Coefficient of determination and Root mean square error 


## Task 3: Conclude the results
