# DATA SELECTION

## Predicting Price of Diamonds

The aim of the project is to build a machine learning model to predict the price of diamonds based on different explanatory variables describing aspects of diamonds.


We aim to minimise the difference between the real price, and the estimated price by our model. We will evaluate model performance using the mean squared error (mse) and the root squared of the mean squared error (rmse).



====================================================================================================

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
%matplotlib inline

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load dataset
# We  load the datasets with the engineered values: we built and saved these datasets in the previous notebook

X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

Unnamed: 0.1,Unnamed: 0,price,carat,cut,color,clarity,depth,table,x,y,z
0,37975,1007,0.04158,0.25,0.5,0.571429,0.577778,0.444444,0.43203,0.077589,0.092453
1,45367,1665,0.072765,0.0,0.0,0.571429,0.513889,0.388889,0.488827,0.089813,0.101887
2,34602,470,0.014553,0.5,0.0,0.428571,0.483333,0.444444,0.389199,0.071307,0.07956
3,32114,783,0.010395,0.0,0.0,0.0,0.536111,0.277778,0.379888,0.069779,0.080189
4,22779,10800,0.168399,0.5,0.333333,0.142857,0.45,0.472222,0.603352,0.111885,0.121698


In [4]:
# capture the target
y_train = X_train['price']
y_test = X_test['price']

# drop unnecessary variables from our training and testing sets
X_train.drop(['Unnamed: 0', 'price'], axis=1, inplace=True)
X_test.drop(['Unnamed: 0', 'price'], axis=1, inplace=True)

### Feature Selection

In [5]:
# here I will do the model fitting and feature selection
# altogether in one line of code

# first, I specify the Lasso Regression model, and I
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then I use the selectFromModel object from sklearn, which
# will select the features which coefficients are non-zero

sel_ = SelectFromModel(Lasso(alpha=0.005, random_state=0))
sel_.fit(X_train, y_train)

SelectFromModel(estimator=Lasso(alpha=0.005, copy_X=True, fit_intercept=True,
                                max_iter=1000, normalize=False, positive=False,
                                precompute=False, random_state=0,
                                selection='cyclic', tol=0.0001,
                                warm_start=False),
                max_features=None, norm_order=1, prefit=False, threshold=None)

In [6]:
sel_.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True])

In [7]:
selected_feat = X_train.columns[(sel_.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 9
selected features: 9
features with coefficients shrank to zero: 0


In [8]:
# print the selected features
selected_feat

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'x', 'y', 'z'], dtype='object')

### Identify the selected variables

In [9]:
# this is an alternative way of identifying the selected features 
# based on the non-zero regularisation coefficients:
selected_feats = X_train.columns[(sel_.estimator_.coef_ != 0).ravel().tolist()]
selected_feats

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'x', 'y', 'z'], dtype='object')

In [10]:
# now we save the selected list of features
pd.Series(selected_feats).to_csv('selected_features.csv', index=False)

  
