## Feature Engineering:  Feature Scaling, Data Cleaning, Train test split

## Problem Statement: Build a model that can predict the employee salaries on basis of their experience

## Step 1: Data Gathering

In [2]:
import pandas as pd
path = r"https://raw.githubusercontent.com/sindhura-nk/Datasets/refs/heads/main/Salary_dataset.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0.1,Unnamed: 0,YearsExperience,Salary
0,0,1.2,39344.0
1,1,1.4,46206.0
2,2,1.6,37732.0
3,3,2.1,43526.0
4,4,2.3,39892.0


## Step 2: Perform  basic Data Quality Checks

In [3]:
df.shape

(30, 3)

In [4]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       30 non-null     int64  
 1   YearsExperience  30 non-null     float64
 2   Salary           30 non-null     float64
dtypes: float64(2), int64(1)
memory usage: 852.0 bytes


In [5]:
# check for duplicated rows
df.duplicated().sum()

np.int64(0)

In [6]:
# handle future cases
df = df.drop_duplicates()

## Step3: Separate X and Y features

    X: YearsExperience
    Y: Salary

In [7]:
X = df[['YearsExperience']]
Y = df[['Salary']]

In [8]:
X.head()

Unnamed: 0,YearsExperience
0,1.2
1,1.4
2,1.6
3,2.1
4,2.3


In [9]:
Y.head()

Unnamed: 0,Salary
0,39344.0
1,46206.0
2,37732.0
3,43526.0
4,39892.0


## Step 4: Feature Engineering

1. Data Cleaning
2. Feature Scaling- Data pre-processing

In [10]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [12]:
num_pipe = make_pipeline(
    # data cleaning
    SimpleImputer(strategy='mean'),
    # feature scaling
    StandardScaler()
).set_output(transform='pandas')

In [13]:
num_pipe

0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('simpleimputer', ...), ('standardscaler', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"missing_values  missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.",
,"strategy  strategy: str or Callable, default='mean' The imputation strategy. - If ""mean"", then replace missing values using the mean along  each column. Can only be used with numeric data. - If ""median"", then replace missing values using the median along  each column. Can only be used with numeric data. - If ""most_frequent"", then replace missing using the most frequent  value along each column. Can be used with strings or numeric data.  If there is more than one such value, only the smallest is returned. - If ""constant"", then replace missing values with fill_value. Can be  used with strings or numeric data. - If an instance of Callable, then replace missing values using the  scalar statistic returned by running the callable over a dense 1d  array containing non-missing values of each column. .. versionadded:: 0.20  strategy=""constant"" for fixed value imputation. .. versionadded:: 1.5  strategy=callable for custom value imputation.",'mean'
,"fill_value  fill_value: str or numerical value, default=None When strategy == ""constant"", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and ""missing_value"" for strings or object data types.",
,"copy  copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.",True
,"add_indicator  add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.",False
,"keep_empty_features  keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy=""constant""` in which case `fill_value` will be used instead. .. versionadded:: 1.2",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True


In [14]:
X_pre = num_pipe.fit_transform(X)
X_pre.head()

Unnamed: 0,YearsExperience
0,-1.510053
1,-1.438373
2,-1.366693
3,-1.187494
4,-1.115814


## Split the data into training and testing

In [15]:
# Random data splitting
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(X_pre,Y,train_size=0.8,test_size=0.2,random_state=21) # reproducibility

In [16]:
xtrain.head()

Unnamed: 0,YearsExperience
19,0.2461
7,-0.757416
27,1.536336
11,-0.470697
18,0.210261


In [17]:
xtest.head()

Unnamed: 0,YearsExperience
5,-0.864935
23,1.034577
22,0.927058
28,1.787215
1,-1.438373


In [18]:
ytrain.head()

Unnamed: 0,Salary
19,93941.0
7,54446.0
27,112636.0
11,55795.0
18,81364.0


In [19]:
ytest.head()

Unnamed: 0,Salary
5,56643.0
23,113813.0
22,101303.0
28,122392.0
1,46206.0


## Model Building

In [20]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(xtrain,ytrain)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


In [21]:
# r2 score for training data
model.score(xtrain,ytrain)

0.9557540289043547

## Evaluation of the model: testing data- unseen data

In [24]:
ypreds = model.predict(xtest)

In [25]:
ypreds[:5]

array([[ 52103.91433294],
       [102051.58797689],
       [ 99224.36116686],
       [121842.17564714],
       [ 37025.37134609]])

In [26]:
ytest.head()

Unnamed: 0,Salary
5,56643.0
23,113813.0
22,101303.0
28,122392.0
1,46206.0


In [27]:
ypreds_train = model.predict(xtrain)
ypreds_train[:5]

array([[ 81318.59136997],
       [ 54931.14114298],
       [115245.31309039],
       [ 62470.4126364 ],
       [ 80376.18243329]])

In [28]:
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

mse_tr = mean_squared_error(ytrain,ypreds_train)
rmse_tr = mse_tr**(1/2)
mae_tr = mean_absolute_error(ytrain,ypreds_train)
r2_tr = r2_score(ytrain,ypreds_train)
print("Training Results:")
print(f"MSE for training data: {mse_tr:.2f}")
print(f"RMSE for training data: {rmse_tr:.2f}")
print(f"MAE for training data: {mae_tr:.2f}")
print(f"r2 score for training data: {r2_tr*100:.2f}%")


print("========================================")
print("Testing Results:")
mse = mean_squared_error(ytest,ypreds)
rmse = mse**(1/2)
mae = mean_absolute_error(ytest,ypreds)
r2 = r2_score(ytest,ypreds)

print(f"MSE for testing data : {mse}")
print(f"RMSE for testing data : {rmse}")
print(f"MAE for testing data : {mae}")
print(f"r2 score for testing data : {r2*100:.2f}%")


Training Results:
MSE for training data: 28631788.29
RMSE for training data: 5350.87
MAE for training data: 4392.34
r2 score for training data: 95.58%
Testing Results:
MSE for testing data : 48542473.24340335
RMSE for testing data : 6967.242872428329
MAE for testing data : 5783.083309441339
r2 score for testing data : 93.99%


## Model is performing more than 80% on both training and testing data.
## We can consider this model for final model building and deployment

In [29]:
ypreds

array([[ 52103.91433294],
       [102051.58797689],
       [ 99224.36116686],
       [121842.17564714],
       [ 37025.37134609],
       [ 91685.08967343]])

In [30]:
xtest

Unnamed: 0,YearsExperience
5,-0.864935
23,1.034577
22,0.927058
28,1.787215
1,-1.438373
21,0.640339


In [31]:
xtest['Salary Predicted'] = ypreds
xtest

Unnamed: 0,YearsExperience,Salary Predicted
5,-0.864935,52103.914333
23,1.034577,102051.587977
22,0.927058,99224.361167
28,1.787215,121842.175647
1,-1.438373,37025.371346
21,0.640339,91685.089673


In [32]:
# save the results to csv file
xtest.to_csv("Regression Model Salary Predictions.csv ",index=False)