# Data Preprocessing

1. Acquire the dataset
2. Import all crucial libraries
3. Import the dataset
4. Identify and handle missing values
5. encoding categorical data
6. Splitting the dataset
7. feature scaling

* Data Preprocessing helps to enhance quality of data to extract meaningful insight from the data.
* When it comes to ml model, data preprocessing is the first step marking the initiation of the process.

In [4]:
# Data preprocessing in machine learning
# step 1: import all libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

In [7]:
# step 2: load dataset
df = pd.read_csv(r'Data_1.csv')

In [9]:
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [8]:
df.describe()

Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


In [13]:
x = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

In [15]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [16]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


In [17]:
# identify missing values
df.isna().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [25]:
# create a copy
df1 = df.copy()

In [26]:
df1.shape

(10, 4)

In [31]:
# solution 1
df1.dropna(inplace = True)

In [32]:
df1

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [72]:
# solution 2
df2 = df.copy()
df2['Age'].fillna(df2['Age'].mean(),inplace = True)
df2['Salary'].fillna(df2['Salary'].mean(),inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['Age'].fillna(df2['Age'].mean(),inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['Salary'].fillna(df2['Salary'].mean(),inplace = True)


In [73]:
df2.isna().sum()

Country      0
Age          0
Salary       0
Purchased    0
dtype: int64

In [74]:
# solution 3 , with sklearn library
df_array = df.iloc[:,:-1].values
df_array

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [42]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan,strategy='mean')

In [43]:
# The fit method calculates the  mean value for each column
# imputer stores these calculated mean internally
imputer.fit(df_array[:,1:3])

0,1,2
,"missing_values  missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.",
,"strategy  strategy: str or Callable, default='mean' The imputation strategy. - If ""mean"", then replace missing values using the mean along  each column. Can only be used with numeric data. - If ""median"", then replace missing values using the median along  each column. Can only be used with numeric data. - If ""most_frequent"", then replace missing using the most frequent  value along each column. Can be used with strings or numeric data.  If there is more than one such value, only the smallest is returned. - If ""constant"", then replace missing values with fill_value. Can be  used with strings or numeric data. - If an instance of Callable, then replace missing values using the  scalar statistic returned by running the callable over a dense 1d  array containing non-missing values of each column. .. versionadded:: 0.20  strategy=""constant"" for fixed value imputation. .. versionadded:: 1.5  strategy=callable for custom value imputation.",'mean'
,"fill_value  fill_value: str or numerical value, default=None When strategy == ""constant"", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and ""missing_value"" for strings or object data types.",
,"copy  copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.",True
,"add_indicator  add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.",False
,"keep_empty_features  keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy=""constant""` in which case `fill_value` will be used instead. .. versionadded:: 1.2",False


In [44]:
# tranform method replaces the missing values in those specific columns using means calculated in the fit step
df_array[:,1:3] = imputer.transform(df_array[:,1:3])

In [45]:
df_array

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [46]:
# encoding the categorical data
# categorical data refers to information that has specific categories within the dataset.
# In the dataset cited above have country and purchased a two columns

In [48]:
# solution 1: Label encoder (majorly used for y label)
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
label_encoder_x = LabelEncoder()
df2['Country'] = label_encoder_x.fit_transform(df2['Country'])

In [49]:
df2

Unnamed: 0,Country,Age,Salary,Purchased
0,0,44.0,72000.0,No
1,2,27.0,48000.0,Yes
2,1,30.0,54000.0,No
3,2,38.0,61000.0,No
4,1,40.0,63777.777778,Yes
5,0,35.0,58000.0,Yes
6,2,38.777778,52000.0,No
7,0,48.0,79000.0,Yes
8,1,50.0,83000.0,No
9,0,37.0,67000.0,Yes


In [52]:
# solution 2: column transformer
# applies column transformers to specific columns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

df4 = df.copy()

In [51]:
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')

In [54]:
df4 = np.array(ct.fit_transform(df4))
df4

array([[1.0, 0.0, 0.0, 44.0, 72000.0, 'No'],
       [0.0, 0.0, 1.0, 27.0, 48000.0, 'Yes'],
       [0.0, 1.0, 0.0, 30.0, 54000.0, 'No'],
       [0.0, 0.0, 1.0, 38.0, 61000.0, 'No'],
       [0.0, 1.0, 0.0, 40.0, nan, 'Yes'],
       [1.0, 0.0, 0.0, 35.0, 58000.0, 'Yes'],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0, 'No'],
       [1.0, 0.0, 0.0, 48.0, 79000.0, 'Yes'],
       [0.0, 1.0, 0.0, 50.0, 83000.0, 'No'],
       [1.0, 0.0, 0.0, 37.0, 67000.0, 'Yes']], dtype=object)

In [55]:
# solution 3: pd.get_dummies()
df6 = df.copy()
pd.get_dummies(df6)

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain,Purchased_No,Purchased_Yes
0,44.0,72000.0,True,False,False,True,False
1,27.0,48000.0,False,False,True,False,True
2,30.0,54000.0,False,True,False,True,False
3,38.0,61000.0,False,False,True,True,False
4,40.0,,False,True,False,False,True
5,35.0,58000.0,True,False,False,False,True
6,38.777778,52000.0,False,False,True,True,False
7,48.0,79000.0,True,False,False,False,True
8,50.0,83000.0,False,True,False,True,False
9,37.0,67000.0,True,False,False,False,True


In [56]:
# solution 3: label encoder
from sklearn.preprocessing import LabelEncoder
le  = LabelEncoder()
y = le.fit_transform(y)

In [57]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


In [58]:
# splitting the dataset - training set and test set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test  = train_test_split(x ,y,test_size = 0.2,random_state = 33)

In [59]:
print(x_train)

[['France' 35.0 58000.0]
 ['France' 37.0 67000.0]
 ['Spain' 27.0 48000.0]
 ['Spain' nan 52000.0]
 ['Germany' 30.0 54000.0]
 ['France' 44.0 72000.0]
 ['France' 48.0 79000.0]
 ['Germany' 40.0 nan]]


In [60]:
print(x_test)

[['Germany' 50.0 83000.0]
 ['Spain' 38.0 61000.0]]


In [61]:
print(y_train)

[1 1 1 0 0 0 1 1]


In [62]:
print(y_test)

[0 0]


# feature Scaling

In [68]:
# always apply scaling after the splitting dataset.
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
x_train[:,1:] = mm.fit_transform(x_train[:,1:])
x_train


array([['France', 0.38095238095238093, 0.3225806451612905],
       ['France', 0.4761904761904763, 0.612903225806452],
       ['Spain', 0.0, 0.0],
       ['Spain', nan, 0.12903225806451624],
       ['Germany', 0.1428571428571428, 0.19354838709677424],
       ['France', 0.8095238095238093, 0.774193548387097],
       ['France', 1.0, 1.0],
       ['Germany', 0.6190476190476191, nan]], dtype=object)

In [71]:
x_test[:,1:] = mm.fit_transform(x_test[:,1:])
print(x_train)

[['France' 0.38095238095238093 0.3225806451612905]
 ['France' 0.4761904761904763 0.612903225806452]
 ['Spain' 0.0 0.0]
 ['Spain' nan 0.12903225806451624]
 ['Germany' 0.1428571428571428 0.19354838709677424]
 ['France' 0.8095238095238093 0.774193548387097]
 ['France' 1.0 1.0]
 ['Germany' 0.6190476190476191 nan]]
