# Lab Experiment 04: Data Preprocessing

## 1. Introduction
Data preprocessing is a crucial step in machine learning. It involves transforming raw data into an understandable format. In this notebook, we will cover:
- Acquiring the dataset
- Importing libraries
- Handling missing values
- Encoding categorical data
- Splitting the dataset
- Feature scaling

1.Acquire dataset<br>
2.import all crucial libraries<br>
3.import dataset<br>
4.identifying and handling missing and noise values<br>
5.encoding categorical data<br>
6.splitting dataset<br>
7.feature scaling <br>

## 2. Importing Libraries
We start by importing the necessary libraries for data manipulation and visualization.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings
warnings.filterwarnings('ignore')


## 3. Creating the Dataset
We will create a sample dataset containing information about countries, age, salary, and purchase status.

In [None]:
df = pd.DataFrame(
    {
        "Country": [
            "France",
            "Spain",
            "Germany",
            "Spain",
            "Germany",
            "France",
            "Spain",
            "France",
            "Germany",
            "France",
            "Spain",
        ],
        "Age": [84.0, 27.0, 30.0, 38.0, 40.0, 35.0, np.NaN, 48.0, 50.0, 37.0, np.NaN],
        "Salary": [
            72000.0,
            48000.0,
            54000.0,
            61000.0,
            np.NaN,
            58000.0,
            52000.0,
            79000.0,
            83000.0,
            67000.0,
            "-",
        ],
        "Purchased": [
            "No",
            "Yes",
            "No",
            "No",
            "Yes",
            "Yes",
            "No",
            "Yes",
            "No",
            "Yes",
            "No",
        ],
    }
)
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,84.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [15]:
df.dtypes

Country       object
Age          float64
Salary       float64
Purchased     object
dtype: object

## 4. Handling Missing Values
Real-world data often has missing values. We need to identify and handle them.
### Identifying Missing Values

In [None]:
pd.to_numeric(df.Salary, errors="coerce")

0     72000.000000
1     48000.000000
2     54000.000000
3     61000.000000
4     63777.777778
5     58000.000000
6     52000.000000
7     79000.000000
8     83000.000000
9     67000.000000
10    63777.777778
Name: Salary, dtype: float64

In [None]:
df.Salary = pd.to_numeric(df.Salary, errors="coerce")
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,84.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [18]:
df.isnull().sum()

Country      0
Age          2
Salary       0
Purchased    0
dtype: int64

In [19]:
df.isnull().sum()
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,84.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Imputing Missing Values
We can fill missing values using various strategies, such as the mean, median, or mode. Here, we will fill missing salaries with the mean salary of the respective country.

In [None]:
ger_df = df[df.Country == "Germany"]
rest_df = df[df.Country != "Germany"]
ger_df.fillna(ger_df.Salary.mean(), inplace=True)
pd.concat([ger_df, rest_df])

Unnamed: 0,Country,Age,Salary,Purchased
2,Germany,30.0,54000.0,No
4,Germany,40.0,63777.777778,Yes
8,Germany,50.0,83000.0,No
0,France,84.0,72000.0,No
1,Spain,27.0,48000.0,Yes
3,Spain,38.0,61000.0,No
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
9,France,37.0,67000.0,Yes


In [None]:
spa_df = df[df.Country == "Spain"]
rest_df = df[df.Country != "Spain"]
spa_df.fillna(spa_df.Salary.mean(), inplace=True)
pd.concat([spa_df, rest_df])

Unnamed: 0,Country,Age,Salary,Purchased
1,Spain,27.0,48000.0,Yes
3,Spain,38.0,61000.0,No
6,Spain,56194.444444,52000.0,No
10,Spain,56194.444444,63777.777778,No
0,France,84.0,72000.0,No
2,Germany,30.0,54000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No


In [None]:
fra_df = df[df.Country == "France"]
rest_df = df[df.Country != "France"]
fra_df.fillna(fra_df.Salary.mean(), inplace=True)
pd.concat([fra_df, rest_df])

Unnamed: 0,Country,Age,Salary,Purchased
0,France,84.0,72000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
9,France,37.0,67000.0,Yes
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
6,Spain,,52000.0,No
8,Germany,50.0,83000.0,No


### Filling All Missing Values
Alternatively, we can fill all missing values in a column with the global mean of that column.

In [None]:
ger_df = df[df.Country == "Germany"]
spa_df = df[df.Country == "Spain"]
fra_df = df[df.Country == "France"]


ger_df.fillna(ger_df.Salary.mean(), inplace=True)
spa_df.fillna(spa_df.Salary.mean(), inplace=True)
fra_df.fillna(fra_df.Salary.mean(), inplace=True)

pd.concat([spa_df, fra_df, ger_df])

Unnamed: 0,Country,Age,Salary,Purchased
1,Spain,27.0,48000.0,Yes
3,Spain,38.0,61000.0,No
6,Spain,56194.444444,52000.0,No
10,Spain,56194.444444,63777.777778,No
0,France,84.0,72000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
9,France,37.0,67000.0,Yes
2,Germany,30.0,54000.0,No
4,Germany,40.0,63777.777778,Yes


In [None]:
df["Salary"].fillna(df["Salary"].mean(), inplace=True)
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,84.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## 5. Verifying Data Integrity
After imputation, we should verify that there are no more missing values.

In [25]:
df.isnull().sum()

Country      0
Age          2
Salary       0
Purchased    0
dtype: int64

**For specific column**

In [None]:
df["Salary"].isnull()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
Name: Salary, dtype: bool