# Importing Libraries

In this section, we import the necessary libraries required for data manipulation, preprocessing, and machine learning.

- `numpy`: A fundamental package for numerical computations in Python.
- `pandas`: A powerful data manipulation and analysis library.
- `SimpleImputer`: A class from `sklearn.impute` used for handling missing data by imputing missing values.
- `LabelEncoder`: A class from `sklearn.preprocessing` used to convert categorical labels into numerical values.
- `OneHotEncoder`: A class from `sklearn.preprocessing` used for converting categorical variables into a format that can be provided to ML algorithms to do a better job in prediction.
- `ColumnTransformer`: A class from `sklearn.compose` used for applying different preprocessing steps to specific columns of the dataset.
- `train_test_split`: A function from `sklearn.model_selection` used to split the dataset into training and testing sets.
- `StandardScaler`: A class from `sklearn.preprocessing` used for standardizing features by removing the mean and scaling to unit variance.

Let's import these libraries.


In [2]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split


# Loading the Dataset

Here, we load the dataset from a CSV file into a DataFrame using `pandas`. This DataFrame (`df`) will allow us to inspect, manipulate, and preprocess the data.

- `pd.read_csv("Data.csv")`: Reads the CSV file named `Data.csv` into a DataFrame.
- `df.head()`: Displays the first few rows of the DataFrame to give an overview of the dataset's structure and contents.

Let's preview the dataset.


In [3]:
df = pd.read_csv("Data.csv")
print("Dataset preview:")
print(df.head())


Dataset preview:
  Country  Age   Salary Purchased
0  Mumbai   34  55000.0       Yes
1  Nagpur   28  49000.0        No
2    Pune   45  75000.0       Yes
3  Nagpur   53      NaN        No
4  Mumbai   32  58000.0       Yes


# Checking Data Types

Understanding the data types of each column is crucial for preprocessing:

- `object`: Typically represents categorical data or strings.
- `int64` and `float64`: Numeric data types, where `int64` represents integers and `float64` represents floating-point numbers.

By examining the data types, we can determine appropriate preprocessing steps for each column.

Let's check the data types of the columns.


In [4]:
print("\nData types of each column:")
print(df.dtypes)



Data types of each column:
Country       object
Age            int64
Salary       float64
Purchased     object
dtype: object


# Selecting Independent and Dependent Variables

In machine learning, we need to separate our features (independent variables) from the target variable (dependent variable):

- `X`: Contains the feature columns. We select all rows and all columns except the last one.
- `y`: Contains the target column. We select the column at index 3.

These selections will be used for further preprocessing and model training.

Let's extract the independent and dependent variables.


In [6]:
X = df.iloc[:, :-1].values  # Independent variables (all rows, all columns except the last one)
y = df.iloc[:, 3].values    # Dependent variable (all rows, the column at index 3)

# Handling Missing Values

Missing values in the dataset can be imputed to ensure that our machine learning models can handle the data properly:

- `SimpleImputer`: This class allows us to replace missing values with a specific strategy.
- `strategy='mean'`: The mean of the column will replace missing values.

In this case, we are applying mean imputation to columns 1 and 2 of `X`, which might contain missing values.

Let's handle missing values in our dataset.


In [8]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X[:, 1:3] = imputer.fit_transform(X[:, 1:3])  # Apply imputation to columns 1 and 2


# Encoding Categorical Data

Machine learning algorithms require numerical inputs, so categorical data must be converted to numerical format:

- `LabelEncoder`: Converts categorical labels into numerical values (e.g., 'Yes' → 1, 'No' → 0).
- `OneHotEncoder`: Converts categorical variables into a binary matrix (e.g., 'Mumbai' → [1, 0, 0], 'Nagpur' → [0, 1, 0]).

`ColumnTransformer` applies the one-hot encoding to the specified columns, leaving the other columns unchanged.

Let's encode the categorical data.


In [10]:
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])  # Encode the categorical values in the first column

ct = ColumnTransformer(
    [('one_hot_encoder', OneHotEncoder(categories='auto'), [0])], 
    remainder='passthrough'
)
X = ct.fit_transform(X)  # Apply one-hot encoding to the first column


# Encoding the Dependent Variable

The target variable `y` often contains categorical labels, which need to be converted into numerical values:

- `LabelEncoder`: Converts categorical labels into numerical values for easier processing by machine learning algorithms.

This step transforms labels such as 'Yes' and 'No' into numerical values (e.g., 1 and 0).

Let's encode the dependent variable.


In [13]:
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)  # Encode the categorical labels in y

# Splitting the Dataset

To evaluate the performance of a machine learning model, we need to split the dataset into training and testing sets:

- `train_test_split`: Splits the data into training and testing subsets.
- `test_size=0.2`: 20% of the data is used for testing, while 80% is used for training.
- `random_state=0`: Ensures reproducibility by using a fixed random seed.

Let's split the dataset into training and testing sets.


In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Feature Scaling

Feature scaling is crucial for algorithms that rely on distance calculations or are sensitive to the scale of features:

- `StandardScaler`: Standardizes features by removing the mean and scaling to unit variance.
- `fit_transform(X_train)`: Computes the mean and standard deviation from the training set and applies scaling to it.
- `transform(X_test)`: Applies the same scaling to the test set without recalculating mean and standard deviation.

Let's scale the features in the training and testing sets.


In [18]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)  # Fit and transform the training set
X_test = sc_X.transform(X_test)        # Transform the test set