<a href="https://colab.research.google.com/github/yoseforaz0990/ML-templates/blob/main/data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

| Step                                    | Description                                                                                                          |
|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------|
| **Step 1: Importing the libraries**     | Import the necessary Python libraries that will be used for data preprocessing, such as pandas, numpy, and scikit-learn modules.|
| **Step 2: Importing the dataset**       | Load the dataset from a CSV file or other data sources into a pandas DataFrame for further processing.                |
| **Step 3: Taking care of missing data** | Handle any missing values in the dataset. In this step, you can use techniques like SimpleImputer to replace missing values with mean, median, mode, or other strategies.|
| **Step 4: Encoding categorical Independent Variable**   | Convert categorical variables into numerical format using OneHotEncoder, creating binary columns for each category. The original categorical columns are replaced with these binary columns. |
| **Step 5: Encoding the Dependent Variable** | If the dependent variable (target) is categorical, use LabelEncoder to convert the categories into numerical representations.|
| **Step 6: Splitting the dataset into Training set and Test set** | Split the dataset into training set and test set. The training set is used to train the machine learning model, while the test set is used to evaluate its performance. |
| **Step 7: Feature Scaling**            | Scale the features to bring them to a similar range. Common techniques include Standardization and Min-Max Scaling.   |


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer

# Step 1: Import the dataset
dataset = pd.read_csv('dataset.csv')

# Step 2: Handle missing data
imputer = SimpleImputer(strategy='mean')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
X[:, :] = imputer.fit_transform(X[:, :])

# Step 3: Encode categorical data
# Specify the column index/indices of categorical columns in X
categorical_columns = [column_index1, column_index2, ...]
ct = ColumnTransformer([("encoder", OneHotEncoder(), categorical_columns)], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype=np.float)

# Step 4: Encode the Dependent Variable (if applicable)
label_encoder_y = LabelEncoder()
y = label_encoder_y.fit_transform(y)

# Step 5: Split the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Step 6: Feature Scaling (if necessary)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
