## Feature Engineering

## 1. What is a parameter?

Parameter in Feature Engineering / Machine Learning

A parameter is a value that defines the behavior of a model or feature transformation and is learned from the data during training.

Parameters are internal to the model (like weights and biases in a neural network).

They are optimized automatically to minimize the loss function.

## 2. What is correlation?

In statistics and data analysis, correlation measures the strength and direction of a relationship between two variables.

Key Points:

Range:

Correlation values (denoted as r) range from -1 to +1.

r=+1 → Perfect positive correlation (both variables increase together)

r=−1 → Perfect negative correlation (one increases, the other decreases)

r=0 → No linear correlation

Types of Correlation:

Positive Correlation: Both variables move in the same direction.

Example: Height and weight

Negative Correlation: Variables move in opposite directions.

Example: Speed of a car and time taken to reach a destination

No Correlation: Variables are independent or unrelated.

Methods to Measure Correlation:

Pearson correlation: Measures linear relationship between continuous variables.

Spearman correlation: Measures monotonic relationship (does not require linearity).

Kendall correlation: Another rank-based correlation method.

## What does negative correlation mean?

Negative correlation means that two variables move in opposite directions.

When one variable increases, the other variable decreases.

The correlation coefficient r is less than 0 (−1 ≤ r < 0).

Key Points:

Range:

r=−1 → Perfect negative correlation (exact opposite relationship)

r=0 → No linear relationship

r=−0.5 → Moderate negative correlation

Interpretation:

Stronger negative values (closer to −1) → stronger inverse relationship

Closer to 0 → weaker inverse relationship

## 3. Define Machine Learning. What are the main components in Machine Learning?

Main Components of Machine Learning:

Data:

Raw information used to train and evaluate the model.

Can be structured (tables, numbers) or unstructured (images, text, audio).

Features:

Individual measurable properties or characteristics of the data.

Example: In predicting house prices, features could be number of rooms, area, location.

Model:

A mathematical representation or algorithm that learns patterns from data.

Examples: Linear regression, decision trees, neural networks.

Learning Algorithm:

The method used to train the model and adjust its parameters to fit the data.

Examples: Gradient descent, backpropagation, k-nearest neighbors.

Parameters:

Internal values learned by the model from data during training (e.g., weights and biases in neural networks).

Objective / Loss Function:

A function that measures how well the model is performing.

The learning algorithm tries to minimize the loss.

Evaluation / Metrics:

Methods to check the model’s performance.

Examples: Accuracy, precision, recall, mean squared error.

## 4. How does loss value help in determining whether the model is good or not?

What is a Loss Value?

The loss value (or loss function) measures how well a machine learning model’s predictions match the true values.

It quantifies the error of the model.

Lower loss → predictions are closer to actual values; higher loss → predictions are farther off.

How Loss Value Helps Evaluate a Model

Model Training:

During training, the learning algorithm tries to minimize the loss by adjusting the model’s parameters.

Example: In linear regression, minimizing Mean Squared Error (MSE) adjusts the weights to fit the line closely to the data points.

Model Comparison:

If you have multiple models, the one with lower loss on the validation set is usually better.

Example: Comparing two regression models:

Model A: MSE = 5

Model B: MSE = 12 → Model A is better.

Overfitting / Underfitting Detection:

Training loss decreases over time.

Validation loss helps detect:

Underfitting: Both training & validation loss are high → model too simple.

Overfitting: Training loss is low, but validation loss is high → model memorized training data, performs poorly on new data.

Guides Model Improvement:

High loss → need better features, more data, or a different model.

Low loss → model is performing well on the training/validation data.

## 5. What are continuous and categorical variables?

1. Continuous Variables

These are numeric variables that can take any value within a range.

They are measurable, and you can perform arithmetic operations on them (addition, subtraction, etc.).

Often real numbers with decimals.

2. Categorical Variables

These are variables that represent categories or groups.

They are qualitative, not numeric (though sometimes numbers are used as labels).

Usually indicate labels, types, or classes.

Types of Categorical Variables:

Nominal: Categories without any order (e.g., color, gender)

Ordinal: Categories with a natural order (e.g., education level: High School < Bachelor < Master)

## 6. How do we handle categorical variables in Machine Learning? What are the common techniques?

Common Techniques to Handle Categorical Variables
1. Label Encoding

Assigns a unique integer to each category.

Useful for ordinal variables (where order matters).

Example:

Education Level: High School → 0, Bachelor → 1, Master → 2

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
education = ["High School", "Bachelor", "Master"]
encoded = le.fit_transform(education)
print(encoded)  # Output: [0 1 2]


Note: Not suitable for nominal variables with no order, as it may imply a hierarchy.




In [1]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
education = ["High School", "Bachelor", "Master"]
encoded = le.fit_transform(education)
print(encoded)  # Output: [0 1 2]


[1 0 2]


In [3]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

colors = np.array([["Red"], ["Blue"], ["Green"]])
ohe = OneHotEncoder()
encoded = ohe.fit_transform(colors)
print(encoded)


  (0, 2)	1.0
  (1, 0)	1.0
  (2, 1)	1.0


## 7. What do you mean by training and testing a dataset?

1. Training a Dataset

Training a dataset means using a portion of your data to teach the machine learning model how to make predictions or recognize patterns.

During training:

The model learns the relationships between input features (X) and target/output (Y).

Model parameters (like weights in linear regression or neural networks) are adjusted to minimize error (loss function).

Example:

Predicting house prices:

Features: Size, location, number of rooms

Target: Price

Model uses the training data to learn how these features affect price.

2. Testing a Dataset

Testing a dataset means using a separate portion of the data (not seen by the model during training) to evaluate the model’s performance.

It helps check if the model can generalize to new, unseen data.

Common metrics for testing: Accuracy, Mean Squared Error (MSE), Precision, Recall, F1-score, etc.

Example:

After training the house price model, you give it new houses from the testing dataset.

Compare predicted prices with actual prices to see how well the model performs.

## 8. What is sklearn.preprocessing?

sklearn.preprocessing

sklearn.preprocessing is a module in the scikit-learn library in Python.

It provides tools and functions to transform or scale your data before feeding it into a machine learning model.

Preprocessing is important because raw data often contains different scales, units, or categorical values, and many ML algorithms work better with normalized or standardized data.

Common Tasks in sklearn.preprocessing

Scaling / Normalization

Adjusting feature values to a common scale.

Examples:

StandardScaler → scales data to zero mean and unit variance

MinMaxScaler → scales data to a range between 0 and 1

In [4]:
from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[10, 200],
                 [20, 300],
                 [30, 400]])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)


[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


Encoding Categorical Variables

Convert categorical features into numeric representations.

Examples:

OneHotEncoder → converts categories into binary vectors

LabelEncoder → assigns unique integers to categories

In [6]:
from sklearn.preprocessing import OneHotEncoder

colors = [["Red"], ["Blue"], ["Green"]]
encoder = OneHotEncoder()
encoded_colors = encoder.fit_transform(colors)
print(encoded_colors)


  (0, 2)	1.0
  (1, 0)	1.0
  (2, 1)	1.0


3. Binarization

Convert values into 0 or 1 based on a threshold.

Useful for transforming features into binary indicators.

4. Polynomial Features

Generate new features by combining existing features (e.g., x1², x1*x2) for polynomial regression.

## 9. What is a Test set?

test set in machine learning is a subset of the dataset that is kept separate from the training data and is used to evaluate the performance of a trained model.

Key Points:

Purpose:

To check how well the model can generalize to new, unseen data.

Helps identify if the model is overfitting or underfitting.

Characteristics:

Not used during training — the model never sees this data while learning.

Typically 20–30% of the total dataset, though the exact split can vary.

Evaluation Metrics:

Regression: Mean Squared Error (MSE), R² score

Classification: Accuracy, Precision, Recall, F1-score

## 10. How do we split data for model fitting (training and testing) in Python?
## How do you approach a Machine Learning problem?

Splitting Data for Training and Testing in Python

In machine learning, we usually split data into training and testing sets (sometimes also a validation set) to train the model and evaluate its performance.

Using scikit-learn:

In [7]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Example dataset
data = pd.DataFrame({
    'feature1': [1,2,3,4,5,6,7,8,9,10],
    'feature2': [5,3,6,2,7,1,8,2,9,0],
    'target':   [0,1,0,1,0,1,0,1,0,1]
})

# Features and target
X = data[['feature1', 'feature2']]
y = data['target']

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Training Features:\n", X_train)
print("Testing Features:\n", X_test)


Training Features:
    feature1  feature2
0         1         5
7         8         2
2         3         6
9        10         0
4         5         7
3         4         2
6         7         8
Testing Features:
    feature1  feature2
8         9         9
1         2         3
5         6         1


Approach to a Machine Learning Problem

Here’s a step-by-step approach commonly used in ML projects:

1. Define the Problem

Understand the goal: classification, regression, clustering, etc.

Identify inputs (features) and outputs (target).

2. Collect and Explore Data

Gather the dataset.

Perform exploratory data analysis (EDA): check distributions, missing values, correlations, and outliers.

3. Preprocess Data

Handle missing values.

Encode categorical variables.

Scale numeric features (normalization/standardization).

Split data into training and testing sets.

4. Choose Model(s)

Select algorithm(s) suitable for the problem:

Regression → Linear Regression, Random Forest Regressor

Classification → Logistic Regression, Decision Trees, SVM

Clustering → KMeans, DBSCAN

5. Train the Model

Fit the model on training data.

Tune parameters and possibly hyperparameters.

6. Evaluate Model

Use test data to evaluate performance using metrics like accuracy, F1-score, MSE, R², etc.

Check for overfitting and underfitting.

7. Improve Model

Feature engineering (create better features)

Try different algorithms or hyperparameter tuning

Use techniques like cross-validation for better generalization

8. Deploy / Predict

Use the trained model to predict new unseen data in real-world applications.

## 11. Why do we have to perform EDA before fitting a model to the data?

What is EDA?

Exploratory Data Analysis (EDA) is the process of analyzing and visualizing your dataset before building a machine learning model.
It helps you understand the structure, quality, and patterns in the data.

Why EDA is Important Before Fitting a Model

Identify Missing or Incorrect Data

Detect NaNs, null values, or erroneous entries.

Example: A dataset with missing age values or negative salaries.

Without handling missing/incorrect data → model may give wrong predictions.

Understand Feature Distributions

Check how numeric variables are distributed (normal, skewed, uniform).

Helps decide scaling, transformation, or feature engineering.

Detect Outliers

Outliers can bias the model and affect metrics like mean squared error.

EDA helps you spot and handle them.

Check Relationships and Correlations

Identify which features are strongly related to the target.

Helps select important features and avoid redundant ones.

Understand Data Types

Detect categorical vs continuous variables.

Helps decide encoding methods for ML models.

Visualize Patterns

Scatter plots, histograms, box plots, heatmaps help uncover hidden patterns.

Example: Linear relationship between house size and price.

Prevent Model Failures

ML models assume certain things about data (e.g., numeric inputs, no missing values).

EDA ensures data quality and suitability for the chosen algorithm.

## 12. What is correlation?

In statistics and data analysis, correlation measures the strength and direction of a relationship between two variables.

Key Points:

Range:

Correlation values (denoted as r) range from -1 to +1.

r=+1 → Perfect positive correlation (both variables increase together)

r=−1 → Perfect negative correlation (one increases, the other decreases)

r=0 → No linear correlation

Types of Correlation:

Positive Correlation: Both variables move in the same direction.

Example: Height and weight

Negative Correlation: Variables move in opposite directions.

Example: Speed of a car and time taken to reach a destination

No Correlation: Variables are independent or unrelated.

Methods to Measure Correlation:

Pearson correlation: Measures linear relationship between continuous variables.

Spearman correlation: Measures monotonic relationship (does not require linearity).

Kendall correlation: Another rank-based correlation method.

## 13. What does negative correlation mean?

Negative correlation means that two variables move in opposite directions.

When one variable increases, the other variable decreases.

The correlation coefficient r is less than 0 (−1 ≤ r < 0).

Key Points:

Range:

r=−1 → Perfect negative correlation (exact opposite relationship)

r=0 → No linear relationship

r=−0.5 → Moderate negative correlation

Interpretation:

Stronger negative values (closer to −1) → stronger inverse relationship

Closer to 0 → weaker inverse relationship

## 14. How can you find correlation between variables in Python?

You can find correlation between variables in Python using pandas and NumPy. The most common method is Pearson correlation, though other methods like Spearman or Kendall can also be used.

Here’s a detailed explanation:

In [9]:
import pandas as pd

# Example dataset
data = pd.DataFrame({
    'Height': [150, 160, 170, 180, 190],
    'Weight': [50, 60, 65, 80, 90],
    'Age': [20, 21, 19, 25, 23]
})

# Compute correlation matrix (default: Pearson)
correlation_matrix = data.corr()
print(correlation_matrix)


          Height    Weight       Age
Height  1.000000  0.990148  0.656532
Weight  0.990148  1.000000  0.734572
Age     0.656532  0.734572  1.000000


Each value shows the correlation coefficient r between two variables.

1 = perfect positive correlation, -1 = perfect negative correlation, 0 = no correlation.

In [11]:
#Using NumPy
import numpy as np

height = np.array([150, 160, 170, 180, 190])
weight = np.array([50, 60, 65, 80, 90])

correlation = np.corrcoef(height, weight)[0, 1]
print("Correlation between height and weight:", correlation)

Correlation between height and weight: 0.9901475429766743


## 15. What is causation? Explain difference between correlation and causation with an example.

1. What is Causation?

Causation (or causal relationship) occurs when one variable directly influences or causes a change in another variable.

In other words, a change in X leads to a change in Y.

Causation implies a cause-and-effect relationship, not just an association.

Example:

Smoking → Lung cancer

Smoking causes an increased risk of lung cancer.


Example to Illustrate the Difference

Scenario:

Data shows that ice cream sales and drowning incidents are positively correlated.

Correlation: High ice cream sales ↔ More drownings (r > 0)

Does this mean buying ice cream causes drowning?

No.

Reason: Both are influenced by a third variable (summer/temperature).

Summer → more ice cream sales & more swimming → more drownings

Key takeaway:

Correlation = Association

Causation = Cause-and-effect

## 16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

What is an Optimizer?

In machine learning and deep learning, an optimizer is an algorithm that adjusts the model’s parameters (weights and biases) to minimize the loss function during training.

The goal of the optimizer is to find the best set of parameters that reduces the error between predicted and actual values.

Optimizers are essential in training neural networks, especially when dealing with large datasets and complex models.

2. How Optimizers Work

The optimizer calculates gradients of the loss function with respect to model parameters using backpropagation.

Then it updates the parameters in the direction that reduces the loss.

The learning rate determines how big each step is in updating parameters.

3. Common Types of Optimizers
A. Gradient Descent (GD)

Idea: Update parameters in the opposite direction of the gradient of the loss function.

Formula:

$$
\theta = \theta - \eta \cdot \nabla L(\theta)
$$


where 

θ = parameters, 

η = learning rate, 

∇L(θ) = gradient of loss.

Types of Gradient Descent:

Batch Gradient Descent: Uses entire dataset to compute gradient.

Accurate but slow for large datasets.

Stochastic Gradient Descent (SGD): Uses one sample at a time to update.

Fast, but updates are noisy.

Mini-batch Gradient Descent: Uses small batches of data (common in practice).

Balance between speed and stability.

B. Momentum

Adds memory of previous updates to accelerate convergence.

Helps avoid local minima and smooths updates.

Update formula:

$$
v = \beta v + \eta \nabla L(\theta)
$$

$$
\theta = \theta - v
$$


where 

v = velocity, 

β = momentum factor (e.g., 0.9)

C. AdaGrad (Adaptive Gradient)

Adjusts learning rate for each parameter individually based on historical gradients.

Parameters with large gradients → smaller learning rate, small gradients → larger learning rate.

Good for sparse data.

D. RMSProp (Root Mean Square Propagation)

Improves AdaGrad by using a moving average of squared gradients to prevent learning rate from decreasing too much.

Popular in RNNs and deep learning.

E. Adam (Adaptive Moment Estimation)

Combines Momentum + RMSProp.

Maintains moving averages of both gradients and squared gradients.

Widely used because it’s fast and works well in most cases.

Update formulas (simplified):

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$

$$
\theta = \theta - \eta \cdot \frac{m_t}{\sqrt{v_t} + \epsilon}
$$

	​

	​

## 17. What is sklearn.linear_model ?

sklearn.linear_model is a module in the scikit-learn library in Python that provides classes and functions to implement linear models for regression and classification problems.

1. What is a Linear Model?

A linear model tries to predict a target variable (Y) as a linear combination of input features (X).

Formula for regression:

$$
Y = w_1 X_1 + w_2 X_2 + \cdots + w_n X_n + b
$$


𝑤𝑖= weights (parameters learned from data)

b = bias/intercept

In [13]:
from sklearn.linear_model import LinearRegression
import numpy as np

# Example data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

# Create linear regression model
model = LinearRegression()

# Fit model on data
model.fit(X, y)

# Predict
predictions = model.predict(np.array([[6]]))
print("Prediction for X=6:", predictions)


Prediction for X=6: [5.8]


## 18. What does model.fit() do? What arguments must be given?

model.fit():

In scikit-learn, the fit() method is used to train a machine learning model on a given dataset.

It learns the patterns from the data by adjusting the model’s parameters.

After calling fit(), the model is ready to make predictions using predict().

## 19. What does model.predict() do? What arguments must be given?

model.predict():

In scikit-learn, the predict() method is used to make predictions on new data using a trained machine learning model.

You first train the model using model.fit() on your dataset.

Then, model.predict() uses the learned parameters (like weights, coefficients, or decision boundaries) to predict output values for new inputs.

predictions = model.predict(X_new)

X_new = New input data (features) you want predictions for.

The shape of X_new should match the features used during training: [n_samples, n_features]

X: Input data for which you want predictions. Must have same number of features as the training data.                                                       

## 20. What are continuous and categorical variables?

Continuous Variables

Continuous variables are numeric variables that can take any value within a range. They are measurable, and you can perform arithmetic operations on them like addition, subtraction, or calculating the mean. Examples include height (like 150.5 cm or 172.3 cm), weight (55.2 kg, 70.8 kg), temperature (36.6°C, 37.4°C), or age (25 years, 30.5 years). Continuous variables can have fractional values and are suitable for statistical calculations like mean, variance, and correlation.

Categorical Variables

Categorical variables represent categories or groups. They are qualitative and often indicate labels, types, or classes. Examples include gender (Male, Female), color (Red, Blue, Green), vehicle type (Car, Bike, Truck), or blood group (A, B, AB, O). Categorical variables can be nominal, where categories have no order (like color or gender), or ordinal, where categories have a natural order (like education level: High School < Bachelor < Master).

## 21. What is feature scaling? How does it help in Machine Learning?

Feature Scaling?

Feature scaling is the process of standardizing or normalizing the range of independent variables (features) in a dataset so that they are on a similar scale.

Different features may have different units or ranges, e.g., height in centimeters (150–200) vs. income in dollars (1000–100000).

Feature scaling ensures that all features contribute equally to the learning process.

2. Common Feature Scaling Techniques

Min-Max Scaling (Normalization)

Scales data to a fixed range, usually 0 to 1.

Formula:

$$
X_{\text{scaled}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$$


Standardization (Z-score Scaling)

Centers data around zero and scales to unit variance.

Formula:

$$
X_{\text{scaled}} = \frac{X - \mu}{\sigma}
$$



where 

μ = mean, 

σ = standard deviation

MaxAbs Scaling

Scales values to the range [-1, 1] based on the maximum absolute value.

Robust Scaling

Uses median and interquartile range to scale features, useful for datasets with outliers.

#Feature Scaling Helps in Machine Learning

Faster Convergence:

Gradient-based algorithms (like linear regression, logistic regression, neural networks) converge faster when features are on a similar scale.

Prevents Feature Dominance:

Features with larger numeric ranges can dominate the learning process if not scaled.

Improves Distance-Based Models:

Algorithms like K-Nearest Neighbors (KNN), K-Means, and SVM rely on distances.

Scaling ensures that all features contribute equally to distance calculations.

Handles Regularization Better:

Models like Ridge, Lasso, ElasticNet perform better when features are scaled, because regularization penalizes large coefficients.

In [15]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# Sample data
X = np.array([[150, 2000],
              [160, 3000],
              [170, 4000]])

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Standardized Data:\n", X_scaled)

# Min-Max Scaling
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
print("Normalized Data:\n", X_normalized)


Standardized Data:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]
Normalized Data:
 [[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]


## 22. How do we perform scaling in Python?

We can perform feature scaling in Python mainly using scikit-learn’s preprocessing module. The most common techniques are Standardization and Normalization (Min-Max Scaling). Here’s how to do it:

1. Using StandardScaler (Z-score Standardization)

Scales data to have mean = 0 and standard deviation = 1.

Useful for most ML algorithms like linear regression, logistic regression, SVM, neural networks.

In [16]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data: 3 samples, 2 features
X = np.array([[150, 2000],
              [160, 3000],
              [170, 4000]])

# Create a scaler object
scaler = StandardScaler()

# Fit the scaler and transform the data
X_scaled = scaler.fit_transform(X)

print("Standardized Data:\n", X_scaled)


Standardized Data:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


2. Using MinMaxScaler (Normalization)

Scales features to a fixed range, usually 0 to 1.

Formula:
$$
X_{\text{scaled}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$$

Useful for distance-based algorithms like KNN, K-Means, and Neural Networks.

In [17]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

print("Normalized Data:\n", X_normalized)


Normalized Data:
 [[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]


3. Using RobustScaler

Scales using median and interquartile range, good for data with outliers.

In [19]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_robust = scaler.fit_transform(X)

print("Robust Scaled Data:\n", X_robust)


Robust Scaled Data:
 [[-1. -1.]
 [ 0.  0.]
 [ 1.  1.]]


## 23. What is sklearn.preprocessing?

sklearn.preprocessing is a module in the scikit-learn library that provides tools for preprocessing and transforming data before using it in machine learning models.

Purpose

Raw data often has different scales, units, or types, which can affect the performance of ML models. The preprocessing module helps to normalize, standardize, encode, or scale features so the model can learn effectively.

Common Tasks in sklearn.preprocessing

Scaling / Normalization

Ensures all features are on a similar scale.

Examples:

StandardScaler → scales features to zero mean and unit variance

MinMaxScaler → scales features to a range between 0 and 1

Encoding Categorical Variables

Converts text or categorical data into numbers.

Examples:

LabelEncoder → converts categories to integers

OneHotEncoder → converts categories to binary vectors

Binarization

Converts numerical values into 0 or 1 based on a threshold.

Example: transforming exam scores into pass/fail indicators

Polynomial Features

Generates new features by combining existing ones for polynomial regression.

In [20]:
from sklearn.preprocessing import StandardScaler
import numpy as np

X = np.array([[150, 2000],
              [160, 3000],
              [170, 4000]])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Scaled Data:\n", X_scaled)


Scaled Data:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


## 24. How do we split data for model fitting (training and testing) in Python?

In Python, we typically split data into training and testing sets using scikit-learn’s train_test_split function. This is an essential step to train a model on one part of the data and evaluate it on unseen data.

In [21]:
#Using train_test_split

from sklearn.model_selection import train_test_split
import pandas as pd

# Example dataset
data = pd.DataFrame({
    'feature1': [1,2,3,4,5,6,7,8,9,10],
    'feature2': [5,3,6,2,7,1,8,2,9,0],
    'target':   [0,1,0,1,0,1,0,1,0,1]
})

# Features and target
X = data[['feature1', 'feature2']]
y = data['target']

# Split data: 70% training, 30% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print("Training Features:\n", X_train)
print("Testing Features:\n", X_test)


Training Features:
    feature1  feature2
0         1         5
7         8         2
2         3         6
9        10         0
4         5         7
3         4         2
6         7         8
Testing Features:
    feature1  feature2
8         9         9
1         2         3
5         6         1


Key Parameters

X → Input features (2D array or DataFrame)

y → Target variable (1D array, Series)

test_size → Proportion of data for testing (e.g., 0.2 = 20%)

train_size → Proportion of data for training (optional, complementary to test_size)

random_state → Seed for reproducibility (ensures the same split every time)

shuffle → Whether to shuffle data before splitting (default: True)

## 25. Explain data encoding

Data Encoding:

Data encoding is the process of transforming categorical (non-numeric) data into a numeric format so that machine learning algorithms can understand and use it.

Most ML models, especially scikit-learn algorithms, work only with numbers, not text or labels.

Encoding ensures categorical variables can be used as features in the model.

Data Encoding Importants:

ML models cannot process text or string data directly.

Encoding preserves the information in categories in a numeric format.

Helps algorithms like logistic regression, decision trees, SVM, neural networks to work with categorical data.



Common Data Encoding Techniques

1. Label Encoding

Converts each category into a unique integer.

Example: Red → 0, Green → 1, Blue → 2

Useful for ordinal categorical variables (with natural order).

In [23]:
from sklearn.preprocessing import LabelEncoder

colors = ['Red', 'Green', 'Blue', 'Green']
le = LabelEncoder()
encoded = le.fit_transform(colors)
print(encoded)  # Output: [2 1 0 1]


[2 1 0 1]


One-Hot Encoding

Converts each category into a binary vector (0 or 1).

Avoids implying any order between categories.

Example: Red → [1,0,0], Green → [0,1,0], Blue → [0,0,1]

In [25]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

colors = np.array([['Red'], ['Green'], ['Blue'], ['Green']])
ohe = OneHotEncoder()
encoded = ohe.fit_transform(colors)
print(encoded)


  (0, 2)	1.0
  (1, 1)	1.0
  (2, 0)	1.0
  (3, 1)	1.0


Ordinal Encoding

Assigns integers to categories based on a defined order.

Example: Education Level → High School → 0, Bachelor → 1, Master → 2

Binary Encoding / Hashing

Converts categories into binary digits or hashes.

Useful when there are many categories.