#Feature Engineering

#1. What is a parameter?
- A parameter is an internal variable of a model that is learned from the training data during the model training process. Parameters determine how the model makes predictions.

- Example: Linear Regression

- The model is:

   - y=mx+b
   - y=mx+b

- m (weight) is a parameter

- b (bias or intercept) is a parameter

- These values are calculated automatically from the data during training.

# 2. What is correlation?
- Correlation is a statistical measure that describes the strength and direction of the relationship between two variables.

- In simple words, it tells us how one variable changes when another variable changes.

- Types of Correlation

   - Positive Correlation: Both variables increase or decrease together.

   - Example: As study hours increase, exam scores increase.

- Negative Correlation: One variable increases while the other decreases.
   - Example: As product price increases, demand decreases.

- Zero Correlation: No relationship between the variables.
   - Example: Shoe size and intelligence.

#3. What does negative correlation mean?
- Negative correlation means that when one variable increases, the other variable decreases, and vice versa. In other words, the two variables move in opposite directions.

- Example: As price increases, demand decreases

  - As exercise time increases, body weight decreases

- Correlation Coefficient: A negative correlation has a value between 0 and −1
−1 indicates a perfect negative correlation

- Graphical Meaning: On a scatter plot, negative correlation appears as a downward-sloping trend from left to right.

# 4. Define Machine Learning. What are the main components in Machine Learning?
- Machine Learning is a method of data analysis that allows systems to automatically learn patterns from data and make predictions or decisions with minimal human intervention.

- Main Components of Machine Learning

   - Data: The most important component.
   - Can be structured or unstructured.
   - Used to train and test the model.

- Features
   - Input variables extracted from raw data.
   - Represent important characteristics used by the model.

- A mathematical representation that learns patterns from data.
- Examples: Linear Regression, Decision Tree, Neural Network.

- Algorithm

   - The procedure used to train the model.
   - Examples: Gradient Descent, K-Means, Backpropagation.

- Parameters: Values learned from data during training.
  - Examples: weights and bias.

- Hyperparameters: Values set before training to control learning.
  - Examples: learning rate, number of epochs, k in KNN.

- Evaluation Metric: Used to measure model performance.
  - Examples: Accuracy, Precision, Recall, RMSE.

# 5 How does loss value help in determining whether the model is good or not?
- The loss value measures how far the model’s predictions are from the actual (true) values. It helps us understand how well or poorly a model is performing.

- Role of Loss Value: A low loss value means the model’s predictions are close to the actual values, indicating a good model. A high loss value means there is a large error in predictions, indicating a poor model.

- During Model Training: The model tries to minimize the loss value using optimization algorithms (e.g., Gradient Descent).

- As training progresses: Decreasing loss → model is learning patterns correctly
Increasing or constant loss → model is not learning well

- Comparing Models: Loss values allow us to compare different models or versions of the same model. The model with the lower loss on validation data is generally considered better.

# 6. What are continuous and categorical variables?
- Continuous Variables: Continuous variables can take any numerical value within a given range and usually have infinite possible values.

- Characteristics: Measured, not counted
  - Can have decimals
  - Represent quantities

- Examples:
  - Height (170.5 cm)
  - Weight (65.2 kg)
  - Temperature (36.7°C)
  - Salary

- Categorical Variables: Categorical variables represent categories or groups rather than numeric values.

- Characteristics: Descriptive in nature

  - Finite number of categories
  - No mathematical meaning

- Examples:
    - Gender (Male, Female)
    - Blood Group (A, B, AB, O)
    - Color (Red, Blue, Green)

- Marital Status: Types of Categorical Variables

   - Nominal (No order)
   - Example: Gender, Color
   - Ordinal (Ordered categories)
   - Example: Education level (High, Medium, Low)

# 7. How do we handle categorical variables in Machine Learning? What are the common techniques?
- Machine learning models work with numerical data, so categorical variables must be converted into numbers before training a model. This process is called categorical encoding.

- Label Encoding: Assigns a unique number to each category. Suitable for ordinal data (where order matters).
   - Example: Low → 0, Medium → 1, High → 2

- One-Hot Encoding: Creates binary (0/1) columns for each category. Used for nominal data (no order).
  - Example: Color → Red, Blue, Green
  - → Red: (1,0,0), Blue: (0,1,0), Green: (0,0,1)

- Ordinal Encoding: Categories are encoded based on a meaningful order.Similar to label encoding but order is predefined.
  - Example: Education Level: Primary < Secondary < Graduate

- Binary Encoding: Converts categories into binary digits. Reduces dimensionality compared to one-hot encoding. Useful for high-cardinality features.

- Target Encoding (Mean Encoding): Replaces categories with the mean of the target variable. Useful for large datasets.
- Must be used carefully to avoid overfitting.

- Frequency Encoding: Encodes categories based on their frequency or count in the dataset.

#8. What do you mean by training and testing a dataset?
- In Machine Learning, a dataset is usually split into two parts: training data and testing data to evaluate model performance.

- Training Dataset: The training dataset is used to teach the model. The model learns patterns, relationships, and parameters from this data. It is used to fit the model.

- Example: Using past house prices to learn how size and location affect price.
  - Testing Dataset
  -  The testing dataset is used to evaluate the model.
  -  The model makes predictions on unseen data.
  -  Helps check how well the model generalizes to new data.

#9. What is sklearn.preprocessing?
- In Python’s scikit-learn (sklearn) library, preprocessing is a module that provides tools for preparing and transforming data before feeding it into machine learning models.

- The main goal is to make data suitable for modeling, improve model performance, and speed up training.

- Key Functions of sklearn.preprocessing

- Scaling / Normalization: Adjusts features to a similar range so that no feature dominates.

- Common techniques: StandardScaler → Mean = 0, Std = 1

  - MinMaxScaler → Scales values to [0,1]
  - Normalizer → Scales rows to unit norm
  - Encoding Categorical Variables
  - Convert non-numeric data to numeric:
  - LabelEncoder → Converts labels to numbers
  - OneHotEncoder → Creates binary columns

- Binarization: Converts values into 0 or 1 based on a threshold.
  - Example: Binarizer(threshold=0.5)

- Polynomial Features: Creates polynomial combinations of features for models that can capture nonlinear relationships.


# 10. What is a Test set?
- In Machine Learning, a test set is a portion of the dataset that is kept separate from the training data and is used to evaluate the performance of a trained model. It helps us understand how well the model will perform on unseen, real-world data.

- Key Points About Test Set
  - Not Used for Training
  - The model does not see the test data during training.
  - Prevents overfitting and gives an unbiased estimate of performance.
  - Used for Evaluation
  - Metrics like accuracy, precision, recall, RMSE are calculated on the test set.

- Shows how well the model generalizes.
  - Split Ratios
  - Common splits: 70% train – 30% test, 80% train – 20% test.

- Example:
  - Suppose we have 1000 rows of data:
  - Training set: 800 rows → used to teach the model
  - Test set: 200 rows → used to check how accurate the predictions are

#11. Why do we have to perform EDA before fitting a model to the data?
- EDA (Exploratory Data Analysis) is the process of analyzing and visualizing data before applying machine learning models. Performing EDA is crucial because it helps us understand the data and prepare it properly, which leads to better model performance.

- Key Reasons to Perform EDA
  - Understand Data Distribution
  - Helps identify patterns, trends, and the range of values.
  - Example: Checking if a variable is normally distributed or skewed.

- Identify Missing Values: Detect missing or null values that can affect model training.
- Decide whether to impute, remove, or ignore them.

- Detect Outliers: Outliers can distort model performance.
  - EDA helps spot and handle them appropriately.
  - Understand Relationships Between Variables

- Correlation analysis or scatter plots show which features influence the target.

  - Helps in feature selection.
  - Choose the Right Model
  - Certain models assume specific data characteristics (e.g., linearity, normality).
  - EDA guides the choice of algorithm.

- Feature Engineering: EDA helps create new meaningful features or transform existing ones.

#12. How can you find correlation between variables in Python?
- In Python, the correlation between variables can be measured using the pandas or numpy libraries. Correlation shows how strongly two variables are related and whether the relationship is positive or negative.

1. Using pandas.DataFrame.corr(): The easiest way is with pandas. It calculates the correlation matrix for all numerical columns.

In [1]:
import pandas as pd

# Example dataset
data = {
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 65000, 70000, 80000],
    'Experience': [2, 5, 7, 10, 12]
}

df = pd.DataFrame(data)

# Correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)


                 Age    Salary  Experience
Age         1.000000  0.989949    0.997609
Salary      0.989949  1.000000    0.987582
Experience  0.997609  0.987582    1.000000


#13. What is causation? Explain difference between correlation and causation with an example.
- Causation (or causal relationship) means that one event directly causes another event to happen.
In other words, changes in one variable directly lead to changes in another variable.

- Correlation:
  - Meaning	Measures association between two variables.
  - Direction	Can be positive, negative, or zero.
  - Requirement	Variables may or may not influence each other.
  - Example	Ice cream sales ↑ correlate with drowning

- Causation
  - Shows cause-and-effect relationship
  - Always implies one variable affects another
  - One variable must influence the other
  - Smoking causes lung cancer ↑
  - Correlation does NOT imply causation.

- Two variables can move together due to:
  - Coincidence
  - A third variable (confounding factor)
  - Seasonal or external effects
- Causation requires evidence (experiments, controlled studies, or strong reasoning).


# 14. What is sklearn.linear_model ?
- scikit-learn (sklearn) library, the linear_model module provides tools to perform linear modeling, which means it contains algorithms that model the relationship between independent (input) variables and a dependent (target) variable using a linear equation.

- It is widely used in regression and classification problems.

- Linear Regression: Models the relationship between numerical inputs and output.
  - Example: Predicting house prices based on size, location, and age.
  - Class: LinearRegression()

- Ridge Regression: Linear regression with L2 regularization to reduce overfitting.
  - Class: Ridge()

- Lasso Regression: Linear regression with L1 regularization for feature selection.
  - Class: Lasso()

- Logistic Regression: Used for binary or multiclass classification problems.
  - Class: LogisticRegression()

- Elastic Net: Combination of L1 (Lasso) and L2 (Ridge) regularization.
  - Class: ElasticNet()

# 15. What does model.fit() do? What arguments must be given?
- fit() teaches the model how to predict the target variable based on the input features. Arguments Required for model.fit()

- The exact arguments depend on the type of model, but generally, the main arguments are:
  - X (features/input data)
  - Usually a 2D array or DataFrame
  - Shape: (n_samples, n_features)
  - Example: [[1, 2], [3, 4], [5, 6]]

- y (target/output variable)
  - 1D array or Series (for regression or binary classification)

- Shape: (n_samples,)
  - Example: [10, 20, 30]
Some models accept extra arguments like sample_weight, epochs, or batch_size, but these are not required for basic usage.



*   List item
*   List item

