# Feature Engineering Assignment.
---


## 1.  What is a parameter?
  - A Parameter is a constant value used to define a population or Model.
      - eg: In Linear regression, the slope and intercept are model parameters.
  - In the Feature Engineering context, a parameter is a setting or configuration value used during data transformation or feature construction. Unlike model parameters (like weights), feature engineering parameters define how we manipulate or derive features.
      - Eg:  - Binning age into groups
            - Parameter: bin edges, e.g. [0, 18, 35, 60, 100]
            - Result: transforms age into categories like teen, adult, senior
    - These parameters shape the feature space, the model learns from — tuning them carefully can improve both performance and interpretability.

2. a & 12.  What is correlation? What does negative correlation mean?
  - Correlation measures the strength and direction of a relationship between two variables.
      - *eg: Height and weight often have positive correlation.*

  2.b & 13. - A **Negative correlation** means one variable increases as the other decreases or vice-versa. (Inverse prorportionality.
      - *eg: Study hours vs No. of Failed subjects..*

3. Define Machine Learning. What are the main components in Machine Learning?
  - **Machine Learning** is the use of algorithms to enable computers to learn patterns from data without explicitly programming.

|Main Components|
|---------------|
|Dataset|
|Model|
|Loss Function|
|Optimizer|
|Evaluation Metric|

4. How does loss value help in determining whether the model is good or not?
  - The loss value quantifies how far off the model's predictions are from the actual outcomes. It's like a penalty score for being wrong.
      - Why It Matters:
        - A low loss indicates that the model predicts well on the training data.
        - A high loss signals poor predictions, suggesting underfitting or bad feature choices.

    - Examples:
      - Mean Squared Error (MSE) in regression
        Measures average squared difference between predicted and actual values.
        - MSE = 3.2 → decent
        - MSE = 92 → poor fit
      - Cross-Entropy Loss in classification
        Penalizes confident wrong predictions more harshly.
        - Lower value → better class separation

      Loss helps during:
      - Model training (to guide optimization)
      - Comparison of models
      - Tuning hyperparameters

5. What are continuous and categorical variables?
  - Continuous: Numeric values with infinite possibilities (e.g. height).
  - Categorical: Values in categories (e.g. color: red, green).

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
  - Label Encoding (Red = 0, Blue = 1)
  - One-Hot Encoding (creates binary columns for each category)

7. What do you mean by training and testing a dataset?
  - Training Set:
      - Used to build and teach the model. The algorithm learns patterns, relationships, and feature importance from this data.
      - Model sees both input features and target labels.
      - It's like the textbook the model studies.

  - Testing Set: Used to evaluate the model's performance.
      - Used to evaluate the model’s performance on unseen data. This shows how well the model generalizes.
      - Model gets only input features and predicts outputs.
      - It's like the exam paper after training.

   - Helps detect:
      - Overfitting: Too good on training, bad on testing.
      - Underfitting: Poor performance on both sets

    - Eg: Usual Train-Test Split is: Train on 80% data, Test on 20%.

8. What is sklearn.preprocessing?
  - A module in **scikit-learn** with tools for Data Preprocessing / transforming data before feeding it to the Maching Learning Models.
  Includes: scaling, encoding, normalization functions.


  |Task|Tool|Purpose|
  |:----|:----|:-----|
  |Scaling|StandardScaler, MinMaxScaler|Normalize feature ranges|
  |Encoding|LabelEncoder, OnHotEncoder|Convert categorical to numerical|
  |Polynomial Features|PolynomialFeatures|Add interaction/ Higher-order terms|
  |Binarization|Binarizer|Convert numeric to binary|
  |Normalization|Normalizer|Scale Individual samples|


9. What is a Test set?
   - Testing Set: Dataset reserved to evaluate the model's performance after training.
      - Used to evaluate the model's performance on unseen data. This shows how well the model generalizes / predicts.
      - Model gets only input features and predicts outputs.
      - It's like the giving the exam after training.

10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?
 ```
 from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#  0.2 - refers to 20% data reserved for testing while 80% for training
```
          Approach to ML problem:
          - EDA
          - Preprocessing
          - Model selection
          - Training
          - Evaluation
          - Tuning

11. Why do we have to perform EDA before fitting a model to the data?
  - We perform EDA first:
    - To understand data patterns, spot missing values, and uncover biases.
      - Example: Detect outliers, skewed distributions before modeling.

12. What is correlation? - duplicate
  ## - Duplicate, Answered Above -

13. What does negative correlation mean? - duplicate
  ## - Duplicate, Answered Above -


14. How can you find correlation between variables in Python?
```
df.corr()  # Pandas method to compute correlation matrix
```

15. What is causation? Explain difference between correlation and causation with an example.
  -  Causation: When One variable directly influences another. There's a cause-and-effect relationship.
  - Eg: Ice cream sales and sunburns **Correlate** whereas, Hot weather is the **Causation** for both Ice sales & Sun burns and not one another.
    - Smoking causes increased risk of lung disease. Here, Smoking is the reason for this health outcome.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
  - An optimizer is an algorithm that adjusts the model's parameters (like weights and biases) to minimize the loss function during training. It guides the model to better predictions by taking steps in the direction that reduces error.

|Optimiser|How it Works|Example Use Case|
|:------:|:---:|:-------:|
|SGD(Stochastic <br> Gradient Descent)|Updates weights using one random sample <br> at a time. Simple and memory-efficient|Basic Neural networks and linear models|
|Momentum|Adds a fraction of the previous update to <br> current step — helps speed up and smooth training.|Deep networks prone to slow convergence|
|RMSprop|Adjusts learning rate based  on recent gradients.<br>  Good for non-stationary objectives.|RNNs and time-series problems|
|Adam (Adaptive <br> Moment Estimation|Combines Momentum and RMSprop. Automatically adapts <br> learning rate for each parameter.|CNNs, deep learning — fast & stable|

  - Adam is often the go-to for deep learning because it balances speed and stability.

17. What is sklearn.linear_model ?
  - It's a module in scikit-learn that contains models for linear relationships.
      - Key Models:
      - ```LinearRegression```: Predicts continuous values using a linear equation.
      - ```LogisticRegression```: Handles binary/multiclass classification using a logistic function.
      - ```Ridge, Lasso```: Regularized versions to prevent overfitting

18. What does model.fit() do? What arguments must be given?
    -  Fits model to training data.
        - Args: X_train, y_train
          ```model.fit(X_train, y_train)```
          
19. What does model.predict() do? What arguments must be given?
    - Generates predictions on new data. ```Args: X_test```
    - ```model.predict(X_test)```

20. What are continuous and categorical variables? - duplicate
  ## - Duplicate, Answered Above -

21. What is feature scaling? How does it help in Machine Learning?
    - Feature scaling is the process of transforming input variables so they're on the same scale, typically between 0-1 or with mean 0 and standard deviation 1.
    - Why?
      - Many algorithms use distance or gradients — so differing scales can skew results or slow convergence.
    - How It Helps in ML
      - Speeds Up Training: Gradient-based optimizers converge faster.
      - Improves Accuracy: Prevents dominant features from overshadowing others.
      - Balances Inputs: Especially important for algorithms like KNN, SVM, or Logistic Regression.
  - eg: Imagine features:
      - Age: 25-60
      - Income: ₹1,00,000 - ₹15,00,000
      - Without scaling, income would dominate distance calculations.
      - After scaling (e.g. StandardScaler), both get equal weight.

22. How do we perform scaling in Python?
  - You can perform feature scaling in Python using scikit-learn, the most commonly used library for this task:
  - Step-by-Step: Scaling with StandardScaler and MinMaxScaler
      - StandardScaler:Centers the data (mean = 0) and scales it (standard deviation = 1).
      - *Best for algorithms sensitive to distribution like Logistic Regression, SVM.*
      ```
      from sklearn.preprocessing import StandardScaler
      scaler = StandardScaler()
      X_scaled = scaler.fit_transform(X)
      ```
    - MinMax Scalar: Scales features to a fixed range (usually 0-1)
    - *Useful when features need bounded scaling for neural networks or interpretable output.*

      ```
      from sklearn.preprocessing import MinMaxScaler
      scaler = MinMaxScaler()
      X_scaled = scaler.fit_transform(X)
      ```

23. What is sklearn.preprocessing? - duplicate
  ## - Duplicate, Answered Above -

24. How do we split data for model fitting (training and testing) in Python? - duplicate
  ## - Duplicate, Answered Above -

25. Explain data encoding?
  - It's a preprocessing step to convert categorical values into a numerical format that ML models can understand.
  - Common Techniques:

|Method|Description|Example|
|:----:|:----:|:----:|
|Label Encoding|Assigns each category a Unique number|{'cat':0, 'dog':1}|
|One Hot Encoding|Creates separate binary columns for each category|cat-> [1,0], dog -> [0]|

      ```
      import pandas as pd
      pd.get_dummies(df['color'])  # One-Hot Encoding
      ```
