# Answers to Assignment Questions (Feature Engineering / ML)

This notebook contains concise answers to the questions from the provided PDF. You can open and run this notebook in Google Colab.

## What is a parameter?

A **parameter** is a configuration variable of a model that is learned from training data. For example, in linear regression y = wx + b, the parameters are **w** (weight/slope) and **b** (bias/intercept). Model training adjusts parameters to minimize a loss function.

## What is correlation?

**Correlation** measures the strength and direction of a linear relationship between two variables. It is commonly quantified by the Pearson correlation coefficient (r), which ranges from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship). A value of 0 indicates no linear relationship.

## What does negative correlation mean?

A **negative correlation** means that as one variable increases, the other tends to decrease. For example, if correlation r = -0.8 between hours spent watching TV and test score, higher TV time is associated with lower test scores.

## Define Machine Learning. What are the main components in Machine Learning?

**Machine Learning (ML)** is a field of computer science that builds algorithms that learn patterns from data to make predictions or decisions without being explicitly programmed for every case. Main components:

- **Data**: labeled or unlabeled examples used to train and evaluate models.
- **Features**: input variables derived from raw data.
- **Model/Algorithm**: the mathematical structure (e.g., linear regression, decision tree, neural network).
- **Loss/Objective function**: measures how well the model performs.
- **Optimizer**: method to update parameters to minimize loss (e.g., gradient descent).
- **Evaluation metrics**: accuracy, precision, recall, RMSE, etc.
- **Training/Validation/Test split**: to train and evaluate generalization.

## How does loss value help in determining whether the model is good or not?

The **loss value** quantifies how far model predictions are from true targets on the dataset used. Lower loss generally indicates better fit. However, absolute loss values depend on the task and loss type (e.g., MSE vs cross-entropy). You must monitor validation loss (not just training loss) to detect overfitting. Also combine loss with task-specific metrics (accuracy, F1, RMSE) for a fuller picture.

## What are continuous and categorical variables?

- **Continuous variables** (numerical) can take on any real value within a range (e.g., height, weight, price).
- **Categorical variables** represent discrete groups or categories (e.g., color: red/green/blue; country names). Categorical variables can be nominal (no order) or ordinal (ordered).

## How do we handle categorical variables in Machine Learning? What are the common techniques?

Common techniques:

- **Label Encoding**: map categories to integers (useful for ordinal categories or tree-based models but can introduce artificial order).
- **One-Hot Encoding**: create binary columns for each category (works well with linear models and neural nets; increases dimensionality).
- **Target/Mean Encoding**: replace category with target mean (requires careful cross-validation to avoid leakage).
- **Binary/Hash Encoding**: useful when there are many categories (reduces dimensionality).
- **Embedding layers**: learned dense representations for categories (common in deep learning).

## What do you mean by training and testing a dataset?

- **Training**: using a dataset to update model parameters (learning). The model sees inputs and targets and optimizes its parameters to minimize loss.
- **Testing**: evaluating the trained model on unseen data (the test set) to estimate how well it generalizes to new inputs. Test data must not influence training or hyperparameter tuning.

## What is sklearn.preprocessing?

`sklearn.preprocessing` is a module in scikit-learn that provides utilities to transform input data—scaling, encoding, normalization, handling missing values, and generating polynomial features. Examples include `StandardScaler`, `MinMaxScaler`, `LabelEncoder`, `OneHotEncoder`, and `PolynomialFeatures`.

## What is a Test set?

The **test set** is a portion of the dataset held out from training and hyperparameter tuning, used only once (or occasionally) to estimate final model performance. It simulates performance on truly unseen data.

## How do we split data for model fitting (training and testing) in Python?

Commonly using `train_test_split` from scikit-learn:

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
You can also create a validation set or use cross-validation (`KFold`, `StratifiedKFold`) for robust hyperparameter tuning.

## How do you approach a Machine Learning problem?

A typical approach:

1. **Define objective**: what to predict and how success is measured.
2. **Collect/understand data**: gather required datasets.
3. **Exploratory Data Analysis (EDA)**: inspect distributions, missing values, correlations, outliers.
4. **Preprocess/clean**: handle missing values, encode categories, scale features.
5. **Feature engineering**: create or select informative features.
6. **Model selection & training**: choose algorithms, train models.
7. **Validate & tune**: cross-validation and hyperparameter tuning.
8. **Evaluate**: test on hold-out set and compute metrics.
9. **Deploy & monitor**: serve the model and track performance in production.

## Why do we have to perform EDA before fitting a model to the data?

**EDA (Exploratory Data Analysis)** helps to:
- Discover patterns, trends, and relationships
- Detect missing values and outliers
- Identify wrong data types or inconsistent entries
- Inform feature engineering and model selection
- Avoid garbage-in–garbage-out by improving data quality before modeling

## How can you find correlation between variables in Python?

Using Pandas and visualization:

```python
import pandas as pd
corr_matrix = df.corr()           # Pearson correlation for numeric columns
sns.heatmap(corr_matrix, annot=True)
```
For pairwise visual checks use `sns.pairplot(df)` or `df.corr(method='spearman')` for monotonic relationships. For categorical variables, use `Cramér's V` or contingency tables.

## What is causation? Explain difference between correlation and causation with an example.

**Causation** means one variable directly affects another. **Correlation** means two variables move together but one may not cause the other.

Example: Ice cream sales and drowning rates are positively correlated (both rise in summer), but buying ice cream doesn't cause drowning—**temperature** is a confounder that causes both.

## What is an Optimizer? What are different types of optimizers? Explain each with an example.

An **optimizer** is an algorithm that updates model parameters to minimize the loss function. Types:

- **Gradient Descent (Batch GD)**: uses the full dataset to compute gradients and update params. Good for small datasets.
- **Stochastic Gradient Descent (SGD)**: updates parameters using one sample at a time—faster but noisier.
- **Mini-batch Gradient Descent**: uses small batches (common in deep learning).
- **Momentum**: accelerates SGD by adding a fraction of the previous update to the current one.
- **AdaGrad**: adapts learning rates per parameter based on past squared gradients—good for sparse data.
- **RMSprop**: fixes AdaGrad's learning-rate decay by using moving averages of squared gradients.
- **Adam**: combines momentum and RMSprop ideas; widely used in training neural networks.

Example: `optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)` in TensorFlow.

## What is sklearn.linear_model ?

`sklearn.linear_model` is a scikit-learn module containing linear models like `LinearRegression`, `LogisticRegression`, `Ridge`, `Lasso`, and `ElasticNet`. These implement simple and regularized linear models for regression and classification.

## What does model.fit() do? What arguments must be given?

`model.fit()` trains the model by adjusting its parameters to the training data.
- Typical arguments: `X_train` (features) and `y_train` (targets). Some models accept `sample_weight`, `epochs` (in Keras), or `validation_data` depending on the API.

## What does model.predict() do? What arguments must be given?

`model.predict()` returns predictions from the trained model for new input data. The main argument is `X` (feature matrix). For classifiers, scikit-learn also has `predict_proba()` to get class probabilities.

## What is feature scaling? How does it help in Machine Learning? How do we perform scaling in Python?

**Feature scaling** is transforming numeric features to a similar scale so that some algorithms (gradient descent-based, distance-based models like KNN, SVM) converge faster or behave correctly.

Common methods:
- **Standardization (Z-score)**: subtract mean and divide by std (`StandardScaler`).
- **Min-Max Scaling**: scale values to [0,1] range (`MinMaxScaler`).
- **Robust Scaling**: uses median and IQR to reduce effect of outliers (`RobustScaler`).

In Python (scikit-learn):
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

## Explain data encoding?

**Data encoding** transforms categorical or non-numeric data into numeric form suitable for ML models. Techniques include label encoding, one-hot encoding, ordinal encoding, target encoding, and embeddings. For textual data, encoding may also include TF-IDF or word embeddings (Word2Vec, BERT).

## Notes / Final tips

- Always avoid data leakage (information from test/validation leaking into training).
- Use cross-validation for robust evaluation.
- Document preprocessing steps and save any fitted transformers (scalers, encoders) to apply consistently to new data.
- For reproducibility, set `random_state`/`seed` when splitting data or training models.