<a href="https://colab.research.google.com/github/shubhamvermapersonal/da_module_questions/blob/main/Machine_Learning_1_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Theory Questions**

### 1. **What is a Parameter?**
In the context of machine learning, a **parameter** is a value that helps define a model's behavior. For example, in linear regression, the coefficients of the variables (like the slope and intercept) are parameters. Parameters are learned from the training data and define the relationship between input features and the output. These parameters are crucial for making predictions once the model is trained.

### 2. **What is Correlation? What Does Negative Correlation Mean?**
**Correlation** refers to the statistical relationship between two variables. It measures how much one variable changes when another variable changes. The correlation value ranges from -1 to 1:
- A **positive correlation** (near 1) means that as one variable increases, the other increases as well.
- A **negative correlation** (near -1) means that as one variable increases, the other decreases.
For example, the relationship between the temperature and ice cream sales might be positively correlated (higher temperatures lead to higher sales). On the other hand, the relationship between the number of umbrellas sold and the amount of sunshine could be negatively correlated.

### 3. **Define Machine Learning. What Are the Main Components in Machine Learning?**
**Machine Learning** (ML) is a subset of artificial intelligence that involves training algorithms to learn from and make predictions or decisions based on data, without being explicitly programmed for every situation. The main components in machine learning include:
- **Data**: The raw information that the model will use to learn.
- **Algorithm**: The method used to learn from the data (e.g., decision trees, neural networks, etc.).
- **Model**: The result of training the algorithm on data; it is used to make predictions.
- **Training Process**: The phase where the model learns patterns from the data.
- **Evaluation**: Assessing how well the model performs using metrics like accuracy or loss.

### 4. **How Does Loss Value Help in Determining Whether the Model Is Good or Not?**
The **loss value** (or cost function) measures how well the model's predictions match the actual outcomes. A **low loss** means that the model's predictions are close to the actual values, while a **high loss** indicates that the model is performing poorly. In machine learning, the goal is to minimize the loss function during the training process. Common loss functions include mean squared error (MSE) for regression tasks and cross-entropy loss for classification tasks.

### 5. **What Are Continuous and Categorical Variables?**
- **Continuous variables** are numeric variables that can take any value within a given range (e.g., height, weight, temperature). They represent measurable quantities and are often used in regression tasks.
- **Categorical variables** represent categories or groups and can take values that are labels (e.g., gender, color, city). They are usually handled in classification tasks.

### 6. **How Do We Handle Categorical Variables in Machine Learning? What Are the Common Techniques?**
Categorical variables need to be transformed into a format that can be understood by machine learning algorithms (which usually expect numerical input). Some common techniques for handling categorical variables include:
- **One-Hot Encoding**: Converts each category into a binary vector. For example, a "color" variable with values "red," "green," and "blue" would be transformed into three columns: one for red, one for green, and one for blue.
- **Label Encoding**: Assigns an integer to each category. For example, "red" might become 0, "green" 1, and "blue" 2.
- **Ordinal Encoding**: For ordinal variables (variables that have a specific order, like "small," "medium," "large"), this method assigns integers based on the order.

### 7. **What Do You Mean by Training and Testing a Dataset?**
- **Training a dataset** involves using a subset of the available data to train the machine learning model. This data helps the model learn patterns and relationships.
- **Testing a dataset** involves using a separate subset of the data (that the model has not seen during training) to evaluate the model’s performance. The test set is crucial to determine how well the model generalizes to new, unseen data.

### 8. **What is sklearn.preprocessing?**
**`sklearn.preprocessing`** is a module from the `scikit-learn` library in Python that provides functions for transforming and scaling data. Some of the common preprocessing techniques include:
- **Standardization**: Scaling the data so that it has a mean of 0 and a standard deviation of 1.
- **Normalization**: Rescaling the data to a specific range, usually [0, 1].
- **Imputation**: Filling missing values with a specified value or method.
- **Encoding**: Converting categorical data into numerical form using techniques like one-hot encoding.

### 9. **What is a Test Set?**
A **test set** is a portion of the dataset that is not used during the model training phase. It is reserved for evaluating how well the model performs on unseen data. Using a test set helps in assessing the generalization ability of the model. It’s important that the test set is not used in any part of the training process to avoid overfitting.

### 10. **How Do We Split Data for Model Fitting (Training and Testing) in Python? How Do You Approach a Machine Learning Problem?**
To split data for model fitting in Python, the `train_test_split` function from the `sklearn.model_selection` module is commonly used:
```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
This splits the data into training data (80%) and testing data (20%), with `X` being the features and `y` being the target variable.

**Approaching a Machine Learning Problem**:
1. **Understand the Problem**: Define the problem clearly (e.g., classification, regression).
2. **Collect and Prepare Data**: Gather relevant data and preprocess it by handling missing values, scaling features, and encoding categorical variables.
3. **Choose a Model**: Select an appropriate machine learning model (e.g., decision tree, random forest, neural network).
4. **Train the Model**: Use the training dataset to train the model and tune hyperparameters.
5. **Evaluate the Model**: Test the model using the test set and assess its performance using evaluation metrics (accuracy, precision, recall, etc.).
6. **Iterate and Improve**: If necessary, refine the model by adjusting parameters, adding features, or trying different models.

### 11. **Why Do We Have to Perform EDA Before Fitting a Model to the Data?**
**Exploratory Data Analysis (EDA)** is essential before fitting a model because it helps you understand the data’s structure, patterns, and potential issues. EDA allows you to:
- **Identify missing values**: These may need to be handled before training a model.
- **Visualize distributions**: It helps in checking if features are skewed or follow a normal distribution, which might affect certain models.
- **Detect outliers**: Outliers can skew model performance and might need to be removed or transformed.
- **Understand relationships**: EDA helps identify correlations and relationships between features, aiding in selecting relevant features for modeling.

### 12. **What is Correlation?**
**Correlation** measures the statistical relationship between two or more variables. It shows the strength and direction of the relationship:
- A **positive correlation** means that as one variable increases, the other also tends to increase.
- A **negative correlation** means that as one variable increases, the other tends to decrease.
- A **correlation of 0** means no linear relationship exists between the variables.

### 13. **What Does Negative Correlation Mean?**
A **negative correlation** between two variables means that as one variable increases, the other decreases. For example, there’s a negative correlation between the amount of time spent studying and the amount of errors made in an exam. As study time increases, errors decrease.

### 14. **How Can You Find Correlation Between Variables in Python?**
To find the correlation between variables in Python, you can use the `pandas` library. The `corr()` method calculates pairwise correlation between columns in a DataFrame:
```python
import pandas as pd
df = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [5, 4, 3, 2, 1]
})
correlation = df.corr()  # Correlation matrix
print(correlation)
```
This will return the correlation between `x` and `y`.

### 15. **What is Causation? Explain the Difference Between Correlation and Causation with an Example.**
**Causation** means that one variable directly causes changes in another variable. Unlike correlation, which merely indicates a relationship, causation implies a cause-and-effect scenario.
- **Example of correlation**: There might be a correlation between ice cream sales and temperature, but buying ice cream does not *cause* the temperature to rise. It’s simply that both tend to increase as summer arrives.
- **Example of causation**: Smoking causes lung cancer. If we observe a relationship, it’s not just a pattern but a cause-and-effect link.

### 16. **What is an Optimizer? What Are Different Types of Optimizers? Explain Each with an Example.**
An **optimizer** is an algorithm that adjusts the model’s weights to minimize the loss function during training. The objective is to improve model accuracy.
Common types of optimizers:
- **Gradient Descent**: A simple optimization technique where the model’s weights are updated based on the gradient (slope) of the loss function.
  - Example: In linear regression, gradient descent can be used to minimize the error between the predicted and actual values by adjusting the weights.
- **Stochastic Gradient Descent (SGD)**: A variation of gradient descent that updates weights using only a subset of data (a batch), making it faster for large datasets.
- **Adam Optimizer**: Combines the benefits of momentum and adaptive learning rates. It’s widely used in deep learning for efficient training.
  - Example: In a neural network, Adam helps improve the learning rate during training, speeding up convergence.
- **RMSProp**: An adaptive learning rate optimizer that divides the gradient by a moving average of squared gradients to handle large gradients.

### 17. **What is sklearn.linear_model?**
**`sklearn.linear_model`** is a module in the `scikit-learn` library that provides linear models for regression and classification tasks. It includes:
- **LinearRegression**: For linear regression problems.
- **LogisticRegression**: For binary classification tasks.
- **Ridge** and **Lasso**: Linear regression models with L2 and L1 regularization, respectively, to prevent overfitting.

### 18. **What Does model.fit() Do? What Arguments Must Be Given?**
**`model.fit()`** is used to train the model on the training dataset. It takes at least two arguments:
- `X_train`: The input data (features) used to train the model.
- `y_train`: The target data (labels) corresponding to `X_train`.

Example:
```python
model.fit(X_train, y_train)
```
The model learns from the data, adjusting its parameters to minimize the error.

### 19. **What Does model.predict() Do? What Arguments Must Be Given?**
**`model.predict()`** is used to make predictions after the model has been trained. It requires the input data (`X_test`) as an argument:
```python
predictions = model.predict(X_test)
```
Here, `X_test` contains the features of the test dataset, and the model predicts the target values based on the learned patterns.

### 20. **What Are Continuous and Categorical Variables?**
- **Continuous variables** are numeric and can take any value within a range (e.g., height, weight, temperature).
- **Categorical variables** are discrete and represent categories or groups (e.g., gender, country, product type).

### 21. **What is Feature Scaling? How Does It Help in Machine Learning?**
**Feature scaling** refers to the process of standardizing or normalizing the range of independent variables or features in a dataset. Some machine learning algorithms are sensitive to the scale of data (e.g., K-Nearest Neighbors, SVM, gradient descent-based models). Feature scaling ensures that each feature contributes equally to the model's performance.
- **Standardization**: Centers the data around zero by subtracting the mean and dividing by the standard deviation.
- **Normalization**: Scales the data to a specific range (usually [0, 1]).

### 22. **How Do We Perform Scaling in Python?**
In Python, you can use `sklearn.preprocessing` to scale data:
- **Standardization** using `StandardScaler`:
  ```python
  from sklearn.preprocessing import StandardScaler
  scaler = StandardScaler()
  X_scaled = scaler.fit_transform(X)
  ```
- **Normalization** using `MinMaxScaler`:
  ```python
  from sklearn.preprocessing import MinMaxScaler
  scaler = MinMaxScaler()
  X_normalized = scaler.fit_transform(X)
  ```

### 23. **What is sklearn.preprocessing?**
**`sklearn.preprocessing`** is a module in `scikit-learn` that provides tools to preprocess and transform data. Some common functions include:
- **StandardScaler**: For standardization (z-score normalization).
- **MinMaxScaler**: For scaling features to a given range.
- **OneHotEncoder**: For encoding categorical variables into binary vectors.
- **LabelEncoder**: For encoding labels as integers.

### 24. **How Do We Split Data for Model Fitting (Training and Testing) in Python?**
You can use the `train_test_split` function from `sklearn.model_selection` to split your data into training and testing sets:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
This splits the data, with 80% used for training and 20% for testing.

### 25. **Explain Data Encoding?**
**Data encoding** is the process of converting categorical variables into a format that can be understood by machine learning algorithms. Common encoding methods include:
- **One-Hot Encoding**: Converts each category into a binary vector. For example, if a column has categories `['red', 'green', 'blue']`, it would be converted to separate columns for each color with 1s and 0s indicating the presence of each category.
- **Label Encoding**: Assigns a unique integer to each category. For instance, `['red', 'green', 'blue']` might become `[0, 1, 2]`.
