#Feature Engineering
#Theory Questions

1. What is a parameter?
-> A parameter in machine learning refers to a configuration variable that is internal to the model and is learned from the training data. For example, in linear regression, the slope and intercept of the line are parameters that the model adjusts to minimize the error between predicted and actual values.

2. What is correlation?
-> Correlation measures the strength and direction of a linear relationship between two variables. It is quantified by the correlation coefficient, which ranges from -1 to 1. A coefficient close to 1 indicates a strong positive correlation, while a coefficient close to -1 indicates a strong negative correlation.

3. What does negative correlation mean?
-> Negative correlation indicates that as one variable increases, the other variable tends to decrease. For example, in a study of exercise and weight, a negative correlation would suggest that higher levels of exercise are associated with lower body weight, reflecting an inverse relationship between the two variables.

4. Define Machine Learning.
-> Machine Learning is a subset of artificial intelligence that enables systems to learn from data, identify patterns, and make decisions with minimal human intervention. It involves algorithms that improve their performance on a task as they are exposed to more data, such as predicting house prices based on features like size and location.

5. What are the main components in Machine Learning?
-> The main components of machine learning include data, algorithms, models, and evaluation metrics. Data is the foundation, algorithms are the methods used to learn from data, models are the representations of learned patterns, and evaluation metrics assess model performance, such as accuracy or F1 score.

6. How does loss value help in determining whether the model is good or not?
-> The loss value quantifies how well a model's predictions match the actual outcomes. A lower loss indicates better model performance, while a higher loss suggests poor predictions. For instance, in regression, mean squared error (MSE) is commonly used; a decreasing MSE during training indicates improvement in the model's accuracy.

7. What are continuous and categorical variables?
-> Continuous variables are numerical values that can take any value within a range, such as height or temperature. Categorical variables represent distinct categories or groups, such as gender or color. They can be nominal (no order) or ordinal (with order), influencing how data is analyzed and modeled.

8. How do we handle categorical variables in Machine Learning? What are the common techniques?
-> Categorical variables can be handled using techniques like one-hot encoding, which creates binary columns for each category, or label encoding, which assigns a unique integer to each category. For example, converting "red," "blue," and "green" into binary columns allows algorithms to process categorical data effectively.

9. What do you mean by training and testing a dataset?
-> Training and testing datasets are subsets of data used in machine learning. The training dataset is used to train the model, allowing it to learn patterns, while the testing dataset evaluates the model's performance on unseen data, ensuring it generalizes well to new inputs and does not overfit.

10. What is sklearn.preprocessing?
-> sklearn.preprocessing is a module in the Scikit-learn library that provides functions to preprocess data before training machine learning models. It includes tools for scaling features, encoding categorical variables, and normalizing data, which help improve model performance and ensure that different features contribute equally to the learning process.

11. What is a Test set?
-> A test set is a portion of the dataset that is reserved for evaluating the performance of a trained machine learning model. It is not used during the training phase, allowing for an unbiased assessment of how well the model generalizes to new, unseen data, ensuring its effectiveness in real-world applications.

12. How do we split data for model fitting (training and testing) in Python?
-> In Python, data can be split into training and testing sets using the train_test_split function from the Scikit-learn library. For example, train_test_split(X, y, test_size=0.2) splits the dataset into 80% training and 20% testing, ensuring a random distribution of data points for robust model evaluation.

13. How do you approach a Machine Learning problem?
-> To approach a machine learning problem, first, define the objective and gather relevant data. Next, perform exploratory data analysis (EDA) to understand data characteristics, preprocess the data (cleaning, encoding, scaling), select appropriate algorithms, train the model, evaluate its performance, and iterate to improve results.

14. Why do we have to perform EDA before fitting a model to the data?
-> Exploratory Data Analysis (EDA) is crucial before model fitting as it helps identify data patterns, distributions, and potential issues like missing values or outliers. EDA informs feature selection, guides preprocessing steps, and enhances understanding of relationships between variables, ultimately leading to better model performance and accuracy.

15. What is causation?
-> Causation refers to a relationship where one event directly influences another. For example, smoking causes an increased risk of lung cancer. Establishing causation requires more rigorous analysis than correlation, often involving controlled experiments or longitudinal studies to demonstrate that changes in one variable lead to changes in another.

16. Explain the difference between correlation and causation with an example.
-> Correlation indicates a relationship between two variables, while causation implies that one variable directly affects the other. For example, ice cream sales and drowning incidents may correlate due to both increasing in summer, but ice cream sales do not cause drowning; they are influenced by the warmer weather.

17. What is an Optimizer?
-> An optimizer is an algorithm used to adjust the parameters of a machine learning model to minimize the loss function during training. Optimizers help improve model performance by finding the best parameter values that reduce prediction errors, ensuring the model learns effectively from the training data.

18. What are different types of optimizers? Explain each with an example.
-> Common types of optimizers include:
* Gradient Descent: Updates parameters based on the gradient of the loss function. Example: Stochastic Gradient Descent (SGD) updates parameters using a single data point at a time.
* Adam: Combines momentum and adaptive learning rates, adjusting learning rates for each parameter. Example: Used in deep learning for faster convergence.
* RMSprop: Adapts learning rates based on recent gradients, preventing oscillations. Example: Effective in training recurrent neural networks.

19. What is sklearn.linear_model?
-> sklearn.linear_model is a module in Scikit-learn that provides linear models for regression and classification tasks. It includes algorithms like Linear Regression, Logistic Regression, and Ridge Regression, allowing users to fit linear models to data, making predictions based on linear relationships between features and target variables.

20. What does model.fit() do? What arguments must be given?
-> The model.fit() method trains a machine learning model on the provided training data. It requires at least two arguments: the feature set (X) and the target variable (y). For example, model.fit(X_train, y_train) adjusts the model parameters to minimize the loss function based on the training data.

21. What does model.predict() do? What arguments must be given?
-> The model.predict() method generates predictions for new data based on the trained model. It requires the feature set (X) as an argument. For example, predictions = model.predict(X_test) uses the test data to produce output values, allowing evaluation of the model's performance on unseen data.

22. What is feature scaling? How does it help in Machine Learning?
-> Feature scaling standardizes the range of independent variables or features in a dataset. It helps improve model performance by ensuring that all features contribute equally to the distance calculations in algorithms like k-NN or gradient descent, preventing features with larger ranges from dominating the learning process.

23. How do we perform scaling in Python?
-> In Python, scaling can be performed using the StandardScaler or MinMaxScaler from the Scikit-learn library. For example, to standardize features, you can use:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

This scales the features to have a mean of 0 and a standard deviation of 1.

24. What is sklearn.preprocessing?
-> sklearn.preprocessing is a module in Scikit-learn that provides tools for transforming and preparing data for machine learning. It includes functions for scaling features, encoding categorical variables, normalizing data, and creating polynomial features, ensuring that the data is in a suitable format for model training.

25. Explain data encoding.
-> Data encoding is the process of converting categorical variables into a numerical format that machine learning algorithms can understand. Techniques include one-hot encoding, which creates binary columns for each category, and label encoding, which assigns unique integers to categories. This transformation is essential for effective model training and performance.