## THEORY QUESTIONS – FEATURE ENGINEERING

**Q1. What is a parameter?**
A parameter is a value that defines the behavior of a model or algorithm. In machine learning, parameters are learned from data during training, such as weights in linear regression or coefficients in a model. Parameters directly affect predictions made by the model. They are adjusted by optimization algorithms to minimize error. Parameters are different from hyperparameters, which are set before training begins.

**Q2. What is correlation?**
Correlation is a statistical measure that shows the relationship between two variables. It indicates how strongly variables move together. Correlation values range from -1 to +1. A value close to +1 means strong positive correlation, close to -1 means strong negative correlation, and 0 means no correlation. Correlation helps in understanding feature relationships in data analysis.

**Q3. What does negative correlation mean?**
Negative correlation means that as one variable increases, the other variable decreases. The correlation coefficient will be less than zero. For example, as the speed of a vehicle increases, the time taken to cover a fixed distance decreases. Negative correlation helps identify inverse relationships between variables in datasets.

**Q4. Define Machine Learning. What are the main components in Machine Learning?**
Machine Learning is a subset of artificial intelligence that enables systems to learn patterns from data and make predictions without explicit programming. The main components are data, features, model, loss function, and optimizer. Data is used for learning, features represent input variables, models learn patterns, loss measures error, and optimizers reduce that error.

**Q5. How does loss value help in determining whether the model is good or not?**
Loss value measures how far the model’s predictions are from actual values. A lower loss indicates better model performance. During training, the model tries to minimize the loss. Comparing loss values on training and test data helps detect overfitting or underfitting. Consistently low loss means the model is learning well.

**Q6. What are continuous and categorical variables?**
Continuous variables are numerical values that can take infinite values within a range, such as height or temperature. Categorical variables represent categories or labels, such as gender or city. Continuous variables are usually used directly in models, while categorical variables require encoding before model training.

**Q7. How do we handle categorical variables in Machine Learning? What are common techniques?**
Categorical variables are handled by converting them into numerical form. Common techniques include Label Encoding, One-Hot Encoding, and Ordinal Encoding. Label Encoding assigns numbers to categories, One-Hot Encoding creates binary columns, and Ordinal Encoding is used when categories have order. Proper encoding improves model performance.

**Q8. What do you mean by training and testing a dataset?**
Training a dataset means using data to teach the model patterns and relationships. Testing a dataset means evaluating the model’s performance on unseen data. This helps measure how well the model generalizes. Splitting data prevents overfitting and ensures reliable performance evaluation.

**Q9. What is sklearn.preprocessing?**
`sklearn.preprocessing` is a module in Scikit-learn used for data preprocessing. It provides tools for scaling, normalization, encoding categorical data, and handling missing values. Common classes include StandardScaler, MinMaxScaler, LabelEncoder, and OneHotEncoder. Preprocessing ensures data is in a suitable format for machine learning models.

**Q10. What is a Test set?**
A test set is a portion of the dataset used to evaluate the final performance of a trained model. It is not used during training. The test set provides an unbiased assessment of how well the model performs on new, unseen data.

**Q11. How do we split data for model fitting in Python?**
Data is split using the `train_test_split()` function from `sklearn.model_selection`. It divides data into training and testing sets based on a specified ratio. This ensures the model is trained and evaluated correctly, improving reliability of results.

**Q12. How do you approach a Machine Learning problem?**
Approaching a machine learning problem involves understanding the problem, collecting data, performing EDA, preprocessing data, selecting a model, training the model, evaluating performance, and fine-tuning. This structured approach ensures accurate and reliable predictions.

**Q13. Why do we perform EDA before fitting a model?**
EDA helps understand data patterns, detect outliers, handle missing values, and identify relationships between variables. Performing EDA improves feature selection and preprocessing decisions. It ensures cleaner data, leading to better model performance and reduced errors.

**Q14. What is correlation?**
Correlation measures the strength and direction of a relationship between two variables. It helps identify how features are related. Understanding correlation is important for feature selection and avoiding multicollinearity in machine learning models.

**Q15. What does negative correlation mean?**
Negative correlation indicates an inverse relationship between variables. When one variable increases, the other decreases. It is useful in understanding opposite trends in datasets and improving feature interpretation.

**Q16. How can you find correlation between variables in Python?**
Correlation in Python can be found using Pandas `corr()` method or NumPy functions. These methods compute correlation coefficients and help analyze relationships between numerical features in datasets.

**Q17. What is causation? Difference between correlation and causation with example.**
Causation means one variable directly affects another. Correlation only shows association, not cause. For example, ice cream sales and drowning cases may be correlated due to summer, but ice cream does not cause drowning. Understanding this prevents wrong conclusions.

**Q18. What is an Optimizer? Types of optimizers with examples.**
An optimizer adjusts model parameters to minimize loss. Common optimizers include Gradient Descent, Stochastic Gradient Descent, and Adam. Gradient Descent updates parameters using full data, SGD uses small batches, and Adam adapts learning rates automatically.

**Q19. What is sklearn.linear_model?**
`sklearn.linear_model` is a module that provides linear models such as Linear Regression, Logistic Regression, and Ridge Regression. These models are used for regression and classification tasks and are simple, efficient, and widely used.

**Q20. What does model.fit() do? What arguments are required?**
`model.fit()` trains the model using input features and target values. It learns patterns from data by adjusting parameters. The required arguments are training data (X) and target labels (y).

**Q21. What does model.predict() do? What arguments are required?**
`model.predict()` generates predictions using the trained model. It takes new input data as an argument. The output is predicted values or class labels based on learned patterns.

**Q22. What are continuous and categorical variables?**
Continuous variables are measurable quantities with infinite values, while categorical variables represent distinct groups or labels. Both types need to be handled differently in preprocessing for machine learning models.

**Q23. What is feature scaling? How does it help?**
Feature scaling standardizes data values into a similar range. It helps models converge faster and improves performance, especially for distance-based algorithms like KNN and gradient descent-based models.

**Q24. How do we perform scaling in Python?**
Scaling is done using Scikit-learn tools such as StandardScaler, MinMaxScaler, and RobustScaler. These transform features to standardized or normalized ranges suitable for model training.

**Q25. What is sklearn.preprocessing?**
It is a Scikit-learn module used for data transformation tasks like scaling, normalization, encoding, and handling missing values. It prepares raw data for efficient machine learning modeling.

**Q26. How do we split data for model fitting in Python?**
Data splitting is done using `train_test_split()` from Scikit-learn. It separates features and labels into training and testing datasets to evaluate model performance properly.

**Q27. Explain data encoding.**
Data encoding converts categorical variables into numerical format. Techniques include Label Encoding, One-Hot Encoding, and Ordinal Encoding. Encoding is necessary because machine learning models work only with numerical data.
