# ML module 1 (feature engineering)

Q1. What is a parameter?
ans-  a parameter refers to a value or characteristic that defines a specific aspect of a statistical model, algorithm, or a transformation applied to data. These parameters are often learned from the data during a training process or are set manually to control how features are created or manipulated. For example, in a scaling transformation like StandardScaler, the mean and standard deviation learned from the data are parameters.

Q2. What is correlation?
What does negative correlation mean?

ans-  Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It's a common tool for describing simple relationships without making a statement about cause and effect.

Negative correlation (or inverse correlation) means that two variables move in opposite directions. When one variable increases, the other variable tends to decrease, and vice versa. For example, the more hours a student spends watching TV, the lower their test scores tend to be.

Q3. Define Machine Learning. What are the main components in Machine Learning.

ans-  Machine Learning (ML) is a field of artificial intelligence that enables systems to learn from data to identify patterns and make decisions without explicit programming. Its main components are:

Data: The input information for learning.
Features: Measurable properties used by the model.
Algorithm/Model: The method used to learn from data.
Training: The process of the model learning from data.
Evaluation: Assessing the model's performance.
Prediction/Inference: Using the trained model to make new decisions.

Q4. How does loss value help in determining whether the model is good or not?

ans-  the loss value quantifies the error between a model's predictions and the actual values. A lower loss value indicates a better model, as it means the model's predictions are closer to the true outcome. It guides model optimization during training and helps detect issues like overfitting or underfitting.


Q5. What are continuous and categorical variables?

and-  
- Continuous Variables: These can take any value within a given range (e.g., height, temperature, time). They are typically measured.
- Categorical Variables: These can take on a limited, fixed number of possible values, typically labels or categories (e.g., gender, color, type of car). They are typically counted or classified.

Q6. How do we handle categorical variables in Machine Learning? What are the common t
echniques?

ans-  
- One-Hot Encoding: Creates new binary columns for each category, indicating the presence or absence of that category. This is suitable for nominal (unordered) categories.
- Label Encoding: Assigns a unique integer to each category. This can be used for ordinalcategories, but can sometimes imply a false order for nominal categories.
- Target Encoding: Replaces each category with the mean of the target variable for that category. Useful but can lead to data leakage if not done carefully.
- Frequency/Count Encoding: Replaces categories with their frequency or count in the dataset.
- Binary Encoding: Converts categories to binary code, reducing the number of new dimensions compared to one-hot encoding.

Q7.What do you mean by training and testing a dataset?
ans- training a dataset refers to the process of feeding the algorithm with data (the 'training set') so it can learn patterns and relationships, adjusting its internal parameters to minimize errors. Testing a dataset (the 'test set') involves evaluating the trained model's performance on new, unseen data to assess how well it generalizes and to estimate its real-world accuracy.

Q8. What is sklearn.preprocessing?
ans-   a module within the scikit-learn library in Python that provides a wide range of functions and classes for data preprocessing. Its main purpose is to transform raw feature vectors into a representation that is more suitable for machine learning algorithms. This includes tasks like scaling, normalization, encoding categorical features, and imputing missing values, all of which are crucial steps in preparing data before training a model.


Q9. What is a Test set?
ans-  A Test set is a portion of the dataset used to evaluate the performance of a machine learning model after it has been trained. It consists of unseen data that the model has not encountered during training, providing an unbiased assessment of how well the model generalizes to new, real-world examples.

Q10.  How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?

ans-  
###
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.DataFrame({'feature_1': range(100), 'feature_2': [i*2 for i in range(100)], 'feature_3': [i**2 for i in range(100)]})
target = pd.Series([0 if i % 2 == 0 else 1 for i in range(100)])

print("Original data shape:", data.shape)
print("Original target shape:", target.shape)


X_train, X_test, y_train, y_test = train_test_split(
    data, target, test_size=0.2, random_state=42, stratify=target
)

print("\nTraining features shape (X_train):", X_train.shape)
print("Testing features shape (X_test):", X_test.shape)
print("Training target shape (y_train):", y_train.shape)
print("Testing target shape (y_test):", y_test.shape)

print("\nFirst 5 rows of X_train:")
display(X_train.head())

print("\nFirst 5 values of y_train:")
display(y_train.head())
###

Q11.  Why do we have to perform EDA before fitting a model to the data?

ans-  We perform Exploratory Data Analysis (EDA) before fitting a model to data to understand the data's characteristics, identify patterns, detect anomalies, check assumptions, and inform feature engineering choices. This helps in selecting appropriate models and improving model performance by addressing data quality issues and gaining insights into relationships between variables.

Q12.  What is correlation?
ans.  Correlation is a statistical measure that expresses the extent to which two variables are linearly related. It's a common tool for describing simple relationships without making a statement about cause and effect. It can be positive , negative, or zero.

Q13.  What does negative correlation mean?
ans-  Negative correlation means that two variables move in opposite directions. When one variable increases, the other variable tends to decrease, and vice versa. For example, the more hours a student spends watching TV, the lower their test scores tend to be.

Q14.  How can you find correlation between variables in Python?
ans-  
###
import pandas as pd
import numpy as np

data = {
    'Feature_A': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    'Feature_B': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], # Highly positively correlated
    'Feature_C': [100, 90, 80, 70, 60, 50, 40, 30, 20, 10], # Highly negatively correlated
    'Feature_D': np.random.rand(10) * 100 # Randomly correlated
}
df = pd.DataFrame(data)

print("Original DataFrame:")
display(df.head())


correlation_matrix = df.corr()

print("\nCorrelation Matrix:")
display(correlation_matrix)

print("\nCorrelation of Feature_A with other features:")
display(df['Feature_A'].corr(df))
###

Q15.  What is causation? Explain difference between correlation and causation with an example.

ans-  Causation means one event directly leads to another. It implies a cause-and-effect relationship.

Correlation means two variables move together, but one doesn't necessarily cause the other. "Correlation does not imply causation."

Example: Ice cream sales and shark attacks are correlated (both increase in summer), but neither causes the other; warm weather is the common cause for both.

Q16.  What is an Optimizer? What are different types of optimizers? Explain each with an example.
ans-  An Optimizer is an algorithm that adjusts the parameters of a machine learning model to minimize the loss function and improve performance. They guide the model in learning from data.

- Gradient Descent (GD): Updates parameters using the gradient of the entire dataset. (Example: Slow but stable learning, like taking precise steps down a hill after surveying the whole slope).
- Stochastic Gradient Descent (SGD): Updates parameters using the gradient of a single training example. (Example: Fast but noisy learning, like taking a step down after feeling the slope at just one point).
- Mini-batch Gradient Descent: Updates parameters using the gradient of a small batch of training examples. (Example: A balance between speed and stability, like taking a step after surveying a small group of points).
- Adam (Adaptive Moment Estimation): Combines momentum and adaptive learning rates, making it very efficient. (Example: A smart walker that adapts step size and direction based on past progress and current slope changes, reaching the bottom quickly).

Q17.  What is sklearn.linear_model ?
ans-  a powerful module within the scikit-learn (sklearn) library in Python that provides a wide range of algorithms for building linear models. These models are fundamental in machine learning for tasks like regression and classification, where the relationship between input features and the output target is modeled as a linear function. Essentially, it's where you'd find many classic and widely-used machine learning algorithms.

Q18.What does model.fit() do? What arguments must be given?

ans-  trains a machine learning model by teaching it patterns from data. It adjusts the model's internal parameters to minimize error.

The essential arguments are typically:

X: The input training features (e.g., your X_train).
y: The target variable or labels for training (e.g., your y_train).

Q19.  What does model.predict() do? What arguments must be given?

ans-  The model.predict() method is used after a machine learning model has been trained (.fit()) to make predictions on new, unseen data.

The essential argument for model.predict() is typically:

X: The input features for which you want to make predictions. This should be an array-like object (e.g., NumPy array, pandas DataFrame) with the same number of features (columns) as the data used for training. It should not include the target variable.

Q20.  What are continuous and categorical variables?
ans-  
- Continuous Variables: These can take any value within a given range (e.g., height, temperature, time). They are typically measured.
- Categorical Variables: These can take on a limited, fixed number of possible values, typically labels or categories (e.g., gender, color, type of car). They are typically counted or classified.

Q21.  What is feature scaling? How does it help in Machine Learning?
ans-  Feature scaling is a data preprocessing technique used to standardize or normalize the range of independent variables (features) within a dataset. It involves transforming numerical features so they have a similar scale, without distorting differences in the ranges of values or losing information.

Feature scaling helps in Machine Learning by:

1.  Improving Algorithm Performance
2.  Faster Convergence
3.  Preventing Domination by Features with Larger Magnitudes
4.  Ensuring Fair Contribution

Q22. How do we perform scaling in Python?

ans- we can perform feature scaling in Python using the sklearn.preprocessing module. The two most common techniques are Standardization (StandardScaler) and Normalization (MinMaxScaler).

Q23.  What is sklearn.preprocessing?
ans-  The sklearn.preprocessing module is a part of the scikit-learn library in Python. It provides a wide range of functions and classes specifically designed for data preprocessing. Its main goal is to transform raw feature vectors into a representation that is more suitable for machine learning algorithms. This includes essential steps like scaling, normalization, encoding categorical features, and imputing missing values, all of which are crucial for preparing data before training a model.

Q24.  How do we split data for model fitting (training and testing) in Python?
ans-  To split data for model fitting (training and testing) in Python, we typically use the train_test_split function from scikit-learn's model_selection module.

Q25.  Explain data encoding?
ans-  Data encoding in Machine Learning is the process of converting data from one form to another, specifically transforming non-numerical data (like text or categories) into numerical representations that machine learning algorithms can understand and process. It's essential because most ML models require numerical input. Common techniques include Label Encoding, One-Hot Encoding, and Target Encoding.

