Assignment Questions

In [12]:
"""
1. What is a parameter?

-> A parameter refers to a configuration variable that is learned from data during the training process.

Types of parameters:

a. Model Parameters:

-> Learned by the algorithm from training data.
-> Define how the model makes predictions.

Examples:
Weights and biases in a neural network
Coefficients in linear regression
Support vectors in SVM

b. Hyperparameters(related, but different):

-> Set before training and not learned from the data.

Examples: learning rate, number of layers, regularization strength

Example:
In linear regression:
The model predicts: output=weight*input+bias
Here, weight and bias are parameters that are learned by minimizing the error on training data.

2. What is correlation?
 What does negative correlation mean?

-> Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.

It ranges from -1 to +1.
A correlation near +1 means a strong positive relationship.
A correlation near -1 means a strong negative relationship.
A correlation near 0 means no linear relationship.

A negative correlation means that as one variable increases, the other decreases.
Example: The more time spent watching TV, the lower the test scores might be.
If the correlation is -0.8, the relationship is strong and inverse.

3. Define Machine Learning. What are the main components in Machine Learning?

-> Machine Learning(ML) is a branch of artificial intelligence that focuses on 
building systems that can learn from data, identify patterns, and make decisions with minimal human intervention.

Main Components in Machine Learning:

a. Data:

-> Raw input used to train and evaluate the model.
-> Includes features(inputs) and labels(outputs) in supervised learning.

b. Model:

-> A mathematical representation or algorithm that maps inputs to outputs.

Examples: Linear regression, decision tree, neural network.

c. Algorithm:

-> The procedure used to train the model by adjusting parameters.

Example: Gradient Descent, K-Means Clustering, Backpropagation.

d. Loss Function(or Cost Function):

-> Measures how well the model's predictions match the actual results.

Common functions: Mean Squared Error, Cross-Entropy Loss.

e. Training:

-> The process of feeding data to the algorithm so it learns patterns.
-> Adjusts model parameters to minimize the loss.

f. Evaluation:

-> Assessing model performance using metrics like accuracy, precision, recall, or RMSE.
-> Typically done on unseen(test or validation) data.

g. Prediction(Inference):

-> Using the trained model to make predictions on new data.

4. How does loss value help in determining whether the model is good or not?

-> The loss value is a key indicator of how well a machine learning model is performing.

It helps determine model quality:

a. Lower Loss=Better Predictions:
-> A small loss value means the model's predictions are close to the true values.
-> A high loss indicates poor performance.

b. Tracks Learning Progress:
-> During training, we monitor the loss after each iteration or epoch.
-> A decreasing loss shows the model is learning.

c. Compare Models:
Loss allows for objective comparison between different models or configurations.

d. Avoid Overfitting:
If training loss is low but validation loss is high, the model may be overfitting 
(memorizing the training data rather than learning general patterns).

5. What are continuous and categorical variables?

-> Continuous Variables:
a. These are quantitative variables that can take any numerical value within a range.
b. They are measurable and can include decimals or fractions.

Examples:
Height(e.g., 172.5 cm)
Temperature(e.g., 36.6 °C)
Salary(e.g., ₹55,000.75)

Categorical Variables:
a. These are qualitative variables that represent categories or groups.
b. They have discrete values and cannot be measured on a numerical scale.

Types:
Nominal: No inherent order(e.g., gender, color)
Ordinal: With an order or ranking(e.g., education level, satisfaction rating)

Examples:
Gender(Male, Female)
Marital Status(Single, Married)
Shirt Size(Small, Medium, Large)

6. How do we handle categorical variables in Machine Learning? What are the common techniques?

-> Handling categorical variables is essential in machine learning because most algorithms require numerical input. 

Common techniques include:
a. Label Encoding:
-> Converts each category into a unique integer.
-> Best for ordinal variables(where order matters).

Example:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data['Size']=le.fit_transform(data['Size'])  # Small=2, Medium=1, Large=0

b. One-Hot Encoding:
-> Creates a binary column for each category.
-> Ideal for nominal variables (no order).

Example:
import pandas as pd
pd.get_dummies(data['Color'],prefix='Color')   # Red -> [1, 0, 0], Green -> [0, 1, 0], Blue -> [0, 0, 1]

c. Ordinal Encoding: 
-> Assigns ordered integers to categories manually or automatically.
-> Used when categories have ranked relationships.

Example:
data['Education']=data['Education'].map({
    'High School':1,
    'Bachelor':2,
    'Master':3,
    'PhD':4
})

d. Frequency or Count Encoding:
-> Replaces categories with frequency counts.
-> Useful for high-cardinality features.

Example:
data['Category']=data['Category'].map(data['Category'].value_counts())

e. Target Encoding:
-> Replaces categories with the mean of the target variable for that category.
-> Often used in regression tasks.

Example:
mean_target=data.groupby('Category')['Target'].mean()
data['Category_encoded']=data['Category'].map(mean_target)

7. What do you mean by training and testing a dataset?

-> Training a dataset means using a portion of our data to teach the machine learning model how to 
make predictions or identify patterns. The model learns by adjusting itself to minimize errors on this training data.

Testing a dataset means using a separate portion of data(not seen by the model during training) to evaluate how 
well the trained model performs on new, unseen data. This helps check if the model can generalize 
beyond just memorizing the training examples.

8. What is sklearn.preprocessing?

-> sklearn.preprocessing is a module in the Scikit-learn(sklearn) library that provides tools 
for preparing(or preprocessing) data before feeding it into a machine learning model.

This module helps to:

a. Scale features(e.g., standardize values)
b. Encode categorical variables
c. Normalize data
d. Handle missing or unevenly distributed data

9. What is a Test set?

-> A test set is a portion of our dataset that is used to evaluate the final performance 
of a trained machine learning model.

Purpose of test set:
a. To measure how well the model generalizes to new, unseen data.
b. It gives an unbiased evaluation of the model's accuracy, precision, recall, etc.

10. How do we split data for model fitting (training and testing) in Python?
 How do you approach a Machine Learning problem?

-> We use the train_test_split function from Scikit-learn(sklearn) to divide the dataset:
from sklearn.model_selection import train_test_split
# Assuming X=features, y=labels/target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

Parameters:
test_size=0.2 -> 20% data for testing, 80% for training.
random_state=42 -> ensures reproducibility.
shuffle=True by default -> shuffles before splitting.

We approach a machine learning problem as follows:

a. Understand the Problem:
-> Define the goal clearly.
-> Know the type: classification, regression, clustering, etc.

b. Collect Data:
Gather relevant data from databases, APIs, files, etc.

c. Explore and Preprocess Data:
-> Handle missing values.
-> Remove duplicates.
-> Treat outliers.
-> Convert categorical variables(e.g., one-hot encoding):
-> Feature scaling(standardization/normalization).

d. Split Data:
-> Split into training and test sets(e.g., 80/20).
-> Optionally, use a validation set or cross-validation.

e. Choose and Train Model:
-> Pick a suitable ML algorithm(e.g., Linear Regression, Random Forest).
-> Fit the model to the training data.

f. Evaluate Model:
-> Use metrics like accuracy, F1-score, RMSE, etc.
-> Evaluate on test data to check generalization.

g. Tune and Improve:
-> Hyperparameter tuning(GridSearchCV, RandomizedSearchCV).
-> Feature engineering or model stacking.
-> Retrain and re-evaluate.

11. Why do we have to perform EDA before fitting a model to the data?

-> Exploratory Data Analysis(EDA) is a crucial step before fitting a model because it helps us understand 
the structure, patterns, and quality of the data. 

EDA is important because:

a. Understand Data Distribution:
-> Helps us see how features are distributed.
-> Identify skewed data, imbalances in target variables, etc.

b. Detect Missing Values:
-> We can identify which columns have missing or null values.
-> Allows us to decide on imputation or removal.

c. Identify Outliers:
-> Outliers can distort model training and performance.
-> Visualization(box plots, scatter plots) helps detect them.

d. Discover Relationships:
-> Correlation analysis can show which features influence the target.
-> Helps in feature selection and engineering.

e. Choose the Right Model:
Knowing whether the target is categorical or continuous influences model choice(classification vs. regression).

f. Improve Model Performance:
-> Clean, well-understood data leads to more accurate and interpretable models.
-> Prevents garbage in, garbage out.

g. Guide Preprocessing Steps:
Suggests what encoding, scaling, or transformation is needed.

12. What is correlation?

-> Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.

It ranges from -1 to +1.
A correlation near +1 means a strong positive relationship.
A correlation near -1 means a strong negative relationship.
A correlation near 0 means no linear relationship.

13. What does negative correlation mean?

-> A negative correlation means that as one variable increases, the other decreases.
Example: The more time spent watching TV, the lower the test scores might be.
If the correlation is -0.8, the relationship is strong and inverse.

14. How can you find correlation between variables in Python?

-> We can find the correlation between variables in Python using the pandas library 
and visualize it using seaborn or matplotlib.

a. Using pandas.corr():

import pandas as pd
# Sample DataFrame
data = {
    'height':[150,160,170,180,190],
    'weight':[50,60,70,80,90],
    'age':[22,25,30,35,40]
}
df=pd.DataFrame(data)
# Calculate correlation matrix
correlation_matrix=df.corr()
print(correlation_matrix)

This gives Pearson correlation by default(range: -1 to 1).

b. Visualize with Heatmap(seaborn):

import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(correlation_matrix,annot=True,cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

c. Other Correlation Methods: 

We can also specify methods like:

"pearson" - default, for linear relationships
"kendall" - for ordinal or non-parametric data
"spearman" - for ranked data

df.corr(method='spearman')

15. What is causation? Explain difference between correlation and causation with an example.

-> Causation means that one variable directly affects another - a change in one variable produces a change in the other.
causation=cause and effect.

Difference between correlation and causation are:

Correlation:
a. Shows a statistical relationship between two variables.
b. Change in one variable is associated with change in another.
c. Does not imply one variable causes the other to change.
d. Can be positive, negative, or zero.
e. May be due to a third(confounding) variable.

Example: Ice cream sales and drowning rates both increase in summer(due to heat).

Causation:
a. Indicates a cause-and-effect relationship.
b. Change in one variable directly causes change in another.
c. Requires controlled experiments or strong evidence.
d. Always has a directional influence(from cause to effect).

Example: Smoking causes lung cancer(proven by medical research).

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

-> An optimizer is an algorithm or method used in machine learning and deep learning to adjust the model's 
parameters(like weights and biases) during training, with the goal of minimizing the loss function. 
The loss function measures how well the model is performing - the optimizer tries to find the best parameters 
that reduce this error.

Different types of optimizers:
a. Gradient Descent(GD):
-> The most basic optimizer.
-> Updates model parameters by moving them in the direction of the negative gradient of the loss function.
-> Calculates gradients on the entire dataset for each update.

Example: Used in simple linear regression.

b. Stochastic Gradient Descent(SGD):
-> Similar to Gradient Descent but updates parameters using one training example at a time.
-> Faster updates but noisier(less stable).
-> Useful for large datasets.

Example: Used in neural networks where data is too big to process at once.

c. Mini-batch Gradient Descent:
-> A hybrid between GD and SGD.
-> Updates parameters based on a small random batch of data(mini-batch) instead of the whole dataset or a single example.
-> Balances speed and stability.

Example: Commonly used in deep learning frameworks like TensorFlow and PyTorch.

d. Momentum:
-> Accelerates SGD by adding a momentum term that helps the optimizer keep moving in the same direction.
-> Helps avoid getting stuck in local minima and speeds up convergence.

Example: Often combined with SGD in training deep neural networks.

e. RMSprop(Root Mean Square Propagation):
-> Adjusts the learning rate adaptively for each parameter based on recent gradients.
-> Divides the learning rate by a moving average of the magnitudes of recent gradients.
-> Works well for non-stationary objectives(changing loss landscape).

Example: Popular for recurrent neural networks(RNNs).

f. Adam(Adaptive Moment Estimation):
-> Combines ideas from Momentum and RMSprop.
-> Maintains moving averages of both gradients and squared gradients.
-> Automatically adjusts learning rates for each parameter.
-> Often works well in practice and is widely used.

Example: Default optimizer for many deep learning tasks.

17. What is sklearn.linear_model?

-> sklearn.linear_model is a module in the scikit-learn Python library that provides classes 
and functions to implement various linear models for regression and classification tasks.

It offers:
a. It contains algorithms that model the relationship between input features and 
a target variable assuming a linear relationship.
b. It supports both regression(predicting continuous values) and classification(predicting categories) problems.
c. The module includes simple and advanced linear models, along with tools for regularization to prevent overfitting.

Example:
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(X_train,y_train)  # Train the model
predictions=model.predict(X_test)  # Predict outcomes

18. What does model.fit() do? What arguments must be given?

-> The fit() method in machine learning models (including scikit-learn models) is used to train the model 
on the provided data. When we call model.fit(), the algorithm:
a. Learns the relationship between the input features and the target variable.
b. Finds the best parameters(e.g., weights in linear regression) that minimize the error or loss function.
c. Prepares the model to make predictions on new/unseen data.

The most common arguments passed to fit() are:

a. X — The input data(features):

-> Usually a 2D array-like structure(e.g., NumPy array, pandas DataFrame).
-> Shape: (number of samples, number of features).

b. y — The target variable (labels or values to predict):

-> 1D array-like for regression or binary/multiclass classification.
-> Shape: (number of samples,).

Example:
from sklearn.linear_model import LinearRegression
import numpy as np
# Sample data
X=np.array([[1,2],[2,3],[3,4],[4,5]])  # Features
y=np.array([3,5,7,9])                  # Target variable
model=LinearRegression()
model.fit(X,y)  # Train the model on X and y

19. What does model.predict() do? What arguments must be given?

-> The predict() method is used to make predictions using the trained machine learning model. 
After we have trained our model with fit(), calling predict() applies the learned patterns 
to new input data to estimate the output(target values or classes).

The most common arguments passed to predict() are:
a. X - The input data(features) for which we want predictions:
-> Typically a 2D array-like structure(NumPy array, pandas DataFrame).
-> Shape: (number of samples, number of features).
b. No target(y) is needed because we want the model to predict these.

Example:
from sklearn.linear_model import LinearRegression
import numpy as np
# Training data
X_train=np.array([[1,2],[2,3],[3,4]])
y_train=np.array([3,5,7])
# Train the model
model=LinearRegression()
model.fit(X_train,y_train)
# New data for prediction
X_new=np.array([[4,5],[5,6]])
# Predict target values for new data
predictions=model.predict(X_new)
print(predictions)

20. What are continuous and categorical variables?

-> Continuous Variables:
a. These are quantitative variables that can take any numerical value within a range.
b. They are measurable and can include decimals or fractions.

Examples:
Height(e.g., 172.5 cm)
Temperature(e.g., 36.6 °C)
Salary(e.g., ₹55,000.75)

Categorical Variables:
a. These are qualitative variables that represent categories or groups.
b. They have discrete values and cannot be measured on a numerical scale.

Types:
Nominal: No inherent order(e.g., gender, color)
Ordinal: With an order or ranking(e.g., education level, satisfaction rating)

Examples:
Gender(Male, Female)
Marital Status(Single, Married)
Shirt Size(Small, Medium, Large)

21. What is feature scaling? How does it help in Machine Learning?

-> Feature scaling is the process of normalizing or standardizing the range of independent variables(features) 
in our data. It transforms features so they have similar scales or distributions.

It helps in machine learning:
a. Many ML algorithms(like gradient descent, k-nearest neighbors, SVM, and neural networks) 
are sensitive to the scale of features.
b. Features with larger ranges can dominate the learning process, causing biased results.
c. Scaling ensures all features contribute equally, improving:
-> Model convergence speed(especially for gradient-based methods).
-> Model performance and accuracy.
-> Interpretability when comparing feature effects.

22. How do we perform scaling in Python?

-> We can perform feature scaling easily in Python using scikit-learn's preprocessing module. 

The common methods are:

a. Min-Max Scaling:
Scales data to a range between 0 and 1.

from sklearn.preprocessing import MinMaxScaler
import numpy as np
data=np.array([[10, 200],
                 [20, 300],
                 [30, 400]])
scaler=MinMaxScaler()
scaled_data=scaler.fit_transform(data)
print(scaled_data)

b. Standardization(Z-score scaling):
Centers data to mean zero and scales to unit variance.

from sklearn.preprocessing import StandardScaler
import numpy as np
data=np.array([[10, 200],
                 [20, 300],
                 [30, 400]])
scaler=StandardScaler()
scaled_data=scaler.fit_transform(data)
print(scaled_data)

c. Normalizer:

Scales each sample(row) to have unit norm(length 1), converting it into a unit vector.

from sklearn.preprocessing import Normalizer
import numpy as np
data=np.array([[4, 1],
                 [1, 2],
                 [3, 3]])
scaler=Normalizer()
normalized_data=scaler.fit_transform(data)
print(normalized_data)

Steps to perform scaling:
a. Create the scaler object.
b. Apply .fit_transform() on our data.
c. Get scaled data ready for modeling.

23. What is sklearn.preprocessing?

-> sklearn.preprocessing is a module in the Scikit-learn(sklearn) library that provides tools 
for preparing(or preprocessing) data before feeding it into a machine learning model.

This module helps to:

a. Scale features(e.g., standardize values)
b. Encode categorical variables
c. Normalize data
d. Handle missing or unevenly distributed data

24. How do we split data for model fitting (training and testing) in Python?

-> We use the train_test_split function from Scikit-learn(sklearn) to divide the dataset:
from sklearn.model_selection import train_test_split
# Assuming X=features, y=labels/target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

Parameters:
test_size=0.2 -> 20% data for testing, 80% for training.
random_state=42 -> ensures reproducibility.
shuffle=True by default -> shuffles before splitting.

25. Explain data encoding?

-> Data encoding in machine learning is the process of converting categorical variables(non-numeric data) 
into a numeric format so that algorithms can process them effectively. Since most machine learning models 
work with numerical data, encoding is a crucial step in data preprocessing.

Encoding is important because:
a. Algorithms like linear regression, logistic regression, SVMs, etc., require numeric input.
b. Encoding allows models to interpret categories meaningfully.
"""

'\n1. What is a parameter?\n\n-> A parameter refers to a configuration variable that is learned from data during the training process.\n\nTypes of parameters:\n\na. Model Parameters:\n\n-> Learned by the algorithm from training data.\n-> Define how the model makes predictions.\n\nExamples:\nWeights and biases in a neural network\nCoefficients in linear regression\nSupport vectors in SVM\n\nb. Hyperparameters(related, but different):\n\n-> Set before training and not learned from the data.\n\nExamples: learning rate, number of layers, regularization strength\n\nExample:\nIn linear regression:\nThe model predicts: output=weight*input+bias\nHere, weight and bias are parameters that are learned by minimizing the error on training data.\n\n2. What is correlation?\n What does negative correlation mean?\n\n-> Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.\n\nIt ranges from -1 to +1.\nA correlation near +1 means a strong 