Submission by Ritam Dutta
# **ML Feature Engineering Assignment Solutions**

## **1. What is a parameter?**

**Theory:** A parameter is a value that the machine learning model learns during training to make accurate predictions. Think of parameters as the model's "settings" that get automatically adjusted based on the training data. For example, in a linear equation y = mx + b, the values 'm' (slope) and 'b' (intercept) are parameters that the model learns.

**Code Example:**

In [None]:
from sklearn.linear_model import LinearRegression

# Create the model
model = LinearRegression()

# You need to train the model first with some data
X = [[1], [2], [3], [4]]  # Features (input data)
y = [2, 4, 6, 8]          # Target values (output data)

# Train the model - this is the missing step!
model.fit(X, y)

# Now we can access the learned parameters
print(f"Parameters: slope={model.coef_}, intercept={model.intercept_}")

Parameters: slope=[2.], intercept=0.0


**Summary:** Parameters are the learned values that help the model make predictions by capturing patterns in the training data.


---

## **2. What is correlation?**

**Theory:** Correlation measures how closely two variables are related to each other. It tells us if when one variable increases, the other tends to increase (positive correlation) or decrease (negative correlation). The correlation value ranges from -1 to +1, where values closer to -1 or +1 indicate stronger relationships.

**Code Example:**


In [None]:
import pandas as pd
df = pd.DataFrame({'height': [150, 160, 170, 180], 'weight': [50, 60, 70, 80]})
correlation = df['height'].corr(df['weight'])
print(f"Correlation: {correlation}")

Correlation: 1.0


**Summary:** Correlation helps us understand how two variables move together, with values from -1 to +1 indicating relationship strength.


---

## **3. What does negative correlation mean?**

**Theory:** Negative correlation means that as one variable increases, the other variable tends to decrease. The closer the correlation value is to -1, the stronger this inverse relationship. For example, as the price of a product increases, the demand for it might decrease, showing negative correlation.

**Code Example:**

In [None]:
import pandas as pd
df = pd.DataFrame({'price': [10, 20, 30, 40], 'demand': [100, 80, 60, 40]})
correlation = df['price'].corr(df['demand'])
print(f"Negative correlation: {correlation}")

Negative correlation: -1.0


**Summary:** Negative correlation indicates an inverse relationship where one variable decreases as the other increases.
---

## **4. Define Machine Learning. What are the main components in Machine Learning?**

**Theory:** Machine Learning is a method where computers learn patterns from data to make predictions or decisions without being explicitly programmed for each task. The main components are: 1) Data (information to learn from), 2) Algorithm (the learning method), 3) Model (the trained result), and 4) Features (individual measurable properties of observed phenomena).

**Code Example:**


In [None]:
from sklearn.linear_model import LinearRegression
# Data, Algorithm, Model, Features example
X = [[1], [2], [3]]  # Features (input data)
y = [1, 2, 3]        # Target (what we want to predict)
model = LinearRegression().fit(X, y)  # Algorithm creates Model from Data


**Summary:** Machine Learning combines data, algorithms, models, and features to automatically learn patterns and make predictions.

---


## **5. How does loss value help in determining whether the model is good or not?**

**Theory:** Loss value measures how wrong the model's predictions are compared to the actual correct answers. A lower loss value means the model is making better predictions, while a higher loss value indicates poor performance. Think of it like a test score - lower errors mean better grades.

**Code Example:**

In [None]:
from sklearn.metrics import mean_squared_error
actual = [1, 2, 3, 4]
predicted = [1.1, 2.2, 2.8, 4.1]
loss = mean_squared_error(actual, predicted)
print(f"Loss value: {loss}")  # Lower is better

Loss value: 0.025000000000000022


**Summary:** Loss value quantifies prediction errors - lower loss indicates a better performing model.

---


## **6. What are continuous and categorical variables?**

**Theory:** Continuous variables are numeric values that can take any value within a range (like height, weight, temperature). Categorical variables represent categories or groups with distinct labels (like color, gender, city names). Continuous variables have meaningful mathematical operations, while categorical variables represent different classes or categories.

**Code Example:**


In [None]:
import pandas as pd
df = pd.DataFrame({
    'age': [25, 30, 35],      # Continuous variable
    'city': ['NYC', 'LA', 'Chicago']  # Categorical variable
})
print(df.dtypes)

age      int64
city    object
dtype: object


**Summary:** Continuous variables are numeric measurements, while categorical variables represent distinct categories or groups.

---


## **7. How do we handle categorical variables in Machine Learning? What are the common techniques?**

**Theory:** Machine learning models work with numbers, so we need to convert categorical variables into numeric format. Common techniques include: Label Encoding (assigning numbers to categories) and One-Hot Encoding (creating separate binary columns for each category). One-hot encoding is preferred when categories don't have a natural order.

**Code Example:**

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'green']})
# One-hot encoding
encoded = pd.get_dummies(df['color'])
print(encoded)

    blue  green    red
0  False  False   True
1   True  False  False
2  False   True  False


**Summary:** Categorical variables are converted to numbers using techniques like label encoding or one-hot encoding for machine learning models.

---


## **8. What do you mean by training and testing a dataset?**

**Theory:** Training means showing the model examples with both input data and correct answers so it can learn patterns. Testing means checking how well the model performs on new, unseen data to evaluate if it learned properly. We split our dataset into training data (to teach the model) and testing data (to evaluate performance).

**Code Example:**


In [None]:
from sklearn.model_selection import train_test_split
X, y = [[1], [2], [3], [4]], [1, 2, 3, 4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
print(f"Training size: {len(X_train)}, Testing size: {len(X_test)}")

Training size: 2, Testing size: 2


**Summary:** Training teaches the model with known examples, while testing evaluates performance on unseen data.

---


## **9. What is sklearn.preprocessing?**

**Theory:** sklearn.preprocessing is a Python module that provides tools to prepare and clean data before feeding it to machine learning models. It includes functions for scaling numbers, encoding categories, handling missing values, and normalizing data. These preprocessing steps help models learn more effectively.

**Code Example:**


In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = [[1], [2], [3], [4]]
scaled_data = scaler.fit_transform(data)
print(scaled_data)

[[-1.34164079]
 [-0.4472136 ]
 [ 0.4472136 ]
 [ 1.34164079]]


**Summary:** sklearn.preprocessing provides essential tools for cleaning and preparing data for machine learning models.

---


## **10. What is a Test set?**

**Theory:** A test set is a portion of your data that you keep separate and don't use during model training. It acts like a final exam for your model to see how well it performs on completely new, unseen data. This helps determine if the model will work well in real-world situations, not just on the training data.

**Code Example:**


In [13]:
from sklearn.model_selection import train_test_split
X, y = [[1], [2], [3], [4], [5], [6]], [1, 2, 3, 4, 5, 6]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(f"Test set size: {len(X_test)} samples")

Test set size: 2 samples



**Summary:** The test set is reserved data used to evaluate final model performance on unseen examples.

---


## **11. How do we split data for model fitting (training and testing) in Python?**

**Theory:** We use the train_test_split function from sklearn to randomly divide our dataset into training and testing portions. Typically, we use 70-80% of data for training and 20-30% for testing. The random_state parameter ensures we get the same split every time we run the code, making results reproducible.

**Code Example:**


In [15]:
from sklearn.model_selection import train_test_split
X, y = [[1], [2], [3], [4]], [1, 2, 3, 4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


**Summary:** train_test_split randomly divides data into training and testing portions for model development and evaluation.

---


## **12. How do you approach a Machine Learning problem?**

**Theory:** A systematic ML approach includes: 1) Define the problem and goal, 2) Collect and explore the data, 3) Clean and prepare the data, 4) Choose and train a model, 5) Evaluate performance, and 6) Deploy or improve the model. This step-by-step process helps ensure successful machine learning projects.

**Code Example:**


In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Step 1-3: Load and prepare data
# Step 4-5: Train and evaluate model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**Summary:** A structured approach from problem definition to model deployment ensures successful machine learning projects.

---



## **13. Why do we have to perform EDA before fitting a model to the data?**

**Theory:** EDA (Exploratory Data Analysis) helps us understand our data before building models. It reveals patterns, outliers, missing values, and relationships between variables. This knowledge helps us choose appropriate models, clean the data properly, and avoid common pitfalls that could lead to poor model performance.

**Code Example:**


In [17]:
import pandas as pd
df = pd.DataFrame({'age': [25, 30, 35], 'salary': [50000, 60000, 70000]})
print(df.describe())  # Basic statistics
print(df.corr())      # Correlation matrix

        age   salary
count   3.0      3.0
mean   30.0  60000.0
std     5.0  10000.0
min    25.0  50000.0
25%    27.5  55000.0
50%    30.0  60000.0
75%    32.5  65000.0
max    35.0  70000.0
        age  salary
age     1.0     1.0
salary  1.0     1.0


**Summary:** EDA reveals data insights and quality issues that guide proper model selection and data preparation.

---


## **14. How can you find correlation between variables in Python?**

**Theory:** Python provides several ways to calculate correlation. The pandas .corr() method calculates correlation between columns in a DataFrame. You can also use numpy.corrcoef() for arrays or scipy.stats functions for more advanced correlation measures. The most common is Pearson correlation coefficient.

**Code Example:**


In [18]:
import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3, 4], 'y': [2, 4, 6, 8]})
correlation_matrix = df.corr()
print(correlation_matrix)

     x    y
x  1.0  1.0
y  1.0  1.0


**Summary:** Pandas .corr() method easily calculates correlation coefficients between variables in a DataFrame.

---


## **15. What is causation? Explain difference between correlation and causation with an example.**

**Theory:** Causation means one variable directly causes changes in another variable. Correlation only shows that two variables move together but doesn't prove one causes the other. Example: Ice cream sales and drowning deaths are correlated (both increase in summer) but ice cream doesn't cause drowning - hot weather causes both.

**Code Example:**


In [19]:
import pandas as pd
# Example: Temperature, ice cream sales, and drowning deaths
df = pd.DataFrame({
    'temp': [70, 80, 90, 100],
    'ice_cream': [10, 20, 30, 40],
    'drowning': [1, 2, 3, 4]
})
print(df.corr())  # Shows correlation but not causation

           temp  ice_cream  drowning
temp        1.0        1.0       1.0
ice_cream   1.0        1.0       1.0
drowning    1.0        1.0       1.0


**Summary:** Correlation shows relationships between variables, while causation proves one variable directly causes changes in another.

---


## **16. What is an Optimizer? What are different types of optimizers? Explain each with an example.**

**Theory:** An optimizer is an algorithm that adjusts model parameters to minimize the loss function during training. Common types include SGD (Stochastic Gradient Descent) - basic but reliable, Adam - adaptive and fast, and RMSprop - good for neural networks. Each has different strategies for finding the best parameter values.

**Code Example:**


In [20]:
from sklearn.linear_model import SGDRegressor
# SGD optimizer example
model = SGDRegressor(learning_rate='constant', eta0=0.01)
X, y = [[1], [2], [3]], [1, 2, 3]
model.fit(X, y)

**Summary:** Optimizers are algorithms that efficiently find the best model parameters by minimizing the loss function during training.

---


## **17. What is sklearn.linear_model?**

**Theory:** sklearn.linear_model is a module in scikit-learn that contains linear algorithms for regression and classification. It includes LinearRegression for predicting continuous values, LogisticRegression for binary classification, Ridge and Lasso for regularized regression, and other linear methods. These models assume linear relationships between features and targets.

**Code Example:**


In [21]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = [[1], [2], [3]], [1, 2, 3]
model.fit(X, y)
prediction = model.predict([[4]])

**Summary:** sklearn.linear_model provides linear algorithms for regression and classification tasks with assumed linear relationships.

---


## **18. What does model.fit() do? What arguments must be given?**

**Theory:** model.fit() trains the machine learning model using the provided data. It takes the training features (X) and target values (y) as required arguments, then adjusts the model's internal parameters to learn patterns from this data. This is where the actual learning happens in supervised machine learning.

**Code Example:**


In [22]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X = [[1], [2], [3]]  # Features (required)
y = [1, 2, 3]        # Target values (required)
model.fit(X, y)      # Train the model

**Summary:** model.fit(X, y) trains the model using features X and targets y, adjusting parameters to learn data patterns.

---


## **19. What does model.predict() do? What arguments must be given?**

**Theory:** model.predict() uses the trained model to make predictions on new data. It requires only the features (X) as input - the same type of features the model was trained on. The model applies its learned parameters to these new features and returns predicted target values.

**Code Example:**


In [23]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit([[1], [2], [3]], [1, 2, 3])
predictions = model.predict([[4], [5]])  # New features to predict
print(predictions)

[4. 5.]


**Summary:** model.predict(X) applies the trained model to new features X and returns predicted target values.

---


## **20. What is feature scaling? How does it help in Machine Learning?**

**Theory:** Feature scaling adjusts all features to similar numerical ranges, preventing features with larger values from dominating the model. For example, if age ranges 20-80 and income ranges 20,000-80,000, income would overshadow age. Scaling methods include standardization (mean=0, std=1) and normalization (range 0-1), helping algorithms work more effectively.

**Code Example:**


In [24]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = [[1, 1000], [2, 2000], [3, 3000]]  # Different scales
scaled = scaler.fit_transform(data)
print(scaled)  # Now both features have similar ranges

[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


**Summary:** Feature scaling equalizes feature ranges, preventing larger-valued features from dominating model training and improving performance.

---

## **21. How do we perform scaling in Python?**

**Theory:** Python's sklearn.preprocessing provides several scaling methods. StandardScaler standardizes features to have mean=0 and standard deviation=1. MinMaxScaler scales features to a range like 0-1. RobustScaler is less sensitive to outliers. The process involves fitting the scaler on training data and transforming both training and test data.

**Code Example:**


In [25]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()  # or MinMaxScaler()
data = [[1, 100], [2, 200], [3, 300]]
scaled_data = scaler.fit_transform(data)

**Summary:** sklearn.preprocessing provides various scalers like StandardScaler and MinMaxScaler for normalizing feature ranges in datasets.

---


## **22. Explain data encoding?**

**Theory:** Data encoding converts different data types into numerical formats that machine learning algorithms can process. This includes converting categorical variables (like colors, cities) into numbers, handling text data, and transforming dates into numerical features. Common methods include one-hot encoding for categories and label encoding for ordinal data.

**Code Example:**


In [26]:
import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red']})
# One-hot encoding
encoded = pd.get_dummies(df, columns=['color'])
print(encoded)

   color_blue  color_green  color_red
0       False        False       True
1        True        False      False
2       False         True      False
3       False        False       True


**Summary:** Data encoding transforms various data types into numerical formats suitable for machine learning algorithms to process effectively.

---