Machine Learning Assignment 1

1. What is a parameter?
A parameter is a variable used in a function definition to accept input values when the function is called. It defines the function's behavior and allows it to operate with different data. Parameters can be positional, keyword, or have default values.

Example:
```python
def greet(name, message="Hello"):
    print(f"{message}, {name}!")

greet("Tanay")  # Uses default message
greet("Tanay", "Good morning")  # Custom message
```

Here, `name` and `message` are parameters.



2. What is correlation? What does negative correlation mean?
Correlation is a statistical measure that describes the strength and direction of the relationship between two variables. It indicates whether and how strongly pairs of variables are related.

Key Points:
Positive Correlation: When one variable increases, the other also increases (e.g., height and weight).
Negative Correlation: When one variable increases, the other decreases (e.g., temperature and heating usage).
No Correlation: When changes in one variable do not predict changes in the other (e.g., shoe size and intelligence).



3. Define Machine Learning. What are the main components in Machine Learning?
Machine Learning (ML) is a field of artificial intelligence (AI) that focuses on developing algorithms and models that allow computers to learn from and make predictions or decisions based on data, without being explicitly programmed.

Main Components:
1. Data: The input used to train models.
2. Algorithms: Methods that find patterns in the data (e.g., regression, decision trees).
3. Model: The output of the trained algorithm, used for predictions.
4. Training: The process of learning from data.
5. Testing/Validation: Evaluating the model on new data to check performance.
6. Evaluation Metrics: Measures like accuracy to assess model performance.



4. How does loss value help in determining whether the model is good or not?
 The loss value quantifies how well a model’s predictions match the true labels. A lower loss indicates that the model is making fewer errors, and the model’s predictions are closer to the actual values. A higher loss suggests the model is not performing well and requires improvement.


5. What are continuous and categorical variables?
Continuous Variables: These are numeric variables that can take any value within a range. They are measurable and can represent quantities.
Example: Age, temperature, income.
Categorical Variables: These are variables that represent categories or groups. They are non-numeric and can take a limited number of distinct values.
Example: Gender (Male, Female), color (Red, Blue), education level (High School, Bachelor's, Master's).



6. How do we handle categorical variables in Machine Learning?
 Categorical variables need to be converted into numerical form for ML algorithms to work:
    Label Encoding: Converts each category into a unique integer (e.g., Male = 0, Female = 1).
    One-Hot Encoding: Creates binary columns for each category, with 1 for presence and 0 for absence (e.g., Red = [1,0,0], Blue = [0,1,0]).
    Ordinal Encoding: Assigns numerical values to categories with a meaningful order (e.g., Low = 0, Medium = 1, High = 2).

7. What do you mean by training and testing a dataset?
    Training Dataset: The data used to train the model. The model learns patterns and relationships from this data.
    Testing Dataset: The data used to evaluate how well the model generalizes to unseen data. It checks the model’s performance after training to ensure it works well in real-world scenarios.


8. What is sklearn.preprocessing?
 sklearn.preprocessing is a module in Scikit-learn that provides tools to preprocess data for machine learning. It includes:
Scaling: Standardizes data (e.g., StandardScaler).
Encoding: Converts categorical variables into numeric form (e.g., OneHotEncoder).
Imputation: Handles missing data (e.g., SimpleImputer).


9. What is a Test Set?
 A Test Set is a subset of data set aside during the initial split, which is only used to assess the performance of the model after it has been trained. It helps evaluate how well the model generalizes to new, unseen data.






10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?

 In Python, the train_test_split function from sklearn.model_selection is commonly used:

 from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 This splits the dataset into 80% training data and 20% testing data, ensuring that the model is trained on one part and evaluated on another.

approach
Define the Problem: Understand the problem, goal (classification, regression), and type of data.
Data Collection: Gather relevant data and ensure its quality.
Preprocessing: Clean and preprocess the data (handle missing values, scale, encode).
Model Selection: Choose an appropriate machine learning model (e.g., decision trees, neural networks).
Model Training: Train the model using the training data.
Evaluation: Evaluate the model on the test data using metrics like accuracy, precision, and recall.
Tuning and Optimization: Tune hyperparameters and refine the model.
Deployment: Deploy the model for real-world use once satisfied with its performance.








11. Why do we have to perform EDA before fitting a model to the data?
 EDA helps to understand the data, detect outliers, handle missing values, identify patterns, and choose the right features for the model, ensuring better performance.

12. What is correlation?
 Correlation measures the relationship between two variables, showing how one variable changes in relation to another. It ranges from -1 (perfect negative) to +1 (perfect positive).


13. What does negative correlation mean?
 Negative correlation means that as one variable increases, the other decreases, showing an inverse relationship.


14. How can you find correlation between variables in Python?
 Use pandas.DataFrame.corr() to compute the correlation matrix for numeric columns:

 correlation = data.corr()


15. What is causation? Explain the difference between correlation and causation with an example.
 Causation means one variable directly causes changes in another.


Correlation: Ice cream sales and drowning both rise in summer but aren't causally related.
Causation: Smoking may directly cause bad effect


16. What is an Optimizer? What are different types of optimizers?
 An optimizer adjusts the parameters (e.g., weights) of a model to minimize the loss function, improving model performance during training.
Gradient Descent: Iteratively updates model parameters by moving in the direction of the negative gradient of the loss function.
Stochastic Gradient Descent (SGD): Similar to gradient descent but updates parameters using a single data point at a time, making it faster and suitable for large datasets.
Adam: An adaptive optimizer that combines benefits of both gradient descent and momentum, adjusting the learning rate for each parameter based on the first and second moments of the gradients.




17. What is sklearn.linear_model?
 sklearn.linear_model is a module in Scikit-learn that provides various linear regression and classification models, including:
Linear Regression: Predicts continuous output using a linear relationship between features and target.
Logistic Regression: Used for binary classification problems, predicting the probability of an outcome.
Ridge Regression: A variant of linear regression with L2 regularization to prevent overfitting.
Lasso Regression: Another variant with L1 regularization that can also perform feature selection.


18. What does model.fit() do?
 model.fit() trains the machine learning model by learning the relationships between input features (X) and target labels (y). The model’s parameters are adjusted to minimize the error on the training data.
Arguments:
X: Feature matrix (input data).
y: Target vector (output data). Example:
model.fit(X_train, y_train)



19. What does model.predict() do?
 model.predict() generates predictions for new, unseen data using the trained model. It applies the learned parameters to input data to predict the target value.
Argument:
X: The feature matrix of new input data. Example:
predictions = model.predict(X_test)



20. What are continuous and categorical variables?
Continuous Variables: Numeric values that can take any value within a range and are measured, like age, salary, or temperature.
Categorical Variables: Non-numeric values representing categories or groups, such as gender, product type, or color. These can be nominal (no inherent order) or ordinal (with a defined order).




21. What is feature scaling? How does it help in Machine Learning?  
   Feature scaling normalizes the range of features, ensuring that all features contribute equally to the model, speeding up training and improving model performance, especially for distance-based algorithms.

22. How do we perform scaling in Python?
   Use `StandardScaler` for standardization (zero mean, unit variance) or `MinMaxScaler` for scaling features to a range (e.g., [0, 1]).
   ```python
   from sklearn.preprocessing import StandardScaler
   scaler = StandardScaler()
   X_scaled = scaler.fit_transform(X)
   ```

23. What is sklearn.preprocessing?  
   It's a module in Scikit-learn with functions for scaling, encoding, and normalizing data, including `StandardScaler`, `MinMaxScaler`, `OneHotEncoder`, and `LabelEncoder`.

24. How do we split data for model fitting (training and testing) in Python?  
   Use `train_test_split` to split data into training and testing sets.
   ```python
   from sklearn.model_selection import train_test_split
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
   ```

25. Explain data encoding?
   Data encoding converts categorical data into numerical format.  
   -One-Hot Encoding: Converts categories to binary columns.
   - Label Encoding: Converts categories to unique numeric labels.  
   Example:
   ```python
   df_encoded = pd.get_dummies(df, columns=['category_column'])
   ```