<a href="https://colab.research.google.com/github/taimoorrauf607/Practice-/blob/master/ML_cheatsheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fetaure Engineering
#### 1. feature scalling
#### 2. Fetaure transformation
#### 3. feature construction
#### 4. feature selecting

## 1. Feature Scaliing
Adjusting features to a common scale without distorting differences in the ranges of values.
#### Key Types:
#### Normalization: Scales values between 0 and 1.
#### Standardization: Scales values to have zero mean and unit variance.

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

## 2. Feature Transformation

#### Ordinal Encoding
Definition: Ordinal encoding is used to convert categorical data into numerical
format while preserving the order of categories. It is particularly useful
for ordinal data (data with an inherent order, e.g., "low", "medium", "high").

In [None]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])  # Define order explicitly if needed
X_encoded = encoder.fit_transform(X)

### Label Encoding
Definition: Label encoding is a technique to convert categorical labels into numerical values.
Each unique category is assigned an integer value.

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

### One-hot encoding
Definition: One-hot encoding converts categorical variables into binary columns, with each unique category represented as
a separate column containing 1 for the category it belongs to and 0 otherwise.

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)  # Use sparse=True for sparse matrix
X_encoded = encoder.fit_transform(X)

### Column Transformer
Definition: The ColumnTransformer applies different preprocessing transformations to specified columns of a dataset,
enabling efficient handling of mixed data types (e.g., numerical and categorical features).

In [None]:
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numerical_columns),
    ('cat', OneHotEncoder(), categorical_columns)
], remainder='passthrough')
X_transformed = transformer.fit_transform(X)

### pipelines
Definition: A pipeline is a way to streamline workflows in machine learning by chaining multiple steps, such as preprocessing, feature selection, and modeling, into a single object. It ensures that the transformations are applied in sequence, and avoids data leakage by fitting the model and preprocessing steps together.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Define the steps of the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Predict using the pipeline
y_pred = pipeline.predict(X_test)


### Mathematical Transformation
Definition: Mathematical transformations modify feature values using mathematical functions like logarithms, square roots, or exponents to reduce skewness, stabilize variance, or make features more suitable for modeling.
### Log Transformation
Definition: Log transformation reduces the impact of extreme values by applying the logarithm function to the data, often used for positively skewed distributions.
### Power Transformation
Definition: Power transformation (e.g., Box-Cox or Yeo-Johnson) applies a mathematical power function to stabilize variance and make data more Gaussian-like.

In [None]:
from sklearn.preprocessing import PowerTransformer
import numpy as np

X_log_transformed = np.log1p(X)  # log1p handles zero values
pt = PowerTransformer(method='yeo-johnson')  # or 'box-cox'
X_power_transformed = transformer.fit_transform(X)
pt.lambdas_  # check lambdas values

### Discretization (Binning)
Definition: Discretization, or binning, is the process of transforming continuous features into discrete bins or intervals to simplify the data and capture important patterns.

#### Strategies:

#### Uniform Binning: Divides the range of data into equally spaced intervals.
#### Quantile Binning: Divides data into intervals containing an equal number of data points.
#### KMeans Binning: Uses clustering (K-means) to determine the bin edges.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')  # 'quantile' or 'kmeans' are other strategies
X_binned = discretizer.fit_transform(X)

### Binarization
Definition: Binarization transforms numerical values into binary values (0 or 1) based on a specified threshold.
Values above the threshold are mapped to 1, and values below or equal to the threshold are mapped to 0.

In [None]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(copy=False)
X_binarized = binarizer.fit_transform(X)

## Handle Missing Values
Definition: Handling missing values involves techniques to either fill in or remove missing data to prevent errors in model training.

### Strategies:

Imputation: Fill missing values with a specific strategy (mean, median, mode, constant value).
Deletion: Remove rows or columns containing missing values.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean', add_indicator=True)  # Use 'median', 'most_frequent', or 'constant' as needed
X_imputed = imputer.fit_transform(X)

### Multivariate Imputer
Definition: A multivariate imputer uses relationships between features to predict and
impute missing values based on other available features.
### KNN Imputer
Definition: The KNN imputer fills missing values by considering the nearest neighbors of the missing data points,
    using K-nearest neighbors for imputation.

In [None]:
from sklearn.impute import IterativeImputer,KNNImputer

imputer1 = IterativeImputer(max_iter=50,n_nearest_features=2)
X_imputed = imputer.fit_transform(X)

imputer2 = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)

### Outlier Treatment Strategies
Outliers can significantly affect the performance of machine learning models. Below are strategies to deal with them:
#### How to detect outliers?
1. in Normal distribution case : (μ+ 3σ) > , (μ- 3σ)<
2. in Skewed distribution case: Interqurtile Range  min = Q1 - 1.5*iqr   , max = q3+1.5*iqr
3. other distribution case : 2.5 percentil to 97.5 percentile

### 1. Trimming
Definition: Trimming involves removing the outliers entirely from the dataset. This is done by defining thresholds and excluding any data points that fall outside of these boundaries.

In [None]:
import numpy as np

# Define upper and lower thresholds
lower_threshold = np.percentile(X, 5)
upper_threshold = np.percentile(X, 95)

X_trimmed = X[(X >= lower_threshold) & (X <= upper_threshold)]

### 2. Capping (Winsorizing)
Definition: Capping involves replacing outliers with the nearest valid data point within a defined range. This method is also known as Winsorizing.  

In [None]:
from scipy.stats import mstats
# Cap values at the 5th and 95th percentiles
X_capped = mstats.winsorize(X, limits=[0.05, 0.05])

### 3. Missing Value-like Treatment
Definition: Treat outliers as missing values and then apply imputation techniques to fill them. This strategy is helpful when the outliers are extreme and might distort the data too much.

In [None]:
from sklearn.impute import SimpleImputer

# Set a threshold for outliers
outlier_threshold = np.percentile(X, 95)

# Replace outliers with NaN
X[X > outlier_threshold] = np.nan

# Impute missing values
imputer = SimpleImputer(strategy='mean')  # Use median or other strategies as needed
X_imputed = imputer.fit_transform(X)


### 4. Interquartile Range method
Definition: The IQR method identifies outliers by calculating the interquartile range (IQR), which is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). Data points that fall outside of the range are outliers


In [None]:
import numpy as np
# Calculate Q1 and Q3
Q1 = np.percentile(X, 25)
Q3 = np.percentile(X, 75)
IQR = Q3 - Q1
# Define outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter out the outliers
X_filtered = X[(X >= lower_bound) & (X <= upper_bound)]
np.where(data>upper_bound,upper_bound,np.where(data<lower_bound,lower_bound,data))

## 3. Feature Construction
Definition: Feature construction involves creating new features from the existing ones to improve the model's ability to learn patterns. This can be done through mathematical operations, aggregating features, or extracting domain-specific information.
### Interaction Features
Definition: Interaction features are created by combining two or more features to capture their combined effect on the target variable.

In [None]:
df['new_feature'] = df['feature1'] * df['feature2']

### Polynomial Features
Definition: Polynomial features are higher-order terms (e.g., squares, cubes) created from numerical features to capture non-linear relationships.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

### Date/Time-based Features
Definition: Extract meaningful features from date or time-related columns, such as year, month, day, day of the week, etc.

In [None]:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

### Binned Features
Definition: Create categorical features by grouping continuous values into bins or categories.

In [None]:
df['binned_feature'] = pd.cut(df['feature'], bins=5, labels=False)

## 4. Feature Extraction
### Principal Component Analysis (PCA)
Definition: PCA reduces the dimensionality of data by projecting it onto principal components that capture the most variance.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
# X_pca = pca.fit_transform(X)
pca.explained_variance_  #eigen values
pca.components_   # eigen vectors
pca.explained_variance_ratio_   # contribution percentage

# Machine learning Types

1. Supervised Learning: Trains on labeled data (e.g., classification, regression).
2. Unsupervised Learning: Trains on unlabeled data (e.g., clustering, dimensionality reduction).
3. Semi-supervised Learning: Mix of labeled and unlabeled data.
4. Reinforcement Learning: Learning through rewards and penalties.


# 1. Supervised Machine learning

## linear Models
### Linear Regression
Definition: Linear regression is a supervised learning algorithm that models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation to the data.

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

model.coef_
model.intercept_

### **Main Assumptions of Linear Regression**

1. **Linearity**: The relationship between the independent and dependent variables is linear.
2. **Independence**: Observations are independent of each other.
3. **Homoscedasticity**: The variance of residuals (errors) is constant across all levels of the independent variable.
4. **Normality of Errors**: Residuals are normally distributed.
5. **No Multicollinearity**: Independent variables are not highly correlated with each other.
6. **No Autocorrelation**: Residuals are not correlated (important in time series data).

## Regression Metrics
Definition: Regression metrics evaluate the performance of regression models by quantifying the error between predicted and actual values.
#### Mean Absolute Error (MAE)
Measures the average absolute difference between predicted and actual values.
#### Mean Squared Error (MSE)
Measures the average squared difference between predicted and actual values.
#### Root Mean Squared Error (RMSE)
Square root of MSE, provides error in the same units as the target variable.
#### R-Squared (Coefficient of Determination)
Represents the proportion of variance in the dependent variable that is predictable from the independent variables.
#### Adjusted R-Squared
Adjusts r2 for the number of predictors in the model, penalizing excessive use of features.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score
import numpy as np

mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
# root mean squared error
rmse = np.sqrt(mse)
#r2_score
r2 = r2_score(y_true, y_pred)
# adjusted r2_score
n = len(y_true)  # Number of observations
p = X.shape[1]  # Number of predictors
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

## Gradient Descent
Definition: Gradient descent is an optimization algorithm used to minimize a function by iteratively updating its parameters in the direction of the negative gradient of the loss function with respect to the parameters.
#### Batch Gradient Descent
Definition: Batch Gradient Descent computes the gradient of the loss function using the entire dataset and updates the model parameters in each iteration. It provides stable updates but can be computationally expensive for large datasets.
#### Stochastic Gradient Descent (SGD)
Definition: Stochastic Gradient Descent (SGD) is a variant of gradient descent that updates model parameters for each training example or a small batch, rather than the entire dataset. It is faster for large datasets but introduces noise in updates.
#### Mini-Batch Gradient Descent
Definition: Mini-Batch Gradient Descent splits the dataset into small batches and computes the gradient for each batch. It balances the efficiency of Batch Gradient Descent and the noise reduction of Stochastic Gradient Descent.

In [None]:
from sklearn.linear_model import SGDRegressor
# stochastic
model = SGDRegressor(learning_rate='constant', eta0=0.01, max_iter=1000)
model.fit(X, y)  # learning rate can be 'optimal', 'constant','adaptive'

# mini_batch
batch_size = 30
for i in range(100):
    idx = random.smaple(range(x_train.shape[0]),batch_size)
    sgd.partial_fit(x_train[idx],y_train[idx])

sgd.coef_
sgd.intercept_

### Polynomial Features
Definition: Polynomial features generate new features by raising existing numerical features to a specified power and creating interaction terms, enabling models to capture non-linear relationships.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)


### Bias-Variance Trade-off
Definition: The bias-variance trade-off is the balance between two sources of error in machine learning models:

#### Bias: Error due to overly simplistic models that make strong assumptions and fail to capture the underlying data pattern.

#### Variance: Error due to models that are too complex and sensitive to the noise in the training data.

#### Low Bias, High Variance: Complex models (e.g., deep learning) can overfit the data, capturing noise as well as the signal.

#### High Bias, Low Variance: Simpler models (e.g., linear regression) may underfit, not capturing the complexity of the data.

#### Goal: Minimize both bias and variance to achieve good generalization, typically by selecting the right model complexity.

### Regularization
Definition: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. This penalty discourages the model from learning overly complex patterns that might only capture noise in the data. Regularization methods aim to strike a balance between model complexity and accuracy.

**Key Types:**
#### L1 Regularization (Lasso):
Adds the absolute value of the coefficients as a penalty term. It can shrink some coefficients to zero,it is used in performing feature selection.
#### L2 Regularization (Ridge):
Adds the square of the coefficients as a penalty term. It helps to reduce the magnitude of coefficients but does not set them to zero.
#### ElasticNet:
Combines both L1 and L2 regularization, allowing for a mix of the benefits of both Lasso and Ridge.


In [None]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# solver {‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’} default=’auto’  # OLS
# solver { ‘sparse_cg’, ‘sag’, ‘saga’, ‘lbfgs’}   # Gradient Descent
model = Ridge(alpha=1.0, solver='lbfgs')  # Alpha is the regularization strength

model = Lasso(alpha=1.0, solver='saga')  # Alpha is the regularization strength not small not so high

model = ElasticNet(alpha=1.0, l1_ratio=0.5)  # l1_ratio controls the mix
model.fit(X, y)
y_pred = model.predict(X)

model.coef_
model.intercept_

### Logistic Regression
Definition: Logistic regression is a statistical method used for binary classification problems. It models the probability of a binary outcome (0 or 1) as a function of the input features by using the logistic function (sigmoid) to map predictions between 0 and 1.

**Hyperparameters**

penalty{‘l1’, ‘l2’, ‘elasticnet’, None}, default=’l2’     **regualarization**

solver{‘lbfgs’, ‘liblinear’, ‘newton-cg’, ‘newton-cholesky’, ‘sag’, ‘saga’}, default=’lbfgs’   **gradient descent apply**

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='saga',max_iter=500,random_state=42)
model.fit(X, y)
y_pred = model.predict(X)

##  Classification Metrics
Definition: Classification metrics evaluate the performance of a classification model by comparing predicted labels with actual labels.
### 1. Accuracy
Definition: The proportion of correctly classified instances to the total instances in the dataset.
### 2. Confusion Matrix
Definition: A table used to evaluate the performance of a classification algorithm by showing the counts of true positives, false positives, true negatives, and false negatives.
### 3. Precision
Definition: The proportion of correctly predicted positive instances to all instances predicted as positive. It measures the accuracy of positive predictions.
### 4. Recall (Sensitivity)
Definition: The proportion of correctly predicted positive instances to all actual positive instances. It measures the model's ability to identify all relevant positive cases.
### 5. F1-Score
Definition: The harmonic mean of precision and recall, providing a balance between them. It is useful when the class distribution is imbalanced.
### 6. Classification Report
Definition: A comprehensive summary of precision, recall, F1-score, and support (the number of occurrences of each class) for each class.

In [None]:
from sklearn.metrics import confusion_matrix, precision_score,recall_score, f1_score,classification_report,accuracy_score

accuracy = accuracy_score(y_true, y_pred # accuracy
cm = confusion_matrix(y_true, y_pred)  # confusion matrix
precision = precision_score(y_true, y_pred) # precision
recall = recall_score(y_true, y_pred) # recall
f1 = f1_score(y_true, y_pred) # f1_score

report = classification_report(y_true, y_pred)  # classification_report

### Softmax Logistic Regression
Definition: Softmax logistic regression is an extension of logistic regression used for **multi-class** classification problems. It applies the softmax function to predict the probability distribution across multiple classes, ensuring the output probabilities sum to 1.

#### hyperparameter
multi_class{‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’

In [None]:
from sklearn.linear_model import LogisticRegression
from mlxtend.plotting import plot_decision_regions

model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X, y)
y_pred = model.predict(X)

plot_decision_regions(x.values,y.values,model,legend=2) # model must be classifier

### Polynomial Logistic Regression
Definition: Polynomial Logistic Regression extends logistic regression by adding polynomial features to capture non-linear relationships. This can be done within a pipeline by combining PolynomialFeatures and LogisticRegression.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression

# Define the pipeline with polynomial features and logistic regression
pipeline = Pipeline([
    ('poly_features', PolynomialFeatures(degree=2,include_bias=True)),  # Add polynomial features
    ('log_reg', LogisticRegression())  # Logistic regression model
])
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Predict using the pipeline
y_pred = pipeline.predict(X_test)

## Non - Linear Models
### Decision Tree
Definition: A decision tree is a non-parametric supervised learning algorithm used for classification and regression. It splits the data into subsets based on feature values, creating a tree-like structure to make predictions.

In [None]:
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree

model = DecisionTreeClassifier(criterion='entropy or gini',max_depth=12,min_samples_split=12,
                              min_samples_leaf=2,max_features=0.5,spliter='best or random')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# plot a tree
plt.figure(figsize(10,5))
plot_tree(model,filled=True,feature_names=data.feature_names,class_names=data.target_names)
plt.show()

### Ensemble Learning
Definition: Ensemble learning combines predictions from multiple models to improve overall performance, reduce overfitting, and enhance generalization.
#### 1. Voting Ensemble
Definition: Combines predictions from multiple models (classifiers or regressors) by voting (for classification) or averaging (for regression). Two types:
Hard Voting: Majority voting on predicted classes.
Soft Voting: Weighted average of predicted probabilities.

In [None]:
from sklearn.ensemble import VotingClassifier , VotingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Create individual models
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = SVC(probability=True)
estimators = [('lr', model1), ('dt', model2), ('svc', model3)]
# Combine models using VotingClassifier
voting_ensemble = VotingClassifier(estimators=estimators, voting='soft', weights=[3,2,1])  # Use 'hard' for majority voting
voting_ensemble.fit(X_train, y_train)
y_pred = voting_ensemble.predict(X_test)

#### 2. Bagging Ensemble
Definition: Combines predictions by training multiple models on different subsets of the data (sampled with replacement). Each model is trained independently, and their results are averaged (regression) or voted (classification).

1. n_estimators: int, default=10    **The number of base estimators in the ensemble.**
2. max_samples: int or float, default=1.0 **The number of samples**
3. max_features: int or float, default=1.0  **The number of features to draw from X to train**
4. bootstrap: bool, default=True  **Whether samples are drawn with replacement. If False, sampling without replacement is performed.**
5. oob_score: bool, default=False **Whether to use out-of-bag samples to estimate the generalization error. Only available if bootstrap=True.**

In [None]:
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
# Bagging with Decision Trees
bagging_ensemble = BaggingClassifier(estimator=model,n_estimators=10, random_state=42)
## Pasting  when bootstrap= False

bagging_ensemble.fit(X_train, y_train)
y_pred = bagging_ensemble.predict(X_test)
bagging_ensemble.oob_score_ # Out of bag : when oob_score= True

##### Random Forest
 Def : It is ensemble learning  bagging technique (bootstrap, aggregation) that builds multiple decison tress and combine their predictions to improve accuracy and reduct overfiting.
 for classification: final prediction is based on majority voting.
 for regression: final prediction is determined by averaging predictions of all tress.

 **Hyperparameters** all the decision tree and bagging hyperparameters use.

In [None]:
from sklearn.ensemble import RandomForestClassifier , RandomForestRegressor
rf = RandomForestRegressor(n_estimators=40,max_depth=5,
                           n_jobs=-1,  # model able to use are cores of your pc
                          min_samples_split=5,
                          bootstrap=True or False)
rf.feature_importances_

#### 3. Boosting
Definition: Boosting is an ensemble technique that combines weak learners (models that perform slightly better than random guessing) sequentially. Each model corrects the errors of the previous one to improve overall performance.
##### AdaBoost ( adaptive Boosting)
Def : it is boosting technique that cmobine multiple weak models (classifiers or regressors) (models slighlty better than random guessing) into a strong models to improve overall accuracy.
Adaboost focuses on correcting the errors made by previous classifiers by assigning higher weights to misclassified samples in each iteration.

In [None]:
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor()
base_model = DecisionTreeRegressor()
adaboost = AdaBoostClassifier(estimator=base_model, n_estimators=100,learning_rate=0.1,random_state=42)

##### Gradient Boosting
Definition: Gradient Boosting is a boosting technique that builds an ensemble of weak learners (typically decision trees) sequentially, optimizing the residual errors of previous models using gradient descent. It is widely used for both regression and classification tasks.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting with hyperparameters
gb_model = GradientBoostingClassifier(
    n_estimators=100,  # Number of trees
    learning_rate=0.1,  # Shrinkage factor
    max_depth=3,  # Maximum depth of each tree
    subsample=1.0,  # Fraction of samples used per tree
    min_samples_split=2,  # Minimum samples to split a node
    min_samples_leaf=1,  # Minimum samples in a leaf node
    random_state=42  # Randomness control
)
gb_model.fit(X_train, y_train)
y_pred = gb_model.predict(X_test)



##### XGBoost
Definition: XGBoost (Extreme Gradient Boosting) is an efficient and scalable implementation of gradient boosting designed to optimize speed and performance. It supports regularization, tree pruning, and parallel processing for improved accuracy and reduced overfitting.


In [None]:
from xgboost import XGBClassifier

# XGBoost Classifier with hyperparameters
xgb_model = XGBClassifier(
    n_estimators=100,  # Number of boosting rounds
    learning_rate=0.1,  # Shrinkage factor
    max_depth=6,  # Maximum depth of trees
    min_child_weight=1,  # Minimum child weight
    gamma=0.0,  # Minimum loss reduction for split
    subsample=1.0,  # Fraction of samples per tree
    colsample_bytree=1.0,  # Fraction of features per tree
    reg_alpha=0.0,  # L1 regularization
    reg_lambda=1.0,  # L2 regularization
    random_state=42  # Random seed
)
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)


##### Ensemble Methods: Stacking and Blending
##### 1. Stacking
Definition: Stacking combines multiple base models (level-0) and trains a meta-model (level-1) on their predictions. The meta-model learns to predict the target based on the outputs of the base models.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Base models
estimators = [
    ('dt', DecisionTreeClassifier(max_depth=3)),
    ('svc', SVC(probability=True))
]

# Stacking ensemble
stacking_model = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),  # Meta-model
    cv=5  # Cross-validation
)
stacking_model.fit(X_train, y_train)
y_pred = stacking_model.predict(X_test)


##### 2. Blending
Definition: Similar to stacking, blending combines base models and a meta-model. The difference is that blending uses a validation set (a hold-out subset of the training data) instead of cross-validation to train the meta-model.

**Steps**:
Split the training data into a training set and a validation set.
Train base models on the training set.
Generate predictions for the validation set and the test set.
Train the meta-model on the validation set predictions.
Make final predictions using the meta-model on test set predictions.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Split data into training and validation sets
X_train_base, X_val, y_train_base, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train base models
model1 = DecisionTreeClassifier(max_depth=3).fit(X_train_base, y_train_base)
model2 = SVC(probability=True).fit(X_train_base, y_train_base)

# Generate validation set predictions
val_pred1 = model1.predict_proba(X_val)
val_pred2 = model2.predict_proba(X_val)

# Concatenate predictions to train meta-model
import numpy as np
meta_features = np.hstack([val_pred1, val_pred2])
meta_model = LogisticRegression()
meta_model.fit(meta_features, y_val)

# Generate test set predictions using the meta-model
test_pred1 = model1.predict_proba(X_test)
test_pred2 = model2.predict_proba(X_test)
test_meta_features = np.hstack([test_pred1, test_pred2])
y_pred = meta_model.predict(test_meta_features)


##### Agglomerative Hierarchical Clustering
Definition: Agglomerative hierarchical clustering is a bottom-up clustering method that starts with each data point as its own cluster and iteratively merges the closest clusters until a stopping criterion is met (e.g., a desired number of clusters).

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Agglomerative clustering with hyperparameters
agg_clustering = AgglomerativeClustering(
    n_clusters=3,  # Number of clusters
    affinity='euclidean',  # Distance metric
    linkage='ward'  # Linkage criterion
)
agg_clustering.fit(X)
labels = agg_clustering.labels_


#### K-Nearest Neighbors (KNN)
Definition: KNN is a non-parametric, lazy learning algorithm used for classification and regression. It predicts the class or value of a sample based on the majority class or average of its nearest neighbors in the feature space.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# KNN with hyperparameters
knn_model = KNeighborsClassifier(
    n_neighbors=5,  # Number of neighbors
    weights='uniform',  # Weight function: uniform or distance
    algorithm='auto',  # Algorithm for nearest neighbors computation
    leaf_size=30,  # Leaf size for tree-based algorithms
    p=2,  # Power parameter for Minkowski distance
    metric='minkowski'  # Distance metric
)
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test)



##### Naive Bayes
Definition: Naive Bayes is a probabilistic classifier based on Bayes' Theorem. It assumes conditional independence between features given the class label. It is efficient and works well for high-dimensional data.

**Types**:

**Gaussian** Naive Bayes: For continuous data assuming a normal distribution.

**Multinomial** Naive Bayes: For discrete data like word counts in text classification.

**Bernoulli** Naive Bayes: For binary data.

In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

# Gaussian Naive Bayes
gnb = GaussianNB(var_smoothing=1e-9)
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

# Multinomial Naive Bayes
mnb = MultinomialNB(alpha=1.0)
mnb.fit(X_train, y_train)
y_pred = mnb.predict(X_test)

# Bernoulli Naive Bayes
bnb = BernoulliNB(alpha=1.0, binarize=0.0)
bnb.fit(X_train, y_train)
y_pred = bnb.predict(X_test)


## Unsupervised ML
##### K-Means Clustering
Definition: K-Means is a centroid-based clustering algorithm that partitions the dataset into
𝑘
k clusters by minimizing the variance within each cluster. It iteratively assigns points to clusters and recalculates cluster centroids until convergence.

In [None]:
from sklearn.cluster import KMeans

# K-Means clustering with hyperparameters
kmeans_model = KMeans(
    n_clusters=3,  # Number of clusters
    init='k-means++',  # Initialization method
    n_init=10,  # Number of initializations
    max_iter=300,  # Maximum iterations
    tol=1e-4,  # Convergence tolerance
    algorithm='lloyd'  # Algorithm to use
)
kmeans_model.fit(X)
labels = kmeans_model.labels_


#### DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Definition: DBSCAN is a density-based clustering algorithm that groups points closely packed together, marking points in low-density regions as noise. It does not require the number of clusters to be specified and can discover clusters of arbitrary shape.

In [None]:
from sklearn.cluster import DBSCAN

# DBSCAN clustering with hyperparameters
dbscan_model = DBSCAN(
    eps=0.5,  # Maximum distance for neighborhood
    min_samples=5,  # Minimum samples to form a core point
    metric='euclidean',  # Distance metric
    algorithm='auto',  # Nearest neighbor search algorithm
    leaf_size=30,  # Leaf size for tree-based algorithms
    p=2  # Power parameter for Minkowski metric
)
dbscan_model.fit(X)
labels = dbscan_model.labels_
