<a href="https://www.kaggle.com/code/kopkritsaikhiao/churn-eda-and-predictions?scriptVersionId=136022588" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Welcome 

Hey everyone, in this project, we'll be focusing on churn prediction. Our goal is to understand and predict churn using different methods. If you find this project interesting, please show your support by upvoting and sharing your feedback.

<div style="text-align: center;">
    <img src="https://img.freepik.com/free-icon/loss_318-699915.jpg?size=626&ext=jpg" alt="Loss Image">
</div>

# Load dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas_profiling import ProfileReport

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # For creating plots
import matplotlib.ticker as mtick # For specifying the axes tick format 
import matplotlib.pyplot as plt
import plotly.express as px


sns.set(style = 'white')

# Input data files are available in the "../input/" directory.

import os
# print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

## preview the head of the dataset

In [None]:
df = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()

In [None]:
# Checking all columns names
df.columns

In [None]:
# Checking the data types of all the columns
df.dtypes

## use pasdas profiling to make quick for understanding data

In [None]:
# Pandas profiling before data preprocessing
profile = ProfileReport(df, title='Pandas profiling before data preprcessing', minimal=True, progress_bar=False)
profile.to_notebook_iframe()

## check missing colums and drop it

In [None]:
# Converting Total Charges to a numerical data type.
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')
df.isnull().sum()

In [None]:
# Drop rows with missing values
df.dropna(inplace=True)

In [None]:
# Check columns is dropped
df.isnull().sum()

In [None]:
df.describe()

In [None]:
# print all unique value
def summary(df):
    print(f"Dataset has {df.shape[1]} features and {df.shape[0]} examples.")
    summary = pd.DataFrame(index=df.columns)
    summary["Unique"] = df.nunique().values
    summary["Missing"] = df.isnull().sum().values
    summary["Duplicated"] = df.duplicated().sum()
    summary["Types"] = df.dtypes
    return summary

summary(df)

In [None]:
df.columns

# Explore relationships between each features and churn

In [None]:
selected_columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod']

for each_name in selected_columns:

    each_count = df.groupby([each_name, 'Churn']).size().reset_index(name='Count')

    fig = px.bar(each_count, x=f'{each_name}', y='Count', color='Churn', barmode='group')

    fig.show()

## what we see form this relationship 
Based on the plot, we can observe the following patterns in the churn relationship:

* Senior citizens are more likely to churn.
* Customers without partners are more likely to churn.
* Customers without dependents are more likely to churn.
* Customers who use fiber optic service are more likely to churn.
* Customers without internet service are more likely to churn.
* Customers without online security, online backup, device protection, and tech support are more likely to churn.
* Customers who pay on a month-to-month basis are more likely to churn.
* Customers who use paperless billing and pay with electronic checks are more likely to churn.

These factors indicate a higher probability of churn based on the plot.

with no machine learning we can see the differce between each features and churn.

In [None]:
fig = px.scatter(df, x='MonthlyCharges', y='TotalCharges')

fig.show()

In [None]:
fig = px.box(df, x='Churn', y='tenure')

fig.show()

* this plot we can see that the distribution between who churn and tenure are different 

In [None]:
fig = px.box(df, x='Churn', y='MonthlyCharges')

fig.show()

* this plot we can see that the distribution between who churn and monthly charge are different 

In [None]:
fig = px.box(df, x='Churn', y='TotalCharges')

fig.show()

* The plot shows that there is a difference in the distribution between customers who churn and their total charges. It suggests that customers with higher total charges have a lower chance of churning.

In other words, there seems to be a negative relationship between total charges and churn. Customers who have higher total charges are less likely to churn compared to those with lower total charges.

## Based on my initial analysis before using machine learning, here are my simple conclusions and assumptions:

1. New users who do things online, like paying bills and receiving electronic statements, are more likely to stop using the service (churn). This is because they are already comfortable with online processes and may find it easy to switch to a different provider.

2. Customers who don't use online security or protection features tend to choose basic services with lower prices. They prioritize saving money over having additional security measures.

These simple conclusions and assumptions provide insights into potential factors that might affect churn behavior based on customer characteristics and behaviors.

# Machine lelearning part

In [None]:
df.head()

In [None]:
df.columns

# compare two one-hot encoding

In [None]:
df_no_id = df[['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']]

df_no_id.head()

In [None]:
df_no_id.isna().sum()

## pd.get_dummies

In [None]:
# Perform one-hot encoding on selected columns
df_encoded_dummies = pd.get_dummies(df_no_id)

# Display the encoded dataframe
df_encoded_dummies.head()

In [None]:
df_object = df[['gender', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'Churn']]

df_object.head()

## OneHotEncoder with sklearn.preprocessing

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Select the columns you want to encode
columns_to_encode = df_object.columns

# Create an instance of the OneHotEncoder class
encoder = OneHotEncoder(sparse=False, drop='first')

# Fit and transform the selected columns
encoded_data = encoder.fit_transform(df[columns_to_encode])

# Create a DataFrame with the encoded data
df_encoded = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(columns_to_encode))

# Concatenate the encoded DataFrame with the original DataFrame
df_final = pd.concat([df.drop(columns=['customerID']+list(columns_to_encode)), df_encoded], axis=1)

# Display the encoded DataFrame
df_final.head()


In [None]:
len(df_encoded_dummies.columns), len(df_final.columns)

* Using pd.get_dummies makes it straightforward to encode categorical variables, but it can result in an increase in the number of dimensions in the dataset. 

* scikit-learn's encoding techniques handle this issue by automatically removing redundant or repeated feature dimensions.

In [None]:
# Compare column sets
columns_diff = set(df_encoded_dummies.columns) - set(df_final.columns)

# Print the differing columns
print("Columns that are present in df_encoded but not in df_final:")
for column in columns_diff:
    print(column)

# Repeat the comparison for columns present in df_final but not in df_encoded
columns_diff = set(df_final.columns) - set(df_encoded.columns)
print("Columns that are present in df_final but not in df_encoded:")
for column in columns_diff:
    print(column)

The Shapiro-Wilk test statistic is typically denoted as W. It has a range of values between 0 and 1, where a value close to 1 indicates a good fit to a normal distribution, and a value close to 0 suggests departure from normality.

In [None]:
import pandas as pd
from scipy.stats import shapiro

# Check for Gaussian distribution in each column
for column in df_encoded.columns:
    data = df_encoded[column]
    shapiro_stat, p_value = shapiro(data)
    is_gaussian = p_value > 0.05  # Set significance level as 0.05

    print(f"Column '{column}':")
    print(f"Shapiro-Wilk test statistic: {shapiro_stat}")
    print(f"P-value: {p_value}")
    print(f"Is Gaussian: {is_gaussian}")
    print("---------------------------------")

In [None]:
df_encoded_dummies.columns

## train test split

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Define the features and target variable
features = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges',
       'gender_Female', 'gender_Male', 'Partner_No', 'Partner_Yes',
       'Dependents_No', 'Dependents_Yes', 'PhoneService_No',
       'PhoneService_Yes', 'MultipleLines_No',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic',
       'InternetService_No', 'OnlineSecurity_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No', 'OnlineBackup_No internet service',
       'OnlineBackup_Yes', 'DeviceProtection_No',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No', 'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No', 'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No', 'StreamingMovies_No internet service',
       'StreamingMovies_Yes', 'Contract_Month-to-month', 'Contract_One year',
       'Contract_Two year', 'PaperlessBilling_No', 'PaperlessBilling_Yes',
       'PaymentMethod_Bank transfer (automatic)',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']

target = ['Churn_Yes']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_encoded_dummies[features], np.array(df_encoded_dummies[target]).ravel(), test_size=0.2, random_state=42)
print("total features :", len(df_encoded_dummies[features].columns))
print("X \n", X_train.iloc[0], "\n y \n", y_train[0])

In [None]:
y_train

In [None]:
y_train = y_train.ravel()
y_test = y_test.ravel()

y_train

# Logistic Regression

## Logistic Regression with pd.get_dummies

In [None]:
# Create the pipeline
pipeline = Pipeline([
    ('scaler', MinMaxScaler()),      # Apply feature scaling using MinMaxScaler
    ('classifier', LogisticRegression())    # Logistic Regression classifier
])

# Fit the pipeline on the training data

pipeline.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)

In [None]:
df_final.columns

## Logistic Regression with sklearn one-hot-encoder

In [None]:
df_final.isna().sum()
df_final.dropna(inplace=True)

In [None]:
df_final.isna().sum()

In [None]:
# Define the features and target variable
features = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges',
       'gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No internet service', 'OnlineBackup_Yes',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No internet service', 'StreamingMovies_Yes',
       'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']

target = ['Churn_Yes']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_final[features], df_final[target], test_size=0.2, random_state=42)
print("total features :", len(df_encoded_dummies[features].columns))
print("X \n", X_train.iloc[0], "\n y \n", y_train)

In [None]:
y_train = np.array(y_train).ravel()
y_test = np.array(y_test).ravel()

y_train

In [None]:
# Create the pipeline
pipeline = Pipeline([
    ('scaler', MinMaxScaler()),      # Apply feature scaling using MinMaxScaler
    ('classifier', LogisticRegression())    # Logistic Regression classifier
])

# Fit the pipeline on the training data

pipeline.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)

### results

* form pandas pd.get_dummies get Accuracy: 0.7874911158493249
* but form sklearn get Accuracy: 0.801423487544484

because it have more repetitive dimensions 

p(x) = 1 / (1 + e^(-z))

this is a formula for logistics regression 

however when we calculate logistic regression with many factors it will look like this 

z = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ

it looklike when we calculate multiple linear regression.
so, reduce dimensions is better.

In [None]:
from sklearn.preprocessing import StandardScaler

# Create the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),      # Apply feature scaling using MinMaxScaler
    ('classifier', LogisticRegression())    # Logistic Regression classifier
])

# Fit the pipeline on the training data

pipeline.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)

## change form minmax scale to standard scale

* Min-max scaling, also known as normalization, is a data scaling technique that transforms the features to a specific range, typically between 0 and 1. It achieves this by subtracting the minimum value from each feature and then dividing it by the range (maximum value minus minimum value).

* standard scaling, also known as z-score normalization, transforms the features to have a mean of 0 and a standard deviation of 1. It achieves this by subtracting the mean from each feature and then dividing it by the standard deviation.

In simpler terms, min-max scaling scales the features to a specific range, while standard scaling standardizes the features to have a mean of 0 and a standard deviation of 1.

so, the result between Min-max scaling and standard scaling is not different significantly.

In [None]:
import plotly.express as px

# Retrieve the logistic regression classifier from the pipeline
logreg = pipeline.named_steps['classifier']

# Get the coefficients of the logistic regression model
coef = logreg.coef_[0]

# Create a dataframe of variable names and corresponding coefficients
weights_df = pd.DataFrame({'Variable': features, 'Weight': coef})

# Sort the weights in descending order and select the top 10
top_weights = weights_df.sort_values('Weight', ascending=False).head(10)

# Create the bar chart using Plotly
fig = px.bar(top_weights, x='Variable', y='Weight', title='Top 10 Variable Weights')
fig.show()

In [None]:
import plotly.express as px

# Retrieve the logistic regression classifier from the pipeline
logreg = pipeline.named_steps['classifier']

# Get the coefficients of the logistic regression model
coef = logreg.coef_[0]

# Create a dataframe of variable names and corresponding coefficients
weights_df = pd.DataFrame({'Variable': features, 'Weight': coef})

# Sort the weights in descending order and select the top 10
top_weights = weights_df.sort_values('Weight', ascending=True).head(10)

# Create the bar chart using Plotly
fig = px.bar(top_weights, x='Variable', y='Weight', title='Top 10 Variable Weights')
fig.show()

The results indicate that certain features play a significant role in determining the outcome. 

This suggests that when exploring the data, we can observe the influence of multiple features on the outcome variable.

such as total charge, internet service fiber optic, tenure, mothly charge.

# Random Forest

Random Forest is a machine learning algorithm that combines multiple decision trees to make predictions. It is called "random" because it creates each decision tree using a random subset of the training data and a random subset of the features.

Think of it like a "forest" of decision trees, where each tree independently predicts the outcome, and the final prediction is made by averaging or voting across all the individual tree predictions.

Random Forest is known for its simplicity and ease of understanding. It can handle both classification and regression tasks and is effective in handling large datasets with many features. It is also robust to outliers and can provide insights into feature importance.

In simple terms, Random Forest is like a team of decision trees working together to make accurate predictions. Its randomness and combination of trees make it a powerful and easy-to-understand algorithm for machine learning tasks.

<div style="text-align: center;">
    <img src="https://img.freepik.com/free-photo/aerial-view-vibrant-green-trees-forest_181624-49828.jpg?size=626&ext=jpg" alt="forrest Image">
</div>

In [None]:
import pandas as pd
import plotly.express as px
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Create the pipeline with Random Forest Classifier
pipeline = Pipeline([
    ('classifier', RandomForestClassifier())    # Random Forest Classifier
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)

In [None]:
# Retrieve the feature importances from the trained Random Forest model
importances = pipeline.named_steps['classifier'].feature_importances_

# Create a dataframe of variable names and corresponding feature importances
importance_df = pd.DataFrame({'Variable': features, 'Importance': importances})

# Sort the feature importances in descending order
importance_df = importance_df.sort_values('Importance', ascending=False)

# Create the bar chart using Plotly
fig = px.bar(importance_df, x='Variable', y='Importance', title='Feature Importance')
fig.show()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Define the pipeline
pipeline = Pipeline([
    ('preprocessing', StandardScaler()),  # Preprocessing step
    ('classifier', RandomForestClassifier(n_estimators=1000, 
                                          oob_score=True, 
                                          n_jobs=-1, 
                                          random_state=50, 
                                          max_features='sqrt', 
                                          max_leaf_nodes=30))  # Random Forest classifier
])

# Perform cross-validation
scores = cross_val_score(estimator=pipeline, X=X_train, y=y_train, cv=5, scoring='roc_auc')

# Print scores and their mean
print("score each round :", scores)
print("Mean:", scores.mean())

In [None]:
# Defining a parameter grid for hyperparameter tuning with different values to be tested for 'n_estimators', 'max_depth', and 'max_features' hyperparameters
param_grid = [{'n_estimators': [100, 200, 300], 'max_depth': [None,2,3,10,20], 'max_features': ['sqrt',2,4,8,16,'log2', None]}]

### tuning hyper-parameter

In [None]:
from sklearn.model_selection import GridSearchCV

# Creating a random forest classifier object 'temp_rf' with a random state of 0 and parallel processing enabled
temp_rf=RandomForestClassifier(random_state=0,n_jobs=-1)
# Creating a grid search object 'grid_search' using the 'GridSearchCV' function, with a random forest classifier as the estimator, a parameter grid, 'roc_auc' as the scoring metric, and 5-fold cross-validation with parallel processing
grid_search=GridSearchCV(estimator=temp_rf, param_grid=param_grid, scoring='roc_auc', cv=5, n_jobs=-1)
# Performing grid search on the training data to find the best hyperparameters for the model
grid_search.fit(X_train,y_train)

In [None]:
# Calculating the best RMSE score found by Grid Search 
grid_search.best_score_

In [None]:
# Retrieving the best parameter values found by the grid search
grid_search.best_params_

# Support Vector Machine (SVC)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),      # Apply feature scaling using StandardScaler
    ('classifier', SVC())              # Support Vector Machine classifier
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)

# AdaBoost


AdaBoost, short for Adaptive Boosting, is a machine learning algorithm that combines multiple weak or simple models (often decision trees) to create a strong predictive model. It is an ensemble method that aims to improve the performance of individual models by sequentially training them on different weighted versions of the dataset.

The key idea behind AdaBoost is to give more importance to the misclassified instances in each iteration, allowing subsequent models to focus on the difficult cases. In each iteration, the algorithm assigns higher weights to misclassified instances and lower weights to correctly classified instances. This way, subsequent models will pay more attention to the challenging observations.

During the prediction phase, AdaBoost combines the predictions from all the weak models through a weighted voting or weighted averaging approach to make the final prediction.

AdaBoost is known for its ability to handle complex datasets and achieve high accuracy. It is especially effective when combined with weak models, as it can effectively learn from their collective knowledge and generalize well to new, unseen data.

In simple terms, AdaBoost is a technique that combines multiple weak models to create a strong and accurate model. It learns from mistakes and gives more emphasis to challenging instances, resulting in improved performance.


<div style="text-align: center;">
    <img src="https://static.packt-cdn.com/products/9781788295758/graphics/image_04_046-1.png" alt="adaboost Image">
</div>

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Create the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),              # Apply feature scaling using StandardScaler
    ('classifier', AdaBoostClassifier())       # AdaBoost Classifier
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)

# XGBoost

XGBoost, short for eXtreme Gradient Boosting, is a powerful machine learning algorithm known for its high performance and efficiency. It belongs to the family of gradient boosting algorithms and is designed to optimize model performance through an iterative boosting process.

XGBoost utilizes an ensemble of decision trees to make predictions. It builds each tree in a sequential manner, with each subsequent tree focusing on correcting the mistakes of the previous trees. The algorithm assigns higher weights to the misclassified instances, allowing subsequent trees to pay more attention to those instances.

What sets XGBoost apart is its advanced regularization techniques, which help prevent overfitting and improve generalization. It incorporates both L1 and L2 regularization to control the complexity of the trees and includes a term to penalize complex models.

XGBoost also features a unique approximation algorithm that speeds up the training process by considering only the most informative splits. It uses parallel processing capabilities and various optimization strategies to efficiently handle large datasets.

The algorithm offers flexibility with customizable parameters that allow fine-tuning for optimal performance. XGBoost is widely used across various domains and has achieved remarkable success in machine learning competitions.

In simple terms, XGBoost is an advanced gradient boosting algorithm that combines decision trees in an iterative process to make accurate predictions. It incorporates regularization techniques and optimization strategies to improve performance, making it a popular choice in machine learning tasks.

<div style="text-align: center;">
    <img src="https://www.researchgate.net/publication/345327934/figure/fig3/AS:1022810793209856@1620868504478/Flow-chart-of-XGBoost.png" alt="XGBoost Image">
</div>

In [None]:
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

# Create the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),              # Apply feature scaling using StandardScaler
    ('classifier', XGBClassifier())            # XGBoost Classifier
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)


In [None]:
# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)

# Differences between each model

Random Forest, AdaBoost, and XGBoost are all ensemble learning methods that combine multiple weak models to create a strong predictive model.

However, there are some key differences between them:

1. Random Forest:

* It builds multiple decision trees independently and combines their predictions through voting or averaging.
* Each tree is trained on a random subset of the training data and a random subset of features.
* Random Forest is known for its simplicity, robustness against overfitting, and ability to handle high-dimensional data.
* It performs well in a wide range of tasks and is relatively easy to interpret.

2. AdaBoost (Adaptive Boosting):

* It trains weak models sequentially, with each subsequent model focusing on correcting the mistakes of the previous models.
* Instances that are misclassified by previous models are given higher weights, allowing subsequent models to pay more attention to them.
* AdaBoost adapts to difficult cases by boosting the importance of misclassified instances.
* It is effective in handling complex datasets and achieving high accuracy.

3. XGBoost (eXtreme Gradient Boosting):

* It is an optimized implementation of gradient boosting, designed for speed and performance.
* XGBoost uses a combination of gradient descent optimization and regularization techniques.
* It includes advanced regularization techniques, parallel processing capabilities, and optimization strategies.
* XGBoost achieves high accuracy, handles large datasets efficiently, and is widely used in machine learning competitions.


In summary, Random Forest focuses on combining independently trained decision trees, AdaBoost adapts to challenging cases by boosting the importance of misclassified instances, and XGBoost is an optimized implementation of gradient boosting with advanced regularization and optimization techniques.

# Thank you for attention

Thank you for taking the time to read and engage with this information! If you found it helpful or enjoyable, please consider giving it an upvote. Positive feedback like this greatly motivates me to continue exploring new areas and sharing my knowledge.

This notebook provided me with many ideas for exploring data. I got inspiration from the techniques and insights shared in this Kaggle notebook: https://www.kaggle.com/code/arnabchaki/eda-on-data-science-salaries/notebook