In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
dataPath = '/content/drive/MyDrive/02 Jobs/04 President/Dataset/'
import os
os.listdir(dataPath)

['heart_disease_uci.csv']

# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">About Data</p>

## Meta-Data (About Dataset):

# Context:
This is a multivariate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis. It is composed of 14 attributes which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. This database includes 76 attributes, but all published studies relate to the use of a subset of 14 of them. The Cleveland database is the only one used by ML researchers to date. One of the major tasks on this dataset is to predict based on the given attributes of a patient that whether that particular person has heart disease or not and other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.

## Dataset Description:

| Variable   | Description                                                                                   |
|------------|-----------------------------------------------------------------------------------------------|
| age        | Age of the patient in years                                                                   |
| sex        | Gender of the patient (0 = male, 1 = female)                                                  |
| cp         | Chest pain type: 0: Typical angina, 1: Atypical angina, 2: Non-anginal pain, 3: Asymptomatic |
| trestbps   | Resting blood pressure in mm Hg                                                               |
| chol       | Serum cholesterol in mg/dl                                                                    |
| fbs        | Fasting blood sugar level, categorized as above 120 mg/dl (1 = true, 0 = false)               |
| restecg    | Resting electrocardiographic results: 0: Normal, 1: Having ST-T wave abnormality, 2: Showing probable or definite left ventricular hypertrophy |
| thalach    | Maximum heart rate achieved during a stress test                                              |
| exang      | Exercise-induced angina (1 = yes, 0 = no)                                                     |
| oldpeak    | ST depression induced by exercise relative to rest                                             |
| slope      | Slope of the peak exercise ST segment: 0: Upsloping, 1: Flat, 2: Downsloping                   |
| ca         | Number of major vessels (0-4) colored by fluoroscopy                                           |
| thal       | Thalium stress test result: 0: Normal, 1: Fixed defect, 2: Reversible defect, 3: Not described |
| target     | Heart disease status (0 = no disease, 1 = presence of disease)                                 |

### Acknowledgements:

## Creators:
- Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
- University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
- University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
- V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

## Relevant Papers:
- Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64,304--310.
- David W. Aha & Dennis Kibler. "Instance-based prediction of heart-disease presence with the Cleveland database."
- Gennari, J.H., Langley, P, & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11--61.

## Citation Request:
The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution.

They would be:
- Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
- University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
- University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
- V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.


# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Aims and Objective</p>



1. **Exploratory Data Analysis (EDA):** Conduct an in-depth exploration of the dataset to gain valuable insights into its structure, patterns, and characteristics.

2. **Data Preprocessing for ML:** Implement robust preprocessing techniques to ensure data quality and prepare it for effective utilization in Machine Learning tasks, including handling missing values and scaling features.

3. **Random Forest and XGB Classifier Training:** Employ state-of-the-art algorithms, namely Random Forest and XGB Classifier, to train models that leverage the dataset's nuances, aiming for high accuracy and predictive power.

Through these steps, the objective is to enhance understanding, optimize data quality, and deploy advanced models for insightful analysis and informed decision-making.
</h>

# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Libraries and Utilizs</p>

In [None]:
# import libraries

# 1. to handle the data
import pandas as pd
import numpy as np

# to visualize the dataset
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go

# To preprocess the data
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer, KNNImputer
# import iterative imputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# machine learning
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
#for classification tasks
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, RandomForestRegressor
from xgboost import XGBClassifier
from sklearn.preprocessing import KBinsDiscretizer
#metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_absolute_error, mean_squared_error, r2_score

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Data Overview</p>

In [None]:
# load the data from csv file placed locally in our pc
# df = pd.read_csv('/kaggle/input/heart-disease-data/heart_disease_uci.csv')
df = pd.read_csv(dataPath + 'heart_disease_uci.csv')
# print the first 5 rows of the dataframe
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


- **Let's explore the gender based distribution of the dataset for age column.**

- **Dataset Column**

- **Let's explore cp (Chest Pain) column:**

 **Types of Chest pain :**

    1. Asymptomatic: No chest pain or discomfort.
    2. Non-Anginal: Chest pain not typical of heart-related issues; requires further investigation.
    3. Atypical Angina: Chest pain with characteristics different from typical heart-related chest pain.
    4. Typical Angina: Classic chest pain indicating potential insufficient blood supply to the heart.

- **Let'e explore the trestbps (resting blood pressure) column:**

The normal resting blood pressure is 120/80 mm Hg.

high blood pressure increasing the risk of heart disease and stroke, often asymptomatic, while low blood pressure can lead to dizziness and fainting

- **Lets Explore the chol Column**

What is the chol :  a fatty substance essential for body function, but elevated levels can contribute to heart disease.

- **Lets Explore Thal ( Thalesmia)**

    Normal: Within expected or healthy parameters.

    Reversible Defect: An abnormality that can potentially be corrected or improved.

    Fixed Defect: An abnormality that is unlikely to change or be corrected.

- **Lets Deal With Num , The Target Variable**
   * `0 = no heart disease`
   * `1 = mild heart disease`
   * `2 = moderate heart disease `
   * `3 = severe heart disease`
   * `4 = critical heart disease `


In [None]:
df['num'].value_counts()

In [None]:
# Groupby num with sex
df.groupby('num')['sex'].value_counts()
# Plot to Visualize
sns.countplot(df, x='num', hue='sex')

In [None]:
# groupby num by age
df.groupby('num')['age'].value_counts()
# Plot to Visualize
sns.histplot(df, x='age', hue='num')

In [None]:
# Make Histplot using Plotly
px.histogram(data_frame=df, x='age', color='num')

# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Conclusions</p>

**1.Age**
1. The minimum age to have a heart disease starts from 28 years old.
2. Most of the people get heart disease at the age of 53-54 years.
3. Most of the males and females get are with heart disease at the age of 54-55 years.
4. Male percentage in the data: 78.91%
5. Female Percentage in the data: 21.09%
6. Males are 274.23% more than females in the data.

**2.Datset**
1. We have highest number of people from Cleveland (304) and lowest from Switzerland (123).
2. The highest number of females in this dataset are from Cleveland (97) and lowest from VA Long Beach (6).
3. The highest number of males in this dataset are from Hungary (212) and lowest from Switzerland (113).

**2.1 Observations :**
   
   1. The Mean Age according to the dataset is :
   
      Cleveland        54.351974
      
      Hungary          47.894198

      Switzerland      55.317073

      VA Long Beach    59.350000
      
   2. The Median Age according to the dataset is :
      
      Cleveland        55.5
      
      Hungary          49.0
      
      Switzerland      56.0
      
      VA Long Beach    60.0

   3. The Mode Age according to the dataset is :
      
      Cleveland              58
      
      Hungary                54
      
      Switzerland            61
      
      VA Long Beach    [62, 63]
      
 ---  

**3.  Chest Pain Output :**
    
   0 = no heart disease
   1 = mild heart disease
   2 = moderate heart disease
   3 = severe heart disease
   4 = critical heart disease

   1. A total of 104 individuals are identified as having neither chest pain nor heart disease.

   2. Only 23 individuals are found to have no chest pain while experiencing critical heart disease.

   3. A group of 83 individuals is observed to be free from chest pain while having severe heart disease.

   4. In the dataset, 197 individuals are noted for having no chest pain and exhibiting mild heart disease.

   5. Among the individuals, 89 have no chest pain while presenting with moderate heart disease.

**3.1Results According to Group by cp and Num :**

| __CP__ | __Num__ |__Value-Count__
|     :---      |       :---      |       :---      |     
| __asymptomatic__ | 1 |197
| __atypical angina__ | 0 |150
| __no-anginal__ | 0 |131
| __typical angina__ | 0 |26
|     :---      |       :---      |       :---      |     
| __asymptomatic__ | 0 |104
| __atypical angina__ | 1 |19
| __no-anginal__ | 1 |37
| __typical angina__ | 1 |12
|     :---      |       :---      |       :---      |     
| __asymptomatic__ | 2 |89
| __atypical angina__ | 3 |3
| __no-anginal__ | 3 |18
| __typical angina__ | 3 |4
|     :---      |       :---      |       :---      |     
| __asymptomatic__ | 3 |83
| __atypical angina__ | 2 |2
| __no-anginal__ | 2 |14
| __typical angina__ | 3 |3
|     :---      |       :---      |       :---      |     
| __asymptomatic__ | 4 |23
| __no-anginal__ | 4 |4
| __typical angina__ | 4 |1

From Above out 0,1,2,3,4 Show Diseases Level adn Next Their Values

---     

**4.  Missing Values Imputation :**

   So Here we impute missing Values by using Iterative Imputer  and Random Forest . In this Dataset some Columns Have Higher Missing Values Ratio , so we have to Used Advance methods to impute missing Values . We Define a FUnction for Iputing Missing Values , In Which We Passed the Columns Names and The FUnction Return a Dataset With no Missing Values .

**4.1 Methods:**
   1. Random Forest Classifier
   2. Random Forest Regressor
   3. Iterative Imputer
   
---

**5.Outliers**

   While Dealing with Outliers , from my Observations There is only One Outlier in the dataset which i removed . Other Values Have some Meaningfull Insight , so we Cannot remove them . Leave them in the Dataset .

---

**6. Thal Output**

   - Normal: Within expected or healthy parameters.
   - Reversible Defect: An abnormality that can potentially be corrected or improved.
   - Fixed Defect: An abnormality that is unlikely to change or be corrected.
  
   1. Among the individuals, 110 males and 86 females are classified as normal.
   2. A total of 42 males and 4 females exhibit a fixed defect.
   3. In the dataset, 171 males and 21 females are identified with a reversible defect.
      The higher ratio of males compared to females is attributed to the dataset's male predominance.
   5. Both individuals with thalassemia and those with normal thalassemia experience chest pain.
   6. Individuals with normal thalassemia often exhibit a higher ratio of being free from heart disease, although some may still experience     heart-related conditions.
   7. Those with thalassemia generally have an increased likelihood of heart disease, yet some
  individuals with thalassemia do not develop such health issues.

---

**7. Num**
   1. Men exhibit a higher ratio of being disease-free, while females show a lower ratio in the dataset.

   2. Conversely, based on the dataset, men are more affected by diseases compared to women.
</h3>


# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Handling Missing Values</p>

**Why It is Important to Deal With Missing Values**
Handling missing values is crucial in data analysis and modeling for several reasons:

1. **Maintaining Data Integrity:** Missing values can lead to inaccuracies and distort the overall integrity of the dataset, affecting the quality of analysis and resulting in flawed conclusions.

2. **Preventing Biased Results:** Ignoring missing values may lead to biased results, as the available data may not be representative of the entire population. This bias can impact the validity of statistical inferences and machine learning models.

3. **Enhancing Model Performance:** Most machine learning algorithms cannot handle missing data. Imputing or addressing missing values ensures that the model is trained on a complete dataset, improving its performance and generalizability.

4. **Preserving Statistical Power:** Missing values reduce the sample size, potentially diminishing the statistical power of analyses. Addressing missing data helps preserve the representativeness and reliability of the study.

5. **Avoiding Misinterpretation:** Incomplete data can mislead analysts, leading to incorrect assumptions or interpretations. Addressing missing values ensures that insights drawn from the data are more accurate and trustworthy.

6. **Supporting Decision-Making:** In business and research, decisions based on incomplete or inaccurate data can have significant consequences. Handling missing values ensures that decisions are informed by reliable and complete information.

Overall, dealing with missing values is essential for maintaining the quality, accuracy, and reliability of data, which is fundamental to sound data analysis and modeling practices.

---

- Here i Define a Function for Imputing Null Value , In Which I Just Passed the Column name the Function Run and impute the Null Values in it
- The Mthods used for Imputation are Random Forest Classifier , Random Forest Regression and Iterative Imputer
</h3>




In [None]:
# remove warning
import warnings
warnings.filterwarnings('ignore')

# impute missing values using our functions
for col in missing_data_cols:
    print("Missing Values", col, ":", str(round((df[col].isnull().sum() / len(df)) * 100, 2))+"%")
    if col in categorical_cols:
        df[col] = impute_categorical_missing_data(col)
    elif col in numeric_cols:
        df[col] = impute_continuous_missing_data(col)
    else:
        pass

In [None]:
# Again CHecking Missing Values
df.isnull().sum()

- Here we are Done With Imputing Missing Values , By using Advance Methods Like Random Forest and Iterative Imputer . Which Are More Accurate then using Mean , Median or Mode  We Define a FUnction for Iputing Missing Values , In Which We Passed the Columns Names and The FUnction Return a Dataset With no Missing Values .
-
      MEthods :
      1. Random Forest Classifier
      2. Random Forest Regressor
      3. Iterative Imputer

# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Handling Outliers</p>
<div style="border-radius:10px; padding: 15px; background-color: #F65058FF; font-size:120%; text-align:left">

<h3 style="color:Black;font-family:newtimeroman;font-size:100%;text-align:center;">

Outliers, statistical anomalies lying outside the norm, wield significant influence on data analysis. These exceptional data points, while occasionally dismissed as errors, often unveil critical insights. Identifying outliers is pivotal for robust analysis, ensuring data integrity and model accuracy. Outliers can illuminate hidden patterns, detect anomalies, and refine predictive models. Their nuanced exploration adds depth to statistical understanding, fostering more informed decision-making. Thus, treating outliers with thoughtful consideration enhances the resilience and reliability of data-driven insights, providing a comprehensive and nuanced perspective on the underlying patterns within a dataset.
</h>

In [None]:
# create box plots for all numeric columns using for loop and subplot
plt.figure(figsize=(20, 20))

colors = ['red', 'green', 'blue', 'orange', 'purple']

for i, col in enumerate(numeric_cols):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x=df[col], color=colors[i])
    plt.title(col)
plt.show()

In [None]:
# print the row from df where trestbps value is 0
df[df['trestbps'] == 0]
# remove this row from data
df = df[df['trestbps'] != 0]

- While Dealing with Outliers , from my Observations There is only One Outlier in the dataset which i removed . Other Values Have some Meaningfull Insight , so we Cannot remove them . Leave them in the Dataset .


# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Machine Learning ( Model Building )</p>


 The Target Column is `num` which is the predicted attribute. We will use this column to predict the heart disease.
 The unique values in this column are: [0, 1].

0 = no heart disease
1 = heart disease

The models that you will use to predict the heart disease. These models should be classifiers for multi-class classification.

1. Random Forest
2. XGB Classifier.</h3>


- Import Libraries

In [None]:
# Import Libraires
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Train test Split
from sklearn.model_selection import train_test_split
# Models
from sklearn.naive_bayes import GaussianNB , BernoulliNB , MultinomialNB
from sklearn.tree import DecisionTreeClassifier , DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier , AdaBoostClassifier ,RandomForestRegressor , AdaBoostRegressor
from xgboost import XGBClassifier , XGBRegressor
from sklearn.linear_model import LinearRegression ,LogisticRegression
from sklearn.neighbors import KNeighborsRegressor , KNeighborsClassifier
from sklearn.ensemble import GradientBoostingRegressor , GradientBoostingClassifier
from sklearn.svm import SVC , SVR
from xgboost import XGBClassifier , XGBRegressor
# Import Naive Bayes
#metrics
from sklearn.metrics import mean_squared_error , mean_absolute_error , r2_score , classification_report , accuracy_score , f1_score , precision_score
#import grid search cv for cross validation
from sklearn.model_selection import GridSearchCV , RandomizedSearchCV
# import preprocessors
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder , OneHotEncoder
from sklearn.preprocessing import QuantileTransformer , PowerTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Remove Warning
import warnings
warnings.filterwarnings('ignore')
# Saving Model
import pickle

# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Process</p>

- Renaming ColumnNames and Droping Some Irrelevant Column
- Here we Drop some irrelevant columns Like : `id ,restecg` and Uses those columns Which are Important .
- in Thal and cp we have space which i think will create problem later on so we also remove those spaces .
- in target Column 0 : 'No disease' and 1 : 'Effected Disease' . Here in target Column , i do some changes , before there are 5 different categories . 1,2,3,4, Represent Disease , soo i make a new column  in which , there are only two categories one represents Disease and one represents no disease .
data_1['target'] = ((data['num'] > 0)*1).copy()

[ (data['sex'] == 'Male')*1 ]: The boolean values (True/False) are then multiplied by 1. In Python, True is equivalent to 1 and False is equivalent to 0 when used in arithmetic operations. This operation effectively converts the boolean values into numerical values (1 for 'Male' and 0 for 'Female').</h3>


In [None]:
# Load Again The Clean Data
data = df.copy()
data.head()

In [None]:
# In some of the features, there is space will will create problem later on.
# So we rename those attributes to handle problems in the future.
data['thal'].replace({'fixed defect':'fixed_defect' , 'reversable defect': 'reversable_defect' }, inplace =True)
data['cp'].replace({'typical angina':'typical_angina', 'atypical angina': 'atypical_angina' }, inplace =True)
data['restecg'].replace({'normal': 'normal' , 'st-t abnormality': 'ST-T_wave_abnormality' , 'lv hypertrophy': 'left_ventricular_hypertrophy' }, inplace =True)

# Genrating New Dataset with Less Columns Which Are Necessary .
data_1 = data[['age','sex','cp','dataset', 'trestbps', 'chol', 'fbs','restecg' , 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal']].copy()
# Some Changes in Target Variable | Only Two Categories (0,1) . 0 for No-Disease , 1 for Disease
data_1['target'] = ((data['num'] > 0)*1).copy()
# Encoding Sex
data_1['sex'] = (data['sex'] == 'Male')*1
# Encoding Fbs and exang
data_1['fbs'] = (data['fbs'])*1
data_1['exang'] = (data['exang'])*1
# Renaming COlumns Names.
data_1.columns = ['age', 'sex', 'chest_pain_type','country' ,'resting_blood_pressure',
              'cholesterol', 'fasting_blood_sugar','Restecg',
              'max_heart_rate_achieved', 'exercise_induced_angina',
              'st_depression', 'st_slope_type', 'num_major_vessels',
              'thalassemia_type', 'target']
# Load Data Sample
data_1.head()

# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Random Forest</p>

Random Forest is an ensemble learning technique used for both classification and regression tasks. It builds multiple decision trees during training and merges their predictions to improve accuracy and reduce overfitting.

1. High Accuracy
2. Robust to Overfitting
3. Handles Missing Values

Random Forest is a versatile and powerful algorithm, especially effective in scenarios with high-dimensional data and complex relationships. It excels in situations where high accuracy is crucial, and its ability to handle missing values and resist overfitting makes it a popular choice in machine learning applications.</h3>


In [None]:
train_random_forest(data_1, 'target')

# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">XGBoost</p>

1. Definition:

   XG Boost is a scalable and efficient machine learning algorithm that belongs to the ensemble learning category. Specifically, it's a gradient boosting framework designed for speed and performance, utilizing decision trees as base learners.

2. Key Characteristics:

    1. Gradient Boosting:
       - Builds an ensemble of weak learners (usually decision trees) sequentially, each correcting the errors of its predecessor.

    2. Regularization:
       - Implements regularization techniques to prevent overfitting.

    3. Parallel Processing:
       - Allows parallelization of tree construction, making it computationally efficient.

3. Handling Missing Values:
   - Can handle missing values in the dataset.

4. Advantages:

    1. High Performance:
       - Achieves high accuracy and efficiency, often outperforming other algorithms.

    2. Feature Importance:
       - Provides insights into feature importance, aiding interpretability.

    3. Flexibility:
       - Can handle various types of data and problems, both regression and classification.![image.png]</h3>



In [None]:
train_xgb_classifier(data_1,'target')

# <p style="background-color: #F65058FF;font-family:Algerian;font-size:150%;text-align:center;color:#28334AFF;border-radius:1000px 50px;">Final Conclusion</p>

<h3 style="color:Black;font-family:newtimeroman;font-size:100%;">

<div style="border-radius:10px; padding: 15px; background-color: #F65058FF; font-size:120%; text-align:left">

In this study, we trained two powerful machine learning models, namely Random Forest and XGB Classifier, to address our classification task. After an extensive hyperparameter tuning process, we achieved optimal configurations for each model.

The Random Forest model demonstrated robust performance with a set of hyperparameters, including a maximum depth of 10, minimum samples per leaf set to 4, minimum samples for split set to 2, and 100 estimators. This resulted in an impressive accuracy of 84% on the test set.

On the other hand, the XGB Classifier exhibited its strengths with a unique set of hyperparameters, such as a colsample by tree of 0.8, gamma value of 2, learning rate of 0.1, maximum depth of 3, 50 estimators, and a subsample ratio of 1.0. This configuration yielded a slightly higher accuracy of 86% on the test set.

Both models demonstrated their effectiveness in handling the classification task, each excelling in different aspects. The Random Forest model showcased high accuracy and robustness to overfitting, making it a reliable choice. Meanwhile, the XGB Classifier demonstrated its capabilities in capturing complex relationships within the data, resulting in a slightly higher accuracy.

Ultimately, the choice between these models depends on specific requirements and preferences. The Random Forest model is suitable for scenarios where accuracy and resistance to overfitting are paramount, while the XGB Classifier proves beneficial when capturing intricate patterns is crucial. This comparative analysis provides valuable insights into the strengths and trade-offs of each model, empowering informed decision-making for future applications.

</h3>
</div>
