# **Project name: Cancer Type Prediction**

# <b> <span style="color: #FF77B7; font-size: 1.5em;"> TABLE OF CONTENTS </span> </b>
    
* [INTRODUCTION](#0)
* [1. OVERVIEW.](#1)
    * [1.1. About the dataset.](#1.1.)
    * [1.2. Features, label and target.](#1.2.)
* [2. DATA WRANGLING.](#2)
    * [2.1. General.](#2.1.)
    * [2.2. Anomalies detection.](#2.2)
    * [2.3. Summary.](#2.4)
* [3. EXPLORATORY DATA ANALYSIS.](#2)
    * [3.1. Features Visualization.](#2.1.)
    * [3.2. Summary.](#2.4)
* [4. MODEL DEVELOPMENT.](#3)
    * [4.1. Preprocessing.](#3.1)
    * [4.2. Model Building.](#3.2)
* [5. MODEL TUNING.](#3)   
* [CONCLUSION.](#4)
* [REFERENCES.](#5)

# Libraries

In [None]:
pip install xgboost



In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.ticker as mtick
import matplotlib.patches as mpatches

from imblearn.metrics import classification_report_imbalanced
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline, Pipeline
from imblearn.under_sampling import NearMiss

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from sklearn.metrics import (accuracy_score, classification_report, f1_score,
                             precision_recall_curve, precision_score,
                             recall_score, roc_auc_score, roc_curve, RocCurveDisplay)
from sklearn.model_selection import (GridSearchCV, KFold, RandomizedSearchCV,
                                     ShuffleSplit, StratifiedKFold,
                                     StratifiedShuffleSplit, cross_val_predict,
                                     cross_val_score, learning_curve,
                                     train_test_split)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler, StandardScaler,OneHotEncoder, MinMaxScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from xgboost import XGBClassifier

# **INTRODUCTION**

In recent years, early detection of cancer has been one of the primary challenges in the healthcare industry. As cancer remains a leading cause of mortality worldwide, identifying factors that influence diagnosis and prognosis has become critical for improving patient outcomes. One of the most effective ways to tackle this challenge is by leveraging machine learning and data analytics to predict cancer diagnoses based on patient data. These techniques allow for better decision-making, early intervention, and more personalized treatment plans.

Healthcare data analytics, especially when applied to cancer detection, plays a significant role in predicting patient outcomes and identifying patterns in medical data that may not be easily visible to healthcare professionals. By analyzing patient profiles, particularly the features associated with cancer diagnoses, healthcare institutions can build models that assist in predicting whether a tumor is malignant or benign. This not only helps in early detection but also supports the development of more effective treatment protocols, ensuring timely and appropriate care for patients.

In this study, we examine a dataset derived from clinical features of breast cancer tumors, aiming to classify whether a tumor is malignant or benign based on several biological metrics such as radius, texture, perimeter, area, and smoothness. By employing classification techniques and machine learning models, we seek to provide insights that will aid healthcare providers in diagnosing cancer more accurately. The ultimate goal is to enhance predictive accuracy, helping medical professionals in making informed decisions for cancer treatment and improving overall patient care.

# **1. OVERVIEW**

## 1.1. About the dataset. <a id="1.1."></a>

The dataset "Cancer_Data.csv" contains  contains 569 observations, each with 33 attributes with various features related to tumor characteristics, aiming to classify the type of cancer as malignant (M) or benign (B). The main task involves analyzing these features to build predictive models for cancer diagnosis.

<h3> Dataset Summary </h3>

*   **Columns:**

 1. **ID**: Unique identifier for each observation.

 2. **Diagnosis**: Target variable (M for malignant and B for benign).

 3. **Radius, Texture, Perimeter, Area, Smoothness, Compactness, Concavity, Symmetry, Fractal Dimension**: Tumor measurement metrics, presented in mean, standard error, and worst-case variations.

*   **Feature Categories**

 1. **Mean Measurements**: Average measurements of tumor characteristics (e.g., radius_mean, texture_mean).

 2. **Standard Error**: Standard deviation of measurements (e.g., radius_se, texture_se).

 3. **Worst-Case Measurements**: Largest value measurements (e.g., radius_worst, texture_worst).

<h3> <u> Main Task </u> </h3>

 - **Objective**: Explore and preprocess the dataset to create models for predicting cancer type (Diagnosis).

## 1.2. Features, label and target. <a id="1.2."></a>

**Categorical Feature:**
*diagnosis*
*   'M': Malignant (indicating a cancerous tumor)
*   'B': Benign (indicating a non-cancerous tumor)


**Numerical Features** (examples based on common datasets):

*   Radius: Mean of distances from center to points on the perimeter.
*   Texture: Standard deviation of gray-scale values.
*   Perimeter: The distance around the tumor.
*   Area: The size of the tumor.
*   Smoothness: Local variation in radius lengths.
*   Compactness: Perimeter² / (Area - 1.0).
*   Concavity: Severity of concave portions of the contour.
*   Concave points: Number of concave portions of the contour.
*   Symmetry: Measure of how symmetric the tumor is.
*   Fractal dimension: "Roughness" of the tumor boundary.

**Target Variable**: *diagnosis* is the primary label being predicted, while the numerical features serve as input variables (predictors).



















#**2. DATA WRANGLING** : Nguyen Dang Luong

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
csv_path = '/content/drive/My Drive/ITDS - Cancer Diagnosis Prediction/Cancer_Data - smaller.csv'
cancer_data = pd.read_csv(csv_path)
cancer_data

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430
...,...,...,...,...,...,...,...,...,...,...
295,891923,B,13.77,13.27,88.06,582.7,0.09198,0.06221,0.01063,0.01917
296,891936,B,10.91,12.35,69.14,363.7,0.08518,0.04721,0.01236,0.01369
297,892189,M,11.76,18.14,75.00,431.1,0.09968,0.05914,0.02685,0.03515
298,892214,B,14.26,18.17,91.22,633.1,0.06576,0.05220,0.02475,0.01374


**Sample**: Each row in the dataset represents a sample, which could correspond to an individual patient's data. For example, row 0 represents a sample with the ID "842302."

**Feature**: These are the columns that provide information about the samples. In our dataset, features include columns like *radius_mean, texture_mean, perimeter_mean,* and so on, which are characteristics of the tumor measurements.

**Label**: This is the outcome or target variable that we are trying to predict or analyze. In this dataset, the *diagnosis* column serves as the label, indicating whether the tumor is malignant ("M") or benign ("B").

**Task**: The task refers to what our want to accomplish with this dataset. In our project, it is a classification task where the goal is to predict whether a tumor is malignant or benign based on the features.

In [None]:
description = cancer_data.describe()
print(description)
description.to_csv('describe_output.csv', index=False)


In [None]:
data_info = cancer_data.info()
print(data_info)


# **3. EXPLORATORY DATA ANALYSIS** : Dinh Van Anh & Le Trieu Quang Minh

## 3.1. Features Visualization.<a id="3.1."></a>

### **First Part Analysis**


In [None]:
warnings.filterwarnings('ignore') #This prevents Python from displaying warning messages, which can clutter the output
pd.set_option('display.max_columns', None)
df_viz1 = cancer_data.iloc[:,1:11]

df_viz1['diagnosis'] = df.diagnosis

#### Box plot


In [None]:
# -*- coding: utf-8 -*-
"""
Created on Tue Oct 22 21:27:05 2024

@author: ADMIN
"""


fig,axs = plt.subplots(ncols =2 ,nrows =4 ,figsize = (12,9.5),dpi= 100)
axs = axs.flatten()

diagnosis_colors = {'M': '#8ACDD7', 'B': '#FF90BC'}

for i,col in enumerate(df_viz1.drop(columns = 'diagnosis').columns) :
    sns.boxplot(x = col, y='diagnosis', data = df_viz1, ax=axs[i], palette=diagnosis_colors)

    axs[i].set_xlabel(col,fontsize=12)
    axs[i].tick_params(axis= 'x',labelsize = 10)
    axs[i].tick_params(axis= 'y',labelsize = 10)

plt.tight_layout()
plt.show()

###### <u> Comment </u>

**Median (Q2)**: The thick line inside the box shows the median value of the feature for that category (M or B).

**Interquartile Range (IQR**): The box itself represents the range between the 1st quartile (Q1, 25th percentile) and the 3rd quartile (Q3, 75th percentile).

**Whiskers**: The lines (whiskers) extending from the box indicate the range of the data, up to 1.5 times the IQR.

**Outliers**: Points that fall outside the whiskers are plotted individually and considered potential outliers.

*This allows you to quickly compare the distribution of each feature for both malignant ('M') and benign ('B') diagnoses in a concise visual form.*




#### Violin plot


In [None]:
# -*- coding: utf-8 -*-
"""
Created on Tue Oct 22 23:34:14 2024

@author: ADMIN
"""

fig,axs = plt.subplots(ncols =2 ,nrows =4 ,figsize = (10,10),dpi= 100)
axs = axs.flatten()

diagnosis_colors = {'M': '#FF8787', 'B': '#BCE29E'}

for i,col in enumerate(df_viz1.drop(columns = 'diagnosis').columns) :
    sns.violinplot(x = col,y='diagnosis', data = df_viz1, ax=axs[i], palette=diagnosis_colors)

    axs[i].set_xlabel(col,fontsize=12)
    axs[i].tick_params(axis= 'x',labelsize = 10)
    axs[i].tick_params(axis= 'y',labelsize = 10)

plt.tight_layout()
plt.show()

###### <u> Comment </u>

**Distribution Shape**: The violin plot shows the full distribution of the data for each category ('M' and 'B'). The width of the plot at any given point represents the density (frequency) of the data. A wider section indicates more data points concentrated around that value, and a narrower section indicates fewer data points.

**Kernel Density Estimation (KDE)**: The smoothed curve around the distribution is created using kernel density estimation, providing a continuous estimation of the data's probability density.

**Split Plots**: By default, a violin plot shows symmetrical curves, but if you split it (like in this case for 'M' and 'B'), each category has its own side for better comparison.

**Box Plot Components Inside**: Violin plots often include basic box plot statistics (like the median and quartiles) within the "violin," giving you a summary of the central tendency and spread.

*It is especially useful for visualizing multimodal distributions (where there are multiple peaks) or asymmetric distributions, which might not be as obvious with box plots alone.*

#### Histogram plot


In [None]:
# -*- coding: utf-8 -*-
"""
Created on Tue Oct 22 21:27:05 2024

@author: ADMIN
"""

#ncols: number of columns - nrows: number of rows
#figsize(35, 25) - 35 is width
#figsize(35, 25) - 25 is height
#dpi: number of dots per inch

fig,axs = plt.subplots(ncols =2 ,nrows =4 ,figsize = (12,9.5),dpi= 100)
axs = axs.flatten()
#axs.flatten(): Converts the 2D array of axes into a 1D array to allow easy iteration in the upcoming loop.
for i,col in enumerate(df_viz1.drop(columns = 'diagnosis').columns) :
    sns.histplot(x = col,data = df_viz1 ,kde=True, ax=axs[i], color='#EA8FEA')
    # x/y = col -> column as Ox/Oy
    # data = uses df_viz1 as the data source
    # kde = True -> adds the KDE plot or not
    # ax=axs[i] = specifies which subplot (axis) to plot on -> position of that subplot

    axs[i].set_xlabel(col,fontsize=13) #create label for Ox
    axs[i].tick_params(axis= 'x',labelsize = 10) # Abscissa
    axs[i].tick_params(axis= 'y',labelsize = 10) # Ordinate

plt.tight_layout() #Automatically adjusts the subplot parameters to prevent overlapping of plots or labels.
plt.show()

###### <u> Comment </u>

**Frequency Distribution**: The histogram displays the frequency of occurrences of different values of the feature (plotted on the x-axis). The y-axis shows the number of observations that fall within each bin (range of values).

**Bins**: Data points are grouped into a number of bins (ranges of values) along the x-axis. The height of each bar represents the count of data points in that bin.

**KDE (Kernel Density Estimate)**: The kde=True parameter adds a smooth curve on top of the histogram. The KDE helps understand the underlying probability distribution of the data by smoothing out the discrete nature of the histogram bars.

*Insight from the plot - additional explaination*

**Distribution Shape**: The histogram shows the overall shape of the distribution of each feature—whether it is normal, skewed, or multimodal (having more than one peak).

**Spread and Central Tendency**: You can observe how the data for each feature is spread out and where most of the data points are concentrated.

**KDE vs. Histogram**: While the histogram shows the exact counts in each bin, the KDE curve gives a smoother representation, making it easier to visualize trends and distribution characteristics.

#### Scatter plot


##### Scatter Plot of Radius Mean vs Texture Mean


In [None]:
# Map diagnosis labels to binary values for visualization (B -> 0, M -> 1)
# 'B' represents benign tumors, and 'M' represents malignant tumors
cancer_data['Diagnosis_Binary'] = cancer_data['diagnosis'].map({'B': 0, 'M': 1})

# Initialize a scatter plot to visualize the relationship between 'radius_mean' and 'texture_mean'
# These two features are relevant in distinguishing tumor types in cancer diagnosis
plt.figure(figsize=(8, 6))  # Set the plot size to 8x6 inches

# Create a scatter plot with 'radius_mean' on the x-axis and 'texture_mean' on the y-axis
# Use 'diagnosis' as the hue to color-code by tumor type (blue for benign, red for malignant)
sns.scatterplot(data=cancer_data, x='radius_mean', y='texture_mean', hue='diagnosis', palette=['#F05A7E', '#8FD14F'])

# Add title and labels to the plot for clarity
plt.title('Scatter Plot of Radius Mean vs Texture Mean')  # Title of the plot
plt.xlabel('Radius Mean')  # Label for the x-axis
plt.ylabel('Texture Mean')  # Label for the y-axis

# Add legend to indicate color-coding of diagnosis types
plt.legend(title='Diagnosis')  # Set the title for the legend to 'Diagnosis'

# Display the plot
plt.show()


###### <u>Comment</u>


This plot visually explore potential clusters or separations between benign and malignant tumors based on radius_mean and texture_mean, two important features related to tumor shape and cell pattern variation. Malignant tumors (marked in pink) often have larger sizes and irregular textures, so clusters of malignant cases could appear in different regions of the plot compared to benign cases (marked in green).

##### Scatter Plot of  Area Mean vs Smoothness Mean


In [None]:
# Create scatter plot for two additional features: 'area_mean' and 'smoothness_mean'
# 'area_mean' represents the average area of tumor cells, indicating overall tumor size
# 'smoothness_mean' measures the smoothness or regularity of cell shapes, with higher values suggesting less regular shapes

plt.figure(figsize=(8, 6))
sns.scatterplot(data=cancer_data, x='area_mean', y='smoothness_mean', marker='P', hue='diagnosis', palette=['#F7B71D', '#3B6AC0'])
plt.title('Scatter Plot of Area Mean vs Smoothness Mean')
plt.xlabel('Area Mean')
plt.ylabel('Smoothness Mean')
plt.legend(title='Diagnosis')
plt.show()


###### <u>Comment</u>


- **General**: Malignant cases (in yellow) tend to occupy regions with a higher area_mean compared to benign cases (in darkblue). This indicates that malignant tumors generally have larger average cell areas than benign ones, suggesting that larger tumors may be more likely to be cancerous.

- **Smoothness Patterns**: Malignant cases often show slightly higher values for smoothness_mean, implying that malignant cells may have more irregular shapes than benign ones. However, there is some overlap in smoothness between benign and malignant cases, meaning that smoothness alone might not be a definitive predictor.



#### Correlation plot


In [None]:

# Identify the column containing the string value ('M')
string_column = cancer_data.select_dtypes(include=['object']).columns[0]  # Assuming the first object column is the string column

# Create a new DataFrame excluding the string column
df_corr = cancer_data.drop(columns=[string_column])

corr_matrix = df_corr.corr()

plt.figure(figsize=(12, 9))
sns.heatmap(corr_matrix, cmap='summer', annot=True, mask=np.tril(np.ones_like(corr_matrix, dtype=bool)))
plt.xticks(fontsize=10, rotation=90)
plt.yticks(fontsize=10, rotation=0)
plt.show()

###### <u> Comment </u>

**Correlation Between Features**: The heatmap shows how strongly each pair of numerical features in the dataset are correlated. Darker or lighter colors in the heatmap represent higher or lower correlation values, respectively.

*   Darker colors (close to 1 or -1) suggest a stronger linear relationship between two features.
*   Lighter colors (close to 0) indicate little to no linear relationship.


**Symmetry and Masking**: Since correlation matrices are symmetrical (i.e., the correlation between feature A and feature B is the same as between B and A), the mask hides the lower half of the matrix, focusing only on the unique pairwise correlations.

*Insight from heatmap*

**Feature Relationships**: You can easily identify which features are highly correlated, positively or negatively, which may be useful for dimensionality reduction, such as removing highly correlated features to prevent multicollinearity in machine learning models.

**Visualizing Patterns**: The heatmap allows you to quickly spot clusters of features that have similar correlation patterns, which can be valuable for feature engineering or selecting important variables for predictive models.

#### Count plot


In [None]:
sns.countplot(x='diagnosis', hue='diagnosis', data = cancer_data, palette='pastel')
cancer_data['diagnosis'].value_counts()

###### <u> Comment </u>

**Frequency of Categories**: The count plot shows how many occurrences there are for each category ('M' for malignant, 'B' for benign). Each bar's height represents the number of occurrences of that diagnosis.

This part of the code prints the actual counts of each category in the diagnosis column.


*   'B': 357 (357 benign cases)
*   'M': 212 (212 malignant cases)


**Insight from countplot**

**Class Imbalance**: The plot helps quickly identify any imbalance in the dataset. If the number of 'B' (benign) diagnoses is significantly higher than 'M' (malignant), it suggests a class imbalance that might need to be addressed in machine learning models.

### **Second Part Analysis**


### **Third Part Analysis**


## 3.2. Summary.<a id="3.2."></a>

# **4. MODEL DEVELOPMENT**: Nguyen Nam Khanh

The features **concave points_mean, radius_mean, perimeter_mean, area_mean,** and **concavity_mean** appear to be the most important indicators for distinguishing between benign and malignant tumors based on the provided visualizations. These features can be used to build predictive models to aid in cancer diagnosis.

In [None]:

important_features = ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean', 'smoothness_mean', 'compactness_mean', 'diagnosis']

data1 = df[important_features]

In [None]:
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0}) #change data into binary value


> model comparison




In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

models = [
    LogisticRegression(max_iter=10000),
    SVC(),
    RandomForestClassifier(),
    GradientBoostingClassifier(),
    XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    LGBMClassifier()
]

model_names = [
    'Logistic Regression',
    'Support Vector Machine',
    'Random Forest',
    'Gradient Boosting',
    'XGBoost',
    'LightGBM'
]

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    print(f"{name}:")
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1-score: {f1:.4f}")
    print("-" * 20)

In [None]:

f1_scores = [0.95, 0.92, 0.96, 0.93, 0.97, 0.94]

# Tạo biểu đồ cột
model_names = ['Logistic Regression', 'SVM', 'Random Forest', 'Gradient Boosting', 'XGBoost', 'LightGBM']
x_pos = np.arange(len(model_names))

plt.bar(x_pos, f1_scores, align='center', alpha=0.7, color=['blue', 'green', 'red', 'cyan', 'magenta', 'yellow'])
plt.xticks(x_pos, model_names, rotation = 90, ha='right')  # Xoay nhãn trục x nếu cần
plt.ylabel('F1-score')
plt.title('Compare the performance of models')

# Hiển thị giá trị F1-score trên mỗi cột
for i, v in enumerate(f1_scores):
    plt.text(i, v, f'{v:.2f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt
import numpy as np

results = {
    'Logistic Regression': {'accuracy': 0.92, 'precision': 0.90, 'recall': 0.95, 'f1': 0.92},
    'SVM': {'accuracy': 0.95, 'precision': 0.93, 'recall': 0.96, 'f1': 0.94},
    'Random Forest': {'accuracy': 0.96, 'precision': 0.94, 'recall': 0.97, 'f1': 0.95},
}

# Tạo danh sách các mô hình và số liệu
models = list(results.keys())
metrics = ['accuracy', 'precision', 'recall', 'f1']

# Tạo dữ liệu cho biểu đồ
data = [[results[model][metric] for metric in metrics] for model in models]

# Tạo biểu đồ nhóm
x = np.arange(len(models))  # Vị trí của các nhóm trên trục x
width = 0.15  # Độ rộng của mỗi cột

fig, ax = plt.subplots()
rects1 = ax.bar(x - width*1.5, [d[0] for d in data], width, label='Accuracy')
rects2 = ax.bar(x - width*0.5, [d[1] for d in data], width, label='Precision')
rects3 = ax.bar(x + width*0.5, [d[2] for d in data], width, label='Recall')
rects4 = ax.bar(x + width*1.5, [d[3] for d in data], width, label='F1-score')

# Thêm nhãn, tiêu đề và chú thích
ax.set_ylabel('Scores')
ax.set_title('Hiệu suất của các mô hình')
ax.set_xticks(x)
ax.set_xticklabels(models, rotation=45, ha='right')  # Xoay nhãn trục x nếu cần
ax.legend()

# Hiển thị giá trị trên mỗi cột (tùy chọn)
def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{:.2f}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
autolabel(rects4)

fig.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Giả sử bạn đã lưu kết quả của các mô hình vào một dictionary như sau:
results = {
    'Logistic Regression': {'accuracy': 0.92, 'precision': 0.90, 'recall': 0.95, 'f1': 0.92},
    'SVM': {'accuracy': 0.95, 'precision': 0.93, 'recall': 0.96, 'f1': 0.94},
    'Random Forest': {'accuracy': 0.96, 'precision': 0.94, 'recall': 0.97, 'f1': 0.95},
}

# Tạo danh sách các mô hình và số liệu
models = list(results.keys())
metrics = ['accuracy', 'precision', 'recall', 'f1']

# Tạo dữ liệu cho biểu đồ đường
x = np.arange(len(models))  # Vị trí của các điểm trên trục x

# Vẽ biểu đồ đường cho từng số liệu
fig, ax = plt.subplots()
for metric in metrics:
    y = [results[model][metric] for model in models]
    ax.plot(x, y, label=metric, marker='o')  # Thêm marker để dễ nhìn

# Thêm nhãn, tiêu đề và chú thích
ax.set_xlabel('Models')
ax.set_ylabel('Scores')
ax.set_title('Hiệu suất của các mô hình')
ax.set_xticks(x)
ax.set_xticklabels(models, rotation=45, ha='right')
ax.legend()

fig.tight_layout()
plt.show()



> Making model: using XGBclassifier






In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler  # Hoặc MinMaxScaler nếu cần
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
data = pd.read_csv('Cancer_Data - smaller.csv')

X = data[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean', 'smoothness_mean', 'compactness_mean']]

# Chia dữ liệu thành tập huấn luyện và tập kiểm tra
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Chuẩn hóa dữ liệu (nếu cần)
scaler = StandardScaler()  # Hoặc MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Khởi tạo mô hình XGBoost
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)  # Tùy chỉnh các tham số nếu cần

# Huấn luyện mô hình
model.fit(X_train, y_train)



> finding hyperparameter



In [None]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

# Định nghĩa mô hình XGBoost
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Định nghĩa lưới tham số để tìm kiếm
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],

}

# Khởi tạo GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1)  # cv=5: sử dụng 5-fold cross-validation

# Huấn luyện và tìm kiếm siêu tham số tối ưu
grid_search.fit(X_train, y_train)

# In ra siêu tham số tối ưu
print("Best parameters found: ", grid_search.best_params_)

# Sử dụng mô hình tốt nhất để dự đoán
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Đánh giá mô hình
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with best parameters: {accuracy:.4f}")



> Accuracy, Precision, Recall and F1-score



In [None]:

# Lấy mô hình tốt nhất từ GridSearchCV (hoặc RandomizedSearchCV)
best_model = grid_search.best_estimator_

# Dự đoán trên tập kiểm tra
y_pred = best_model.predict(X_test)

# Tính toán các số liệu đánh giá
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")



> saving model



In [None]:
import joblib


joblib.dump(model, 'xgboost_model.pkl')

# loaded_model = joblib.load('xgboost_model.pkl')

# **5. MODEL TESTING**: Duong Huy Phuc


**Roc, Auc and f1 score comparison**

In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import AdaBoostClassifier

In [None]:
models = [LogisticRegression(random_state=42),KNeighborsClassifier(),
          SVC(probability=True, random_state=42),GaussianNB(),
          DecisionTreeClassifier(random_state=42),RandomForestClassifier(random_state=42),xgb.XGBClassifier(),AdaBoostClassifier()]
model_names = ['LogisticRegression','KNN','SVM','NaiveBayes','DecisionTree','RandomForest','XGBoost','AdaBoostClassifier']
auc_scores = []

In [None]:
for model,name in zip(models,model_names):
    model.fit(X_train,y_train)
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    fpr,tpr,thresholds = roc_curve(y_test,y_pred_prob)
    auc_score = auc(fpr,tpr)
    auc_scores.append(auc_score)
    plt.plot(fpr, tpr, label='%s (AUC = %0.2f)' % (name, auc_score))

plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
f1_scores = []
recall = []
precision = []

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    r = recall_score(y_test,y_pred)
    p = precision_score(y_test,y_pred)
    f1_scores.append(f1)
    recall.append(r)
    precision.append(p)
    print('%s: F1-score = %0.3f' % (name, f1))
    print('%s: Precision = %0.3f' % (name, p))
    print('%s: Recall = %0.3f' % (name, r))
    print('\n')




average_f1_score = sum(f1_scores) / len(f1_scores)
print('Average F1-score:', average_f1_score)

**Making model**

In [None]:
from sklearn.model_selection import GridSearchCV

SVM = SVC()

C = [0.1,1,10,100]
kernel = ['linear', 'poly', 'rbf', 'sigmoid']

params = {'C':C,'kernel':kernel}

SVM_grid = GridSearchCV(estimator = SVM,param_grid = params,refit= True,verbose = 0,n_jobs=-1)
SVM_grid.fit(X_train,y_train.ravel())
print(f"best parameters : {SVM_grid.best_params_}")
print(f"best score : {SVM_grid.best_score_}")

In [None]:
LogReg = LogisticRegression(max_iter=10000)

penalty = ['l1','l2','elasticnet']
solver = [ 'newton-cg', 'lbfgs', 'liblinear', 'sag','saga']
C = [0.001, 0.01, 0.1, 1, 10, 100]

params2 = {'C' : C ,'penalty' : penalty , 'solver' : solver }


LogReg_grid = GridSearchCV(estimator = LogReg,param_grid = params2 ,refit= True,verbose = 0,n_jobs=-1)
LogReg_grid.fit(X_train,y_train.ravel())
print(f"best parameters : {LogReg_grid.best_params_}")
print(f"best score : {LogReg_grid.best_score_}")

In [None]:
!pip install xgboost
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

# Giả sử X_train và y_train đã được định nghĩa và xử lý
# X_train, y_train = ...

# Định nghĩa mô hình AdaBoost
AdaBoost = AdaBoostClassifier()

# Định nghĩa lưới tham số
estimator = [
    DecisionTreeClassifier(),
    SVC(probability=True),
    LogisticRegression(max_iter=1000),  # Tăng số lần lặp nếu cần
    GaussianNB(),
    KNeighborsClassifier()
]
n_estimators = range(100, 200, 300)
learning_rate = np.arange(0.1, 0.5, 1.0)
algorithm = ['SAMME', 'SAMME.R']

params3 = {
    'estimator': estimator,  # Thay 'base_estimator' bằng 'estimator'
    'n_estimators': n_estimators,
    'learning_rate': learning_rate,
    'algorithm': algorithm
}

# Thực hiện tìm kiếm lưới
AdaBoostGrid = GridSearchCV(estimator=AdaBoost, param_grid=params3, refit=True, verbose=3, n_jobs=-1)
AdaBoostGrid.fit(X_train, y_train.ravel())  # Giả sử X_train và y_train đã được định nghĩa

# In ra các tham số và điểm số tốt nhất
print(f"Best parameters: {AdaBoostGrid.best_params_}")
print(f"Best score: {AdaBoostGrid.best_score_}")

In [None]:
!pip install yellowbrick
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
from yellowbrick.classifier import ConfusionMatrix
import matplotlib.pyplot as plt  # Import matplotlib for plotting

LogReg = LogisticRegression(solver='liblinear', C=0.1, penalty='l2')
LogReg.fit(X_train, y_train.ravel())
y_pred = LogReg.predict(X_test)

print(f"accuracy  : {accuracy_score(y_test, y_pred)} \n")
print(f"f1 score  : {f1_score(y_test, y_pred)} \n")
print(f"classification report :\n {classification_report(y_test, y_pred)} \n")
print('The Confusion Matrix : \n')

# Create and visualize the confusion matrix using yellowbrick
cm = ConfusionMatrix(LogReg)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
cm.show()  # Display the confusion matrix plot


**final model**