# Machine Learning Workflow

#### Step 1: Loading and Understanding Data
#### Step 2: Preprocessing Data
#### Step 3: Training Model
#### Step 4: Evaluate Model

#### Common Machine Learning Models:
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
- Gradient Boosting
- Neural Networks (Multi-layer Perceptron)
- Naive Bayes

# Example: Breast Cancer Dataset

## Model: Decision Tree
#### Step 1: Loading and Understanding Data

In [2]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

breast_cancer = load_breast_cancer()
data = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
data['target'] = breast_cancer.target

In [3]:
# Display first few rows
print(data.head())

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

In [4]:
# Basic Statistics
print(data.describe())

       mean radius  mean texture  mean perimeter    mean area  \
count   569.000000    569.000000      569.000000   569.000000   
mean     14.127292     19.289649       91.969033   654.889104   
std       3.524049      4.301036       24.298981   351.914129   
min       6.981000      9.710000       43.790000   143.500000   
25%      11.700000     16.170000       75.170000   420.300000   
50%      13.370000     18.840000       86.240000   551.100000   
75%      15.780000     21.800000      104.100000   782.700000   
max      28.110000     39.280000      188.500000  2501.000000   

       mean smoothness  mean compactness  mean concavity  mean concave points  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0.014064          0.052813        0.079720             0.038803   
min           0.052630          0.019380        0.000000             0.000000   
25%      

In [6]:
# Check for missing values
print(data.isnull().sum())

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64


#### Step 2: Preprocessing Data

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define features and target variable
X = data.drop('target', axis=1)
y = data['target']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

#### Step 3: Training Model

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize Model
model = DecisionTreeClassifier()

# Train Model
model.fit(X_train, y_train)

#### Steo 4: Model Evaluation

In [9]:
# Predict on test set
y_pred = model.predict(X_test)

# Evaluation Model
accuracy = accuracy_score(y_test, y_pred)
print(f'Decision Tree Accuracy: {accuracy}')

Decision Tree Accuracy: 0.9385964912280702


## Model: Random Forest

In [10]:
from sklearn.ensemble import RandomForestClassifier

# Intitialize Model
model = RandomForestClassifier()

# Train Model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f'Random Forest Accuracy: {accuracy}')

Random Forest Accuracy: 0.956140350877193


## Model: Support Vector Machine (SVM)

In [11]:
from sklearn.svm import SVC

# Initialize Model
model = SVC()

# Train Model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f'SVM Accuracy: {accuracy}')

SVM Accuracy: 0.9912280701754386


## Model: K-Nearest Neighbors (KNN)

In [12]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize Model 
model = KNeighborsClassifier()

# Train Model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f'KNN Accuracy: {accuracy}')

KNN Accuracy: 0.9824561403508771


## Model: Gradient Boosting

In [13]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize Model
model = GradientBoostingClassifier()

# Train Model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f'Gradient Boosting Accuracy: {accuracy}')

Gradient Boosting Accuracy: 0.956140350877193


## Model: Neural Networks

In [14]:
from sklearn.neural_network import MLPClassifier

# Initialize Model
model = MLPClassifier()

# Train Model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f'Neural Network Accuracy: {accuracy}')

Neural Network Accuracy: 0.9736842105263158




## Model: Naive Bayes

In [15]:
from sklearn.naive_bayes import GaussianNB

# Initialize Model
model = GaussianNB()

# Train Model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f'Naive Bayes Accuracy: {accuracy}')

Naive Bayes Accuracy: 0.9649122807017544


# Results

### The best performing model is SVM with an accuracy score of 99.12%