**Types of Classification:**

1. Binary Classification
- Predicts between two classes/outcomes
- Examples: 
  - Spam vs. Non-spam email
  - Pass/Fail
  - Fraudulent vs. Legitimate transactions

2. Multi-class Classification
- Predicts one class from three or more possible classes
- Examples:
  - Digit recognition (0-9)
  - Animal species classification
  - Language detection

3. Multi-label Classification
- Can assign multiple labels to each instance
- Examples:
  - Image tagging (photo can be both "sunset" and "beach")
  - Movie genre classification
  - News article topics

Based on learning approach:

4. Supervised Classification
- Uses labeled training data
- Common algorithms:
  - Logistic Regression
  - Decision Trees
  - Random Forests
  - Support Vector Machines (SVM)
  - K-Nearest Neighbors (KNN)

5. Semi-supervised Classification
- Uses both labeled and unlabeled data
- Useful when labeled data is scarce/expensive
- Examples:
  - Text classification with partially labeled documents
  - Medical image classification

6. Unsupervised Classification (Clustering)
- Finds natural groupings in unlabeled data
- Algorithms:
  - K-means
  - Hierarchical clustering
  - DBSCAN
 

y -> Lebel variable, Dependent variable, Response variable

In [20]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, confusion_matrix, accuracy_score

In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
breast_cancer_wisconsin_original = fetch_ucirepo(id=15) 
  
# data (as pandas dataframes) 
X = breast_cancer_wisconsin_original.data.features 
y = breast_cancer_wisconsin_original.data.targets 
  
# metadata 
print(breast_cancer_wisconsin_original.metadata) 
  
# variable information 
print(breast_cancer_wisconsin_original.variables) 



{'uci_id': 15, 'name': 'Breast Cancer Wisconsin (Original)', 'repository_url': 'https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original', 'data_url': 'https://archive.ics.uci.edu/static/public/15/data.csv', 'abstract': 'Original Wisconsin Breast Cancer Database', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 699, 'num_features': 9, 'feature_types': ['Integer'], 'demographics': [], 'target_col': ['Class'], 'index_col': ['Sample_code_number'], 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1990, 'last_updated': 'Sun Mar 10 2024', 'dataset_doi': '10.24432/C5HP4Z', 'creators': ['WIlliam Wolberg'], 'intro_paper': None, 'additional_info': {'summary': "Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed fro

In [3]:
breast_cancer_df = pd.concat([X,y], axis=1)
breast_cancer_df.head()

Unnamed: 0,Clump_thickness,Uniformity_of_cell_size,Uniformity_of_cell_shape,Marginal_adhesion,Single_epithelial_cell_size,Bare_nuclei,Bland_chromatin,Normal_nucleoli,Mitoses,Class
0,5,1,1,1,2,1.0,3,1,1,2
1,5,4,4,5,7,10.0,3,2,1,2
2,3,1,1,1,2,2.0,3,1,1,2
3,6,8,8,1,3,4.0,3,7,1,2
4,4,1,1,3,2,1.0,3,1,1,2


In [4]:
breast_cancer_df.isnull().sum()

Clump_thickness                 0
Uniformity_of_cell_size         0
Uniformity_of_cell_shape        0
Marginal_adhesion               0
Single_epithelial_cell_size     0
Bare_nuclei                    16
Bland_chromatin                 0
Normal_nucleoli                 0
Mitoses                         0
Class                           0
dtype: int64

In [5]:
breast_cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Clump_thickness              699 non-null    int64  
 1   Uniformity_of_cell_size      699 non-null    int64  
 2   Uniformity_of_cell_shape     699 non-null    int64  
 3   Marginal_adhesion            699 non-null    int64  
 4   Single_epithelial_cell_size  699 non-null    int64  
 5   Bare_nuclei                  683 non-null    float64
 6   Bland_chromatin              699 non-null    int64  
 7   Normal_nucleoli              699 non-null    int64  
 8   Mitoses                      699 non-null    int64  
 9   Class                        699 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 54.7 KB


In [6]:
breast_cancer_df.drop('Bare_nuclei', axis=1, inplace=True)
breast_cancer_df.columns

Index(['Clump_thickness', 'Uniformity_of_cell_size',
       'Uniformity_of_cell_shape', 'Marginal_adhesion',
       'Single_epithelial_cell_size', 'Bland_chromatin', 'Normal_nucleoli',
       'Mitoses', 'Class'],
      dtype='object')

In [7]:
breast_cancer_df['Class'].value_counts()

2    458
4    241
Name: Class, dtype: int64

In [8]:
breast_cancer_df['Class'] = breast_cancer_df['Class'].replace({
        2: 0,
        4: 1,
    }, 
)

In [9]:
X = breast_cancer_df.drop('Class', axis=1)
y = breast_cancer_df['Class']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=24)

In [11]:
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
y_pred = logistic_regression.predict(X_test)

In [12]:
logistic_regression.coef_

array([[0.4686736 , 0.19424696, 0.61144381, 0.19875579, 0.05170231,
        0.54513631, 0.12542504, 0.43632838]])

In [24]:
print(pd.crosstab(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

col_0    0   1
Class         
0      135   3
1        4  68
[[135   3]
 [  4  68]]


In [25]:
accuracy_score(y_test, y_pred)

0.9666666666666667

0.6619047619047619