**Types of Classification:**

1. Binary Classification
- Predicts between two classes/outcomes
- Examples: 
  - Spam vs. Non-spam email
  - Pass/Fail
  - Fraudulent vs. Legitimate transactions

2. Multi-class Classification
- Predicts one class from three or more possible classes
- Examples:
  - Digit recognition (0-9)
  - Animal species classification
  - Language detection

3. Multi-label Classification
- Can assign multiple labels to each instance
- Examples:
  - Image tagging (photo can be both "sunset" and "beach")
  - Movie genre classification
  - News article topics

Based on learning approach:

4. Supervised Classification
- Uses labeled training data
- Common algorithms:
  - Logistic Regression
  - Decision Trees
  - Random Forests
  - Support Vector Machines (SVM)
  - K-Nearest Neighbors (KNN)

5. Semi-supervised Classification
- Uses both labeled and unlabeled data
- Useful when labeled data is scarce/expensive
- Examples:
  - Text classification with partially labeled documents
  - Medical image classification

6. Unsupervised Classification (Clustering)
- Finds natural groupings in unlabeled data
- Algorithms:
  - K-means
  - Hierarchical clustering
  - DBSCAN
 

y -> Lebel variable, Dependent variable, Response variable

In [1]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, confusion_matrix, accuracy_score

In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
breast_cancer_wisconsin_original = fetch_ucirepo(id=15) 
  
# data (as pandas dataframes) 
X = breast_cancer_wisconsin_original.data.features 
y = breast_cancer_wisconsin_original.data.targets 
  
# metadata 
print(breast_cancer_wisconsin_original.metadata) 
  
# variable information 
print(breast_cancer_wisconsin_original.variables) 



{'uci_id': 15, 'name': 'Breast Cancer Wisconsin (Original)', 'repository_url': 'https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original', 'data_url': 'https://archive.ics.uci.edu/static/public/15/data.csv', 'abstract': 'Original Wisconsin Breast Cancer Database', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 699, 'num_features': 9, 'feature_types': ['Integer'], 'demographics': [], 'target_col': ['Class'], 'index_col': ['Sample_code_number'], 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1990, 'last_updated': 'Sun Mar 10 2024', 'dataset_doi': '10.24432/C5HP4Z', 'creators': ['WIlliam Wolberg'], 'intro_paper': None, 'additional_info': {'summary': "Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed fro

In [3]:
breast_cancer_df = pd.concat([X,y], axis=1)
breast_cancer_df.head()

Unnamed: 0,Clump_thickness,Uniformity_of_cell_size,Uniformity_of_cell_shape,Marginal_adhesion,Single_epithelial_cell_size,Bare_nuclei,Bland_chromatin,Normal_nucleoli,Mitoses,Class
0,5,1,1,1,2,1.0,3,1,1,2
1,5,4,4,5,7,10.0,3,2,1,2
2,3,1,1,1,2,2.0,3,1,1,2
3,6,8,8,1,3,4.0,3,7,1,2
4,4,1,1,3,2,1.0,3,1,1,2


In [4]:
breast_cancer_df.isnull().sum()

Clump_thickness                 0
Uniformity_of_cell_size         0
Uniformity_of_cell_shape        0
Marginal_adhesion               0
Single_epithelial_cell_size     0
Bare_nuclei                    16
Bland_chromatin                 0
Normal_nucleoli                 0
Mitoses                         0
Class                           0
dtype: int64

In [5]:
breast_cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Clump_thickness              699 non-null    int64  
 1   Uniformity_of_cell_size      699 non-null    int64  
 2   Uniformity_of_cell_shape     699 non-null    int64  
 3   Marginal_adhesion            699 non-null    int64  
 4   Single_epithelial_cell_size  699 non-null    int64  
 5   Bare_nuclei                  683 non-null    float64
 6   Bland_chromatin              699 non-null    int64  
 7   Normal_nucleoli              699 non-null    int64  
 8   Mitoses                      699 non-null    int64  
 9   Class                        699 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 54.7 KB


In [6]:
breast_cancer_df.drop('Bare_nuclei', axis=1, inplace=True)
breast_cancer_df.columns

Index(['Clump_thickness', 'Uniformity_of_cell_size',
       'Uniformity_of_cell_shape', 'Marginal_adhesion',
       'Single_epithelial_cell_size', 'Bland_chromatin', 'Normal_nucleoli',
       'Mitoses', 'Class'],
      dtype='object')

In [7]:
breast_cancer_df['Class'].value_counts()

2    458
4    241
Name: Class, dtype: int64

In [8]:
breast_cancer_df['Class'] = breast_cancer_df['Class'].replace({
        2: 0,
        4: 1,
    }, 
)

In [9]:
X = breast_cancer_df.drop('Class', axis=1)
y = breast_cancer_df['Class']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=24)

In [11]:
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
y_pred = logistic_regression.predict(X_test)

In [12]:
logistic_regression.coef_

array([[0.4686736 , 0.19424696, 0.61144381, 0.19875579, 0.05170231,
        0.54513631, 0.12542504, 0.43632838]])

In [13]:
print(pd.crosstab(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

col_0    0   1
Class         
0      135   3
1        4  68
[[135   3]
 [  4  68]]


In [14]:
accuracy_score(y_test, y_pred)

0.9666666666666667

In [15]:
y_pred_zero = np.zeros(len(y_test))
accuracy_score(y_test, y_pred_zero)

0.6571428571428571

### Majority/Naive Rule
Any ML Model score should be better than the score of baseline model.
Here baseline (naive) model is the model that predicts the class with the majority class.

`baseline model score = 0.6571428571428571`

and 

`ml model score = 0.9666666666666667`


Here we can say that the model is better than the baseline model hence it is relevant model.

---

### Performaing same operation on diagnostics dataset

In [16]:
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# data (as pandas dataframes) 
X = breast_cancer_wisconsin_diagnostic.data.features 
y = breast_cancer_wisconsin_diagnostic.data.targets 
  
# metadata 
print(breast_cancer_wisconsin_diagnostic.metadata) 
  
# variable information 
print(breast_cancer_wisconsin_diagnostic.variables) 


{'uci_id': 17, 'name': 'Breast Cancer Wisconsin (Diagnostic)', 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic', 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv', 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 569, 'num_features': 30, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Diagnosis'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1993, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DW2B', 'creators': ['William Wolberg', 'Olvi Mangasarian', 'Nick Street', 'W. Street'], 'intro_paper': {'ID': 230, 'type': 'NATIVE', 'title': 'Nuclear feature extraction for breast tumor diagnosis', 'authors': 'W. Street, W. Wolberg, O. Mangasarian', 'venue': 'Electronic imaging', 'year': 1993, 'journal': None, 'DOI': '1

In [17]:
breast_cancer_diagnostic_df = pd.concat([X,y],axis=1)
breast_cancer_diagnostic_df.head()

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3,Diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,M
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,M
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,M
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,M
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,M


In [18]:
breast_cancer_diagnostic_df.isna().sum().sum()

0

In [19]:
breast_cancer_diagnostic_df['Diagnosis'].value_counts()

B    357
M    212
Name: Diagnosis, dtype: int64

In [20]:
breast_cancer_diagnostic_df['Diagnosis'].replace({
    'M': 0,
    'B': 1,
}, inplace=True)

In [21]:
X = breast_cancer_diagnostic_df.drop('Diagnosis', axis=1)
y = breast_cancer_diagnostic_df['Diagnosis']

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=24)

In [23]:
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
y_pred = logistic_regression.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [24]:
logistic_regression.coef_

array([[ 1.25042911,  0.52400177,  0.16354035, -0.01831261, -0.04837595,
        -0.2289746 , -0.30551111, -0.131887  , -0.06766531, -0.0134385 ,
         0.0552828 ,  0.53518136,  0.1829223 , -0.1286827 , -0.0026744 ,
        -0.03879791, -0.05621525, -0.01588397, -0.01386848, -0.00237516,
         1.43801936, -0.60629193, -0.203309  , -0.0169482 , -0.08257276,
        -0.67612234, -0.81357364, -0.25621037, -0.17806145, -0.05851047]])

In [25]:
print(confusion_matrix(y_test, y_pred))

[[ 61   4]
 [  5 101]]


In [26]:
model_score = accuracy_score(y_test, y_pred)
model_score

0.9473684210526315

In [27]:
y_pred_zero = np.zeros(len(y_test))
naive_score = accuracy_score(y_test, y_pred_zero)
naive_score

0.38011695906432746

In [28]:
if model_score < naive_score:
    print("Model is useless")
else:
    print("Model is useful")

Model is useful
