<a href="https://colab.research.google.com/github/sukhijapiyush/Logistic-Regression-Project/blob/master/Breast%20Cancer%20Classification%20LR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Breast Cancer Classification using Logistic Regression
### Problem Statement
Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.

The key challenges against itâ€™s detection is how to classify tumors into malignant (cancerous) or benign(non cancerous). 

### Objective
Understand the Dataset & cleanup (if required).
Build classification models to predict whether the cancer type is Malignant or Benign.

### Dataset Details
Dataset URL: https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset

### Steps involved
 - Data Load and Analysis
 - Data Wragling
 - Exploratory Data Analysis
 - Splitting the dataset
 - Scaling of the variables
 - Modelling
 - Hyperparameter Tuning
 - Model Evaluation

### Importing Libraries

In [20]:
# Loading Libraries
# Data Manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Scaling
from sklearn.preprocessing import MinMaxScaler

# Feature Selection
from sklearn.feature_selection import RFE

# Model Creation
from sklearn.linear_model import LogisticRegression

# Model Selection
from sklearn.model_selection import train_test_split, GridSearchCV, KFold

# Metrics
from sklearn.metrics import classification_report,accuracy_score,precision_score,recall_score

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Data Loading and information

In [21]:
# Loading Data
b_cancer_data = pd.read_csv('breast-cancer.csv')
# First 5 rows of the data
b_cancer_data.head()

FileNotFoundError: ignored

In [None]:
# Removing display limit of dataframe (optional cell to run)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Setting style for seaonrn
sns.set_style("whitegrid", {'axes.grid' : False})
sns.set()

In [None]:
#Basic information about the data
## Number of rows and columns
print('Number of Columns:',b_cancer_data.shape[1])
print('Number of Rows:',b_cancer_data.shape[0])
## Number of missing values
print('Number of missing values:',b_cancer_data.isnull().sum().sum())
## Number of unique values
print('Number of unique values:',b_cancer_data.nunique().sum())
## Number of duplicates
print('Number of duplicates:',b_cancer_data.duplicated().sum())

In [None]:
# Basic information about the dataframe
b_cancer_data.info()

In [None]:
# Describing the dataframe
b_cancer_data.describe([0.25,0.50,0.75,0.95,0.99])

In [None]:
# Columns in the dataframe
print(b_cancer_data.columns)

As there are no null values present, no cleaing of data required.

### Data Cleaning

In [None]:
# Removing the ID column as it is a duplicate index
b_cancer_data=b_cancer_data.drop('id',axis=1)
# Printing dataframe
b_cancer_data.head()

In [None]:
num_cols=b_cancer_data.drop('diagnosis',axis=1)
target_cols=b_cancer_data['diagnosis']

In [None]:
# Checking for outlier analysis
k=0
fig, axes = plt.subplots(10,3, figsize=(20, 40), sharey=True)
for i in range(0,10):
  for j in range(0,3):
    sns.violinplot(ax=axes[i,j],x=b_cancer_data[num_cols.columns[k]])
    k=k+1
plt.show()

In [None]:
# Removing Quartile via IQR method
Q1 = b_cancer_data.quantile(0.05)
Q3 = b_cancer_data.quantile(0.95)
IQR = Q3 - Q1
b_cancer_data = b_cancer_data[~((b_cancer_data < (Q1 - 1.5 * IQR)) |(b_cancer_data > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
# Shape of dataset
b_cancer_data.shape

In [None]:
# Checking for outlier analysis
k=0
fig, axes = plt.subplots(10,3, figsize=(20, 40))
for i in range(0,10):
  for j in range(0,3):
    sns.barplot(ax=axes[i,j],x=b_cancer_data[num_cols.columns[k]],y=b_cancer_data['diagnosis'])
    k=k+1
plt.show()

In [None]:
## Checking correlation
plt.figure(figsize=(20,20))
sns.heatmap(b_cancer_data.corr(),annot=True,cmap='Greens')
plt.title('Correlation Matrix for Breast Cancer')
plt.show()

### Feature selection using GridSearchCV

In [None]:
# Diving Dataset into test and train
df_train, df_test = train_test_split(b_cancer_data,train_size=0.7,random_state = 79,shuffle=True)

In [None]:
y_train = df_train.pop('diagnosis')
X_train = df_train
y_test = df_test.pop('diagnosis')
X_test = df_test
# Scaling the dataset
scaler=MinMaxScaler()
X_train=scaler.fit_transform(X_train[num_cols.columns])
X_test=scaler.transform(X_test[num_cols.columns])
# Mapping output to binary
target_map={'M':1,'B':0}
y_train=y_train.apply(lambda x:target_map[x])
y_test=y_test.apply(lambda x:target_map[x])

In [None]:
# step-1: create a cross-validation scheme
folds = KFold(n_splits = 10, shuffle = True, random_state = 79)

# step-2: specify range of hyperparameters to tune
hyper_params = [{'n_features_to_select': list(range(1, 31))}]


# step-3: perform grid search
# 3.1 specify model
lm = LogisticRegression()
lm.fit(X_train, y_train)             
rfe=RFE(lm)
# 3.2 call GridSearchCV()
model_cv = GridSearchCV(estimator = rfe, 
                        param_grid = hyper_params, 
                        scoring= 'accuracy', 
                        cv = folds, 
                        verbose = 1,
                        n_jobs=-1,
                        return_train_score=True)      

# fit the model
model_cv.fit(X_train, y_train)         
model_cv.best_params_

Making model with optimized number of features to sort the columns which are important for predictions. 

In [None]:
# Running RFE with the output number of the variable equal to 15
lm = LogisticRegression()
lm.fit(X_train, y_train)
rfe = RFE(estimator=lm,n_features_to_select=15)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
col = X_train.columns[rfe.support_]
X_train=X_train[col]