**Objective : Predict whether the cancer is benign or malignant**

## Part-1: Data Pre-Processing

> Dataset link: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

#### Importing requisite libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Ignore warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")

#### Importing dataset

In [None]:
dataset = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

#### Data Exploration

In [None]:
dataset.shape

There are 33 columns and 569 observations in the dataframe

In [None]:
dataset.info()

- We can observe that 31 columns are numeric, and the target variable `diagnosis` is an object.
- The column `Unnamed: 32`, has no observations
- All other columns do not have null values

##### Analyzing categorical variables

In [None]:
dataset.select_dtypes(include = "object").columns

In [None]:
# Checking number of categorical variables
len(dataset.select_dtypes(include = "object").columns)

There is only one column `diagnosis` which is a categorical variable

##### Checking numerical variables

In [None]:
dataset.select_dtypes(include=["int64","float64"]).columns

In [None]:
# Checking number of numerical variables
len(dataset.select_dtypes(include=["int64","float64"]).columns)

There are 32 numerical variables

##### Get statistical summary for numerical variables

In [None]:
dataset.describe()

In [None]:
# Get the list of columns
dataset.columns

### Deal with missing values

Check if we have any null values in the dataset

In [None]:
dataset.isnull().values.any()

Get Total number of null values in the dataset

In [None]:
dataset.isnull().values.sum()

Getting the column(s) which have null values

In [None]:
dataset.columns[dataset.isnull().any()]

In [None]:
len(dataset.columns[dataset.isnull().any()])

Checking total null values in the column `'Unnamed: 32'`

In [None]:
dataset['Unnamed: 32'].count()

We are dropping the column 'Unnamed: 32' as it only contains NULL values

In [None]:
dataset = dataset.drop(columns = 'Unnamed: 32')

Check the shape of dataset after dropping column

In [None]:
dataset.shape

Checking to see if there are any NULL values in the dataset

In [None]:
dataset.isnull().values.any()

### Dealing with Categorical Data

In [None]:
dataset.select_dtypes(include = "object").columns

Getting unique values in `diagnosis` column

In [None]:
dataset["diagnosis"].unique()

In [None]:
# Total number of unique values
dataset["diagnosis"].nunique()

##### Perform one-hot encoding to convert `diagnosis` to a numerical variable

In [None]:
dataset = pd.get_dummies(data = dataset, drop_first = True)

In [None]:
dataset.head()

### Create a CountPlot to check `diagnosis_M`

In [None]:
sns.countplot(data = dataset, x = "diagnosis_M")
plt.show()

In [None]:
# Benign (0) values
(dataset["diagnosis_M"] == 0).sum()

In [None]:
# Benign (0) values
(dataset["diagnosis_M"] == 1).sum()

### Correlation Matrix and HeatMap

Dropping Target variable `diagnosis_M`

In [None]:
dataset_2 = dataset.drop(columns = "diagnosis_M")

In [None]:
dataset_2.head()

Creating correlation b/w Target Variable `diagnosis_M` and other independent variables

In [None]:
dataset_2.corrwith(dataset['diagnosis_M']).plot.bar(
figsize = (20,10), title = "Correlation with diagnosis_M", rot = 45, grid = True)
plt.show()

Creating a correlation matrix `corr` to view the results better

In [None]:
corr = dataset_2.corr()

In [None]:
corr

Creating a HeatMap to view the correlations

In [None]:
plt.figure(figsize = (20,10))
sns.heatmap(data = corr, annot = True, cmap = "RdYlGn")
plt.show()

### Splitting the dataset into train and test set

In [None]:
dataset.head()

#### Matrix of Features / Independent Variables

Dropping `id` and `diagnosis_M` fields from the dataset

In [None]:
x = dataset.iloc[:, 1:-1].values

In [None]:
x.shape

#### Target Variable / Dependent Variable

In [None]:
y = dataset.iloc[:,-1].values

In [None]:
y.shape

#### Split data using scikit-learn `train_test_split`

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [None]:
x_train.shape

In [None]:
x_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

### Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc = StandardScaler()

In [None]:
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In [None]:
x_train

In [None]:
x_test

## Part-2: Building the model

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
classifier_lr = LogisticRegression(random_state=0)

In [None]:
classifier_lr.fit(x_train,y_train)

In [None]:
y_pred = classifier_lr.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score

In [None]:
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)

In [None]:
results = pd.DataFrame([['Logistic Regression', acc, f1, prec, rec]],
               columns = ['Model', 'Accuracy', 'F1 Score', 'Precision', 'Recall'])

In [None]:
results

Creating a Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

### Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
accuracies = cross_val_score(estimator=classifier_lr, X=x_train, y=y_train, cv=10)

In [None]:
print("Accuracy is {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation is {:.2f} %".format(accuracies.std()*100))

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
classifier_rm = RandomForestClassifier(random_state=0)
classifier_rm.fit(x_train, y_train)

In [None]:
y_pred = classifier_rm.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score

In [None]:
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)

In [None]:
model_results = pd.DataFrame([['Random forest', acc, f1, prec, rec]],
               columns = ['Model', 'Accuracy', 'F1 Score', 'Precision', 'Recall'])

In [None]:
results = results.append(model_results, ignore_index=True)

In [None]:
results

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

### Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator=classifier_rm, X=x_train, y=y_train, cv=10)

print("Accuracy is {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation is {:.2f} %".format(accuracies.std()*100))

## Part-3: Randomized Search to find the best Parameters

**We will be using `Logistic Regression` as that seems to be the best model for our scenario**

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
parameters = {'penalty':['l1', 'l2', 'elasticnet', 'none'],
              'C':[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0],
              'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
              }

In [None]:
parameters

In [None]:
random_search = RandomizedSearchCV(estimator=classifier_lr,param_distributions=parameters, n_iter=5, 
                                   scoring='roc_auc', n_jobs = -1, cv=5, verbose=3)

In [None]:
random_search.fit(x_train, y_train)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_score_

In [None]:
random_search.best_params_

## Part-4: Final Model (Logistic Regression)

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(C=0.75, random_state=0, solver='saga',penalty='l2')

In [None]:
classifier.fit(x_train, y_train)

In [None]:
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)

model_results = pd.DataFrame([['Final Logistic Regression', acc, f1, prec, rec]],
               columns = ['Model', 'Accuracy', 'F1 Score', 'Precision', 'Recall'])


results = results.append(model_results, ignore_index = True)
results

### Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator=classifier, X=x_train, y=y_train, cv=10)

print("Accuracy is {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation is {:.2f} %".format(accuracies.std()*100))

### Part-5: Predicting a single observation

In [None]:
dataset.head()

In [None]:
out_list = dataset.iloc[0,:1:-1].to_list()

In [None]:
np_single_observation = np.array(out_list,ndmin=2)

In [None]:
np_single_observation

In [None]:
classifier.predict(sc.transform(np_single_observation))

***For this sample observation, the model is predicting that the cancer is `malignant`***

In [None]:
# Checking actual data
dataset.iloc[0,[1,-1]].to_list()

***For this sample observation, the actual data is also predicting that the cancer is `malignant`***