Welcome to the notebook starting your learning from scratch till the end!

Let us hop on and do some coding:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df=pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')
df.head(5)

Before we dive in further into codes, let us understand the problem statement:

We are provided with data that contains data of different patients with ids. Their diagnosis is also given, where M means malign, means it is cancerous and harmful and B mean beningn which means that it is non-cancerous and not harmful. 

### Why is this project important? 

Breast cancer (BC) is one of the most common cancers among women worldwide, representing the majority of new cancer cases and cancer-related deaths according to global statistics, making it a significant public health problem in today’s society.

The early diagnosis of BC can improve the prognosis and chance of survival significantly, as it can promote timely clinical treatment to patients. Further, accurate classification of benign tumors can prevent patients undergoing unnecessary treatments. Thus, **the correct diagnosis of BC and classification of patients into malignant or benign groups is the subject of much research.**

## Risk factors

The following are some of the known risk factors for breast cancer:

**Age**: The chance of getting breast cancer increases as women age. Nearly 80 percent of breast cancers are found in women over the age of 50.

**Personal history of breast cancer**: A woman who has had breast cancer in one breast is at an increased risk of developing cancer in her other breast.

**Family history of breast cancer**: A woman has a higher risk of breast cancer if her mother, sister or daughter had breast cancer, especially at a young age (before 40). Having other relatives with breast cancer may also raise the risk.

**Genetic factors**: Women with certain genetic mutations, including changes to the BRCA1 and BRCA2 genes, are at higher risk of developing breast cancer during their lifetime. Other gene changes may raise breast cancer risk as well.

**Childbearing and menstrual history**: The older a woman is when she has her first child, the greater her risk of breast cancer. Also at higher risk are:

1. Women who menstruate for the first time at an early age (before 12)
2. Women who go through menopause late (after age 55)
3. Women who’ve never had children

Let us see the statistical parameters of our dataset: 

In [None]:
df.describe()

In [None]:
df.shape

It is always to important to look if there are any missing values in our dataset, as it may impact our model which we will be training later. 

In [None]:
df.isna().sum()

Our dataset does not contain any missing value, which is a good thing for us as we won't have to invest in cleaning our data. Cleaning our data is a very important step in any project, even if the dataset looks clean prima facie.

Now, let us see the value count of our diagnosis and look if it is an imbalanced classification problem or not.

**What is an imblanced dataset?**

Suppose you are working on a problem of identifying patients having skin cancer with a number of features available. Now, your target variable will be 0: it is not skin cancer, and 1: it is skin cancer. We all know that incidents of cancer are rare, as compared to the global population and if we take out a sample, the results are going to be the same. The target variable may have 9900 0's and only 100 1's.

If you will be taking this data to the model for training, it will learn from the above data that not having skin cancer is normal while having skin cancer is very less likely. Worse, it may start treating the 100 rows with 1 as target variable, as noise. We surely do not want this to happen! 

In [None]:
df["diagnosis"].value_counts()

In [None]:
sns.countplot(df['diagnosis'],label='count')

Our dataset is fairly balanced and we need not balance this out. 

In [None]:
df.dtypes

Before moving to other steps, we will learn an important concept which is called label encoding. While navigating through the data, you would have seen that the classification has been done in form of words B and M for benign and malignant cancer cells. But we need this to be converted in number so that this parameter can also be used statistically.

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder_Y=LabelEncoder()
df.iloc[:,1]=labelencoder_Y.fit_transform(df.iloc[:,1].values)


In the above code, we have applied label encoding on all the rows and one column. This is the syntax, there is nothing much to understand here. Now let us create a correlation matrix to understand the relationship between various variables. We will use the Pearson correlation 

In [None]:
df.iloc[:,1:].corr(method="pearson")

We have a lot of features in our dataset, so visualizing it all together will make it virtually illegible. So we will first draw a correlation matrix for first 15 datapoints and then the rest. For more comparisons, the above table can be utilized. 

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.iloc[:,1:15].corr(), annot=True,fmt="0.0%")

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.iloc[:,15:].corr(), annot=True,fmt="0.0%")

Now is the time to split our dataset into datasets containing dependent and independent variables. 

Dependent variables in our machine learning problem is the one on which our problem statement depends. So, our problem statement can be solved if we are able to predict if the cancer cell is benign or malignant. That will be our dependent variable, on which the solution to our problem statement depends. 

All the other variables are independent variables. 

In [None]:
X=df.iloc[:,2:31].values #features that help us determine if patient has cancer or not
Y=df.iloc[:,1].values #this is the dataset containing our target variable which indicates diagnosis

The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

From Sklearn, sub-library model_selection, we will import the train_test_split so we can split to training and test sets. The test_size inside the function indicates the percentage of the data that should be held over for testing. It’s usually around 80/20 or 70/30. The ratio is kept as such so that model does not overfit or underfit. Let us understand first what overfitting and underfitting means:

## Overfitting

Overfitting means that model we trained has trained “too well” and is now, well, fit too closely to the training dataset. This usually happens when the model is too complex i.e. too many features/variables compared to the number of observations. This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not generalized, meaning you can generalize the results and can’t make any inferences on other data, which is, ultimately, what you are trying to do. Basically, when this happens, the model learns or describes the “noise” in the training data instead of the actual relationships between variables in the data. This noise, obviously, isn’t part in of any new dataset, and cannot be applied to it.

## Underfitting

In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. This is usually the result of a very simple model which does not have enough predictors/independent variables. It could also happen when, for example, we fit a linear model ,like linear regression to data that is not linear. It almost goes without saying that this model will have poor predictive ability on training data and can’t be generalized to other data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test= train_test_split(X,Y, test_size=0.25, random_state=0)

## Fit Transform
To center the data (make it have zero mean and unit standard error), you subtract the mean and then divide the result by the standard deviation.

x′=(x−μ)/σ.

You do that on the training set of data. But then you have to apply the same transformation to your testing set (e.g. in cross-validation), or to newly obtained examples before forecast. But you have to use the same two parameters μ and σ (values) that you used for centering the training set.

Hence, every sklearn's transform's fit() just calculates the parameters (e.g. μ and σ in case of StandardScaler) and saves them as an internal objects state. Afterwards, you can call its transform() method to apply the transformation to a particular set of examples.

fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set x, but it also returns a transformed x′. Internally, it just calls first fit() and then transform() on the same data.

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

We have applied fit transform on our test and train data. Now it is time to read about various models that can be used to predict whether a cancer cell is beningn or malignant.

Model selection is an important part of solving a machine learning but not as important as data cleaning! Your model will always be as good as your data is so focus should always be on getting high quality data and cleaning it properly.

There are a number of Machine Learning models avaialble which can be employed to read to meaningful conclusions and selecting the right model depends on a variety of factors such as:

1. The accuracy of the model.
2. The interpretability of the model.
3. The complexity of the model.
4. The scalability of the model.
5. How long does it take to build, train, and test the model?
6. How long does it take to make predictions using the model?
7. Does the model meet the business goal?


In this notebook, we will be mainly focussing on three algorithms which can be used to model our dataset:

### 1. Logistics regression
### 2. Decision tree classifier 
### 3. Random Forest classifier 

We will understand these algorithms one by one. 

## Logistic regression

Logistic regression is named for the function used at the core of the method, the logistic function.

The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)

![](https://miro.medium.com/max/2400/1*RqXFpiNGwdiKBWyLJc_E7g.png)



Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your spreadsheet) and value is the actual numerical value that you want to transform. Below is a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function.

Logistic regression uses an equation as the representation, very much like **linear regression**.

Input values (x) are combined linearly using weights or coefficient values (referred to as the Greek capital letter Beta) to predict an output value (y). A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a numeric value.

If you wish to read more about linear regression, please refer my notebook: https://www.kaggle.com/srtpan/beginner-friendly-linear-regression-guide

## Decision tree classifier

A Decision Tree is a simple representation for classifying examples. It is a Supervised Machine Learning where the data is continuously split according to a certain parameter.

Decision Tree consists of :

1. Nodes : Test for the value of a certain attribute.
2. Edges/ Branch : Correspond to the outcome of a test and connect to the next node or leaf.
3. Leaf nodes : Terminal nodes that predict the outcome (represent class labels or class distribution).

![](https://static.javatpoint.com/tutorial/machine-learning/images/decision-tree-classification-algorithm.png)

In this notebook, we will be using classification trees. Such a tree is built through a process known as binary recursive partitioning. This is an iterative process of splitting the data into partitions, and then splitting it up further on each of the branches. This is used in classification problem. Decision tree can also be used for regression problems.

## Random forest classifier

It is an ensemble tree-based learning algorithm. The Random Forest Classifier is a set of decision trees from randomly selected subset of training set. It aggregates the votes from different decision trees to decide the final class of the test object.

Ensemble Algorithm : Ensemble algorithms are those which combines more than one algorithms of same or different kind for classifying objects. For example, running prediction over Naive Bayes, SVM and Decision Tree and then taking vote for final consideration of class for test object.

![](https://miro.medium.com/max/718/0*a8KgF1IINziv7KIQ.png)


Since we have understood all the types of algorithms, it is time to now apply these to our dataset! We will make a function to apply these:

In [None]:
#create a function for the models
def models (X_train, Y_train):
    #using logistic regression  model
    from sklearn.linear_model import LogisticRegression
    log=LogisticRegression (random_state=0)
    log.fit(X_train, Y_train)
    
    #Decision tree model
    from sklearn.tree import DecisionTreeClassifier 
    tree=DecisionTreeClassifier(criterion="entropy",random_state=0)
    tree.fit(X_train, Y_train)
    
    #Rnadom Forest Classifier
    from sklearn.ensemble import RandomForestClassifier 
    forest= RandomForestClassifier(n_estimators=10, criterion="entropy",random_state=0)
    forest.fit(X_train, Y_train)
    
    #print the model's accuracy on the training data 
    print("[0]Logistic Regression Training Accuracy:", log.score(X_train, Y_train))
    print("[1]Decision Tree Classifier Training Accuracy:", tree.score(X_train, Y_train))
    print("[2]Random Forest Classifier Accuracy:", forest.score(X_train, Y_train))
    
    return log, tree, forest

You will be seeing random state in the code. What is random state used for?

If there is no randomstate provided the system will use a randomstate that is generated internally. So, when you run the program multiple times you might see different train/test data points and the behavior will be unpredictable. In case, you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program.

In [None]:
#getting all of the models
model=models(X_train, Y_train)

The above given accuracy is when we apply the algorithm to the data which we have used for training, it is much obvious that it will be very close to 100% because we are training with that data. We will be finding out the accuracy of the testing data with the help of confusion matrix.

Now we will apply this algorithm to our testing data as we had earlier applied to the training set.

## Evaluating our classifier

We will be applying these above mentioned algorithms to our machine learning problem but how do we check how well does these algorithms perform on our test data?

#### We can use classification performance metrics such as Log-Loss, Accuracy, AUC(Area under Curve) etc. Another example of metric for evaluation of machine learning algorithms is precision, recall, which can be used for sorting algorithms primarily used by search engines.

credit: https://medium.com/@MohammedS/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b#:~:text=We%20can%20use%20classification%20performance,primarily%20used%20by%20search%20engines.

## 1. Confusion Matrix

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

### Definition of the Terms:

**Positive (P) :** Observation is positive (for example: is an apple). 

**Negative (N) :** Observation is not positive (for example: is not an apple). 

**True Positive (TP) :** Observation is positive, and is predicted to be positive. 

**False Negative (FN) :** Observation is positive, but is predicted negative. 

**True Negative (TN) :** Observation is negative, and is predicted to be negative. 

**False Positive (FP) :** Observation is negative, but is predicted positive.


![](https://qph.fs.quoracdn.net/main-qimg-d1f717f321a4ebc8e6d2c1d1cc19f9fc)

## 2. Accuracy:
Accuracy in classification problems is the number of correct predictions made by the model over all kinds predictions made.

**Accuracy:** Overall, how often is the classifier correct? (TP+TN)/total
Accuracy is a good measure when the target variable classes in the data are nearly balanced.

## 3. Precision:

Precision is a measure that tells us what proportion of patients that we diagnosed as having cancer, actually had cancer. The predicted positives (People predicted as cancerous are TP and FP) and the people actually having a cancer are TP.

**Precision**= TP/TP+FP

## 4. Recall:

Recall is a measure that tells us what proportion of patients that actually had cancer was diagnosed by the algorithm as having cancer. The actual positives (People having cancer are TP and FN) and the people diagnosed by the model having a cancer are TP. 

**Recall**=TN/TP+FN

When to use precision and when to use recall?

**Precision** is about being precise. So even if we managed to capture only one cancer case, and we captured it correctly, then we are 100% precise.
**Recall** is not so much about capturing cases correctly but more about capturing all cases that have “cancer” with the answer as “cancer”. So if we simply always say every case as “cancer”, we have 100% recall.

## 5. F1 score 

We don’t really want to carry both Precision and Recall in our pockets every time we make a model for solving a classification problem. So it’s best if we can get a single score that kind of represents both Precision(P) and Recall(R).

One way to do that is simply taking their arithmetic mean. i.e (P + R) / 2 where P is Precision and R is Recall

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(classification_report(Y_test, model[0].predict(X_test)))
print(accuracy_score(Y_test, model[0].predict(X_test)))

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(classification_report(Y_test, model[1].predict(X_test)))
print(accuracy_score(Y_test, model[1].predict(X_test)))

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(classification_report(Y_test, model[2].predict(X_test)))
print(accuracy_score(Y_test, model[2].predict(X_test)))

Thanks for follwoing the notebook. Please upvote if ypu found this notebook useful. :)