# Predicting Breast Cancer using Random Forest Classifier

**What will you learn in this notebook?**  
You will learn about some basic usages in pandas that helps you process your data like iloc, isnull(),head(). You will learn about Encoding — why and types — LabelEncoder and One Hot Encoder. We will also see how Random Forest Classifier can be trained and on how confusion matrixes help us determine the accuracy of our model. We would be using sklearn throughout the blog  

Dataset : [Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/downloads/breast-cancer-wisconsin-data.zip/2)

# The Data Set

In [23]:
import pandas as pd
data = pd.read_csv('./data/breast_cancer_data.csv')
data.shape #To find the dimensions of the dataset
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
data.shape

(569, 33)

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

# Data Preparation/Preprocessing

## Missing values
Let’s check if our dataset has any null or empty values.  
It would return zero for all attributes if none is missing.

In [6]:
# Missing values
data.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

In [7]:
data.isna().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

## Input and Output  
Consider our dataset. Given a list of attributes, we are trying to predict if the tumor/lump is malignant or benign. So column 2 is our result to be predicted or output and the rest are our inputs. The column id contains no relevance to our problem.  
  
Let’s split it.

In [8]:
X = data.iloc[:,2:32].values
Y = data.iloc[:,1].values

*iloc in Pandas Dataframe is used for integer-location based indexing/selection by position.The iloc indexer syntax is data.iloc[<row selection>, <column selection>]
    
In simple words, “iloc” in pandas is used to select rows and columns by number, in the order that they appear in the data frame.*

## Encoding  
Now, notice how your Y is.

In [9]:
Y

array(['M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M',
       'M', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'M', 'M',
       'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'B', 'M',
       'M', 'M', 'M', 'M', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'B',
       'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'M',
       'M', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'M', 'B', 'M',
       'M', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'M', 'M', 'B', 'B', 'B',
       'M', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B',
       'B', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'M',
       'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'M', 'B',
       'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'B', 'M', 'M',
       'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'M

In a very basic sense, Machine Learning Models don’t understand text as it is. So we convert them into the language that they can understand — numbers.  

This converting of our text values to a number is called encoding is ML.  

There are two types of encoding in ML  
1. Label Encoding
2. One hot Encoder

## Label Encoding
This is simple straight forward encoding. We convert categorical text data into model-understandable numerical data, in this case using the Label Encoder class of sklearn.  
So all we have to do, to label encode a column, is import the LabelEncoder class from the sklearn library, fit and transform the first column of the data, and then replace the existing text data with the new encoded data.

### One-hot encoder
Label encoding introduces a new problem. Sometimes we encode categorical data who have no relation, of any kind, between the rows. Since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order. Eg: 3>2>1>0. But this isn’t the case at all. To overcome this problem, we use One Hot Encoder.
So, what it does is as follows.
First, it takes a column with categorical data already label encoded. It then splits the column into multiple columns, as many as the categories. Now each of these categories column values is replaced by 1 or 0 depending on whether it is in the correct category or not.
So instead of one column answering, let’s say what each fruit is. We have multiple rows for each fruit. So our column for let’s say Apple has 1 for all rows who are apple and 0 otherwise.

For our case, a simple LabelEncoder would do.

In [12]:
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder #encode categorical features using a one-hot or ordinal encoding scheme
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

# Test and Train Data Set
So, like any ML problem, we are going to have to split our dataset into two, one to learn from and the other to test. We will leave 25% of our whole data to test.

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

## Feature Scaling
So most of the times, our data contains values that vary very much, anything from nanometers to kilometers for an example. The problem is that most algorithms just take the magnitude dropping the units, so features with large magnitude will weigh more than others.
Look at our data values. The range is too large. So, we scale them to an acceptable range.

In [14]:
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Train your Model
We are going to use the Random Forest Classifier. Which model to use and where is a discussion for another detailed blog in itself.
Random Forest Classifier is an ensemble algorithm, ie they combine one or more algorithm of the same type for classifying objects.
In simple words, a Random Forest Classifier creates a set of decision trees from a randomly selected subset of the training set. It then aggregates the votes from different decision trees to decide the final class of the test object.
This is a good read if you want to learn more about it.
Let’s see how it’s implemented in our case.

In [16]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)

RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0)

Here n_estimators stand for the number of decision trees, criterion determines the measure the quality of a split. “gini” for Gini Impurity and “entropy” for Information Gain. random_state, if int, is the seed used by the random number generator. If None, the random number generator is the RandomState instance used by np.random.

# Test your Model
Now that you have created and trained your model, let’s test to see how well it performs.
Let’s store all the predictions at Y_predict

In [18]:
Y_pred = classifier.predict(X_test)

Now, to see how well our model has predicted, I am going to look at it’s confusion matrix.

In [19]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)

In [20]:
cm

array([[89,  1],
       [ 1, 52]])

It has actually performed well! 89 of one class and 52 of another was correctly predicted. Just one instance of both classes were predicted wrong. That’s a pretty good result.
Let’s see the accuracy.

In [21]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(Y_test,Y_pred)

In [22]:
accuracy

0.986013986013986

98% is a pretty great accuracy.