## 2.2 Random Forest
In this task you are suppose to implement 2 types of decision trees: 1. Using only Python. 2. Using random forest with a library such as sklearn.
The classification should be to predict recurrent cancer.
* Download the Breast Cancer dataset https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data
* Predict the two classes: Not-recurrent and recurrent
* Implement and test a decision tree from scratch using Python and standard libraries.
* Implement and test random forest with a library such as sklearn.
* Choose the network architecture with care.
* Train and validate all algorithms.
* Make the necessary assumptions.
* You can be in groups of up to 3.
* Handin: One page report to be delivered at the end of the semester.


## Get and prepare data
* Number of Instances: 286
* Number of Attributes: 9 + the class attribute
* Attribute Information:<br>
   1\. Class: no-recurrence-events, recurrence-events<br>
   2\. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.<br>
   3\. menopause: lt40, ge40, premeno.<br>
   4\. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.<br>
   5\. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.<br>
   6\. node-caps: yes, no.<br>
   7\. deg-malig: 1, 2, 3.<br>
   8\. breast: left, right.<br>
   9\. breast-quad: left-up, left-low, right-up, right-low, central.<br>
  10\. irradiat: yes, no.<br>
* Missing Attribute Values: (denoted by "?")
   Attribute #:  Number of instances with missing values:<br>
   6\.             8<br>
   9\.             1<br>
* Class Distribution:<br>
    1\. no-recurrence-events: 201 instances<br>
    2\. recurrence-events: 85 instances<br>

In [2]:
import requests
import random
import pandas as pd
from pprint import pprint

# Get data
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data"
r = requests.get(data_url)

# Split response into lists of lists and remove incomplete rows
dataset = [line.split(',') for line in r.text.splitlines() if '?' not in line]

# Shuffle data
random.seed(2411)
random.shuffle(dataset)

pd_data = pd.DataFrame({
    'class':[row[0] for row in dataset],
    'age':[row[1] for row in dataset],
    'menopause':[row[2] for row in dataset],
    'tumor-size':[row[3] for row in dataset],
    'inv-nodes':[row[4] for row in dataset],
    'node-caps':[row[5] for row in dataset],
    'deg-malig':[row[6] for row in dataset],
    'breast':[row[7] for row in dataset],
    'breast-quad':[row[8] for row in dataset],
    'irradiat':[row[9] for row in dataset]

})

pd_data.head()

Unnamed: 0,class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,50-59,ge40,20-24,0-2,no,1,right,left_low,no
1,recurrence-events,30-39,premeno,0-4,0-2,no,2,right,central,no
2,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,left,left_low,no
3,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
4,no-recurrence-events,60-69,ge40,45-49,6-8,yes,3,left,central,no


In [3]:
# Preprocessing (convert categories to numbers)
from sklearn import preprocessing

le_class = preprocessing.LabelEncoder()
le_class.fit(pd_data['class'].ravel())
pd_data['class'] = le_class.transform(pd_data['class'])

le_age = preprocessing.LabelEncoder()
le_age.fit(pd_data['age'].ravel())
pd_data['age'] = le_age.transform(pd_data['age']) 

le_menopause = preprocessing.LabelEncoder()
le_menopause.fit(pd_data['menopause'].ravel())
pd_data['menopause'] = le_menopause.transform(pd_data['menopause']) 

le_tumor_size = preprocessing.LabelEncoder()
le_tumor_size.fit(pd_data['tumor-size'].ravel())
pd_data['tumor-size'] = le_tumor_size.transform(pd_data['tumor-size']) 

le_inv_nodes = preprocessing.LabelEncoder()
le_inv_nodes.fit(pd_data['inv-nodes'].ravel())
pd_data['inv-nodes'] = le_inv_nodes.transform(pd_data['inv-nodes']) 

le_node_caps = preprocessing.LabelEncoder()
le_node_caps.fit(pd_data['node-caps'].ravel())
pd_data['node-caps'] = le_node_caps.transform(pd_data['node-caps']) 

le_deg_malig = preprocessing.LabelEncoder()
le_deg_malig.fit(pd_data['deg-malig'].ravel())
pd_data['deg-malig'] = le_deg_malig.transform(pd_data['deg-malig'])

le_breast = preprocessing.LabelEncoder()
le_breast.fit(pd_data['breast'].ravel())
pd_data['breast'] = le_breast.transform(pd_data['breast'])

le_breast_quad = preprocessing.LabelEncoder()
le_breast_quad.fit(pd_data['breast-quad'].ravel())
pd_data['breast-quad'] = le_breast_quad.transform(pd_data['breast-quad']) 

le_irradiat = preprocessing.LabelEncoder()
le_irradiat.fit(pd_data['irradiat'].ravel())
pd_data['irradiat'] = le_irradiat.transform(pd_data['irradiat']) 

pd_data.head()

Unnamed: 0,class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,0,3,0,3,0,0,0,1,1,0
1,1,1,2,0,0,0,1,1,0,0
2,0,4,0,2,0,0,1,0,1,0
3,0,1,2,5,0,0,2,0,1,0
4,0,4,0,8,5,1,2,0,0,0


In [13]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

X=pd_data[['age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat']]  # Features
y=pd_data['class']  # Labels

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

## Training

In [14]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

## Accuracy

In [15]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.702380952381


Adapted from the tutorial at: https://www.datacamp.com/community/tutorials/random-forests-classifier-python
With input from: https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree
