In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets

In [2]:
# load breast cancer dataset
data = datasets.load_breast_cancer()

# convert to pandas data frame
features = data.feature_names
dataDF = pd.DataFrame(data.data, columns=features)

# add binary target variable
target = 'target'
dataDF['target'] = data.target

# display 
dataDF.sample(5)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
163,12.34,22.22,79.85,464.5,0.1012,0.1015,0.0537,0.02822,0.1551,0.06761,...,28.68,87.36,553.0,0.1452,0.2338,0.1688,0.08194,0.2268,0.09082,1
439,14.02,15.66,89.59,606.5,0.07966,0.05581,0.02087,0.02652,0.1589,0.05586,...,19.31,96.53,688.9,0.1034,0.1017,0.0626,0.08216,0.2136,0.0671,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
494,13.16,20.54,84.06,538.7,0.07335,0.05275,0.018,0.01256,0.1713,0.05888,...,28.46,95.29,648.3,0.1118,0.1646,0.07698,0.04195,0.2687,0.07429,1
336,12.99,14.23,84.08,514.3,0.09462,0.09965,0.03738,0.02098,0.1652,0.07238,...,16.91,87.38,576.0,0.1142,0.1975,0.145,0.0585,0.2432,0.1009,1


In [8]:
# split data into train, validation and test
n = dataDF.shape[0]
sizes = [int(0.8 * n), int(0.9 * n)]
trainDF, validationDF, testDF = np.split(dataDF.sample(frac=1, random_state=10), sizes)

print("Train: ", trainDF.shape)
print("Validation: ", validationDF.shape)
print("Test: ", testDF.shape)



Train:  (455, 31)
Validation:  (57, 31)
Test:  (57, 31)


# Simple Logistic Regression

In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
model = LogisticRegression(solver='lbfgs', max_iter=5000)
model.fit(trainDF[features], trainDF[target])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=5000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [11]:
predicted = model.predict(testDF[features])

In [12]:
from sklearn.metrics import accuracy_score
print(accuracy_score(testDF[target], predicted))

0.9122807017543859


# Using Adaboost

The idea of adaboost, acronym for "Adaptive Boosting", is simple. Imagine back your school days preparing for a exam. Once you have scanned through and praticsed the whole syllabalus, you would spend more time praticising hard problems. Boosting is based on the same principle. Once we have our initial classifier that treats each data point equally, we start focusing more on data points for which our previous classifier was wrong. The idea here is that the direction of gradient descent is more influenced by data points for which our model having trouble dealing with. One simple way to achieve this is by increasing the weight of the sample. Recall, hwo in chapter [4](04-linearregression.html), we used mean square error (MSE) as our objective function to minimize and it is computed as follows:

$$MSE = \frac{1}{m} \sum_{i=1}^{m}\left[y - \sum_{i=1}^{n}\theta_iX_i\right]^2$$

We can generalize, the above function to incorporate sample weight, say $w_i$, as:

$$MSE = \frac{1}{m} \sum_{i=1}^{m}\left[w_i\left(y - \sum_{i=1}^{n}\theta_iX_i\right)\right]^2$$

Now, if we take partial derivative of the above equation, we get direction of gradient descent that is influenced by the weight of the sample. Thus, we can assign higher weight to difficult data points and influence the direction of the gradient descent towards these difficult data points.

The idea make sense but the problem remains is how to find optimal weight for difficult data points. If it's too low then the gradient descent direction might not change at all and if it's too hight then the gradient descent might gets too much influenced by hard data points and start having trouble with other data points that our model previously able to deal with. Adaboost solves this problem. 

There is still one challenge, 


* AdaBoost --> adaptive boosting

**References**:
1. [Boosting with Adaboost and Gradient Boosting](https://medium.com/diogo-menezes-borges/boosting-with-adaboost-and-gradient-boosting-9cbab2a1af81)
2. [A comprehensive guide to ensemble learning](https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/)

In [16]:
sample_weights = np.repeat(1./trainDF.shape[0], trainDF.shape[0])
model1 = LogisticRegression(solver='lbfgs', max_iter=5000)
model1.fit(trainDF[features], trainDF[target])
predicted = model1.predict(trainDF[features])
print(accuracy_score(trainDF[target], predicted))

0.9626373626373627
