# Getting benchmarks for our models

### Concept: we create very simplistic "dummy" models that we need to beat

In [1]:
from sklearn.dummy import DummyClassifier, DummyRegressor

### Don't forget about test/train split! We're setting up an experiment afterall...

DummyRegressor always predicts the mean. If we can't beat DummyRegressor, we don't have a good model!

In [2]:
#dr = DummyRegressor()
#dr.fit(X, y)
#should we use MSE or MAE?
#cross_val_score(dr, X, y, cv=10)

DummyClassifier always predicts the most common class. Depending on our metrics, we want to beat this to have a good model!

In [3]:
#dc = DummyClassifier()
#dc.fit(X, y)
#what should we use to measure success? precision? recall? etc...
#cross_val_score(dc, X, y, cv=10)

# Dataset 1: Predicting Benign vs Malignant Breast Cancer

File name: breast-cancer.csv

Attribute Information:

- ID is the FIRST COLUMN
- Diagnosis is the LAST COLUMN (2 for benign, 4 for malignant)
 

Nine real-valued features are computed for each cell nucleus: 

- radius (mean of distances from center to points on the perimeter) 
- texture (standard deviation of gray-scale values) 
- perimeter 
- area 
- smoothness (local variation in radius lengths) 
- compactness (perimeter^2 / area - 1.0) 
- concavity (severity of concave portions of the contour) 
- concave points (number of concave portions of the contour) 
- fractal dimension ("coastline approximation" - 1)

## Guiding questions

In [4]:
# importing process

In [5]:
!head breast-cancer.csv

1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4
1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2
1033078,2,1,1,1,2,1,1,1,5,2
1033078,4,2,1,1,2,1,2,1,1,2


In [6]:
import pandas as pd
names = ['id', 'radius', 'texture', 'perimeter', 'area', 'smoothness', 'compact', 'concave', 'concave_points',
        'fractal', 'diagnosis']
cancer = pd.read_csv('breast-cancer.csv', names=names, na_values=['?'])
# convert the y values into 0,1
cancer['diagnosis'] = pd.factorize(cancer['diagnosis'])[0]

In [7]:
cancer.head()

Unnamed: 0,id,radius,texture,perimeter,area,smoothness,compact,concave,concave_points,fractal,diagnosis
0,1000025,5,1,1,1,2,1,3,1,1,0
1,1002945,5,4,4,5,7,10,3,2,1,0
2,1015425,3,1,1,1,2,2,3,1,1,0
3,1016277,6,8,8,1,3,4,3,7,1,0
4,1017023,4,1,1,3,2,1,3,1,1,0


In [8]:
# catching it early is very important
# we don't want to mistakenly tell someone they don't have cancer when they do
# what metrics should we use to evaluate our model?

There are 2 good answers here that depend on understanding the situation surrounding the problem.

- Recall is useful here because we want to detect as many instances of cancer as possible
- F1 might be a stronger answer because we want to balance the cost of the test (and possible reactions since chemo therapy is no joke) with the ability to "not miss" cancer

In [9]:
# which model performs best?

In [10]:
cancer.corr()

Unnamed: 0,id,radius,texture,perimeter,area,smoothness,compact,concave,concave_points,fractal,diagnosis
id,1.0,-0.055308,-0.041603,-0.041576,-0.064878,-0.045528,-0.099248,-0.060051,-0.052072,-0.034901,-0.080226
radius,-0.055308,1.0,0.644913,0.654589,0.486356,0.521816,0.593091,0.558428,0.535835,0.350034,0.716001
texture,-0.041603,0.644913,1.0,0.906882,0.705582,0.751799,0.691709,0.755721,0.722865,0.458693,0.817904
perimeter,-0.041576,0.654589,0.906882,1.0,0.683079,0.719668,0.713878,0.735948,0.719446,0.438911,0.818934
area,-0.064878,0.486356,0.705582,0.683079,1.0,0.599599,0.670648,0.666715,0.603352,0.417633,0.6968
smoothness,-0.045528,0.521816,0.751799,0.719668,0.599599,1.0,0.585716,0.616102,0.628881,0.479101,0.682785
compact,-0.099248,0.593091,0.691709,0.713878,0.670648,0.585716,1.0,0.680615,0.58428,0.33921,0.822696
concave,-0.060051,0.558428,0.755721,0.735948,0.666715,0.616102,0.680615,1.0,0.665878,0.344169,0.756616
concave_points,-0.052072,0.535835,0.722865,0.719446,0.603352,0.628881,0.58428,0.665878,1.0,0.428336,0.712244
fractal,-0.034901,0.350034,0.458693,0.438911,0.417633,0.479101,0.33921,0.344169,0.428336,1.0,0.42317


## Reasons for and against certain learning algorithms 

This is meant to get you thinking about "analysis". Here are a few thought starters:

- kNN: Many features are correlated and the ratio of features / examples is rather high, which could hurt its performance. However, it seems reasonable that instances of cancer will "look alike" so using the nearest neighbor makes sense in this context.
- NB: The assumption that all the features are conditionally independent probably rules out the learning algorithm in this case. The features are most likely dependent on one another since they are different measurements of the same area.

In [11]:
# how do you determine what's "good" performance?

In [12]:
# dummyclassifier
dc = DummyClassifier()
# or we can calculate the majority by ourselves
# remember to do this on the train data
cancer['diagnosis'].value_counts() / cancer.shape[0]

0    0.655222
1    0.344778
Name: diagnosis, dtype: float64

In [13]:
# don't forget all the ways we discussed to evaluate performance

Does speed matter in this case? Most likely not, but still important to think about. Models like Random Forest will take much longer to train.

# Dataset 2: Predicting forest fires

Filename: forestfires.csv

Attribute Information:

- X - x-axis spatial coordinate within the Montesinho park map: 1 to 9 
- Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9 
- month - month of the year: 'jan' to 'dec' 
- day - day of the week: 'mon' to 'sun' 
- FFMC - FFMC index from the FWI system: 18.7 to 96.20 
- DMC - DMC index from the FWI system: 1.1 to 291.3 
- DC - DC index from the FWI system: 7.9 to 860.6 
- ISI - ISI index from the FWI system: 0.0 to 56.10 
- temp - temperature in Celsius degrees: 2.2 to 33.30 
- RH - relative humidity in %: 15.0 to 100 
- wind - wind speed in km/h: 0.40 to 9.40 
- rain - outside rain in mm/m2 : 0.0 to 6.4 
- area (our prediction goal) - the burned area of the forest (in ha): 0.00 to 1090.84 
(this output variable is very skewed towards 0.0, so it may make 
sense to model with the logarithm transform)

In [14]:
!head forestfires.csv

X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0,0
7,4,oct,tue,90.6,35.4,669.1,6.7,18,33,0.9,0,0
7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0,0
8,6,mar,fri,91.7,33.3,77.5,9,8.3,97,4,0.2,0
8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0,0
8,6,aug,sun,92.3,85.3,488,14.7,22.2,29,5.4,0,0
8,6,aug,mon,92.3,88.9,495.6,8.5,24.1,27,3.1,0,0
8,6,aug,mon,91.5,145.4,608.2,10.7,8,86,2.2,0,0
8,6,sep,tue,91,129.5,692.6,7,13.1,63,5.4,0,0


In [15]:
fire = pd.read_csv('forestfires.csv')

In [16]:
fire.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 517 entries, 0 to 516
Data columns (total 13 columns):
X        517 non-null int64
Y        517 non-null int64
month    517 non-null object
day      517 non-null object
FFMC     517 non-null float64
DMC      517 non-null float64
DC       517 non-null float64
ISI      517 non-null float64
temp     517 non-null float64
RH       517 non-null int64
wind     517 non-null float64
rain     517 non-null float64
area     517 non-null float64
dtypes: float64(8), int64(3), object(2)
memory usage: 56.5+ KB


In [17]:
fire.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0


In [18]:
fire.describe()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
count,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0
mean,4.669246,4.299807,90.644681,110.87234,547.940039,9.021663,18.889168,44.288201,4.017602,0.021663,12.847292
std,2.313778,1.2299,5.520111,64.046482,248.066192,4.559477,5.806625,16.317469,1.791653,0.295959,63.655818
min,1.0,2.0,18.7,1.1,7.9,0.0,2.2,15.0,0.4,0.0,0.0
25%,3.0,4.0,90.2,68.6,437.7,6.5,15.5,33.0,2.7,0.0,0.0
50%,4.0,4.0,91.6,108.3,664.2,8.4,19.3,42.0,4.0,0.0,0.52
75%,7.0,5.0,92.9,142.4,713.9,10.8,22.8,53.0,4.9,0.0,6.57
max,9.0,9.0,96.2,291.3,860.6,56.1,33.3,100.0,9.4,6.4,1090.84


Notice the potential outliers above, especially in the wind and rain predictors!

In [19]:
# large fires are very rare, so we care more about getting most of our predictions right
# not as concerned with finding that one major fire
# which metrics should we use to evaluate our model?

We should use MAE in this case since we do not need to penalize large errors. We just want an idea of how accurate our fire predictions are.

In [20]:
# which model performs best?

- Random Forest (kinda ironic but no pun intended) is a really good choice in this situation. We are concerned about outliers, and RF should be robust to them. 
- What about the features? Are all of them relevant? We can use RF or Lasso to test out some theories.

In [21]:
# how do you determine what's "good" performance?

In [22]:
# we can use DummyRegressor()
# or we can calculate the mean ourselves
# remember to do this on the train data
fire['area'].mean()

12.847292069632491

In [23]:
# don't forget all the ways we discussed to evaluate performance

# Note: this lab is meant to be a gentle thought-starter for HW2

- Please move on to HW2 once you have thought about the questions asked above