# Getting benchmarks for our models

### Concept: we create very simplistic "dummy" models that we need to beat

In [7]:
from sklearn.dummy import DummyClassifier, DummyRegressor

### Don't forget about test/train split! We're setting up an experiment afterall...

DummyRegressor always predicts the mean. If we can't beat DummyRegressor, we don't have a good model!

In [None]:
#dr = DummyRegressor()
#dr.fit(X, y)
#should we use MSE or MAE?
#cross_val_score(dr, X, y, cv=10)

DummyClassifier always predicts the most common class. Depending on our metrics, we want to beat this to have a good model!

In [None]:
#dc = DummyClassifier()
#dc.fit(X, y)
#what should we use to measure success? precision? recall? etc...
#cross_val_score(dc, X, y, cv=10)

# Dataset 1: Predicting Benign vs Malignant Breast Cancer

File name: breast-cancer.csv

Attribute Information:

- ID is the FIRST COLUMN
- Diagnosis is the LAST COLUMN (2 for benign, 4 for malignant)
 

Nine real-valued features are computed for each cell nucleus: 

- radius (mean of distances from center to points on the perimeter) 
- texture (standard deviation of gray-scale values) 
- perimeter 
- area 
- smoothness (local variation in radius lengths) 
- compactness (perimeter^2 / area - 1.0) 
- concavity (severity of concave portions of the contour) 
- concave points (number of concave portions of the contour) 
- fractal dimension ("coastline approximation" - 1)

## Guiding questions

In [1]:
# catching it early is very important
# we don't want to mistakenly tell someone they don't have cancer when they do
# what metrics should we use to evaluate our model?

In [2]:
# which model performs best?

In [3]:
# how do you determine what's "good" performance?

In [1]:
# don't forget all the ways we discussed to evaluate performance

# Dataset 2: Predicting forest fires

Filename: forestfires.csv

Attribute Information:

- X - x-axis spatial coordinate within the Montesinho park map: 1 to 9 
- Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9 
- month - month of the year: 'jan' to 'dec' 
- day - day of the week: 'mon' to 'sun' 
- FFMC - FFMC index from the FWI system: 18.7 to 96.20 
- DMC - DMC index from the FWI system: 1.1 to 291.3 
- DC - DC index from the FWI system: 7.9 to 860.6 
- ISI - ISI index from the FWI system: 0.0 to 56.10 
- temp - temperature in Celsius degrees: 2.2 to 33.30 
- RH - relative humidity in %: 15.0 to 100 
- wind - wind speed in km/h: 0.40 to 9.40 
- rain - outside rain in mm/m2 : 0.0 to 6.4 
- area (our prediction goal) - the burned area of the forest (in ha): 0.00 to 1090.84 
(this output variable is very skewed towards 0.0, so it may make 
sense to model with the logarithm transform)

In [4]:
# large fires are very rare, so we care more about getting most of our predictions right
# not as concerned with finding that one major fire
# which metrics should we use to evaluate our model?

In [5]:
# which model performs best?

In [6]:
# how do you determine what's "good" performance?

In [2]:
# don't forget all the ways we discussed to evaluate performance

# Note: this lab is meant to be a gentle thought-starter for HW2

- Please move on to HW2 once you have thought about the questions asked above