## Experiments in Local Mode
This notebook requires scikit-learn.  Please install using `pip install scikit-learn` or equivalent in your environment.

In [1]:
%run ./00_setup.ipynb

Python version: 3.6.7
Pandas version: 0.23.4
Numpy version: 1.15.4
Cortex SDK version: 5.5.4


In [2]:
from sklearn.datasets.california_housing import fetch_california_housing
houses = fetch_california_housing()

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /Users/narcher/scikit_learn_data


In [3]:
print(houses.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [4]:
df = pd.DataFrame(data=houses.data, columns=houses.feature_names)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.325,41.0,6.984,1.024,322.0,2.556,37.88,-122.23
1,8.301,21.0,6.238,0.972,2401.0,2.11,37.86,-122.22
2,7.257,52.0,8.288,1.073,496.0,2.802,37.85,-122.24
3,5.643,52.0,5.817,1.073,558.0,2.548,37.85,-122.25
4,3.846,52.0,6.282,1.081,565.0,2.181,37.85,-122.25


In [5]:
cortex = Cortex.local()
builder = cortex.builder()

In [6]:
ds = builder.dataset('c12e/cal-housing').title('California Housing dataset').from_df(df).build()
print('{} v{}'.format(ds.name, ds.version))

c12e/cal-housing v1


In [7]:
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [8]:
def train(x, y, **kwargs):
    alphas = kwargs.get('alphas', [1, 0.1, 0.001, 0.0001])
    # Select alogrithm "
    mtype = kwargs.get('model_type')
    if mtype == 'Lasso':
        model = LassoCV(alphas=alphas)
    elif mtype == 'Ridge':
        model = RidgeCV(alphas=alphas)
    elif mtype == 'ElasticNet':
        model = ElasticNetCV(alphas=alphas)
    else:
        model = LinearRegression()

    # Train model
    model.fit(x, y)
    
    return model

In [9]:
def predict_and_score(model, x, y):
    predictions = model.predict(x)
    rmse = np.sqrt(mean_squared_error(predictions, y))
    return [predictions, rmse]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(df, houses.target, test_size=0.30, random_state=10)

In [11]:
%%time

best_model = None
best_model_type = None
best_rmse = 1.0

exp = cortex.experiment('c12e/cal-housing-regression')
exp.reset()
exp.set_meta('style', 'supervised')
exp.set_meta('function', 'regression')

with exp.start_run() as run:
    alphas = [1, 0.1, 0.001, 0.0005]
    for model_type in ['Linear', 'Lasso', 'Ridge', 'ElasticNet']:
        print('---'*30)
        print('Training model using {} regression algorithm'.format(model_type))
        model = train(X_train, y_train, model_type=model_type, alphas=alphas)
        [predictions, rmse] = predict_and_score(model, X_train, y_train)
        print('Training error:', rmse)
        [predictions, rmse] = predict_and_score(model, X_test, y_test)
        print('Testing error:', rmse)

        if rmse < best_rmse:
            best_rmse = rmse
            best_model = model
            best_model_type = model_type

    r2 = best_model.score(X_test, y_test)
    run.log_metric('r2', r2)
    run.log_metric('rmse', best_rmse)
    run.log_param('model_type', best_model_type)
    run.log_param('alphas', alphas)
    run.log_artifact('model', best_model)

print('---'*30)
print('Best model: ' + best_model_type)
print('Best testing error: %.6f' % best_rmse)
print('R2 score: %.6f' % r2)

------------------------------------------------------------------------------------------
Training model using Linear regression algorithm
Training error: 0.7168015879496298
Testing error: 0.7443413397146885
------------------------------------------------------------------------------------------
Training model using Lasso regression algorithm
Training error: 0.7168404173291649
Testing error: 0.7435061368178243
------------------------------------------------------------------------------------------
Training model using Ridge regression algorithm
Training error: 0.7168016634345166
Testing error: 0.7443005648600199
------------------------------------------------------------------------------------------
Training model using ElasticNet regression algorithm
Training error: 0.7168267314398832
Testing error: 0.7436480525313524
------------------------------------------------------------------------------------------
Best model: Lasso
Best testing error: 0.743506
R2 score: 0.592222
CPU t

In [12]:
exp

ID,Date,Took,Params,Params,Metrics,Metrics
ID,Date,Took,alphas,model_type,r2,rmse
dx181e9,"Fri, 15 Feb 2019 16:42:48 GMT",0.10 s,"[1, 0.1, 0.001, 0.0005]",Lasso,0.592222,0.743506
