# Running Kaggle Bike Sharing using Azure ML
Here I'm trying to apply what I've learned from the Azure DP 100 training LP 2 MODs 1-3 on doing ML and storing data in Azure ML Service. I've taken a model I've trained to predict bicycle rental counts per hour for a Kaggle learning competition on bike demand. I was applying knowledge I learned in a book called principles of Data Science in that I learned how to split nominal columns into dummy variables.

Since I'm doing this with my local JupyterLab, this is also a test of using the Azure ML Python SDK on my local machine - and not on the notebooks provided by the ```myCIWorkstateionRyan``` Compute Instance. Apparently I can store notebooks independently in workspace storage and open them in any VM - so I suspect I can get these files from here as well, I will have to look into that!!!

Perhaps next I will try to finish what I've started with house price prediction data.

**For this experiment, I'll use what was learned in MOD 1**

Update at 4:21 July 29: After all day of working with this, when I plug my test data into my model, I get ridiculous things. I'm trying to debug but its hard to get the script to do anything since the compute runs on a container. I'm thinking I need to double check my dataset first.


In [1]:
#import the workspace from config file
from azureml.core import Workspace #experiment imported to run experiment lol

ws = Workspace.from_config()

In [2]:
type(ws)

azureml.core.workspace.Workspace

In [3]:
for compute_name in ws.compute_targets:
    compute = ws.compute_targets[compute_name]
    print(compute.name, ":", compute.type)

myCIWorkstationRyan : ComputeInstance
myCClusterRyan : AmlCompute


In [4]:
import os, shutil

# Create a folder for the experiment files
training_folder = 'bike-share-training'
os.makedirs(training_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('train.csv', os.path.join(training_folder, "train.csv"))
shutil.copy('test.csv', os.path.join(training_folder, "test.csv"))

'bike-share-training\\test.csv'

In [5]:
%%writefile $training_folder/bike-share-training.py
# Import libraries
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
import os
from sklearn import preprocessing
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


def when_is_it(hour):
    if hour >=5 and hour < 11:
        return 'morning'
    elif hour >=11 and hour < 16:
        return 'afternoon'
    elif hour >=16 and hour < 18:
        return 'rush_hour'
    else:
        return 'off_hours'

def season_is_it(season_int):
    if season_int == 1:
        return 'spring'
    elif season_int == 2:
        return 'summer'
    elif season_int == 3:
        return 'fall'
    else:
        return 'winter'
    
def weather_is_it(weather_int):
    if weather_int == 1:
        return 'nice'
    elif weather_int == 2:
        return 'misty'
    elif weather_int == 3:
        return 'ugly'
    else:
        return 'stormy'
    
#function takes all nominal features and converts to dummy features
def dummy_conversion(data): 
    #create the rules for each dummy variable
    df_when_is_it = data['when_is_it'].apply(when_is_it)
    df_season_is_it = data['season'].apply(season_is_it)
    df_weather_is_it = data['weather'].apply(weather_is_it)
    
    
    when_dummies = pd.get_dummies(df_when_is_it, prefix = 'when_')    
    season_dummies = pd.get_dummies(df_season_is_it, prefix = 'season_')
    weather_dummies = pd.get_dummies(df_weather_is_it, prefix = 'weather_')
    
    #drop the old nominal veriables
    data=data.drop('datetime', axis = 1 )
    data=data.drop('season', axis = 1)
    data=data.drop('weather', axis = 1)
    
    data[list(when_dummies.columns)] = when_dummies
    data[list(season_dummies.columns)] = season_dummies
    data[list(weather_dummies.columns)] = weather_dummies
    return data       

#imports train.csv    
bikes = pd.read_csv('train.csv')
#imports test data and saves a temporary record of the datetime column



feature_cols = ['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed']    

#saves the labels and the featurs of the training data separately
y = bikes['count']
bikes=bikes[feature_cols]

#splites up the datetime into a nicer format
bikes['when_is_it']= bikes['datetime'].apply(lambda x:int(x[11]+x[12]))

                      


#converts datetime, season and weather nominal variables into dummy variable columns
X=dummy_conversion(bikes)



# Get the experiment run context
run = Run.get_context()

#initiates preprocessing StandardScaler - which is different from MinMaxScaler I assume, that's what I used last time.
#if I decide to use argparser, here's a place I could use it.
scaler = preprocessing.StandardScaler()

#fits scaler to X
scaler.fit(X)

#sets X to normalized values
X=scaler.transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y)
    #X_train[feature_cols] = scaler.transform(X_train[feature_cols])
    #X_test[feature_cols] = scaler.transform(X_test[feature_cols])
    #X_train and y_train will be used to train the model
    #X_test and y_test will be used to test the model
    #remember that all four of these variables are just subsets of the overall X and Y
    
    
# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Set regularization hyperparameter

# Train a linear regression model
print('Training a linear regression model')
model = LinearRegression().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)


#ensure negative values predict 0, since negatives make no sense here
y_hat[y_hat <0] = 0


error = np.sqrt(metrics.mean_squared_log_error(y_test, y_hat))
    #error = np.sqrt(metrics.mean_squared_log_error(y_test, y_pred))
    # calculate our metric

#logs the error, its extra cool to see this on my Run Metrics in the ML Service 
run.log('Root Mean Squared Log Error', np.float(error))


# calculate AUC - not relevant since this is regression, just commenting it out to remember this later
#y_scores = model.predict_proba(X_test)
#auc = roc_auc_score(y_test,y_scores[:,1])
#print('AUC: ' + str(auc))
#run.log('AUC', np.float(auc))

# Save the trained model in the outputs folder
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/bike_share_model_StandardScale_LinRegression.pkl')

run.complete()

Overwriting bike-share-training/bike-share-training.py


## From lab: Use an Estimator to Run the Script as an Experiment

You can run experiment scripts using a **RunConfiguration** and a **ScriptRunConfig**, or you can use an **Estimator**, which abstracts both of these configurations in a single object.

In this case, we'll use a generic **Estimator** object to run the training experiment. Note that the default environment for this estimator does not include the **scikit-learn** package, so you need to explicitly add that to the configuration. The conda environment is built on-demand the first time the estimator is used, and cached for future runs that use the same configuration; so the first run will take a little longer. On subsequent runs, the cached environment can be re-used so they'll complete more quickly.

From Ryan: The following script produced errors and highlights the need to have a template for these things:
- First Error fixed by changing compute_target to 'myCIWorkstationRyan' as opposed to local, but I think I should try local after this as it can take a long time when running and experiment. 
- Second error ran a long time before throwing an error because I didn't make a ```test.csv``` file in a folder. 

Lesson learned here: getting things wrong can lead to a lot of frustration because it takes so long to spoo

In [6]:
from azureml.train.estimator import Estimator
from azureml.core import Experiment
from azureml.train.sklearn import SKLearn
from azureml.widgets import RunDetails

# Create an estimator Estimator uses argument  conda_packages=['scikit-learn'] if not SKLearn
estimator = SKLearn(source_directory=training_folder,
                      entry_script='bike-share-training.py',
                      compute_target='myCIWorkstationRyan',
                      )

# Create an experiment
experiment_name = 'bike-share-training-StandardScaler-LinearRegression'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment based on the estimator
run = experiment.submit(config=estimator)
run.wait_for_completion(show_output=True)

RunId: bike-share-training-StandardScaler-LinearRegression_1596061501_41506abb
Web View: https://ml.azure.com/experiments/bike-share-training-StandardScaler-LinearRegression/runs/bike-share-training-StandardScaler-LinearRegression_1596061501_41506abb?wsid=/subscriptions/72656d3a-fc47-4387-8c69-13f86d502d86/resourcegroups/myMLResourceGroup/workspaces/myMLWorkspace1

Streaming azureml-logs/55_azureml-execution-tvmps_9f154f53a68892668b6ff741dfaeba7c8c6ff30ec94f3ec0fb58959269f5ec51_d.txt

2020-07-29T22:25:14Z Executing 'Copy ACR Details file' on 10.0.0.5
2020-07-29T22:25:15Z Copy ACR Details file succeeded on 10.0.0.5. Output: 
>>>   
>>>   
2020-07-29T22:25:15Z Starting output-watcher...
2020-07-29T22:25:15Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_2a0e99d2b6c0b56ef3ce1012e5647b1d
Digest: sha256:3b950264a3ba472a2cd7fb3891a7b4dfb679e6db9b1e8741270a243911b17633
Status: Image is up to date for

ActivityFailedException: ActivityFailedException:
	Message: Activity Failed:
{
    "error": {
        "code": "UserError",
        "message": "User program failed with AttributeError: 'numpy.ndarray' object has no attribute 'to_csv'",
        "detailsUri": "https://aka.ms/azureml-known-errors",
        "details": [],
        "debugInfo": {
            "type": "AttributeError",
            "message": "'numpy.ndarray' object has no attribute 'to_csv'",
            "stackTrace": "  File \"/mnt/batch/tasks/shared/LS_root/jobs/mymlworkspace1/azureml/bike-share-training-standardscaler-linearregression_1596061501_41506abb/mounts/workspaceblobstore/azureml/bike-share-training-StandardScaler-LinearRegression_1596061501_41506abb/azureml-setup/context_manager_injector.py\", line 148, in execute_with_context\n    runpy.run_path(sys.argv[0], globals(), run_name=\"__main__\")\n  File \"/azureml-envs/azureml_ba9520bf386d662001eeb9523395794e/lib/python3.6/runpy.py\", line 263, in run_path\n    pkg_name=pkg_name, script_name=fname)\n  File \"/azureml-envs/azureml_ba9520bf386d662001eeb9523395794e/lib/python3.6/runpy.py\", line 96, in _run_module_code\n    mod_name, mod_spec, pkg_name, script_name)\n  File \"/azureml-envs/azureml_ba9520bf386d662001eeb9523395794e/lib/python3.6/runpy.py\", line 85, in _run_code\n    exec(code, run_globals)\n  File \"bike-share-training.py\", line 130, in <module>\n    y_hat.to_csv('test_prediction')\n"
        },
        "messageParameters": {}
    },
    "time": "0001-01-01T00:00:00.000Z"
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Activity Failed:\n{\n    \"error\": {\n        \"code\": \"UserError\",\n        \"message\": \"User program failed with AttributeError: 'numpy.ndarray' object has no attribute 'to_csv'\",\n        \"detailsUri\": \"https://aka.ms/azureml-known-errors\",\n        \"details\": [],\n        \"debugInfo\": {\n            \"type\": \"AttributeError\",\n            \"message\": \"'numpy.ndarray' object has no attribute 'to_csv'\",\n            \"stackTrace\": \"  File \\\"/mnt/batch/tasks/shared/LS_root/jobs/mymlworkspace1/azureml/bike-share-training-standardscaler-linearregression_1596061501_41506abb/mounts/workspaceblobstore/azureml/bike-share-training-StandardScaler-LinearRegression_1596061501_41506abb/azureml-setup/context_manager_injector.py\\\", line 148, in execute_with_context\\n    runpy.run_path(sys.argv[0], globals(), run_name=\\\"__main__\\\")\\n  File \\\"/azureml-envs/azureml_ba9520bf386d662001eeb9523395794e/lib/python3.6/runpy.py\\\", line 263, in run_path\\n    pkg_name=pkg_name, script_name=fname)\\n  File \\\"/azureml-envs/azureml_ba9520bf386d662001eeb9523395794e/lib/python3.6/runpy.py\\\", line 96, in _run_module_code\\n    mod_name, mod_spec, pkg_name, script_name)\\n  File \\\"/azureml-envs/azureml_ba9520bf386d662001eeb9523395794e/lib/python3.6/runpy.py\\\", line 85, in _run_code\\n    exec(code, run_globals)\\n  File \\\"bike-share-training.py\\\", line 130, in <module>\\n    y_hat.to_csv('test_prediction')\\n\"\n        },\n        \"messageParameters\": {}\n    },\n    \"time\": \"0001-01-01T00:00:00.000Z\"\n}"
    }
}

The first sucessful run (Run 7), with the StandardScaler has me at 1.1718 RMSLE which is worse than 1.1271, our last submission with MinMaxScaler, but I had to run my algorithm a ton of times to get 1.1271 so as I run this more things might be better. Lets run this a few times to see what we can get. Its also cool that all this is logged in my Experiments in ```myMLWorkspace1``` under ```bike-share-training-StandardScaler-LinearRegression``` at time of writing.
- Run 7 1.1718
- Run 8 1.2096
- Run 9 1.1829
- Run 10 1.2231
- Run 11 1.2107
Average = 1.19962

I compared to the MinMax Scaler and this seems to be better for 5 average values, it could be highly variable though, I'll stick with this one since Standard is a nice name.

The first time I ran with ```SKLearn``` instead of ```Estimator``` it took a long time

Get run details below

In [7]:
from azureml.widgets import RunDetails

RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

Print Metrics below

In [8]:
# Get logged metrics
metrics = run.get_metrics()
for key in metrics.keys():
        print(key, metrics.get(key))
print('\n')

for file in run.get_file_names():
    print(file)

Root Mean Squared Log Error 1.1870601523227546


azureml-logs/55_azureml-execution-tvmps_9f154f53a68892668b6ff741dfaeba7c8c6ff30ec94f3ec0fb58959269f5ec51_d.txt
azureml-logs/65_job_prep-tvmps_9f154f53a68892668b6ff741dfaeba7c8c6ff30ec94f3ec0fb58959269f5ec51_d.txt
azureml-logs/70_driver_log.txt
azureml-logs/75_job_post-tvmps_9f154f53a68892668b6ff741dfaeba7c8c6ff30ec94f3ec0fb58959269f5ec51_d.txt
azureml-logs/process_info.json
azureml-logs/process_status.json
logs/azureml/94_azureml.log
logs/azureml/job_prep_azureml.log
logs/azureml/job_release_azureml.log
outputs/bike_share_model_StandardScale_LinRegression.pkl


Below I register it as a new version in the workspace. I've ran it a number of times and can't get much better than 1.17-1.18, even though I've done 1.12 before with MinMax after running a number of times. However this isn't a real competition, I'm just practicing so I will have to be satisfied to just get Azure experience instead of azure experience and a better model...

In [9]:
from azureml.core import Model

# Register the model
run.register_model(model_path='outputs/bike_share_model_StandardScale_LinRegression.pkl', model_name='bike_share_model',
                   tags={'Training context':'SKLearn Estimator'},
                   properties={'Root Mean Squared Log Error': run.get_metrics()}
                  )

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

bike_share_model version: 3
	 Training context : SKLearn Estimator
	 Root Mean Squared Log Error : {'Root Mean Squared Log Error': 1.1870601523227546}


bike_share_model version: 2
	 Training context : SKLearn Estimator
	 Root Mean Squared Log Error : {'Root Mean Squared Log Error': 1.1761471172549591}


bike_share_model version: 1
	 Training context : SKLearn Estimator
	 Root Mean Squared Log Error : {'Root Mean Squared Log Error': 1.1761471172549591}


diabetes_model version: 3
	 Training context : Parameterized SKLearn Estimator
	 AUC : 0.8483904671874223
	 Accuracy : 0.7736666666666666


diabetes_model version: 2
	 Training context : Estimator
	 AUC : 0.8483377282451863
	 Accuracy : 0.774


diabetes_model version: 1
	 Training context : Estimator
	 AUC : 0.8483377282451863
	 Accuracy : 0.774


amlstudio-predict-iris-cluster version: 1
	 CreatedByAMLStudio : true


amlstudio-predict-diabetes version: 1
	 CreatedByAMLStudio : true


amlstudio-predict-auto-price version: 1
	 CreatedBy

In [10]:
import json
import joblib
import numpy as np
import pandas as pd
import joblib
import os

def when_is_it(hour):
    if hour >=5 and hour < 11:
        return 'morning'
    elif hour >=11 and hour < 16:
        return 'afternoon'
    elif hour >=16 and hour < 18:
        return 'rush_hour'
    else:
        return 'off_hours'

def season_is_it(season_int):
    if season_int == 1:
        return 'spring'
    elif season_int == 2:
        return 'summer'
    elif season_int == 3:
        return 'fall'
    else:
        return 'winter'
    
def weather_is_it(weather_int):
    if weather_int == 1:
        return 'nice'
    elif weather_int == 2:
        return 'misty'
    elif weather_int == 3:
        return 'ugly'
    else:
        return 'stormy'
    
#function takes all nominal features and converts to dummy features
def dummy_conversion(data): 
    #create the rules for each dummy variable
    df_when_is_it = data['when_is_it'].apply(when_is_it)
    df_season_is_it = data['season'].apply(season_is_it)
    df_weather_is_it = data['weather'].apply(weather_is_it)
    
    
    when_dummies = pd.get_dummies(df_when_is_it, prefix = 'when_')    
    season_dummies = pd.get_dummies(df_season_is_it, prefix = 'season_')
    weather_dummies = pd.get_dummies(df_weather_is_it, prefix = 'weather_')
    
    #drop the old nominal veriables
    data=data.drop('datetime', axis = 1 )
    data=data.drop('season', axis = 1)
    data=data.drop('weather', axis = 1)
    
    data[list(when_dummies.columns)] = when_dummies
    data[list(season_dummies.columns)] = season_dummies
    data[list(weather_dummies.columns)] = weather_dummies
    return data   


In [11]:
print(Model.get_model_path('bike_share_model', version=3, _workspace=ws))

azureml-models\bike_share_model\3\bike_share_model_StandardScale_LinRegression.pkl


In [12]:
from sklearn import preprocessing

#suggests the test_data and scaler code in my script is not useful - which makes sense and I can remove
test_data = pd.read_csv('test.csv')
temp_datetime = pd.read_csv('test.csv')['datetime']

In [17]:
print(test_data)

[[-0.17315012  0.67684424 -1.23596641 ...  0.73366278 -0.01755332
  -0.3072252 ]
 [-0.17315012  0.67684424 -1.23596641 ...  0.73366278 -0.01755332
  -0.3072252 ]
 [-0.17315012  0.67684424 -1.23596641 ...  0.73366278 -0.01755332
  -0.3072252 ]
 ...
 [-0.17315012  0.67684424 -1.23596641 ...  0.73366278 -0.01755332
  -0.3072252 ]
 [-0.17315012  0.67684424 -1.23596641 ...  0.73366278 -0.01755332
  -0.3072252 ]
 [-0.17315012  0.67684424 -1.23596641 ...  0.73366278 -0.01755332
  -0.3072252 ]]


In [12]:
test_data['when_is_it'] = test_data['datetime'].apply(lambda x:int(x[11]+x[12]))
test_data=dummy_conversion(test_data)

# Get the path to the deployed model file and load it
model_path = Model.get_model_path('bike_share_model')
model = joblib.load(model_path)

scaler = preprocessing.StandardScaler() #initializes a scaler
scaler.fit(test_data)#fits scaler to test data
test_data=scaler.transform(test_data) #transforms test data

predictions = model.predict(test_data)
#ensure all negative values predict 0
predictions[predictions <0] = 0 

#ensure all counts are rounded
predictions = np.rint(predictions)
submission = pd.DataFrame({'datetime': temp_datetime, 'count': predictions})



The above is quite a problem, my SKLearn is not of the same generation and when I look at the submission, its numbers are crazy off, have to diagnose this later, but not right now.

In [13]:
submission.head()

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,0.0
1,2011-01-20 01:00:00,0.0
2,2011-01-20 02:00:00,0.0
3,2011-01-20 03:00:00,0.0
4,2011-01-20 04:00:00,0.0


In [15]:
submission.to_csv('linReg_submission3.csv', index=False, )

I tried this with the Azure ML notebook and it didn't seem to fix the problem with bad values, oh nooo! I'll have to look closer into why its predicting poorly