# Running Kaggle Bike Sharing using Azure ML
Here I'm trying to apply what I've learned from the Azure DP 100 training LP 2 MODs 1-3 on doing ML and storing data in Azure ML Service. I've taken a model I've trained to predict bicycle rental counts per hour for a Kaggle learning competition on bike demand. I was applying knowledge I learned in a book called principles of Data Science in that I learned how to split nominal columns into dummy variables.

Since I'm doing this with my local JupyterLab, this is also a test of using the Azure ML Python SDK on my local machine - and not on the notebooks provided by the ```myCIWorkstateionRyan``` Compute Instance. Apparently I can store notebooks independently in workspace storage and open them in any VM - so I suspect I can get these files from here as well, I will have to look into that!!!

Perhaps next I will try to finish what I've started with house price prediction data.

**For this experiment, I'll use what was learned in MOD 1**

In [9]:
#import the workspace from config file
from azureml.core import Workspace #experiment imported to run experiment lol

ws = Workspace.from_config()

In [10]:
type(ws)

azureml.core.workspace.Workspace

In [11]:
for compute_name in ws.compute_targets:
    compute = ws.compute_targets[compute_name]
    print(compute.name, ":", compute.type)

myCIWorkstationRyan : ComputeInstance
myCClusterRyan : AmlCompute


In [12]:
import os, shutil

# Create a folder for the experiment files
training_folder = 'bike-share-training'
os.makedirs(training_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('train.csv', os.path.join(training_folder, "train.csv"))
shutil.copy('test.csv', os.path.join(training_folder, "test.csv"))

'bike-share-training\\test.csv'

In [14]:
%%writefile $training_folder/bike-share-training.py
# Import libraries
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
import os
from sklearn import preprocessing
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


def when_is_it(hour):
    if hour >=5 and hour < 11:
        return 'morning'
    elif hour >=11 and hour < 16:
        return 'afternoon'
    elif hour >=16 and hour < 18:
        return 'rush_hour'
    else:
        return 'off_hours'

def season_is_it(season_int):
    if season_int == 1:
        return 'spring'
    elif season_int == 2:
        return 'summer'
    elif season_int == 3:
        return 'fall'
    else:
        return 'winter'
    
def weather_is_it(weather_int):
    if weather_int == 1:
        return 'nice'
    elif weather_int == 2:
        return 'misty'
    elif weather_int == 3:
        return 'ugly'
    else:
        return 'stormy'
    
#function takes all nominal features and converts to dummy features
def dummy_conversion(data): 
    #create the rules for each dummy variable
    df_when_is_it = data['when_is_it'].apply(when_is_it)
    df_season_is_it = data['season'].apply(season_is_it)
    df_weather_is_it = data['weather'].apply(weather_is_it)
    
    
    when_dummies = pd.get_dummies(df_when_is_it, prefix = 'when_')    
    season_dummies = pd.get_dummies(df_season_is_it, prefix = 'season_')
    weather_dummies = pd.get_dummies(df_weather_is_it, prefix = 'weather_')
    
    #drop the old nominal veriables
    data=data.drop('datetime', axis = 1 )
    data=data.drop('season', axis = 1)
    data=data.drop('weather', axis = 1)
    
    data[list(when_dummies.columns)] = when_dummies
    data[list(season_dummies.columns)] = season_dummies
    data[list(weather_dummies.columns)] = weather_dummies
    return data       

#imports train.csv    
bikes = pd.read_csv('train.csv')
#imports test data and saves a temporary record of the datetime column
test_data = pd.read_csv('test.csv')
temp_datetime = pd.read_csv('test.csv')['datetime']

feature_cols = ['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed']    

#saves the labels and the featurs of the training data separately
y = bikes['count']
bikes=bikes[feature_cols]

#splites up the datetime into a nicer format
bikes['when_is_it']= bikes['datetime'].apply(lambda x:int(x[11]+x[12]))
test_data['when_is_it'] = test_data['datetime'].apply(lambda x:int(x[11]+x[12]))
                      


#converts datetime, season and weather nominal variables into dummy variable columns
X=dummy_conversion(bikes)
test_data=dummy_conversion(test_data)


# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
print("Loading Data...")
diabetes = pd.read_csv('train.csv')

#initiates preprocessing StandardScaler - which is different from MinMaxScaler I assume, that's what I used last time.
#if I decide to use argparser, here's a place I could use it.
scaler = preprocessing.MinMaxScaler()

#fits scaler to X
scaler.fit(X)

#sets X to normalized values
X=scaler.transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y)
    #X_train[feature_cols] = scaler.transform(X_train[feature_cols])
    #X_test[feature_cols] = scaler.transform(X_test[feature_cols])
    #X_train and y_train will be used to train the model
    #X_test and y_test will be used to test the model
    #remember that all four of these variables are just subsets of the overall X and Y
    
    
# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Set regularization hyperparameter

# Train a linear regression model
print('Training a linear regression model')
model = LinearRegression().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)


#ensure negative values predict 0, since negatives make no sense here
y_hat[y_hat <0] = 0
    
error = np.sqrt(metrics.mean_squared_log_error(y_test, y_hat))
    #error = np.sqrt(metrics.mean_squared_log_error(y_test, y_pred))
    # calculate our metric

#logs the error, its extra cool to see this on my Run Metrics in the ML Service 
run.log('Root Mean Squared Log Error', np.float(error))


# calculate AUC - not relevant since this is regression, just commenting it out to remember this later
#y_scores = model.predict_proba(X_test)
#auc = roc_auc_score(y_test,y_scores[:,1])
#print('AUC: ' + str(auc))
#run.log('AUC', np.float(auc))

# Save the trained model in the outputs folder
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/bike_share_model_StandardScale_LinRegression.pkl')

run.complete()

Overwriting bike-share-training/bike-share-training.py


## From lab: Use an Estimator to Run the Script as an Experiment

You can run experiment scripts using a **RunConfiguration** and a **ScriptRunConfig**, or you can use an **Estimator**, which abstracts both of these configurations in a single object.

In this case, we'll use a generic **Estimator** object to run the training experiment. Note that the default environment for this estimator does not include the **scikit-learn** package, so you need to explicitly add that to the configuration. The conda environment is built on-demand the first time the estimator is used, and cached for future runs that use the same configuration; so the first run will take a little longer. On subsequent runs, the cached environment can be re-used so they'll complete more quickly.

From Ryan: The following script produced errors and highlights the need to have a template for these things:
- First Error fixed by changing compute_target to 'myCIWorkstationRyan' as opposed to local, but I think I should try local after this as it can take a long time when running and experiment. 
- Second error ran a long time before throwing an error because I didn't make a ```test.csv``` file in a folder. 

Lesson learned here: getting things wrong can lead to a lot of frustration because it takes so long to spoo

In [19]:
from azureml.train.estimator import Estimator
from azureml.core import Experiment

# Create an estimator
estimator = Estimator(source_directory=training_folder,
                      entry_script='bike-share-training.py',
                      compute_target='myCIWorkstationRyan',
                      conda_packages=['scikit-learn']
                      )

# Create an experiment
experiment_name = 'bike-share-training-MinMaxScaler-LinearRegression'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment based on the estimator
run = experiment.submit(config=estimator)
run.wait_for_completion(show_output=True)

RunId: bike-share-training-MinMaxScaler-LinearRegression_1596052335_4e6c3490
Web View: https://ml.azure.com/experiments/bike-share-training-MinMaxScaler-LinearRegression/runs/bike-share-training-MinMaxScaler-LinearRegression_1596052335_4e6c3490?wsid=/subscriptions/72656d3a-fc47-4387-8c69-13f86d502d86/resourcegroups/myMLResourceGroup/workspaces/myMLWorkspace1

Streaming azureml-logs/55_azureml-execution-tvmps_9f154f53a68892668b6ff741dfaeba7c8c6ff30ec94f3ec0fb58959269f5ec51_d.txt

2020-07-29T19:52:25Z Executing 'Copy ACR Details file' on 10.0.0.5
2020-07-29T19:52:25Z Starting output-watcher...
2020-07-29T19:52:25Z Copy ACR Details file succeeded on 10.0.0.5. Output: 
>>>   
>>>   
2020-07-29T19:52:25Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_18a2c352852de1e0e7ad8b589dd0927b
Digest: sha256:e5fe3e923746f15f646bde32cd52f130f67f9aae3a78019062e698b5e423424f
Status: Image is up to date for mymlw

{'runId': 'bike-share-training-MinMaxScaler-LinearRegression_1596052335_4e6c3490',
 'target': 'myCIWorkstationRyan',
 'status': 'Completed',
 'startTimeUtc': '2020-07-29T19:52:29.230189Z',
 'endTimeUtc': '2020-07-29T19:52:50.433216Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '30bdec04-e9c1-4b13-9801-3d887c2aedb3',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'runDefinition': {'script': 'bike-share-training.py',
  'scriptType': None,
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'myCIWorkstationRyan',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': None,
  'nodeCount': 1,
  'environment': {'name': 'Experiment bike-share-training-MinMaxScaler-LinearRegression Environment',
   'version': 'Autosave_2020-

Now using MinMax Scaler again, going to see what sort of metrics I get,, with new model ```bike-share-training-MinMaxScaler-LinearRegression```
- Run 1 1.1645
- Run 2 1.1800
- Run 3 1.2214
- Run 4 1.2879
- Run 5 1.2072
Average is 1.21, so maybe the Standard Scaler better on average model
So I won't do the model registration here

In [20]:
ws.models

{'AutoMLd9d5ac31d0': Model(workspace=Workspace.create(name='myMLWorkspace1', subscription_id='72656d3a-fc47-4387-8c69-13f86d502d86', resource_group='myMLResourceGroup'), name=AutoMLd9d5ac31d0, id=AutoMLd9d5ac31d0:1, version=1, tags={}, properties={}),
 'AutoML50b56166512': Model(workspace=Workspace.create(name='myMLWorkspace1', subscription_id='72656d3a-fc47-4387-8c69-13f86d502d86', resource_group='myMLResourceGroup'), name=AutoML50b56166512, id=AutoML50b56166512:1, version=1, tags={}, properties={}),
 'amlstudio-predict-auto-price': Model(workspace=Workspace.create(name='myMLWorkspace1', subscription_id='72656d3a-fc47-4387-8c69-13f86d502d86', resource_group='myMLResourceGroup'), name=amlstudio-predict-auto-price, id=amlstudio-predict-auto-price:1, version=1, tags={'CreatedByAMLStudio': 'true'}, properties={}),
 'amlstudio-predict-diabetes': Model(workspace=Workspace.create(name='myMLWorkspace1', subscription_id='72656d3a-fc47-4387-8c69-13f86d502d86', resource_group='myMLResourceGroup'

In [12]:
scaler.fit(test_data)
test_data=scaler.transform(test_data)

predictions = linreg.predict(test_data)
#ensure all negative values predict 0
predictions[predictions <0] = 0 

#ensure all counts are rounded
predictions = np.rint(predictions)
submission = pd.DataFrame({'datetime': temp_datetime, 'count': predictions})

In [13]:
submission.head()

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,0.0
1,2011-01-20 01:00:00,0.0
2,2011-01-20 02:00:00,0.0
3,2011-01-20 03:00:00,0.0
4,2011-01-20 04:00:00,6.0


In [14]:
submission.to_csv('linReg_submission2.csv', index=False, )

https://www.kaggle.com/c/bike-sharing-demand
Above is the link I submit to. I officially got 1.46 RMSLE on the site when I submitted without dummy variables, but included scaling. 

My new submission... GOT 1.199984 = 1.2 RMSLE WOOOOOOOOO