## Training and Running My First Model

- Models are computer code that processes information to make a prediction or a decision. 
- I'll attempt to train a model to guess a comfortable boot size for a dog, based on the size of the harness that fits them.

## Preparing data

### loading the data

In [1]:
import pandas
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-boot-harness.csv
!pip install statsmodels


# Make a dictionary of data for boot sizes
# and harness size in cm
data = {
    'boot_size' : [ 39, 38, 37, 39, 38, 35, 37, 36, 35, 40, 
                    40, 36, 38, 39, 42, 42, 36, 36, 35, 41, 
                    42, 38, 37, 35, 40, 36, 35, 39, 41, 37, 
                    35, 41, 39, 41, 42, 42, 36, 37, 37, 39,
                    42, 35, 36, 41, 41, 41, 39, 39, 35, 39
 ],
    'harness_size': [ 58, 58, 52, 58, 57, 52, 55, 53, 49, 54,
                59, 56, 53, 58, 57, 58, 56, 51, 50, 59,
                59, 59, 55, 50, 55, 52, 53, 54, 61, 56,
                55, 60, 57, 56, 61, 58, 53, 57, 57, 55,
                60, 51, 52, 56, 55, 57, 58, 57, 51, 59
                ]
}

# Convert it into a table using pandas
dataset = pandas.DataFrame(data)

# Print the data
# In normal python we would write
# print(dataset)
# but in Jupyter notebooks, if we simple write the name
# of the variable and it is printed nicely 
dataset

'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.




Unnamed: 0,boot_size,harness_size
0,39,58
1,38,58
2,37,52
3,39,58
4,38,57
5,35,52
6,37,55
7,36,53
8,35,49
9,40,54


- The sizes of boots and harnesses are for 50 avalanche dogs.

- Harness size will be used to estimate boot size.
- harness_size is our input. 
- I want a model that will process the input and make its own estimations of the boot size (output).

## Selecting a model

- First: select a model. 
- Starting with a very simple model called OLS. This is just a straight line (sometimes called a trendline).
- Using an existing library to create my model, not training it yet

In [2]:
# Load a library to do the hard work :)
import statsmodels.formula.api as smf

# define a formula using a special syntax
# This says that boot_size is explained by harness_size
formula = "boot_size ~ harness_size"

# Create the model, but don't train it yet
model = smf.ols(formula = formula, data = dataset)

# created the model but it does not have internal parameters set yet
if not hasattr(model, 'params'):
    print("Model selected but it does not have parameters set. We need to train it!")

Model selected but it does not have parameters set. We need to train it!


## Training The Model

- OLS models have two parameters (a slope and an offset), I haven't set in the model yet. 
- Need to train (fit) the model to find these values so that the model can reliably estimate dogs' boot size based on their harness size.
- The code below will fit the model to the data

In [4]:
!pip install graphing

Collecting graphing
  Downloading graphing-0.0.8-py3-none-any.whl (68 kB)
Installing collected packages: graphing
Successfully installed graphing-0.0.8


In [5]:
# Load some libraries to do the hard work for us
import graphing 

# Train (fit) the model so that it creates a line that 
# fits our data. This method does the hard work for
# us. We will look at how this method works in a later unit.
fitted_model = model.fit()

# Print information about our model now it has been fit
print("The following model parameters have been found:\n" +
        f"Line slope: {fitted_model.params[1]}\n"+
        f"Line Intercept: {fitted_model.params[0]}")

The following model parameters have been found:
Line slope: 0.5859254167382707
Line Intercept: 5.719109812682602


- Training the model sets its parameters automatically.
- Intercepting using a graph:

In [16]:
import matplotlib.pyplot as plt

In [None]:
import graphing

# Show a graph of the result
graphing.scatter_2D(dataset,    label_x="harness_size", 
                                label_y="boot_size",
                                trendline=lambda x: fitted_model.params[1] * x + fitted_model.params[0]
                                )

- The graph should show the original data as circles, with a red line through it. The red line shows the model.
- Looking at the line to understand our model. For example, we can see that as harness size increases, so will the estimated boot size.

## Using the model

Now we've finished training, we can use our model to predict a dog's boot size from their harness size.

For example, by looking at the red line, we can see that that a harness size of 52.5 (x axis) corresponds to a boot size of about 36.5 (y axis).

We don't have to do this by eye though. We can use the model in our program to predict any boot size we like. Run the code below to see how we can use our model now it is trained

In [8]:
# harness_size states the size of the harness we are interested in
harness_size = { 'harness_size' : [52.5] }

# Use the model to predict what size of boots the dog will fit
approximate_boot_size = fitted_model.predict(harness_size)

# Print the result
print("Estimated approximate_boot_size:")
print(approximate_boot_size[0])

Estimated approximate_boot_size:
36.48019419144181


In [10]:
### trial: change the value of 52.5 in harness_size to a new value and run the block above to see the model in action.

In [11]:
# harness_size states the size of the harness we are interested in
harness_size = { 'harness_size' : [40.3] }

# Use the model to predict what size of boots the dog will fit
approximate_boot_size = fitted_model.predict(harness_size)

# Print the result
print("Estimated approximate_boot_size:")
print(approximate_boot_size[0])

Estimated approximate_boot_size:
29.331904107234912
