[Console](https://www.tight.ai/login) | [SDK](https://www.tight.ai/sdk) | [Docs](https://www.tight.ai/docs)

In [None]:
%reload_ext autoreload
%autoreload 2

Machine learning is simply a computer learning from data instead of following a recipe. It's meant to mimic how people (and perhaps other animals) learn while still being grounded in mathematics.

This post is meant to get you started with a basic machine learning model. 

A chatbot.

Now, we're not re-creating Alexa, Siri, Cortana, or Google Assistant but we are going to create a brand new machine learning program from scratch. 

This tutorial is meant to be easy assuming you know a bit of Python Programming.

### Step 1: What's our data?

Machine learning needs data to actually, well, learn. Machines don't yet learn like you and I do but they do learn by finding patterns in things that may seem non-obvious to you and I. We'll see a lot of that in this entire post.

Before we define our data, let's talk about the goal of this ML (machine learning) project:
>  To answer somewhat "random" questions with pre-defined responses.


Here's what we'll try and solve:

__Scenario 1__

Bill: `Hi there, what time do you open tomorrow for lunch?`

Bot: `Our hours are 9am-10pm everday.`


__Scenario 2__

Karen: `Can I speak to your manager?`

Bot: `You can contact our customer support at 555-555-555.5`


__Scenario 3__

Wade: `What type of products do you have?`

Bot: `We carry various food items including tacos, nachos, burritos, and salads.`

Let's put this into a python format:

In [None]:
conversations = [
    {
        "customer": "Hi there, what time do you open tomorrow for lunch?",
        "response": "Our hours are 9am-10pm everday."
    },
     {
        "customer": "Can I speak to your manager?",
        "response": "You can contact our customer support at 555-555-5555."
    },
     {
        "customer": "What type of products do you have?",
        "response": "We carry various food items including tacos, nachos, burritos, and salads."
    }  
    
]

Without machine learning our bot would look like this:

In [None]:
while True:
    customer_input = input("What is your question?\n")
    response = None
    for convo in conversations:
        if convo['customer'] == customer_input:
            response =  convo['response']
    if response != None:
        print(response)
        break
    continue        

Right away, you should see the huge flaws in this recipe; if a customer doesn't ask a question in a specific pre-defined way, the bot fails and ultimately really sucks. 

A few examples:
    - What if a customer says, _when do you open?_ What do you already know the response to be? 
    - What if a customer says, _Do you sell burgers?_
    - What if a customer says, _How do I reach you on the phone?_
   
I'm sure you could come up with many many more examples of where this really falls apart.

So let's clean up our converstaions data a bit more by adding `tags` that describe the initial question.

In [None]:
conversations_tagged = [
    {
        "customer": "Hi there, what time do you open tomorrow for lunch?",
        "tags": ['opening', 'hours'],
    },
     {
        "customer": "Can I speak to your manager?",
        "tags": ['contact', 'customer_support'],
    },
     {
        "customer": "What type of products do you have?",
        "tags": ['products', 'inventory'],
    }     
]

Now, I removed the responses on purpose. Machine learning needs "input" data and "output" data. In this case, we're interested in the "essence" of what a customer is asking. Yes, bots can get MUCH more complex than this but we just want to essentially "auto-tag" when a customer asks a question.

A few examples:
    - What if a customer says, _when do you open?_ We want our ML app to guess at least `opening` as a tag. 
    - What if a customer says, _Do you sell burgers?_ We want our ML app to guess at least `inventory` as a tag.
    - What if a customer says, _How do I reach you on the phone?_ We want our ML app to guess at least `contact` as a tag.
 
 
Once we know a tag, we can write in "recipes" on how to handle that tag. Something like:
    - If the tag is `opening`, then we can respond with `We're open from 9am-10pm Monday-Sunday` OR `Our hours are 9am-10pm everday.`

Notice that I added a `OR` option to the potential response? This ability with give the bot a bit more natural feeling.

Okay. Now let's create a few more tagged conversations:

In [None]:
convos_two = [
    {
        "customer": "How late is your kitchen open?",
        "tags": ['opening', 'hours'],
    },
     {
        "customer": "My order was prepared incorrectly, how can I get this fixed?",
        "tags": ['customer_support'],
    },
    {
        "customer": "The food was amazing. Thank you!",
        "tags": ['feedback', 'customer_support'],
    },
    {
        "customer": "What kind of meats do you have?",
        "tags": ['menu', 'products', 'inventory'],
    }
]

Do you see a trend happening here? It's really easy to come up with all kinds of questions for a restaurant bot. It's also easy to see how challenging this would be to try and hard-code conditions to handle all the kinds of queries/questions customers could have.

I'm sure you've heard you need a LOT of data for machine learning. I'll just add one thing to that, you need a lot of data to have *awe-inspiring* machine learning projects. A simple bot for a mom-and-pop store down the street doesn't need *awe-inspiring* just yet. They need simple, approachable, easy to explain. That's exactly what this is. It's not a black box of *millions* of lines of data points. It's like 20 questions with made up on the spot tags.

In so many ways, machine learning today (in the 2020s) is like the internet of the 1990s. People have heard about it and "sort of get it" and feel like it's just this magical gemmick that only super nerds know how to do. Ha. Super nerds.

Now that we have our starting data, let's prepare for machine learning.

First, let's combine all converstaions:

In [None]:
final_convos = conversations_tagged + convos_two

For good measure, let's add a few more:

In [None]:
convos_three = [
    {
        "customer": "When does your dining room open?",
        "tags": ['opening', 'hours'],
    },
     {
        "customer": "When do you open for dinner?",
        "tags": ['opening', 'hours'],
    },
    {
        "customer": "How do I contact you?",
        "tags": ["contact", "customer_support"]
    }
]
final_convos += convos_three

final_convos

Our conversations have the keys `customer` and `tags`. These are artibtaray names for this project and you change change them at-will. Just remember that `customer` equals `input` and `tags` equals `output`. This makes sense because in the future, we want a random customer input such as `What's the menu specials today` and a predicted tags output like `menu` or something similar.


Machine learning has all kinds of terms and acyronyms that often make it a bit confusing. In general, just remember that you have some `inputs` and some target `outputs`. Here's what I mean by that:

- `customer`: These values are really the `input` values for our ML project. Input values are sometimes called `source`, `feature`, `training`, `X`, `X_train`/`X_test`/`X_valid`, and a few others.
- `tags`: These values are really the `output` values for our ML project. Output values are sometimes called `target`, `labels`, `y`, `y_train`/`y_test`/`y_valid`, `classes`/`class`, and a few others.

> We're using a machine learning technique known as `supervised learning` which means we provide both the `inputs` and `outputs` to the model. Both data points are known data that we came up with. As you know, the `tags` (or `labels`/`outputs`) have been decided by a human (ie you and me) but can, eventually, be decied by a ML model itself and then verified by a human. Doing so would make the model better and better. There are many other techniques but `supervised learning` is by far the most approachable for beginners.


### Prepare for ML

Now that we have our data, it's time to put it into a format that works well for computers. As you may know, computers are great at numbers and not so great at text. In this case, we have to convert our text into numbers.


This is made simple by using the [scikit-learn](https://scikit-learn.org/stable/index.html) library. So let's install it below by uncommenting the cell.

In [None]:
# !pip install scikit-learn

First up, let's turn our `customer` and `tag` data into 2 separate lists where the index of each item corresponds to the index of the other.

```
X = [customer_convo_1, customer_convo_2, ...]
y = [convo_1_tags, convo_2_tags, ...]
```

This is very standard practice so that `X[0]` is the `input` that corresponds to the `y[0]` `output`, `X[1]` is the `input` that corresponds to the `y[1]` `output` and so on. 

In [None]:
inputs = [x['customer'] for x in final_convos]
outputs = [x['tags'] for x in final_convos]

assert len(inputs) == len(outputs)

> If you have an `AssertionError` above, that means your `inputs` and `outputs` are not balanced. Check your data source(s) to ensure every `input` has a corresponding `output` value.

Now we need to turn each `inputs` list and `outputs` list into a list of `numbers` so our machine learning can do machine learning. 

`scikit-learn` has a simple way to do this. First, let's focus on the `inputs` (aka `customer` conversations) as they are the most simple.

#### Prepare Inputs (`features`)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize our vectorizer.
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(inputs)


> Technical note: `scikit-learn` converted our data into a collection of 1 dimension matrices. We need to use matrices so we can do matrix multiplication (that's how machine learning works under the hood). In `numpy` speak, `X` is an `array` of `array`s.  If you want to see the actual vectors created, check out `X.toarray()` and you'll see it.

In [None]:
X.shape

`X.shape` is useful to describe our data. 

`X.shape[0]` refers to the number of conversations from our `final_convos` list. So, `X.shape[0] == len(final_convos)` and `X.shape[0] == len(inputs)`


`X.shape[1]` refers to the number of `words` our data has. The `CountVectorizer` did this for us. The Machine Learning term is `feactures` related to what our data has. You can see all of the `features` (`words` minus punctuation) with:

In [None]:
print(vectorizer.get_feature_names())

The vectorizer has a very limited vocabulary as you can see. Naturally, this means our ML project will *always* missunderstand some key converstaions and that's okay. The goal for our project is to get it working first, get customers (or ourselves) using it so we can *improve* it with new data right away (and thus re-improve it).

#### Prepare Outputs (`labels`)

Every one of our inputs has a list of tags, not just one tag. Let's look at what I mean:

In [None]:
print(inputs[0], outputs[0])

In machine learning, this means `multi-label` classification because there are multiple `output` values for each `input` value. This is a more challenging problem than a `single` label but definitely necssary for a chatbot project.

A single label dataset would look like the following:
```
Input: Hi there, how are you doing today?
Output: not_spam

Input: Free CELL phones just text 3ED#2
Output: spam
```

Notice that the output is a single `str` and not a `list` of `str` values. If we continued down this path, our data would *always* fall into 2 categories: `spam` or `not_spam`.


In our project, our `input` values *can* fall into multiple values, 1 value, or no values.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer



In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(outputs)

In [None]:
list(mlb.classes_)

Calling `mlb.classes_` gives us the exact order of how our classes are defined in `y`. So `y[0]` corrensponds to `outputs[0]` but in numbers instead of words. It's pretty cool. To see this technically, run the following code:

```
print(y[0])
# map to classes with `zip`
y0_mapped_to_classes = dict(zip(mlb.classes_, y[0]))
print(y0_mapped_to_classes)
```

Then compare:
```
sorted(outputs[0]) == sorted([k for k,v in y0_mapped_to_classes.items() if v == 1])
```


In [None]:
y.shape

`y.shape` is useful to describe our data in a similar way to `X.shape`

`y.shape[0]` refers to the number of conversations from our `final_convos` list. So, `y.shape[0] == len(final_convos)` and `y.shape[0] == len(outputs)` and `y.shape[0] == X.shape[0]`


`y.shape[1]` refers to the unique values of all of the possible `tags` each converstaion has; it will never repeat using the `MultiLabelBinarizer`.

In [None]:
assert y.shape[0] == X.shape[0]
assert y.shape[0] == len(inputs)

If you see an `AssertionError` here, it's the same exact error as `assert len(inputs) == len(outputs)` from above. Your data is not balanced.

### Training with `scikit-lean`

In [None]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(random_state=1)
model = MultiOutputClassifier(forest, n_jobs=-1)

In [None]:
# actual training
model.fit(X, y)

In [None]:
classes = mlb.classes_
def label_predictor(text="Hello World", model=None, vectorizer=None):
    assert model != None
    assert vectorizer != None
    x_test = vectorizer.transform([text])
    target = model.predict(x_test)
    preds = {}
    for i, val in enumerate(target[0]):
        preds[classes[i]] = val
    return preds

In [None]:
label_predictor("How do I contact your manager?",  model, vectorizer)

In [None]:
label_predictor("Hello world",  model, vectorizer)

### Exporting Model for Re-use

In [None]:
import pickle
import pathlib

NB_ROOT = pathlib.Path("").resolve()
PROJECTS_ROOT = NB_ROOT.parent / "projects"
PROJECT_NAME = 'hello-world'
PROJECT_PATH = PROJECTS_ROOT / PROJECT_NAME
PROJECT_DATA = PROJECT_PATH / "data"
if not PROJECT_DATA.exists():
    "project does not exist locally, make all folders"
    PROJECT_DATA.mkdir(parents=True, exist_ok=True)

pickle_dest = PROJECT_DATA / 'model.pkl'
pickle_object = {
    "model": model,
    "vectorizer": vectorizer,
    "classes": mlb.classes_
}

with open(pickle_dest, 'wb') as f:
    pickle.dump(pickle_object, f)

### Re-use Exported Model

In [None]:
import pickle
import pathlib

NB_ROOT = pathlib.Path("").resolve()
PROJECTS_ROOT = NB_ROOT.parent / "projects"
PROJECT_NAME = 'hello-world'
PROJECT_PATH = PROJECTS_ROOT / PROJECT_NAME
PROJECT_DATA = PROJECT_PATH / "data"
if not PROJECT_DATA.exists():
    "project does not exist locally, make all folders"
    PROJECT_DATA.mkdir(parents=True, exist_ok=True)

pickle_source = PROJECT_DATA / 'model.pkl'

loaded_pickle_obj = None

with open(pickle_source, 'rb') as f:
    loaded_pickle_obj = pickle.loads(f.read())
    
classes = loaded_pickle_obj['classes']
label_predictor("I did enjoy my meal", loaded_pickle_obj['model'], loaded_pickle_obj['vectorizer'])

### Prepare for Production

In [None]:
PROJECT_ENTRY = PROJECT_PATH/"entry.py"

In [None]:
%%writefile $PROJECT_ENTRY

import sklearn
import pickle
import pathlib

BASE_DIR = pathlib.Path(__file__).parent.absolute()
DATA_DIR = BASE_DIR / "data"
SOURCE_PKL = DATA_DIR / 'model.pkl'

classes = []
model = None
vectorizer = None

def load_pickle_data():
    global classes
    global model
    global vectorizer
    with open(SOURCE_PKL, 'rb') as f:
        loaded_pickle_obj = pickle.loads(f.read())
        classes = loaded_pickle_obj['classes']
        model = loaded_pickle_obj['model']
        vectorizer = loaded_pickle_obj['vectorizer']
        
        

load_pickle_data()

def label_predictor(text="Hello World"):
    global classes
    global model
    global vectorizer
    x_test = vectorizer.transform([text])
    target = model.predict(x_test) # target is an array of numpy.int64, we need a python `int` instead
    preds = {}
    for i, val in enumerate(target[0]):
        key_label = classes[i]
        preds[key_label] = int(val) # convert numpy.int64 into a python `int`
    return preds

def run(json_data={}, *args, **kwargs):
    '''
    Required method for tight.ai serving
    Returns a dictionary that is `json.dumps` ready.
    '''
    
    if 'question' not in json_data:
        return {'message': "a question is required", 'status': 400}
    input_question = json_data.get('question')
    tags = label_predictor(text=input_question)
    return {
        "question": input_question,
        "tags": tags
    }
    

### Local Local Server via Tight.ai

In [None]:
LOCAL_PORT = 5008
print("Copy the result of the following into your terminal / powershell:\n\n")
print(f"tight local run --path {PROJECT_PATH} --port {LOCAL_PORT}")

> You can also run this command within jupyter by adding a `!` in front of it like `!tight local run `

In [None]:
import requests

data = {
    "question": "When do you open tomorrow?"
}

r = requests.post("http://localhost:5008", json=data)
print(
    r.json()
)

### Push to Production

Get an [API KEY](https://www.tight.ai/developer/tokens/) from https://www.tight.ai

In [None]:
# add the only requirement we need `scikit-learn`

REQUIREMENTS_OUTPUT = PROJECT_PATH / "requirements.txt"
!echo "scikit-learn" > $REQUIREMENTS_OUTPUT

In [None]:
import getpass
my_api_key = getpass.getpass(f"Enter your api key from https://www.tight.ai/developer/tokens/\n\n")

In [None]:
import tightai
tightai.api_key = my_api_key or "<your_api_key>"

In [None]:
from tightai.projects import Project
Project.get_http_headers()
projects = Project.objects.all()
projects

##### Create Your Project

In [None]:
from tightai.projects import Project

project_name = "hello-world"
# project_obj = Project.objects.create(project_id=project_name)

In [None]:
##### Push Into Production

In [None]:
from tightai.projects import Project

project_name = "hello-world" # or whatever you choose

# Get our just-created Project
project_obj = Project.objects.get(project_id=project_name)

In [None]:
# grab the latest project version or by number
version_obj = project.latest()

# or
# version_obj = project_obj.get_version(version=1)

# or

# from tightai.projects import Version
# version_obj = Version.objects.get(project_id='hello-world', version=1)

In [None]:
# push your code
version_obj.push(PROJECT_PATH)

In [None]:
# get deployment status
version_obj.status(latest=True)

In [None]:
# run predictions

# directly on version object.
version_obj.predict(json={'question': "What time do you open?"})

# or with latest project version
# project_obj.predict(json={'question': "What time do you open?"}, use_latest=True)

# or with Project version number
# project_obj.predict(json={'question': "What time do you open?"}, version=1)