## Data Generation

In order to train a model to classify faults with metrobuses, we must first get our hands on some data. In this notebook we take a small data set, containing only 100 samples, and use [Markovify](https://github.com/jsvine/markovify) to simulate a larger data set, based on the original samples.

We obtained a small data set by asking our colleagues to describe issues they currently have or had in the past, with their vehicles. 

**Note: you do not need to run this notebook in order to execute the remainder of the workshop. This notebook is solely for data generation. We have already generated the data for you, and it can be found in the `/dataset` folder.**

Let's take a look at that data: 

In [1]:
import pandas as pd

In [2]:
pd.set_option('display.max_colwidth', None)
df = pd.read_csv('dataset/response.csv') 
df.sample(10)

Unnamed: 0,Timestamp,What is the main issue you're having,Please pick one of these three symptoms:,Please pretend you're a customer. In the space below tell us in your own words what's going wrong with your car:,Please pick one of these three symptoms:.1,Please pretend you're a customer. In the space below tell us in your own words what's going wrong with your car:.1,Please pretend you're a customer. In the space below tell us in your own words what's going wrong with your car:.2
87,2021/03/05 2:24:33 AM CST,Other,,,,,"When I drive, I feel like the car is leaning on left"
26,2021/03/04 12:40:35 PM CST,Brakes,,,"Car brakes, but then brakes disengage","When coming to a stop at a stop light, I press on the brake and the car comes to a complete stop, but then starts rolling again even if I keep my foot firmly pressing down on the brake pedal.",
2,2021/03/04 11:47:13 AM CST,Other,,,,,I can't open the damn door to my car
63,2021/03/04 2:01:06 PM CST,Other,,,,,"When turning, the car is making a clicking noise."
27,2021/03/04 12:41:26 PM CST,Other,,,,,AC doesn't work
40,2021/03/04 1:19:20 PM CST,Starter,Car doesn't start,I turn the key and hear funny sounds but nothing happens. Cannot get the car to turn over. Please help ASAP!,,,
33,2021/03/04 12:51:04 PM CST,Starter,Car doesn't start,"Sometimes the car doesn't start and it seems like the battery is dead, but I have replaced the battery three times over the past year, so I think maybe either I'm buying defective batteries or something about the car is wrecking perfectly good batteries that should be lasting years.",,,
67,2021/03/04 2:04:59 PM CST,Starter,Car doesn't start,It won't start and doesn't respond when trying to jump start it.,,,
56,2021/03/04 1:54:16 PM CST,Other,,,,,"When I put the car in reverse, the hatchback latch opens so that the backup camera can engage. That's all good. But sometimes, the latch stays permanently opened while I'm driving and when I stop and park, I cannot open the hatchback due to the latch being stuck in the open position (but not engaged)."
44,2021/03/04 1:21:03 PM CST,Brakes,,,Car makes grinding noise,It is more squealing or screeching sounds than grinding. Plus it seems it takes more force to stop and sometimes very bumpy.,


As you can see, the data contains a range of information including, the time the error was recorded, and the type of issue, as well as a description of the problem. The information we are really interested in is the type of issue, and the description - let's go ahead and process this data, extracting only the info we need!

In [3]:
df = df.fillna('')
df['response']=df.iloc[:,3]+df.iloc[:,5]+df.iloc[:,6]
df['issue'] = df.iloc[:,1]
df['symptom'] = df.iloc[:,2] + df.iloc[:,4]
subset = df.iloc[:,-3:]
subset.sample(10)

Unnamed: 0,response,issue,symptom
34,"I press down on the brake pedal all the way to the floor, and the car does slow, but it takes a long time to come to a complete stop.",Brakes,Car doesn't stop in timely manner
1,"super frustrating every time I start my car it just stops again, what is wrong!",Starter,Car starts then stops
16,"I came out from work at the end of the day and the car would not start. I have a keyless start system and when I step on the brake and pushed the button, the lights would come one and the car would make a click click clicking sound but would not start. I'm not sure if it is the battery since the lights and radio worked but it definitely was not starting. Just that odd clicking noise....",Starter,Car doesn't start
45,"Starter makes clicking noise, won't start the engine unless you hit the starter with a hammer.",Starter,Car doesn't start
92,"The car won't turn on, but it won't turn off either. I'm on the side of the road, the lights, radio, electronic door locks, etc are all working fine, but the car won't shift out of park. When I hit the start button to turn everything off, everything stays on. It won't go, but it kind of won't stop either.",Other,
41,I'm hearing a noise and feeling a vibration when I turn the wheel.,Other,
30,Customer states breaks make a terrible sound when stopping.,Brakes,Car makes grinding noise
3,I turn the key and nothing happens,Starter,Car doesn't start
33,"Sometimes the car doesn't start and it seems like the battery is dead, but I have replaced the battery three times over the past year, so I think maybe either I'm buying defective batteries or something about the car is wrecking perfectly good batteries that should be lasting years.",Starter,Car doesn't start
17,"It was really cold out and I could not get the emergency brake to disengage. When I tried to drive, the car made horrible grinding noises.",Brakes,Car makes grinding noise


This looks much nicer! Let's explore our data:

In [4]:
subset.issue.unique()

array(['Brakes', 'Starter', 'Other'], dtype=object)

We only have 3 categories of issue - 'Brakes', 'Starter' and 'Other'. 

How much data do we have?

In [5]:
subset.shape

(109, 3)

In [6]:
subset['issue'].value_counts() 

Other      56
Brakes     28
Starter    25
Name: issue, dtype: int64

Only 109 rows - That's not likely to be enough to train a successful categorization model.

Most of the data falls into the 'other' category.

In the rest of this notebook, we use Markovify to generate more data for each class of issue. Markovify is a Markov Chain generator, and we are going to use it to simulate more responses for each type of issue.

First we install markovify, which is in our `nb-requirements.txt` file:

In [8]:
!pip install -r nb-requirements.txt



In [9]:
import markovify
import csv

In [10]:
def train_markov_type(data, issue):
    return markovify.Text(data[data["issue"] == issue].response, retain_original=False, state_size=2)

#Function takes one of the 'issue' models and creates a randomly-generated sentence of length up to 100 characters.  
def make_sentence(model, length=100):
    return model.make_short_sentence(length, max_overlap_ratio = .7, max_overlap_total=15)

#built models
other_model = train_markov_type(subset, "Other")
brakes_model = train_markov_type(subset, "Brakes")
starter_model = train_markov_type(subset, "Starter")

We can combine these models with relative weights:

In [11]:
import numpy

def generate_cases(models, weights=None):
    if weights is None:
        weights = [1] * len(models)
    
    choices = [] # Array of tuples of weight and models
    
    total_weight = float(sum(weights))
    
    for i in range(len(weights)):
        choices.append((float(sum(weights[0:i+1])) / total_weight, models[i]))
    
    # Return a tuple of model and category that are randomly selected by given weights.
    def choose_model():
        r = numpy.random.uniform()
        for (model_weight, model) in choices:
            if r <= model_weight:
                return model
        return choices[-1][1]


    while True:
        local_model = choose_model() 
        # local_model[0]) is the markovify model, local_model[1] is the category
        yield make_sentence(local_model[0]), local_model[1]
   


We can now use this code to generate new sentences:

In [12]:
import numpy as np

generated_cases = generate_cases([(other_model,'other'), (brakes_model,'brakes'), (starter_model,'starter')], [14,7,7])

# Tuples with sentence and category
sentence_tuples = [next(generated_cases)  for i in range(1000)]  # create 200 sentence/category tuples

In [13]:
sentence_tuples

[('This is the third one, and I hear a grinding noise as I drive.', 'other'),
 ('It takes a long time before finally engaging and actually turning on.',
  'starter'),
 ("If I knew what the problem was, I wouldn't be on some website, I'd be fixing it!!",
  'other'),
 ('My battery keeps dying.', 'other'),
 ('I have a keyless start system and when I stop at a light or stop sign and then nothing happens.',
  'starter'),
 ('I can hear the starter with a hammer.', 'starter'),
 ('Front right corner is dented from a collision.', 'other'),
 ('The battery is getting old so my range is a bit short these days.',
  'other'),
 ("AC doesn't work", 'other'),
 ('Brake pads appear less than ¼ inch thick', 'brakes'),
 ('I try to start the engine doesn’t power up', 'starter'),
 ('When I tried to drive, the car', 'brakes'),
 ('It sputters for a very long time to start, and if I push the key and nothing happens',
  'starter'),
 ('Just that odd clicking noise....', 'starter'),
 ('My battery keeps dying.', 'o

In [14]:
len(sentence_tuples)

1000

We can save these new issues and responses to file, and we will use this file later to train our model. 

In [15]:
# Write to csv file
with open('dataset/testdata1.csv', 'w') as file:
    writer = csv.writer(file, delimiter=',', lineterminator='\n')
    writer.writerows(sentence_tuples)

At this point we have created a new data set, and we can transform this data and train a classification model. 

Let's head to notebook [01-Create-Claims-Classification.ipynb](01-Create-Claims-Classification.ipynb), where we process the data and train a classifier. 