<a href="https://colab.research.google.com/github/rdhingra001/buddiey/blob/master/Buddiey%20Brains.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Buddiey - The Brains

Developed by the makers of Project: Buddiey ([Ronit](mailto:ronit@buddiey.live) more specifically)

If viewed as PDF, runnable code can be found at: (https://www.buddiey.live/brain)

## How It Was Made

The "Brains" of Project: Buddiey, was developed with both two popular Deep Learning frameworks: Tensorflow & Keras.


In [None]:
# Installing TensorFlow
import sys
!{sys.executable} -m pip install tensorflow
!{sys.executable} -m pip install keras

%tensorflow_version 2.x
import tensorflow as tf # A required module
from keras.preprocessing.text import Tokenizer # An example module from Keras's text preprocessing libraries



These two modules allowed the backend developer for Project: Buddiey (Ronit) to create a Recurrent Neural Network (RNN), which is best specialized in text & speech analysis instead of the more widely used Convolutional Neural Network (CNN). So, he took the existing RNN architecture and made a model dedicated for Natural Language Processing (NLP), which would be able to extract the "true" message and emotion toward a sentence (a message thread for Buddiey). From there, it will use that extracted data and create a human-like sentence in response to the sentence or sent message.





### Natural Language Processing
Natural Language Processing (NLP) is a capability in computing that analyzes the communication between human, or "natural" langauge, and computer language (programming languages). In most common scenarios, Natural Language Processing can be found in today's spellcheckers, autocompleters, and even some Telegram or Discord bots. These usually run on a type of neural network otherwise known as a RNN, or Recurrent Neural Network.

### Recurrent Neural Networks
Recurrent Neural Networks (RNN) is a specialized type of neural networks. These are most utilized for text and speech analysis, and are typically used alongside Natural Language Processing libraries and models. Unlike CNNs (Convolutional Neural Networks) which are better for image and video analysis, RNN is mostly implemented in conversational AI, and can be used to analyze the sentiment and message from a sentence or message.

## Bag of Words

If you've done anything with Natural Language Processing and/or Recurrent Neural Networks, you're most likely going to have heard of the phrase "bag of words". If not, it's essentially the process of tokenizing a sentence and splitting up the words inside of a dictionary, and keeping count of each word in the given sentence. It can be very helpful for simple queries, but the quality starts to deteriorate once you feed the model complex inputs.

### Downsides of the Bag of Words & Why Buddiey Doesn't Use It
Let's look at two sentences to exploit some downsides:



```
I thought the movie was going to be bad, but it was great!
```

```
I thought the movie was going to be great, but it was bad!
```

A typical model would split both these sentences into the same JSON structure:


In [None]:
vocab = {}
word_encode = 1
def bag_of_words(text):
  global word_encode # It is not reccomended to use global keywords, I'm only using them for demonstration

  words = text.lower().split(" ")
  bag = {}

  # Iterates throught the words and see if it exists in the model vocabulary
  for word in words:
    if word in vocab:
      encoding = vocab[word]
    else:
      vocab[word] = word_encode
      encoding = word_encode
      word_encode += 1
    
    if encoding in bag:
      bag[encoding] += 1
    else:
      bag[encoding] = 1
  
  return bag

movie_bad_great = "I thought the movie was going to be bad but it was great"
movie_great_bad = "I thought the movie was going to be great but it was bad"
bag_1 = bag_of_words(movie_great_bad)
bag_2 = bag_of_words(movie_bad_great)
print(f"Model Vocabulary: {vocab}")
print(f"First Review: {bag_1}")
print(f"Second Review: {bag_2}")

Model Vocabulary: {'i': 1, 'thought': 2, 'the': 3, 'movie': 4, 'was': 5, 'going': 6, 'to': 7, 'be': 8, 'great': 9, 'but': 10, 'it': 11, 'bad': 12}
First Review: {1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1}
Second Review: {1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 8: 1, 12: 1, 10: 1, 11: 1, 9: 1}


When you run the code cell, you can see that these two sentences have two completely different motives, but the Bag of Words function collects them the same way. We can confirm that they equal the same with this script below.

In [None]:
# Function to see if two elements equal each other
def element_equal(bag1, bag2):
  return bag1 == bag2

equal = element_equal(bag_1, bag_2)
print(equal)

NameError: ignored

As seen when running the code cell right above this text box, we see that both bags returned from the function are equivalent, though both sentences have different intentions. For example, one of the sentences said that the movie they saw was bad, while the other said that the movie they saw was good. This small flaw can be fatal in ruining the user's experience with Buddiey, and we can't risk that. So, keep reading to see how our team optimized our model with newer and more accurate technologies instead for our Recurrent Neural Network and Natural Language Processing model.


## Buddiey's Optimizations

If you keep reading in this category, you will see replacements that the Buddiey team has implemented in their model. We implemented different neural nets and NLP algorithms rather than the traditional "Bag of Words" algorithm and other neural nets. Of course, I'm not going to say what they are (that's why there's a dedicated section), so keep reading to see what each part we changed and why.

### Buddiey's NLP: Word Embeddings

Luckily for us, there is another method that is far superior, accurate, and scalable that removes the need for the classic "Bag of Words". Rather than taking the words from a sentence and then assigning a word to the model's vocabulary at a random index, this method keeps the order of the words intact and takes the meaning and encodes each word into a dense vector that represents its context in the sentence, or message thread in our scenario.




Think of it this way; we have four words: good, happy, sad, mad.

On a traditional bag of words model, the indices for each word are mixed up, and are inputted depending on the order of them in a given phrase or sentence. 

For example:



In [None]:
sentence = "Am I good or bad, or am I happy or sad?"

So we created a basic sentence with all four words incorporated with a given sentence. Now, let's see if we process this with our existing "Bag of Words" function and see what we get

In [None]:
vocab = {}

bag_sentence = bag_of_words(sentence)

print(vocab)
print(bag_sentence)

{'am': 13, 'i': 14, 'good': 15, 'or': 16, 'bad,': 17, 'happy': 18, 'sad?': 19}
{13: 2, 14: 2, 15: 1, 16: 3, 17: 1, 18: 1, 19: 1}


As you can see, the vocab dictionary classifies each word into the dictionary based on the order of the sentence, but not its meaning or connection with the other existing words in that sentence. This can prove to be fatal in production, since the model would have a very high probability of making mistakes in its NLP core and confuse the user, may it be a chatbot, autocorrect, or even a text-to-speech program.

### SimpleRNN - Simple Recurrent Layer

As you can tell from the name, a SimpleRNN is a layer that belongs to a Recurrent Neural Network. As the name suggests, this process is simple and it only gets more complex from here.



Rather than taking all of the data at once and processing it from left to right, the SimpleRNN splits the words, uses the "Word Embeddings" algorithm, uses the returned vector and converts it to a numeric value that is usable for the model.

Let's use an example of:
```
sentence = "Hi I am Buddiey"
```

Since this sentence has four different words, the layer will assign 4 cells to process the sentence. To better understand the process of this, there are two crucial steps that are thoroughly explained.

#### Split The Sentence

This is typically done with a simple algorithm which grabs each word from a sentence and saves them in a list for reference.

```
words = ["hi", "i", "am", "buddiey"]
```

It is important that this is firstly done because the neural network layer requires a single word as an input, and doesn't accept an entire sentence. Also, we are using the more efficient "Word Embeddings" algorithm instead of "Bag Of Words", which requires for the Natural Language Processing procedure to have the given input split up into words, and then later cached as vectors for finding the intent, or meaning of a query.

In [None]:
# How it's done
sentence = "Hi I am Buddiey"
words = sentence.lower().split(" ") # ["hi", "i", "am", "buddiey"]

#### Process Each Word

From now, all the SimpleRNN needs to do to make it possible for the model to use the data is to process each word from the sentence and return the query as a combination of numerical computations, which are retrieved by flattening the retreived vectors from the "Word Embeddings" algorithm. 

Since there are 4 words inside of the given sentence, the RNN layer will assign 4 cells to compute the intent and use NLP to calculate data. The first cell will take in the first word as an input and compile it to a vector, then converted to a numerical unit for further usage. For the second cell, the output from the first one, along with the second word, gets inputted and the same procedure happens. The process keeps looping like this for every cell until every word gets indexed.**bold text**

In [None]:
# Note: The following lines of code are invalid, and do not have any computational value. It is only to show how each word gets indexed.

def process_word(word) -> Num # Takes in a word and returns a numerical value
cell1 = process_word(words[0])
cell2 = cell1 + process_word(words[1])
cell3 = cell2 + process_word(words[2])
cell4 = cell3 + process_word(words[3])

```
(Example) Output: Cells successfully created and ready to use
```

### LSTM - Long/Short Term Memory

The LSTM, or Long/Short Term Memory layer, is another layer that can be found or added to a Recurrent Neural Network. However, it greatly differs from a SimpleRNN layer. This is because, as the name suggests, the layer has a memory of sorts and can retreive any data from previous timestamps. In the SimpleRNN layer, the previous data would gradually disappear and would proritize the next layer or the layer in progress. But, the LSTM can retrieve data from the past, which can be valuable when Buddiey refers data from previous discussions, and remembers these conversations.

**Note:** Because the blueprints of both LSTM and SimpleRNN are so similar, I'm not going to supply code snippets for this in actual progress.

## How it Works

I could have put this at the beginning of this Jupyter Notebook (or Google Colab or PDF), but I decided that I should explain what tools are used in depth, then explain how they all come together. Sorry for making you read all of this 😛

## Buddiey - Making the Brains

Ok, so you know how everything fits into place, and why in the end I wrote in thorough detail about all of the concepts, from the "Bag of Words" to the "LSTM". So, now I'll start writing the code and explain how they relate to my initial explanation.

### Importing Our Datasets

The first part in any RNN or NLP based solution is to import the datasets 

In [None]:
import pandas as pd

link = "https://app.buddiey.live/resources/"
df = pd.read_csv(link + "buddiey-training.tsv", sep='\t', header=None)
df

Unnamed: 0,0,1
0,Florida doesn't have earthquakes or fires.,"No, florida just has hurricanes every year fro..."
1,What's your name ?,My name is marfa and i can be your best friend
2,"Would you put suntan lotion on my back, please?",Sure.
3,What is the purpose of living ?,To live forever .
4,My pants have a hole in the front pocket.,You shouldn't carry pens in your pocket.
...,...,...
504,The only thing in my backpack is used books.,But thieves don't know that.
505,I hate you,Why? what have i done to derserve this? can't ...
506,"I am happy, but marriage is a lot of responsib...","Yes, you have to take care of your wife."
507,What is your best childhood memory ?,"When we were kids , we had our own a secret la..."
