# Adding AI to Your App

There are multiple approaches one can use to leverage AI and ML in their business idea. 
1. Building your own proprietary model
2. Using a pre-trained model within your application
3. Leverage Cloud providers to support AI functions in your application

These different approaches have pros and cons. Approach 1 gives most degree of control to you as a business in terms of ***taking ownership*** and ***being independent***. 

This notebook provides code examples in going about building your proprietary ML model in part 1. Then in part 2, we look at an example of how a model can be evaluated. This function is quite important as model evaluation is critical for a business regardless the model is built in-house or provided by a an external vendor. 

# Part 1: Building your Proprietary AI model

In this section, we look at how a ML model can be by built from scratch for your organisation. As seen from the example, this appears to be the more complex approach to leveraging ML as it entails costs relating to sourcing the relevant data and expertise required to building and maintaining your in-house model. In a strategic perspective, this approach is more preferable as it adds true value to the business as the ML model becomes the intellectual property of your business and gives the competitive edge. It also frees your business from having to rely on 3rd party licensing restrictions and terms of service agreements that are governed by the true owners of the AI/ML models that may become critical to your business.   

## Installing the Python Libraries

The first step is to install the python libraries that are needed for training our own ML model. There are many off-the-shelf machine learning libraries that are available in majority of programming languages to use well tested ML algorithms to train models with your own data. These libraries often come with favorable licenses (Eg. Apache 2, MIT and BSD to name a few) that will give your the freedom to use these tools without compromising the legal ownership of the models you train with them. 

In this example, we use Python, a programming language that has a very rich ecosystem for data science and machine learning. For the specific implementation, we need [scikit-learn](https://scikit-learn.org/stable/) and [pandas](https://pandas.pydata.org/). We can use [pip](https://pypi.org/project/pip/) Python package manager to install these two libraries. 

In [None]:
!pip install -U scikit-learn
!pip install pandas

## Loading the Data

We use a popular publicly available labelled [sentiment analysis dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/) to demonstrate the different approaches. We use a star rating dataset from the famous movie review website, [IMDB](https://www.imdb.com/).

First, we load the data from the local disc using the `load_data` function that has been implemented here. Then we convert that dataset into a `pandas.DataFrame` object that is highly compatible with `scikit-learn` machine learning library.

In [None]:
# import required functions
from os.path import join
from os import listdir

data_dir = "data"
pos_data_dir = join(data_dir, "pos")
neg_data_dir = join(data_dir, "neg")

In [None]:
def load_data(filepath, label):
    files = [join(filepath, filename) for filename in listdir(filepath) if filename.endswith(".txt")]
    records = []
    
    for file in files:
        with open(file) as f:
            text = f.read()
        
        records.append({"text": text, "label": label})
    
    return records

In [None]:
import pandas as pd
from sklearn.utils import shuffle


pos = load_data(pos_data_dir, 1)
neg = load_data(neg_data_dir, 0)

records_df = shuffle(pd.DataFrame(pos + neg)).reset_index(drop=True)

In [None]:
records_df

## Train-Test Split

When training a machine learning model, we need to make sure that there is an ***unseen*** set of examples that we can use to evaluate the true performance of the trained machine learning model. A popular approach to pre-allocate a percentage of the full dataset for testing the trained model and avoid using that data during the model training process.  

`scikit-learn` already provides functions that can easily create this test-data allocation for us. In the following step, we create the train-test data split.

In [None]:
from sklearn.model_selection import train_test_split

X = records_df["text"]
y = records_df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Data Vectorisation

Machine Learning models dominantly work with ***numerical representations***. What that means is that the input we provide into the machine learning algorithm should contain numbers rather than letters and symbols. In this example, we are working with movie reviews that are text representations. The process of taking non-numerical data and transforming them into a numerical vectors in a sensible manner is called ***data vectorisation*** in data science. 

In order to create numerical representations of the text we have, there are well tested methods such as extracting the [TFIDF representation](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) of the textual document. We use pre-built functions available in `scikit-learn` in order to vectorise text. `scikit-learn` library provides different vectorisation methods (a.k.a feature extraction) for different modalities such as [text](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text) and [images](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.image).



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectoriser = TfidfVectorizer(stop_words="english")

In [None]:
x_train = vectoriser.fit_transform(X_train)

### Model Training

After vectorising the data, we choose an appropriate machine learning model and train it by ***fitting the model*** to the training data.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(x_train, y_train)

In [None]:
y_train_pred = model.predict(x_train)
print(list(y_train_pred))
print(list(y_train))

### Model Testing

Once the model is trained, we need to see how robust this model is in making predictions on data that it hasn't seen before. We can use the pre-allocated test data to serve this purpose.

In [None]:
x_test = vectoriser.transform(X_test)
y_test_pred = model.predict(x_test)

## Example Prediction

In [None]:
example_text = list(X_test)[1]
example_label = list(y_test)[1]

print("Text: {}\n Actual Label: {}".format(example_text, example_label))

### Predicting on example with our proprietary  model

In [None]:
x_vect = vectoriser.transform([example_text])
y_pred = model.predict(x_vect)
print("Predicted Label: {}".format(y_pred[0]))

# Part 2: Evaluation

## Evaluating Accuracy of the train and Test data

Evaluating a trained machine learning model is critical to establishing the value it can bring to your business. In this section we look at how we can evaluate the performance of the trained sentiment classification model. 

The [accuracy score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) can be used here to evaluate classification accuracy. 

### Exercise

Using the `accuracy_score` function in `sklearn.metrics` module, calculate the accuracy classification accuracy of the trained model based on both training data and testing data (using actual and predicted labels)/

In [None]:
# insert code here

In [None]:
train_accuracy = # insert code here

In [None]:
test_accuracy = # insert code here

In [None]:
print("Accuracy of the model on training data: {}".format(train_accuracy))
print("Accuracy of the model on test data: {}".format(test_accuracy))