# Revisiting the NLP Pipeline: Deploying NLP Software

Some of the questions we should ask ourselves in this process are:
- What kind of data do we need for training the NLP system? Where do we get this data from? These questions are important at the start and also later as the model matures.
- How much data is available? If it’s not enough, what data augmentation techniques can we try?
- How will we label the data, if necessary?
- How will we quantify the performance of our model? What metrics will we use to do that?
- How will we deploy the system? Using API calls over the cloud, or a monolith system, or an embedded module on an edge device?
- How will the predictions be served: streaming or batch process?
- Would we need to update the model? If yes, what will the update frequency be: daily, weekly, monthly?
- Do we need a monitoring and alerting mechanism for model performance? If yes, what kind of mechanism do we need and how will we put it in place?

We can then start to focus on building version 1 of the model with strong baselines, implementing the pipeline, deploying the model, and from there, iteratively improving our solution.

Typical steps in deployment of a model include:
1. **Model packaging:** If the model is large, it might need to be saved in persistent cloud storage, such as AWS S3, Azure Blob Storage, or Google Cloud Storage, for easy access. 
2. **Model serving:** The model can be made available as a web service for other services to consume.
3. **Model scaling:** Models that are hosted as web services should be able to scale with respect to request traffic.

AWS Cloud and SageMaker to serve text classification
![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_1101.png)

# Building and Maintaining a Mature System

We need to manage the complexity of a mature NLP model while making sure it’s also maintainable. Some of the issues we need to consider in this process are:
- Finding better features
- Iterating existing models
- Code and model reproducibility
- Troubleshooting and testing
- Minimizing technical debt
- Automating the ML process

## Finding better features

> we’ve repeatedly stressed the importance of building a simple model first.

There are plenty of statistical methods that can be used to fine-tune our feature sets by removing redundant or irrelevant features. This broad area is called feature selection.

Two popular techniques for feature selection are wrapper methods and filter methods.
- Wrapper methods use an ML model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set and then used to identify the best features based on the error rate of the model. Wrapper methods are computationally expensive, but they often provide the best set of features.
- Filter methods use some sort of proxy measure instead of the error rate to rank and score features (e.g., correlation among the features and correlation with the output predictions). Such measures are fast to compute while still capturing the usefulness of the feature set. Filter methods are usually less computationally expensive than wrappers, but they produce a feature set that’s not as well optimized to a specific type of predictive model.

## Iterative existing models

Any NLP model is seldom a static entity. We’re often required to update our models even in production systems. There are several reasons for this. We may get more (and newer) data that differs from previous training data. If we don’t update our model to reflect this change, it will soon become stale and churn out poor predictions. We may get some user feedback on where the model predictions are going wrong. This will then require us to reflect on the model and its features and make amendments accordingly. In both cases, we need to set up a process to periodically retrain and update the existing model and deploy the new model in production.

## Code and Model Reproducibility

Maintaining separation between code, data, and model(s) is always a good strategy. Separating code and data is generally a best practice in software engineering, and it becomes even more critical for AI systems.

It’s always a good practice to name model and data versions appropriately so that we can revert back easily, if needed. While storing the models, you should try to have all your model parameters, along with other variables, in a separate file. Similarly, try to avoid hardcoded parameter values in your model. If you have to use arbitrary numbers in your training process (e.g., a seed value somewhere), explain it in the code as comments.

A keystone for improving reproducibility is to make sure to note all steps explicitly. This is especially necessary in the exploratory phase of data analysis.

## Troubleshooting and Interpretability

Considering the probabilistic nature of ML models, how to test ML models is not obvious. When it comes to testing the model, the following steps are helpful:
- Run the model on train, validation, and test datasets used during the model-building phase. K-fold cross validation is often used to verify model performance.
- Test the model for edge cases. For example, for sentiment classification, test with sentences with double or triple negation.
- Analyze the mistakes the model is making. For NLP, packages and techniques like TensorFlow Model Analysis, Lime, Shap, and attention networks can give a deeper understanding of what the model is doing deep down.
- Another good practice is to build a subsystem that keeps track of key statistics of the features. Since all features are numerical, we can maintain statistics like mean, median, standard deviation, distribution plots, etc. Any deviation in these statistics is a red flag, and we’re likely to see the system churning out wrong predictions. The reason could be as simple as a bug in the pipeline or as complex as a covariate shift in the underlying data. Packages like TensorFlow Model Analysis can track these metrics.
- Create dashboards for tracking model metrics and create an alerting mechanism on them in case there are any deviations in the metrics.
- It’s always good to know what a model is doing inside. This goes a long way toward understanding why a model is behaving in a certain way. A key question in AI has been how to create intelligent systems where we can explain why the model is doing what it is doing. This is called interpretability. It’s the degree to which a human can understand the cause of a decision. While many algorithms in machine learning (such as decision trees, random forest, XGboost, etc.) and computer vision have been very interpretable, this is not true for NLP, especially DL algorithms. With recent techniques such as attention networks, Lime, and Shapley, we have greater interpretability in NLP models. Interested readers can look at [Interpretable Machine Learning by Christoph Molnar](https://christophm.github.io/interpretable-ml-book/).

## Monitoring

We need to monitor the model for a range of things and trigger alerts at the right points:
- Model performance has to be monitored regularly. For a web service–based model, it can be the mean and various percentiles—50th (median), 90th, 95th, and 99th (or deeper)—for response time. If the model is deployed as a batch service, statistics on the batch processing and task times have to be monitored.
- Similarly, it helps to store monitor model parameters, behavior, and KPIs. Model KPIs for the abusive comments example would be the percentage of comments that were reported by users but not flagged by the model. For a text classification service, it could be the distribution of classes that are classified each day.
- For all the metrics we’re monitoring, we need to periodically run them through an anomaly detection system that can alert changes in normal behavior. This could be a sudden spike in the response rate of a web service or a sudden drop in retraining times. In the worst case, when the performance drops substantially, we may also want to hit circuit breakers (i.e., move to a more stable model or a default approach).
- If our overall engineering pipeline is using a logging framework, there’s a good chance it also has support for monitoring anomalies over time for any metric. For instance, ELK stack by Elastic offers built-in anomaly detection. Sumo Logic also flags outliers that can be queried as needed. Microsoft also offers anomaly detection as a service.

## Minimizing Technical Debt

A good rule of thumb is to look at the coverage a feature provides. If a feature is present in only a few data points, say, 1%, then maybe it’s not worth keeping. But even something like this can’t be applied blindly. For example, if the same feature covers just 1% of the data but gives 95% classification accuracy just based on that feature, then it’s really effective and most certainly worth continuing to use.

> TIP: opt for a simpler model that has performance comparable to a much more complex model if you want to minimize technical debt

Besides these recommendations, we’d also like to share some landmark work on building mature ML systems:
- [“A Few Useful Things to Know About Machine Learning”](https://sites.astro.caltech.edu/~george/ay122/cacm12.pdf) by Pedro Domingoes of the University of Washington.
- [“Machine Learning: The High-Interest Credit Card of Technical Debt”](https://research.google/pubs/pub43146/) by a team at Google AI.
- [“Hidden Technical Debt in Machine Learning Systems”](https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf) by a team at Google AI.
- [Feature Engineering for Machine Learning](https://learning.oreilly.com/library/view/feature-engineering-for/9781491953235/), a book written by Alice Zheng and Amanda Casari
- [“Ad Click Prediction: A View from the Trenches,”](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41159.pdf) a work by a Google Search team on the issues faced by a large online ML system [37]
- [“Rules of Machine Learning,”](https://developers.google.com/machine-learning/guides/rules-of-ml) an online guide created by Martin Zenkovich of Google.
- [“The Unreasonable Effectiveness of Data,”](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf) a report by renowned UC Berkeley researcher Peter Norvig and a Google AI team.
- [“Revisiting Unreasonable Effectiveness of Data in Deep Learning Era,”](https://arxiv.org/abs/1707.02968) another modern look at the previous report by a team from Carnegie Mellon University.

## Automating Machine Learning

One of the holy grails of machine learning is to automate more and more of the feature engineering process. This has led to the creation of a subarea called AutoML (automated machine learning), which aims to make machine learning more accessible. In most cases, it generates a data analysis pipeline that can include data pre-processing, feature selection, and feature engineering methods. This pipeline essentially selects ML methods and parameter settings that are optimized for a specific problem and data. As all of these steps can be time consuming for the ML expert and may be intractable for a beginner, AutoML can be a much-needed bridge for a gap in the world of machine learning. AutoML is itself essentially “doing machine learning using machine learning,” making this powerful and complex technology more widely accessible for those hoping to make use of massive amounts of data.

AutoML is the cutting edge of machine learning. One should only build it from the bottom up when more traditional methods for improving performance are exhausted. It often requires a high amount of computing and GPU resources and a higher level of technical skill when doing it from scratch.
- Auto-Sklearn
- Google Cloud AutoML:
    - Text classification
    - Entity extraction
    - Sentiment analysis
    - Machine translation

In [None]:
# Google Colab Auto-Sklearn
# https://colab.research.google.com/drive/1saaEu1GpK11KRMuf-I1gLqxWPHFoU3Qv?usp=sharing

# The Data Science Process

Two popular processes in the industry are the KDD process and the Microsoft Team Data Science Process.

## The KDD Process

The KDD process was introduced in the late ’90s.

![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_1106.png)

These steps are ordered as follows:
1. **Understanding the domain:** This includes learning about the application and understanding the goals of a problem. It also involves getting deeper into the problem domain and extracting relevant domain knowledge.
2. **Target dataset creation:** This includes selecting a subset of data and variables the problem will focus on. We may have a plethora of data sources at our disposal, but we focus on the subset we need to work on.
3. **Data pre-processing:** This encompasses all activities needed so that the data can be treated coherently. This includes filling missing values, noise reduction, and removing outliers.
4. **Data reduction:** If the data has a lot of dimensions, this step can be used to make it easier to work with. This includes steps like dimensionality reduction and projecting the data into another space. This step is optional depending on the data.
5. **Choosing the data mining task:** Various classes of algorithms can be applied to a problem. They may be regression, classification, or clustering. It’s important to select the right task based on our understanding from Step 1.
6. **Choosing the data mining algorithm:** Based on the selected data mining task, we need to select the right algorithm. For instance, for classification, we can choose algorithms such as SVM, random forests, CNNs, etc., as we saw in Chapter 4.
7. **Data mining:** This is a core step of applying the selection algorithm from Step 6 to the given dataset and creating predictive models. Tuning with respect to parameters and hyperparameters also happens here.
8. **Interpretation:** Once the algorithm is applied, the user has to interpret the results. This can be done partially by visualizing various components of results.
9. **Consolidation:** This is the final step where we deploy the built model into an existing system, document the approach, and generate reports.

## Microsoft Team Data Science Process

The Microsoft Team Data Science Process (TDSP) is an agile, iterative data science process for executing and delivering advanced analytics solutions. It’s designed to improve the collaboration and efficiency of data science teams in enterprise organizations. The main features of TDSP are:
- A data science life cycle definition
- A standardized project structure, which includes project documentation and reporting templates
- An infrastructure for project execution
- Tools for data science, like version control, data exploration, and modeling

![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_1107.png)

# Making AI Succeed at Your Organization

- Team

While there’s no fixed recipe, in our experience, the right blend comes with having 
1. Scientists who build models, 
2. Engineers who operationalize and maintain models, 
3. Leaders who manage AI teams and strategize. 
4. It’s good to have scientists who have worked in industry after graduate school, 
5. Engineers who understand scale and data pipelines, and 
6. Leaders who have also been individual contributor scientists in the past.

- Right Problem and Right Expectations

In many cases, either the problem at hand is ill defined or AI teams set the wrong expectations.

Another common problem is stakeholders having wrong expectations of AI technology. This often happens because of articles in popular media that tend to compare AI to the human brain. While that’s correct as a motivation behind the area of AI, it’s far from the truth.

![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_1108.png)

- Data and Timing
    - Quality of data, AI system needs a high quality of data for both training and prediction.
    - Quantity of data, not having enough data that’s a true representation of the data the model will see in production is a big reason for models not performing well.
    - Data labeling, data labeling is often a continuous process.
    
Currently, AI talent comes at a high cost. Without the right data, it will be futile to hire AI talent; having the right data is a prerequisite for AI teams to deliver well and fast. By this, we don’t mean that we must have all of the prerequisites in place before bringing in AI talent. What we mean is that we must be fully aware of other prerequisites, such as the right data, and have realistic expectations in the absence of it.

- A Good Process
    - Set up the right metrics, for business metrics and AI metrics such as precision, recall, etc.
    - Start simple, establish strong baselines
    - Make it work, make it better
    - Keep shorter turnaround cycles
    
- Other Apects
    - Cost of compute
    - Blindly following SOTA
    - ROI
    - Full automation is hard

![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_1109.png)

Others discuss rules of thumb for building AI systems:
- [“Why Is Machine Learning ‘Hard'?,”](http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html) a blog post by S. Zayd Enam, a Stanford researcher.
- [“Software 2.0,”](https://medium.com/@karpathy/software-2-0-a64152b37c35) a blog post on AI as a different way of writing software by Andrej Karpathy, a well-known researcher, educator, and scientist at Tesla.
- [“NLP’s Clever Hans Moment Has Arrived,”](https://thegradient.pub/nlps-clever-hans-moment-has-arrived/) an article by Benjamin Heinzerling that argues the validity of SOTA results obtained on certain popular datasets.
- [“Closing the AI Accountability Gap,”](https://arxiv.org/pdf/2001.00973.pdf) a report by a team at Google AI and the nonprofit Partnership on AI.
- [“The Twelve Truths of Machine Learning for the Real World,”](http://deliprao.com/archives/227) a blog post by Delip Rao, researcher and O’Reilly author.
- [“What I’ve Learned Working with 12 Machine Learning Startups,”](https://towardsdatascience.com/what-ive-learned-working-with-12-machine-learning-startups-a9a3026d2419) an article by Daniel Shenfeld, a startup veteran and ML consultant.

![alt text](https://learning.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_1110.png)