# <font color=blue> Course3 - Structuring ML Projects

## Week 1 ML Strategy

![image.png](attachment:image.png)

When trying to improve a DL system, you often have a lot of ideas or things you could try. And the problem is that if you choose poorly, it is entirely possible that you end up spending six months charging in some direction only to realize after six months that that didn't do any good. For example, I've seen some teams spend literally six months collecting more data only to realize after six months that it barely improved the performance of their system. So, assuming you don't have six months to waste on your problem, won't it be nice if you had quick and effective ways to figure out which of all of these ideas and maybe even other ideas, are worth pursuing and which ones you can safely discard.

It turns out also that machine learning strategy is changing in the era of deep learning because the things you could do are now different with deep learning algorithms than with previous generation of machine learning algorithms.

One of the challenges with building machine learning systems is that there's so many things you could try, so many things you could change. Including, for example, so many hyperparameters you could tune. One of the things I've noticed is about the most effective machine learning people is they're very clear-eyed about what to tune in order to try to achieve what effect. This is a process we call orthogonalization. 

### Orthogonalization

![image.png](attachment:image.png)

And finally, if it does well on the test set, but it isn't delivering to you a happy cat picture app user, then what that means is that you want to go back and change either the dev set or the cost function. At last if doing well on the test set according to some cost function doesn't correspond to your algorithm doing what you need it to do in the real world, then it means that either your dev test set distribution isn't set correctly, or your cost function isn't measuring the right thing.

### Evaluation Metric

You'll find that your progress will be much faster if you have a single real number evaluation metric that lets you quickly tell if the new thing you just tried is working better or worse than your last idea. So when teams are starting on a machine learning project, I often recommend that you set up a single real number evaluation metric for your problem.

### Precision and Recall

![image.png](attachment:image.png)

But it turns out that there's often a tradeoff between precision and recall, and you care about both. You want that, when the classifier says something is a cat, there's a high chance it really is a cat (Precision) . But of all the images that are cats, you also want it to pull a large fraction of them as cats. So it might be reasonable to try to evaluate the classifiers in terms of its precision and its recall.

### F1 Score

The problem with using precision recall as your evaluation metric is that if classifier A does better on recall, which it does here, the classifier B does better on precision, then you're not sure which classifier is better. And if you're trying out a lot of different ideas, a lot of different hyperparameters, you want to rather quickly try out not just two classifiers, but maybe a dozen classifiers and quickly pick out the, quote, "best ones", so you can keep on iterating from there. And with two evaluation metrics, it is difficult to know how to quickly pick one of the two or quickly pick one of the ten. So what I recommend is rather than using two numbers, precision and recall, to pick a classifier, you just have to find a new evaluation metric that combines precision and recall. In the ML literature, the standard way to combine precision and recall is something called an F1 score.

Informally, you can think of F1 Score as the average of precision, P, and recall, R. 

![image.png](attachment:image.png)

So what I found for a lot of machine learning teams is that having a well-defined dev set, which is how you're measuring precision and recall, plus a single number evaluation metric, sometimes I'll call it single real number Evaluation metric allows you to quickly tell if classifier A or classifier B is better, and therefore having a dev set plus single number evaluation metric distance to speed up iterating.It speeds up this iterative process of improving your machine learning algorithm.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

So it might be reasonable to keep track of how well your classifiers do in these different markets or these different geographies. But by tracking four numbers, it's very difficult to look at these numbers and quickly decide if algorithm A or algorithm B is superior. And if you're testing a lot of different classifiers, then it's just difficult to look at all these numbers and quickly pick one. So what I recommend in this example is, in addition to tracking your performance in the four different geographies, to also compute the average. And assuming that average performance is a reasonable single real number evaluation metric, by computing the average, you can quickly tell that it looks like algorithm C has a lowest average error. And you might then go ahead with that one,  If you have to pick an algorithm to keep on iterating from. 

So your work load machine learning is often, you have an idea, you implement it and try it out, and you want to know whether your idea helped. So what we've seen in this video is that having a single number evaluation metric can really improve your efficiency or the efficiency of your team in making those decisions. how to effectively set up evaluation metrics?

### Optimizing and Satisficing metric

![image.png](attachment:image.png)

And by defining optimizing as well as satisficing metrics, this gives you a clear way to pick the, quote, "best classifier", which in this case would be classifier B because of all the ones with a running time better than 100 milliseconds, it has the best accuracy. So more generally, if you have N metrics that you care about it's sometimes reasonable to pick one of them to be optimizing. So you want to do as well as is possible on that one. And then N minus 1 to be satisficing, meaning that so long as they reach some threshold such as running times faster than 100 milliseconds, you don't care how much better it is in that threshold, but they have to reach that threshold.

And, workflow in machine learning is that you try a lot of ideas, train up different models on the training set, and then use the dev set to evaluate the different ideas and pick one. And, keep innovating to improve dev set performance until, finally, you have one clause that you're happy with that you then evaluate on your test set.

![image.png](attachment:image.png)

Your dev and test sets should come from the same distribution, but how long should they be?

![image.png](attachment:image.png)

In the era of big data, I think the old rule of thumb of a 70/30 split, that no longer applies. And the trend has been to use more data for training and less for dev and tests, especially when you have a very large data sets. And the rule of thumb is really to try to set the dev set to big enough for its purpose, which helps you evaluate different ideas and pick from A or B better. And the purpose of test set is to help you evaluate your final cost buys. You just have to set your test set big enough for that purpose, and that could be much less than 30% of the data. 

But the overall guideline is if your current metric and data you are evaluating on doesn't correspond to doing well on what you actually care about, then change your metric and/or your dev/test set to better capture what you need your algorithm to actually do well on. Having an evaluation metric and the dev set allows you to much more quickly make decisions about is Algorithm A or Algorithm B better. It really speeds up how quickly you or your team can iterate.

So my recommendation is, even if you can't define the perfect evaluation metric and dev set, just set something up quickly and use that to drive the speed of your team iterating. And if later down the line you find out that it wasn't a good one, you have better idea, change it at that time, it's perfectly okay. 

![image.png](attachment:image.png)

### Bayes Error - human-level error as a proxy for bayes error.

For our Cat classification example, think of human level error as a proxy or as a estimate for Bayes error or for Bayes optimal error. And for computer vision tasks, this is a pretty reasonable proxy because humans are actually very good at computer vision and so whatever a human can do is maybe not too far from Bayes error. By definition, human level error is worse than Bayes error because nothing could be better than Bayes error but human level error might not be too far from Bayes error.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Surpassing human-level performace

![image.png](attachment:image.png)

And above are not natural perception problems, so these are not computer vision, or speech recognition, or natural language processing tasks. Humans tend to be very good in natural perception task. So it is possible, but it's just a bit harder for computers to surpass human-level performance on natural perception tasks. 

And finally, all of these are problems where there are teams that have access to huge amounts of data. So for example, the best systems for all four of these applications have probably looked at far more data of that application than any human could possibly look at. And so, that's also made it relatively easy for a computer to surpass human-level performance. Now, the fact that there's so much data that computer could examine, so it can better find statistical patterns than even the human mind.

*Other than these problems, today there are speech recognition systems that can surpass human-level performance. And there are also some computer vision, some image recognition tasks, where computers have surpassed human-level performance. But because humans are very good at these natural perception tasks, I think it was harder for computers to get there. And then there are some medical tasks, for example, reading ECGs or diagnosing skin cancer, or certain narrow radiology task, where computers are getting really good and maybe surpassing a single human-level's performance. And I guess one of the exciting things about recent advances in deep learning is that even for these tasks we can now surpass human-level performance in some cases, but it has been a bit harder because humans tend to be very good at these natural perception tasks. So surpassing human-level performance is often not easy, but given enough data there've been lots of deep learning systems have surpassed human-level performance on a single supervisory problem.*

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Quiz**

Consider airplane pilots who’s training involves time spent in flight simulators. These flight simulators accelerate the pilots’ learning by allowing them to experience a volume and variety of scenarios that they otherwise may have needed a much longer time to acquire.

The following exercise is a “flight simulator” for machine learning. Rather than you needing to spend years working on a machine learning project before you get to experience certain scenarios, you’ll get to experience them right here.



![image.png](attachment:image.png)

And kinf of, what was interesting is that I was very interested in artificial intelligence, and so I took classes in artificial intelligence. But a lot of what I was seeing there was just very not satisfying. It was a lot of depth-first search, breadth-first search, alpha-beta pruning, and all these things. And I was not understanding how, I was not satisfied. And so when I was seeing neural networks for the first time in machine learning, which is this term that I think is more technical and not as well known in most people talk about artificial intelligence. Machine learning was more a technical term, I would almost say. And so I was dissatisfied with artificial intelligence. When I saw machine learning, I was like, this is the AI that I want to spend time on, this is what's really interesting. And that's what took me down those directions is that this is almost a new computing paradigm. Because normally, humans write code, but here in this case, the optimization writes code. And so you're creating the input/out specification, and then you have lots of examples of it, and then the optimization writes code, and sometimes it can write code better than you.

## Week 2   Error Analysis

If you're trying to get a learning algorithm to do a task that humans can do. And if your learning algorithm is not yet at the performance of a human. Then manually examining mistakes that your algorithm is making, can give you insights into what to do next. This process is called error analysis.

![image.png](attachment:image.png)

The conclusion of this process gives you an estimate of how worthwhile it might be to work on each of these different categories of errors. For example, clearly in this example, a lot of the mistakes were made on blurry images, and quite a lot on were made on great cat images. And so the outcome of this analysis is not that you must work on blurry images. This doesn't give you a rigid mathematical formula that tells you what to do, but it gives you a sense of the best options to pursue. It also tells you, for example, that no matter how much better you do on dog images, or on Instagram images. You at most improve performance by maybe 8%, or 12%, in these examples. Whereas you can to better on great cat images, or blurry images, the potential improvement. Now there's a ceiling in terms of how much you could improve performance, is much higher. So depending on how many ideas you have for improving performance on great cats, on blurry images. Maybe you could pick one of the two, or if you have enough personnel on your team, maybe you can have two different teams. Have one work on improving errors on great cats, and a different team work on improving errors on blurry images.

So to summarize, to carry out error analysis, you should find a set of mislabeled examples (here mislabeled examples refer to that were incorrectly predicted by ur algorithm), either in your dev set, or in your development set. And look at the mislabeled examples for false positives and false negatives. And just count up the number of errors that fall into various different categories.

Mislabeled examples ---> by algorithm (predicted labels)

Incorrectly labelled -----> by humans (ground truth)

![image.png](attachment:image.png)"

Remember we said that at the start of this video that it's actually less important to correct the labels in your training set. And it's quite possible you decide to just correct the labels in your dev and test set which are also often smaller than a training set and you might not invest all that extra effort needed to correct the labels in a much larger training set. This is actually okay.Some processes for handling when your training data is different in distribution than you dev and test data. Learning algorithms are quite robust to that. It's super important that your dev and test sets come from the same distribution. But if your training set comes from a slightly different distribution, often that's a pretty reasonable thing to do. 

![image.png](attachment:image.png)

When I'm leading a machine learning team and I want to understand what mistakes it is making, I would actually go in and look at the data myself and try to counter the fraction of errors. And I think that because these minutes or maybe a small number of hours of counting data can really help you prioritize where to go next. I find this a very good use of your time and I urge you to consider doing it if you've built a machine learning system and you're trying to decide what ideas or what directions to prioritize things.

### Building ML System

And more generally, for almost any machine learning application, there could be 50 different directions you could go in and each of these directions is reasonable and would make your system better. But the challenge is, how do you pick which of these to focus on. And even though I've worked in speech recognition for many years, if I'm building a new system for a new application domain, I would still find it maybe a little bit difficult to pick without spending some time thinking about the problem. 

If you're starting on building a brand new machine learning application, is to build your first system quickly and then iterate. What I mean by that is I recommend that you first quickly set up a dev/test set and metric. So this is really deciding where to place your target. And if you get it wrong, you can always move it later, but just set up a target somewhere. And then I recommend you build an initial machine learning system quickly. Find the training set, train it and see. Start to see and understand how well you're doing against your dev/test set and your valuation metric. When you build your initial system, you will then be able to use bias/variance analysis which we talked about earlier as well as error analysis which we talked about just in the last several videos, to prioritize the next steps.

**Of all the value of building this initial system, it can be a quick and dirty implementation, you know, don't overthink it, but all the value of the initial system is having some learned system, having some trained system allows you to localize bias/variance, to try to prioritize what to do next, allows you to do error analysis, look at some mistakes, to figure out all the different directions you can go in, which ones are actually the most worthwhile.**

So to recap, what I recommend you do is build your first system quickly, then iterate. This advice applies less strongly if you're working on an application area in which you have significant prior experience. It also applies a bit less strongly if there's a significant body of academic literature that you can draw on for pretty much the exact same problem you're building. So, for example, there's a large academic literature on face recognition. But if you are tackling a new problem for the first time, then I would encourage you to really not overthink or not make your first system too complicated. But, just build something quick and dirty and then use that to help you prioritize how to improve your system.

![image.png](attachment:image.png)

### Why DL Algorithms need more data than traditional algorithms

![image.png](attachment:image.png)

![image.png](attachment:image.png)


In production/live env u get images from mobile app so tuning ur model should be on app images & hence dev set should consist of app images which u r aiming at rather than web images which can be used for training purposes only.

Option 1

So it turns out that of your total amount of data, 200,000, so I'll just abbreviate that 200k, out of 210,000, we'll write that as 210k, that comes from web pages. So all of these 2,500 examples on expectation, I think 2,381 of them will come from web pages. This is on expectation, the exact number will vary around depending on how the random shuttle operation went. But on average, only 119 will come from mobile app uploads.So remember that setting up your dev set is telling your team where to aim the target. And the way you're aiming your target, you're saying spend most of the time optimizing for the web page distribution of images, which is really not what you want.So I would recommend against option one, because this is setting up the dev set to tell your team to optimize for a different distribution of data than what you actually care about.

Option 2

Option2 better than Option1. In Option2 the training set will include 200,000 images from the web and 5,000 from the mobile app. The dev set will be 2,500 images from the mobile app, and the test set will be 2,500 images also from the mobile app. The advantage of this way of splitting up your data into train, dev, and test, is that you're now aiming the target where you want it to be. You're telling your team, my dev set has data uploaded from the mobile app and that's the distribution of images you really care about, so let's try to build a machine learning system that does really well on the mobile app distribution of images. The disadvantage, of course, is that now your training distribution is different from your dev and test set distributions. But it turns out that this split of your data into train, dev and test will get you better performance over the long term. 

You've seen a couple examples of when allowing your training set data to come from a different distribution than your dev and test set allows you to have much more training data. And in these examples, it will cause your learning algorithm to perform better. Now one question you might ask is, should you always use all the data you have? The answer is subtle, it is not always yes.

### Bias & Variance with mismatched data distributions.

![image.png](attachment:image.png)

So let's say in this example that your training error is 1%.And let's say the error on the training-dev set is 9%, and the error on the dev set is 10%, same as before. What you can conclude from this is that when you went from training data to training dev data the error really went up a lot. And only the difference between the training data and the training-dev data is that your neural network got to sort the first part of this. It was trained explicitly on this, but it wasn't trained explicitly on the training-dev data. So this tells you that you have a variance problem.

Let's look at a different example. Let's say the training error is 1%, and the training-dev error is 1.5%, but when you go to the dev set your error is 10%. So now, you have actually a pretty low variance problem, because when you went from training data that you've seen to the training-dev data that the neural network has not seen, the error increases only a little bit, but then it really jumps when you go to the dev set. So this is a data mismatch problem, where data mismatched. So this is a data mismatch problem, because your learning algorithm was not trained explicitly on data from training-dev or dev, but these two data sets come from different distributions. But whatever algorithm it's learning, it works great on training-dev but it doesn't work well on dev. So somehow your algorithm has learned to do well on a different distribution than what you really care about, so we call that a data mismatch problem.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

So for example, comparing these two numbers in this case tells us that for humans, the rearview mirror speech data is actually harder than for general speech recognition, because humans get 6% error, rather than 4% error. 

### Apart from bias variance problem now addressing data mismatch (different data distribution) problem.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Transfer Learning

And the reason this can be helpful is that a lot of the low level features such as detecting edges, detecting curves, detecting positive objects. Learning from that, from a very large image recognition data set, might help your learning algorithm do better in radiology diagnosis. It's just learned a lot about the structure and the nature of how images look like and some of that knowledge will be useful. 

What is pre-tuning and fine-tuning in transfer learning?

In the context of machine learning and specifically transfer learning, "pre-training" and "fine-tuning" refer to the two main stages of the transfer learning process.

1. **Pre-training**: This phase involves training a machine learning model on a large dataset, often on a different but related task. For instance, in the case of deep learning, a neural network might be pre-trained on a large dataset like ImageNet, which contains millions of images across thousands of categories. During this stage, the model learns a lot of underlying patterns or features in the data that can be useful for other tasks. This is particularly important when dealing with deep learning models, which have a high capacity and need a lot of data to learn effectively. The goal is to leverage the patterns learned from this large dataset to bootstrap the learning process for a new task.

2. **Fine-tuning**: After pre-training, the model is then fine-tuned on the target task. This is typically a smaller, specific dataset that we're actually interested in. During fine-tuning, some or all of the parameters of the model are updated to better suit the specific task at hand. This process can involve freezing the earlier layers of the model (which have learned more general features) and training only the later layers (which learn more task-specific features), or it could involve training all layers but with a smaller learning rate to avoid large changes to the pre-trained weights.

These two stages allow us to leverage the power of large datasets when training models for tasks where we might not have as much data. They also allow us to build models for specific tasks more quickly and effectively, by bootstrapping the learning process with relevant patterns learned from related tasks.




So, when does transfer learning make sense? Transfer learning makes sense when you have a lot of data for the problem you're transferring from and usually relatively less data for the problem you're transferring to.


So for example, let's say you have a million examples for image recognition task. So that's a lot of data to learn a lot of low level features or to learn a lot of useful features in the earlier layers in neural network. But for the radiology task, maybe you have only a hundred examples. So you have very low data for the radiology diagnosis problem, maybe only 100 x-ray scans. So a lot of knowledge you learn from image recognition can be transferred and can really help you get going with radiology recognition even if you don't have all the data for radiology.

For speech recognition, maybe you've trained the speech recognition system on 10000 hours of data. So, you've learned a lot about what human voices sounds like from that 10000 hours of data, which really is a lot. But for your trigger word detection, maybe you have only one hour of data. So, that's not a lot of data to fit a lot of parameters. So in this case, a lot of what you learn about what human voices sound like, what are components of human speech and so on, that can be really helpful for building a good wake word detector, even though you have a relatively small dataset or at least a much smaller dataset for the wake word detection task.

![image.png](attachment:image.png)

So to summarize, when does transfer learning make sense? If you're trying to learn from some Task A and transfer some of the knowledge to some Task B, then transfer learning makes sense when Task A and B have the same input X. In the first example, A and B both have images as input. In the second example, both have audio clips as input. It tends to make sense when you have a lot more data for Task A than for Task B. All this is under the assumption that what you really want to do well on is Task B. And because data for Task B is more valuable for Task B, usually you just need a lot more data for Task A because you know, each example from Task A is just less valuable for Task B than each example for Task B. And then finally, transfer learning will tend to make more sense if you suspect that low level features from Task A could be helpful for learning Task B. And in both of the earlier examples, maybe learning image recognition teaches you enough about images to have a radiology diagnosis and maybe learning speech recognition teaches you about human speech to help you with trigger word or wake word detection.


So to summarize, transfer learning has been most useful if you're trying to do well on some Task B, usually a problem where you have relatively little data. So for example, in radiology, you know it's difficult to get that many x-ray scans to build a good radiology diagnosis system. So in that case, you might find a related but different task, such as image recognition, where you can get maybe a million images and learn a lot of load-over features from that, so that you can then try to do well on Task B on your radiology task despite not having that much data for it. When transfer learning makes sense? It does help the performance of your learning task significantly

### Multi-task Learning

There's another version of learning from multiple tasks which is called multitask learning, which is when you try to learn from multiple tasks at the same time rather than learning from one and then sequentially, or after that, trying to transfer to a different task.

In multi-task learning, you start off simultaneously, trying to have one neural network do several things at the same time. And then each of these task helps hopefully all of the other task

![image.png](attachment:image.png)

And one other thing you could have done is just train four separate neural networks, instead of train one network to do four things. But if some of the earlier features in neural network can be shared between these different types of objects, then you find that training one neural network to do four things results in better performance than training four completely separate neural networks to do the four tasks separately.So that's the power of multi-task learning.



It turns out that multi-task learning also works even if some of the images we'll label only some of the objects. So with a data set like this, you can still train your learning algorithm to do four tasks at the same time, even when some images have only a subset of the labels and others are sort of question marks or don't cares. And the way you train your algorithm, even when some of these labels are question marks or really unlabeled is that in this sum over j from 1 to 4, you would sum only over values of j with a 0 or 1 label. So whenever there's a question mark, you just omit that term from summation but just sum over only the values where there is a label. And so that allows you to use datasets like this as well.

![image.png](attachment:image.png)

But the key really is that if you already have 1,000 examples for 1 task, then for all of the other tasks you better have a lot more than 1,000 examples if those other other task are meant to help you do better on this final task. And finally multi-task learning tends to make more sense when you can train a big enough neural network to do well on all the tasks.

But multi-task learning is just more rare that you have a huge set of tasks you want to use that you want to do well on, you can train all of those tasks at the same time. Maybe the one example is computer vision. In object detection I see more applications of multi-task learning where one neural network trying to detect a whole bunch of objects at the same time works better than different neural networks trained separately to detect objects. But I would say that on average transfer learning is used much more today than multi-task learning, but both are useful tools to have in your arsenal. 

*So to summarize, multi-task learning enables you to train one neural network to do many tasks and this can give you better performance than if you were to do the tasks in isolation. Specially applicable use case in computer vision in object detection* 

### End to end Deep Learning

![image.png](attachment:image.png)

Take speech recognition as an example, where your goal is to take an input X such an audio clip, and map it to an output Y, which is a transcript of the audio clip. So traditionally, speech recognition required many stages of processing. First, you will extract some features, some hand-designed features of the audio. So if you've heard of MFCC, that's an algorithm for extracting a certain set of hand designed features for audio. And then having extracted some low level features, you might apply a machine learning algorithm, to find the phonemes in the audio clip. And then you string together phonemes to form individual words. And then you string those together to form the transcripts of the audio clip. So, in contrast to this pipeline with a lot of stages, what end-to-end deep learning does, is you can train a huge neural network to just input the audio clip, and have it directly output the transcript.

And when end-to-end deep learning just took the last training set and learned the function mapping from x and y directly, really bypassing a lot of these intermediate steps, it was challenging for some disciplines to come around to accepting this alternative way of building AI systems. Because it really obsoleted in some cases, many years of research in some of the intermediate components. It turns out that one of the challenges of end-to-end deep learning is that you might need a lot of data before it works well. So for example, if you're training on 3,000 hours of data to build a speech recognition system, then the traditional pipeline, the full traditional pipeline works really well. It's only when you have a very large data set, you know one could say 10,000 hours of data, anything going up to maybe 100,000 hours of data that the end-to end-approach then suddenly starts to work really well. So when you have a smaller data set, the more traditional pipeline approach actually works just as well. Often works even better. And you need a large data set before the end-to-end approach really shines. And if you have a medium amount of data, then there are also intermediate approaches where maybe you input audio and bypass the features and just learn to output the phonemes of the neural network, and then at some other stages as well. So this will be a step toward end-to-end learning, but not all the way there.

![image.png](attachment:image.png)

So why is it that the two step approach works better? There are actually two reasons for that. One is that each of the two problems you're solving is actually much simpler. But second, is that you have a lot of data for each of the two sub-tasks. In particular, there is a lot of data you can obtain for face detection, for task one over here, where the task is to look at an image and figure out where is the person's face and the image. So there is a lot of data. There is a lot of label data X, comma Y where X is a picture and y shows the position of the person's face. So you could build a neural network to do task one quite well. And then separately, there's a lot of data for task two as well. Today, leading companies have let's say, hundreds of millions of pictures of people's faces. So given a closely cropped image, like this red image or this one down here, today leading face recognition teams have at least hundreds of millions of images that they could use to look at two images and try to figure out the identity or to figure out if it's the same person or not. So there's also a lot of data for task two. But in contrast, if you were to try to learn everything at the same time, there is much less data of the form X comma Y. Where X is image like this taken from the turnstile, and Y is the identity of the person. So because you don't have enough data to solve this end-to-end learning problem, but you do have enough data to solve sub-problems one and two, in practice, breaking this down to two sub-problems results in better performance than a pure end-to-end deep learning approach. Although if you had enough data for the end-to-end approach, maybe the end-to-end approach would work better, but that's not actually what works best in practice today.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

But given data availability and the types of things we can learn with neural networks today, end to end dl is actually not the most promising approach or this is not an approach that I think teams have gotten to work best. It can sometimes work really well but you also have to be mindful of where you apply end-to-end deep learning.

Finally, thank you and congrats on making it this far with me. If you finish last week's videos and this week's videos then I think you will already be much smarter and much more strategic and much more able to make good prioritization decisions in terms of how to move forward on your machine learning project. Look at this week's homework problems which should give you another opportunity to practice these ideas and make sure that you're mastering them.

![image.png](attachment:image.png)

**Quiz**

In a standard machine learning project, you might split your data into three sets: training, dev (or validation), and test. The model is trained on the training set, tuned based on the dev set, and then finally evaluated on the test set.



Class Probabilities: For the output layer that predicts the class of the object, a softmax or sigmoid activation function is often used, depending on whether the classes are mutually exclusive or not. If each bounding box can contain only one class of object (i.e., the classes are mutually exclusive), then a softmax activation function would be used. If each bounding box can contain multiple classes of objects (i.e., the classes are not mutually exclusive), then a sigmoid activation function would be used.



So that was actually a very important insight that in an earlier era of deep learning where computers where just slower, the Restricted Boltzmann Machine and Deep Boltzmann Machine that was needed for initializing the neural network weights, but as computers got faster, straight backprop then started to work much better

Neural nets, and these are highly non-convex systems that are hard to optimize. 

But I would encourage people to just try and not be afraid to try to tackle hard problems.

Thank you, Rus, for sharing all the comments and insights. That was interesting, hearing the story of your early days doing this as well.Thanks, Andrew, yeah.Thanks for having me.