### Introductions

* Importance of getting to know people in the class
* Review syllabus
* Give out slack information
* When office hours?
* Participation: ML news / paper for the day / discuss homework ideas / class discussions
* Homework: Target 5 homeworks which will be pretty open-ended and almost like mini-projects
* Quizzes: In class and around 3. Will test knowledge of the subject not coding
* First time class being taught and I am very open to feedback. Also, feel free to submit pull requests to correct or add to content.
* Review first homework

## What is Machine Learning?

[Wikipedia](https://en.wikipedia.org/wiki/Machine_learning) tells us that Machine learning is, "a field of computer science that gives computers **the ability to learn without being explicitly programmed**." It goes on to say, "machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through **building a model from sample inputs**."


### Learning from inputs

What does it mean to learn from inputs without being explicitly programmed? Let us consider the classical machine learning problem: spam filtering.

Let us imagine that we know nothing about machine learning, but are tasked with determining whether an email is spam or not. How might we do this?

What do we like and not like about our approach? How would we keep it up to date with new types of spam?

Now imagine we had a black-box that if given a great many examples of emails that are spam and not spam, could take these examples and learn from the text what spam looks like.  For this class, we will call that black-box (though it won't be black for long) machine learning. And often the data we feed it to understand a problem are called features.

What does learning from inputs look like? Let us consider [flappy bird](https://www.youtube.com/watch?v=79BWQUN_Njc).


## Why machine learning?

From our discussion, why do you think machine learning might be valuable?


## Types of machine learning problems

**Supervised** machine learning problems are ones for which you have labeled data. Labeled data means you give the algorithm the solution with the data and these solutions are called labels. For example, with spam classification the labels would be "spam" or "not spam." Linear regression would be considered a supervised problem.

**Unsupervised** machine learning is the opposite. It is not given any labels. These algorithms are often not as powerful as they don't get the benefit of labels, but they can be extremely valuable when getting labeled data is expensive or impossible. An example would be clustering.

**Regression** problems are a class of problems for which you are trying to predict a real number. For example, linear regression outputs a real number and could be used to predict housing prices.

**Classification** problems are problems for which you are predicting a class. For example, spam prediction is a classification problem because you want to know whether your input falls into one of two classes. Logistic regression is an algorithm used for classification.

**Ranking** problems are very popular in eCommerce. These models try to rank the items by how valuable they are to a user. For example, Netflix's movie recommendations. An example model is collaborative filtering.

**Reinforcement Learning** is when you have an agent in an environment that gets to perform actions and receive rewards for actions. The model here learns the best actions to take to maximize rewards. The flappy bird video is an example of reinforcement learning. An example model is deep Q-networks.


## Machine Learning and Econometrics

How are they different?

For one, they use different lingo. 

For another, econometrics is often more interested in understanding why things happen while machine learning often cares more about just the actual prediction being correct.

Economic theory is often a driver in the development of econometric models, while machine learning often relies on the data to deliver insights.

The two worlds have a lot of overlap and continue to grow closer. Machine learning is getting better and providing both predictions and understandings and econometrics is finding value in the scalability and accuracy of some machine learning models.


## Challenges of Machine Learning

Perhaps my favorite part from the Wikipedia page on machine learning is, "As of 2016, **machine learning is a buzzword**, and according to the Gartner hype cycle of 2016, at its peak of inflated expectations. Effective machine learning is difficult because finding patterns is hard and often not enough training data is available; as a result, **machine-learning programs often fail to deliver**."

* There isn't a clear problem to solve

Some executive heard machine learning is the next big thing, so they hired a data science team. Unfortunately, there isn't a clear idea on what problems to solve, so the team flounders for a year.

* Labeled data can be extremely important to building machine learning models, but can also be extremely costly.

First off, you often need a lot of data. [Google](https://research.googleblog.com/2017/07/revisiting-unreasonable-effectiveness.html) found for representation learning, performance increases logarithmically based on volume of training data:

![image.png](attachment:image.png)

Secondly, you need to get data that represents the full distribution of the problem you are trying to solve. For example, for our spam classification problem, what kinds of emails might we want to gather? What if we only had emails that came from US IP addresses?

Lastly, just getting data labeled can be time consuming and cost a lot of money. Who is going to label 1 million emails as spam or not spam?

* Data can be very messy

Often data in the real world has errors, outliers, missing data, and noise. How you handle these can greatly influence the outcome of your model.

* Feature engineering

Once you have your data and labels, deciding on how to represent it to your model can be very challenging. For example, for spam classification would you just feed it the raw text? What about the origin of the IP address? What about a timestamp?

* Your model might not generalize

After all of this, you might still end up with a model that either is too simple to be effective (underfitting) or too complex to generalize well (overfitting). You have to develop a model that is just right. :)

* Evaluation is non-trivial

Let's say we develop a machine learning model for spam classification. How do we evaluate it? Do we care more about precision or recall? How do we tie our scientific metrics to business metrics?

* Getting into production can be hard

You have a beautiful model built in Python only to discover the back-end is in Java and has to run in under 5ms, so micro-services are not an option. So you convert your model to PMML, but engineers won't let you push code, so you are now blocked and putting your model in production isn't high on their priorities.


## There is hope

While many machine learning initiatives do fail, many also succeed and are running some of the most valuable companies in the world. Companies like Google, Facebook, Amazon, AirBnB, and Netflix have all found successful ways to leverage machine learning and are reaping large rewards.

Google CEO Sundar Pichai even recently said, "an important shift from a mobile first world to an AI first world."

And Mark Cuban said, "Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years."

And lastly, [Harvard Business Review](https://hbr.org/2012/10/big-data-the-management-revolution) found, "companies in the top third of their industry in the use of data-driven decision making were, on average, 5% more productive and 6% more profitable than their competitors."

The goal of this course is to prepare you for this world. So that you will not only know how to build the machine learning models to predict the future, but also understand the key ingredients of a successful machine learning initiative and how to overcome the challenges.