# 20-05-02: Daily Data Practice

---

### Daily Practices

* Meta Data: Review and write
  * Focus on a topic, review notes and resources, write a blog post about it
* HackerRank SQL or Packt SQL Data Analytics
* Practice with the common DS/ML tools and processes
  * Try to hit benchmark accuracies with [UCI ML datasets](https://archive.ics.uci.edu/ml/index.php)
  * Hands-on ML with sklearn, Keras, and TensorFlow
    * Read, write notes, and test yourself
  * [fast.ai course](https://course.fast.ai/)
  * Kaggle
* Interviewing
  * "Tell me a bit about yourself"
  * "Tell me about a project you've worked on and are proud of"
  * Business case walk-throughs
  * Hot-seat DS-related topics for recall practice (under pressure)
* Job sourcing
  * LinkedIn

---

### DS + ML Practice

* Pick a dataset and try to do X with it
  * Try to hit benchmark accuracies with [UCI ML datasets](https://archive.ics.uci.edu/ml/index.php)
  * Kaggle
* Practice with the common DS/ML tools and processes
  * Hands-on ML with sklearn, Keras, and TensorFlow
  * Machine learning flashcards

#### _The goal is to be comfortable explaining the entire process._

* Data access / sourcing, cleaning
  * SQL
  * Pandas
  * Exploratory data analysis
  * Data wrangling techniques and processes
* Inference
  * Statistics
  * Probability
  * Visualization
* Modeling
  * Implement + justify choice of model / algorithm
  * Track performance + justify choice of metrics
    * Communicate results as relevant to the goal

## Hands-on ML (reading)

### Chapter 1 Exercises (test yourself)

1. How would you define Machine Learning?
2. Can you name four types of problems where it shines?
3. What is a labeled training set?
4. What are the two most common supervised tasks?
5. Can you name four common unsupervised tasks?
6. What type of ML algorithm would you use to allow a robot to walk in various unknown terrains?
7. What type of algorithm would you use to segment your customers into groups?
8. Would you frame the problem of spam detection as supervised or unsupervised?
9. What is an online learning system?
10. What is out-of-core learning?
11. What type of learning algorithm relies on a similarity measure to make predictions?
12. What is the difference between a model parameter and a learning algo’s hyperparameter?
13. What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?
14. Can you name four of the main challenges in ML?
15. If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?
16. What is a test set and why would you want to use it?
17. What is the purpose of a validation set?
18. What can go wrong if you tune hyperparameters using the test set?
19. What is repeated cross-validation and why would you prefer it to using a single validation set?

1. How would you define Machine Learning?

Machine Learning is the practice of creating and utilizing algorithms that find patterns in ("learn" from) data in order to solve problems, without being explicitly programmed for that purpose.

> ML is about making machines better at some task by learning from data rather than from explicitly programming rules.

2. Can you name four types of problems where it shines?

* Problems that usually require lots of fine-tuning and adjustment
* Problems for which traditional methods and algorithms cannot get good solutions
* Dynamic environments that require the algorithms to react / adapt to new data
* Problems that require gaining insight on complex problems with large amounts of data

3. What is a labeled training set?

It is a dataset that is labeled with the desired outcome for each instance, necessary for supervised learning.

4. What are the two most common supervised tasks?

* Regression
* Classification

5. Can you name four common unsupervised tasks?

* Clustering
* Association rule learning
* Anomaly detection
* Visualization / Dimensionality reduction

6. What type of ML algorithm would you use to allow a robot to walk in various unknown terrains?

Reinforcement learning.

7. What type of algorithm would you use to segment your customers into groups?

To segment customers into groups, I would go with a clustering algorithm such as k-means clustering, which is unsupervised.

8. Would you frame the problem of spam detection as supervised or unsupervised?

I would consider spam detection to be a supervised problem because the algorithm needs to be trained with examples labeled as spam.

9. What is an online learning system?

An online learning system is one that learns incrementally, either from individual instances or mini-batches.

10. What is out-of-core learning?

Out-of-core learning happens when the dataset is too large to fit onto a single machine, and so must be done in chunks, some of which may happen outside of the "core" of any of the machines.

11. What type of learning algorithm relies on a similarity measure to make predictions?

Instance-based learning algorithms, such as k-Nearest Neighbors, rely on similarity measures to make predictions.

12. What is the difference between a model parameter and a learning algo's hyperparameter?

A model's parameter values are the output of the model that minimize the cost function, whereas the hyperparameters are the parameters that adjust how the model is trained. As such, the hyperparameters are not affected by the model itself — they are set prior to training and are constant throughout.

13. What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?

Model-based learning algorithms search for parameters that minimize a cost function. Typically the cost function is a measure of the distance between the predictions and the training examples. The models make predictions based on the input.

14. Can you name four of the main challenges in ML?

* Bad data
  * Not enough data
  * Non-representative
  * Biased (sampling bias)
  * Poor quality
  * Irrelevant features
* Bad algorithm
  * Overfitting / underfitting

15. If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?

If a model performs great on the training data but generalizes poorly, that means it's been overfitted to the training data.
Solutions:

* Simplify the model by reducing the number of model parameters, reducing the number of attributes / features, or constaining the model (regularization)
* Gather more training data
* Reduce the noise in the training data

16. What is a test set and why would you want to use it?

A test set is a portion of the training data set aside to test the performance of the model "out of sample", or how well it generalizes to instances that were not in its training data. Using a test set allows the engineer to evaluate how well the model generalizes, and so is essential for creating a robust model.

17. What is the purpose of a validation set?

A validation set is used to iterate with a model while tuning its hyperparameters. The purpose of this is to evaluate many different models with different hyperparameters and selecting the best one.

18. What can go wrong if you tune hyperparameters using the test set?

If the model is tuned to the test set, it will overfit the test data, and therefore may not generalize well to new instances. In other words, it will be tuned to be the best model / hyperparameter for that particular set.

19. What is repeated cross-validation and why would you prefer it to using a single validation set?

Repeated cross-validation allows multiple runs of an iteration to be averaged. Averaging out the evaluations of each model provides a more accurate measure of its performance.

---

### Writing

> Focus on a topic or project, learn/review the concepts, and write a blog post about it



---

### SQL

> Work through practice problems on HackerRank or Packt

---

### Interviewing

> Practice answering the most common interview questions

* "Tell me a bit about yourself"
* "Tell me about a project you've worked on and are proud of"
* Business case walk-throughs
* Hot-seat DS-related topics for recall practice (under pressure)

---

### Job sourcing

> Browse LinkedIn, Indeed, and connections for promising leads