# 20-XX-XX: Daily Data Practice

---

### Daily Practices

* Meta Data: Review and write
  * Focus on a topic, review notes and resources, write a blog post about it
* HackerRank SQL or Packt SQL Data Analytics
* Practice with the common DS/ML tools and processes
  * Try to hit benchmark accuracies with [UCI ML datasets](https://archive.ics.uci.edu/ml/index.php)
  * Hands-on ML with sklearn, Keras, and TensorFlow
    * Read, code along, take notes
    * _test yourself on the concepts_ — i.e. do all the chapter exercises
  * [fast.ai course](https://course.fast.ai/)
  * Kaggle
* Interviewing
  * "Tell me a bit about yourself"
  * "Tell me about a project you've worked on and are proud of"
  * Business case walk-throughs
  * Hot-seat DS-related topics for recall practice (under pressure)
* Job sourcing
  * LinkedIn

---

### Writing

> Focus on a topic or project, learn/review the concepts, and write a blog post about it



### The Data

As seems to be the case with most, if not all, machine learning projects, we spent the
vast majority of the time gathering and labeling our dataset.

In an ideal world, our model would be able to recognize any object that anyone would
ever want to throw away. But the reality is that this is practically impossible,
particularly within the 8 weeks we had to work on Trash Panda.

We were granted an API key from Earth911 to utilize their recycling center search
database. When we were working with it, the database held information on around 300
items—how they should be recycled based on location, and facilities that accept them if
they are not curbside recyclable.

We had our starting point for the list of items our system should be able to
recognize. However, the documentation for the neural network architecture we'd decided
to use suggested that to create a robust model, it should be trained with at least
1,000 instancesi (in this case, images) of each of the classes we wanted it to detect.

Gathering 300,000 images was also quite a bit out of the scope of the project at that
point. So the DS team spent many hours reducing the size of that list to something a
little more manageable and realistic.

The main method of doing so was to group the items based primarily on visual
similarity. We knew it was also out of the scope of our time with the project to train
a model that could tell the difference between #2 plastic bottles and #3 plastic
bottles, or motor oil bottles and brake fluid bottles.

We also considered the items that 1) users would be throwing away on a somewhat
regular basis, and 2) users would usually either be unsure of how to dispose of properly
or would dispose of properly.

---

## Statistics and Probability

* Training kit
* Lecture and assignment notebooks
* Books
  * Practical Statistics for Data Scientists
* Video
  * [StatQuest Statistics Fundamentals](https://www.youtube.com/playlist?list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9)

#### Sets

A set is a collection of unique entities. A set is said to be a subset of another set, if all of the first set's members are also members of the second set.

An empty set is a set without any members. It can be defined as the set that is the subset of every set, and every set (universal set) is a subset of itself.

### Random sampling and sample bias

A _sample_ is a subset of data from a larger dataset, the _population_.

* `N(n)` : The size of the population (sample)
* Random sampling : Drawing elements into a sample at random
  * Each available member of the population has an equal chance of being chosen for the sample at each draw
* Stratified sampling : Dividing the population into strata and randomly sampling from each strata
  * The intuition here is that a stratified sample can help a sample to follow the distribution of the population, particularly in the case of a biased distribution
* Simple random sample : random sample without stratifying the population
* Sample bias : a sample that misrepresents the population
  * Samples will always be somewhat non-representative of the population
  * Sampling bias occurs when that difference is meaningful
  * An unbiased process will produce error, but it is random and does not tend strongly in any direction
* with replacement : observations are put back in the population after each draw
  * without replacement : once selected, observations can't be drawn again

Data quality is often more important than data quantity. Random sampling can reduce bias and facilitate quality improvement that would be prohibitively expensive.

To minimize bias, specify a hypothesis first, then collect data using randomization and random sampling.

* Regression to the mean : when taking successive measurements on a given variable, extreme observations tend to be followed by more central ones

#### Sampling distribution of a statistic

* Sample statistic : a metric calculated for a sample of data drawn from a larger population
* Data distribution : the frequency distribution of individual values in a dataset
* Sampling distribution : the frequency distribution of a sample statistic over many samples or resamples
* Central Limit Theorem : the tendency of the sampling distribution to take on a normal shape as sample size increases
* Standard error : the variability (stdev) of a sample statistic over many samples
  * Standard deviation : variability of individual data values
  
#### The bootstrap

* Bootstrap sample : a sample taken with replacement from an observed dataset
* Resampling : the process of taking repeated samples from observed data
  * Includes bootstrap and permutation (shuffling)

---

### Interviewing

> Practice answering the most common interview questions

* "Tell me a bit about yourself"
* "Tell me about a project you've worked on and are proud of"
* "What is your greatest strength / weakness?"
* "Tell me about a time when you had conflict with someone and how you handled it"
* "Tell me about a mistake you made and how you handled it"
* Business case walk-throughs
* Hot-seat DS-related topics for recall practice (under pressure)

> "Where do you see yourself in 3-5 years?"

Ideally working on the cutting edge of deep learning, whether it is doing research or
developing applications and products for users. As company X does X, I can see myself
working deeply on the research or machine learning engineering team here.

> "What brought you to data science? What interested you about data science?"

My background in Economics gave me my first real taste of programmatically gathering
and utilizing data. Data science is mostly a continuation of that into the modern age
of big data. I know that data can make big improvements in peoples' lives, and data
science ...

> "What sort of compensation are you looking for for this position?"

I have an idea of a salary range based on the position, the work, and my experience.
It's flexible and depends heavily on the whole package, such as PTO and other benefits.
First, I wanted to hear what general range you would offer for this position.

> "Tell me a bit about yourself"

* Homeschooled until seventh grade
  * Learning is a life-long endeavor
  * Jack of many trades, master of some
    * I have a wide range of interests, skills and experiences
    * Go deep on things I'm fascinated with (maybe better as strength/weakness)
  * I like the free flow of creativity
  * also need something technical to dig my teeth into
* Always been fascinated by technology
* Econ undergrad - first taste of harnessing the power of data
* After college
  * On-site implementation consultant for an ERP software company
  * Trained in manipulating Oracle RDBMS
  * Wrote Crystal Reports using SQL (lots and lots of joins)
* Worked as a professional DJ for a couple of years
  * Wasn't stratching my technical itch
  * Found my way back to data

> "Tell me about a mistake you made and how you handled it"

* First solo implementation project
  * Tried to do everything myself
    * Instantiating the system
    * Migrating the data
    * Teaching the users
    * Writing reports
* Resolution
  * Started building my delegation muscle
  * Decided to backtrack to be sure all of the bases were hit
  * Delegated work to those who are specialized for it (teachers to teach)

---

### SQL

> Work through practice problems on HackerRank or Packt

---

### DS + ML Practice

* Pick a dataset and try to do X with it
  * Try to hit benchmark accuracies with [UCI ML datasets](https://archive.ics.uci.edu/ml/index.php)
  * Kaggle
* Practice with the common DS/ML tools and processes
  * Hands-on ML with sklearn, Keras, and TensorFlow
  * Machine learning flashcards

#### _The goal is to be comfortable explaining the entire process._

* Data access / sourcing, cleaning
  * SQL
  * Pandas
  * Exploratory data analysis
  * Data wrangling techniques and processes
* Inference
  * Statistics
  * Probability
  * Visualization
* Modeling
  * Implement + justify choice of model / algorithm
  * Track performance + justify choice of metrics
    * Communicate results as relevant to the goal

---

### Job sourcing

> Browse LinkedIn, Indeed, and connections for promising leads