# Study Guide: Training Kit

Topics for DS interview prep, based on the Lambda Training Kit.

Notes:

* Not all topics are listed—only the ones I think I should review
  * Most topics are listed, as some I would like to review anyways

---
---

## Unit 1: Statistics Fundamentals



---

### 1. Data Wrangling and Storytelling

Exploratory Data Analysis (EDA)

* Use basic Pandas functions for EDA
* Generate basic graphs with Pandas

Make Features

* The purpose of feature engineering
* Modify or create features using `df.apply()`
* Work with dates and times in Pandas

Join and Reshape Data

* Concatenate and merge data with Pandas
* Understand tidy data formatting
* Melt and pivot data with Pandas

Make Explanatory Visualizations

* Identify misleading visualizations and fix them
* Matplotlib
  * Name the parts of a Matplotlib plot
  * Differentiate between plt syntaxes
  * Control basic visual aspects of plt plot to mimic popular plotting styles
    * plot, plot stylesheet, title, subtitle
    * axis labels, axis tick marks, background colors, text annotations

---

### 2. Statistical Tests and Experiments

Probability, Statistics, and Inference

* Understand fundamental concepts of set theory and probability:
  * Permutations
  * Combinations
  * Expected value
  * Variance
  * Binomial distribution
  * Bernoulli trial
* Explain how statistics can inform a practical and reliable understanding of data

Sampling, Confidence Intervals, and Hypothesis Testing

* Understand how sampling works
  * Fundamental concept(s)
    * Sample vs. population
  * Common pitfalls
* Use hypothesis testing to determine if a result is significant
* Determine the level of confidence for a possible outcome

Introduction to Bayesian Inference

* Discuss differences between Bayesian and Frequentist statistics
* Derive Bayes' theorem from the Law of Conditional Probability
* Calculate Bayesian confidence intervals

---

### 3. Linear Algebra

Vectors and Matrices

* Explain why linear algebra is important for data science
* Vectors
  * Graph them
  * Identify dimensionality
  * Calculate length
  * Calculate dot product of two vectors
* Matrices
  * Identify dimensionality
  * Multiply them
  * Identify when matmul works
  * Transpose
  * Identify special matrices:
    * Identity matrix
    * Calculate the determinant and inverse
* Use numpy to perform basic linear algebra operations

Intermediate Linear Algebra

* Visualize orthogonal projections in R^2
* Understand the differences in standardization between variance, stdev, covariance, and correlation
* Identify when two vectors or matrices are orthogonal
  * Explain the intuitive implications of orthogonality
* Unit vectors
  * Rewrite any vector as a linear combination of scalars and unit vectors
  * Explain what makes a vector a "unit" vector
  * Know how to turn any vector into a unit vector
* Identify linearly dependent vectors in both graphical and numeric representations
* Calculate the rank of a matrix
  * Use it to determine the span and basis of component vectors
  
Dimensionality Reduction Techniques

* Recognize when `p > n` and explain why this leads to failure of certain ML models
* Principal Component Analysis (PCA)
  * Identify high-dimensionality data
  * Use PCA to improve model performance
* Understand limitations of projecting data onto an eigenvector subspace

Clustering

* Identify situations when unsupervised learning is necessary
  * Select and apply appropriate clustering techniques
* Use K-Means clustering and other centroid-based clustering algorithms
* Articulate the "No Free Lunch Principle"
  * Use it to guide the search for appropriate ML algorithm for a situation

### 4. Unit 1 Build Week

---
---

## Unit 2: Predictive Modeling

---

### 1. Linear Models

Regression 1

* Begin with baselines for regression
* Use scikit-learn for linear regression
* Explain the coefficients from a linear regression

Regression 2

* Use scikit-learn for multiple regression
* Understand how OLS minimizes SSE
* Define overfitting/underfitting and the bias/variance tradeoff

Ridge Regression

* One-hot encoding of categorical features
* Univariate feature selection
* Explain intuition and interpretation of Ridge Regression
* Use sklearn to fit and interpret Ridge Regression models

Logistic Regression

* Train/validate/test split
* Begin with baselines for classification
* Explain intuition and interpretation of Logistic Regression
* Use sklearn to fit and interpret Logistic Regression models

---

### 2. Kaggle Challenge

Decision Trees

* Clean data with outliers and missing values
* Use scikit-learn pipelines
* Use scikit-learn for decision trees
* Get and interpret feature importances of a tree-based model
* Understand why decision trees are useful to model non-linear, non-monotonic relationships and feature interactions

Random Forests

* Use scikit-learn for random forests
* Ordinal encoding with high-cardinality categorical variables
* Understand how categorical encodings affect trees differently than linear models
* Understand how tree ensembles reduce overfitting compared to single tree with unlimited depth

Cross-Validation

* Do cross-validation with independent test set
* Use scikit-learn for hyperparameter optimization

Classification Metrics

* Get and interpret a confusion matrix for classification models
* Use classification metrics of precision and recall
* Understand relationships between precision, recall, thresholds, and predicted probabilities
* Use and understand the classification metric ROC AUC

---

### 3. Applied Modeling

Define ML Problems

* Choose a target to predict and check distribution
* Choose an appropriate evaluation metric
* Avoid leakage
  * Test to train
  * Target to features

Wrangle ML Datasets

* Explore and join tabular data for supervised machine learning

Permutation and Boosting

* Get permutation importances for model interpretation and feature selection
* Use XGBoost for gradient boosting

Model Interpretation

* Visualize and interpret partial dependence plots
* Explain individual predictions with Shapley values plots

---

### 4. Unit 2 Build: `print(fiction)`

* Define a regression or classification problem, choose evaluation metrics, and begin with baselines
* Fit and evaluate a linear model
* Fit and evaluate a tree-based model
* Communicate the modeling process and insights in writing
* Make visualizations to explain predictive models

---
---

## Unit 3: Data Engineering

Don't really need to review anything in Unit 3!

---
---

## Unit 4: Machine Learning

---

### 1. Natural Language Processing

Introduction to NLP

* Tokenize text
* Remove stop words
* Stem or lemmatize text

Vector Representations

* Represent a document as a vector
* Query documents by similarity
* Apply word embedding models to create document vectors

Document Classification

* Extract text features and use them in classification pipelines
* Apply latent semantic indexing (LSI) to a document classification problem
* Benchmark and compare various vectorization methods in doc classification tasks

Topic Modeling

* Describe the latent dirichlet allocation process
* Implement a topic model using the gensim library
* Interpret document topic distributions and summarize findings

---

### 2. Neural Network Foundations

Perceptrons

* Understand what a perceptron is and how it works

Gradient Descent and Backpropagation

* Compute backpropagation through hidden layers

Neural Network Frameworks

Hyperparameter Tuning

---

### 3. Major Neural Network Architectures

Recurrent Neural Networks and LSTM

* Train a basic LSTM with TensorFlow
* Use an LSTM to generate text

Convolutional Neural Networks

* Understand and run a basic CNN with Keras / TensorFlow
* Use transfer learning to run a CNN for high-accuracy image classification

AutoEncoders and Recommendation Systems

AGI and the Future