<a href="https://colab.research.google.com/github/yexf308/AppliedStatistics/blob/main/12_Feature_extraction_and_ML_in_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pylab inline 

$\def\m#1{\mathbf{#1}}$
$\def\mm#1{\boldsymbol{#1}}$
$\def\mb#1{\mathbb{#1}}$
$\def\c#1{\mathcal{#1}}$
$\def\mr#1{\mathrm{#1}}$
$\newenvironment{rmat}{\left[\begin{array}{rrrrrrrrrrrrr}}{\end{array}\right]}$
$\newcommand\brm{\begin{rmat}}$
$\newcommand\erm{\end{rmat}}$
$\newenvironment{cmat}{\left[\begin{array}{ccccccccc}}{\end{array}\right]}$
$\newcommand\bcm{\begin{cmat}}$
$\newcommand\ecm{\end{cmat}}$


# Feature Extraction
Machine Learning = Data + **Feature** + Model

Data comes in all forms:

- Real,continuous	features: $\m{x}\in \mb{R}^d$.  

- Categorical	data: $\m{x} = [\text{Red, 12203, Finished basement}].$

- Structured	data: Tree-style data.

- Text data.

- Image data.

- Audio data. 

- Time-series	data. 


What shall we do if we have the missing data entry?

### 1. Real vectors

**Q1:**
If many features are	uninformative, 

- Feature selection (LASSO). 
- Dimensional Reduction, like auto-encoder. 
- Transform the features. 
- Collect new data.


**Q2**:
If many features are	incomparable, 

- Standardization
- Normalization 

**Q3:**
If 	many	features	are	superfluous	or	correlated	with	each	other. 
- Use PCA to de-correlation. 


Common pre-processing	pipeline:		
1. Standardize	data	(de-mean,	divide	by	standard	deviation)	
2. Project	down	to	lower	dimensional	representation	using	PCA	
3. Apply	exact	transformation	to	Training	and	Testing.	



### 2. Categorical data
Many	machine	learning	algorithms	(e.g.,	linear	predictors)	require	real valued-vectors
to	make	predictions.	And	we	want	those	real-valued	numbers	to	be	correlated with the label.

- One-hot	encoding:	Assign	canonical	vector	to	each	categorical	variable. For example, $\text{color}\in \{\text{red, yellow, blue}\}$. 

- zip code: 12203. The dimension is too large, we can group them say 122xx.

### 3. Structured data



<img src="https://github.com/yexf308/AppliedStatistics/blob/main/image/structured.png?raw=true" width="600" />


### 4. Image data
Find	a	feature	vector	for	the	image:	
• Recognition	
• Identification	
• Detection	
• Image	classification	

**Computer Vision**: CNN, Resnet... I didn't cover it in this course. This itself is a separate course. 


### 5. Text data. 
Can we embed words
into a latent space? This embedding came from
directly querying for
relationships. **Natural Language Processing** This again is a separate course. 

- **word2vec** is a popular
unsupervised learning
approach that just uses a text
corpus

- **Bag of Words** and tfidf. (check BBC clustering HW problem.)

- **Bert**

- Document Clustering Based On **Non-negative Matrix
Factorization**. 

### 6. Audio data, Time-series data
Hidden Markov Models, Recurrent Neural Network in my Applied Stochastic Processes course

# Machine Learning in Practice

### What is Machine Learning
- “Learning is any process by which a system improves performance from
experience.” - Herbert Simon

- A popular defintion of **machine learning**, due to Tom Mitchell:
"A computer program is said to learn from experience $E$ with respect to some class of tasks $T$, and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$."

- "Machine learning is driving by looking at the rearview mirror." by one of my friends.

### Your Test Data is Sacred!
- Spit data into: Training set(80%), Validation set(10%) and Testing set(10%). 
- You may or may not perform the $k$-fold cross-validation, depends on the size of the data. 
- You should be causion on trajectory (sequential) data and groupable data (partial exchangeability).

### Understand Your Data
- Typical Problems: All features are zero/constant. Some features have large value (due to noise, etc.)
- Visualize the data:
  - Histogram each feature
  - Scatter plot (pairs/triplets) of features
  - Perform PCA first and then scatter plot the projected
data.

- Don’t Use Labels as Features! Use one-hot encodings! 

### Normalize and Standarize Data
- Mean center, scale variance of each feature.
- Min-Max scaling to $[-1,1]$.
- Whiten the Data (center the mean, identity covariance).

### More Data and More Features
- More data is better!
- More feature may not, you need to understand what features matters 
  - Naively: Use all features. (adding features should never hurt), except there are computational/overfitting/Memory problem. 

- Select Features: 
   - LASSO
   - Sequential Backward Selection: Measure performance of all combinations of all but one
feature on development set, Remove least important one, Iterate.
   - Measure performance of all features individually, Include the most important one. Iterate. 

### Never Underestimate the Power of a Linear Predictor
Linear Regression, Logistic Regression, Linear SVMs, and etc. 

### Unbalanced class
<img src="https://github.com/yexf308/AppliedStatistics/blob/main/image/unbalanced.png?raw=true" width="500" />


- Downsample the overrepresented classes, but doesn’t use all the data.
- Upsample (duplicate data), but may cause overfitting and computational overhead.
- Weight the samples (scale loss), but might not help if using stochastic
gradient descent. 

### Overfitting vs Underfitting
**Overfitting**: You do great on training data, but relatively poorly on test data

- Test on validation set
- Regularize more
- Feature selection (fewer features). 
- Dimensional reduction
- Get more data
- Use simpler classifier

**Underfitting:** You perform poorly on both training and test data
- Use more features.
- More sophisticated classifier.
- Kernelize (infinite features!!)
- Regularize less
- Optimize better

The rule of thumb: “Always start by overfitting” - Cris Dima 

### Building Large Learning Systems
- Avoid premature statistical optimization:
  - Spend time on the parts that matter
  - Think a lot about features and data
  - Start with part of the data

- Don’t buy the Hype!
  - Always start with simple models
  - Combinations of simple models (Ensemble Methods). 


### Data Science is NOT Easy! 
- Be prepared.
- Ready to learn everyday. Always open-minded. 
- Able to test new stuffs and implement new algorithms. 
- Mathematics + Statistics + Computer Science 



