# Chapter 1

- Remove features with 0 variance : they add no information
- Remove features with very high or very low cardinality : they add no information
- Remove irrelevant column : they add noise into the model
- Inspect features with seaborn pairplot to remove duplicate columns
- Drop highly correlated features if you are confident that they may add bias into the model
    - look at both pairplot and heatmap and prearson's correlation value
- Drop scaled variance up to a threshold : Very low variance may add noise to the dataset
- Drop columns that have missing values beyond a threshold (generally 30%)
- Extract features for seemmingly same correlated features : 
    - Use PCA
- Visualize the contribution of features with t-sne
    - Use t-sne on numeric features and visualize them in 2D
    - use categorical features as `hue` of scatterplot for transformed t-sne to identify driver features

```
# Remove feature with 0 variance
df.describe()
# Remove features with very high or very low cardinality
df['col'].value_counts()
# Remove irrelevant column 
df.drop('col', axis=1)
# Inspect features with seaborn pairplot to remove duplicate columns
sns.pairplot(df, hue="cat_col", diag_kind='hist')
# USE T-SNE

# Drop highly correlated features if you are confident that they may add bias into the model
corr_df = df.corr()
mask = np.triu(np.ones_like(corr_df, dtype=bool))
sns.heatmap(corr_df, mask=mask, center=0, linewidths=1, annot=True, fmt=".2f")
# Alternative approach
corr_df = df.corr().abs()
mask = np.triu(np.ones_like(corr_df, dtype=bool))
tri_df = corr_df.mask(mask)
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.95)]
reduced_df = chest_df.drop(to_drop, axis=1)

# Drop columns that have missing values beyond a threshold (generally 30%)
mask = pokemon_df.isna().sum() / len(pokemon_df) < 0.3
reduced_df = df.loc[:, mask]

# Drop scaled variance up to a threshold (features with very low variance or close to 0)
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.005)
sel.fit(df / df.mean())
mask = sel.get_support()
reduced_df = df.loc[:, mask]
```

### T-SNE

```
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# Tune learning_rate between 50 to 200
model = TSNE(learning_rate=50)

# Transform into 2D
# NOTE : There is no fit or transform stand-alone method. They are done together
tsne_features = model.fit_transform(df_numeric)

# Visualize the t-sne clusters
df['x'] = tsne_features[:,0]
df['y'] = tsne_features[:,1]
sns.scatterplot(x="x", y="y", hue="cat_col", data=df)
plt.show()
```

# Chapter 2

### Curse of dimensionality


- Simply think of a dataset of 3 columns and 100 rows(observation)
- Now for increasing accuracy, if you add one mmore feature/column:
	- For each feature(column), you have to add more observations 
    - Otherwise the model will overfit based on the little amount (100 samples)
    - How much observations should be added : exponential number of obervations for each new feature
- For high dimension, this becomes really problematic
- This phenomenon is called curse of dimensionality


