# Chapter 1

- Remove features with 0 variance : they add no information
- Remove features with very high or very low cardinality : they add no information
- Remove irrelevant column : they add noise into the model
- Inspect features with seaborn pairplot to remove duplicate columns
- Drop highly correlated features if you are confident that they may add bias into the model
    - look at both pairplot and heatmap and prearson's correlation value
- Drop scaled variance up to a threshold : Very low variance may add noise to the dataset
- Drop columns that have missing values beyond a threshold (generally 30%)
- Extract features for seemmingly same correlated features : 
    - Use PCA
- Visualize the contribution of features with t-sne
    - Use t-sne on numeric features and visualize them in 2D
    - use categorical features as `hue` of scatterplot for transformed t-sne to identify driver features
- Discard less important features of a model by filtering out with a threshold co-efficient value:
    - Recursive feature elimination
    - train the model, drop the feature with lowest co-efficient
    - train the model again, drop the next feature with lowest co-efficient
    - continue until a desired number of features remain
- Voting from many models:
    - Perform RFE on many models.
    - Do votes on existing features for all models
    - The features that survive most of the time are the desired features
    - Note : make sure the dataset is standardized, regularized, cross-validated
- Use trees to find out important features
- Generate new features from existing features:
    - eg: average arm length column from left arm column and right arm column
    - eg: generate bmi column from weight and height
    

```
# Remove feature with 0 variance
df.describe()
# Remove features with very high or very low cardinality
df['col'].value_counts()
# Remove irrelevant column 
df.drop('col', axis=1)
# Inspect features with seaborn pairplot to remove duplicate columns
sns.pairplot(df, hue="cat_col", diag_kind='hist')
# USE T-SNE
# DO RFE
# DO PCA

# Drop columns that have missing values beyond a threshold (generally 30%)
mask = pokemon_df.isna().sum() / len(pokemon_df) < 0.3
reduced_df = df.loc[:, mask]

# Generate new features (eg: average)
df["avg_col"]=( df["col1"] + df["col2"])/2

# Drop highly correlated features if you are confident that they may add bias into the model (Use heatmap on dataframe)
# Drop scaled variance up to a threshold (features with very low variance or close to 0)
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.005)
sel.fit(df / df.mean())
mask = sel.get_support()
reduced_df = df.loc[:, mask]

# Tree based models to filter out features (You can use RFE with it)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
mask = rf.feature_importances_ > 0.1
# See surviving features
X_reduced = X.loc[:, mask]
print(X_reduced.columns)
```

### T-SNE

```
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# Tune learning_rate between 50 to 200
model = TSNE(learning_rate=50)

# Transform into 2D
# NOTE : There is no fit or transform stand-alone method. They are done together
tsne_features = model.fit_transform(df_numeric)

# Visualize the t-sne clusters
df['x'] = tsne_features[:,0]
df['y'] = tsne_features[:,1]
sns.scatterplot(x="x", y="y", hue="cat_col", data=df)
plt.show()
```

# Chapter 2

### Curse of dimensionality


- Simply think of a dataset of 3 columns and 100 rows(observation)
- Now for increasing accuracy, if you add one mmore feature/column:
	- For each feature(column), you have to add more observations 
    - Otherwise the model will overfit based on the little amount (100 samples)
    - How much observations should be added : exponential number of obervations for each new feature
- For high dimension, this becomes really problematic
- This phenomenon is called curse of dimensionality




### Heatmap

```
# Drop highly correlated features if you are confident that they may add bias into the model
corr_df = df.corr()
mask = np.triu(np.ones_like(corr_df, dtype=bool))
sns.heatmap(corr_df, mask=mask, center=0, linewidths=1, annot=True, fmt=".2f")
# Alternative approach
corr_df = df.corr().abs()
mask = np.triu(np.ones_like(corr_df, dtype=bool))
tri_df = corr_df.mask(mask)
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.95)]
reduced_df = chest_df.drop(to_drop, axis=1)
```

# Chapter 3

### RFE

- Discard less important features of a model by filtering out with a threshold co-efficient value:
    - Recursive feature elimination
    - train the model, drop the feature with lowest co-efficient
    - train the model again, drop the next feature with lowest co-efficient
    - continue until a desired number of features remain
- Voting from many models:
    - Perform RFE on many models.
    - Do votes on existing features for all models
    - The features that survive most of the time are the desired features
    - Note : make sure the dataset is standardized, regularized, cross-validated

```
# Recursive feature elimination during training 
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=2, step=1, verbose=1) # drop 1 feature at each step
rfe.fit(X_train_std, y_train)
# See the surviving features
X.columns[rfe.support_]
# See ranking (higher ranking = weakest features that were eliminated first)
print(dict(zip(X.columns, rfe.ranking_)))
print(accuracy_score(y_test, rfe.predict(X_test_std)))

# Voting from RFE
mask1 = rfe_model1.support_
mask2 = rfe_model2.support_
votes = np.sum([mask1, mask2], axis=0)
final_mask = votes >= 2
reduced_X = X.loc[:, final_mask]
```

# Chapter 4

### Data loss during feature generation

<center><img src="images/04.01.png"  style="width: 400px, height: 300px;"/></center>

### PCA

```
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
std_df = scaler.fit_transform(df)
pca = PCA()
print(pca.fit_transform(std_df))
# See which principal components explains the variance more
print(pca.explained_variance_ratio_)
# See equation of principal components : PC 1 = 0.71 x Feature 1 + 0.71 x Feature 2
print(pca.components_)

### Alternative approach: Pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()),
                ('reducer', PCA(n_components=3)), # n_components=0.9 for capturing 90% variance
                ('classifier', RandomForestClassifier())])
# principal components
pipe['reducer'].components_
# No of principal components
pipe['reducer'].n_components_
# See PCA explained variance
pipe['reducer'].explained_variance_ratio_
# Visualize elbow plot for PCA tuning
plt.plot(pipe['reducer'].explained_variance_ratio_)
# Visualize PCA plot
pc = pipe['reducer'].fit_transform(df)
df['PC 1'] = pc[:,0]
df['PC 2'] = pc[:,1]
sns.scatterplot(data=df, x='PC 1', y='PC 2', hue='cat_col', alpha=0.4)
# Rebuilding back to original Data
pc = pipe['reducer'].transform(X)
X_rebuilt = pipe['reducer'].inverse_transform(pc)
```

### PCA : Alternative Approach

```
from sklearn.decomposition import PCA
samples = df.drop("target", axis = 1).values
model = PCA(n_components=2)
model.fit(samples)
# Get the mean of the grain samples: mean
mean = model.mean_
# Get the first principal component: first_pc
first_pc = model.components_[0,:]
# Visualize direction of the component
plt.arrow(mean[0] , mean[1], first_pc[0], first_pc[1], color='red', width=0.01)
plt.axis('equal')
plt.show()

transformed = model.transform(samples)
# Principal Components
principal_components = model.components_
# Visualize principal components contribution
features = range(model.n_components_)
plt.bar(features, model.explained_variance_)
plt.show()
# Visualize how the dataset looks after transformation
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys, c=df["target"].values)
plt.show()
# Contribution of Original Features to Principal Components
for i, pc in enumerate(principal_components):
    plt.bar(list(df.columns), np.abs(pc), label=f'PC {i + 1}', alpha=0.7)
plt.xlabel('Original Features')
plt.ylabel('Absolute Loadings')
plt.legend()
plt.show()


#### TruncatedSVD : PCA on sparse dataset (most entries are zero, remembers entries by saving columns that have values to save space)
from sklearn.decomposition import TruncatedSVD
# Apply TruncatedSVD
svd_model = TruncatedSVD(n_components=2)
transformed_svd = svd_model.fit_transform(samples.values) # samples is scipy.sparse.csr_matrix
# Visualize how the dataset looks after transformation
xs_svd = transformed_svd[:, 0]
ys_svd = transformed_svd[:, 1]
plt.scatter(xs_svd, ys_svd)
plt.show()
# Principal Components
svd_principal_component = svd_model.components_
# Visualize principal components contribution
plt.bar(range(1, svd_model.n_components + 1), svd_model.explained_variance_ratio_)
plt.show()
# Contribution of Original Features to Principal Components
for i, loading in enumerate(svd_principal_component):
    plt.bar(df.columns, np.abs(loading), label=f'PC {i + 1}', alpha=0.7)
plt.show()
```