## Introduction
Greetings! Welcome to an exciting lesson on feature selection and dimensionality reduction, a foundational element in the realms of machine learning and data science. Today, we will delve into a variance-based approach for feature selection in high-dimensionality data. We will explore the importance of feature selection, understand the concept of variance, and implement feature selection using VarianceThreshold on a synthetic dataset.

## Understanding Variance and VarianceThreshold
The variance of a feature is a statistical measurement that describes the spread of data points in a data feature. It is one of the key metrics that carries top significance in statistical data analysis.

In context of feature selection, if a feature has a low variance (close to zero), it likely carries less information. For instance, consider a dataset of students with a variable 'nationality' where 99% of students come from India, the 'nationality' feature will have very low variance as almost all observations are 'India'; it’s near-constant and therefore would not improve the model's performance.

Variance based feature selection should be used in the cases when you suspect that some features are near-constant and may not be informative for the model.

Scikit-learn provides the VarianceThreshold method to remove all features which variance doesn’t meet some threshold. By removing these low variance features, we can then decrease the number of input dimensions.

## Generating Synthetic Data in Python
To demonstrate our feature selection and dimensionality reduction concepts, let's start by generating a synthetic dataset. For many machine learning concepts, especially those related to data preprocessing and manipulation, synthetic datasets can be a useful tool for learning and exploration.

First, we'll need to import pandas, numpy and VarianceThreshold from sklearn.feature_selection. We are going to use pandas and numpy to create a DataFrame with ten distinct features, each composed of random numbers.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
np.random.seed(36)

Next, we generate a DataFrame with ten features.

In [2]:
data = pd.DataFrame(data={
    "feature_1": np.random.rand(1000),
    "feature_2": np.random.rand(1000) * 10,
    "feature_3": np.random.rand(1000),
    "feature_4": np.random.rand(1000) * 100,
    "feature_5": np.random.rand(1000),
    "feature_6": np.random.rand(1000) * 0.1,
    "feature_7": np.random.rand(1000),
    "feature_8": np.random.rand(1000) * 0.01,
    "feature_9": np.random.rand(1000),
    "feature_10": np.random.rand(1000) * 50,
})

print("Original data shape: ", data.shape) # (1000, 10)

Original data shape:  (1000, 10)


The output of the above code will be 1000 rows and 10 columns.

Here, we assume that all features in our data are numerical and there's consequently no missing data.

## Applying VarianceThreshold on Generated Data
After generating the data, let's apply VarianceThreshold and see how it impacts the dimensionality of our data.

In [3]:
# We use the VarianceThreshold to perform the feature selection.
# We set the threshold to 0.1, meaning that if the variance of a feature is less than 0.1, it will be removed.
selector = VarianceThreshold(threshold=0.1)

# Fit the data to the VarianceThreshold object
data_values = data.values
data_values_reduced = selector.fit_transform(data_values)

# Print the shape of the reduced data
print("Reduced data shape: ", data_values_reduced.shape) # (1000, 3)

Reduced data shape:  (1000, 3)


The output of the above code shows that the shape of the reduced data is (1000, 3) after applying the variance threshold.

This indicates that the dimensionality of our dataset has been reduced from 10 features to 3, suggesting that only three features met the variance threshold and therefore were kept.

## Identifying Retained Features
Now, it would also be beneficial to know which features have been retained after the feature selection process. For this, we can utilize the get_support method of the VarianceThreshold object.

In [4]:
# Get the names of the features that were kept. The get_support method returns a boolean mask of the features selected - True for selected features and False for removed features.
kept_features = data.columns[selector.get_support(indices=True)]
print("Kept Features: ", kept_features)

Kept Features:  Index(['feature_2', 'feature_4', 'feature_10'], dtype='object')


The output of the above code will be:

```JSON
Kept Features:  Index(['feature_2', 'feature_4', 'feature_10'], dtype='object')
```

This shows the names of the features that were kept after applying the variance threshold. It provides insight into which features contain enough variance to possibly improve the performance of a machine learning model.

## Lesson Summary and Practice
You've now learned how to implement VarianceThreshold for feature selection and dimensionality reduction. We've established the importance of dimensionality reduction, introduced feature selection, walked you through the concept of variance, and performed variance-based feature selection with VarianceThreshold using a synthetic dataset.

Remember, to gain a good command over these concepts, practice is key! I would recommend you to experiment with different variance thresholds and observe how it affects the number and selection of features. This will bolster your understanding of implementing feature selection within your own data science and machine learning projects! Happy learning!



## Unveiling High Variance Features in Synthetic Data

Curious about which features in our dataset have enough variance to be considered useful for a machine learning model? The given code employs the VarianceThreshold to filter out features with low variance from synthetic data. Let's embark on this space mission to identify the high-variance features that might significantly contribute to the predictive power of our models. Are you ready to see the results?

In [5]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold

# Setting a fixed seed for reproducibility
np.random.seed(36)

# Generating synthetic data
data = pd.DataFrame({
    "feature_1": np.random.rand(1000),
    "feature_2": np.random.rand(1000) * 10,
    "feature_3": np.random.rand(1000),
    "feature_4": np.random.rand(1000) * 100,
    "feature_5": np.random.rand(1000),
    "feature_6": np.random.rand(1000) * 0.1,
    "feature_7": np.random.rand(1000),
    "feature_8": np.random.rand(1000) * 0.01,
    "feature_9": np.random.rand(1000),
    "feature_10": np.random.rand(1000) * 50,
})

# Apply VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
data_values = data.values
data_values_reduced = selector.fit_transform(data_values)

# Identifying retained features
kept_features = data.columns[selector.get_support(indices=True)]

# Displaying shapes of original and reduced data
print("Original data shape: ", data.shape)
print("Reduced data shape: ", data_values_reduced.shape)

# Displaying the names of the kept features
print("Kept Features: ", kept_features)

Original data shape:  (1000, 10)
Reduced data shape:  (1000, 3)
Kept Features:  Index(['feature_2', 'feature_4', 'feature_10'], dtype='object')


## Adjusting the Variance Threshold

Ready to tweak some code, Space Wanderer? Based on what you learned about reducing input dimensions, adjust the threshold in the VarianceThreshold method and observe the outcome. Will more or fewer features make the cut? Try changing the threshold from 0.15 to 0.5 in the starter code and watch the magic unfold!

In [6]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold

# Initial data setup
np.random.seed(10)

# Generate synthetic data with modified feature values
data = pd.DataFrame({
    "feature_1": np.random.normal(0, 1, 1000),
    "feature_2": np.random.normal(5, 2, 1000),
    "feature_3": 0.5 * np.ones(1000),  # This feature has variance 0
    "feature_4": np.random.binomial(1, 0.2, 1000), # Binary feature will have low variance if p is close to 0 or 1
    "feature_5": np.random.normal(10, 5, 1000),
})

# Apply VarianceThreshold method to filter features
selector = VarianceThreshold(threshold=0.50)  # The change should be made on this line
filtered_data = selector.fit_transform(data)

# Output the results
print("Original data shape: ", data.shape) # Should print (1000, 5)
print("Reduced data shape: ", filtered_data.shape)  # Output will vary based on threshold
print("Kept Features: ", data.columns[selector.get_support(indices=True)])

Original data shape:  (1000, 5)
Reduced data shape:  (1000, 3)
Kept Features:  Index(['feature_1', 'feature_2', 'feature_5'], dtype='object')


## Setting the Variance Threshold
Stellar work, Space Voyager! Now, to weld the armor of knowledge even tighter, there's a piece of code you need to add. Set the threshold for the variance in our machine learning feature selection technique and print the kept features list at the end. Stand ready with your decision!

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold

np.random.seed(36)  # Ensures that the random numbers are the same each time we run.

# Generating synthetic data with ten features
data = pd.DataFrame({
    "feature_1": np.random.normal(0, 1, 1000),  
    "feature_2": np.random.normal(0, 2, 1000),  
    "feature_3": np.full(1000, 1),              
    "feature_4": np.random.normal(0, 4, 1000),  
    "feature_5": np.random.normal(0, 1, 1000),  
    "feature_6": np.full(1000, 0),              
    "feature_7": np.random.normal(0, 1, 1000),  
    "feature_8": np.random.normal(0, 1, 1000),  
    "feature_9": np.random.normal(0, 2, 1000),  
    "feature_10": np.full(1000, 3)              
})

print("Original data shape: ", data.shape)

# Initialize the VarianceThreshold with threshold 0.1
threshold = 0.1
selector = VarianceThreshold(threshold=threshold)

# Fit the selector and transform the data
reduced_data = selector.fit_transform(data)
print("Reduced data shape: ", reduced_data.shape)

# Identify and print the names of the features that have been kept after variance thresholding
kept_features = data.columns[selector.get_support()]
print("Kept features: ", list(kept_features))


Original data shape:  (1000, 10)
Reduced data shape:  (1000, 7)
Kept features:  ['feature_1', 'feature_2', 'feature_4', 'feature_5', 'feature_7', 'feature_8', 'feature_9']


## Cosmic Code Crafting: Feature Selection with Variance Threshold

Ready for liftoff on your solo mission, Space Voyager? Here's your chance to prove your prowess with dimensionality reduction. Your task is to recreate the code to apply VarianceThreshold, and find out which features are left. Remember, the final destination should resemble the solution we previously crafted together. Have a stellar journey through the code cosmos!

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
np.random.seed(36)  # Set the seed for reproducibility

# Generating synthetic data
data = pd.DataFrame({
    "feature_1": np.random.rand(1000),
    "feature_2": np.random.rand(1000) * 10,
    "feature_3": np.random.rand(1000),
    "feature_4": np.random.rand(1000) * 100,
    "feature_5": np.random.rand(1000),
    "feature_6": np.random.rand(1000) * 0.1,
    "feature_7": np.random.rand(1000),
    "feature_8": np.random.rand(1000) * 0.01,
    "feature_9": np.random.rand(1000),
    "feature_10": np.random.rand(1000) * 50,
})

print("Original data shape: ", data.shape)

# Select features using VarianceThreshold with a certain threshold
threshold = 0.1
selector = VarianceThreshold(threshold=threshold)

# Transform the data using the fit_transform method of VarianceThreshold
reduced_data = selector.fit_transform(data)
print("Reduced data shape: ", reduced_data.shape)

# Determine which features were kept after variance threshold
kept_features = data.columns[selector.get_support()]
print("Kept features: ", list(kept_features))
