# Assignment_9

Question 1. What is feature engineering, and how does it work? Explain the various aspects of feature
engineering in depth.

Feature engineering is the process of creating new features or modifying existing ones in a dataset to improve the performance of machine learning models. It is a crucial and often creative step in the data preprocessing pipeline that can significantly impact the model's ability to learn patterns and make accurate predictions. Feature engineering involves extracting relevant information from raw data, transforming it, and creating informative features that better represent the underlying relationships in the data. 
Here, we'll delve into various aspects of feature engineering:

a)Feature Extraction:

Feature extraction involves converting raw data into a more suitable format for machine learning. It often involves techniques like:
Text Feature Extraction: Converting text data into numerical representations (e.g., TF-IDF, word embeddings like Word2Vec or GloVe).
Image Feature Extraction: Extracting features from images (e.g., using convolutional neural networks or handcrafted features like color histograms).

b)Feature Transformation:
Feature transformation techniques aim to make the data more amenable to modeling. Common transformations include:
Logarithmic or Power Transformations: Useful for handling skewed data distributions.

Scaling and Normalization: Scaling features to a standard range (e.g., Min-Max scaling or z-score normalization).

Encoding Categorical Variables: Converting categorical variables into numerical representations, such as one-hot encoding or label encoding.

c)Feature Creation:

Feature creation involves generating new features from existing ones to capture additional information. Examples include:
Polynomial Features: Creating higher-order polynomial features to capture non-linear relationships.
Interaction Features: Combining two or more features to capture interactions between them.
Domain-Specific Features: Creating features based on domain knowledge or business insights.

d)Handling Missing Data:

Dealing with missing data is an essential part of feature engineering. Strategies include:
Imputation: Replacing missing values with reasonable estimates (e.g., mean, median, or interpolation).
Creating Indicator Features: Adding binary indicators to denote missing values in specific columns.

e)Feature Deletion: Removing features with a high percentage of missing values if they are not informative.

f)Feature Selection:
Feature selection is the process of choosing a subset of the most relevant features to improve model simplicity and reduce overfitting. Techniques include:
Univariate Feature Selection: Selecting features based on statistical tests (e.g., chi-squared, ANOVA).
Feature Importance: Using model-specific metrics to rank and select important features.
Recursive Feature Elimination (RFE): Iteratively removing the least important features.

g)Feature Scaling and Normalization:

Ensuring that features are on similar scales can be crucial for many machine learning algorithms. Techniques include Min-Max scaling, z-score normalization, or robust scaling.

h)Feature Engineering for Time Series Data:

Time series data often require special treatment, including creating lag features, rolling statistics, and handling seasonality and trends.

i)Dimensionality Reduction:

In cases of high-dimensional data, dimensionality reduction techniques like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) can be used to reduce the number of features while preserving important information.
Validation and Iteration:


Question 2. What is feature selection, and how does it work? What is the aim of it? What are the various
methods of function selection?

Answer 2)
Feature selection is the process of choosing a subset of the most relevant features (variables or attributes) from a larger set of available features in a dataset. The aim of feature selection is to improve the performance of machine learning models by reducing dimensionality, removing irrelevant or redundant features, and enhancing model interpretability. 
The primary goal of feature selection is to retain the most informative features while discarding those that do not contribute significantly to the predictive power of the model. By selecting a subset of features, we aim to:

Simplify Models: Reducing the number of features simplifies the model and makes it easier to interpret, which can be essential for understanding the factors influencing predictions.

Improve Model Generalization: Removing irrelevant or noisy features can help prevent overfitting, allowing the model to generalize better to unseen data.

Reduce Computational Cost: Fewer features mean lower computational resources and faster model training and inference.

Enhance Model Robustness: Removing redundant or correlated features can improve model robustness and stability.

There are various methods of feature selection, which can be broadly categorized into three main types:

Filter Methods:

Filter methods evaluate the relevance of features independently of the machine learning algorithm to be used. They typically rely on statistical measures or scoring techniques to rank or score features.

Common filter methods include:

Chi-squared test: Measures the independence between categorical features and the target variable.

Correlation coefficient: Measures the linear relationship between continuous features and the target variable.

Information gain or mutual information: Measures the information shared between features and the target variable.

Wrapper Methods:
Wrapper methods assess the quality of features by directly using them to train and evaluate a machine learning model. They use a search algorithm to explore different feature subsets and select the subset that yields the best model performance.
Common wrapper methods include:

Forward Selection: Starts with an empty set of features and iteratively adds the most relevant features based on model performance.

Backward Elimination: Begins with all features and iteratively removes the least relevant features.
Recursive Feature Elimination (RFE): Repeatedly trains the model and eliminates the least important features until the desired number is reached.

Embedded Methods:
Embedded methods incorporate feature selection into the model training process. They optimize feature selection as part of the model training and selection.
Common embedded methods include:

L1 Regularization (Lasso): Adds a penalty term to the model's loss function that encourages feature sparsity by setting some feature coefficients to zero.

Tree-based methods: Some decision tree-based algorithms like Random Forests and XGBoost have built-in feature selection mechanisms that rank or score features based on their importance.

Question 3. Describe the function selection filter and wrapper approaches. State the pros and cons of each
approach?

Answer 3)
Filter methods rely on finding the statistics for each feature to select the features with highest contributions to the machine learning model. These statistics are calculated for both categorical and numerical data. For numerical data, statistics like Correlation coefficients and for categorical data, statistics like the Chi-Square test are applied between two features to find probability of correlation(linear and non-linear). 

Advantage:--- Less computational cost
Disadvantage:--- Less accurate, cannot handle multicollinearity

Wrapper methods rely on an iterative approach where model is trained over subsets of features and tested using a performance metric, using which we can decide to keep features or discard them by comparing the recent performance score with the last score.

Advantage:--- Most accurate
Disadvantage:--- Computationally expensive

Question 4 (i). Describe the overall feature selection process.

Using the problem statement, a feature selection approach is selected keeping in mind the advantages and disadvantages that come with it. Data is cleaned thoroughly.

Next, we choose a method of feature selection technique from the feature selection approach. Features are selected or included into the main model training based on univariate statistics or iterative approach where the contribution of features are measured.

Question 4(ii). Explain the key underlying principle of feature extraction using an example. What are the most
widely used function extraction algorithms?

The key principle of feature extraction is to build important features from data by reformatting, combining and transforming features over and over again until a new set of features is created which can give more concise information to a machine learning model than some high dimensional data.

The MNIST database consists of 60,000 digits data as training data. Such a high dimensionality data has to be reduced for the machine learning model to be more accurate and take less training time. Dimensionality reduction in the form of feature extraction is done where the features are reduced to a combination of variables that can still describe the same data as before. Non Linear Dimensionality Reduction, Principal Component Analysis method are some of the most widely used techniques.

Question 5. Describe the feature engineering process in the sense of a text categorization issue.


Answer 5) 
Text categorization or text classification based on tags requires a feature engineering because machine cannot understand words. 

After the corpus has been converted to a numerical data using TF-IDF or Bag of Words, feature engineering technique is applied so that parts of speech tagging is more accurate.

Question 6. What makes cosine similarity a good metric for text categorization? A document-term matrix has
two rows with values of (2, 3, 2, 0, 2, 3, 3, 0, 1) and (2, 1, 0, 0, 3, 2, 1, 3, 1). Find the resemblance in
cosine..

Cosine Similarity a measure of similarity and is used for its good performance and computational efficiency. Even if two similar documents are far apart in terms of distance and frequency of words differ, presence of smaller angles between the vectors of data points make it beneficial for detection of similarity and hence cosine similarity is able to check whether the classifications made are correct or not. Duplication of words in text is important from analysis point of view and is good for measuring this duplication.

Cosine similarity is a widely used metric for text categorization and document similarity tasks for several reasons:

Scale Invariance: Cosine similarity is scale-invariant, which means it doesn't depend on the magnitude of the vectors being compared. It only considers the direction of the vectors. In text analysis, this is valuable because the frequency of words (term frequencies) can vary significantly, and cosine similarity still works effectively.

Angle Measure: Cosine similarity measures the cosine of the angle between two vectors. When the angle is small (cosine value close to 1), it indicates that the vectors have similar orientations, implying similarity. When the angle is large (cosine value close to -1), it suggests dissimilarity. This aligns well with the concept of document similarity: documents with similar word usage should have vectors with smaller angles between them.

Efficiency: Cosine similarity is computationally efficient, especially when working with large document collections. It only requires dot products and vector norms, making it fast to calculate.

Now, let's calculate the cosine similarity between two vectors using the given values:

Vector A = (2, 3, 2, 0, 2, 3, 3, 0, 1)
Vector B = (2, 1, 0, 0, 3, 2, 1, 3, 1)

To calculate the cosine similarity, you can follow these steps:

Calculate the dot product of the two vectors.
Calculate the magnitude (Euclidean norm) of each vector.
Apply the cosine similarity formula.

In [1]:
import numpy as np

# Define the two vectors
vector_a = np.array([2, 3, 2, 0, 2, 3, 3, 0, 1])
vector_b = np.array([2, 1, 0, 0, 3, 2, 1, 3, 1])

# Calculate the dot product of the two vectors
dot_product = np.dot(vector_a, vector_b)

# Calculate the magnitudes (Euclidean norms) of the vectors
magnitude_a = np.linalg.norm(vector_a)
magnitude_b = np.linalg.norm(vector_b)

# Calculate the cosine similarity
cosine_similarity = dot_product / (magnitude_a * magnitude_b)

# Print the result
print("Cosine Similarity:", cosine_similarity)


Cosine Similarity: 0.6753032524419089


Question 7.i. What is the formula for calculating Hamming distance? Between 10001011 and 11001111,
calculate the Hamming gap.

Hamming distance is calculated as the number of bit places where any two bit strings are different.

Hamming distance between 10001011 and 11001111 would be 2

Question (ii). Compare the Jaccard index and similarity matching coefficient of two features with values (1, 1, 0,
0, 1, 0, 1, 1) and (1, 1, 0, 0, 0, 1, 1, 1), respectively (1, 0, 0, 1, 1, 0, 0, 1).

In [2]:
import numpy as np

# Define the two binary feature vectors as NumPy arrays
vector_a = np.array([1, 1, 0, 0, 1, 0, 1, 1])
vector_b = np.array([1, 1, 0, 0, 0, 1, 1, 1])
vector_c = np.array([1, 0, 0, 1, 1, 0, 0, 1])

# Calculate Jaccard Index
def jaccard_index(a, b):
    intersection = np.sum(np.logical_and(a, b))
    union = np.sum(np.logical_or(a, b))
    return intersection / union

jaccard_ab = jaccard_index(vector_a, vector_b)
jaccard_ac = jaccard_index(vector_a, vector_c)

# Calculate Similarity Matching Coefficient (SMC)
def smc(a, b):
    intersection = np.sum(np.logical_and(a, b))
    total = np.sum(np.logical_or(a, b))
    return intersection / total

smc_ab = smc(vector_a, vector_b)
smc_ac = smc(vector_a, vector_c)

# Print the results
print("Jaccard Index (A, B):", jaccard_ab)
print("Similarity Matching Coefficient (A, B):", smc_ab)
print("Jaccard Index (A, C):", jaccard_ac)
print("Similarity Matching Coefficient (A, C):", smc_ac)


Jaccard Index (A, B): 0.6666666666666666
Similarity Matching Coefficient (A, B): 0.6666666666666666
Jaccard Index (A, C): 0.5
Similarity Matching Coefficient (A, C): 0.5


Question 8. State what is meant by &quot;high-dimensional data set&quot;? Could you offer a few real-life examples?
What are the difficulties in using machine learning techniques on a data set with many dimensions?
What can be done about it?

Answer 8)

High dimensional data set means that there are high number of features present in the dataset, so high that dimensionality of the dataset becomes detrimental to the learning phase of the machine learning model. In a high dimensional data, number of features can be greater than or equal to the number of observations.

Real life examples:

Gene sequencing data, spatio-temporal data, sociology based data.

The time space complexity is different for each machine learning algorithm, but each suffers from high dimensional data. The curse of dimensionality is at play here. High dimensional data not only slows down the training phase because of the number of computations the algorithm needs to perform increases in quadratic and cubic orders for some algorithms, it also affects the accuracy of algorithms due to presence of correlations that do not make sense and also results in overfitting.

High dimensional data can be processed using feature selection procedures, or dimensionality reduction techniques.

Question 9. Make a few quick notes on:


Q(i)PCA is an acronym for Personal Computer Analysis.
PCA is not an acronym for Personal Computer Analysis, but for Principal Component Analysis. Principal Component Analysis is a dimensionality reduction technique used to reduce dimensionality of a dataset. PCA uses calculation of eigen values and eigen vectors to create new features that contribute most to the training phase without affecting training times.


ii) Use of Vectors
Vectors find high usage in linear algebra and linear algebra based applications like the Support Vector Machines for classification problems. Vectors are often used to understand machine learning algorithms via geometric intuition. Vectors are also used in Exploratory Data Analysis and handling of data.

iii) Embedded Techniques
Embedded technique or method is a one of feature selection approaches in which the feature selection takes place while the model is being trained. Algorithms take care of feature selection on their own and train themselves at the same time. Algorithms like Random Forest support feature importance and can identify which feature is being used most to train the machine learning model.

Regularization methods like L1 regularization is also used in feature selection as penalties are imposed and coefficents for features not contributing to the training phase can be put to 0, hence effectively removing the features.

Question 10. Make a comparison between:

i) Sequential backward exclusion vs. sequential forward selection


Sequential backward elimination (SBE) and sequential forward selection (SFS) are two greedy feature selection algorithms. They work by iteratively adding or removing features to a feature set until a desired criterion is met.

SBE starts with all features in the feature set and iteratively removes the least important feature until a desired number of features remains. The least important feature is typically determined using a statistical test, such as information gain or chi-squared test.

SFS starts with no features in the feature set and iteratively adds the most important feature at each step. The most important feature is typically determined using the same statistical tests used in SBE.

Direction: The primary difference is the direction of the search. Sequential backward exclusion starts with all features and removes them one by one, while sequential forward selection starts with no features and adds them one by one.

Suitability: The choice between the two methods depends on the problem and the initial set of features. If you start with a large feature set and want to reduce dimensionality, backward exclusion may be more suitable. If you have a small set of features and want to identify the most important ones, forward selection is a better choice.

Computational Complexity: Forward selection can be more computationally intensive, especially for high-dimensional datasets, as it explores different combinations of features. Backward exclusion, on the other hand, may be more computationally efficient.

Risk of Overfitting: Forward selection can be more prone to overfitting since it keeps adding features that improve model performance. Backward exclusion may provide a simpler model with fewer features, reducing the risk of overfitting.

ii)Function selection methods: filter vs. wrapper

Filter methods rely on finding the statistics for each feature to select the features with highest contributions to the machine learning model. These statistics are calculated for both categorical and numerical data. For numerical data, statistics like Correlation coefficients and for categorical data, statistics like the Chi-Square test are applied between two features to find probability of correlation(linear and non-linear). 

Wrapper approach is an iterative approach in feature selection where all combinations of subset of features is used to train the model and a perfomance metric is used to test the model and check of there is an improvement of score over the previous iteration where a different subset of features were used. Examples of this technique are forward feature selection and backward feature selection.

iii) SMC vs. Jaccard coefficient

The SMC (Similarity Matching Coefficient) and the Jaccard coefficient are both similarity measures used to compare sets or binary feature vectors. However, they have different characteristics and are suited to different types of data and tasks. Here's a comparison between the two:

Jaccard Coefficient:

Definition:

The Jaccard coefficient, also known as the Jaccard index, measures the similarity between two sets by calculating the size of their intersection divided by the size of their union.
Formula: J(A, B) = |A ∩ B| / |A ∪ B|

Applicability:

The Jaccard coefficient is primarily used for comparing sets or binary vectors. It is commonly used in tasks involving set-based data, such as document similarity, recommendation systems, and clustering.

Range:

The Jaccard coefficient ranges from 0 to 1, where 0 indicates no similarity (completely dissimilar sets) and 1 indicates perfect similarity (identical sets).

Use Cases:

Common use cases include measuring the similarity between two sets of items, such as words in documents, products in a shopping cart, or user preferences in recommendation systems.
Handling Imbalanced Data:

The Jaccard coefficient is less suitable for imbalanced data, where one set is much larger than the other, as it gives more weight to the larger set.

SMC (Similarity Matching Coefficient):

Definition:

The SMC, also known as the Rogers-Tanimoto coefficient, is a similarity measure that calculates the size of the intersection of two sets divided by the sum of their sizes, minus the size of their intersection.
Formula: SMC(A, B) = |A ∩ B| / (|A| + |B| - |A ∩ B|)
Applicability:

The SMC is also used for comparing sets or binary vectors. It is a variation of the Jaccard coefficient and can be applied to similar tasks.
Range:

The SMC also ranges from 0 to 1, with 0 indicating no similarity and 1 indicating perfect similarity.
Use Cases:

Similar to the Jaccard coefficient, the SMC is used to measure the similarity between two sets or binary vectors. It is applied in contexts where set-based similarity needs to be calculated.
Handling Imbalanced Data:

The SMC can be more suitable for imbalanced data compared to the Jaccard coefficient because it explicitly accounts for the size of the intersection in the denominator.