### 1. What is feature engineering, and how does it work? Explain the various aspects of feature engineering in depth.

`Feature engineering` 

Feature engineering is a machine learning technique that leverages data to create new variables that aren’t in the training set. It can produce new features for both supervised and unsupervised learning, with the goal of simplifying and speeding up data transformations while also enhancing model accuracy.

Feature engineering involves several processes. Feature selection, construction, transformation, and extraction are some key aspects of feature engineering. Let’s understand what each process involves:

- Feature selection

involves choosing a set of features from a large collection. Selecting the important features and reducing the size of the feature set makes computation in machine learning and data analytic algorithms more feasible. Feature selection also improves the quality of the output obtained from algorithms. 


- Feature transformation

involves creating features using existing data by the use of mathematical operations. For example, to ascertain the body type of a person a feature called BMI (Body Mass Index) is needed. If the dataset captures the person’s weight and height, BMI can be derived using a mathematical formula.


- Feature construction 

is the process of developing new features apart from the ones generated in feature transformation, that are appropriate variables of the process under study.


- Feature extraction

is a process of reducing the dimensionality of a dataset. Feature extraction involves combining the existing features into new ones thereby reducing the number of features in the dataset. This reduces the amount of data into manageable sizes for algorithms to process, without distorting the original relationships or relevant information.

### 2. What is feature selection, and how does it work? What is the aim of it? What are the various methods of function selection?

Feature Selection is the method of reducing the input variable to your model by using only relevant data and getting rid of noise in data. It is the process of automatically choosing relevant features for your machine learning model based on the type of problem you are trying to solve.

There are three types of feature selection: 

- Wrapper methods (forward, backward, and stepwise selection), 
- Filter methods (ANOVA, Pearson correlation, variance thresholding), 
- Embedded methods (Lasso, Ridge, Decision Tree).

### 3. Describe the function selection filter and wrapper approaches. State the pros and cons of each approach?

`Filter Method:`
    
features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. Like Pearson Correlation, Chisquare, Anova, LDA etc..

`Wrapper Method:`

In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. The problem is essentially reduced to a search problem. These methods are usually computationally very expensive.

Some common examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc.

- Forward Selection: 

Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

- Backward Elimination: 

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

- Recursive Feature elimination: 

It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.

Difference between Filter and Wrapper methods

The main differences between the filter and wrapper methods for feature selection are:

- Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.

- Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.

- Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.

- Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.

- Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods.

4.

    i. Describe the overall feature selection process.

    ii. Explain the key underlying principle of feature extraction using an example. What are the most widely used function extraction algorithms?


Feature Selection is the method of reducing the input variable to your model by using only relevant data and getting rid of noise in data. It is the process of automatically choosing relevant features for your machine learning model based on the type of problem you are trying to solve.

There are three types of feature selection:

- Wrapper methods (forward, backward, and stepwise selection),
- Filter methods (ANOVA, Pearson correlation, variance thresholding),
- Embedded methods (Lasso, Ridge, Decision Tree).

`Feature extraction`

It is a process of reducing the dimensionality of a dataset. Feature extraction involves combining the existing features into new ones thereby reducing the number of features in the dataset. This reduces the amount of data into manageable sizes for algorithms to process, without distorting the original relationships or relevant information.

`principle of feature extraction`

The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized.

- Principal Component Analysis (PCA)
- Linear discriminant analysis (LDA)
- Generalized discriminant analysis (GDA)
- Low Variance Filter.
- High Correlation Filter.
- Backward Feature Elimination.

### 5. Describe the feature engineering process in the sense of a text categorization issue.

1) Removing stop words

2) Tokennizing the words

3) Converting words to Embeddings


### 6. What makes cosine similarity a good metric for text categorization? A document-term matrix has two rows with values of (2, 3, 2, 0, 2, 3, 3, 0, 1) and (2, 1, 0, 0, 3, 2, 1, 3, 1). Find the resemblance in cosine.

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.



In [None]:
from scipy import spatial

dataSetI = [2, 3, 2, 0, 2, 3, 3, 0, 1]
dataSetII = [2, 1, 0, 0, 3, 2, 1, 3, 1]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)

result

7.

    i. What is the formula for calculating Hamming distance? Between 10001011 and 11001111, calculate the Hamming gap.

    ii. Compare the Jaccard index and similarity matching coefficient of two features with values (1, 1, 0, 0, 1, 0, 1, 1) and (1, 1, 0, 0, 0, 1, 1, 1), respectively (1, 0, 0, 1, 1, 0, 0, 1).


`Hamming Distance`

The Hamming distance between two equal-length strings of symbols is the number of positions at which the corresponding symbols are different.

In [2]:
def hammingDist(str1, str2):
    i = 0
    count = 0
 
    while(i < len(str1)):
        if(str1[i] != str2[i]):
            count += 1
        i += 1
    return count
 
# Driver code 
str1 = "10001011"
str2 = "11001111"
 
# function call
print(hammingDist(str1, str2))

2


`Jaccard similarity index`

sometimes called the Jaccard similarity coefficient compares members for two sets to see which members are shared and which are distinct. It's a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations.



In [3]:
def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(set(list1)) + len(set(list2))) - intersection
    return float(intersection) / union

In [20]:
list1 = [1, 1, 0, 0, 1, 0, 1, 1]
list2 = [1, 1, 0, 0, 0, 1, 1, 1]


jaccard_similarity(list1, list2)

1.0

In [13]:
list1 = [1, 0, 0, 1, 1, 0, 0, 1]
list2 = [1, 1, 0, 0, 0, 1, 1, 1]

jaccard_similarity(list1, list2)

1.0

The `simple matching coefficient (SMC) or Rand similarity coefficient` is a statistic used for comparing the similarity and diversity of sample sets. Given two objects, A and B, each with n binary attributes, SMC is defined as: where: is the total number of attributes where A and B both have a value of 0.

### 8. State what is meant by  "high-dimensional data set"? Could you offer a few real-life examples? What are the difficulties in using machine learning techniques on a data set with many dimensions? What can be done about it?

High dimensional data refers to a dataset in which the number of features p is larger than the number of observations N, often written as p >> N.

Eg:

1) Healthcare Data

High dimensional data is common in healthcare datasets where the number of features for a given individual can be massive (i.e. blood pressure, resting heart rate, immune system status, surgery history, height, weight, existing conditions, etc.).

2) Financial Data

High dimensional data is also common in financial datasets where the number of features for a given stock can be quite large (i.e. PE Ratio, Market Cap, Trading Volume, Dividend Rate, etc.)


`Challenges with High Dimensional Data :`
    
    Dimensionally cursed phenomena occur in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse.

- If we have more features than observations than we run the risk of massively overfitting our model — this would generally result in terrible out of sample performance.

- When we have too many features, observations become harder to cluster — believe it or not, too many dimensions causes every observation in your dataset to appear equidistant from all the others. And because clustering uses a distance measure such as Euclidean distance to quantify the similarity between observations, this is a big problem. If the distances are all approximately equal, then all the observations appear equally alike (as well as equally different), and no meaningful clusters can be formed.

Soltions are:
    
    1) Feature Selection
    2) Dimensionality reduction
    3) Use a regularization method.

### 9. Make a few quick notes on:

1. PCA is an acronym for Personal Computer Analysis.

2. Use of vectors

3. Embedded technique


`Principal Component Analysis:`
    
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.


`Vectors`

Vectors are used to represent numeric characteristics, called features, of an object in a mathematical and easily analyzable way. Vectors are essential for many different areas of machine learning and pattern processing.

Vectors are used on NLP for embeddings

`Embedded technique`

An embedding is a low-dimensional translation of a high-dimensional vector. Embedding is the process of converting high-dimensional data to low-dimensional data in the form of a vector in such a way that the two are semantically similar. In the world of Natural Language Processing. They allow us to capture relationships in language that are very difficult to capture otherwise. However, embedding layers can be used to embed many more things than just words.

### 10. Make a comparison between:

1. Sequential backward exclusion vs. sequential forward selection

2. Function selection methods: filter vs. wrapper

3. SMC vs. Jaccard coefficient


`Forward Selection:` Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

`Backward elimination:` is a feature selection technique while building a machine learning model. It is used to remove those features that do not have a significant effect on the dependent variable or prediction of output. Backward selection starts with a (usually complete) set of variables and then excludes variables from that set, again, until some stopping criterion is met.

`filter vs. wrapper`

The main differences between the filter and wrapper methods for feature selection are: Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.

`SMC vs. Jaccard coefficient`

The SMC is very similar to the more popular Jaccard index. the SMC counts both mutual presences (when an attribute is present in both sets) and mutual absence (when an attribute is absent in both sets) as matches and compares it to the total number of attributes in the universe, whereas the Jaccard index only counts mutual presence as matches and compares it to the number of attributes that have been chosen by at least one of the two sets.