# **Feature Selection**

## **Agenda**

In this lesson, we will cover the following concepts with the help of a business use case:
* Feature Selection
* Dimensionality Reduction:
  * Dimensionality Reduction Techniques
  * Pros and Cons of Dimensionality Reduction
  * Factor Analysis
    
    

## **What Is Feature Selection?**

Feature selection is a method that helps in the inclusion of the significant variables that<br> help form a model with good predictive power. 

Features or variables that are redundant or irrelevant can negatively impact the<br> performance of the model, thus it becomes necessary to remove them.

### **Benefits of Feature Selection**

* It reduces overfitting as the unwanted variables are removed, and the focus is on the significant variables.

* It removes irrelevant information, which helps to improve the accuracy of the model’s predictions.

* It reduces the computation time involved to get the model's predictions.

## **Dimensionality Reduction**

Dimensionality reduction is the method of transforming a collection of data having large dimensions into data of smaller dimensions while ensuring that identical information is conveyed concisely.

### **Dimensionality Reduction Techniques**

Some of the techniques used for dimensionality reduction are:

1. Imputing missing values
2. Dropping low-variance variables
3. Decision trees (DT)
4. Random forest (RF)
5. Reducing highly correlated variables
6. Backward feature elimination
7. Factor analysis


### **Pros of Dimensionality Reduction**

- It helps to compress data, reducing the storage space needed.
- It cuts down on computing time.
- It also aids in the removal of redundant features.

### **Cons of Dimensionality Reduction**

- Some data may will be lost as a result.
- We use certain thumb rules when we do not know how many principal components to keep in practice.

## **Gist of Factor Analysis**

* Factor analysis is used to:
  * Explain variance among the observed variables
  * Condense the set of observed variables into the factors 

  ![FA](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/Applied_Machine_Learning/Images/0.6_Feature_Selection/Trainer_PPT_and_IPYNB/FA.JPG)

* Factor explains the amount of variance in the observed variables.

* In other words, factor analysis is a method that investigates linear relation of a number of variables of interest V1, V2,……., Vl, with a smaller number of unobservable factors F1, F2,..……, Fk.
<br><br>
  ![FA1](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/Applied_Machine_Learning/Images/0.6_Feature_Selection/Trainer_PPT_and_IPYNB/FA1.JPG)




### **Types of FA**

![FA2](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/Applied_Machine_Learning/Images/0.6_Feature_Selection/Trainer_PPT_and_IPYNB/FA2.JPG)

### **Work Process of FA**

The objective of the factor analysis is the reduction of the number of observed variables and find the unobservable variables. 

It is a two-step process.

![FA3](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/Applied_Machine_Learning/Images/0.6_Feature_Selection/Trainer_PPT_and_IPYNB/FA3.JPG)

### **Choosing Factors**

Note: Let us get accustomed to the term eigenvalue before moving to the selecting the number of factors.

**Eigenvalues:**

It represents the explained variance of each factor from the total variance and is also known as the characteristic roots.

**Ways to Choose Factors:**

* The eigenvalues are a good measure for identifying the significant factors. An eigenvalue greater than 1 is considered for the selection criteria of the feature.

* Apart from observing values, the graphical approach is used that visually represents the factors' eigenvalues. This visualization is known as the scree plot. A Scree plot helps in determining the number of factors where the curve makes an elbow.



### **Use Case: Feature Selection in Cancer Dataset Using FA**

**Problem Statement**

John Cancer Hospital (JCH) is a leading cancer hospital in the USA. It specializes in preventing breast cancer. 

Over the last few years, JCH has collected breast cancer data from patients who came for screening or treatment. 

However, this data has 32 attributes and is difficult to run and interpret the result. As an ML expert,

 you have to reduce the number of attributes so that the results are meaningful and accurate. 

Use FA for feature selection.

#### **Dataset**

Features of the dataset are computed from a digitized image of a Fine-Needle Aspirate (FNA) of a breast mass. 

They describe the characteristics of the cell nuclei present in the image.

#### **Data Dictionary**

**Dimensions:**
* 32 variables
* 569 observations

**Attribute Information:**

1. ID number 

2. Diagnosis (M = malignant, B = benign)

3. Attributes with mean values: <br>
10 real-valued features are computed for each cell nucleus:
  * radius_mean (mean of distances from center to points on the perimeter)
  * texture_mean (standard deviation of gray-scale values) 
  * perimeter_mean
  * area_mean
  * smoothness_mean (local variation in radius lengths)
  * compactness_mean (perimeter$^2$ / area - 1.0)
  * concavity_mean (severity of concave portions of the contour) 
  * concave points_mean (number of concave portions of the contour) 
  * symmetry_mean
  * fractal dimension_mean ("coastline approximation" - 1)

4. Attributes with standard error and worst/largest:
  * radius_se	
  * texture_se
  * perimeter_se	
  * area_se	
  * smoothness_se	
  * compactness_se	
  * concavity_se	
  * concave points_se	
  * symmetry_se	
  * fractal_dimension_se	
  * radius_worst	
  * texture_worst	
  * perimeter_worst	
  * area_worst	
  * smoothness_worst	
  * compactness_worst	
  * concavity_worst	
  * concave points_worst	
  * symmetry_worst
  * fractal_dimension_worst


#### **Solution**

##### **Import Libraries**     





In Python, Numpy is a package that includes multidimensional array objects as well as a number of derived objects.
Matplotlib is an amazing visualization library in Python for 2D plots of arrays.\n
Pandas is used for data manipulation and analysis\n
So these are the core libraries that are used for the EDA process.

These libraries are written with an import keyword.

In [14]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline

##### **Import and Check the Data**

Before reading data from a csv file, you need to download the "breast-cancer-data.csv" dataset from the resource section and upload it into the Lab.
We will use the Up arrow icon, which is shown on the left side under the View icon. Click on the Up arrow icon and upload the file from
wherever it has downloaded into your system.

After this, you will see the downloaded file will be visible on the left side of your lab along with all the .pynb files.

In [15]:
df = pd.read_csv('breast-cancer-data.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


pd.read_csv function is used to read the "breast-cancer-data.csv" file and df.head() will show the top 5 rows of the dataset.

* dataframe or df is a variable that will store the data read by the csv file.
* head will show the rows and () default take the 5 top rows as output.
* one more example - df.head(3) will show the top 3 rows.

#### shape function

In [16]:
df.shape

(569, 32)

df.shape will show the number of rows and columns in the dataframe.

#### info Function

In [17]:
# Check the data , there should be no missing values 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

* The dataframe's information is printed using the info() function. 
* The number of columns, column labels, column data types, memory use, range index, and the number of cells in each column are all included in the data (non-null values).

In [18]:
# defining the array as np.array
feature_names = np.array(['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension'])

In [22]:
#### Convert diagnosis column to 1/0 and store in new column target
from sklearn.preprocessing import LabelEncoder


* The sklearn.preprocessing package contains a number of useful utility methods and transformer classes for converting raw feature vectors into a format that is suitable for downstream estimators.
* LabelEncoder encodes labels with a value between 0 and n_classes-1 where n is the number of distinct labels.
* These libraries are written with an import keyword.

In [23]:
# # Encode label diagnosis
#M -> 1
#B -> 0

#Converting diagnosis to numerical variable in df
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0}).astype(int)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In the above code, we are encoding the column diagnosis in which we are encoding M as 1 and B as 0.

## Factor Analysis

* A linear statistical model is a factor analysis. 
* It is used to condense a group of observable variables into an unobserved variable termed factors and to explain the variance among the observed variables.

#### **Adequacy Test**


Before you perform factor analysis, you need to evaluate the “factorability” of our dataset. Can we find the factors in the dataset? Checking factorability or sampling adequacy can be done in two ways:

1- The Bartlett's Test

2- Test of Kaiser-Meyer-Olkin

In [8]:
#Install factor analyzer
#!pip install factor_analyzer


#### **Bartlett's Test**

Bartlett’s test of sphericity checks whether or not the observed variables intercorrelate at all using the observed correlation matrix against the identity matrix. If the test is found to be statistically insignificant, you should not employ a factor analysis.

Note: This test checks for the intercorrelation of observed variables by comparing the observed correlation matrix against the identity matrix.

 

In [None]:
!pip install factor_analyzer

* In the above code, we are installing the factor analyzer.
* pip is used to install the packages.
* Factor analysis is an exploratory data analysis method used to search for influential underlying factors or latent variables from a set of observed variables.

Now, you are trying to perform factor analysis by using the factor analyzer module. Use the below code
for calculating_bartlett_sphericity.

In [8]:
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_value

(40196.16116300879, 0.0)

* In the above code, you are importing the factor analyzer and calculate_bartlett_sphericity.
* In this Bartlett’s test, the p-value is 0. The test was statistically significant, indicating that the observed correlation matrix is not an identity matrix.

**Inference:**

The p-value is 0, and this indicates that the test is statistically significant and highlights that the correlation matrix is not an identity matrix.

#### **Kaiser-Meyer-Olkin Test**

* The Kaiser-Meyer-Olkin (KMO) test determines if data is suitable for factor analysis.
* It assesses the suitability of each observed variable as well as the entire model.

In [9]:
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(df)



In [10]:
kmo_model

0.26062906217559106

KMO = (Sum of squared correlations between variables) / (Sum of squared correlations between variables + Sum of squared partial correlations between variables)

The result will be a value between 0 and 1. A KMO value of 1 indicates that the data is perfectly suited for factor analysis, while a value of 0 indicates that factor analysis is not appropriate for the data.

In general, the following guidelines can be used to interpret KMO values:

* KMO < 0.5: unacceptable for factor analysis
* 0.5 < KMO < 0.6: poor for factor analysis
* 0.6 < KMO < 0.7: mediocre for factor analysis
* 0.7 < KMO < 0.8: middling for factor analysis
* 0.8 < KMO < 0.9: meritorious for factor analysis
* KMO > 0.9: marvelous for factor analysis

**Inference:**

The KMO value is less than 0.5, and this indicates that we need to delete the insignificant variables.

**Finding Significant Variables**

In [11]:
corr = df.corr()
corr.style.background_gradient()#cmap='coolwarm'

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,1.0,0.039769,0.074626,0.09977,0.073159,0.096893,-0.012968,9.6e-05,0.05008,0.044158,-0.022114,-0.052511,0.143048,-0.007526,0.137331,0.177742,0.096781,0.033961,0.055239,0.078768,-0.017306,0.025725,0.082405,0.06472,0.079986,0.107187,0.010338,-0.002968,0.023203,0.035174,-0.044224,-0.029866
diagnosis,0.039769,1.0,0.730029,0.415185,0.742636,0.708984,0.35856,0.596534,0.69636,0.776614,0.330499,-0.012838,0.567134,-0.008303,0.556141,0.548236,-0.067016,0.292999,0.25373,0.408042,-0.006522,0.077972,0.776454,0.456903,0.782914,0.733825,0.421465,0.590998,0.65961,0.793566,0.416294,0.323872
radius_mean,0.074626,0.730029,1.0,0.323782,0.997855,0.987357,0.170581,0.506124,0.676764,0.822529,0.147741,-0.311631,0.67909,-0.097317,0.674172,0.735864,-0.2226,0.206,0.194204,0.376169,-0.104321,-0.042641,0.969539,0.297008,0.965137,0.941082,0.119616,0.413463,0.526911,0.744214,0.163953,0.007066
texture_mean,0.09977,0.415185,0.323782,1.0,0.329533,0.321086,-0.023389,0.236702,0.302418,0.293464,0.071401,-0.076437,0.275869,0.386358,0.281673,0.259845,0.006614,0.191975,0.143293,0.163851,0.009127,0.054458,0.352573,0.912045,0.35804,0.343546,0.077503,0.27783,0.301025,0.295316,0.105008,0.119205
perimeter_mean,0.073159,0.742636,0.997855,0.329533,1.0,0.986507,0.207278,0.556936,0.716136,0.850977,0.183027,-0.261477,0.691765,-0.086761,0.693135,0.744983,-0.202694,0.250744,0.228082,0.407217,-0.081629,-0.005523,0.969476,0.303038,0.970387,0.94155,0.150549,0.455774,0.563879,0.771241,0.189115,0.051019
area_mean,0.096893,0.708984,0.987357,0.321086,0.986507,1.0,0.177028,0.498502,0.685983,0.823269,0.151293,-0.28311,0.732562,-0.06628,0.726628,0.800086,-0.166777,0.212583,0.20766,0.37232,-0.072497,-0.019887,0.962746,0.287489,0.95912,0.959213,0.123523,0.39041,0.512606,0.722017,0.14357,0.003738
smoothness_mean,-0.012968,0.35856,0.170581,-0.023389,0.207278,0.177028,1.0,0.659123,0.521984,0.553695,0.557775,0.584792,0.301467,0.068406,0.296092,0.246552,0.332375,0.318943,0.248396,0.380676,0.200774,0.283607,0.21312,0.036072,0.238853,0.206718,0.805324,0.472468,0.434926,0.503053,0.394309,0.499316
compactness_mean,9.6e-05,0.596534,0.506124,0.236702,0.556936,0.498502,0.659123,1.0,0.883121,0.831135,0.602641,0.565369,0.497473,0.046205,0.548905,0.455653,0.135299,0.738722,0.570517,0.642262,0.229977,0.507318,0.535315,0.248133,0.59021,0.509604,0.565541,0.865809,0.816275,0.815573,0.510223,0.687382
concavity_mean,0.05008,0.69636,0.676764,0.302418,0.716136,0.685983,0.521984,0.883121,1.0,0.921391,0.500667,0.336783,0.631925,0.076218,0.660391,0.617427,0.098564,0.670279,0.69127,0.68326,0.178009,0.449301,0.688236,0.299879,0.729565,0.675987,0.448822,0.754968,0.884103,0.861323,0.409464,0.51493
concave points_mean,0.044158,0.776614,0.822529,0.293464,0.850977,0.823269,0.553695,0.831135,0.921391,1.0,0.462497,0.166917,0.69805,0.02148,0.71065,0.690299,0.027653,0.490424,0.439167,0.615634,0.095351,0.257584,0.830318,0.292752,0.855923,0.80963,0.452753,0.667454,0.752399,0.910155,0.375744,0.368661




If your main goal is to visualize the correlation matrix rather than creating a plot per se, the convenient pandas styling options are a viable built-in solution, as shown in the above code.

**Creating Dataset of Significant Variables**

In [12]:
df_corr = df[['radius_mean','perimeter_mean', 'area_mean','radius_worst','perimeter_worst',
              'area_worst','concavity_mean','concave points_mean', 'concavity_worst',
              'concave points_worst','diagnosis']]

**Running KMO Test**

In [24]:
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(df_corr)
kmo_model

0.8260787423549438

* In the above code, we are calculating the KMO. KMO estimates the proportion of variance among all the observed variables. 
* The overall KMO for the data is 0.82, which is excellent. This value indicates that you can proceed with your planned factor analysis.

**Inference:**

The value of the KMO model is greater than 0.5. Therefore, the dataset is good enough for factor analysis.

**Note: In this lesson, we saw the use of the feature selection methods, but in the next lesson we are going to use
    one of these methods as a sub-component of "Supervised Learning - Regression and Classification".**

![Simplilearn_Logo](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/Logo_Powered_By_Simplilearn/SL_Logo_1.png)