<h1 align = center> 100 Days of Machine Learning - Day 6

100 days of machine learning is a tech challenge where the participants spend 100 continuous days studying, learning and coding machine learning concepts. It involves dedicating a certain amount of time each day to engage in ML-related activities, such as reading books, watching tutorials, completing online courses, working on projects, or participating in coding exercises. The goal is to develop a consistent learning habit and make significant progress in ML skills over the course of 100 days.

# Table of Contents
- Feature Selection
- Accuracy
- Precision
- Recall
- F1 score
- Precision vs Recall Tradeoff

# Feature Selection

Any real-life dataset comprises of various features and fields that describe the relation of an entity with another. In most datasets, specially big ones, we observe that not all features or fields are strongly correlated (positively or negatively) with each other or atleast with the quantity we are trying to measure or predict (the target variable). If we include all the features present in a dataset, it poses a couple of problems :

- High Dimensionality : Dealing with a large number of features also means that the dimensionality of the dataset is increased. A dataset being "high-dimensional" means that there are a large number of input features and variables that the model has to consider, analyse and optimize its predictions according to. When a model becomes overly high dimensional, the model becomes incredibly complex and computationaly expensive. Therefore it is important to only select the important features that are correlated with the target variable.

    A frequently observed instance of high dimensional data is image data. Images consist of hundreds of pixels and each pixel can be termed as an input feature. Processing all the data for a final model can be very computationally complex, hence it is important to detect and seperate features from the images first using techniques like filters.

- Model Performance : If a dataset contains many features, and an amount of those features are poorly correlated with the target variable or contain noise/faulty data- then the model trained on the dataset can be subject to performance problems like overfitting, which occurs when the model learns irrelevant noise and patterns that lead to poor generalization to unseen data but very good performance on training data.

    Selecting specific features which are relevant, mathematically correlated or appropriate for the training of the model reduces the amount of noise and faults present in the input dataset. This reduces overfitting and betters the performance of the model.

    Moreover, if the data is extremely high-dimensional because of a large amount of features, the computational complexity of the model increases exponentially. This causes training of the model to slow down and make it harder for iterative learning to occur. This means a lot of features that may or may not be interrelated with each other have to be computed and with each added data point, the computational cost of resolving the distance, relation or densities increases.

- Interpretability and Complexity : A high amount of features make a model very complex because of high dimensionality, which also means it may be difficult to interpret and understand the model's behaviour. Most unmodified and uncleaned datasets contain irrelevant features and noise, which when used to train a model may result in misleading information as the model becomes more and more susceptible to overfitting.

    When we do feature selection, we understand directly what features are correlated with the target variable and how they may be related to it. By isolating the relevant features we gain more insight into how the model is internally operating, making the black box system of an ML model more tranparent and easier to interpret. 

    Since feature selection reduces complexity, representing the understood workings of a model also becomes easier and more simplified. This allows for easier and intuitive explanations and understanding of models.

    If there is no feature selection done on the input dataset, it will often lead to *a loss in detail*, which means we lose the understanding of what features may be directly correlated with the target variable(s). This can cause a model to lose its attention to the important features and also the interpretability of the model to decrease because of potential misinformation gain.

## Methods in feature selection

There are several feature selection methods that can be employed for various scenarios in machine learning. Feature selection aims to find the best representative features for a given problem or for dealing with a target variable. Feature selection techniques exist for both supervised and unsupervised methods.

## Filter Methods

Filter methods are series of statistically based feature selection methods which guage the relevance of a feature based on their individual characteristics or statistical properties. They are computationally efficient methods that can be used while preprocessing the data.

There's various types of filter methods:

## Correlation methods
This method measures the correlation between the target variable and the features. This method was explored on [Day 2](https://github.com/snowclipsed/100daysofml/blob/main/Day%202/Day%202.ipynb) of 100DoMLC. 

--- 
### Information Gain Method

Information Gain is a method of calculating the reduction of entropy from feature transformation, i.e. ; it tells us the amount of informationt hat a feature provides about a target variable. Entropy is lowered because it is the loss of information and information from a featuee is gained. Information in the context of a machine learning dataset can also be termed as how "surprising" an encountered data point is to the model - if it is less surprising or is a lower probability event, the model learns that what it knows stands, if it is a lot more surprising than expected or has a high probability to occur acc to the model, then the model may have to take some steps to adjust this newly gained information about the data point. 

Measuring the amount of information gained tells us how strongly a feature affects the target variable, specially in the context of decision tree algorithms, where a feature of high information gain is chosen as the feature to split for at each node.

In a binary classification problem with an output of 0 and 1, we can measure entropy by the formula : $$ Entropy = -(p(0)) * log(P(0)) + p(1) * log(P(1)) $$









If a dataset has a perfectly balanced probability distribution, it has a high entropy because the surprise factor of getting any given value is the highest in such a distribution. However, if the probability distribution is skewed towards a direction, we understand that more values are concentrated towards the region of the graph where the probability is the highest.

Information gain can use entropy to find out the purity or the skewedness of a dataset. A more pure dataset means that the probability distribution of the datapoints is more uniform.

Suppose we have a dataset "S" and a random variable "a"

then Information Gain :

$$ IG (S,a) = H(S) - H(S | a) $$

where IG(S,a) is the information gain from the dataset, and H(S) is the standard entropy of the unchanged model. Meanwhile, H(S | a) is the entropy with a variable "a" into consideration. H(S | a) can be calculated by taking the ratio of examples in the dataset where a variable a has a value V to the total datapoints in the dataset multiplied by the combined entropy of the examples where the variable a has value V.

$$ H(S | a) = \sum \frac{Sa(v)}{S} * H(Sa(v)) $$




---
### Chi Squared Test Method

This method is valid for selecting features which are categorical and for categorical datasets. We apply the Chi-Square method on each categorical feature and pick the features which have good chi squared scores. 

The Chi-Squared score is based on the null hypothesis, which assumes that any feature or variable has zero correlation with one another as well as the alternate hypothesis which means that a feature or variable has some amount of correlation with the target feature/variable. The Chi-square formula works on the concept of frequencies or occurances of datapoints in the dataset and work well with frequency based datasets.


After definining our two hypotheses, we draw a contingency table with the observed frequencies/values of the features. Next, we find the expected value of the features according to our first hypothesis, the null hypothesis. 

$$ 

P(A \cap B) = P(A) * P(B) 
$$
or
$$
Expected Frequency = \frac {Row Frequency  * Column Frequency }{GrandSum}
$$ 




Now let's import a dataset to perform the chi-squared method on :

In [1]:
import numpy as np 
import pandas as pd 
import scipy.stats as stats

In [14]:
chi_data = pd.read_csv("data/titanic.csv")

In [15]:
chi_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [16]:
chi_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [28]:
p_first_saved = chi_data.loc[(chi_data['Pclass'] == 1) & (chi_data['Survived'] == 1)]
p_first_lost = chi_data.loc[(chi_data['Pclass'] == 1) & (chi_data['Survived'] == 0)]
p_second_saved = chi_data.loc[(chi_data['Pclass'] == 2) & (chi_data['Survived'] == 1)]
p_second_lost = chi_data.loc[(chi_data['Pclass'] == 2) & (chi_data['Survived'] == 0)]
p_third_saved = chi_data.loc[(chi_data['Pclass'] == 3) & (chi_data['Survived'] == 1)]
p_third_lost = chi_data.loc[(chi_data['Pclass'] == 3) & (chi_data['Survived'] == 0)]

In [34]:
len(p_first_saved.index), len(p_first_lost)

(50, 57)