# Missing data, Feature selection

Most of the times, the data is damaged, or missing, we need to take care of it since data science models don't work when the data is missing or not a number. Dealing with missing data is crucial for accurate analysis and modeling. 

Feature selection involves choosing a subset of relevant features (variables) from a larger set. The goal is to retain the most informative and impactful features while eliminating redundant or less significant ones. This process helps improve model performance, reduce complexity, and enhance interpretability in data science tasks.

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

## Imputing missing values using Imputer

We will try here to impute missing values with the [simplest imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) in the scikit learn library.

In [9]:
df1 = pd.read_csv('Data.csv')
df1.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [10]:
# replace every occurrence of missing_values to one defined by strategy
# which can be mean, median, mode. Axis = 0 means rows, 1 means column

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df.iloc[:, 1:3] = imputer.fit_transform(df1.iloc[:, 1:3])
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes


**Exercice**: Try other strategies in the Simple Imputer and comment your results.

In [11]:
# replace every occurrence of missing_values to one defined by strategy
# which can be mean, median, mode. Axis = 0 means rows, 1 means column

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df.iloc[:, 1:3] = imputer.fit_transform(df1.iloc[:, 1:3])
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,48000.0,Yes


## Encoding categorical data  

Encoding is the conversion of categorical data into a numerical format suitable for data science algorithms. This ensures models can process and analyze the data effectively, enhancing their performance. Common methods include one-hot encoding, label encoding, and ordinal encoding.

In [25]:
# Label Encoder will replace every categorical variable with number. Useful for replacing yes by 1, no by 0.
# One Hot Encoder will create a separate column for every variable and give a value of 1 where the variable is present
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

**Exercice**: use the ` LabelEncoder()` to transform the `'Country'` column into labels 

In [33]:
lable_encoder = LabelEncoder()
df['Country_Labels'] = lable_encoder.fit_transform(df['Country'])
df.head()

Unnamed: 0,Country,Age,Salary,Purchased,Country_Labels
0,France,44.0,72000.0,No,0
1,Spain,27.0,48000.0,Yes,2
2,Germany,30.0,54000.0,No,1
3,Spain,38.0,61000.0,No,2
4,Germany,40.0,48000.0,Yes,1


We now use the `OneHotEncoder()` to transform the `'Country'` column into labels 

In [34]:
# you can pass an array of indices of categorical features
one_hot_encoder = OneHotEncoder()
temp = df.copy()
one_hot_encoder.fit(df.iloc[:, 0].values.reshape(-1, 1))
onehotlabels = one_hot_encoder.transform(df.iloc[:, 0].values.reshape(-1, 1)).toarray()

onehotlabels

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

Actually, pandas has a built in function that achieves this result faster.

In [36]:
# you can achieve a similar result using get_dummies
one_hot_df = pd.get_dummies(df.iloc[:, :-1])
one_hot_df['Country_France'] = one_hot_df['Country_France'].astype(int)
one_hot_df['Country_Germany'] = one_hot_df['Country_Germany'].astype(int)
one_hot_df['Country_Spain'] = one_hot_df['Country_Spain'].astype(int)
one_hot_df

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain,Purchased_No,Purchased_Yes
0,44.0,72000.0,1,0,0,1,0
1,27.0,48000.0,0,0,1,0,1
2,30.0,54000.0,0,1,0,1,0
3,38.0,61000.0,0,0,1,1,0
4,40.0,48000.0,0,1,0,0,1
5,35.0,58000.0,1,0,0,0,1
6,27.0,52000.0,0,0,1,1,0
7,48.0,79000.0,1,0,0,0,1
8,50.0,83000.0,0,1,0,1,0
9,37.0,67000.0,1,0,0,0,1


## Binarizing

Sometimes we need to do something different. For example, convert continuous features to discrete values. For instance, we want to convert the output to 0 or 1 depending on the threshold. 

In [37]:
from sklearn.datasets import load_iris

iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
feature_names = iris_dataset.feature_names


Now we'll binarize the sepal width, that is the second column of the X vector with 0 or 1 indicating whether the current value is below or above mean. 

In [38]:
X[:, 1]

array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3. ,
       3. , 4. , 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3. ,
       3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.6, 3. ,
       3.4, 3.5, 2.3, 3.2, 3.5, 3.8, 3. , 3.8, 3.2, 3.7, 3.3, 3.2, 3.2,
       3.1, 2.3, 2.8, 2.8, 3.3, 2.4, 2.9, 2.7, 2. , 3. , 2.2, 2.9, 2.9,
       3.1, 3. , 2.7, 2.2, 2.5, 3.2, 2.8, 2.5, 2.8, 2.9, 3. , 2.8, 3. ,
       2.9, 2.6, 2.4, 2.4, 2.7, 2.7, 3. , 3.4, 3.1, 2.3, 3. , 2.5, 2.6,
       3. , 2.6, 2.3, 2.7, 3. , 2.9, 2.9, 2.5, 2.8, 3.3, 2.7, 3. , 2.9,
       3. , 3. , 2.5, 2.9, 2.5, 3.6, 3.2, 2.7, 3. , 2.5, 2.8, 3.2, 3. ,
       3.8, 2.6, 2.2, 3.2, 2.8, 2.8, 2.7, 3.3, 3.2, 2.8, 3. , 2.8, 3. ,
       2.8, 3.8, 2.8, 2.8, 2.6, 3. , 3.4, 3.1, 3. , 3.1, 3.1, 3.1, 2.7,
       3.2, 3.3, 3. , 2.5, 3. , 3.4, 3. ])

The mean has the following value:

In [39]:
X[:, 1].mean()

3.0573333333333337

In [40]:
from sklearn.preprocessing import Binarizer
X[:, 1:2] = Binarizer(threshold=X[:, 1].mean()).fit_transform(X[:, 1].reshape(-1, 1))
X[:, 1]

array([1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0.])