# Feature Selection with Categorical Data

## The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the chi-squared statistic and the mutual information statistic.

https://machinelearningmastery.com/feature-selection-with-categorical-data/

In [27]:
import pandas as pd 
# load the dataset as a pandas DataFrame
data = pd.read_csv('D:\\Python\\Data Science\\scikit-learn-videos-master\\breast_cancer_data1980.csv', header=None)
# retrieve numpy array
dataset = data.values   # # retrieve numpy array

print(dataset)
data.head(50)

[["'40-49'" "'premeno'" "'15-19'" ... "'left_up'" "'no'"
  "'recurrence-events'"]
 ["'50-59'" "'ge40'" "'15-19'" ... "'central'" "'no'"
  "'no-recurrence-events'"]
 ["'50-59'" "'ge40'" "'35-39'" ... "'left_low'" "'no'"
  "'recurrence-events'"]
 ...
 ["'30-39'" "'premeno'" "'30-34'" ... "'right_up'" "'no'"
  "'no-recurrence-events'"]
 ["'50-59'" "'premeno'" "'15-19'" ... "'left_low'" "'no'"
  "'no-recurrence-events'"]
 ["'50-59'" "'ge40'" "'40-44'" ... "'right_up'" "'no'"
  "'no-recurrence-events'"]]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
1,'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
2,'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
3,'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'
4,'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'
5,'50-59','premeno','25-29','3-5','no','2','right','left_up','yes','no-recurrence-events'
6,'50-59','ge40','40-44','0-2','no','3','left','left_up','no','no-recurrence-events'
7,'40-49','premeno','10-14','0-2','no','2','left','left_up','no','no-recurrence-events'
8,'40-49','premeno','0-4','0-2','no','2','right','right_low','no','no-recurrence-events'
9,'40-49','ge40','40-44','15-17','yes','2','right','left_up','yes','no-recurrence-events'


In [43]:
# split into input (X) and output (y) variables
X = dataset[:, :-1]
y = dataset[:,-1]

Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).

In [12]:
# format all fields as string
X = X.astype(str)

In [15]:
from sklearn.model_selection import train_test_split

# split the data into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize
print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)


Train (190, 9) (190,)
Test (95, 9) (95,)


### We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.
The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

In [19]:
from sklearn.preprocessing import OrdinalEncoder
# EXAMPLE FOR ENCODE:
enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
print(enc.categories_)
enc.transform([['Female', 3], ['Male', 1]])


[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]


array([[0., 2.],
       [1., 0.]])

In [20]:
from sklearn.preprocessing import OrdinalEncoder   # יש חשיבות וסדר לנתונים (לדוגמה שנות לימוד)
# prepare input data
def prepare_inputs(X_train, X_test):
	oe = OrdinalEncoder()     # instanciate OrdinalEncoder to Encode categorical features as an integer array.
	oe.fit(X_train)
	X_train_enc = oe.transform(X_train)
	X_test_enc = oe.transform(X_test)
	return X_train_enc, X_test_enc

## prepare the target variable - map the two class labels to 0 and 1 using LabelEncoder
This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose. We could just as easily use the OrdinalEncoder and achieve the same result, although the LabelEncoder is designed for encoding a single variable

In [57]:
import numpy as np
np.unique(X_train[: ,2], return_counts=True)

(array(["'0-4'", "'10-14'", "'15-19'", "'20-24'", "'25-29'", "'30-34'",
        "'35-39'", "'40-44'", "'45-49'", "'5-9'", "'50-54'"], dtype='<U11'),
 array([ 6, 23, 20, 26, 37, 46, 10, 11,  3,  4,  4], dtype=int64))

In [None]:
ordi = ["'0-4'", "'5-9'", "'10-14'", "'15-19'", "'20-24'", "'25-29'", "'30-34'",
        "'35-39'", "'40-44'", "'45-49'", "'50-54'"]                                    # assigning in the correct order

In [60]:
# check where we need to order the list before LabelEncoder
for a in range (0, 9):
    print(a, ":", np.unique(X_train[: ,a]))
    

0 : ["'20-29'" "'30-39'" "'40-49'" "'50-59'" "'60-69'" "'70-79'"]
1 : ["'ge40'" "'lt40'" "'premeno'"]
2 : ["'0-4'" "'10-14'" "'15-19'" "'20-24'" "'25-29'" "'30-34'" "'35-39'"
 "'40-44'" "'45-49'" "'5-9'" "'50-54'"]
3 : ["'0-2'" "'12-14'" "'15-17'" "'3-5'" "'6-8'" "'9-11'"]
4 : ["'no'" "'yes'" 'nan']
5 : ["'1'" "'2'" "'3'"]
6 : ["'left'" "'right'"]
7 : ["'central'" "'left_low'" "'left_up'" "'right_low'" "'right_up'" 'nan']
8 : ["'no'" "'yes'"]


In [21]:
from sklearn.preprocessing import LabelEncoder
# prepare target
def prepare_targets(y_train, y_test):
	le = LabelEncoder()
	le.fit(y_train)
	y_train_enc = le.transform(y_train)
	y_test_enc = le.transform(y_test)
	return y_train_enc, y_test_enc

In [56]:

ordi = OrdinalEncoder(categories=[2])
ordi.fit(X-train[[]])

In [47]:
np.unique(X_train[: ,3])

array(["'0-2'", "'12-14'", "'15-17'", "'3-5'", "'6-8'", "'9-11'"],
      dtype='<U11')

In [54]:
np.unique(24-26, return_index=True)

(array([-2]), array([0], dtype=int64))

In [55]:
# call these functions to prepare our data:
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

ValueError: Found unknown categories ["'24-26'"] in column 3 during transform