## Count or frequency encoding

### High Cardinality

Another way to refer to variables that have a multitude of categories, is to call them variables with high cardinality.

If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category.

In [1]:
import pandas as pd

In [2]:
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data' , header = None,index_col=None)
train_set.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
# get all categorical features
cat_features = train_set.select_dtypes(include=['object'])
cat_features.head()

Unnamed: 0,1,3,5,6,7,8,9,13,14
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [4]:
# provide name to each feature
cat_features.columns=['Employment','Degree','Status','Designation','family_job','Race','Sex','Country','Salary']
cat_features.head()

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country,Salary
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [5]:
cat_features['Country'] = cat_features['Country'].str.replace('?', 'Unknown')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat_features['Country'] = cat_features['Country'].str.replace('?', 'Unknown')


In [6]:
# find unique labels from each categorical feature
for col in cat_features.columns[:]:
    print(col,':', len(cat_features[col].unique()), ' labels')

Employment : 9  labels
Degree : 16  labels
Status : 7  labels
Designation : 15  labels
family_job : 6  labels
Race : 5  labels
Sex : 2  labels
Country : 42  labels
Salary : 2  labels


In [7]:
# create dictionary of Country feature by counting each category values
country_map = cat_features.Country.value_counts().to_dict()
country_map

{' United-States': 29170,
 ' Mexico': 643,
 ' Unknown': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' Greece': 29,
 ' France': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Trinadad&Tobago': 19,
 ' Cambodia': 19,
 ' Thailand': 18,
 ' Laos': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Honduras': 13,
 ' Hungary': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [8]:
# map Country dictionary to cat_features data
cat_features['Country'] = cat_features['Country'].map(country_map)
cat_features.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat_features['Country'] = cat_features['Country'].map(country_map)


Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country,Salary
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95,<=50K
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,29170,<=50K
6,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,81,<=50K
7,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170,>50K
8,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,29170,>50K
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170,>50K


#### Advantages

<li>It is very simple to implement</li>
<li>Does not increase the feature dimensional space</li>

#### Disadvantages

<li>If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.</li>
<li>Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power</li>