# Introduction:
### Most the datasets have categorical/ non-numeric variables. These variables mask lots of interesting information. Challenges one encounters while dealing with categorical variable:
### 1. Categorical variable has too many levels. For e.g. the postal-codes, city, age-groups etc...
### 2. Categorical variable has many levels which rarely occur.  
### 3. There are a few levels which almost always occur. 
### 4. Sometimes categorical variables are masked/ encoded. Therefore deciphering their actual meaning is tricky.
### 5. Categorical variables cannot be used directly in regressions. They need to be transformed first. 
### 6. Most of the Python's ML libraries work only when the categorical variables are transformed as needed. 

# Methods to deal with Categorical Variable:
### 1. Convert to number:
      1.1 Label Encoder: 
          It is used to transform non-numerical labels to numeric labels. For e.g. Male/Female ---> (0/1) or 
          Ratings[good/average/bad] ----> [2/1/0]. 
          Common drawback of this approch is that its not suitable if the number of levels is more, as it tends to degrade 
          model performance. For e.g. "City" variable which will have many labels. 
          
      1.2 Convert numeric bins to number: 
          There are variables which contain values in the form of bins. For e.g. 'Age' which has values like [0-17, 18-25...]
          For working with such range data, we need to convert them to numerical equivalent. Ways to do that:
          1. Use label encoder for conversion
          2. Create a new feature using the mean/mode of each age bucket. 
          3. Create 2 new features one for the lower-bound of the age and one for the upper-bound of the age. 
      
### 2. Combine levels:
      2.1 Using a custom business logic:
          This is by far the best way to deal with those levels of the categorical variable which rarely occur. Its wise to 
          club or combine or group seemingly similar levels to one group. For e.g. we can group the different age levels in 
          categories for infants/ kids/ teenagers/ adults and aged or to group zipcodes on State/ District levels. 
          We need to write some business logic for this grouping. 
     
      2.2 Using frequency:
          Combining the levels using a custom business logic is a good strategy, if we know the business domain. In 
          situations where in we do not have enough exposure to the domain knowledge, the safest bet is to use the frequency 
          distribution, of each level and then combine the frequencies less than 5% of the total observations. This is 
          effective in handling rare levels.

### 3. Dummy  coding:
          Its one of the most common method for variable encoding, where in we assign the value of 1, if a specific level 
          exists else 0. For e.g. Female(1), Male(0). In case there are multiple levels in the categorical variable, it is 
          advised to reduce the number of levels by combining methods and then use dummy coding. This procedure is referrred to 
          as "One Hot encoding". 
        
        
        
        


#  In this notebook, we focus on the Label Encoder. 

# Pre-requistes

In [29]:
# Import required packages
from sklearn import preprocessing
import pandas as pd


# Create custom data frame

In [28]:
data = {'day': ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'], 
        'obs': [12, 20, 7, 11, 15, 9, 15],
         'wind': ['strong','mild','normal', 'strong', 'strong','normal', 'mild']
       }

df = pd.DataFrame(data, columns=['day', 'obs', 'wind'])
print(df)


   day  obs    wind
0  Sun   12  strong
1  Mon   20    mild
2  Tue    7  normal
3  Wed   11  strong
4  Thu   15  strong
5  Fri    9  normal
6  Sat   15    mild


# Create an instance of the label encoder

In [32]:
lencoder = preprocessing.LabelEncoder()   # Encode labels with value between 0 and n_classes-1.
print(lencoder)

LabelEncoder()


# Fit the Label encoder on the specific column of the pandas dataframe

In [34]:
lencoder.fit(df.wind)   # [OR] lencoder.fit(df['wind'])

LabelEncoder()

# View the labels

In [35]:
list(lencoder.classes_)

['mild', 'normal', 'strong']

# Transform the categorical variable into numeric format - transform()

In [40]:
df['wind_transformed'] = lencoder.transform(df['wind'])                           # Create a new column with the encoded values

# Check the dataframe

In [41]:
print(df)                                                                        # The classes are encoded are 0,1,2        

   day  obs    wind  wind_transformed
0  Sun   12  strong                 2
1  Mon   20    mild                 0
2  Tue    7  normal                 1
3  Wed   11  strong                 2
4  Thu   15  strong                 2
5  Fri    9  normal                 1
6  Sat   15    mild                 0


# Reverse transaformation - i.e. numeric to categorical - inverse_transform()

In [44]:
import warnings
warnings.filterwarnings("ignore")
df['wind_reverse_transformed'] = lencoder.inverse_transform(df['wind_transformed'])
print(df)

   day  obs    wind  wind_transformed wind_reverse_transformed
0  Sun   12  strong                 2                   strong
1  Mon   20    mild                 0                     mild
2  Tue    7  normal                 1                   normal
3  Wed   11  strong                 2                   strong
4  Thu   15  strong                 2                   strong
5  Fri    9  normal                 1                   normal
6  Sat   15    mild                 0                     mild
