# 4 - Feature Construction

## Introduction:

This project will primarily focus on working with categorical data to build new features into the dataset so that the models employed will be able to learn from them. The idea is to introduce and construct features that are useful to improve the model's performance over the baseline.


## Breakdown of this Project:
- Examining the Dataset
- Applying imputation techniques to categorical features.
- Applying encoding techniques to categorical variables.
- Extending numerical features.
- Text-specific feature construction.


## Requirements:



## 1 - Taking a look at the Dataset:

The dataset here will be self-created to show a variety of data leels and types.

In this section, the Pandas Dataframe with its attributes and methods will be used on the dataset.

In [1]:
# Import the Required Libraries:
import pandas as pd

In [2]:
# Set up the Dataset:
X_data = pd.DataFrame({'city':['tokyo', None, 'london', 'seattle', 'san francisco', 'tokyo'],
                       'boolean':['yes', 'no', None, 'no', 'no', 'yes'], 
                       'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                                         'somewhat like', 'dislike'], 
                       'quantitative_column':[1, 11, -.5, 10, None, 20]})

In [3]:
X_data.head()

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,1.0
1,,no,like,11.0
2,london,,somewhat like,-0.5
3,seattle,no,like,10.0
4,san francisco,no,somewhat like,


From the above, the table consists of several columns that are:
- boolen, is a binary categorical data (e.g. yes or no) and this is at the nominal level.
- city, is categorical data and is at the nominal level.
- ordinal_column, is a column of ordinal data and at an ordinal level.
- quantitative_column, contains integers that are at the ratio level.

## 2 - Peform Imputation on the Categorical Features:

With the understanding of the dataset outlined above, this section will go through the imputation process.

### 2.1 - Dealing with Missing Values:

In [4]:
# Find missing values in the dataset:
X_data.isnull().sum()

city                   1
boolean                1
ordinal_column         0
quantitative_column    1
dtype: int64

The output shows that there are 3 missing values in the stated columns. These value will require imputation.

Next, a custom transformer will be implemented, where it is a method that will impute the missing values in a column.

#### For the "City" Column:

As mentioned, this column is categorical, which means the imputation strategy is fill the missing with the most common category.

In [5]:
# Find the most common category in this column:
X_data['city'].value_counts().index[0]

'tokyo'

Here, Tokyo was the most frequent. Next is to impute the missing row.

In [6]:
X_data['city'].fillna(value=X_data['city'].value_counts().index[0])

0            tokyo
1            tokyo
2           london
3          seattle
4    san francisco
5            tokyo
Name: city, dtype: object

Now that the imputation worked nicely, the next part would be to fix the remaining categorical columns. To do this, a custom imputer will be made.

### 2.2 -  Build Custom Imputer:

Pipeline are an assembly of steps (transformations) that can be cross-validated together while allowing the setting of different parameters.

Building a pipeline allows for the following:
1. Enables the application of sequential list of transformations before a final estimator.
2. Each intermediate steps of the pipeline are "transforms" (a fit and transform method).
3. The last layer is the final Estimator (which is a fit method).

The pipeline will have the built transformers for each of the coloumns that requires imputing, where the dataset will be passed and transformed in one go. 

#### Custom Category Imputer:

Here, the "TransformerMixin" class from scikit-learn will be utlised to build the custom categorical imputer. Note that this transformer is only one of the element in the pipeline, where in this case, it would be dealing with categorical data.

In [7]:
# Import the required library:
from sklearn.base import TransformerMixin

In [8]:
# Define the Custom category imputer Class:

class CustomCategoryImputer(TransformerMixin):
    """ This builds the Custom Category Imputer, that inherits the TransformerMixin class.
        The inheritance should have a .fit_transform method to call with .fit and .transform methods.
    
    """
    # Initialise one instance attribute, the columns:
    def __init__(self, cols=None):
        self.cols = cols
    
    # Fill the missing column values:
    def transform(self, dataFrame):
        X = dataFrame.copy()
        
        for col in self.cols:
            X[col].fillna(value=X[col].value_counts().index[0], inplace=True)
            
        return X
    
    # Fit method, that follows the fit method from scikit-learn:
    def fit(self, *_):
        return self
    

With the above imputer completed, it can be used on the "city" and "boolean" (categorical) columns.

In [9]:
# Apply the custom imputer, instantiate:
cci = CustomCategoryImputer(cols=['city', 'boolean'])

# Fit and transform on the dataset:
cci.fit_transform(X_data)

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,1.0
1,tokyo,no,like,11.0
2,london,no,somewhat like,-0.5
3,seattle,no,like,10.0
4,san francisco,no,somewhat like,
5,tokyo,yes,dislike,20.0


The missing values in both columns have now been filled.

### 2.3 - Build Custom Quantitative Imputer:

The Custom Quantitative Imputer will be similar to the previous one, but will contain an added "strategy" parameter that accounts for quantitative data.

In [10]:
# Import the required package:
from sklearn.impute import SimpleImputer

In [13]:
# Define the Custom Quantitative imputer Class:

class CustomQuantitativeImputer(TransformerMixin):
    """ This builds the Custom Quantitative Imputer, that inherits the TransformerMixin class.
        The inheritance should have a .fit_transform method to call with .fit and .transform methods.
    Note:
        - requires SimpleImputer from sklearn.impute .
    """
    # Initialise one instance attribute, the columns:
    def __init__(self, cols=None, strategy='mean'):
        self.cols = cols
        self.strategy = strategy
    
    # Fill the missing column values:
    def transform(self, dataFrame):
        X = dataFrame.copy()
        
        imputer = SimpleImputer(strategy=self.strategy)
        
        # Note the double brackets:
        for col in self.cols:
            X[col] = imputer.fit_transform(X[[col]])
            
        return X
    
    # Fit method, that follows the fit method from scikit-learn:
    def fit(self, *_):
        return self
    

With the above imputer completed, it can be used on the "quantitative_column".

In [14]:
# Apply the custom imputer, instantiate:
cqi = CustomQuantitativeImputer(cols=['quantitative_column'], strategy='mean')

# Fit and transform on the dataset:
cqi.fit_transform(X_data)

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,1.0
1,,no,like,11.0
2,london,,somewhat like,-0.5
3,seattle,no,like,10.0
4,san francisco,no,somewhat like,8.3
5,tokyo,yes,dislike,20.0


### 2.4 - Implement Imputers in the Pipeline:

Use both the CustomCategoryImputer and CustomQuantitativeImputer in a pipeline to impute the missing values for both Categorical and Quantitative values.

In [15]:
# Import the required library:
from sklearn.pipeline import Pipeline

In [16]:
# Apply the custom imputer, instantiate:
cci = CustomCategoryImputer(cols=['city', 'boolean'])

# Apply the custom imputer, instantiate:
cqi = CustomQuantitativeImputer(cols=['quantitative_column'], strategy='mean')

# Pipeline:
pipeline_imputer = Pipeline(steps=[('quant', cqi), 
                                   ('category', cci)] 
                           )

# Fit and transform on the dataset:
pipeline_imputer.fit_transform(X_data)

Unnamed: 0,city,boolean,ordinal_column,quantitative_column
0,tokyo,yes,somewhat like,1.0
1,tokyo,no,like,11.0
2,london,no,somewhat like,-0.5
3,seattle,no,like,10.0
4,san francisco,no,somewhat like,8.3
5,tokyo,yes,dislike,20.0


## 3 - Applying encoding techniques to categorical variables:

### 3.1 - 