# Chapter 1

### Feature Engineering

- Create new features
- Transform existing features
- Normalize features
- Encoding : Convert categories into numeric data
    - One-hot encoding : Explainable features, create N columns for N categories
    - Dummy encoding : Necessary information without duplication, create N-1 columns for N categories
- Merge low frequent categorical values (uncommon categories) into one single category (eg: `other`)
- Binarise numeric values (eg: from `num_violations` to `violation_boolean`)

```
# One-hot encoding
pd.get_dummies(df, columns=['cat'], prefix='C')
# Dummy encoding
pd.get_dummies(df, columns=['cat'], drop_first=True, prefix='C')

# Merging low frequency categorical counts
counts = df['cat'].value_counts()
mask = df['cat'].isin(counts[counts < 5].index) 
df['cat'][mask] = 'Other'

# Binarizing numeric variables
df['Binary_col'] = 0 
df.loc[df['Number_col'] > 0, 'Binary_col'] = 1
import numpy as np
df['Binned_Group'] = pd.cut( df['Number_col'], bins=[-np.inf, 0, 2, np.inf], labels=[1, 2, 3])
```

### Binning

```
# create intervals for equal-sized 5 bins
bins = np.linspace(df["price"].min(), df["price"].max(),5)
custom_labels = ["low","medium","high"]
df["price_bin"] = pd.cut(df["price"], bins, labels=custom_labels, include_lowest=True)

# Alternative approach
df['price_bin'] = pd.qcut(df['price'], q=3)
```

### Dataframe column

```
# See column names
df.columns
# Set column names
df.columns = ['A', 'B', 'C']
# Data type of columns
df.dtypes
# Select column of specific types only
df_ints = df.select_dtypes(include=['int'])
# Set type of a column
df['num_col']=df['num_col'].astype(int)
# See column description
df.describe()
# See column information
df.info()
# See frequencies in categorical column
df['cat'].value_counts()
```

### One hot encoding

```
# One-hot-encoding on categorical variable
df_onehot = pd.get_dummies(df, columns=['cat'], prefix='C')
df_dummy = pd.get_dummies(df, columns=['cat'], drop_first=True, prefix='C')

# Alternative approach-2
from sklearn import preprocessing
encoder = preprocessing.OneHotEncoder()
onehot_transformed = encoder.fit_transform(df['cat_col'].values.reshape(-1,1))
# Convert into dataframe
onehot_df = pd.DataFrame(onehot_transformed.toarray())
# Add the encoded columns with original dataset, 
df = pd.concat([df, onehot_df], axis=1)
# Drop the original column that you used for encoding 
df = df.drop('cat_col', axis=1)

# Label encoding : Turning string labels into numeric values
from sklearn import preprocessing
encoder_lvl = preprocessing.LabelEncoder()
# Specify the unique categories in the column to apply one-hot encoding
encoder_lvl.fit([ 'LOW', 'NORMAL', 'HIGH'])
# Apply one hot encoding on the third column of the dataset
df[:,2] = encoder_lvl.transform(df[:,2]) 
```