
# Data Preparation

<u>Context</u>

Superheroes have been in popular culture for a long time, and now more than ever. Since its creation, superheroes have not been diverse, but this is changing rapidly. The two datasets aim to provide an overview of heroes and their physical and power characteristics, helping curious people to identify trends and patterns. In this case, we want to understand how physical attributes and powers define superheroes' alignment (superhero, supervillain).

    
<u>Inspiration</u>

What are the characteristics of your favorite superheroes? Are these characteristics affecting superheroes' alignment? Let's put some light on this important business question.

## Process steps

1. **Feature Selection and Engineering:**  
   Create a feature matrix `X` and a target array `y` (using the `Alignment` variable). Drop any irrelevant columns and explain your reasoning for each column you choose to exclude. If you find it relevant, consider combining existing columns or creating new ones based on the dataset's features.

2. **Encoding Categorical and Ordinal Features:**  
   Identify the categorical and ordinal columns, and encode them using a `ColumnTransformer` to apply the transformations in parallel.

3. **Handling Missing Data:**  
   If there are missing values in your feature matrix, decide on an appropriate method to handle them (e.g., imputation).

4. **Standardizing Features:**  
   Assess whether standardization is necessary for your numerical features, and apply it if needed.

5. **Building a Pipeline:**  
   Create a `Pipeline` that integrates all the preprocessing steps you have applied.

6. **Documentation:**
   Remember that thoroughly documenting your code and clearly explaining why certain decisions were made—while also considering and justifying why other options were not chosen—will be highly evaluated.

In [27]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [None]:
# Data import

df = pd.read_csv('https://raw.githubusercontent.com/jnin/information-systems/refs/heads/main/data/superheroes_complete.csv')

### 1. Feature Selection

In [29]:
y = df['Alignment']
X = df.drop(columns = ['Alignment'])

We decided to drop the **name**, **skin color**, **hair color** and **eye color** due to their irrelevance in determining the alignment of each hero. 

We specifically decided to keep **publisher** since some publishers might have a bias towards certain types of alignment (e.g. Marvel publishing preferrably "good" superheros. For later prediction this could be a relevant factor) 

In [30]:
X.drop(columns = [
    "Name",
    "Skin color", 
    "Hair color", 
    "Eye color"], inplace = True)

We were also checking if it's possible to combine vision columns and transform them to ordinal encoding later. The following checks show that this is **not possible** since there are multiple superheros having more than one type of vision.

In [31]:
# Step 1: Filter columns that contain the word "vision" (case-insensitive)

vision_columns = [col for col in df.columns if 'vision' in col.lower()]
print(f'There are {len(vision_columns)} different kinds of vision.')

# Step 2: Create a new column v_sum to store the sum of all vision powers

df['v_sum'] = df[vision_columns].sum(axis = 1)

multiple_vision = len(df[df['v_sum'] > 1])
print(f'There are {multiple_vision} superheros with at least 2 types of vision.')


There are 8 different kinds of vision.
There are 45 superheros with at least 2 types of vision.


### 2. Handling of missing values

**Note:** We switched the order of execution to handle missing data BEFORE encoding as to avoid "NaN" columns

2.1 Identify missing values columns

In [32]:
# Count of missing values for each column
missing_per_column = X.isnull().sum()

# Filter columns where missing values are greater than 0
columns_with_missing = missing_per_column[missing_per_column > 0]

# Display the columns with missing values
print(columns_with_missing)

Gender        18
Race         247
Publisher     13
Weight         2
dtype: int64


2.2 Imputing missing values for **weight**, **publisher**, **race** and **gender**

In [33]:
# Weight will be imputed using the median
imputer_median = SimpleImputer(strategy='median')

# Publisher, race and gender will be imputed using the most frequent occurance
imputer_mf = SimpleImputer(strategy='most_frequent')

# Setting up the transformer. We are turning off the column prefixes to not have to deal with the "remainder" prefixes later.

imputing_transformer = ColumnTransformer([('median', imputer_median, ['Weight']),
                                 ('most_freq', imputer_mf, ['Publisher', 'Race', 'Gender'])],
                                 remainder= "passthrough",
                                 verbose_feature_names_out=False)

# Transforming the data
X_imputed = imputing_transformer.fit_transform(X)


2.3 Checking if there are any remaining missing values

In [34]:
check = pd.DataFrame(X_imputed)
total_missing = check.isnull().sum().sum()

print(total_missing)

0


### 3. Variable Encodings

This code sets up and applies another ColumnTransformer to preprocess a dataset. 

In [35]:
categorical_cols = ['Gender', 'Race']
ordinal_cols = ['Power Level', 'Intelligence Level']

# Specifying the order of the ordinal columns
categories = [
    ['Weak', 'Below Average', 'Average', 'Above Average', 'Extremely Powerful'],  # Power Level
    ['Low Intelligence', 'Average Intelligence', 'Smart', 'Genius', 'Super-Genius']  # Intelligence Level
]

encoding_transformer = ColumnTransformer([
        ('ohe', OneHotEncoder(sparse_output= False), categorical_cols),  # Using One-Hot Encoder for categorical variables
        ('ord_enc', OrdinalEncoder(categories= categories),ordinal_cols )  # Using OrdinalEncoder for ordinal variables
    ],
    remainder = "passthrough",
    verbose_feature_names_out=False
)

X_encoded = encoding_transformer.fit_transform(X_imputed)


### 4. Feature Standardization

We decided to scale the height & weight columns; Therefor, we duplicate the dataframe to store it in X_scaled and overwrite the weight and height column with the new scaled variables.

In [36]:
scaler = StandardScaler()

X_scaled = X_encoded

# Fitting the scaler on X_encoded and transforming the data

X_scaled[['Weight', 'Height']] = scaler.fit_transform(X_encoded[['Weight', 'Height']])

### 5. Building the Pipeline

Here, we added all the column transformation steps together and combine them in the Pipeline to fit/transform all steps at once.

- **Imputer**: Using the column imputation in **2.2.** for weight, publisher, race and gender
- **Encoder**: Encoding all variables using the OneHot Encoder and Ordinal Encoder from **3.**
- **Scaler**: Using the StandardScaler instatiated in **4** but adding the column transformer to specify the weight & height column

In [37]:
from sklearn.pipeline import Pipeline

pipeline_steps = [('imputer', imputing_transformer),
                  ('encoder', encoding_transformer),
                  ('scaler', ColumnTransformer([
                      ('scaler', scaler, ['Weight', 'Height'])], 
                      remainder='passthrough',
                      verbose_feature_names_out= False))]

pipe = Pipeline(pipeline_steps)  

X_pipe = pipe.fit_transform(X)

This last steps checks, whether the two approaches ("manual" vs. Pipeline) come to the same result:

In [40]:
X_pipe.shape == X_scaled.shape

True

In [39]:
X_pipe

Unnamed: 0,Weight,Height,Gender_Female,Gender_Male,Race_Alien,Race_Alpha,Race_Amazon,Race_Android,Race_Animal,Race_Asgardian,...,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient
0,2.956541,0.659970,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
1,0.096699,0.571472,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
2,0.286848,0.527224,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
3,2.956541,0.659970,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
4,-1.150678,-1.567212,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
655,-0.002178,0.379728,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
656,-1.150678,1.410721,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
657,-0.268387,-0.350374,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
658,0.035852,0.416602,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
