## 1. Merging DataFrames

This cell demonstrates how to combine two different data sources (Customers and Orders) using the `pd.merge()` function. We use an **Outer Join** to ensure no data is lost and an **Indicator** to see which table provided the data.

# Chapter 3: Data Manipulation & Preprocessing

This notebook focuses on advanced data cleaning, reshaping, and preparing data for machine learning using Pandas and Scikit-learn.

| Cell # | Purpose | Key Details |
| :--- | :--- | :--- |
| **1** | **Merging DataFrames** | Uses `pd.merge()` to perform an Outer Join and track sources with an indicator. |
| **2** | **Joining Data** | Demonstrates index-based joining using `.join()`. |
| **3** | **Pivoting Data** | Transforms 'long' format to 'wide' format using `pivot_table()`. |
| **4** | **Melting Data** | Reverses pivoting using `pd.melt()` for data normalization. |
| **5** | **Missing Data Analysis** | Calculates counts and percentages of null values. |
| **6** | **Basic Imputation** | Demonstrates Median and Forward Fill strategies. |
| **7** | **Advanced Imputation** | Uses Scikit-learn's `SimpleImputer` for mean-based filling. |
| **8** | **One-Hot Encoding** | Converts categorical names into binary columns. |
| **9** | **Ordinal Encoding** | Maps hierarchical categories to specific integers. |
| **10** | **Feature Scaling** | Demonstrates Normalization and Standardization. |
| **11** | **Memory Optimization** | Downcasts numeric types to save RAM. |
| **12** | **ML Pipelines** | Automates the entire preprocessing workflow. |
| **13** | **Sales Data Loading** | Imports CSV data and calculates total Revenue per order. |
| **14** | **Pivot Table Reporting** | Summarizes sales by category and city for better insight. |
| **15** | **Data Reshaping (Melting)** | Converts wide summarized data back to long format. |

In [2]:
import pandas as pd

# Example DataFrames
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'region': ['East', 'West', 'East', 'South']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 5, 3],
    'amount': [50.0, 120.0, 80.0, 30.0, 90.0]
})


merged_df = pd.merge(customers, orders,
                     on='customer_id',
                     how='outer',
                     indicator=True) 

print(merged_df)

   customer_id     name region  order_id  amount      _merge
0            1    Alice   East     101.0    50.0        both
1            2      Bob   West     102.0   120.0        both
2            2      Bob   West     103.0    80.0        both
3            3  Charlie   East     105.0    90.0        both
4            4    David  South       NaN     NaN   left_only
5            5      NaN    NaN     104.0    30.0  right_only


## 2. Joining Data (Index-based)

Unlike merging on a column, `.join()` is used to combine DataFrames based on their indexes. This is often faster and cleaner for primary-key associations.

In [3]:
from os.path import join
import pandas as pd

customers = pd.DataFrame({
    "customer_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Charlie", "David"],
    "region": ["East", "West", "East", "South"]
})

orders = pd.DataFrame({
    "order_id": [101, 102, 103, 104, 105],
    "customer_id": [1, 2, 2, 5, 3],
    "amount": [50.0, 120.0, 80.0, 30.0, 90.0]
})

join_df = customers.set_index("customer_id").join(
    orders.set_index("customer_id"),
    how="right"
).reset_index()

print(join_df)


   customer_id     name region  order_id  amount
0            1    Alice   East       101    50.0
1            2      Bob   West       102   120.0
2            2      Bob   West       103    80.0
3            5      NaN    NaN       104    30.0
4            3  Charlie   East       105    90.0


## 3. Pivoting Data

Pivoting allows us to reshape data by turning unique values from one column into multiple new columns. Here, we turn 'Product' rows into columns to see monthly sales at a glance.

In [4]:
# Data prepared for pivoting: Sales by Product and Month
data = {
    'Month': [1, 1, 2, 2, 3, 3],
    'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [100, 50, 110, 60, 150, 70]
}
df_sales = pd.DataFrame(data)

# Pivot: Turn products into columns, aggregated by month
pivoted_df = df_sales.pivot_table(
    index='Month',
    columns='Product',
    values='Sales',
    aggfunc='sum'
)

print(pivoted_df)

Product    A   B
Month           
1        100  50
2        110  60
3        150  70


## 4. Melting Data

Melting is the inverse of pivoting. It unpivots a DataFrame from 'wide' format back to a 'long' format, which is often required for statistical modeling or plotting with libraries like Seaborn.

In [5]:
# Melting the pivoted_df back into a long format
melted_df = pd.melt(pivoted_df.reset_index(),
                    id_vars=['Month'],
                    value_vars=['A', 'B'],
                    var_name='Product',
                    value_name='Sales')

print(melted_df)

   Month Product  Sales
0      1       A    100
1      2       A    110
2      3       A    150
3      1       B     50
4      2       B     60
5      3       B     70


## 5. Identifying Missing Data

Before cleaning data, we must identify where information is missing. This cell calculates the sum and percentage of `NaN` values per column.

In [6]:
import pandas as pd
import numpy as np

data = {'A': [1, 2, np.nan, 4], 
        'B': [5, np.nan, 7, 8], 
        'C': [9, 10, 11, np.nan]}
df = pd.DataFrame(data)

# 1. Total count of missing values per column
print('Missing Counts:\n', df.isnull().sum())

# 2. Percentage of missing values
print('\nMissing Percentage:\n', (df.isnull().sum() / len(df)) * 100)

Missing Counts:
 A    1
B    1
C    1
dtype: int64

Missing Percentage:
 A    25.0
B    25.0
C    25.0
dtype: float64


## 6. Basic Imputation

We handle missing data using simple Pandas methods. Here we use the **Median** to fill missing numbers in column 'A' and **Forward Fill** for column 'C'.

In [7]:
from sklearn.impute import SimpleImputer
import pandas as pd

# 1. Impute Column A (Numeric) with the Median
median_A = df['A'].median()
df['A_imputed_median'] = df['A'].fillna(median_A)

# 2. Impute Column C (Time Series) using Forward Fill (FFill)
df['C_imputed_ffill'] = df['C'].ffill()

print(df)


     A    B     C  A_imputed_median  C_imputed_ffill
0  1.0  5.0   9.0               1.0              9.0
1  2.0  NaN  10.0               2.0             10.0
2  NaN  7.0  11.0               2.0             11.0
3  4.0  8.0   NaN               4.0             11.0


## 7. SimpleImputer (Scikit-learn)

Scikit-learn's `SimpleImputer` provides a standardized way to fill missing values, making it easier to integrate into automated machine learning pipelines.

In [8]:
# Using Scikit-learn to handle multiple columns with different strategies
imputer_mean = SimpleImputer(strategy='mean')

# Fit and transform (A and B are imputed with their respective means)
df[['A', 'B']] = imputer_mean.fit_transform(df[['A', 'B']])

print('\nDataFrame after sklearn mean imputation:')
print(df)


DataFrame after sklearn mean imputation:
          A         B     C  A_imputed_median  C_imputed_ffill
0  1.000000  5.000000   9.0               1.0              9.0
1  2.000000  6.666667  10.0               2.0             10.0
2  2.333333  7.000000  11.0               2.0             11.0
3  4.000000  8.000000   NaN               4.0             11.0


## 8. One-Hot Encoding

Machine learning models cannot work with text directly. One-Hot Encoding converts categorical names (like Cities) into new columns with 0s and 1s.

In [9]:
import pandas as pd

data = {
    'City': ['NYC', 'London', 'Paris', 'NYC'],
    'Price': [100, 200, 300, 150]
}
df = pd.DataFrame(data)

# Perform OHE
df_ohe = pd.get_dummies(df, columns=['City'], prefix='is')

print('One-Hot Encoded Data:')
print(df_ohe)

One-Hot Encoded Data:
   Price  is_London  is_NYC  is_Paris
0    100      False    True     False
1    200       True   False     False
2    300      False   False      True
3    150      False    True     False


## 9. Ordinal Encoding

When data has a clear order (e.g., Small < Medium < Large), we use Ordinal Encoding to convert categories into a meaningful sequence of numbers.

In [10]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

data = {
    'Size': ['Small', 'Medium', 'Large', 'Small'],
    'Value': [10, 20, 30, 15]
}
df_ord = pd.DataFrame(data)

# Define the explicit order
size_order = ['Small', 'Medium', 'Large']

encoder = OrdinalEncoder(categories=[size_order])

# Fit and transform the 'Size' column
df_ord['Size_Encoded'] = encoder.fit_transform(df_ord[['Size']])

print('\nOrdinal Encoded Data:')
print(df_ord)


Ordinal Encoded Data:
     Size  Value  Size_Encoded
0   Small     10           0.0
1  Medium     20           1.0
2   Large     30           2.0
3   Small     15           0.0


## 10. Feature Scaling

This cell compares **Normalization** (scaling values to a 0-1 range) and **Standardization** (scaling to a mean of 0 and standard deviation of 1).

In [11]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np

data = np.array([[10, 1], [20, 5], [30, 10]])

# 1. Normalization (MinMaxScaler)
scaler_norm = MinMaxScaler()
data_normalized = scaler_norm.fit_transform(data)
print('Normalized (Min-Max):')
print(data_normalized.round(2))

# 2. Standardization (StandardScaler)
scaler_std = StandardScaler()
data_standardized = scaler_std.fit_transform(data)
print('\nStandardized (Z-Score):')
print(data_standardized.round(2))

Normalized (Min-Max):
[[0.   0.  ]
 [0.5  0.44]
 [1.   1.  ]]

Standardized (Z-Score):
[[-1.22 -1.18]
 [ 0.   -0.09]
 [ 1.22  1.27]]


## 11. Memory Optimization

Handling big data requires efficiency. This custom function reduces memory usage by converting large numeric types to smaller byte sizes and using the `category` type for repetitive strings.

In [12]:
# 1. Optimizing Integer and Float Types

def downcast_numeric(df):
    for col in df.select_dtypes(include=['int64', 'float64']).columns:
        # Check limits and downcast to the smallest fit
        if 'int' in str(df[col].dtype):
            df[col] = pd.to_numeric(df[col], downcast='integer')
        elif 'float' in str(df[col].dtype):
            df[col] = pd.to_numeric(df[col], downcast='float')
    return df

# Create a sample DataFrame for demonstration
import pandas as pd
data = {
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1],
    'C': ['apple', 'banana', 'apple', 'orange', 'banana', 'apple', 'grape', 'orange', 'banana', 'apple'],
    'D': ['long string data', 'more long string data', 'long string data', 'unique string', 'more long string data', 'long string data', 'some other text', 'unique string', 'more long string data', 'long string data']
}
df = pd.DataFrame(data)

print("Original Dtypes:")
print(df.dtypes)
print(f"Original Memory Usage: {df.memory_usage(deep=True).sum() / (1024**2):.2f} MB")

# 2. Optimizing String Types (Critical for low-cardinality data)

# Column 'C' has only 5 unique values (low cardinality)
df['C'] = df['C'].astype('category')

df_optimized = downcast_numeric(df.copy())

# Re-check memory usage
optimized_mem = df_optimized.memory_usage(deep=True).sum() / (1024**2)

print('Optimized Dtypes:')
print(df_optimized.dtypes)
print(f"Optimized Memory Usage: {optimized_mem:.2f} MB")

Original Dtypes:
A      int64
B    float64
C        str
D        str
dtype: object
Original Memory Usage: 0.00 MB
Optimized Dtypes:
A        int8
B     float32
C    category
D         str
dtype: object
Optimized Memory Usage: 0.00 MB


## 12. End-to-End ML Pipelines

This complex pipeline automatically handles missing data, scales numerical features, encodes categories, and trains a Logistic Regression modelâ€”all in one unified workflow.

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

# 1. Sample Data (Simulating raw input)
data = {
    'Age': [30, 45, np.nan, 22, 60],
    'Income': [50000, 120000, 80000, 30000, 150000],
    'City': ['NYC', 'London', 'Paris', 'NYC', 'London'],
    'Target': [0, 1, 0, 1, 1]
}
df = pd.DataFrame(data)
X = df.drop('Target', axis=1)
y = df['Target']

# --- Step 1: Define Column Groups ---

numerical_features = ['Age', 'Income']
categorical_features = ['City']

# --- Step 2: Define Sub-Pipelines ---

# Pipeline for Numerical Data (Impute missing, then scale)
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline for Categorical Data (Handle missing, then OHE)
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))

])

# --- Step 3: Combine Pipelines using ColumnTransformer ---

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ],
    remainder='passthrough'
)

# --- Step 4: Final Model Pipeline ---

# The final pipeline integrates preprocessing and the model
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='liblinear'))
])

# --- Step 5: Training ---

# The entire cleaning/scaling/training process is run in one line
full_pipeline.fit(X, y)

print("Pipeline training successful.")

# --- Step 6: Prediction on New Data ---

new_data = pd.DataFrame({
    'Age': [40, np.nan],
    'Income': [60000, 95000],
    'City': ['Paris', 'Berlin']
})

# The exact scaling and imputation rules learned from the training data are applied
predictions = full_pipeline.predict(new_data)

print(f"\nPredictions on new data: {predictions}")

Pipeline training successful.

Predictions on new data: [0 1]


## 13. Sales Data Analysis: Loading & Exploration

This section identifies how to load raw transactional data from a CSV file and perform initial inspection to understand the product categories and regional distribution.

In [14]:
import pandas as pd

df = pd.read_csv("sales_data.csv")

print(df.head())


   order_id customer_id customer_name   product     category  price  quantity  \
0       101        C001         Alice    Laptop  Electronics    800         1   
1       102        C002           Bob     Phone  Electronics    500         2   
2       103        C001         Alice     Mouse  Accessories     20         3   
3       104        C003       Charlie  Keyboard  Accessories     50         1   
4       105        C002           Bob    Laptop  Electronics    800         1   

        city  order_date  
0  Kathmandu  2024-01-05  
1   Lalitpur  2024-01-06  
2  Kathmandu  2024-01-10  
3  Bhaktapur  2024-02-01  
4   Lalitpur  2024-02-05  


## 14. Feature Engineering: Calculating Revenue

Derived features are often necessary for analysis. Here, we calculate the total Revenue for each order by multiplying the unit price with the quantity.

In [15]:
df["Revenue"] = df["price"] * df["quantity"]

print(df.head())


   order_id customer_id customer_name   product     category  price  quantity  \
0       101        C001         Alice    Laptop  Electronics    800         1   
1       102        C002           Bob     Phone  Electronics    500         2   
2       103        C001         Alice     Mouse  Accessories     20         3   
3       104        C003       Charlie  Keyboard  Accessories     50         1   
4       105        C002           Bob    Laptop  Electronics    800         1   

        city  order_date  Revenue  
0  Kathmandu  2024-01-05      800  
1   Lalitpur  2024-01-06     1000  
2  Kathmandu  2024-01-10       60  
3  Bhaktapur  2024-02-01       50  
4   Lalitpur  2024-02-05      800  


In [16]:
cateegory_summary = df.groupby('category')["Revenue"].sum()

print(cateegory_summary)

category
Accessories     200
Electronics    3100
Name: Revenue, dtype: int64


## 15. Pivot Table Reporting

This cell creates a cross-tabulation of total Revenue indexed by Product Category and columned by City. This multi-dimensional view quickly highlights which regions dominate specific markets.

In [17]:
pivoted_df = df.pivot_table(
    values='Revenue',
    index='category',
    columns='city',    
    aggfunc='sum',
    fill_value=0
)
print(pivoted_df)

city         Bhaktapur  Kathmandu  Lalitpur
category                                   
Accessories         90        110         0
Electronics          0       1300      1800


## 16. Data Reshaping: Melting the Pivot Table

The `.melt()` method is the inverse of pivoting. It converts the wide-format summarized table back into a long-format DataFrame, which is essential for certain visualization and database operations.

In [18]:
melted_df = pivoted_df.reset_index().melt(
    id_vars='category',
    var_name='city',
    value_name='Revenue'
)
print(melted_df.head())

      category       city  Revenue
0  Accessories  Bhaktapur       90
1  Electronics  Bhaktapur        0
2  Accessories  Kathmandu      110
3  Electronics  Kathmandu     1300
4  Accessories   Lalitpur        0


In [19]:

customers = df[["customer_id", "customer_name", "city"]].drop_duplicates()
products = df[["product", "category", "price"]].drop_duplicates()
orders = df[["order_id", "customer_id", "order_date"]].drop_duplicates()
order_details = df[["order_id", "product", "quantity"]].drop_duplicates()

merged_df = order_details.merge(products, on="product", how="left") \
                         .merge(orders, on="order_id", how="left") \
                         .merge(customers, on="customer_id", how="left")

print(merged_df.head())

   order_id   product  quantity     category  price customer_id  order_date  \
0       101    Laptop         1  Electronics    800        C001  2024-01-05   
1       102     Phone         2  Electronics    500        C002  2024-01-06   
2       103     Mouse         3  Accessories     20        C001  2024-01-10   
3       104  Keyboard         1  Accessories     50        C003  2024-02-01   
4       105    Laptop         1  Electronics    800        C002  2024-02-05   

  customer_name       city  
0         Alice  Kathmandu  
1           Bob   Lalitpur  
2         Alice  Kathmandu  
3       Charlie  Bhaktapur  
4           Bob   Lalitpur  


## Log Transformation

In [20]:
import pandas as pd
import numpy as np

data = {
    'Values': [1, 2, 3, 4, 5]
}

df = pd.DataFrame(data)

df['Log_Values'] = np.log10(df['Values'])
print(df)


   Values  Log_Values
0       1    0.000000
1       2    0.301030
2       3    0.477121
3       4    0.602060
4       5    0.698970


In [21]:
from sklearn.preprocessing import StandardScaler
import numpy as np

X = np.array([
    [100, 1],
    [200, 2],
    [300, 3]
])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Original Means (Col A, Col B):", np.mean(X, axis=0))
print("Scaled Means (should be close to 0):", np.mean(X_scaled, axis=0).round(2))
print("Scaled Data:\n", X_scaled)

Original Means (Col A, Col B): [200.   2.]
Scaled Means (should be close to 0): [0. 0.]
Scaled Data:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


## One-Hot Encoding (OHE)

In [22]:

import pandas as pd

df = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Apple', 'Orange']})

df_encoded = pd.get_dummies(df, columns=['Fruit'], drop_first=True)
print(df_encoded)


   Fruit_Banana  Fruit_Orange
0         False         False
1          True         False
2         False         False
3         False          True


In [36]:
df = pd.read_csv("employee.csv")

print(df.head(20))

    employee_id         department     region        education gender  \
0          8724         Technology  region_26        Bachelors      m   
1         74430                 HR   region_4        Bachelors      f   
2         72255  Sales & Marketing  region_13        Bachelors      m   
3         38562        Procurement   region_2        Bachelors      f   
4         64486            Finance  region_29        Bachelors      m   
5         46232        Procurement   region_7        Bachelors      m   
6         54542            Finance   region_2        Bachelors      m   
7         67269          Analytics  region_22        Bachelors      m   
8         66174         Technology   region_7  Masters & above      m   
9         76303         Technology  region_22        Bachelors      m   
10        60245  Sales & Marketing  region_16        Bachelors      m   
11        42639  Sales & Marketing  region_17  Masters & above      m   
12        30963  Sales & Marketing   region_4  Mast

In [37]:
df = pd.read_csv("employee.csv")

print(df.head(20))

    employee_id         department     region        education gender  \
0          8724         Technology  region_26        Bachelors      m   
1         74430                 HR   region_4        Bachelors      f   
2         72255  Sales & Marketing  region_13        Bachelors      m   
3         38562        Procurement   region_2        Bachelors      f   
4         64486            Finance  region_29        Bachelors      m   
5         46232        Procurement   region_7        Bachelors      m   
6         54542            Finance   region_2        Bachelors      m   
7         67269          Analytics  region_22        Bachelors      m   
8         66174         Technology   region_7  Masters & above      m   
9         76303         Technology  region_22        Bachelors      m   
10        60245  Sales & Marketing  region_16        Bachelors      m   
11        42639  Sales & Marketing  region_17  Masters & above      m   
12        30963  Sales & Marketing   region_4  Mast

In [24]:
df.describe()

Unnamed: 0,employee_id,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met_more_than_80,awards_won,avg_training_score
count,17417.0,17417.0,17417.0,16054.0,17417.0,17417.0,17417.0,17417.0
mean,39083.491129,1.250732,34.807774,3.345459,5.80186,0.358845,0.023368,63.176322
std,22707.024087,0.595692,7.694046,1.265386,4.175533,0.479675,0.151074,13.418179
min,3.0,1.0,20.0,1.0,1.0,0.0,0.0,39.0
25%,19281.0,1.0,29.0,3.0,3.0,0.0,0.0,51.0
50%,39122.0,1.0,33.0,3.0,5.0,0.0,0.0,60.0
75%,58838.0,1.0,39.0,4.0,7.0,1.0,0.0,75.0
max,78295.0,9.0,60.0,5.0,34.0,1.0,1.0,99.0


In [26]:
df.shape

(17417, 13)

In [28]:
print(df.isnull().sum())

employee_id                 0
department                  0
region                      0
education                 771
gender                      0
recruitment_channel         0
no_of_trainings             0
age                         0
previous_year_rating     1363
length_of_service           0
KPIs_met_more_than_80       0
awards_won                  0
avg_training_score          0
dtype: int64


In [31]:
print((df.isnull().sum()/len(df))*100)

employee_id              0.000000
department               0.000000
region                   0.000000
education                4.426710
gender                   0.000000
recruitment_channel      0.000000
no_of_trainings          0.000000
age                      0.000000
previous_year_rating     7.825688
length_of_service        0.000000
KPIs_met_more_than_80    0.000000
awards_won               0.000000
avg_training_score       0.000000
dtype: float64


In [32]:
df['previous_year_rating'].fillna(df['previous_year_rating'].median(), inplace=True)

/var/folders/nb/_rv6f3v52rq4wx3ktv25ltvr0000gn/T/ipykernel_77547/3909758819.py:1: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  df['previous_year_rating'].fillna(df['previous_year_rating'].median(), inplace=True)


0        3.0
1        3.0
2        1.0
3        2.0
4        4.0
        ... 
17412    5.0
17413    1.0
17414    1.0
17415    1.0
17416    5.0
Name: previous_year_rating, Length: 17417, dtype: float64

In [33]:
df['education'].fillna(df['education'].mode()[0], inplace=True)

/var/folders/nb/_rv6f3v52rq4wx3ktv25ltvr0000gn/T/ipykernel_77547/2668662719.py:1: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  df['education'].fillna(df['education'].mode()[0], inplace=True)


0        Bachelors
1        Bachelors
2        Bachelors
3        Bachelors
4        Bachelors
           ...    
17412    Bachelors
17413    Bachelors
17414    Bachelors
17415    Bachelors
17416    Bachelors
Name: education, Length: 17417, dtype: str

In [34]:
categorical_columns = ['department', 'region', 'education', 'gender', 'recruitment_channel']


In [35]:
numerical_columns = ['employee_id', 'no_of_trainings', 'age', 'previous_year_rating', 
                     'length_of_service', 'KPIs_met_more_than_80', 'awards_won', 'avg_training_score']


In [41]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pandas as pd

scaler = StandardScaler()

df_processed = df.copy()


df_processed[numerical_columns] = scaler.fit_transform(df[numerical_columns])

print("Numerical columns scaled with StandardScaler:")
print(df_processed[numerical_columns].head())
print("\nScaled data statistics:")
print(df_processed[numerical_columns].describe())

Numerical columns scaled with StandardScaler:
   employee_id  no_of_trainings       age  previous_year_rating  \
0    -1.337047        -0.420921 -1.404733                   NaN   
1     1.556678        -0.420921 -0.494913             -0.273015   
2     1.460890        -0.420921 -0.494913             -1.853610   
3    -0.022967         2.936614 -0.494913             -1.063313   
4     1.118739        -0.420921 -0.624887              0.517282   

   length_of_service  KPIs_met_more_than_80  awards_won  avg_training_score  
0          -1.150032               1.336682   -0.154684            1.030250  
1          -0.192043              -0.748121   -0.154684           -0.907476  
2          -0.431541              -0.748121   -0.154684           -1.205587  
3           0.765946              -0.748121   -0.154684            0.135915  
4           0.286951              -0.748121   -0.154684           -0.162197  

Scaled data statistics:
        employee_id  no_of_trainings           age  previo

In [42]:

label_encoders = {}

for col in categorical_columns:
    le = LabelEncoder()
    df_processed[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

print("Categorical columns encoded with LabelEncoder:")
print(df_processed[categorical_columns].head())
print("\nEncoding mappings:")
for col, encoder in label_encoders.items():
    print(f"\n{col}: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")

Categorical columns encoded with LabelEncoder:
   department  region  education  gender  recruitment_channel
0           8      18          0       1                    2
1           2      28          0       0                    0
2           7       4          0       1                    0
3           5      11          0       0                    0
4           1      21          0       1                    2

Encoding mappings:

department: {'Analytics': np.int64(0), 'Finance': np.int64(1), 'HR': np.int64(2), 'Legal': np.int64(3), 'Operations': np.int64(4), 'Procurement': np.int64(5), 'R&D': np.int64(6), 'Sales & Marketing': np.int64(7), 'Technology': np.int64(8)}

region: {'region_1': np.int64(0), 'region_10': np.int64(1), 'region_11': np.int64(2), 'region_12': np.int64(3), 'region_13': np.int64(4), 'region_14': np.int64(5), 'region_15': np.int64(6), 'region_16': np.int64(7), 'region_17': np.int64(8), 'region_18': np.int64(9), 'region_19': np.int64(10), 'region_2': np.int64(11)

In [43]:
print("Final Preprocessed DataFrame:")
print(df_processed.head())
print(f"\nShape: {df_processed.shape}")
print(f"\nData Types:\n{df_processed.dtypes}")
print("\nSummary Statistics (All Columns):")
print(df_processed.describe())

Final Preprocessed DataFrame:
   employee_id  department  region  education  gender  recruitment_channel  \
0    -1.337047           8      18          0       1                    2   
1     1.556678           2      28          0       0                    0   
2     1.460890           7       4          0       1                    0   
3    -0.022967           5      11          0       0                    0   
4     1.118739           1      21          0       1                    2   

   no_of_trainings       age  previous_year_rating  length_of_service  \
0        -0.420921 -1.404733                   NaN          -1.150032   
1        -0.420921 -0.494913             -0.273015          -0.192043   
2        -0.420921 -0.494913             -1.853610          -0.431541   
3         2.936614 -0.494913             -1.063313           0.765946   
4        -0.420921 -0.624887              0.517282           0.286951   

   KPIs_met_more_than_80  awards_won  avg_training_score  
0  