# Transforming and Enriching Data

First, install the required Python libraries if not done already. See
[Installing Required Python Libraries](../00_Installing_Required_Python_Libraries.md).

Begin by importing the required packages.

In [1]:
import pandas as pd

## Run the data access notebooks

In [2]:
%run '../03_01_Accessing_and_Exploring_Data/01_Accessing_and_Reading_Local_Files.ipynb'
%run '../03_01_Accessing_and_Exploring_Data/02_Accessing_and_Reading_Data_Lake_Files.ipynb'
%run '../03_01_Accessing_and_Exploring_Data/03_Accessing_and_Reading_Database-Data_Lakehouse_Data.ipynb'

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customerSubscrCode  3 non-null      int64 
 1   customerSubscrStat  3 non-null      object
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID                      5000 non-null   float64
 1   LostCustomer            5000 non-null   float64
 2   regionPctCustomers      5000 non-null   float64
 3   numOfTotalReturns       5000 non-null   float64
 4   wksSinceLastPurch       5000 non-null   float64
 5   basktPurchCount12Month  5000 non-null   float64
 6   LastPurchaseAmount      5000 non-null   float64
 7   AvgPurchaseAmount12     5000 non-null   float64
 8   AvgPurchase

## Run the data joining notebook

In [3]:
%run './01_Combining_Data.ipynb'

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customerSubscrCode  3 non-null      int64 
 1   customerSubscrStat  3 non-null      object
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID                      5000 non-null   float64
 1   LostCustomer            5000 non-null   float64
 2   regionPctCustomers      5000 non-null   float64
 3   numOfTotalReturns       5000 non-null   float64
 4   wksSinceLastPurch       5000 non-null   float64
 5   basktPurchCount12Month  5000 non-null   float64
 6   LastPurchaseAmount      5000 non-null   float64
 7   AvgPurchaseAmount12     5000 non-null   float64
 8   AvgPurchase

## Feature Engineering

### Replace codes with labels for demHomeOwner

In [4]:
df['demHomeOwner'] = df['DemHomeOwnerCode'].map( {'U': 'Unknown', 'H':'HomeOwner'} )

df.drop(columns='DemHomeOwnerCode', inplace=True)

df['demHomeOwner'].head()

0      Unknown
1    HomeOwner
2    HomeOwner
3      Unknown
4    HomeOwner
Name: demHomeOwner, dtype: object

### Compute customer age

In [5]:
import numpy as np

df['customerAge'] = ((pd.Timestamp.now() - pd.to_datetime(df['birthDate'])).dt.days / 365.25)

df['customerAge'] = df['customerAge'].apply(lambda x: int(x) if pd.notnull(x) else np.nan)

df['customerAge'].head()

0    19.0
1    50.0
2    18.0
3    18.0
4    22.0
Name: customerAge, dtype: float64

### Compute average purchase amount per ad

In [6]:
df['AvgPurchasePerAd'] = df['AvgPurchaseAmount12'] / df['intAdExposureCount12']

df['AvgPurchasePerAd'].head()

0    0.000000
1    4.807692
2    0.000000
3    1.290323
4    0.000000
Name: AvgPurchasePerAd, dtype: float64