In [1]:
#############################################################
#       Predicting Red Hat Business Value 
#############################################################
"""
Reference: https://www.kaggle.com/c/predicting-red-hat-business-value

The organization is an American multinational software company that provides open source software products to the 
enterprise community.Their primary product is Red Hat Enterprise Linux, the most popular distribution of Linux OS, 
used by various large enterprises. In its services, it helps organizations align their IT strategies by providing 
enterprise-grade solutions through an open business model and an affordable, predictable subscription model. 
These subscriptions from large enterprise customers create a substantial part of their revenue, and therefore it is 
of paramount importance for them to understand their valuable customers and serve them better by prioritizing 
resources and strategies to drive improved business value.

How Can We Identify a Potential Customer?
Red Hat has been in existence for over 25 years. In the long stint of business, they have accumulated and captured 
a vast amount of data from customer interactions and their descriptive attributes. This rich source of data could be 
a gold mine of patterns that can help in identifying a potential customer by studying the vast and complex historical
patterns in the interaction data.
With the ever-growing popularity and prowess of DL, we can develop a DNN that can learn from historic customer 
attributes and operational interaction data to understand the deep patterns and predict whether a new customer will
potentially be a high-value customer for various business services.
Therefore, we will develop and train a DNN to learn the chances that a customer will be a potential high-value 
customer, using various customer attributes and operational interaction attributes.
"""
# Exploring the Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Import the 2 datasets provided in the Zip Folder
act_train = pd.read_csv('D:\\ml-data\\predicting-red-hat-business-value\\act_train.csv')
people = pd.read_csv('D:\\ml-data\\predicting-red-hat-business-value\\people.csv')

In [3]:
# Explore the shape of the datasets
print('Shape of DF:', act_train.shape)
print('Shape of People DF:', people.shape)

Shape of DF: (2197291, 15)
Shape of People DF: (189118, 41)


In [4]:
# Explore the contents of the first dataset
act_train.head()

Unnamed: 0,people_id,activity_id,date,activity_category,char_1,char_2,char_3,char_4,char_5,char_6,char_7,char_8,char_9,char_10,outcome
0,ppl_100,act2_1734928,2023-08-26,type 4,,,,,,,,,,type 76,0
1,ppl_100,act2_2434093,2022-09-27,type 2,,,,,,,,,,type 1,0
2,ppl_100,act2_3404049,2022-09-27,type 2,,,,,,,,,,type 1,0
3,ppl_100,act2_3651215,2023-08-04,type 2,,,,,,,,,,type 1,0
4,ppl_100,act2_4109017,2023-08-26,type 2,,,,,,,,,,type 1,0


In [5]:
"""
Exploring the contents of the training dataset, we can see that it mostly has customer interaction data but is 
completely anonymized. Given the confidentiality of customers and their attributes, the entire data is anonymized,
and this leaves us with little knowledge about its true nature. This is a common problem in data science. Quite 
often, the team that develops DL models faces the challenge of the data confidentiality of the end customer and is 
therefore provided only anonymized and sometimes encrypted data. This still shouldn’t be a roadblock. It is 
definitely best to have a data dictionary and complete understanding of the dataset, but nevertheless, we can still
develop models with the provided information.
"""
# Calculating the % of Null values in each column for activity data
act_train.isnull().sum()  # show sum of null

people_id                  0
activity_id                0
date                       0
activity_category          0
char_1               2039676
char_2               2039676
char_3               2039676
char_4               2039676
char_5               2039676
char_6               2039676
char_7               2039676
char_8               2039676
char_9               2039676
char_10               157615
outcome                    0
dtype: int64

In [6]:
act_train.shape[0]  # show total row count

2197291

In [7]:
# calculate percentage of null by dividing total null by total row count
act_train.isnull().sum() / act_train.shape[0]

people_id            0.000000
activity_id          0.000000
date                 0.000000
activity_category    0.000000
char_1               0.928268
char_2               0.928268
char_3               0.928268
char_4               0.928268
char_5               0.928268
char_6               0.928268
char_7               0.928268
char_8               0.928268
char_9               0.928268
char_10              0.071732
outcome              0.000000
dtype: float64

In [8]:
"""
Around nine features have more than 90% null values. We can’t do much to fix these features. Let’s move ahead
and have a look at the people dataset.
"""
# Explore the contents of People dataset
people.head()

Unnamed: 0,people_id,char_1,group_1,char_2,date,char_3,char_4,char_5,char_6,char_7,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
0,ppl_100,type 2,group 17304,type 2,2021-06-29,type 5,type 5,type 5,type 3,type 11,...,False,True,True,False,False,True,True,True,False,36
1,ppl_100002,type 2,group 8688,type 3,2021-01-06,type 28,type 9,type 5,type 3,type 11,...,False,True,True,True,True,True,True,True,False,76
2,ppl_100003,type 2,group 33592,type 3,2022-06-10,type 4,type 8,type 5,type 2,type 5,...,False,False,True,True,True,True,False,True,True,99
3,ppl_100004,type 2,group 22593,type 3,2022-07-20,type 40,type 25,type 9,type 4,type 16,...,True,True,True,True,True,True,True,True,True,76
4,ppl_100006,type 2,group 6534,type 3,2022-07-27,type 40,type 25,type 9,type 3,type 8,...,False,False,True,False,False,False,True,True,False,84


In [9]:
"""
Let’s check how many missing data points the customer dataset has. Since the customer dataset has around 40+ 
features, we can combine the missing value percentages for all columns together with the preceding code, 
instead of looking at each column individually.
"""
# Calculate the % of null values in for the entire dataset
people.isnull().sum().sum()

0

In [10]:
"""
And we see that none of the columns in the customer dataset has missing values.

To create a consolidated dataset, we need to join the activity and customer data on the people_id key. But before
we do that, we need to take care of a few things. We need to drop the columns in the activity data that have 90%
missing values, as they cannot be fixed. Secondly, the “date” and “char_10” columns are present in both datasets.
In order to avoid a name clash, let us rename the “date” column in the activity dataset to “activity_date” and 
“char_10” in the activity data as “activity_type.” Next, we also need to fix the missing values in the 
“activity_type” column. Once these two tasks are accomplished, we will join the two datasets and explore the
consolidated data.
"""
# Create the list of columns to drop from activity data
columns_to_remove = ['char_' + str(x) for x in np.arange(1, 10)]
print('Columns to remove:', columns_to_remove)

# Remove the columns from the activity data
act_train = act_train[list(set(act_train.columns) - set(columns_to_remove))]

# Rename the 2 columns to avoid name clashes in merged data
act_train = act_train.rename(columns={'date': 'activity_date', 'char_10': 'activity_type'})

# Replace nulls in the activity_type column with the mode
act_train['activity_type'] = act_train['activity_type'].fillna(act_train['activity_type'].mode()[0])

# Print the shape of the final activity dataset
print('Shape of DF:', act_train.shape)

Columns to remove: ['char_1', 'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8', 'char_9']
Shape of DF: (2197291, 6)


In [11]:
# We can now join the two datasets to create a consolidate activity and customer attributes dataset.
# Merge the 2 datasets on 'people_id' key
merged_df = act_train.merge(people, on=['people_id'], how='inner')
print('Shape before merging:', act_train.shape)
print('Shape after merging :', merged_df.shape)

Shape before merging: (2197291, 6)
Shape after merging : (2197291, 46)


In [12]:
"""
Let us now study the target (i.e., the variable we want to predict), named “outcome” in the dataset. We can check 
the distribution between potential vs. nonpotential customers.
"""
print('Unique values for outcome:', merged_df['outcome'].unique())
print('\nPercentage of distribution for outcome-')
print(merged_df['outcome'].value_counts() / merged_df.shape[0])

Unique values for outcome: [0 1]

Percentage of distribution for outcome-
0    0.556046
1    0.443954
Name: outcome, dtype: float64


In [None]:
"""
We can see that there is a good mix in the distribution of potential customers, as around 45% are potential 
customers.
"""