# EDA 

### Goals

* EDA on REd Hat business data.

### Comments



In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)
sns.set_context("talk", font_scale=1.4)
sns.set_style('whitegrid')

from tqdm.notebook import tqdm
tqdm.pandas()

import missingno as mso

import re

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Step I: Business Goal

* Business Goal: Identify who, when and how (activity) to approach an potential customer to derive the most potential business value for Red Hat. 
* Objective: Create a classification algorithm that accurately identifies which customers have the most potential business value for Red Hat based on their characteristics and activities.

* Classification performance measured in AUC.


* Initial Hypotheses:
    * I. There are some activities which bring a higher business value than othe activities.
    * II. During certain times of the year chances are higher to derive business value from customers.
    * III. Some group of people allow for higher business value.
    * IV. Characteristics of people and activities are indicative of business value.


# Step II: Data Extraction

* This competition uses two separate data files that may be joined together to create a single, unified data table: a people file and an activity file.
* People.csv: Each row in the people file represents a unique person. Each person has a unique people_id. Contains characteristis of people.
* activity.csv: 
    * The activity file contains all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id. **Unique in the sense of (who, how, when), not actual unique activity characteristics?**
    * The activity file contains several different categories of activities. Type 1 activities are different from type 2-7 activities because there are more known characteristics associated with type 1 activities (nine in total) than type 2-7 activities (which have only one associated characteristic).

In [None]:
people = pd.read_csv("/kaggle/input/predicting-red-hat-business-value/people.csv.zip")
people.shape

In [None]:
activities = pd.read_csv("/kaggle/input/predicting-red-hat-business-value/act_train.csv.zip")
activities.shape

In [None]:
activities_test = pd.read_csv("/kaggle/input/predicting-red-hat-business-value/act_test.csv.zip")
activities_test.shape

# Step III: Meeting and Greet Data


* There are 189k potential customers and 2.1M customer activities in the training set. 
* The test set contains 498k customer activities (train/test split of activities 18.5% in test)


* potential typos/mistakes in mixed-type fields: 
     * people: ppl_group, ppl_char_1 - 9
     * activity: act_category, act_char_10


* Missing data
    * ppl: No missing data detected by pandas
    * activities: act_char_1-9 have same number of missing values. Agreement with documentation, 
    as these are the 9 characteristics only available for activity type 1. **However act_char_10 has**
    **also missing values. Why? **



* Critical questions

    * people characteristics: char_1 until 38. group_1 meaning? date could be the first contact with the person
    * date fields contain timestamps from future dates! Why? 
    

* data types
    * categorical
        * nominal: ppl_id, act_id, act_outcome (encoded numeric), ppl_char_10-ppl_char_38 (booleans)
        * ordinal: act_category, act_char_1-act_char_10 (types), ppl_group, ppl_char_1-ppl_char_9 (types) , 
    * numeric: 
        * discrete: ppl_char_38
    * date
        * ppl_date, act_date
        
* Variables & Assumptions
   * ppl_id: unique ID of the user
   * act_id: unique ID of activity
   * act_date: date of activity
   * act_category: assume this are the types of activities. actual types are not known. Only type 1 has act_char_# variables
   * act_outcome: dependent variable. the business value is encoded in this variable. Exact meaning is not known, and business value could mean many things.
   * ppl_group: specific group of people. more not known
   * ppl_char_1-ppl_char_9: each variable have multiple types. ordered by number. does not necessary mean there is an order!
   * ppl_char_38: surprising this is a numerical varaible> meaning not known
   * act_date: dates are in the future! The variable was likely modified to anonymize it. Assume that the modification was only of the year, hence the order in time is still correct.
   * ppl_date: dates in future. Assume this was the date of customer acquisition. Validate possibility in multivariate analysis.

In [None]:
# simplify column naming
ppl = people.rename(columns=dict({name: '_'.join(['ppl',name]) for name in people.columns if 'char' in name}, 
                               **{'date': 'ppl_date', 'group_1': 'ppl_group', 'people_id': 'ppl_id'}))
activ = activities.rename(columns=dict({name: '_'.join(['act',name]) for name in activities.columns if 'char' in name}, 
                               **{'activity_category': 'act_category', 'date':'act_date', 'activity_id': 'act_id', 'outcome': 'act_outcome', 'people_id': 'ppl_id'}))
activ_test = activities_test.rename(columns=dict({name: '_'.join(['act',name]) for name in activities_test.columns if 'char' in name}, 
                               **{'activity_category': 'act_category', 'date':'act_date', 'activity_id': 'act_id', 'outcome': 'act_outcome', 'people_id': 'ppl_id'}))

In [None]:
ppl.head(2)

In [None]:
# sorting of columns
ppl = ppl[['ppl_id', 'ppl_date',  'ppl_group', 'ppl_char_1', 'ppl_char_2'] + ppl.columns[5:].to_list()]

In [None]:
ppl.head(10)

In [None]:
ppl.sample(10, random_state=42)

In [None]:
activ.head(2)

In [None]:
activ.sample(10, random_state=42)

In [None]:
activ_test.head(2)

Ensure same variables in activity train file and activity test file.

In [None]:
assert (activ_test.columns == activ.drop('act_outcome', axis=1).columns).all()
assert (activ_test.dtypes == activ.drop('act_outcome', axis=1).dtypes).all()

In [None]:
ppl.info()

In [None]:
activ.info(null_counts=True)

Convert into more usable data types

In [None]:
for activ_tmp in [activ, activ_test]:
    activ_tmp['act_date'] = activ_tmp['act_date'].progress_apply(pd.to_datetime)

In [None]:
ppl['ppl_date'] = ppl['ppl_date'].progress_apply(pd.to_datetime)

In [None]:
ppl_char_1_9 = ppl.columns.to_list()[3:12]
ppl_char_1_9

In [None]:
ppl[['ppl_group']+ppl_char_1_9] = ppl[['ppl_group']+ppl_char_1_9].astype('category')

In [None]:
act_cat = activ.columns.to_list()[3:14]
act_cat

In [None]:
activ[act_cat] = activ[act_cat].astype('category')
activ_test[act_cat] = activ_test[act_cat].astype('category')

In [None]:
ppl.dtypes

In [None]:
activ.dtypes

First we drop the weekday, month etc. and then we add those as separate features:

In [None]:
characts = pd.merge(activ.drop(columns=["weekday", "monthday", "month"]), 
                    ppl.drop(columns=["weekday", "monthday", "month"]), how = 'left', on='ppl_id')
characts.shape

In [None]:
characts['ppl_weekday'] = characts[['ppl_date']].apply(lambda x: dt.datetime.strftime(x['ppl_date'], '%A'), axis=1)
characts['ppl_monthday'] = characts.ppl_date.dt.day
characts["ppl_month"] = characts.ppl_date.dt.month

characts['act_weekday'] = characts[['act_date']].apply(lambda x: dt.datetime.strftime(x['act_date'], '%A'), axis=1)
characts['act_monthday'] = characts.act_date.dt.day
characts["act_month"] = characts.act_date.dt.month

In [None]:
activ.shape, ppl.shape

# Step IV: Univariate Analysis

* act_outcome: surprisingly fairly balanced classes.
* act_date: activities are fairly distributed across the time, spanning roughly 1 year and 1 month. I find extreeme values with maximum of 48174 activies in just one day. 
* ppl_id: users. For a significant number of people 20% (in people.csv) we do not have any activities recorded. Can we discard those?


### Business Value

In [None]:
activ['act_outcome'].astype('bool').value_counts(normalize=True)

In [None]:
ax = activ['act_outcome'].astype('bool').value_counts(normalize=True).mul(100).plot(kind='bar')
ax.set_xlabel('business value'); ax.set_ylabel('% of customers'); plt.xticks(rotation=0)

### Activity Dates

How ares the activities distributed over time? 

In [None]:
activities_per_day = activ.groupby([pd.Grouper(key='act_date', freq='1D')])['act_id'].count().reset_index()
activities_per_day_test = activ_test.groupby([pd.Grouper(key='act_date', freq='1D')])['act_id'].count().reset_index()

In [None]:
activ['act_date'].agg({'min': 'min', 'max': 'max'})

In [None]:
activ_test['act_date'].agg({'min': 'min', 'max': 'max'})

In [None]:
fig, ax = plt.subplots(figsize=(16,4))
sns.lineplot(data=activities_per_day, x='act_date', y='act_id', ax=ax, label='train')
sns.lineplot(data=activities_per_day_test, x='act_date', y='act_id', ax=ax, label='test')

ax.set_ylabel('# of activities'); ax.set_xlabel('date')



* Days with the most and the least activity.

* Activities in test span span the same time range as activities in train set. Danger of data leakage! This is not a good split in training and test set when using time series!
* Also plotting the test set here and analyzing it is introducing positive bias, in particular feature analysis along the lines I am doing here.


In [None]:
activ.act_date.isin(activ_test.act_date).value_counts()

In [None]:
activ_test.act_date.isin(activ.act_date).value_counts()

> Train and test set cover the same days of activity. One could hence use the `act_date` as a feature to get predictions on the test set. However this might beat the business application purpose which might be to apply the model on  future data.

In [None]:
activities_per_day.head()

In [None]:
pd.concat([activities_per_day[activities_per_day['act_id'] == activities_per_day['act_id'].max()],
activities_per_day[activities_per_day['act_id'] == activities_per_day['act_id'].min()]],axis=0)

users: 189118 in the peoples.csv but only 151295 in the characters

In [None]:
import datetime as dt
activ['weekday'] = activ[['act_date']].apply(lambda x: dt.datetime.strftime(x['act_date'], '%A'), axis=1)
activ['monthday'] = activ.act_date.dt.day
activ["month"] = activ.act_date.dt.month


Towards the end of the week the. number of customer activities increases, peaking on Friday. Surprisingly, Monday is the lowest. It could be that on Monday, as first day of the week, people are not willing to engage with a company/RedHat.

In [None]:
activ['weekday'].value_counts(normalize=True).plot(kind='bar')

In [None]:
fig , ax = plt.subplots(figsize=(18,4))
activ['monthday'].value_counts(normalize=True).sort_index().plot(kind='bar', ax=ax)
ax.set_ylabel('fraction of activities'); ax.set_xlabel('day of month')

In [None]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sept", "Oct", "Nov", "Dec"]
map_month_year = {i+1:m for i, m in enumerate(months)}
fract_act_month = activ['month'].value_counts(normalize=True).sort_index()
fract_act_month.index =fract_act_month.index.map(map_month_year)
fract_act_month.plot(kind='bar')

### Number of activities per customer 

Potential outliers

* On median there are 5 activities per customer, with a heavy skew to the right (mean 15)
* Some customers have had just one activity recorded. 
* 14k activities are flagged as potential outliers based on IQR. Some customers have 1000s of activities recorded which seems odd.

> How to deal with the outliers? Flag with variable? Discard?

In [None]:
fig, ax = plt.subplots()
sns.boxplot(activ.groupby('ppl_id').count()['act_id'], ax=ax)
ax.set_xlabel('# of activities by user')
ax.set_xscale('log')

Cap them with IQR.

In [None]:
q75=activ.groupby('ppl_id').count()['act_id'].quantile(0.75)
q25=activ.groupby('ppl_id').count()['act_id'].quantile(0.25)
IQR = q75-q25
threshold_outlier = q75 + 1.5*IQR
threshold_outlier

In [None]:
outlier_activ = activ.groupby('ppl_id').count()['act_id'][activ.groupby('ppl_id').count()['act_id']>=threshold_outlier]
outlier_activ.shape

In [None]:
fig, ax = plt.subplots(figsize=(16,4))
sns.distplot(activ.groupby('ppl_id').count()['act_id'][activ.groupby('ppl_id').count()['act_id']<threshold_outlier], ax=ax, bins=range(0,35))
ax.set_xlabel('number of activities per user')
ax.set_xlim(1,35)
_ = ax.set_xticks(range(1,35, 2))

In [None]:
activ.groupby('ppl_id').count()['act_id'].describe()

### Activity Uniqueness

In [None]:
activ['act_id'].nunique() == activ['act_id'].shape[0]

### Categories of activities

* Type 2 category is most prevalent with 41%. 
* Type 1 catery which has the 9 characteristics is only available for 7% of the data.
* Type 6,7 are rare and only in less than 0.1% of the data

In [None]:
activ['act_category'].value_counts()

In [None]:
activ['act_category'].value_counts(normalize=True)

In [None]:
fig, ax = plt.subplots(figsize=(16,4))
sns.barplot(x=activ.act_category.value_counts().index.to_list(), y=activ.act_category.value_counts(normalize=True), ax=ax)

### Activity Characteristics

In [None]:
activity_chars_type1 = activ.columns[4:-4].to_list()
activity_chars_type1

Validate that only when category type 1 activity is present, characteristics 1-9 are present. This is indeed the case:

In [None]:
activ[activ['act_category']=='type 1'][activity_chars_type1].isnull().sum().sum()

In [None]:
(~activ[activ['act_category']!='type 1'][activity_chars_type1].isnull()).sum().sum()

The char1 and char 2 have most types.

In [None]:
activ[activity_chars_type1].describe().T.sort_values('unique')

The special characteristic `act_char_10` is present.

In [None]:
activ[['act_char_10']].nunique()

In [None]:
activ[['act_char_10']].value_counts(normalize=True).head()

### People ID

People ID is unique in the people table as expected.

In [None]:
 ppl['ppl_id'].nunique() == ppl['ppl_id'].shape[0]

### Date of Customer Acquisition

In [None]:
ppl.head()

In [None]:
customer_acquisition_per_day = ppl.groupby([pd.Grouper(key='ppl_date', freq='1D')])['ppl_id'].count().reset_index()

In [None]:
ppl['ppl_date'].agg({'min': 'min', 'max': 'max'})

In [None]:
fig, ax = plt.subplots(figsize=(20,4))
sns.lineplot(data=customer_acquisition_per_day, x='ppl_date', y='ppl_id', ax=ax)
ax.set_ylabel('# of activities'); ax.set_xlabel('date')

In [None]:
ppl['weekday'] = ppl[['ppl_date']].apply(lambda x: dt.datetime.strftime(x['ppl_date'], '%A'), axis=1)
ppl['monthday'] = ppl.ppl_date.dt.day
ppl["month"] = ppl.ppl_date.dt.month

In [None]:
ppl['weekday'].value_counts(normalize=True).plot(kind='bar')

In [None]:
fig , ax = plt.subplots(figsize=(18,4))
ppl['monthday'].value_counts(normalize=True).sort_index().plot(kind='bar', ax=ax)
ax.set_ylabel('fraction of people date'); ax.set_xlabel('day of month')

In [None]:
fract_ppl_month = ppl['month'].value_counts(normalize=True).sort_index()
fract_ppl_month.index =fract_ppl_month.index.map(map_month_year)
fract_ppl_month.plot(kind='bar')

### People Group

We find one dominant group with 41%, and all other 34223 groups are <1%. 

In [None]:
ppl['ppl_group'].nunique()

In [None]:
ppl['ppl_group'].value_counts(normalize=True).head()

### People Characteristics

* character types: some variables have 2 types and up to 43 different types. This indicates that some features might carry significant more information.
* character boolean features: each value has at least ~20% of all values, hence there is no extreme imbalance
* the special numerical feature (ppl_char_38): could be percentage. The large amounts of zero values are suspicious.

In [None]:
ppl_char_types = ppl.columns[3:12].to_list()

In [None]:
ppl[ppl_char_types].describe().T

In [None]:
for col in ppl_char_types:
    print("col : ", col)
    print(ppl[col].value_counts(normalize=True).head(5))

> 	
    

In [None]:
ppl_char_bool = ppl.columns[12:-1].to_list()
ppl[ppl_char_bool].describe()

In [None]:
ppl[ppl_char_bool].apply(pd.value_counts, normalize=True)

In [None]:
ppl['ppl_char_38'].describe().to_frame().T

In [None]:
fig, ax = plt.subplots(figsize=(16,4))
sns.distplot(ppl['ppl_char_38'], ax=ax)

# Step V: Multivariate Analysis

* What is the relationship between the ID variables in the files? Are all ppl_id's in the people.csv in the activities csv?


* Answer initial hypothesis:

    * I. There are some activities which bring a higher business value than othe activities.
    * II. During certain times of the year chances are higher to derive business value from customers.
    * III. Some group of people allow for higher business value.
    * IV. Characteristics of people and activities are indicative of business value.

* Is there a relationship between missing values?



### Customer ID relationships in files

Confirm that the activitiy user ids are all in the ppl user data.

In [None]:
ppl_set = set(ppl['ppl_id'].unique())
activ_ppl_set = set(activ['ppl_id'].unique())
activ_ppl_set_test = set(activ_test['ppl_id'].unique())

In [None]:
activ_ppl_set.issubset(ppl_set), activ_ppl_set_test.issubset(ppl_set)

Users in test set are not in training set:

In [None]:
activ_ppl_set_test.intersection(activ_ppl_set)

There are no users which are in the people.csv but not in the activity files (we  could have discarded those if they existed)

In [None]:
ppl_set - activ_ppl_set - activ_ppl_set_test

No IDs in activity file are in the test file (by accident).

In [None]:
set(activ['act_id']).intersection(set(activ_test['act_id']))


### Is there a relationship between missing values?


It looks like that those fields of type1 activity have no `act_char_10` field. I confirm this below. Missingness of one variable depends on the value of another.

In [None]:
mso.matrix(activ)

In [None]:
activ[activ['act_category']=='type 1']['act_char_10'].nunique()

In [None]:
activ[activ['act_category']!='type 1']['act_char_10'].isnull().sum()

### Hypothesis I.: There are some activities which bring a higher business value than othe activities.

Type 6 activity has the highest chance with 55% of successfull business outcomes.

In [None]:
ax = activ.groupby('act_category')['act_outcome'].mean().sort_values(ascending=False).plot(kind='barh', figsize=(16,4), color=['r','b', 'y', 'k', 'grey'])
ax.set_title('fraction of activity categories with business value ')
ax.set_ylabel('fraction')
_=plt.xticks([0.1,0.2,0.3,0.4,0.5, 0.6])

In [None]:
activ.head()

In [None]:
with sns.plotting_context("talk", font_scale=1):
    fig, ax = plt.subplots(len(activ.columns[4:13].to_list()), 1, figsize=(20,20), sharex=True)
    axes = ax.flatten()
    for i, activ_col in enumerate(activ.columns[4:13].to_list()):
        sns.barplot(data=activ.groupby(activ_col)['act_outcome'].mean().reset_index(), x='act_outcome', y=activ_col, ax=axes[i], 
                    order=activ.groupby(activ_col)['act_outcome'].mean().sort_values().index) #, palette=sns.color_palette("Blues_d", n_colors=60))
        axes[i].set_ylabel(activ_col); axes[i].set_xlabel('')
        axes[i].get_yaxis().set_ticks([])
        #axes[i].set_title(activ_col)
axes[i].set_xlabel('fraction')

### Time dependence of customer outcomes

Day of the week matters and it appears (!) like where there is a higher chance for customer success, there are more activities taking place.

In [None]:
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
fig, axes = plt.subplots(2, figsize=(16,8), sharex=True)
activ.groupby("weekday")["act_outcome"].value_counts(normalize=True).sort_index().unstack().reindex(cats).plot(kind='bar', ax=axes[0] )
axes[0].legend(loc=4)
activ["weekday"].value_counts(normalize=True).reindex(cats).plot(kind='bar', ax=axes[1])

In [None]:
fig , ax = plt.subplots(figsize=(18,4))
activ.groupby('monthday')['act_outcome'].mean().sort_index().plot(kind='bar', ax=ax)
plt.title('fraction of positive activity outcome'); ax.set_xlabel('day of month')

Success of business activitate appears to depend on the month

In [None]:
# fract_act_month = activ['month'].value_counts(normalize=True).sort_index()

fract_act_month_act = activ.groupby('month')['act_outcome'].mean()
fract_act_month_act.index =fract_act_month_act.index.map(map_month_year)
fract_act_month_act.plot(kind='bar'); plt.title('fraction of successful business outcomes')

### Group and date

In [None]:
characts.head()

In [None]:
outcome_by_grp_actdate = characts.groupby(["act_date", "ppl_group"])['act_outcome'].mean()

This creates a dataframe with all dates for each group, hence missing values for dates.

In [None]:
outcome_by_grp_actdate.dropna().head()

In [None]:
outcome_by_grp_actdate.dropna().value_counts()

> This implies that for a specific day in training set, all activities related to a ppl_group category are either act_outcome=0 or 1. There is no pplt_group where some activities had outcome 0 but others 1.

Simple classifier scheme: `act_date` > `ppl_group` category > fixed outcome.

As all dates in the test set are also in the training set, one can use the `act_date` as feature. 


In [None]:
fig, axes = plt.subplots(2, figsize=(16,8), sharex=True)
characts.groupby("ppl_weekday")["act_outcome"].value_counts(normalize=True).sort_index().unstack().reindex(cats).plot(kind='bar', ax=axes[0] )
axes[0].legend(loc=4)
characts["ppl_weekday"].value_counts(normalize=True).reindex(cats).plot(kind='bar', ax=axes[1])

### Categorical variables with many values

How useful are these variables?

* Extreme valuess for act_char_10, ppl_group and to a far less extend ppl_char_38

In [None]:
characts.nunique()[characts.nunique() > 10]

### Correlations between features

In [None]:
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x,y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

In [None]:
characts.info()

In [None]:
characts.nunique()

In [None]:
corr_features = characts.nunique()[characts.nunique() < 5].index

In [None]:
corr_dummies = pd.get_dummies(characts[corr_features],drop_first=True)

In [None]:
corr_dummies.columns

In [None]:
corr_pearson = corr_dummies.corr() 

Some categories are highly correlated! Could be removed in modeling.

* Above 90% correlation only: ppl_char_28 and ppl_char_21
* Some features are above 80% correlation.

In [None]:
with sns.plotting_context("talk", font_scale=0.6):
    fig, ax = plt.subplots(figsize=(22,14))
    lower_triangle = np.tril(corr_pearson, k = -1)
    mask = lower_triangle == 0
    sns.heatmap(corr_pearson, annot=True, ax=ax, fmt=".2f", mask=mask)

Sort to identify features which are correlated the highest:

In [None]:
high_corr_features = [[pair[0], pair[1]] for pair in corr_pearson[corr_pearson > 0.8].stack().index.tolist() if not pair[1]==pair[0]]
high_corr_features = [[pair[1], pair[0], corr_pearson.loc[pair[1], pair[0]]] for pair in high_corr_features]
pd.DataFrame(sorted(high_corr_features, key=lambda x: x[2])[::-1], columns=['feature1', 'feature2', 'corr_coef']).drop_duplicates(subset=['corr_coef'])

In [None]:
corr_pearson['act_outcome'].sort_values(ascending=False)