# <center> Predicting Business Value with LightGBM</center>
## <center> From the Red Hat competition database <center>

![red-hat](https://www.channelpartnerinsight.com/w-images/794d78dd-c155-45a5-af67-59b64ad64873/3/IBMRedHat-580x358.jpg)

### Introduction

Hi Kagglers,

In this notebook, I will resolve a binary classification problem with LightGBM. The procedure followed will be quite simple and can be seen as a good tour of all the phases the machine learning problem-solving requires.

The competition uses two separate data files that may be joined together to create a single, unified data table: a people file and an activity file. This is maybe the main singularity of the set.

As seen in the data description of Kaggle:

The people file contains all of the unique people (and the corresponding characteristics) that have performed activities over time. Each row in the people file represents a unique person. Each person has a unique people_id.

The activity file contains all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id.

The challenge of this competition is to predict the potential business value of a person who has performed a specific activity. The business value outcome is defined by a yes/no field attached to each unique activity in the activity file. The outcome field indicates whether or not each person has completed the outcome within a fixed window of time after each unique activity was performed.

I would like to thank the job done in these two notebooks. They were a true inspiration:

- Jay Speidell's Red Hat - Exploratory Data Analysis: https://www.kaggle.com/jayspeidell/red-hat-exploratory-data-analysis
    
- M.J Wu's LightGBM with Sklearn interface: https://www.kaggle.com/wwu651/lightgbm-with-sklearn-interface

### Index

1. [Import the necessary libraries](#section1)
2. [Load the data](#section2)
3. [Merging the two datasets](#section3)
4. [Basic exploration](#section4)
5. [Missing values imputation](#section5)
6. [Date variables manipulation](#section6)
7. [Categorical columns treatment](#section7)
8. [Model implementation](#section8)
9. [Prediction and submission](#section9)

#### <a id='section1'>1. Import the necessary libraries</a>

In [None]:
# For processing the data
import pandas as pd
import numpy as np
import datetime
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Visualization tools
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Machine learning
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score
from tensorflow import keras

# Others
import warnings
warnings.filterwarnings("ignore")

#### <a id='section2'> 2. Load the data </a>

In [None]:
people = pd.read_csv("../input/predicting-red-hat-business-value/people.csv.zip")
activity = pd.read_csv("../input/predicting-red-hat-business-value/act_train.csv.zip")
test = pd.read_csv("../input/predicting-red-hat-business-value/act_test.csv.zip")


#### <a id='section3'>3. Merging the two datasets</a>

In [None]:
df = pd.merge(people, activity, left_on="people_id", right_on="people_id")
df.isnull().sum()[df.isnull().sum() > 0]

In [None]:
test = pd.merge(people, test, left_on="people_id", right_on="people_id")

#### <a id='section3'>3. Basic exploration</a>

**A)** We start with the `people`'s database:

In [None]:
people.head(3)

In [None]:
print("People's shape: ", people.shape, "\n")
people.info()

We can clearly see that the `char_38` is the only numeric feature. Let us futher analyze this variable:

In [None]:
people.describe().transpose()

In [None]:
f, ax = plt.subplots(figsize=(12,8))

sns.distplot(df[df['outcome']==0]['char_38'], color='#ff8492', ax=ax)
sns.distplot(df[df['outcome']==1]['char_38'], color='#84fff1', ax=ax)
plt.show()

print('Number of 0 value:', df[df.char_38==0]['char_38'].count())
print('Number of 1 value:', df[df.char_38==1]['char_38'].count())

Alright, it seems that is a variable with high variance and good predictive potential for our target. We will keep it for sure.

Let us continue with the exploration of the rest of the variables, the categorical ones:

In [None]:
people.iloc[:, :40].astype("object").describe()

There are features like `group_1` which are categorical but they have too many distinct values. This will be problematic if we want to One Hot Encode in the future. Date should also be converted to a Date dtype, or perform some sort of feature engineering with it.

On the other side there are some features highly unbalanced:

In [None]:
sns.countplot(people["char_1"],
            palette=('#ff8492', '#84fff1'))
plt.show()

**B)** We continue with the `activity`'s set exploration:

In [None]:
activity.head(3)

In [None]:
print("Activity's shape: ", activity.shape, "\n" )
activity.info()

Contrary to the `people's` set, here we have an important number of missing values. Therefore we will need to impute these values.

In [None]:
activity.isnull().sum() 

From `char_1` to `char_9` we find out that there are a lot of missing values. They are in the same observations so there could be a pattern or reason for this happening. The best way to proceed here is to categorize these `NaNs`. In other words, create a new category to include them. With `Char_10` happens the same on a lower scale. We will perform this categorization in the next section.

With that many missing values a common imputation, like the median or mean imputation would include too much noise, reducing by a great deal the variability of the data.

In [None]:
activity.describe(include="object")

Let us focus now on our target or dependent variable:

In [None]:
activity["outcome"].value_counts()

In [None]:
activity["outcome"] = activity["outcome"].astype('object')

sns.countplot(activity["outcome"], 
            palette=('#ff8492', '#84fff1'))
plt.show()

Our target is enought balanced to not have any problems with it. Its dtype is an `int64` even tough it is clearly a binary variable.

#### <a id='section5'>5. Missing values imputation</a>

There are way too much missing values in some columns, as introduced before we will categorize these values:

In [None]:
df.loc[:, df.columns != 'char_38'] = df.loc[:, df.columns != 'char_38'].fillna("missing")
df.isnull().any().sum()

#### <a id='section6'>6. Date variables manipulation</a>
We have two options here. Put it as a numerical value with reference of the minimum date: like minutes/seconds from minimum date. Or, on the other hand, we can make different categorical values of the year/day of the year/weekday/hour/minutes etc. We will follow this second option:

In [None]:
def get_date_features(df, original_date):
    features = ["year", "month", "day", "is_month_end", "is_month_start", 
                "is_quarter_end", "is_quarter_start", "is_year_end", "is_year_start"]
    df[original_date] = pd.to_datetime(df[original_date])
    for n in features:
        df[n + "_{}".format(original_date)] = df[original_date].map(lambda x: getattr(x, n))
    df["weekday {}".format(original_date)] = ["weekday" if x < 5 else "weekend" 
                                              for x in df[original_date].dt.weekday]
    df = df.drop(original_date, axis=1)

In [None]:
get_date_features(df, "date_x")
get_date_features(df, "date_y")

In [None]:
df.head()

#### <a id='section7'>7. Categorical columns treatment</a>

The way to proceed here is to one-hot-encode the categorical variables, dropping the first values. The only problem here is that, as we said before, there are features with a lot of distinct values, which would increase the dimensionality of our dataset by a huge deal. So we will only create dummies for those variables with less than 13 different values.

The rest of the features will be label-encoded.

In [None]:
categorical = df.select_dtypes(include=['object'])
column_names = categorical.columns

embed_feats = categorical.nunique() > 12
onehot_feats = categorical.nunique() <= 12

In [None]:
pd.get_dummies(df, columns=df[onehot_feats[onehot_feats == True].index].columns, drop_first=True)
df.drop(columns=df[onehot_feats[onehot_feats == True].index].columns, axis=1, inplace=True)

Label encode those variables with more distinct values:

In [None]:
df[embed_feats[embed_feats == True].index]= df[embed_feats[embed_feats == True].index].apply(
    LabelEncoder().fit_transform)

In [None]:
df.head()

#### <a id='section8'>8. Model implementation</a>

In [None]:
y = df["outcome"]
X = df.drop(["people_id", "activity_id", "date_x", "date_y", "outcome"], axis=1)
X = X.iloc[:500000, :]
y = y.iloc[:500000]

In [None]:
kfold=StratifiedKFold(n_splits=5)
print(X.shape)
print(y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2666)

In [None]:
# I've narrowed down the range of the parameters after some testing
lgbm = LGBMClassifier(random_state=2666)

param_grid = {'num_leaves': [5,10,15, 25],
              'learning_rate': [0.001,0.005],
              'n_estimators': [50,100,500,1000]}

lgbm = GridSearchCV(lgbm, param_grid=param_grid, cv=kfold, scoring="accuracy", n_jobs=3, verbose=2)

lgbm.fit(X_train, y_train)

print(lgbm.best_score_)
print(lgbm.best_params_)

#### <a id='section9'>9. Prediction and submission</a>

We have to make all the preprocessing to the test set before:

In [None]:
test_c = test.copy()
test_c.loc[:, test_c.columns != 'char_38'] = test_c.loc[:, test_c.columns != 'char_38'].fillna("missing")

In [None]:
get_date_features(test_c, "date_x")
get_date_features(test_c, "date_y")

In [None]:
categorical = test_c.select_dtypes(include=['object'])
column_names = categorical.columns

embed_feats = categorical.nunique() > 12
onehot_feats = categorical.nunique() <= 12

pd.get_dummies(test_c, columns=test_c[onehot_feats[onehot_feats == True].index].columns, drop_first=True)
test_c.drop(columns=test_c[onehot_feats[onehot_feats == True].index].columns, axis=1, inplace=True)

test_c[embed_feats[embed_feats == True].index]= test_c[embed_feats[embed_feats == True].index].apply(
    LabelEncoder().fit_transform)

In [None]:
test_id = test_c["activity_id"].astype("object")
test_c = test_c.drop(["people_id", "activity_id", "date_x", "date_y"], axis=1)

In [None]:
# Check feature number of train and test
missingfeatures = list(set(test_c.columns.tolist()) - set(X.columns.tolist()))
print(missingfeatures)

print(len(X.columns))
print(len(test_c.columns))

Prediction time:

In [None]:
results = pd.DataFrame({'activity_id': test_id.values,
                        'outcome': lgbm.predict_proba(test_c)[:,1]})

In [None]:
results.to_csv('redhat_LightGBM.csv', index=False)