# Data Analysis and Feature encoding

Before, moving on to the prediction step, the data needs to be analyzed first. This document presents the following stages of this case study.
1. Examining the data
2. Managiing the missing Values
3. Visualizing individual Data Columns
4. Cleaning the data
4. Checking Correlations
5. Feature Encoding
6. Separating the known and unknown 'hits' data

Before moving on, let's import all the packages we'll be needing here.

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, Imputer, LabelEncoder, MinMaxScaler
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn.feature_extraction import FeatureHasher

# Examining the Data

First, let's have a look at the data set as a whole to understand the schema of the data

In [2]:
nRowsRead = None  # This can be changed into an integer in order to load a smaller chunk of the data.
df = pd.read_csv('ML Data Scientist Case Study Data.csv', delimiter= ';', nrows = nRowsRead)
print(df.shape)
df.info()

(988681, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 988681 entries, 0 to 988680
Data columns (total 10 columns):
row_num              988681 non-null int64
locale               988681 non-null object
day_of_week          988681 non-null object
hour_of_day          988681 non-null int64
agent_id             988681 non-null int64
entry_page           988681 non-null int64
path_id_set          983792 non-null object
traffic_type         988681 non-null int64
session_durantion    988681 non-null object
hits                 988681 non-null object
dtypes: int64(5), object(5)
memory usage: 75.4+ MB


There is a typo in one of the the column names, so let's correct this in order to avoid confusion later. Also, let's replace the unknown values with numpy nans for ease of  handlimg the unknown values.

In [6]:
# fixing the typo in the column name
df = df.rename(columns={'session_durantion': 'session_duration'})
# replacing missing values with Nan
df = df.replace('\\N', np.nan)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 988681 entries, 0 to 988680
Data columns (total 10 columns):
row_num             988681 non-null int64
locale              988681 non-null object
day_of_week         988681 non-null object
hour_of_day         988681 non-null int64
agent_id            988681 non-null int64
entry_page          988681 non-null int64
path_id_set         983792 non-null object
traffic_type        988681 non-null int64
session_duration    988013 non-null object
hits                619235 non-null object
dtypes: int64(5), object(5)
memory usage: 75.4+ MB


In [7]:
df.isnull().sum()

row_num                  0
locale                   0
day_of_week              0
hour_of_day              0
agent_id                 0
entry_page               0
path_id_set           4889
traffic_type             0
session_duration       668
hits                369446
dtype: int64

We can see here that the couple of feature columns also hae missing values. In the following section, let's tackle those those values.

# Fill in the missing values 

Let's start by filling in the missing values in session_duration by using a simple imputer with a mean strategy. We could have used an iterative imputer but that seemed to be an overkill for such a small number of missing values.

In [8]:
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(df[['session_duration']])
df['session_duration'] = imputer.transform(df[['session_duration']])
df['session_duration'] = df['session_duration'].astype(float) # changing to a numeric type after filling in the missing values 
df.isnull().sum()

row_num                  0
locale                   0
day_of_week              0
hour_of_day              0
agent_id                 0
entry_page               0
path_id_set           4889
traffic_type             0
session_duration         0
hits                369446
dtype: int64

Values filled in. Now, let's deal with the missing path_id_set values.

In [9]:
# analyzing the entries with missing path ids
df_no_path = df[df['path_id_set'].isna()]
print(df_no_path['session_duration'].describe())
print(df_no_path['hits'].describe())

count     4889.000000
mean       905.935570
std       6441.779028
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      82462.000000
Name: session_duration, dtype: float64
count     3153
unique      44
top          1
freq      2743
Name: hits, dtype: object


The session_duration and hits values where path_ids are zeros indicate that missing path_ids can be assumed as an 'empty set' of values rather than missing values. Also, changing the 'hits' column to numeric type after changing '\N' to np.nan earlier

In [10]:
df['hits'] = df['hits'].astype(float) # changing to numeric type; float because of numpy nans

# Visualizing individual columns

Here let's just visualize individual features of the data to get a better unserstanding of what we are dealing with

In [None]:
df['locale'].value_counts().plot(kind='bar')

In [None]:
df['day_of_week'].value_counts().plot(kind='bar')

In [None]:
df['hour_of_day'].value_counts().plot(kind='bar')

In [None]:
df['agent_id'].value_counts().plot(kind='bar')

In [None]:
df['entry_page'].plot(kind='hist')

In [None]:
df['traffic_type'].value_counts().plot(kind='bar')

In [None]:
print(df['session_duration'].describe())
df['session_duration'].plot(kind='hist', bins=100)


The description of this column and its distribution tells us that there may be a very small fraction values that are skewing the picture and are not representative of the data in general. Let's deal with it in the next section.

In [None]:
df['path_id_set'].describe()

In [None]:
print(df['hits'].describe())
df['hits'].plot(kind='hist', bins=100)

Here again, we see a small number of outliers skewing the overall representation disproportionately.

# Data cleaning

Let's remove the extreme outliers. I'll be using quantiles to remove the extreme outliers. The model training improved drastically with this simple, yet effective, filtering on the two columns from the previous section.

In [None]:
q1 = df['session_duration'].quantile(0.98)
q2 = df['hits'].quantile(0.98)

df_clean = df.copy()
df_empty = df_clean[df_clean['hits'].isna()]
df_not_empty = df_clean[df_clean['hits'].notna()]

df_clean = df_not_empty[df_not_empty['session_duration'] < q1]
df_clean = df_clean[df_clean['hits'] < q2]
df_clean = pd.concat([df_clean, df_empty], axis=0)
print(df_clean.info())

Furthermore, the correlation analysis - described later in this notebook - revealed that the number of hits have a higher correlation with the the number of locations in the path_id_set rather than the actual locations themselves. So we'll add an additional column to help us build a better model.

In [None]:
#mapping function to get the path_length from path_id_set
def get_path_length(x):
    if x is not np.nan:
        x = str(x)
        y = len(x.split(';'))
    else:
        y=0
    return y

In [None]:
df_clean['path_length'] = df_clean['path_id_set'].apply(lambda x: get_path_length(x))

Here, now we have a new column in our data.

# Correlations

Before checking correlations, we'll separate data with hits so that np.nans don't interfere with the correlation calculations

In [None]:
df_wo_hits = df_clean[df_clean['hits'].isna()]
df_hits = df_clean[df_clean['hits'].notna()]

First, let's observe the Correlation among continuous variables after path_length

In [None]:
cont_col = ['session_duration', 'path_length', 'hits']
df_cont = df_hits[cont_col] 
corr = df_cont.corr(method='spearman')
corr.style.background_gradient()

Then let's observe the  Correlation Ratio between hits and categorical features

In [None]:
def correlation_ratio(categories, measurements):
    fcat, _ = pd.factorize(categories)
    cat_num = np.max(fcat)+1
    y_avg_array = np.zeros(cat_num)
    n_array = np.zeros(cat_num)
    for i in range(0,cat_num):
        cat_measures = measurements[np.argwhere(fcat == i).flatten()]
        n_array[i] = len(cat_measures)
        y_avg_array[i] = np.average(cat_measures)
    y_total_avg = np.sum(np.multiply(y_avg_array,n_array))/np.sum(n_array)
    numerator = np.sum(np.multiply(n_array,np.power(np.subtract(y_avg_array,y_total_avg),2)))
    denominator = np.sum(np.power(np.subtract(measurements,y_total_avg),2))
    if numerator == 0:
        eta = 0.0
    else:
        eta = np.sqrt(numerator/denominator)
    return eta


In [None]:
for cols in df_hits.columns:
    if cols not in ('session_duration', 'path_length', 'hits', 'row_num', 'path_id_set'):
        cor_ratio = correlation_ratio(df_hits[cols].tolist(), np.array(df_hits['hits'].tolist()))
        print ("Correlation ratio between " + cols + " & hits is = " + str(cor_ratio) + 
               ', whereas total # of categories is: ' + str(len(df_hits[cols].unique())))

In [None]:
plot_col = ['day_of_week', 'locale', 'agent_id', 'entry_page', 'traffic_type', 'hour_of_day', 'hits']

df_plot = df_hits[plot_col]
sns.pairplot(df_plot)


This figure above shows us the pairwise plot of different features of our data. The next step after the understading of the data is to encode this data so that it can be fed into a machine learning model.

Before encoding, let's view how our cleaned DataFrame looks like now.

In [None]:
df_clean.describe()

# Feature encoding

The feature encoding is a very immportant part for preparing the data for the model training and model prediction. So, getting the right kind of encoding for each feature is very important. This dataset is particularly challengng in that regard because it has features that are
1) nominal categorical
2) categorical with large number of categories
3) continous numerical
4) cyclical

First, let's start by converting all the nominal categorical variables to one hot encoding.

In [None]:
df_encoded = df_clean.copy() # getting a new DataFrame instance for encodings

for i in ('locale', 'agent_id', 'traffic_type'):
    dummies = pd.get_dummies(df_encoded[i], drop_first=True,  prefix=i.split('_')[0])
    df_encoded = pd.concat([df_encoded, dummies], axis=1)
    df_encoded = df_encoded.drop(columns=[i]) 
df_encoded.info()

Next, we have two categorical features with a very large number of discrete categories. Using one-hot-encoding here will lead us to a bad place, named curse of dimensionality.
So, instead well limit the encodings for these features to 10 by using FeatureHasher for entry_page while we have already replaced path_id_set by path_length.

In [None]:
hasher = FeatureHasher(n_features=10, input_type='string')
hashed_features = hasher.fit_transform(df_encoded['entry_page'].astype(str))
hashed_features = hashed_features.toarray()
df_encoded = pd.concat([df_encoded.drop(columns=['entry_page']), pd.DataFrame(hashed_features, 
                                                                 columns=['ep_0', 'ep_1', 'ep_2', 'ep_3', 'ep_4', 
                                                                          'ep_5', 'ep_6', 'ep_7',
                                                                          'ep_8', 'ep_9'])], axis=1)



In [None]:
df_encoded.drop(columns=['path_id_set'], inplace=True)
df_encoded.info()

After adding hashed features, we're done with the categorical feature. 
Now we'll scale the contionuous features. various options were explored but the simplest option of MinMaxScaler worked as well as any other 

In [None]:
#scaler  = StandardScaler()
#minmax = MinMaxScaler()
#df_encoded[['path_length', 'session_duration']] = scaler.fit_transform(df_encoded[['path_length', 'session_duration']].apply(lambda x: np.log(1+x)))
#df_encoded[['path_length', 'session_duration']] = minmax.fit_transform(df_clean[['path_length', 'session_duration']])

#df_encoded[['path_length', 'session_duration']].describe()


For days and hours, we have to use cysclic encoding because they repeat periodically. Following cyclic method was used for encoding

In [None]:
day_mapper  = dict(zip(['Sunday','Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'], 
                       [0, 1, 2, 3 ,4, 5, 6]))

df_encoded['day_of_week_enc'] = df_encoded['day_of_week'].map(day_mapper)



In [None]:

df_encoded['day_sin'] =  np.sin(2*np.pi*df_encoded.day_of_week_enc/7)
df_encoded['day_cos'] =  np.cos(2*np.pi*df_encoded.day_of_week_enc/7)

df_encoded['hour_sin'] =  np.sin(2*np.pi*df_encoded.hour_of_day/24)
df_encoded['hour_cos'] =  np.cos(2*np.pi*df_encoded.hour_of_day/24)

df_encoded[['day_cos', 'day_sin']].describe()
df_encoded[['hour_cos', 'hour_sin']].describe()

df_encoded = df_encoded.drop(columns= ['day_of_week', 'day_of_week_enc', 'hour_of_day'])

In [None]:
df_encoded.info()

Finally, let's save the encoded files into csv files 

In [None]:
df_encoded.to_csv('feature_engineered_data_2.csv')

In [None]:
df_encoded[df_encoded['hits'].notna()].to_csv('feature_engineered_data_with_hits_2.csv')

In [None]:
df_encoded[df_encoded['hits'].isna()].to_csv('feature_engineered_data_without_hits_2.csv')

In [None]:
df_encoded[df_encoded['hits'].isna()].describe()