# Classification of Q1 dataset to the clusters
In this file, the preparation for Q1 dataset take place by updating the Q1 dataset created for iteration 2.

In [1]:
# Remove the warnings for presentation of the notebook. During the development, the warnings were not ignored.
import warnings
warnings.filterwarnings('ignore')

First, import pandas, sklearn model and both datasets.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import sklearn as sk

In [3]:
df = pd.read_csv('final-Q4.csv')
df1 = pd.read_csv('cluster-Q1.csv')
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101101 entries, 0 to 101100
Data columns (total 12 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Unnamed: 0           101101 non-null  int64  
 1   userID               101101 non-null  int64  
 2   companyID            101101 non-null  int64  
 3   country              101101 non-null  int64  
 4   activity             101101 non-null  int64  
 5   event                101101 non-null  float64
 6   timestamp            101101 non-null  object 
 7   page                 101101 non-null  int64  
 8   hour                 101101 non-null  int64  
 9   day                  101101 non-null  int64  
 10  week                 101101 non-null  int64  
 11  userCountPerCompany  101101 non-null  int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 9.3+ MB


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Unnamed: 0           100000 non-null  int64  
 1   userID               100000 non-null  int64  
 2   companyID            100000 non-null  int64  
 3   country              100000 non-null  object 
 4   activity             100000 non-null  object 
 5   event                100000 non-null  float64
 6   page                 100000 non-null  object 
 7   timestamp            100000 non-null  object 
 8   hour                 100000 non-null  int64  
 9   day                  100000 non-null  int64  
 10  week                 100000 non-null  int64  
 11  userCountPerCompany  100000 non-null  int64  
 12  cluster              100000 non-null  int64  
dtypes: float64(1), int64(8), object(4)
memory usage: 9.9+ MB


Then, add the cluster from the Q4 data to Q1 dataset by mapping the userID.

In [5]:
mapping = dict(df[['userID', 'cluster']].values)
df1['cluster'] = df1[['userID']].userID.map(mapping)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101101 entries, 0 to 101100
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Unnamed: 0           101101 non-null  int64  
 1   userID               101101 non-null  int64  
 2   companyID            101101 non-null  int64  
 3   country              101101 non-null  int64  
 4   activity             101101 non-null  int64  
 5   event                101101 non-null  float64
 6   timestamp            101101 non-null  object 
 7   page                 101101 non-null  int64  
 8   hour                 101101 non-null  int64  
 9   day                  101101 non-null  int64  
 10  week                 101101 non-null  int64  
 11  userCountPerCompany  101101 non-null  int64  
 12  cluster              101101 non-null  int64  
dtypes: float64(1), int64(11), object(1)
memory usage: 10.0+ MB


Count the cluster assigned to Q1 dataset to see the initial amount.

In [6]:
df1['cluster'].value_counts()

0    48282
3    31728
1    16046
2     5045
Name: cluster, dtype: int64

### Training the algorithm

We will use these variables to predict which cluster these user belong in Q1: country, activity, page, hour, day, week, event, userCountPerCompany<br>
<br>
In the below cell, the training and test data are separated by 70% and 30%. This built-in function from sk-learn splits the data set randomly into a train set and a test set.

In [7]:
from sklearn.preprocessing import normalize

X = df1[['country', 'activity', 'page', 'hour', 'day', 'week', 'event', 'userCountPerCompany']]
X = normalize(X)
y = df1['cluster']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

#### Random Forest model

The below random forest model is a built-in function from sk-learn. RF uses randomness, a random_state is set for stability of result. Traditionally, literature suggests 10 more trees to be used in this model. But in this project, we'll use 100 (see n_estimators below). Setting random_state to a fixed value will guarantee that same random numbers are generated each time the code is run. Therefore, the result will be always the same. This is helpful when verifying the output.

In [8]:
rf = RandomForestClassifier(random_state=1, n_estimators=100)
rf = rf.fit(X_train, y_train)

### Evaluating the model

In the below, we're going to evaluate the model using a confusion matrix and calculating accuracy, precision and recall. This is typical for classification problem.

In [9]:
y_test_pred = rf.predict(X_test)
cm = confusion_matrix(y_test, y_test_pred)
cm

array([[14124,    55,     1,   358],
       [    5,  4803,     0,     3],
       [    0,     2,  1518,     6],
       [  320,    34,     2,  9100]])

In [10]:
y_pred = rf.predict(X_test) 
conf_matrix = confusion_matrix(y_test, y_pred) 
conf_matrix = pd.DataFrame(cm, index=['0 actual', '1 actual', '2 actual', '3 actual'], columns = ['0 prediction', '1 prediction', '2 prediction', '3 prediction']) 
conf_matrix

Unnamed: 0,0 prediction,1 prediction,2 prediction,3 prediction
0 actual,14124,55,1,358
1 actual,5,4803,0,3
2 actual,0,2,1518,6
3 actual,320,34,2,9100


The random forest model has almost 100% precision.

In [11]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.97      0.97     14538
           1       0.98      1.00      0.99      4811
           2       1.00      0.99      1.00      1526
           3       0.96      0.96      0.96      9456

    accuracy                           0.97     30331
   macro avg       0.98      0.98      0.98     30331
weighted avg       0.97      0.97      0.97     30331



In [12]:
y_pred2 = rf.predict(X)
df1['predictions'] = y_pred2
df1.head()

Unnamed: 0.1,Unnamed: 0,userID,companyID,country,activity,event,timestamp,page,hour,day,week,userCountPerCompany,cluster,predictions
0,0,32,37,1,1,1.0,2021-02-05 08:17:01,5,11,4,5,2466,3,3
1,1,140,41,0,1,3.0,2021-01-05 22:56:21,5,5,2,1,3075,1,1
2,2,195,8,1,0,0.0,2021-02-11 12:56:44,5,15,3,6,1027,0,0
3,3,19,23,1,1,1.0,2021-03-17 19:24:07,4,22,2,11,2040,3,3
4,4,84,17,0,1,2.0,2021-01-05 13:55:09,2,20,1,1,1487,0,0


Further, we added the prediction back to the Q1 dataframe. When the most common prediction is not the original cluster number itself, then the most common prediction value become the new cluster of the user.

In [13]:
getCluster = df1[['userID', 'predictions']]
df1['Most common predictions'] = getCluster.groupby('userID')['predictions'].transform(lambda x: x.value_counts().idxmax())

In [14]:
df1['new cluster'] = df1.apply(lambda x : x['cluster'] if x['cluster'] == x['Most common predictions'] else x['Most common predictions'], axis=1)
df1['cluster'].value_counts()

0    48282
3    31728
1    16046
2     5045
Name: cluster, dtype: int64

In [15]:
df1['new cluster'].value_counts()

0    48282
3    31728
1    16046
2     5045
Name: new cluster, dtype: int64

Update the values of country, activity and page variables and then save the dataset to csv to be further processed.

In [16]:
df1['country'] = df1['country'].replace({0:'Indonesia', 1:'Turkey'})
df1['activity'] = df1['activity'].replace({0:'Page', 1:'Event'})
df1['page'] = df1['page'].replace({0:'https://en.wikipedia.org', 
                                 1: 'https://en.wikipedia.org/wiki/Main_Page',
                                 2: 'https://en.wikipedia.org/wiki/Accounting',
                                 3: 'https://en.wikipedia.org/wiki/Bookkeeping',
                                 4: 'https://en.wikipedia.org/wiki/Financial_technology',
                                 5: 'https://en.wikipedia.org/wiki/Financial_services'})
df1.to_csv('final-Q1.csv')

## KNN model

To be sure that random forest model has the best accuracy, in this section, classification will be done with *KNeightborsClassifier* class from sklearn.

In [17]:
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=3)
knn = knn.fit(X_train, y_train) 

From the below, we can see that the KNN model accuracy is less than random forest. Thus, using random forest was indeed a better idea.

In [18]:
knn.score(X_test, y_test)

0.6517424417262866