# TASK 2: User Engagement analysis

The aim of this notebook is to analyze the engagement of the users. By doing this will see how engage the user are towards the services of the TellCo company.
This will result in building & improving the Quality of Service (QoS) to leverage the mobile platforms and to get more users for the business.</br>
For this task, we are called to to track the user’s engagement using the following engagement metrics: 
* sessions frequency 
* the duration of the session 
* the sessions total traffic (download and upload (bytes))

In [1]:
# Importation of the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

In [2]:
import sys
sys.path.append('../scripts')
from Extract_data import extract_data

In [3]:
# Import the dataset
df = pd.read_csv("../data/Cleaned_Data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 55 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   Bearer Id                                 149010 non-null  float64
 1   Start                                     150000 non-null  object 
 2   Start ms                                  150001 non-null  float64
 3   End                                       150000 non-null  object 
 4   End ms                                    150001 non-null  float64
 5   Dur. (s)                                  150001 non-null  float64
 6   IMSI                                      149431 non-null  float64
 7   MSISDN/Number                             148935 non-null  float64
 8   IMEI                                      149429 non-null  float64
 9   Last Location Name                        148848 non-null  object 
 10  Avg RTT DL (ms)     

Let's extract the important variables to perform this task. The imported dataset has already been cleaned and the cleaning process can be found in the notebook [here](Data_Preprocessing.ipynb).

In [35]:
# Extract the important variables for the task 2
dfTask2 = df.loc[:,['Bearer Id','Dur. (s)','MSISDN/Number','Social Media DL (Bytes)','Social Media UL (Bytes)',
                                  'Google DL (Bytes)', 'Google UL (Bytes)', 'Email DL (Bytes)',
                                  'Email UL (Bytes)', 'Youtube DL (Bytes)', 'Youtube UL (Bytes)',
                                  'Netflix DL (Bytes)', 'Netflix UL (Bytes)', 'Gaming DL (Bytes)',
                                  'Gaming UL (Bytes)', 'Other DL (Bytes)', 'Other UL (Bytes)',
                                  'Total UL (Bytes)', 'Total DL (Bytes)']]
dfTask2[['MSISDN/Number','Bearer Id']] = dfTask2[['MSISDN/Number','Bearer Id']].astype(str).replace('nan',np.nan)
dfTask2.head(5)

Unnamed: 0,Bearer Id,Dur. (s),MSISDN/Number,Social Media DL (Bytes),Social Media UL (Bytes),Google DL (Bytes),Google UL (Bytes),Email DL (Bytes),Email UL (Bytes),Youtube DL (Bytes),Youtube UL (Bytes),Netflix DL (Bytes),Netflix UL (Bytes),Gaming DL (Bytes),Gaming UL (Bytes),Other DL (Bytes),Other UL (Bytes),Total UL (Bytes),Total DL (Bytes)
0,1.31144834608449e+19,86399.0,33664962239.0,1545765.0,24420.0,1634479.0,1271433.0,3563542.0,137762.0,15854611.0,2501332.0,8198936.0,9656251.0,278082303.0,14344150.0,171744450.0,8814393.0,36749741.0,308879636.0
1,1.31144834828789e+19,86399.0,33681854413.0,1926113.0,7165.0,3493924.0,920172.0,629046.0,308339.0,20247395.0,19111729.0,18338413.0,17227132.0,608750074.0,1170709.0,526904238.0,15055145.0,53800391.0,653384965.0
2,1.31144834840805e+19,86399.0,33760627129.0,1684053.0,42224.0,8535055.0,1694064.0,2690151.0,672973.0,19725661.0,14699576.0,17587794.0,6163408.0,229584621.0,395630.0,410692588.0,4215763.0,27883638.0,279807335.0
3,1.31144834854428e+19,86399.0,33750343200.0,644121.0,13372.0,9023734.0,2788027.0,1439754.0,631229.0,21388122.0,15146643.0,13994646.0,1097942.0,799538153.0,10849722.0,749039933.0,12797283.0,43324218.0,846028530.0
4,1.31144834994807e+19,86399.0,33699795932.0,862600.0,50188.0,6248284.0,1500559.0,1936496.0,173853.0,15259380.0,18962873.0,17124581.0,415218.0,527707248.0,3529801.0,550709500.0,13910322.0,38542814.0,569138589.0


In [36]:
# Check for missing values
dfTask2.isna().sum().sort_values(ascending=False)

MSISDN/Number              1066
Bearer Id                   991
Youtube UL (Bytes)            0
Total UL (Bytes)              0
Other UL (Bytes)              0
Other DL (Bytes)              0
Gaming UL (Bytes)             0
Gaming DL (Bytes)             0
Netflix UL (Bytes)            0
Netflix DL (Bytes)            0
Youtube DL (Bytes)            0
Dur. (s)                      0
Email UL (Bytes)              0
Email DL (Bytes)              0
Google UL (Bytes)             0
Google DL (Bytes)             0
Social Media UL (Bytes)       0
Social Media DL (Bytes)       0
Total DL (Bytes)              0
dtype: int64

>## Task 2.1

### Aggregate the metrics per customer id (MSISDN)
To do so, we'll use a module we wrote in the script [Extract_data.py](..\scripts\Extract_data.py)

In [31]:
dfForAgg = extract_data(dfTask2)
dfAgg = dfForAgg.merge_data('MSISDN/Number')
dfAgg.drop(columns= ['Total UL (Bytes)','Total DL (Bytes)'],inplace=True)
dfAgg

Unnamed: 0_level_0,Number of session,Dur. (s),Social Media,Google,Email,Youtube,Gaming,Other,Total
MSISDN/Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3197020876596.0,1,86399.0,7.152240e+05,1.043866e+07,1.520771e+06,1.195990e+07,1.780487e+08,4.705265e+08,2.321240e+08
33601001722.0,1,116720.0,2.232135e+06,4.389005e+06,1.331362e+06,2.162455e+07,8.124587e+08,3.865709e+08,8.786906e+08
33601001754.0,1,181230.0,2.660565e+06,5.334863e+06,3.307781e+06,1.243222e+07,1.197501e+08,2.817101e+08,1.568596e+08
33601002511.0,1,134969.0,3.195623e+06,3.443126e+06,3.205380e+06,2.133357e+07,5.388277e+08,5.016937e+08,5.959665e+08
33601007832.0,1,49878.0,2.802940e+05,9.678493e+06,2.284670e+06,6.977321e+06,3.911261e+08,3.527970e+07,4.223207e+08
...,...,...,...,...,...,...,...,...,...
33789980299.0,2,210389.0,4.250312e+06,1.024647e+07,5.315327e+06,3.801281e+07,9.723450e+08,1.075140e+09,1.094693e+09
33789996170.0,1,8810.0,3.001830e+05,7.531269e+06,1.006915e+06,2.664784e+07,6.603614e+08,2.952828e+08,7.146416e+08
33789997247.0,1,140988.0,4.985690e+05,5.429705e+06,2.514097e+06,1.985157e+07,4.370033e+08,2.111151e+08,4.803073e+08
882397108489451.0,1,86399.0,1.546088e+06,9.218647e+06,3.330974e+06,4.094071e+07,4.307026e+07,4.013605e+08,1.391536e+08


In [29]:
# Function to convert seconds into hours
def convert_sec_to_day(df,duration_col):
    """This function converts the duration (s) into days"""
    day = 3600*24
    df.loc[:,duration_col] = df.loc[:,duration_col]/day
    df.rename(columns={duration_col:'Duration (day)'})
    return df.loc[:,'Duration (day)']

In [30]:
dfAggDay = dfAgg.copy()
dfAggDay['Dur. (s)'] = convert_sec_to_day(dfAggDay,'Dur. (s)')
dfAggDay

Unnamed: 0_level_0,Number of session,Dur. (s),Social Media,Google,Email,Youtube,Gaming,Other,Total
MSISDN/Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3197020876596.0,1,0.999988,7.152240e+05,1.043866e+07,1.520771e+06,1.195990e+07,1.780487e+08,4.705265e+08,2.321240e+08
33601001722.0,1,1.350926,2.232135e+06,4.389005e+06,1.331362e+06,2.162455e+07,8.124587e+08,3.865709e+08,8.786906e+08
33601001754.0,1,2.097569,2.660565e+06,5.334863e+06,3.307781e+06,1.243222e+07,1.197501e+08,2.817101e+08,1.568596e+08
33601002511.0,1,1.562141,3.195623e+06,3.443126e+06,3.205380e+06,2.133357e+07,5.388277e+08,5.016937e+08,5.959665e+08
33601007832.0,1,0.577292,2.802940e+05,9.678493e+06,2.284670e+06,6.977321e+06,3.911261e+08,3.527970e+07,4.223207e+08
...,...,...,...,...,...,...,...,...,...
33789980299.0,2,2.435058,4.250312e+06,1.024647e+07,5.315327e+06,3.801281e+07,9.723450e+08,1.075140e+09,1.094693e+09
33789996170.0,1,0.101968,3.001830e+05,7.531269e+06,1.006915e+06,2.664784e+07,6.603614e+08,2.952828e+08,7.146416e+08
33789997247.0,1,1.631806,4.985690e+05,5.429705e+06,2.514097e+06,1.985157e+07,4.370033e+08,2.111151e+08,4.803073e+08
882397108489451.0,1,0.999988,1.546088e+06,9.218647e+06,3.330974e+06,4.094071e+07,4.307026e+07,4.013605e+08,1.391536e+08
