# DATA PRE-PROCESSING
This notebook aims to perform the data-preprocessing step and to save the cleaned data in a CSV file.

In [1]:
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Import the class to clean the data
import sys
sys.path.append('../scripts')
from Clean_data import clean_data

In [3]:
# Import the data set
df = pd.read_csv("../data/Week1_challenge_data_source(CSV).csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 55 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   Bearer Id                                 149010 non-null  float64
 1   Start                                     150000 non-null  object 
 2   Start ms                                  150000 non-null  float64
 3   End                                       150000 non-null  object 
 4   End ms                                    150000 non-null  float64
 5   Dur. (ms)                                 150000 non-null  float64
 6   IMSI                                      149431 non-null  float64
 7   MSISDN/Number                             148935 non-null  float64
 8   IMEI                                      149429 non-null  float64
 9   Last Location Name                        148848 non-null  object 
 10  Avg RTT DL (ms)     

In [4]:
print(f'The dataset has {df.shape[0]} observations/rows and {df.shape[1]} columns.')

The dataset has 150001 observations/rows and 55 columns.


In [5]:
DataClass = clean_data(df)

## Missing values
Let's check the miss values in our data set.

In [6]:
missingCountCol, missingCountTot, missingPerc = DataClass.missing_values()

The dataset contains 1031392 missing values in total
The missing values represent 12.5% of the values contained in the set
These values are distributed as follows:
Nb of sec with 37500B < Vol UL              130254
Nb of sec with 6250B < Vol UL < 37500B      111843
Nb of sec with 125000B < Vol DL              97538
TCP UL Retrans. Vol (Bytes)                  96649
Nb of sec with 31250B < Vol DL < 125000B     93586
Nb of sec with 1250B < Vol UL < 6250B        92894
Nb of sec with 6250B < Vol DL < 31250B       88317
TCP DL Retrans. Vol (Bytes)                  88146
HTTP UL (Bytes)                              81810
HTTP DL (Bytes)                              81474
Avg RTT DL (ms)                              27829
Avg RTT UL (ms)                              27812
Last Location Name                            1153
MSISDN/Number                                 1066
Bearer Id                                      991
Nb of sec with Vol UL < 1250B                  793
UL TP < 10 Kbps (%) 

Before starting, one could easily notice that we only get the values related to the upload and download data per application. But we don't have for the total. So, one can just decide to remove it. But since we'have the information on the upload and download data, we'll keep it.</br>
Let's explore to see if we're able to find any relationship between the data used per application and the total.

In [18]:
DataClass.df.iloc[-1,:]

Bearer Id                                                NaN
Start                                                    NaN
Start ms                                                 NaN
End                                                      NaN
End ms                                                   NaN
Dur. (ms)                                                NaN
IMSI                                                     NaN
MSISDN/Number                                            NaN
IMEI                                                     NaN
Last Location Name                                       NaN
Avg RTT DL (ms)                                          NaN
Avg RTT UL (ms)                                          NaN
Avg Bearer TP DL (kbps)                                  NaN
Avg Bearer TP UL (kbps)                                  NaN
TCP DL Retrans. Vol (Bytes)                              NaN
TCP UL Retrans. Vol (Bytes)                              NaN
DL TP < 50 Kbps (%)     

Our data $12.47\%$ of missing values. One could decide to impute the data but first of all let's dive inside the missing values to understand those missing values.
* Firstly, since `IMSI`, `MSISDN/Number` and `IMEI` could be considered as unique identifier, it won't make any sense to impute them. Using an imputation method would generate the same identifier and this could lead to a bias in the analysis. 
* Secondly, we could not impute the missing values of the categorical variable as `Handset Manufacturer` and `Handset Type`
* Thirdly, let's notice that wa have the same number of NA for `Duration ratio when Bearer Downlink` and also the same for `Duration ratio when Bearer Uplink`. This could lead us to look at the data contained in the columns `Avg Bearer TP UL (kbps)` and `Avg Bearer TP DL (kbps)`.

In [15]:
columnName = DataClass.df.columns.to_list()

In [48]:
# Explore the relation between the duration ratio and the average bearer TP UL
print(DataClass.df.loc[DataClass.df["Avg Bearer TP UL (kbps)"]==0,columnName[20:24]].median())
DataClass.df.loc[DataClass.df["Avg Bearer TP UL (kbps)"]==0,columnName[20:24]].isnull().sum()

UL TP < 10 Kbps (%)               100.0
10 Kbps < UL TP < 50 Kbps (%)       0.0
50 Kbps < UL TP < 300 Kbps (%)      0.0
UL TP > 300 Kbps (%)                0.0
dtype: float64


UL TP < 10 Kbps (%)               791
10 Kbps < UL TP < 50 Kbps (%)     791
50 Kbps < UL TP < 300 Kbps (%)    791
UL TP > 300 Kbps (%)              791
dtype: int64

In [56]:
DataClass.df.skew()

Bearer Id                                     0.026666
Start ms                                      0.000968
End ms                                       -0.001163
Dur. (ms)                                     3.952609
IMSI                                         41.045956
MSISDN/Number                               332.155856
IMEI                                          1.071470
Avg RTT DL (ms)                              62.907828
Avg RTT UL (ms)                              28.457415
Avg Bearer TP DL (kbps)                       2.589437
Avg Bearer TP UL (kbps)                       4.503413
TCP DL Retrans. Vol (Bytes)                  15.951809
TCP UL Retrans. Vol (Bytes)                  84.113393
DL TP < 50 Kbps (%)                          -2.297803
50 Kbps < DL TP < 250 Kbps (%)                3.271453
250 Kbps < DL TP < 1 Mbps (%)                 4.566158
DL TP > 1 Mbps (%)                            5.370351
UL TP < 10 Kbps (%)                          -8.985016
10 Kbps < 

In [52]:
# Explore the relation between the duration ratio and the average bearer TP DL
print(DataClass.df.loc[:,columnName[16:20]].median())
DataClass.df.loc[DataClass.df["Avg Bearer TP UL (kbps)"]==0,columnName[16:20]].isnull().sum()

DL TP < 50 Kbps (%)               100.0
50 Kbps < DL TP < 250 Kbps (%)      0.0
250 Kbps < DL TP < 1 Mbps (%)       0.0
DL TP > 1 Mbps (%)                  0.0
dtype: float64


DL TP < 50 Kbps (%)               173
50 Kbps < DL TP < 250 Kbps (%)    173
250 Kbps < DL TP < 1 Mbps (%)     173
DL TP > 1 Mbps (%)                173
dtype: int64

We notice that, we got the NA for the duration ratio where the `Avg Bearer TP UL (kbps)` and `Avg Bearer TP UL (kbps)` are equal to 0 in each case. And it's obvious. If the average bearer is equal to 0, we cannot have a duration ratio.</br>
Let's now fill those missing values with the respective values. In the case of `Avg Bearer TP UL (kbps)`, only `UL TP < 10 Kbps (%)` will be filled with $100$ and the remaining columns with 0.

In [40]:
useToFill = {'UL TP < 10 Kbps (%)':100,'10 Kbps < UL TP < 50 Kbps (%)':0,
                 '50 Kbps < UL TP < 300 Kbps (%)':0,'UL TP > 300 Kbps (%)':0}

dict_keys(['UL TP < 10 Kbps (%)', '10 Kbps < UL TP < 50 Kbps (%)', '50 Kbps < UL TP < 300 Kbps (%)', 'UL TP > 300 Kbps (%)'])

In [49]:
test = DataClass.df.loc[DataClass.df["Avg Bearer TP UL (kbps)"]==0,columnName[20:24]]
#test = test.fillna(test)
test.isnull().sum()

UL TP < 10 Kbps (%)               791
10 Kbps < UL TP < 50 Kbps (%)     791
50 Kbps < UL TP < 300 Kbps (%)    791
UL TP > 300 Kbps (%)              791
dtype: int64

In [50]:
test.median()

UL TP < 10 Kbps (%)               100.0
10 Kbps < UL TP < 50 Kbps (%)       0.0
50 Kbps < UL TP < 300 Kbps (%)      0.0
UL TP > 300 Kbps (%)                0.0
dtype: float64