# TASK 1: User Overview analysis

In [1]:
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import sys
sys.path.append('../scripts')
from Clean_data import clean_data

In [3]:
# Import the dataset
df = pd.read_csv("../data/Week1_challenge_data_source(CSV).csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 55 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   Bearer Id                                 149010 non-null  float64
 1   Start                                     150000 non-null  object 
 2   Start ms                                  150000 non-null  float64
 3   End                                       150000 non-null  object 
 4   End ms                                    150000 non-null  float64
 5   Dur. (ms)                                 150000 non-null  float64
 6   IMSI                                      149431 non-null  float64
 7   MSISDN/Number                             148935 non-null  float64
 8   IMEI                                      149429 non-null  float64
 9   Last Location Name                        148848 non-null  object 
 10  Avg RTT DL (ms)     

## Sub-tasks

>## Identifying the top 10 handsets used by the customers </br>

The handsets type of the customers are stored in the column named `Handset Type`. But before extracting the information for this column, we should identify each unique user/customer. As a matter of fact, a customer can be identified by either his/her IMSI, MSISDN/number or IMEI.

In [4]:
# Search the column that have the less number of missing values
IdVariable = clean_data(df[['IMSI','MSISDN/Number', 'IMEI','Handset Type']])
missingCol,_,_ = IdVariable.missing_values(verbose=False)
missingCol

MSISDN/Number    1066
IMEI              572
Handset Type      572
IMSI              570
dtype: int64

Thus one can choose to use the `IMSI` to identify each customer because we get more information using this field instead of `IMEI` or `MSISDN/Number`. Nethertheless, one can also choose to use the `IMEI`. Since we're looking for the handset type, we can just focus on the **IMEI** because it's an unique number which identify a device on a mobile network. Furthermore, without the IMEI, we can't identify the handset type - this is the reason why we get the same number of missing values for those two columns. Even if, we get less missing values for IMSI, this caracteristics does not ensure that we'll get the maximum of information relative to the handset type.

In [5]:
# Extract the unique customer from the table
UniqueUser = df.loc[:,['IMEI','Handset Manufacturer','Handset Type']].dropna(how="all")
# Drop the duplicates
UniqueUser = UniqueUser.drop_duplicates()
# Count the number of each handset type and identify the top 10
UniqueUser.loc[:,'Handset Type'].value_counts()[:10]

Huawei B528S-23A                10638
Apple iPhone 6S (A1688)          6765
undefined                        6716
Apple iPhone 6 (A1586)           6271
Apple iPhone 7 (A1778)           4721
Apple iPhone Se (A1723)          3764
Apple iPhone 8 (A1905)           3550
Samsung Galaxy S8 (Sm-G950F)     3275
Apple iPhone Xr (A2105)          3077
Samsung Galaxy J5 (Sm-J530)      2760
Name: Handset Type, dtype: int64

In [6]:
# Count the number of each handset type and identify the top 10 (without undefined handset type)
UniqueUser.query("`Handset Type`!='undefined'").loc[:,'Handset Type'].value_counts()[:10]

Huawei B528S-23A                10638
Apple iPhone 6S (A1688)          6765
Apple iPhone 6 (A1586)           6271
Apple iPhone 7 (A1778)           4721
Apple iPhone Se (A1723)          3764
Apple iPhone 8 (A1905)           3550
Samsung Galaxy S8 (Sm-G950F)     3275
Apple iPhone Xr (A2105)          3077
Samsung Galaxy J5 (Sm-J530)      2760
Samsung Galaxy A5 Sm-A520F       2721
Name: Handset Type, dtype: int64

>## Identify the three top handset manufacturer

In [7]:
def topNManufacturer(df,topn=3):
    toNMan = df.loc[:,'Handset Manufacturer'].value_counts()[:topn]
    return toNMan

In [8]:
# Identify the top 3 manufacturer
topNManufacturer(UniqueUser,topn=3)

Apple      42687
Samsung    30981
Huawei     21743
Name: Handset Manufacturer, dtype: int64

>## Identify the top 5 handsets per top 3 handset manufacturer

In [9]:
# Function to extract the top n type of handset for the top m manufacturer
def topTypeManufact(df=UniqueUser,nmanufact=3,ntype=5):
    topNManufact = df.loc[:,'Handset Manufacturer'].value_counts()[:nmanufact]
    res = pd.DataFrame(columns=['Manufacturer','Type','Count'])
    for manufacturer in topNManufact.index:
        temp = df.loc[UniqueUser['Handset Manufacturer']==manufacturer,'Handset Type'].value_counts()[:ntype]
        temp = pd.DataFrame({'Manufacturer':[manufacturer]*ntype,'Type':temp.index,'Count':temp.to_list()})
        res = pd.concat([res,temp])
    return res.reset_index(drop=True)

In [10]:
# Identify the top 5 handsets per top 3 handset manufacturer
topTypeManufact(df=UniqueUser,nmanufact=3,ntype=5)

Unnamed: 0,Manufacturer,Type,Count
0,Apple,Apple iPhone 6S (A1688),6765
1,Apple,Apple iPhone 6 (A1586),6271
2,Apple,Apple iPhone 7 (A1778),4721
3,Apple,Apple iPhone Se (A1723),3764
4,Apple,Apple iPhone 8 (A1905),3550
5,Samsung,Samsung Galaxy S8 (Sm-G950F),3275
6,Samsung,Samsung Galaxy J5 (Sm-J530),2760
7,Samsung,Samsung Galaxy A5 Sm-A520F,2721
8,Samsung,Samsung Galaxy J3 (Sm-J330),2606
9,Samsung,Samsung Galaxy S7 (Sm-G930X),2310


>## Task 1.1: Get an overview of the users’ behavior on those applications