# Conference Call Bandwidth Predictive Modelling

In this project, we sort to predict data usage of several online conference tools including Zoom, Google Meet, Hangouts, Mixlr and Free Conference Calls.

In [None]:
!conda install -y gdown

In [None]:
!gdown https://drive.google.com/uc?id=1VJYssibEIC1TnOGuoGscZ0rnOciaW8J9

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('/kaggle/working/online_meetings_v1.csv')
df

### Dataset Description
Each datapoint in the dataset describes a single online meeting containing the platform utilized, the start and end time, the duration, along with other details. Those details include:

1. **Participant Video on** - Whether the participant in observation had his/her video on
2. **Participant Mic on**- User mic on
3. **Participant Screen Share** - Whether the participate share their screens.
4. **Others Video on** - Whether participants in the call (excluding the one in observation) had their videos on
5. **Others Screen Share** - Whether other participants shared their screen
6. **Window Minimized** - Whether the application window was minimized
7. Group - Whether the call was a group call
8. **Download** - Total downloaded bytes (Megabytes) in the duration of the meeting.
9. **Upload** - Total upload bytes (Megabytes) in the duration of the meeting.
10. **Total** - Total bytes (Megabytes) transfered

## Data Cleaning

Checking for null data

In [None]:
df.isnull().sum()

In [None]:
df.dtypes

In [None]:
df.Platform.value_counts()

Renaming the columns

In [None]:
df.replace({'Zoom ':'zoom', 'Zoom':'zoom', 
            'Free Conference Call':'free_conf_call', 
            'Google Meet':'google_meet', 'Mixlr (audio)':'mixlr', 
            'Hangouts':'hangouts', 'Google Duo': 'google_duo'}, inplace=True)

Converting time into date object

In [None]:
start_hr = pd.DataFrame({'start_hr':[pd.Timestamp(i).hour for i in df['Start Time']]})
start_min = pd.DataFrame({'start_min':[pd.Timestamp(i).minute for i in df['Start Time']]})

In [None]:
end_hr = pd.DataFrame({'end_hr':[pd.Timestamp(i).hour for i in df['End Time']]})
end_min = pd.DataFrame({'end_min':[pd.Timestamp(i).minute for i in df['End Time']]})

In [None]:
duration_sec = pd.DataFrame({'duration_sec':[pd.Timedelta(i).seconds for i in df['Duration']]})

Adding Upload and Download Speeds

In [None]:
df_dummy = df

In [None]:
upload_speed = pd.DataFrame({'avg_upload_speed': [(b/s)*8 for b, s in zip(df.Upload, duration_sec.duration_sec)]})
download_speed = pd.DataFrame({'avg_download_speed': [(b/s)*8 for b, s in zip(df.Download, duration_sec.duration_sec)]})

Concatenating with the dataset

In [None]:
df_dummy = pd.concat([df, start_hr, start_min, end_hr, end_min, duration_sec, download_speed, upload_speed], axis=1)
df_dummy.head()

## Data Visualization

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style="whitegrid")

Distribution of Conference Call Platforms


In [None]:
_ = sns.countplot(df_dummy.Platform)

Total Usage against time

1. Platforms

In [None]:
_ = sns.scatterplot(df_dummy.duration_sec, df_dummy.Total,hue=df_dummy.Platform, palette='Set2')

2. Participant Video On

In [None]:
_ = sns.scatterplot(df_dummy.duration_sec, df_dummy.Upload,hue=df_dummy['Participant Video on'], palette='Set2')

### Upload

In [None]:
_ = sns.scatterplot(df_dummy.duration_sec, df_dummy.Upload,hue=df_dummy['Participant Screen Share'], palette='Set2')

### Download

In [None]:
_ = sns.scatterplot(df_dummy.duration_sec, df_dummy.Download,hue=df_dummy['Others Video on'], palette='Set2')

In [None]:
_ = sns.scatterplot(df_dummy.duration_sec, df_dummy.Download,hue=df_dummy['Others Screen Share'], palette='Set2')

In [None]:
_ = sns.scatterplot(df_dummy.duration_sec, df_dummy.Total,hue=df_dummy['Window Minimized'], palette='Set2')

Considering individual platforms.

In [None]:
def scatter(x,y, **kwargs):
    sns.scatterplot(x, y, palette='Set2')

g = sns.FacetGrid(df_dummy, col='Platform', row='Others Video on', hue='Participant Video on', aspect=1, palette='Set2')
_ = g.map(scatter, "duration_sec", "Total")
plt.subplots_adjust(hspace=0.5, wspace=0.3)
_ = g.add_legend()
_ = g.fig.set_size_inches(15,5)
_ = g.fig.suptitle("Video", y=1.08)
_ = g.set_titles(row_template = 'Others Vid {row_name}', col_template = '{col_name}')

In [None]:
def scatter(x,y, **kwargs):
    sns.scatterplot(x, y, palette='Set2')

g = sns.FacetGrid(df_dummy, col='Platform', row='Others Screen Share', hue='Participant Screen Share', aspect=1, palette='Set2')
_ = g.map(scatter, "duration_sec", "Total")
plt.subplots_adjust(hspace=0.5, wspace=0.3)
_ = g.fig.set_size_inches(15,5)
_ = g.add_legend()
_ = g.fig.suptitle("Screen Share", y=1.08)
_ = g.set_titles(row_template = 'Others Screen {row_name}', col_template = '{col_name}')

Everything checks out. Viewing the correlations

In [None]:
df_dummy.corr()

Dummy Coding Platforms

In [None]:
platform_dummy = pd.get_dummies(df_dummy.Platform, drop_first=True)
platform_dummy.head()

In [None]:
df_dummy = pd.concat([df_dummy, platform_dummy], axis=1)

Confirming datatypes

In [None]:
df_dummy.dtypes

Dropping original time columns

In [None]:
df_dummy.drop(['Start Time', 'End Time', 'Duration', 'Platform'], axis=1, inplace=True)
df_dummy.head()

## Predictive Modelling for Zoom Bandwidth Consumption

Creating X and y variables

In [None]:
X = df_dummy[df_dummy.zoom == 1].drop(['Date', 'Download', 'Upload', 'Total', 'google_meet', 'mixlr', 'zoom', 'hangouts', 'google_duo', 'Window Minimized'], axis=1)
X.head()

In [None]:
y = df_dummy.Total[df_dummy.zoom == 1]
y.shape

Splitting the dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=92)

Preparing the model

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
_ = regressor.fit(X_train, y_train)

In [None]:
regressor.score(X,y)

In [None]:
y_pred = regressor.predict(X_test)
y_pred = [5 if i < 0 else i for i in y_pred]

In [None]:
df_result = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df_result.head()

In [None]:
df_result.head(30).plot(kind='bar', figsize=(20,8))

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

### Predicting Bandwidth consumption of this Zoom  Lecture

In [None]:
print(X.avg_upload_speed.mean())
print(X.avg_download_speed.mean())

In [None]:
is_group = 1
participant_video = 0
participant_mic = 1
participant_screen = 1
others_video = 0
others_screen = 0
start_hr = 12
start_min = 15
end_hr = 12
end_min = 41
avg_download_speed = 0.34
avg_upload_speed = 0.032

eie_523_class = pd.DataFrame({'Participant Video on':participant_video, 'Participant Mic On':participant_mic, 'Participant Screen Share':participant_screen, 'Others Video on':others_video, 'Others Screen Share':others_screen, 'Group':is_group, 'start_hr':start_hr, 'start_min':start_min, 'end_hr':end_hr, 'end_min':end_min, 'duration_sec':(end_min - start_min + (end_hr - start_hr)*60)*60 , 'avg_download_speed':avg_download_speed, 'avg_upload_speed':avg_upload_speed}, index=[0])
eie_523_class.head()

In [None]:
prediction = regressor.predict(eie_523_class)
prediction = [3 if i < 3 else i for i in prediction]
print('Total bandwidth to be consumed is: ' + str(prediction))