### Problem Statement

To predict the total data consumption of organization with following usage types

Date: Specific date

Day of Week: Starting from Monday to Sunday

Weekend: Whether the particular day is weekend or not.
    1 - weekend.
    0 - not a weekend

Holiday: All holidays including weekend. 1 - holiday. 0 - not ah holiday

Internal Events: Any internal events in the organization on that day. 1 - event. 0 - not an event. if event many employees may attend the event.

Offline Job Count: Number of servers or cron job running in offline.

Network Down: Number of network interruptions on that day.

Employees Count: Number of employees count who has come to office

Social Media Count: Number of employees count who are having access to use social media pages, youtube, etc..

Guest Count: Number of guest who has come to office. Guest may be clients, interview candidates etc..

Mobile Usage Count: Number of employees who are having access to use office network in their personal phones/tablets.

Total Swipe Hours: Total swipe hours of employees and guest.

total_usage: Target(To be predicted). Total data consumption on that day.

### Importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np

#### Reading the data

In [2]:
train = pd.read_csv("data_consumption.csv")

In [3]:
train.head()

Unnamed: 0,Date,Day of Week,Weekend,Holiday,Internal Events,Offline Job Count,Network Down,Employees Count,Social Media Count,Guest Count,Mobile Usage Count,Total Swipe Hours,total_usage
0,2015-01-01,4,0,1,0,18,1,1,0,0,0,8.101913,38.049178
1,2015-01-02,5,0,0,1,116,3,1020,693,8,94,8321.136018,5638.464285
2,2015-01-03,6,1,1,0,31,0,46,19,0,0,372.688014,607.002069
3,2015-01-04,7,1,1,0,29,0,31,11,0,0,251.159313,417.100325
4,2015-01-05,1,0,1,0,27,0,5,0,0,0,40.509567,97.364743


In [4]:
train.shape

(1886, 13)

### Extracting Categorical and Numerical Columns

In [5]:
categorical_columns = ['Day of Week', 'Weekend', 'Holiday', 'Internal Events']
numerical_columns = ['Offline Job Count', 'Network Down', 'Employees Count', 'Social Media Count', 'Guest Count', 'Mobile Usage Count', 'Total Swipe Hours', 'total_usage']

### Data Normalization

##### 1. Offline Job Count

In [6]:
train['Offline Job Count'].skew()

-0.26842670367831006

##### 2. Employees Count

In [7]:
train['Employees Count'].skew()

-0.38711970858523664

##### 3. Social Media Count

In [8]:
train['Social Media Count'].skew()

-0.3390035681010024

##### 4. Mobile Usage Count

In [9]:
train['Mobile Usage Count'].skew()

-0.1264552148264044

##### 5. Total Swipe Hours

In [10]:
train['Total Swipe Hours'].skew()

-0.38724780576889356

##### 6. Total Usage

In [11]:
train['total_usage'].skew()

-0.19148987953549765

Seems all the numerical columns are uniformy distributed. We don't need to normalize these columns and also skew value is very small only. 

In weekends and weekdays values are distributed uniformly separated. 

### Separating features and target variables and drop the date column

In [12]:
target = train['total_usage']
train.drop(columns=['Date','total_usage'], inplace=True)

### Splitting the whole data into training and testing data

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.2, random_state=9)
print(X_train.shape)
print(X_test.shape)

(1508, 11)
(378, 11)


### Defining metrics to eveluate model - RMSE

In [14]:
from sklearn.metrics import mean_squared_error
def rmse(expected, predicted):
    return np.sqrt(mean_squared_error(expected, predicted))

### Defining Model - Polynomial Regression - degree 3

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=3)
polyDf = polynomial_features.fit_transform(X_train)
lrPoly2 = LinearRegression()

### Training the Model

In [16]:
lrPoly2.fit(polyDf, y_train)
train_pred = lrPoly2.predict(polyDf)
print('RMSE:', rmse(y_train, train_pred))

RMSE: 0.00824474564178414


### Testing the Model

In [17]:
polyDfTest = polynomial_features.fit_transform(X_test)
y_pred = lrPoly2.predict(polyDfTest)
print('RMSE:', rmse(y_test, y_pred))

RMSE: 0.030544667671774176


We have almost very less RMSE. So we can finalize this model as trained model.