<a href="https://colab.research.google.com/github/surajj808/Soulpage/blob/main/Soulpage_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement:
The core objective of this challenge is to construct a predictive model that anticipates the volume of incoming calls the Childline center is poised to receive on an hourly basis throughout each day.
Your ingenious solution holds the power to revolutionize Childline's operational efficiency by optimizing their resource allocation and staffing strategy for the call center. This optimization paves the way for an enhanced capacity to cater to the needs of numerous children, thus amplifying their outreach and impact.
The data have been split into a test and training set. The training set contains all the calls (over 135,000) that were received from 1 January 2022 to 12 July 2022. You are asked to estimate the number of incoming calls per hour per day from 13 July 2022 to 6 September 2022.

In [1]:
# Importing needed libraries for the project

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import RandomForestRegressor

# Preparing the datasets:
### 1. Public Holiday

In [2]:
holidays = pd.read_csv('/content/drive/MyDrive/Soulpage/PublicHolidays.csv', parse_dates = ['Date'])
holidays.drop (['Unnamed: 2'], axis = 1, inplace = True) # Had all null values
holidays.head(2)

Unnamed: 0,Date,Holiday
0,2022-01-01,New Years Day
1,2022-03-25,Good Friday


### 2. School Dates

In [3]:
schoolDates = pd.read_csv('/content/drive/MyDrive/Soulpage/SchoolDates.csv', parse_dates = ['Opening', 'Closing'])
schoolDates.drop(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5'], axis = 1, inplace = True)  # All null valued columns
schoolDates.head()

Unnamed: 0,Term,Opening,Closing
0,1,2022-01-04,2022-04-08
1,2,2022-05-02,2022-08-05
2,3,2022-09-05,2022-11-18


### 3. Weather Dataset

In [4]:
weather = pd.read_excel('/content/drive/MyDrive/Soulpage/Weather.xlsx')

In [5]:
cols = ['Po', 'P', 'Pa','ff10', 'ff3','W1', 'W2', 'Tn', 'Tx', 'RRR', 'tR', 'E', 'Tg', 'E\'', 'sss']  # All null valued columns

weather.drop(cols, axis = 1, inplace = True)

weather['Local time '].iloc[1139:1140] = '28.02.2022 21:00'     # Original data had the value = '29.02.2022 21:00', which is invalid because 2022 is not a leap year. Hence, february month do not have the date 29th.
weather['Local time '].iloc[1140:1141] = '28.02.2022 15:00'
weather['Local time '].iloc[1141:1142] = '28.02.2022 09:00'
weather['Local time '].iloc[1142:1143] = '28.02.2022 03:00'

weather['Local time '] = pd.to_datetime(weather['Local time '], format='%d.%m.%Y %H:%M')

weather.head(2)

Unnamed: 0,Local time,T,U,DD,Ff,N,WW,Cl,Nh,H,Cm,Ch,VV,Td
0,2022-12-31 21:00:00,20.2,68.0,Wind blowing from the east-northeast,8.0,70 – 80%.,,Stratocumulus other than Stratocumulus cumulog...,60%.,600-1000,Altocumulus translucidus at a single level.,"No Cirrus, Cirrocumulus or Cirrostratus.",30.0,14.0
1,2022-12-31 15:00:00,26.0,40.0,Wind blowing from the north-east,9.0,40%.,,"Cumulonimbus capillatus (often with an anvil),...",40%.,600-1000,"No Altocumulus, Altostratus or Nimbostratus.","No Cirrus, Cirrocumulus or Cirrostratus.",30.0,11.2


### 4. Training Dataset

In [6]:
train_df = pd.read_csv('/content/drive/MyDrive/Soulpage/Train.csv')

train_df = train_df[['calldate', 'cc_status', 'maincat', 'subcat1', 'casepriority']]   # Rest all columns had a majority of null values, hence selected the top features
train_df.calldate = pd.to_datetime(train_df.calldate)

train_df['hour'] = train_df['calldate'].dt.hour       # Extracting Hour from datetime
train_df['day'] = train_df['calldate'].dt.day         # Extracting day from datetime
train_df['weekday'] = train_df['calldate'].dt.weekday # Extracting weekday from datetime
train_df['month'] = train_df['calldate'].dt.month     # Extracting month from datetime
train_df['year'] = train_df['calldate'].dt.year       # Extracting Year from datetime

train_df = pd.merge(train_df, holidays, how='left', left_on='calldate', right_on='Date')
train_df = pd.merge(train_df, schoolDates, how='left', left_on='calldate', right_on='Opening')
train_df = pd.merge(train_df, weather, how='left', left_on='calldate', right_on='Local time ')

train_df.fillna(0, inplace=True)
train_df.head(2)

Unnamed: 0,calldate,cc_status,maincat,subcat1,casepriority,hour,day,weekday,month,year,...,Ff,N,WW,Cl,Nh,H,Cm,Ch,VV,Td
0,2022-01-01 07:26:00,Closed,non-interventional,Blank call,Non Critical,7,1,5,1,2022,...,0.0,0,0,0,0,0,0,0,0.0,0.0
1,2022-01-01 07:32:00,Closed,non-interventional,Blank call,Non Critical,7,1,5,1,2022,...,0.0,0,0,0,0,0,0,0,0.0,0.0


In [7]:
hourly_calls = train_df.resample('H', on='calldate').size().reset_index(name='calls') # grouping the data into specified time intervals of 1-hour i.e, hourly resampling
hourly_calls.head(2)

Unnamed: 0,calldate,calls
0,2022-01-01 07:00:00,8
1,2022-01-01 08:00:00,41


### 5. Testing Dataset

In [8]:
test_df = pd.read_csv("/content/drive/MyDrive/Soulpage/Test.csv")

In [9]:
test_df.rename(columns={'time_index': 'calldate'}, inplace=True)
test_df.calldate = pd.to_datetime(test_df.calldate, format = '%Y%m%d%H')
test_df.head(2)

Unnamed: 0,calldate,calls
0,2022-07-13 00:00:00,28.0
1,2022-07-13 01:00:00,98.0


In [10]:
x_test = test_df.drop(['calls'], axis=1)

# Model Development

In [11]:
x_train = hourly_calls.drop(['calls'], axis=1)   # Since the testing data have only 2 columns, hence training data should also have 2 columns i.e, calldate & calls
y_train = hourly_calls['calls']

model = RandomForestRegressor(n_estimators=100, random_state=42)

model.fit(x_train, y_train)

In [12]:
y_pred = model.predict(x_test)
y_pred = y_pred.round()
y_pred = pd.DataFrame(y_pred, columns=['calls'])
y_pred.head()

Unnamed: 0,calls
0,49.0
1,49.0
2,49.0
3,49.0
4,49.0


# Saving the predictions: submission.csv

In [13]:
final_df = pd.concat([x_test, y_pred], axis=1)
final_df.head()

Unnamed: 0,calldate,calls
0,2022-07-13 00:00:00,49.0
1,2022-07-13 01:00:00,49.0
2,2022-07-13 02:00:00,49.0
3,2022-07-13 03:00:00,49.0
4,2022-07-13 04:00:00,49.0


In [14]:
final_df.tail()

Unnamed: 0,calldate,calls
1179,2022-09-06 19:00:00,49.0
1180,2022-09-06 20:00:00,49.0
1181,2022-09-06 21:00:00,49.0
1182,2022-09-06 22:00:00,49.0
1183,2022-09-06 23:00:00,49.0


In [15]:
final_df.to_csv('submission.csv', index=True)