# Telstra Network Disruptions

Telstra is the largest Telecom Service Provider in Australia. They posted this challenge on kaggle few years ago as part of their recruitment exercise to hire potential data scientists. We are given a data set which is from Telstra’s service logs and we are required to predict the severity of service disruptions (if a disruption is a temporary glitch or is it critical and will result in total loss of service).  This challenge was crafted as a simulation of the type of problem one might encounter as a member of data science team at Telstra.

# Data Preprocessing

**Making the initial imports and suppress the un-wanted warning messages**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style("whitegrid")
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings('ignore')

**Reading the files**

In [None]:
train = pd.read_csv('../input/telstra-recruiting-network/train.csv.zip')
test = pd.read_csv('../input/telstra-recruiting-network/test.csv.zip')
severity_type = pd.read_csv('../input/telstra-recruiting-network/severity_type.csv.zip', error_bad_lines= False, warn_bad_lines= False)
resource_type = pd.read_csv('../input/telstra-recruiting-network/resource_type.csv.zip', error_bad_lines= False, warn_bad_lines= False)
log_failure = pd.read_csv('../input/telstra-recruiting-network/log_feature.csv.zip', error_bad_lines= False, warn_bad_lines= False)
event_type = pd.read_csv('../input/telstra-recruiting-network/event_type.csv.zip', error_bad_lines=False, warn_bad_lines= False)

**Printing the shape of all given files**

In [None]:
print('The shape of test set is: {}\n'.format(test.shape))
print('The shape of train set is: {}\n'.format(train.shape))
print('The shape of severity_type is: {}\n'.format(severity_type.shape))
print('The shape of resource_type is: {}\n'.format(resource_type.shape))
print('The shape of log_failure is: {}\n'.format(log_failure.shape))
print('The shape of event_type is: {}'.format(event_type.shape))

**Checking the head of training file before merging it with other files**

In [None]:
train.head()

**Merging the data sets to have all the available info**

In [None]:
train_1 = train.merge(severity_type, how = 'left', left_on='id', right_on='id')
train_2 = train_1.merge(resource_type, how = 'left', left_on='id', right_on='id')
train_3 = train_2.merge(log_failure, how = 'left', left_on='id', right_on='id')
train_4 = train_3.merge(event_type, how = 'left', left_on='id', right_on='id')

**Checking the head after merging**

In [None]:
train_4.head()

**As we can see that there are some duplicates. So let's remove them all.**

In [None]:
train_4.drop_duplicates(subset= 'id', keep= 'first', inplace = True)
train_4.head()

# Exploratory Data Analysis (EDA)

**Count plot for fault severity**

In [None]:
plt.figure(figsize = (8,6))
sns.countplot(train_4['fault_severity'])
plt.show()

Not very balanced data set as values with fault_severity ‘zero’ (indicating no fault) are very high as compared with others. So ML models might be biased towards fault severity value of ‘zero’.

**Count plot for severity type**

In [None]:
plt.figure(figsize = (8,6))
sns.countplot(train_4['severity_type'])
plt.show()

Severity_type_1 and 2 are very high as compared with others.

**Count plot for resource type**

In [None]:
plt.figure(figsize = (14,6))
sns.countplot(train_4['resource_type'])
plt.tight_layout()
plt.show()

Most of the resource types are either type_2 or type_8.

# CatBoost

Catboost is an opensource machine learning algorithm from Yandex (Russian search engine like Google). It can work with wide range of data types and can help solve various problems. The catboost algo does not need extensive training (computing efficiency)  and can better handle categorical features.

In [None]:
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split

In [None]:
X = train_4[['id', 'location', 'severity_type', 'resource_type',
       'log_feature', 'volume', 'event_type']]
y = train_4.fault_severity

In [None]:
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.20, random_state=101)

In [None]:
categorical_features_indices = np.where(X_train.dtypes == object)[0]

In [None]:
train_dataset = Pool(data=X_train,
                     label=y_train,
                     cat_features=categorical_features_indices)

eval_dataset = Pool(data=X_validation,
                    label=y_validation,
                    cat_features=categorical_features_indices)

In [None]:
model = CatBoostClassifier(iterations=1000,
                           learning_rate=1,
                           depth=2,
                           loss_function='MultiClass',
                           random_seed=1,
                           bagging_temperature=22,
                           od_type='Iter',
                           metric_period=100,
                           od_wait=100)

In [None]:
model.fit(train_dataset, eval_set= eval_dataset, plot= True)

**As the model was getting overfit  after initial iterations so it was stopped by overfitting detector in Catboost.**

In [None]:
# Get predicted classes
preds_class = model.predict(eval_dataset)

# Get predicted probabilities for each class
preds_proba = model.predict_proba(eval_dataset)

# Getting the test set ready to feed into the model

In [None]:
test.head()

In [None]:
test_1 = test.merge(severity_type, how = 'left', left_on='id', right_on='id')
test_2 = test_1.merge(resource_type, how = 'left', left_on='id', right_on='id')
test_3 = test_2.merge(log_failure, how = 'left', left_on='id', right_on='id')
test_4 = test_3.merge(event_type, how = 'left', left_on='id', right_on='id')

In [None]:
test_4.head()

**Remove duplicate values**

In [None]:
test_4.drop_duplicates(subset= 'id', keep= 'first', inplace = True)
test_4.head()

In [None]:
test_4.isnull().sum()

# Making predictions on test set

In [None]:
predict_test=model.predict_proba(test_4)
pred_df=pd.DataFrame(predict_test,columns=['predict_0', 'predict_1', 'predict_2'])
submission_cat=pd.concat([test[['id']],pred_df],axis=1)
submission_cat.to_csv('sub_cat_1.csv',index=False,header=True)

In [None]:
submission_cat.head()

**These are the predicted probabilites. The column with the highest value is the predicted class of severity.**