<a href="https://colab.research.google.com/github/simontirvine/msc/blob/main/EDA_Call_Centre_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import and preprocess the data

This script first loads the dataframe using the pd.read_csv() function, then it converts the 'Answer Rate' column from string to float by stripping the % sign and dividing by 100. Then it converts the time columns ('Answer Speed (AVG)', 'Talk Duration (AVG)', 'Waiting Time (AVG)') from string to timedelta using the pd.to_timedelta() function. Finally, it converts the 'Service Level (20 Seconds)' column from string to float by stripping the % sign and dividing by 100. It also checks the data types of each column and for missing values to ensure that the dataframe is clean and preprocessed correctly.

You can add more preprocessing steps as per your requirement.

In [None]:
import pandas as pd

# load the dataframe
df = pd.read_csv('call_center_records.csv')

# convert the 'Answer Rate' column from string to float
df['Answer Rate'] = df['Answer Rate'].str.rstrip('%').astype('float') / 100

# convert the 'Answer Speed (AVG)' column from string to timedelta
df['Answer Speed (AVG)'] = pd.to_timedelta(df['Answer Speed (AVG)'])

# convert the 'Talk Duration (AVG)' column from string to timedelta
df['Talk Duration (AVG)'] = pd.to_timedelta(df['Talk Duration (AVG)'])

# convert the 'Waiting Time (AVG)' column from string to timedelta
df['Waiting Time (AVG)'] = pd.to_timedelta(df['Waiting Time (AVG)'])

# convert the 'Service Level (20 Seconds)' column from string to float
df['Service Level (20 Seconds)'] = df['Service Level (20 Seconds)'].str.rstrip('%').astype('float') / 100

# check the data types of each column to ensure they are correct
print(df.dtypes)

# check for missing values
print(df.isnull().sum())

Index                                   int64
Incoming Calls                          int64
Answered Calls                          int64
Answer Rate                           float64
Abandoned Calls                         int64
Answer Speed (AVG)            timedelta64[ns]
Talk Duration (AVG)           timedelta64[ns]
Waiting Time (AVG)            timedelta64[ns]
Service Level (20 Seconds)            float64
dtype: object
Index                         0
Incoming Calls                0
Answered Calls                0
Answer Rate                   0
Abandoned Calls               0
Answer Speed (AVG)            0
Talk Duration (AVG)           0
Waiting Time (AVG)            0
Service Level (20 Seconds)    0
dtype: int64


# Perform data quality checks

This script first checks the shape of the dataframe, column names, data types of each column and any missing values using shape, columns, dtypes and isnull() respectively. It then checks for duplicate rows using the duplicated() function. It also checks the summary statistics of the numerical columns using describe(), the unique values of each column using nunique(), and the distribution of the target variable 'Answer Rate' using value_counts(). These checks will help identify any issues with the data such as missing values, outliers, duplicate rows, etc.

You can also use other libraries such as missingno to visualize missing values, scipy.stats for outlier detection and pandas_profiling for a comprehensive report of the dataframe and its variables.

In [None]:
# check the shape of the dataframe
print(df.shape)

# check the column names
print(df.columns)

# check the data types of each column
print(df.dtypes)

# check for missing values
print(df.isnull().sum())

# check for duplicate rows
print(df.duplicated().sum())

# check the summary statistics of the numerical columns
print(df.describe())

# check the unique values of each column
for col in df.columns:
    print(col + ": " + str(df[col].nunique()))

# check the distribution of the target variable
print(df['Answer Rate'].value_counts(normalize=True))


(1251, 9)
Index(['Index', 'Incoming Calls', 'Answered Calls', 'Answer Rate',
       'Abandoned Calls', 'Answer Speed (AVG)', 'Talk Duration (AVG)',
       'Waiting Time (AVG)', 'Service Level (20 Seconds)'],
      dtype='object')
Index                                   int64
Incoming Calls                          int64
Answered Calls                          int64
Answer Rate                           float64
Abandoned Calls                         int64
Answer Speed (AVG)            timedelta64[ns]
Talk Duration (AVG)           timedelta64[ns]
Waiting Time (AVG)            timedelta64[ns]
Service Level (20 Seconds)            float64
dtype: object
Index                         0
Incoming Calls                0
Answered Calls                0
Answer Rate                   0
Abandoned Calls               0
Answer Speed (AVG)            0
Talk Duration (AVG)           0
Waiting Time (AVG)            0
Service Level (20 Seconds)    0
dtype: int64
0
             Index  Incoming Calls  Ans

# Perform some simple statistical analysis