# Milestone 1: Data-Driven Optimization of IT Support Team Performance


---

## Module 1: Project Initialization and Dataset Setup

### Objectives
- Define objectives, KPIs, and workflow 
- Load the CSV dataset using pandas 
- Explore schema, data types, and missing values 
- Calculate initial ticket distribution by Type, Priority, and Category 


In [1]:
import pandas as pd
import numpy as np

### Load Dataset

In [4]:
df = pd.read_csv('customer_support_tickets.csv')
df.head()

Unnamed: 0,Ticket ID,Customer Name,Customer Email,Customer Age,Customer Gender,Product Purchased,Date of Purchase,Ticket Type,Ticket Subject,Ticket Description,Ticket Status,Resolution,Ticket Priority,Ticket Channel,First Response Time,Time to Resolution,Customer Satisfaction Rating
0,1,Marisa Obrien,carrollallison@example.com,32,Other,GoPro Hero,2021-03-22,Technical issue,Product setup,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Social media,2023-06-01 12:15:36,,
1,2,Jessica Rios,clarkeashley@example.com,42,Female,LG Smart TV,2021-05-22,Technical issue,Peripheral compatibility,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Chat,2023-06-01 16:45:38,,
2,3,Christopher Robbins,gonzalestracy@example.com,48,Other,Dell XPS,2020-07-14,Technical issue,Network problem,I'm facing a problem with my {product_purchase...,Closed,Case maybe show recently my computer follow.,Low,Social media,2023-06-01 11:14:38,2023-06-01 18:05:38,3.0
3,4,Christina Dillon,bradleyolson@example.org,27,Female,Microsoft Office,2020-11-13,Billing inquiry,Account access,I'm having an issue with the {product_purchase...,Closed,Try capital clearly never color toward story.,Low,Social media,2023-06-01 07:29:40,2023-06-01 01:57:40,3.0
4,5,Alexander Carroll,bradleymark@example.com,67,Female,Autodesk AutoCAD,2020-02-04,Billing inquiry,Data loss,I'm having an issue with the {product_purchase...,Closed,West decision evidence bit.,Low,Email,2023-06-01 00:12:42,2023-06-01 19:53:42,1.0


### Dataset Shape

In [5]:
df.shape

(8469, 17)

### Dataset Information

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8469 entries, 0 to 8468
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticket ID                     8469 non-null   int64  
 1   Customer Name                 8469 non-null   object 
 2   Customer Email                8469 non-null   object 
 3   Customer Age                  8469 non-null   int64  
 4   Customer Gender               8469 non-null   object 
 5   Product Purchased             8469 non-null   object 
 6   Date of Purchase              8469 non-null   object 
 7   Ticket Type                   8469 non-null   object 
 8   Ticket Subject                8469 non-null   object 
 9   Ticket Description            8469 non-null   object 
 10  Ticket Status                 8469 non-null   object 
 11  Resolution                    2769 non-null   object 
 12  Ticket Priority               8469 non-null   object 
 13  Tic

### Missing Values Analysis

In [7]:
df.isnull().sum()

Ticket ID                          0
Customer Name                      0
Customer Email                     0
Customer Age                       0
Customer Gender                    0
Product Purchased                  0
Date of Purchase                   0
Ticket Type                        0
Ticket Subject                     0
Ticket Description                 0
Ticket Status                      0
Resolution                      5700
Ticket Priority                    0
Ticket Channel                     0
First Response Time             2819
Time to Resolution              5700
Customer Satisfaction Rating    5700
dtype: int64

### Define Key KPIs

In [8]:
kpis = {
    'Total Tickets': len(df),
    'Unique Priorities': df['Priority'].nunique() if 'Priority' in df.columns else None,
    'Unique Categories': df['Issue Category'].nunique() if 'Issue Category' in df.columns else None
}
kpis

{'Total Tickets': 8469, 'Unique Priorities': None, 'Unique Categories': None}

In [9]:
df.columns = df.columns.str.strip().str.replace(' ', '_')
df.columns

Index(['Ticket_ID', 'Customer_Name', 'Customer_Email', 'Customer_Age',
       'Customer_Gender', 'Product_Purchased', 'Date_of_Purchase',
       'Ticket_Type', 'Ticket_Subject', 'Ticket_Description', 'Ticket_Status',
       'Resolution', 'Ticket_Priority', 'Ticket_Channel',
       'First_Response_Time', 'Time_to_Resolution',
       'Customer_Satisfaction_Rating'],
      dtype='object')

### Initial Ticket Distribution

In [10]:
# Ticket distribution by Type
df['Ticket_Type'].value_counts()

Ticket_Type
Refund request          1752
Technical issue         1747
Cancellation request    1695
Product inquiry         1641
Billing inquiry         1634
Name: count, dtype: int64

In [11]:
# Ticket distribution by Priority
df['Ticket_Priority'].value_counts()

Ticket_Priority
Medium      2192
Critical    2129
High        2085
Low         2063
Name: count, dtype: int64

In [12]:
# Ticket distribution by Issue Category
df['Ticket_Channel'].value_counts()

Ticket_Channel
Email           2143
Phone           2132
Social media    2121
Chat            2073
Name: count, dtype: int64

---
## Module 2: Data Cleaning and Feature Engineering

### Objectives
- Handle missing or incorrect data in text fields 
- Create new features: Resolution_Duration and Priority_Score 
- Save processed data for analysis 
- Deliverables: Cleaned dataset, Feature engineering summary, Data dictionary


### Handling Missing Values

In [13]:
df.columns

Index(['Ticket_ID', 'Customer_Name', 'Customer_Email', 'Customer_Age',
       'Customer_Gender', 'Product_Purchased', 'Date_of_Purchase',
       'Ticket_Type', 'Ticket_Subject', 'Ticket_Description', 'Ticket_Status',
       'Resolution', 'Ticket_Priority', 'Ticket_Channel',
       'First_Response_Time', 'Time_to_Resolution',
       'Customer_Satisfaction_Rating'],
      dtype='object')

In [14]:
text_columns = df.select_dtypes(include="object").columns.tolist()
text_columns

['Customer_Name',
 'Customer_Email',
 'Customer_Gender',
 'Product_Purchased',
 'Date_of_Purchase',
 'Ticket_Type',
 'Ticket_Subject',
 'Ticket_Description',
 'Ticket_Status',
 'Resolution',
 'Ticket_Priority',
 'Ticket_Channel',
 'First_Response_Time',
 'Time_to_Resolution']

In [15]:
# Fill missing values
df[text_columns] = df[text_columns].fillna("Unknown")

# Standardize text formatting
for col in text_columns:
    df[col] = (
        df[col]
        .astype(str)
        .str.strip()
        .str.title()
        .str.replace(r"\s+", " ", regex=True)
    )

# Replace incorrect placeholder values
invalid_values = ["", "N/A", "NA", "None", "Null", "-", "--"]

df[text_columns] = df[text_columns].replace(invalid_values, "Unknown")


In [16]:
# Check remaining missing values
df[text_columns].isnull().sum()

Customer_Name          0
Customer_Email         0
Customer_Gender        0
Product_Purchased      0
Date_of_Purchase       0
Ticket_Type            0
Ticket_Subject         0
Ticket_Description     0
Ticket_Status          0
Resolution             0
Ticket_Priority        0
Ticket_Channel         0
First_Response_Time    0
Time_to_Resolution     0
dtype: int64

### Feature : Resolution Duration

In [17]:
df.columns

Index(['Ticket_ID', 'Customer_Name', 'Customer_Email', 'Customer_Age',
       'Customer_Gender', 'Product_Purchased', 'Date_of_Purchase',
       'Ticket_Type', 'Ticket_Subject', 'Ticket_Description', 'Ticket_Status',
       'Resolution', 'Ticket_Priority', 'Ticket_Channel',
       'First_Response_Time', 'Time_to_Resolution',
       'Customer_Satisfaction_Rating'],
      dtype='object')

In [18]:
for col in df.columns:
    if "time" in col.lower() or "created" in col.lower() or "resolved" in col.lower() or "closed" in col.lower():
        print(col)

First_Response_Time
Time_to_Resolution


In [19]:
# Create Resolution_Duration from existing column
df["resolution_duration"] = df["Time_to_Resolution"]

In [20]:
df["resolution_duration"] = pd.to_numeric(
    df["resolution_duration"], errors="coerce"
)

# Remove negative values
df.loc[df["resolution_duration"] < 0, "resolution_duration"] = None

# Fill missing with median
df["resolution_duration"] = df["resolution_duration"].fillna(
    df["resolution_duration"].median()
)


In [21]:
df[["resolution_duration"]].head(10)

Unnamed: 0,resolution_duration
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,


### Feature : Priority Score

In [22]:
priority_cols = [
    col for col in df.columns
    if any(k in col.lower() for k in ["priority", "severity", "urgency"])
]

if not priority_cols:
    raise ValueError("No priority-related column found in dataset.")

priority_col = priority_cols[0]
print("Detected priority column:", priority_col)

Detected priority column: Ticket_Priority


In [23]:
if pd.api.types.is_numeric_dtype(df[priority_col]):
    df["priority_score"] = df[priority_col]

else:
    # Clean priority text
    df[priority_col] = (
        df[priority_col]
        .astype(str)
        .str.strip()
        .str.lower()
    )

    # Flexible mapping (covers variations)
    priority_mapping = {
        "critical": 4,
        "highest": 4,
        "high": 3,
        "medium": 2,
        "moderate": 2,
        "low": 1,
        "lowest": 1
    }

    df["priority_score"] = df[priority_col].map(priority_mapping)

In [24]:
df["priority_score"] = df["priority_score"].fillna(0).astype(int)
df[[priority_col, "priority_score"]].head(10)

Unnamed: 0,Ticket_Priority,priority_score
0,critical,4
1,critical,4
2,low,1
3,low,1
4,low,1
5,low,1
6,critical,4
7,critical,4
8,low,1
9,critical,4


### Cleaned Dataset Preview

In [25]:
df.head()

Unnamed: 0,Ticket_ID,Customer_Name,Customer_Email,Customer_Age,Customer_Gender,Product_Purchased,Date_of_Purchase,Ticket_Type,Ticket_Subject,Ticket_Description,Ticket_Status,Resolution,Ticket_Priority,Ticket_Channel,First_Response_Time,Time_to_Resolution,Customer_Satisfaction_Rating,resolution_duration,priority_score
0,1,Marisa Obrien,Carrollallison@Example.Com,32,Other,Gopro Hero,2021-03-22,Technical Issue,Product Setup,I'M Having An Issue With The {Product_Purchase...,Pending Customer Response,Unknown,critical,Social Media,2023-06-01 12:15:36,Unknown,,,4
1,2,Jessica Rios,Clarkeashley@Example.Com,42,Female,Lg Smart Tv,2021-05-22,Technical Issue,Peripheral Compatibility,I'M Having An Issue With The {Product_Purchase...,Pending Customer Response,Unknown,critical,Chat,2023-06-01 16:45:38,Unknown,,,4
2,3,Christopher Robbins,Gonzalestracy@Example.Com,48,Other,Dell Xps,2020-07-14,Technical Issue,Network Problem,I'M Facing A Problem With My {Product_Purchase...,Closed,Case Maybe Show Recently My Computer Follow.,low,Social Media,2023-06-01 11:14:38,2023-06-01 18:05:38,3.0,,1
3,4,Christina Dillon,Bradleyolson@Example.Org,27,Female,Microsoft Office,2020-11-13,Billing Inquiry,Account Access,I'M Having An Issue With The {Product_Purchase...,Closed,Try Capital Clearly Never Color Toward Story.,low,Social Media,2023-06-01 07:29:40,2023-06-01 01:57:40,3.0,,1
4,5,Alexander Carroll,Bradleymark@Example.Com,67,Female,Autodesk Autocad,2020-02-04,Billing Inquiry,Data Loss,I'M Having An Issue With The {Product_Purchase...,Closed,West Decision Evidence Bit.,low,Email,2023-06-01 00:12:42,2023-06-01 19:53:42,1.0,,1


### Save Cleaned Dataset

In [26]:
df.to_csv('customer_support_tickets_cleaned_milestone1.csv', index=False)
print('Cleaned dataset saved successfully.')

Cleaned dataset saved successfully.


### Data Dictionary (Summary)
- **Resolution_Duration_Hours**: Time taken to resolve a ticket in hours
- **Priority_Score**: Numerical encoding of ticket priority

---
**End of Milestone 1 Notebook**