Data Preprocessing & Cleaning
1. Handle Missing Values

Impute missing dates (e.g., follow_up_date) using strategies like median, mode, "Unknown", or custom logic.

For other NaNs (numerical/categorical), use mean/mode or "Unknown" as appropriate.

2. Convert Date Columns

Convert all date columns (last_service_date, next_service_due_date, follow_up_date, etc.) to datetime using pd.to_datetime().

Extract year, month, day, and calculate intervals (e.g., days since last service).

3. Encode Categorical Variables

Use OneHotEncoder or pd.get_dummies for variables like location, customer_type, preferred_language, make, model, etc.

Apply LabelEncoder for binary columns (warranty_status, insurance_status, pending_service, etc.).

4. Drop Sensitive Identifiers

Remove privacy-sensitive or unnecessary columns: customer_id, name, email, mobile_number.

5. Scale/Normalize Numerical Features

Scale features like odometer_reading, age_of_vehicle, last_service_cost, etc., using StandardScaler or MinMaxScaler.

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv('service_dataset.csv')

In [4]:
df

Unnamed: 0,customer_id,name,mobile_number,email,location,customer_type,preferred_language,make,model,year_of_purchase,...,Whats_delivered,Whats_clicked,sent_email,clicked_email,email_opened,last_service_cost,feedback_score,pickup_drop_required,customer_feedback,feedback_date
0,CUST0116,Sabrina Barnes,8945450816,averyshane@stanton.com,South Christianchester,Retail,Tamil,Ford,Aspire,2019,...,No,No,No,No,No,9884,4,No,Poor Service,11-04-2024
1,CUST0368,Steven Johnston,9815728359,tyler23@walsh-mccarthy.com,North Jacksonfort,Corporate,Tamil,Toyota,Yaris,2019,...,Yes,Yes,Yes,No,Yes,10996,4,Yes,Satisfied,09-01-2025
2,CUST0422,Matthew Gonzalez,8078635658,woodsmelvin@yahoo.com,Halefort,Retail,English,Ford,Figo,2020,...,No,No,No,No,No,3213,4,No,Average,08-03-2025
3,CUST0413,Lisa Thomas,6756649998,tschroeder@yahoo.com,Larryhaven,Corporate,English,Honda,City,2019,...,No,No,Yes,Yes,No,6601,5,No,Smooth Process,19-11-2024
4,CUST0451,Jessica Abbott,6364076742,heatherfarmer@carlson.com,Christinaburgh,Fleet,Hindi,Honda,City,2015,...,Yes,Yes,Yes,Yes,No,14907,3,No,Unresponsive,30-12-2024
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,CUST0366,Shawna Palmer,9460106831,josephalvarez@hotmail.com,Bellville,Retail,Hindi,Hyundai,i20,2015,...,Yes,Yes,No,No,No,11536,3,No,Delayed Pickup,20-02-2025
996,CUST0005,Theresa Martin,6329366782,chapmanjerry@gmail.com,North Jeanton,Corporate,Tamil,Hyundai,Creta,2016,...,Yes,Yes,Yes,No,No,6999,4,Yes,Smooth Process,15-10-2024
997,CUST0350,Melinda Nelson,9675801524,jacobsonroy@martin-bailey.biz,Alvarezburgh,Retail,Tamil,Toyota,Innova,2021,...,Yes,Yes,No,No,No,3215,4,No,Delayed Pickup,15-04-2025
998,CUST0315,Wanda Todd,9752893160,scottolson@clark.com,West Jessicashire,Fleet,Tamil,Hyundai,i10,2015,...,Yes,Yes,No,No,No,9804,1,No,Unresponsive,27-08-2024


In [5]:
df.isnull().sum()

customer_id                0
name                       0
mobile_number              0
email                      0
location                   0
customer_type              0
preferred_language         0
make                       0
model                      0
year_of_purchase           0
age_of_vehicle             0
fuel_type                  0
transmission               0
odometer_reading           0
warranty_status            0
insurance_status           0
last_service_date          0
last_service_type          0
service_center             0
number_of_services         0
last_service_kms           0
avg_kms_per_month          0
next_service_due_kms       0
next_service_due_date      0
next_service_due_days      0
AMC_status                 0
pending_service            0
last_call_date             0
response                   0
follow_up_required         0
follow_up_date           494
telecaller_name            0
service_booked             0
call_duration_sec          0
remark        

In [6]:
# Fill missing values example
df['follow_up_date'] = pd.to_datetime(df['follow_up_date'], errors='coerce')
df['follow_up_date'].fillna(method='ffill', inplace=True)

In [7]:
df.isnull().sum()

customer_id              0
name                     0
mobile_number            0
email                    0
location                 0
customer_type            0
preferred_language       0
make                     0
model                    0
year_of_purchase         0
age_of_vehicle           0
fuel_type                0
transmission             0
odometer_reading         0
warranty_status          0
insurance_status         0
last_service_date        0
last_service_type        0
service_center           0
number_of_services       0
last_service_kms         0
avg_kms_per_month        0
next_service_due_kms     0
next_service_due_date    0
next_service_due_days    0
AMC_status               0
pending_service          0
last_call_date           0
response                 0
follow_up_required       0
follow_up_date           4
telecaller_name          0
service_booked           0
call_duration_sec        0
remark                   0
eligible_offer_code      0
offer_description        0
o

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 52 columns):
customer_id              1000 non-null object
name                     1000 non-null object
mobile_number            1000 non-null int64
email                    1000 non-null object
location                 1000 non-null object
customer_type            1000 non-null object
preferred_language       1000 non-null object
make                     1000 non-null object
model                    1000 non-null object
year_of_purchase         1000 non-null int64
age_of_vehicle           1000 non-null int64
fuel_type                1000 non-null object
transmission             1000 non-null object
odometer_reading         1000 non-null int64
warranty_status          1000 non-null object
insurance_status         1000 non-null object
last_service_date        1000 non-null object
last_service_type        1000 non-null object
service_center           1000 non-null object
number_of_services      

In [10]:
df.shape

(1000, 52)

In [11]:
# Convert dates
date_cols = ['last_service_date', 'next_service_due_date', 'feedback_date']
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')

In [12]:
df

Unnamed: 0,customer_id,name,mobile_number,email,location,customer_type,preferred_language,make,model,year_of_purchase,...,Whats_delivered,Whats_clicked,sent_email,clicked_email,email_opened,last_service_cost,feedback_score,pickup_drop_required,customer_feedback,feedback_date
0,CUST0116,Sabrina Barnes,8945450816,averyshane@stanton.com,South Christianchester,Retail,Tamil,Ford,Aspire,2019,...,No,No,No,No,No,9884,4,No,Poor Service,2024-11-04
1,CUST0368,Steven Johnston,9815728359,tyler23@walsh-mccarthy.com,North Jacksonfort,Corporate,Tamil,Toyota,Yaris,2019,...,Yes,Yes,Yes,No,Yes,10996,4,Yes,Satisfied,2025-09-01
2,CUST0422,Matthew Gonzalez,8078635658,woodsmelvin@yahoo.com,Halefort,Retail,English,Ford,Figo,2020,...,No,No,No,No,No,3213,4,No,Average,2025-08-03
3,CUST0413,Lisa Thomas,6756649998,tschroeder@yahoo.com,Larryhaven,Corporate,English,Honda,City,2019,...,No,No,Yes,Yes,No,6601,5,No,Smooth Process,2024-11-19
4,CUST0451,Jessica Abbott,6364076742,heatherfarmer@carlson.com,Christinaburgh,Fleet,Hindi,Honda,City,2015,...,Yes,Yes,Yes,Yes,No,14907,3,No,Unresponsive,2024-12-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,CUST0366,Shawna Palmer,9460106831,josephalvarez@hotmail.com,Bellville,Retail,Hindi,Hyundai,i20,2015,...,Yes,Yes,No,No,No,11536,3,No,Delayed Pickup,2025-02-20
996,CUST0005,Theresa Martin,6329366782,chapmanjerry@gmail.com,North Jeanton,Corporate,Tamil,Hyundai,Creta,2016,...,Yes,Yes,Yes,No,No,6999,4,Yes,Smooth Process,2024-10-15
997,CUST0350,Melinda Nelson,9675801524,jacobsonroy@martin-bailey.biz,Alvarezburgh,Retail,Tamil,Toyota,Innova,2021,...,Yes,Yes,No,No,No,3215,4,No,Delayed Pickup,2025-04-15
998,CUST0315,Wanda Todd,9752893160,scottolson@clark.com,West Jessicashire,Fleet,Tamil,Hyundai,i10,2015,...,Yes,Yes,No,No,No,9804,1,No,Unresponsive,2024-08-27


In [13]:
# Drop identifiers foe sensitive
df = df.drop(columns=['customer_id', 'name', 'email', 'mobile_number'])

In [14]:
df

Unnamed: 0,location,customer_type,preferred_language,make,model,year_of_purchase,age_of_vehicle,fuel_type,transmission,odometer_reading,...,Whats_delivered,Whats_clicked,sent_email,clicked_email,email_opened,last_service_cost,feedback_score,pickup_drop_required,customer_feedback,feedback_date
0,South Christianchester,Retail,Tamil,Ford,Aspire,2019,6,Electric,Automatic,59174,...,No,No,No,No,No,9884,4,No,Poor Service,2024-11-04
1,North Jacksonfort,Corporate,Tamil,Toyota,Yaris,2019,6,Electric,Automatic,32365,...,Yes,Yes,Yes,No,Yes,10996,4,Yes,Satisfied,2025-09-01
2,Halefort,Retail,English,Ford,Figo,2020,5,Diesel,Manual,49576,...,No,No,No,No,No,3213,4,No,Average,2025-08-03
3,Larryhaven,Corporate,English,Honda,City,2019,6,Petrol,Manual,83890,...,No,No,Yes,Yes,No,6601,5,No,Smooth Process,2024-11-19
4,Christinaburgh,Fleet,Hindi,Honda,City,2015,10,Electric,Manual,77667,...,Yes,Yes,Yes,Yes,No,14907,3,No,Unresponsive,2024-12-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Bellville,Retail,Hindi,Hyundai,i20,2015,10,Diesel,Manual,101418,...,Yes,Yes,No,No,No,11536,3,No,Delayed Pickup,2025-02-20
996,North Jeanton,Corporate,Tamil,Hyundai,Creta,2016,9,Electric,Automatic,93748,...,Yes,Yes,Yes,No,No,6999,4,Yes,Smooth Process,2024-10-15
997,Alvarezburgh,Retail,Tamil,Toyota,Innova,2021,4,Electric,Automatic,52739,...,Yes,Yes,No,No,No,3215,4,No,Delayed Pickup,2025-04-15
998,West Jessicashire,Fleet,Tamil,Hyundai,i10,2015,10,Petrol,Automatic,66442,...,Yes,Yes,No,No,No,9804,1,No,Unresponsive,2024-08-27


In [15]:
# Encode categoricals
df = pd.get_dummies(df, columns=['location', 'customer_type', 'make', 'model', 'fuel_type', 
                                 'transmission', 'service_center', 'warranty_status', 
                                 'insurance_status', 'AMC_status', 'preferred_language'])

In [16]:
df

Unnamed: 0,year_of_purchase,age_of_vehicle,odometer_reading,last_service_date,last_service_type,number_of_services,last_service_kms,avg_kms_per_month,next_service_due_kms,next_service_due_date,...,service_center_Velachery,warranty_status_Active,warranty_status_Expired,insurance_status_Active,insurance_status_Expired,AMC_status_No,AMC_status_Yes,preferred_language_English,preferred_language_Hindi,preferred_language_Tamil
0,2019,6,59174,2024-10-04,Minor,7,56063,811,66063,2025-02-04,...,0,0,1,0,1,1,0,0,0,1
1,2019,6,32365,2025-05-01,Major,3,30554,443,40554,2025-10-28,...,0,0,1,1,0,1,0,0,0,1
2,2020,5,49576,2025-04-03,Minor,8,46881,813,56881,2025-09-30,...,0,0,1,0,1,1,0,1,0,0
3,2019,6,83890,2024-11-18,Minor,6,82509,1149,92509,2025-05-17,...,0,0,1,0,1,1,0,1,0,0
4,2015,10,77667,2024-12-27,Major,4,73001,642,83001,2025-06-25,...,0,0,1,1,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2015,10,101418,2025-02-16,Major,3,99377,838,109377,2025-08-15,...,0,0,1,1,0,0,1,0,1,0
996,2016,9,93748,2024-10-14,Minor,3,90539,860,100539,2025-12-04,...,1,0,1,1,0,0,1,0,0,1
997,2021,4,52739,2025-04-14,Minor,3,47837,1076,57837,2025-11-10,...,0,0,1,1,0,1,0,0,0,1
998,2015,10,66442,2024-08-26,Minor,9,64525,549,74525,2025-02-22,...,0,0,1,1,0,1,0,0,0,1


In [17]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Scale numerical
scaler = StandardScaler()
num_cols = ['age_of_vehicle', 'odometer_reading', 'last_service_cost', 'number_of_services']
df[num_cols] = scaler.fit_transform(df[num_cols])

In [18]:
df

Unnamed: 0,year_of_purchase,age_of_vehicle,odometer_reading,last_service_date,last_service_type,number_of_services,last_service_kms,avg_kms_per_month,next_service_due_kms,next_service_due_date,...,service_center_Velachery,warranty_status_Active,warranty_status_Expired,insurance_status_Active,insurance_status_Expired,AMC_status_No,AMC_status_Yes,preferred_language_English,preferred_language_Hindi,preferred_language_Tamil
0,2019,-0.247899,-0.206041,2024-10-04,Minor,0.560456,56063,811,66063,2025-02-04,...,0,0,1,0,1,1,0,0,0,1
1,2019,-0.247899,-1.082204,2025-05-01,Major,-0.907667,30554,443,40554,2025-10-28,...,0,0,1,1,0,1,0,0,0,1
2,2020,-0.692162,-0.519720,2025-04-03,Minor,0.927486,46881,813,56881,2025-09-30,...,0,0,1,0,1,1,0,1,0,0
3,2019,-0.247899,0.601719,2024-11-18,Minor,0.193425,82509,1149,92509,2025-05-17,...,0,0,1,0,1,1,0,1,0,0
4,2015,1.529153,0.398341,2024-12-27,Major,-0.540636,73001,642,83001,2025-06-25,...,0,0,1,1,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2015,1.529153,1.174563,2025-02-16,Major,-0.907667,99377,838,109377,2025-08-15,...,0,0,1,1,0,0,1,0,1,0
996,2016,1.084890,0.923895,2024-10-14,Minor,-0.907667,90539,860,100539,2025-12-04,...,1,0,1,1,0,0,1,0,0,1
997,2021,-1.136425,-0.416348,2025-04-14,Minor,-0.907667,47837,1076,57837,2025-11-10,...,0,0,1,1,0,1,0,0,0,1
998,2015,1.529153,0.031489,2024-08-26,Minor,1.294517,64525,549,74525,2025-02-22,...,0,0,1,1,0,1,0,0,0,1


In [19]:
#save to preprocessed data
df.to_csv('preprocessed_train_data.csv', index=False)

In [1]:
!pip install fastapi

Collecting fastapi
  Downloading fastapi-0.103.2-py3-none-any.whl (66 kB)
     ---------------------------------------- 66.3/66.3 kB ? eta 0:00:00
Collecting starlette<0.28.0,>=0.27.0
  Downloading starlette-0.27.0-py3-none-any.whl (66 kB)
     ---------------------------------------- 67.0/67.0 kB ? eta 0:00:00
Collecting anyio<4.0.0,>=3.7.1
  Downloading anyio-3.7.1-py3-none-any.whl (80 kB)
     ---------------------------------------- 80.9/80.9 kB ? eta 0:00:00
Collecting pydantic!=1.8,!=1.8.1,!=2.0.0,!=2.0.1,!=2.1.0,<3.0.0,>=1.7.4
  Downloading pydantic-2.5.3-py3-none-any.whl (381 kB)
     ------------------------------------- 381.9/381.9 kB 12.0 MB/s eta 0:00:00
Collecting exceptiongroup
  Downloading exceptiongroup-1.3.0-py3-none-any.whl (16 kB)
Collecting annotated-types>=0.4.0
  Downloading annotated_types-0.5.0-py3-none-any.whl (11 kB)
Collecting pydantic-core==2.14.6
  Downloading pydantic_core-2.14.6-cp37-none-win_amd64.whl (1.9 MB)
     --------------------------------------