# 🏗️ Advanced Feature Engineering for Customer Churn

In this notebook, we engineer additional features from the cleaned dataset:
- 📄 Textual features (lengths, sentiment)
- 🔤 TF‑IDF embeddings
- ⏰ Time‑based features
- 👤 Interaction features

The goal is to create a rich feature set that improves our churn prediction model.


## 📥 Load Cleaned Data
We start with the cleaned dataset prepared in the previous step.


In [2]:
import pandas as pd
import numpy as np

# Load cleaned data from Step 1
df = pd.read_csv("../data/processed/cleaned_twcs.csv")

print("✅ Data Loaded")
print(df.head())
print(df.info())


✅ Data Loaded
   tweet_id   author_id  inbound                 created_at  \
0         1  sprintcare    False  2017-10-31 22:10:47+00:00   
1         4  sprintcare    False  2017-10-31 21:54:49+00:00   
2         6  sprintcare    False  2017-10-31 21:46:24+00:00   
3        11  sprintcare    False  2017-10-31 22:10:35+00:00   
4        15  sprintcare    False  2017-10-31 20:03:31+00:00   

                                                text response_tweet_id  \
0  @115712 I understand. I would like to assist y...                 2   
1  @115712 Please send us a Private Message so th...                 3   
2  @115712 Can you please send us a private messa...               5,7   
3  @115713 This is saddening to hear. Please shoo...               NaN   
4  @115713 We understand your concerns and we'd l...                12   

   customer_tweet_id        customer_created_at  \
0                3.0  2017-10-31 22:08:27+00:00   
1                5.0  2017-10-31 21:49:35+00:00   
2        

## ✨ Textual Features
We extract simple yet powerful features from the `cleaned_text` column:
- Text length (characters, words)
- Average word length
- Sentiment polarity (using VADER)


In [3]:
# Ensure cleaned_text is string and handle missing values
df['cleaned_text'] = df['cleaned_text'].fillna('').astype(str)

# Character count
df['char_count'] = df['cleaned_text'].apply(len)

# Word count
df['word_count'] = df['cleaned_text'].apply(lambda x: len(x.split()))

# Average word length
df['avg_word_len'] = df['char_count'] / df['word_count'].replace(0, 1)


In [4]:
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()
df['sentiment'] = df['cleaned_text'].apply(lambda x: sia.polarity_scores(x)['compound'])

print("✅ Sentiment scores added.")
df[['cleaned_text','sentiment']].head()


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


✅ Sentiment scores added.


Unnamed: 0,cleaned_text,sentiment
0,i understand i would like to assist you we wou...,0.6369
1,please send us a private message so that we ca...,0.4767
2,can you please send us a private message so th...,0.7152
3,this is saddening to hear please shoot us a dm...,-0.5106
4,we understand your concerns and wed like for y...,0.5859


## 🕒 Time-Based Features
From `created_at`, extract:
- Hour of day
- Day of week
- Weekend flag


In [5]:
df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce', utc=True)

# Hour of day
df['hour'] = df['created_at'].dt.hour

# Day of week (0=Monday, 6=Sunday)
df['day_of_week'] = df['created_at'].dt.dayofweek

# Weekend flag
df['is_weekend'] = df['day_of_week'].isin([5,6]).astype(int)

print("✅ Time features created.")
df[['created_at','hour','day_of_week','is_weekend']].head()


✅ Time features created.


Unnamed: 0,created_at,hour,day_of_week,is_weekend
0,2017-10-31 22:10:47+00:00,22,1,0
1,2017-10-31 21:54:49+00:00,21,1,0
2,2017-10-31 21:46:24+00:00,21,1,0
3,2017-10-31 22:10:35+00:00,22,1,0
4,2017-10-31 20:03:31+00:00,20,1,0


## 👤 Interaction Features
Aggregate customer-level information:
- Number of tweets per author
- Mean response time per author


In [6]:
# Number of tweets by author
author_counts = df['author_id'].value_counts().to_dict()
df['author_tweet_count'] = df['author_id'].map(author_counts)

# Mean response time by author
author_mean_resp = df.groupby('author_id')['response_time'].mean().to_dict()
df['author_mean_response_time'] = df['author_id'].map(author_mean_resp)

print("✅ Interaction features created.")
df[['author_id','author_tweet_count','author_mean_response_time']].head()


✅ Interaction features created.


Unnamed: 0,author_id,author_tweet_count,author_mean_response_time
0,sprintcare,22209,61.400065
1,sprintcare,22209,61.400065
2,sprintcare,22209,61.400065
3,sprintcare,22209,61.400065
4,sprintcare,22209,61.400065


## 🔤 TF‑IDF Features
We vectorize the `cleaned_text` column to capture word importance.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000, # keep top 5000 features
    ngram_range=(1,2), # unigrams and bigrams
    stop_words='english'
)

X_tfidf = tfidf.fit_transform(df['cleaned_text'])
print("✅ TF‑IDF matrix shape:", X_tfidf.shape)


✅ TF‑IDF matrix shape: (1261888, 5000)


## 🏗️ Combine All Features
We concatenate:
- Engineered numerical features
- TF‑IDF sparse matrix


In [8]:
from scipy.sparse import hstack

num_features = df[['char_count','word_count','avg_word_len','sentiment',
                   'hour','day_of_week','is_weekend',
                   'response_time','author_tweet_count','author_mean_response_time']].fillna(0)

X_final = hstack([X_tfidf, num_features.values])

print("✅ Final feature matrix shape:", X_final.shape)

# Target variable
y = df['churn_label']


✅ Final feature matrix shape: (1261888, 5010)


## 💾 Save Engineered Features
Save the sparse matrix and target for modeling step.


In [9]:
# Save TWCS with all features
df.to_csv('../data/processed/cleaned_twcs.csv', index=False)
print("✅ Saved TWCS with all features to '../data/processed/cleaned_twcs.csv'")

✅ Saved TWCS with all features to '../data/processed/cleaned_twcs.csv'


In [10]:
import joblib

# Save features and labels
joblib.dump(X_final, "../data/features/X_features.pkl")
joblib.dump(y, "../data/features/y_labels.pkl")

print("✅ Features and labels saved in data/features/ folder.")


✅ Features and labels saved in data/features/ folder.


# Next Steps
Proceed to the modeling notebook:
- Load `X_features.pkl` and `y_labels.pkl`
- Train multiple models (Logistic Regression, Random Forest, XGBoost)
- Perform hyperparameter tuning and evaluation


In [11]:
import joblib
joblib.dump(tfidf, "../models/tfidf_vectorizer.pkl")

['../models/tfidf_vectorizer.pkl']