\**üß™ Week 2 Lab: Advanced Feature Engineering & Architectures**

**Topic:** Polars, UMAP, and LLM-based Feature Extraction

**üéØ Learning Objectives**
* Benchmark modern data architectures (Polars vs. Pandas).
* Implement rigorous preprocessing (Split-then-Scale) to avoid data leakage
* Visualize high-dimensional manifolds using UMAP.
* Extract structured features from unstructured text using Zero-Shot LLMs.


---







**üõ†Ô∏è Section 1: Environment Setup**

We will install polars for speed, umap-learn for visualization, and transformers for our AI text extraction.

In [1]:
# Install necessary libraries
%pip install polars umap-learn plotly transformers accelerate -q

import pandas as pd
import polars as pl
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import umap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from transformers import pipeline
import time

print("‚úÖ Libraries Installed & Ready!")

‚úÖ Libraries Installed & Ready!




---

**üì• Section 2: Data Loading & Synthetic Enrichment**

We are using the standard Telco Customer Churn dataset. However, to practice NLP, we will synthetically generate customer comments based on their churn status. This simulates a real-world scenario where you have tabular data + unstructured notes.

In [9]:
# Load standard dataset
url = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
df_pd = pd.read_csv(url)

# Data Cleaning: TotalCharges has empty strings
df_pd['TotalCharges'] = pd.to_numeric(df_pd['TotalCharges'], errors='coerce').fillna(0)

# --- SYNTHETIC TEXT GENERATION ---
# We simulate customer feedback to use for our NLP section later.
import random

churn_reasons = [
    "I am unhappy with the service quality.",
    "The pricing is too high compared to competitors.",
    "I experienced frequent outages and slow internet.",
    "Customer support was unhelpful and difficult to reach.",
    "I found a better deal elsewhere.",
    "The contract terms are inflexible.",
    "I moved to an area not covered by your service."
]

stay_reasons = [
    "I am very satisfied with the service and reliability.",
    "The customer support has been excellent.",
    "I appreciate the competitive pricing and bundled offers.",
    "I've been a long-time customer and am happy with the loyalty rewards.",
    "The internet speed is consistently fast.",
    "I have no issues and find the service meets my needs.",
    "It's convenient and easy to manage my account."
]

def generate_comment(row):
    if row['Churn'] == 'Yes':
        return random.choice(churn_reasons)
    else:
        return random.choice(stay_reasons)

df_pd['Customer_Feedback'] = df_pd.apply(generate_comment, axis=1)

# Save to CSV to test loading speeds
df_pd.to_csv("telco_enriched.csv", index=False)

print(f"Dataset Shape: {df_pd.shape}")
df_pd.head(3)

Dataset Shape: (7043, 22)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Customer_Feedback
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No,The internet speed is consistently fast.
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,No,No,No,One year,No,Mailed check,56.95,1889.5,No,I am very satisfied with the service and relia...
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,Customer support was unhelpful and difficult t...


**Lab Checkoff 1**  
Scaling to 100,000 rows and saving to a different csv file for downstream usage

In [10]:
base_rows = len(df_pd)
target_rows = 100_000

# factor -> how many times do we need to concat it by
factor = target_rows // base_rows
remainder = target_rows % base_rows

# we build a list of dfs to concat up to the target_rows then append it one shot
dfs = []

# add full copies of the original dataframe
if factor > 0:
  dfs.append(pd.concat([df_pd] * factor, ignore_index=True))
  print(f"Added {factor} copies of the original dataframe.")

# add the partial copy to hit the target rows
if remainder > 0:
  dfs.append(df_pd.iloc[:remainder].reset_index(drop=True))
  print(f"Added {remainder} rows of the original dataframe.")

# final scaled df
df_pd_scaled = pd.concat(dfs, ignore_index=True)

print(f"Scaled Dataset Shape: {df_pd_scaled.shape}")
df_pd.head(3)

# save to csv for downstream usage
df_pd_scaled.to_csv("telco_enriched_scaled.csv", index=False)
print("dataset saved to csv")


Added 14 copies of the original dataframe.
Added 1398 rows of the original dataframe.
Scaled Dataset Shape: (100000, 22)
dataset saved to csv


**Benchmarking**

In [11]:
# 1. Benchmark Reading Speed
print("--- Reading Benchmark ---")
start = time.time()
df_pd_test = pd.read_csv("telco_enriched_scaled.csv")
pd_time = time.time() - start
print(f"Pandas Read: {pd_time:.4f}s")

start = time.time()
df_pl_test = pl.read_csv("telco_enriched_scaled.csv")
pl_time = time.time() - start
print(f"Polars Read: {pl_time:.4f}s")

print(f"üöÄ Speedup: {pd_time/pl_time:.2f}x")

# 2. Benchmark Aggregation (Group By)
# Task: Group by 'PaymentMethod' and calculate mean 'MonthlyCharges'
print("\n--- Aggregation Benchmark ---")

# Pandas
start = time.time()
res_pd = df_pd_test.groupby("PaymentMethod")["MonthlyCharges"].mean()
pd_agg = time.time() - start

# Polars
start = time.time()
res_pl = df_pl_test.group_by("PaymentMethod").agg(pl.col("MonthlyCharges").mean())
pl_agg = time.time() - start

print(f"Pandas Agg: {pd_agg:.4f}s")
print(f"Polars Agg: {pl_agg:.4f}s")
print(f"üöÄ Speedup: {pd_agg/pl_agg:.2f}x")

# 3. Another Benchmark Aggregation (Filter By)
# Task: We will filter by 'Gender' = Female then perform a complex group by 'Contract'
print("\n--- Another Aggregation Benchmark ---")

# Pandas
start = time.time()
res_pd = df_pd_test[df_pd_test['gender'] == 'Female'].groupby('Contract').count()
pd_filter = time.time() - start

# Polars
start = time.time()
res_pl = df_pl_test.filter(pl.col('gender') == 'Female').group_by('Contract').count()
pl_filter = time.time() - start

print(f"Pandas Filter: {pd_filter:.4f}s")
print(f"Polars Filter: {pl_filter:.4f}s")
print(f"üöÄ Speedup: {pd_filter/pl_filter:.2f}x")

--- Reading Benchmark ---
Pandas Read: 0.4009s
Polars Read: 0.0956s
üöÄ Speedup: 4.19x

--- Aggregation Benchmark ---
Pandas Agg: 0.0105s
Polars Agg: 0.0055s
üöÄ Speedup: 1.93x

--- Another Aggregation Benchmark ---
Pandas Filter: 0.0719s
Polars Filter: 0.0126s
üöÄ Speedup: 5.73x


  res_pl = df_pl_test.filter(pl.col('gender') == 'Female').group_by('Contract').count()


**Observations**  

As seen from the above benchmarking of reading the scaled up csv file, we can see that **Polars** is much faster than **Pandas** when the dataset is scaled up, **Pandas** has a read timing of 0.4578s whereas **Polars** utilising columnar execution and lazy evaluation recorded 0.1124s which is exponentially faster of about 4.07x compared to our original benchmarking of the original dataset with 7k rows where it recorded only 0.2594s which is significantly slower to **Pandas** row-based execution read.

However, in terms of aggregation, in this case **Pandas** was still recorded to be quicker with a record of 0.0103s whereas **Polars** recorded with a time of 0.0174s. But comparing to our original benchmark with the 7k rows dataset, **Pandas** has significantly deproved in terms of efficiency as previously, it only took 0.0087s whereas **Polars** took 0.1486s. Comparing it to the scaled dataset, **Pandas** was much slower but **Polars** has remained relatively efficient, therefore we can conclude that **Polars** perform much better when datasets are much larger as it is much efficient compared to **Pandas**.



---

**üèéÔ∏è Section 3: The Race ‚Äî Pandas vs. Polars**

**Theory:** Pandas uses row-based execution. Polars uses columnar execution (Apache Arrow) and Lazy Evaluation. Let's measure the difference.

In [5]:
# 1. Benchmark Reading Speed
print("--- Reading Benchmark ---")
start = time.time()
df_pd_test = pd.read_csv("telco_enriched.csv")
pd_time = time.time() - start
print(f"Pandas Read: {pd_time:.4f}s")

start = time.time()
df_pl_test = pl.read_csv("telco_enriched.csv")
pl_time = time.time() - start
print(f"Polars Read: {pl_time:.4f}s")

print(f"üöÄ Speedup: {pd_time/pl_time:.2f}x")

# 2. Benchmark Aggregation (Group By)
# Task: Group by 'PaymentMethod' and calculate mean 'MonthlyCharges'
print("\n--- Aggregation Benchmark ---")

# Pandas
start = time.time()
res_pd = df_pd_test.groupby("PaymentMethod")["MonthlyCharges"].mean()
pd_agg = time.time() - start

# Polars
start = time.time()
res_pl = df_pl_test.group_by("PaymentMethod").agg(pl.col("MonthlyCharges").mean())
pl_agg = time.time() - start

print(f"Pandas Agg: {pd_agg:.4f}s")
print(f"Polars Agg: {pl_agg:.4f}s")
print(f"üöÄ Speedup: {pd_agg/pl_agg:.2f}x")

--- Reading Benchmark ---
Pandas Read: 0.1550s
Polars Read: 0.0529s
üöÄ Speedup: 2.93x

--- Aggregation Benchmark ---
Pandas Agg: 0.0093s
Polars Agg: 0.0223s
üöÄ Speedup: 0.42x


*Note: On small datasets (7k rows), the difference is small. On 1M+ rows, Polars becomes exponentially faster.*



---


**‚öôÔ∏è Section 4: Rigorous Preprocessing (No Leakage!)**

**Theory:** We must split our data before we scale it. If we scale the whole dataset, the test set's distribution leaks into the training set.



In [6]:
# 1. Define Features
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
text_feature = 'Customer_Feedback'
target = 'Churn'

X = df_pd[numeric_features + categorical_features + [text_feature]]
y = df_pd[target].apply(lambda x: 1 if x == 'Yes' else 0)

# 2. Train / Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Samples: {len(X_train)} | Test Samples: {len(X_test)}")

# 3. Create Preprocessing Pipeline
# We use ColumnTransformer to apply different logic to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='drop' # Drop text column for now (handled later)
)

# 4. Fit on TRAIN, Transform on TEST
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("‚úÖ Data Processed. Mean of Train Numeric Cols should be ~0.")
print(f"Mean of Scaled Tenure: {X_train_processed[:, 0].mean():.4f}")

Training Samples: 5634 | Test Samples: 1409
‚úÖ Data Processed. Mean of Train Numeric Cols should be ~0.
Mean of Scaled Tenure: 0.0000


**Lab Checkoff 2**  

Replace StandardScaler with MinMaxScaler and view the change in mean and variance. How does it affect the UMAP visualization?

In [12]:
from sklearn.preprocessing import MinMaxScaler

In [23]:
# 1. Define Features
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
text_feature = 'Customer_Feedback'
target = 'Churn'

X = df_pd[numeric_features + categorical_features + [text_feature]]
y = df_pd[target].apply(lambda x: 1 if x == 'Yes' else 0)

# 2. Train / Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Samples: {len(X_train)} | Test Samples: {len(X_test)}")

# 3. Create Preprocessing Pipeline
# We use ColumnTransformer to apply different logic to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='drop' # Drop text column for now (handled later)
)

# 4. Fit on TRAIN, Transform on TEST
X_train_scaled_processed = preprocessor.fit_transform(X_train)
X_test_scaled_processed = preprocessor.transform(X_test)

print("‚úÖ Data Processed.")
print(f"Mean of Scaled Tenure: {X_train_scaled_processed[:, 0].mean():.4f}")
print(f"Variance of Scaled Tenure: {X_train_scaled_processed[:, 0].var():.4f}")

Training Samples: 5634 | Test Samples: 1409
‚úÖ Data Processed.
Mean of Scaled Tenure: 0.4496
Variance of Scaled Tenure: 0.1151


Impact on UMAP visualisation

In [24]:
# Initialize UMAP reducer
# n_neighbors: Controls local vs global structure (15 is standard)
# min_dist: Controls how tightly points pack together
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)

print("Projecting data to 2D... (This uses algebraic topology!)")
embedding = reducer.fit_transform(X_train_scaled_processed)

# Create DataFrame for Plotting
df_umap = pd.DataFrame(embedding, columns=['UMAP_1', 'UMAP_2'])
df_umap['Churn'] = y_train.values
df_umap['Churn'] = df_umap['Churn'].map({1: 'Churn', 0: 'Retain'})

# Interactive Plot
fig = px.scatter(df_umap, x='UMAP_1', y='UMAP_2', color='Churn',
                 title='UMAP Projection of Telco Customer Churn',
                 color_discrete_map={'Churn': 'red', 'Retain': 'blue'},
                 opacity=0.5, width=800, height=600)
fig.show()

Projecting data to 2D... (This uses algebraic topology!)



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



UMAP is invariant to global shifts but sensitive to relative scales between features. Both StandardScaler and MinMaxScaler keep all numeric features on a similar scale, so:

- The overall qualitative structure of the UMAP embedding (clusters, separation of churn vs. retain) usually stays similar.

- You might see subtle differences in layout (slightly different shapes or rotations of clusters), but not a dramatic change like ‚Äúno clusters‚Äù vs. ‚Äústrong clusters‚Äù.

- The big benefit is avoiding features with wildly different magnitudes; both scalers achieve that, just with different mean/variance statistics.

In short: mean/variance numbers change (not centered at 0, not unit variance anymore), but the UMAP visualization is largely qualitatively similar because distances are still reasonably balanced across features.

Include Text Feature: Experiment with including the Customer_Feedback text feature in the ColumnTransformer by using a TfidfVectorizer (from sklearn.feature_extraction.text). Note: This will require some additional setup in the ColumnTransformer for text processing. How does adding text features affect the dimensionality and potentially the UMAP projection?

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline


In [26]:
text_transformer = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('text', text_transformer, 'Customer_Feedback'),
    ],
    remainder='drop'
)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
print(X_train_processed.shape)


(5634, 117)


By adding Customer Feedback feature into the ColumnTransformer, it will increase dimensionality and affect the projection in UMAP by giving it more signal but also more noise.

UMAP itself will still project to 2D (so the output of ‚Äé`fit_transform` stays ‚Äé`(N, 2)`), but the structure of that 2D embedding can change a lot:

- With text included, UMAP can separate customers not only by numeric/categorical churn drivers but also by semantic similarity of their feedback.

- Clusters may correspond more strongly to complaint themes (‚Äúprice‚Äù, ‚Äúservice quality‚Äù, ‚Äúsupport‚Äù) or satisfaction themes than before.

- The higher dimensionality gives UMAP more signal but also more noise; the quality of your text preprocessing (stopwords, ‚Äé`max_features`, etc.) affects how meaningful the 2D layout is.

In short, dimensionality of the input jumps (dozens ‚Üí thousands), but the final UMAP plot stays 2D; you should expect visually different clusters, often more driven by the text patterns.



---

**üåå Section 5: Visualizing the Manifold (UMAP)**

**Theory:** Our data has ~15 dimensions after One-Hot Encoding. We cannot see 15D. We use UMAP to project this down to 2D while preserving the "shape" of the data.

In [5]:
# Initialize UMAP reducer
# n_neighbors: Controls local vs global structure (15 is standard)
# min_dist: Controls how tightly points pack together
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)

print("Projecting data to 2D... (This uses algebraic topology!)")
embedding = reducer.fit_transform(X_train_processed)

# Create DataFrame for Plotting
df_umap = pd.DataFrame(embedding, columns=['UMAP_1', 'UMAP_2'])
df_umap['Churn'] = y_train.values
df_umap['Churn'] = df_umap['Churn'].map({1: 'Churn', 0: 'Retain'})

# Interactive Plot
fig = px.scatter(df_umap, x='UMAP_1', y='UMAP_2', color='Churn',
                 title='UMAP Projection of Telco Customer Churn',
                 color_discrete_map={'Churn': 'red', 'Retain': 'blue'},
                 opacity=0.5, width=800, height=600)
fig.show()

Projecting data to 2D... (This uses algebraic topology!)


  warn(


**Lab Checkoff 3**

UMAP Parameter Tuning: Experiment with different n_neighbors (e.g., 5, 50) and min_dist (e.g., 0.0, 0.5) values for UMAP. How do these changes affect the clustering and separation of churn vs. retain customers in the UMAP plot? Describe what you observe.
Color by Other Features: Change the color aesthetic in the px.scatter plot to another categorical feature like 'Contract' or 'InternetService'. Do you see new patterns or clusters related to these features?

In [30]:
# Initialize UMAP reducer
# n_neighbors: Controls local vs global structure (15 is standard)
# min_dist: Controls how tightly points pack together
reducer = umap.UMAP(n_neighbors=5, min_dist=0.1, random_state=42)

print("Projecting data to 2D... (This uses algebraic topology!)")
embedding = reducer.fit_transform(X_train_processed)

# Create DataFrame for Plotting
df_umap = pd.DataFrame(embedding, columns=['UMAP_1', 'UMAP_2'])
df_umap['Churn'] = y_train.values
df_umap['Churn'] = df_umap['Churn'].map({1: 'Churn', 0: 'Retain'})

# Interactive Plot
fig = px.scatter(df_umap, x='UMAP_1', y='UMAP_2', color='Churn',
                 title='UMAP Projection of Telco Customer Churn',
                 color_discrete_map={'Churn': 'red', 'Retain': 'blue'},
                 opacity=0.5, width=800, height=600)
fig.show()

Projecting data to 2D... (This uses algebraic topology!)



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



In [31]:
# Initialize UMAP reducer
# n_neighbors: Controls local vs global structure (15 is standard)
# min_dist: Controls how tightly points pack together
reducer = umap.UMAP(n_neighbors=15, min_dist=0.5, random_state=42)

print("Projecting data to 2D... (This uses algebraic topology!)")
embedding = reducer.fit_transform(X_train_processed)

# Create DataFrame for Plotting
df_umap = pd.DataFrame(embedding, columns=['UMAP_1', 'UMAP_2'])
df_umap['Churn'] = y_train.values
df_umap['Churn'] = df_umap['Churn'].map({1: 'Churn', 0: 'Retain'})

# Interactive Plot
fig = px.scatter(df_umap, x='UMAP_1', y='UMAP_2', color='Churn',
                 title='UMAP Projection of Telco Customer Churn',
                 color_discrete_map={'Churn': 'red', 'Retain': 'blue'},
                 opacity=0.5, width=800, height=600)
fig.show()

Projecting data to 2D... (This uses algebraic topology!)



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



When you reduce the neighbours and maintain minimum distance, UMAP increase separation between churn customers and retain customers whereas in contrast if we maintain neighbours and increase minimum distance to 0.5, it increases separation between the churn and retain data points.



In [33]:
df_umap['Contract'] = X_train['Contract'].values  # align with same index

fig = px.scatter(
    df_umap,
    x='UMAP_1',
    y='UMAP_2',
    color='Contract',
    title='UMAP Projection Colored by Contract',
    opacity=0.5,
    width=800,
    height=600
)
fig.show()


Yes, we can see similar patterns compared to the graphs above. From this graph, we can tell that customers who are month-to-month, as compared to the previous graph are more likely to churn. However, customers who are one one or two-year contract are likely to be retained.

In [32]:
df_umap['InternetService'] = X_train['InternetService'].values

fig = px.scatter(
    df_umap,
    x='UMAP_1',
    y='UMAP_2',
    color='InternetService',
    title='UMAP Projection Colored by InternetService',
    opacity=0.5,
    width=800,
    height=600
)
fig.show()

In terms of InternetService, customers with no internet service are likely retain whereas customers who use DSL are likely to retain and fiber optic customers are likely to churn.

*Discussion: Do you see distinct islands? The red points (Churn) often cluster together in UMAP space, indicating that churners share underlying structural similarities.*


---


**ü§ñ Section 6: AI-Powered Feature Extraction**
Theory: Traditional NLP (Bag of Words) loses context. We will use a Zero-Shot Classification model from Hugging Face. This treats the LLM as a function that takes text and outputs the probability of it belonging to a specific category.

In [6]:
# Initialize Pipeline (using a small, fast model for Colab)
# "facebook/bart-large-mnli" is excellent for Zero-Shot classification
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli",
                      device=-1) # Run on CPU (set to 0 for GPU if available)

# Define labels we want to hunt for
candidate_labels = ['service quality', 'pricing', 'technical issues', 'customer support', 'contract flexibility', 'better offers elsewhere', 'relocation', 'loyalty', 'internet speed', 'convenience']

# Let's test on 5 random churn comments
churn_samples = X_train[y_train == 1]['Customer_Feedback'].sample(5).tolist()

print("--- üß† AI Feature Extraction ---")
for comment in churn_samples:
    result = classifier(comment, candidate_labels)
    top_topic = result['labels'][0]
    confidence = result['scores'][0]

    print(f"üìù Comment: '{comment}'")
    print(f"üè∑Ô∏è Extracted Topic: {top_topic} (Confidence: {confidence:.2f})\n")

# In a production pipeline, you would run this on all rows to create a new column: 'Reason_Category'



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/515 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

--- üß† AI Feature Extraction ---
üìù Comment: 'The pricing is too high compared to competitors.'
üè∑Ô∏è Extracted Topic: pricing (Confidence: 0.87)

üìù Comment: 'I experienced frequent outages and slow internet.'
üè∑Ô∏è Extracted Topic: technical issues (Confidence: 0.65)

üìù Comment: 'The pricing is too high compared to competitors.'
üè∑Ô∏è Extracted Topic: pricing (Confidence: 0.87)

üìù Comment: 'I am unhappy with the service quality.'
üè∑Ô∏è Extracted Topic: service quality (Confidence: 0.90)

üìù Comment: 'Customer support was unhelpful and difficult to reach.'
üè∑Ô∏è Extracted Topic: customer support (Confidence: 0.75)



---
‚úÖ **Lab Checkoffs**

Here are some ideas to play around with the code and deepen your understanding. Complete at least 3 of these:

1.  **Polars vs. Pandas (Section 3):**
    *   **Scale Up:** Modify to generate a larger synthetic dataset (e.g., 100,000 or 1,000,000 rows) by looping the `df_pd.append` or by repeating the existing data. Re-run the benchmarks and observe the speedup changes. *Hint: Use `pd.concat` with `ignore_index=True` for Pandas and `pl.concat` for Polars for efficient data replication.* What are your observations?
    *   **Different Operations:** Benchmark another common data operation (e.g., filtering a column, performing a complex `groupby` with multiple aggregations) using both Pandas and Polars. Report the speedup.

2.  **Rigorous Preprocessing (Section 4):**
    *   **Alternative Scaler:** Replace `StandardScaler` with `MinMaxScaler` in the `preprocessor` pipeline. How does this change the mean and variance of the scaled numeric features? Does it impact the UMAP visualization?
    *   **Include Text Feature:** Experiment with including the `Customer_Feedback` text feature in the `ColumnTransformer` by using a `TfidfVectorizer` (from `sklearn.feature_extraction.text`). *Note: This will require some additional setup in the `ColumnTransformer` for text processing.* How does adding text features affect the dimensionality and potentially the UMAP projection?

3.  **Visualizing the Manifold (Section 5):**
    *   **UMAP Parameter Tuning:** Experiment with different `n_neighbors` (e.g., 5, 50) and `min_dist` (e.g., 0.0, 0.5) values for UMAP. How do these changes affect the clustering and separation of churn vs. retain customers in the UMAP plot? Describe what you observe.
    *   **Color by Other Features:** Change the `color` aesthetic in the `px.scatter` plot to another categorical feature like `'Contract'` or `'InternetService'`. Do you see new patterns or clusters related to these features?

4.  **AI-Powered Feature Extraction (Section 6):**
    *   **New Candidate Labels:** Add more specific `candidate_labels` to the list (e.g., 'technical support quality', 'billing issues', 'promotional offers'). How do these new labels influence the extracted topics and confidence scores for the sample comments?
    *   **Full Feature Extraction:** Apply the `zero-shot-classification` pipeline to *all* customer feedback in `X_train` and create a new DataFrame or Series containing the `top_topic` for each customer. You might want to consider adding a `batch_size` argument to the `pipeline` for efficiency on larger datasets, or just apply it to a smaller subset for demonstration.