# Logestic Regression Model


Can I predict whether a client has a mobile number — yes or no — based on their scores, arrears, and residency info?”

yes, we can using a LR model.

But why predicting contact info (like Mobile number) actually makes sense ?

We don’t always have full data for everyone

- In debt datasets, some clients have partial info:

- Others have mismatched, outdated, or missing contact data

- Some have names and addresses but no verified mobile/email

So instead of waiting for external partners to "send back updates" (which takes money/time ), we can use the data which we have to predict who’s likely to be contactable.

Let’s say we are working with 10,000 client records.

we want to buy contact info from Experian or Equifax, But they charge £0.10 per lookup

That’s £1,000 to enrich all 10,000 records — but what if half of them are probably dead-ends?

So instead, we predict which clients are most likely to be contactable, and only enrich the top 30% — the ones with the highest probability of success.

This is a cost-saving strategy i believe, but not just a LR model prediction.

In [None]:
import pandas as pd


df = pd.read_csv("/Users/rg/ACADEMICS/Interview/Connected Data Comapany/MAY/Dataset/Modified/cleaned_connected_data_with_zones.csv")

# Clean mobile/email flags
df['Mobile Flag'] = df['Mobile Flag'].str.strip().str.upper()
df['Email Flag'] = df['Email Flag'].str.strip().str.upper()
df['dp2 Council Tax Band'] = df['dp2 Council Tax Band'].str.strip().str.upper()
df['dp2 Occupancy Style'] = df['dp2 Occupancy Style'].str.strip().str.title()

# Create the target variable
df['Has_Mobile'] = (df['Mobile Flag'] == 'Y').astype(int)

# Create features
features = df[['dp1 Score', 'dp3 Score', 'Arrears Balance']]
zone_dummies = pd.get_dummies(df['Residency Zone'], prefix='Zone')
occupancy_dummies = pd.get_dummies(df['dp2 Occupancy Style'], prefix='Occupancy')
taxband_dummies = pd.get_dummies(df['dp2 Council Tax Band'], prefix='TaxBand')

# Combine all features
X = pd.concat([features, zone_dummies, occupancy_dummies, taxband_dummies], axis=1)

# Drop rows with missing values
X_cleaned = X.dropna()
y_cleaned = df.loc[X_cleaned.index, 'Has_Mobile']  # Align target

# Optional: Save the cleaned dataset
X_cleaned.to_csv("/Users/rg/ACADEMICS/Interview/Connected Data Comapany/MAY/Dataset/Modified/cleaned_features_no_missing.csv", index=False)
y_cleaned.to_csv("/Users/rg/ACADEMICS/Interview/Connected Data Comapany/MAY/Dataset/Modified/cleaned_target_no_missing.csv", index=False)

print(" Missing values dropped and files saved!")


 Missing values dropped and files saved!


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


X_cleaned = pd.read_csv("/Users/rg/ACADEMICS/Interview/Connected Data Comapany/MAY/Dataset/LR/cleaned_features_no_missing.csv")
y_cleaned = pd.read_csv("/Users/rg/ACADEMICS/Interview/Connected Data Comapany/MAY/Dataset/LR/cleaned_target_no_missing.csv")

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

# Logistic Regression Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train.values.ravel())

# Predict and Evaluate
y_pred = model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
print("\n Confusion Matrix:\n", conf_matrix)

class_report = classification_report(y_test, y_pred)
print("\n Classification Report:\n", class_report)

accuracy = accuracy_score(y_test, y_pred)
print("\n Accuracy Score:", round(accuracy * 100, 2), "%")



 Confusion Matrix:
 [[48 20]
 [15 30]]

 Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.71      0.73        68
           1       0.60      0.67      0.63        45

    accuracy                           0.69       113
   macro avg       0.68      0.69      0.68       113
weighted avg       0.70      0.69      0.69       113


 Accuracy Score: 69.03 %


That’s pretty decent for considering:

- predicting based on behavioral features, not phone records directly
- Missing values were dropped (which reduces training size)
- This is a real-world noisy problem where some mobile info may be incomplete or unrecorded

- The model is solid for a basic proof-of-concept.
- It found 2 out of 3 clients who had mobile numbers just from their other info.
- It could be used for enrichment targeting, deciding where to invest lookup efforts, or choosing fallback channels (like mail or tracing).