# Improving Conversion on Marketplaces

This notebook follows the CRISP-DM framework to explore, model, and evaluate conversion optimization strategies in online marketplaces.

## Step 1: Business Understanding
Goal: Improve conversion rates in a two-sided marketplace (e.g., Airbnb, Getmyboat).

Conversion = when a user completes a meaningful action (e.g., booking, inquiry).

Objective: Identify patterns in user and listing behavior to predict conversion likelihood.

In [2]:
import os
print(os.getcwd())

/Users/yurilima/Library/CloudStorage/GoogleDrive-yuri.slimaaa@gmail.com/My Drive/Business/Courses/Udacity/Data Scientist/GitHub/market-conversion-optimization/notebooks


In [1]:
# Load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load raw CSV files from Olist dataset
orders = pd.read_csv("data/raw/olist_orders_dataset.csv")
customers = pd.read_csv("data/raw/olist_customers_dataset.csv")
order_items = pd.read_csv("data/raw/olist_order_items_dataset.csv")
products = pd.read_csv("data/raw/olist_products_dataset.csv")
reviews = pd.read_csv("data/raw/olist_order_reviews_dataset.csv")
payments = pd.read_csv("data/raw/olist_order_payments_dataset.csv")

# Preview one of the tables to check it's working
orders.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/olist_orders_dataset.csv'

## Step 2: Data Understanding

In [None]:
# Define conversion: 1 = delivered, 0 = otherwise
orders['converted'] = orders['order_status'].apply(lambda x: 1 if x == 'delivered' else 0)

# Check distribution
sns.countplot(x='converted', data=orders)
plt.title('Order Conversion Distribution')
plt.show()

## Step 3: Data Preparation

In [None]:
# Merge key tables into a unified dataset for modeling

# Merge orders with customers
df = orders.merge(customers, on='customer_id', how='left')

# Merge with order_items
df = df.merge(order_items, on='order_id', how='left')

# Merge with products
df = df.merge(products, on='product_id', how='left')

# Merge with reviews
df = df.merge(reviews, on='order_id', how='left')

# Merge with payments
df = df.merge(payments, on='order_id', how='left')

# Preview the merged dataset
df.head()


# Drop rows with missing critical values
df = df.dropna(subset=['product_id', 'review_score', 'payment_type'])

# Create modeling subset
df_model = df[['converted', 'price', 'freight_value', 'review_score', 'payment_value', 'payment_type']]

# Check structure
df_model.info()
df_model.head()

## 🤖 Step 4: Modeling

In [None]:
# Encode categorical variable: payment_type
df_model_encoded = pd.get_dummies(df_model, columns=['payment_type'], drop_first=True)

# Split data into train/test
from sklearn.model_selection import train_test_split

X = df_model_encoded.drop('converted', axis=1)
y = df_model_encoded['converted']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print("📊 Classification Report:")
print(classification_report(y_test, y_pred))
print(f"🔁 ROC AUC Score: {roc_auc_score(y_test, y_proba):.4f}")

# Feature importance (Optional but insightful!)
import seaborn as sns
import matplotlib.pyplot as plt

feature_importances = pd.Series(model.feature_importances_, index=X.columns)
feature_importances.nlargest(10).plot(kind='barh')
plt.title("Top 10 Feature Importances")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

## Step 5: Evaluation & Scenario

In [None]:
# Simulated prediction scenario
# Pick a random example from the test set
scenario = X_test.sample(1, random_state=1)

# Predict conversion probability and class
conversion_prob = model.predict_proba(scenario)[:, 1][0]
converted_class = model.predict(scenario)[0]

print("Prediction Scenario Features:")
display(scenario)

print(f"Predicted Conversion Probability: {conversion_prob:.2%}")
print(f"Predicted Class: {'Converted (1)' if converted_class == 1 else 'Not Converted (0)'}")

# ------------------------------------------------------
# What-If Analysis: Test impact of changing features

# Create a version with lower price (simulate a discount)
scenario_low_price = scenario.copy()
scenario_low_price['price'] *= 0.8  # 20% discount

# Create a version with higher review score
scenario_high_review = scenario.copy()
if 'review_score' in scenario_high_review.columns:
    scenario_high_review['review_score'] = 5.0

# Run predictions
low_price_prob = model.predict_proba(scenario_low_price)[:, 1][0]
high_review_prob = model.predict_proba(scenario_high_review)[:, 1][0]

# Display comparison
print("\n What-If Analysis:")
print(f"Lower Price (20% off) → Conversion Probability: {low_price_prob:.2%}")
print(f"Review Score = 5.0     → Conversion Probability: {high_review_prob:.2%}")

## Step 6: Conclusion

### Summary of Key Insights
- The model successfully predicts the likelihood of a user converting (placing an order) based on features like **price**, **freight cost**, **review score**, and **payment type**.
- Listings with **lower prices** and **higher review scores** tend to have a significantly higher chance of conversion.
- Payment types also played a role — with some payment methods slightly more associated with conversions than others.

### Practical Impact
This model could be used to:
- **Prioritize listings** shown to users based on likelihood of conversion
- Help sellers **optimize pricing or invest in customer experience**
- Provide **product recommendations** that align with buyer behavior

### Next Steps
- Perform **hyperparameter tuning** and try other models like XGBoost or Logistic Regression
- Add **session behavior data** (e.g., time spent on product pages, number of views)
- Use SHAP values for better **explainability**
- Consider building a **recommender system** that incorporates conversion predictions

---

This project demonstrates how data science can drive marketplace growth by focusing on measurable outcomes like conversion rates.