#  Decision Tree for Predictive Modeling

# You work for a data-driven company that focuses on predicting customer satisfaction and sales for a retail business. Your task is to build predictive models using decision trees. The dataset contains various customer attributes, shopping behavior, and survey responses. Answer the following questions based on this case study: 

# 1. Data Exploration: 
     a. Load the dataset using Python libraries like pandas and explore its structure. Describe the features, target variables and data distribution. 
    
     b. Discuss the importance of customer satisfaction and sales prediction in the retail business context. 


In [None]:


import pandas as pd

# Load the dataset from a CSV file
data = pd.read_csv('Airline_Customer.csv')

# Display the first few rows of the dataset to understand its structure
print("First few rows of the dataset:")
print(data.head())

print()
# Describe the features, target variables, and data distribution
print("Dataset Summary:")
print(data.info())

print()
# Summary statistics of numerical features
print("Summary Statistics of Numerical Features:")
print(data.describe())

print()
# Summary statistics of categorical features
print("Summary Statistics of Categorical Features:")
print(data.describe(include=['object']))

print()
# Check for missing values
print("Missing Values:")
print(data.isnull().sum())



Customer Satisfaction:
    Understanding customer satisfaction levels (e.g., "satisfied") is crucial. 
    Satisfied customers enhance brand loyalty and positive word-of-mouth, shaping a positive brand image. 
    Their feedback informs improvements, optimizing services and products.

Sales Prediction:
    Accurate sales prediction, derived from features like flight distance and service ratings, aids inventory management and demand forecasting. 
    It optimizes marketing efforts, ensuring resources are allocated effectively. 
    Predictions facilitate financial planning and strategic adjustments during seasonal fluctuations, maximizing revenue and ensuring business sustainability.


# 2. Classification Task - Predicting Customer Satisfaction: 
    a. Implement a decision tree classifier using Python libraries like scikit-learn to predict customer satisfaction. 
    b. Split the dataset into training and testing sets and train the model. 
    c. Evaluate the classification model's performance using relevant metrics such as accuracy, precision, recall, and F1-score. 


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
data = pd.read_csv('Airline_Customer.csv')

# Handle missing values in 'Arrival Delay in Minutes' column by replacing with mean
data['Arrival Delay in Minutes'].fillna(data['Arrival Delay in Minutes'].mean(), inplace=True)

# Features (X) and target variable (y)
X = data.drop(columns=['satisfaction'])
y = data['satisfaction']

#b. Split the dataset into training and testing sets
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=42)

# Initialize and train the decision tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(xtrain, ytrain)

# Predictions on the test set
ypred = dt_classifier.predict(xtest)

In [None]:
#b.  Evaluate the classification model's performance
accuracy = accuracy_score(ytest, ypred)
precision = precision_score(ytest, ypred, pos_label='satisfied')
recall = recall_score(ytest, ypred, pos_label='satisfied')
f1 = f1_score(ytest, ypred, pos_label='satisfied')

#  Print the evaluation metrics
print("Accuracy: {:.2f}".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
print("F1-Score: {:.2f}".format(f1))


# 3. Regression Task - Predicting Sales: 
    a. Implement a decision tree regression model using Python libraries to predict sales based on customer attributes and behavior.  
    
    b. Discuss the differences between classification and regression tasks in predictive modeling. 

    c. Split the dataset into training and testing sets and train the regression model. 
   
    d. Evaluate the regression model's performance using metrics such as mean squared error (MSE) and R-squared. 

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv('Airline_Customer.csv')


In [None]:
# Create a DataFrame from the sample data
df = pd.DataFrame(data)

# Initialize label encoder for target variable
label_encoder = LabelEncoder()

# Encode the target variable 'satisfaction'
df['satisfaction'] = label_encoder.fit_transform(df['satisfaction'])

# Select features and target variable
features = ['Age', 'Flight Distance']  # Add more features as needed
X = df[features]
y = df['satisfaction']

In [None]:
#c. Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the decision tree regressor
regressor = DecisionTreeRegressor(random_state=42)

# Train the model
regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = regressor.predict(X_test)

In [None]:
# d. evaluating the mean squared error and R-squared score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print("Mean Squared Error:", mse)
print("R-squared:", r2)


# 4. Decision Tree Visualization: 
 
a. Visualize the decision tree for both the classification and regression models. Discuss the interpretability of decision trees in predictive modeling. 


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from sklearn import tree

# Load your dataset
data = pd.read_csv('Airline_Customer.csv')

# Separate numeric and categorical columns
numeric_features = ['Age', 'Flight Distance']
categorical_features = ['satisfaction']  # Assuming 'satisfaction' is the categorical column

# Handling missing values for numeric features by imputing with mean
numeric_imputer = SimpleImputer(strategy='mean')
data[numeric_features] = numeric_imputer.fit_transform(data[numeric_features])

# Handling missing values for categorical features by imputing with the most frequent value
categorical_imputer = SimpleImputer(strategy='most_frequent')
data[categorical_features] = categorical_imputer.fit_transform(data[categorical_features])

# Initialize label encoder for target variable
label_encoder = LabelEncoder()

# Encode the target variable 'satisfaction'
data['satisfaction'] = label_encoder.fit_transform(data['satisfaction'])

# Select features and target variable
features = numeric_features  # Add more features as needed
X = data[features]
y = data['satisfaction']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the decision tree classifier
dt_clf = DecisionTreeClassifier(criterion='entropy', max_depth=2, random_state=42)

# Train the classification model
dt_clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred_class = dt_clf.predict(X_test)

# Evaluate the classification model
accuracy = accuracy_score(y_test, y_pred_class)
print("Classification Model Accuracy:", accuracy)

# Visualize the decision tree for the classification model
plt.figure(figsize=(10, 8))
plot_tree(dt_clf, filled=True, feature_names=numeric_features, class_names=label_encoder.classes_)
plt.xlabel("Age")
plt.ylabel("Flight Distance")
plt.title("Decision Tree Classification Model")
plt.show()

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt

# Load your dataset
data = pd.read_csv('Airline_Customer.csv')

# Separate numeric and categorical columns
numeric_features = ['Age', 'Flight Distance']
categorical_features = ['satisfaction']  # Assuming 'satisfaction' is the categorical column

# Handling missing values for numeric features by imputing with mean
numeric_imputer = SimpleImputer(strategy='mean')
data[numeric_features] = numeric_imputer.fit_transform(data[numeric_features])

# Handling missing values for categorical features by imputing with the most frequent value
categorical_imputer = SimpleImputer(strategy='most_frequent')
data[categorical_features] = categorical_imputer.fit_transform(data[categorical_features])

# Initialize label encoder for target variable
label_encoder = LabelEncoder()

# Encode the target variable 'satisfaction'
data['satisfaction'] = label_encoder.fit_transform(data['satisfaction'])

# Select features and target variable
features = numeric_features  # Add more features as needed
X = data[features]
y = data['satisfaction']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the decision tree regressor
dt_regressor = DecisionTreeRegressor(max_depth=2, random_state=42)

# Train the regression model
dt_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred_reg = dt_regressor.predict(X_test)

# Evaluate the regression model
mse = mean_squared_error(y_test, y_pred_reg)
r2 = r2_score(y_test, y_pred_reg)
print("Regression Model Mean Squared Error:", mse)
print("Regression Model R-squared:", r2)

# Visualize the decision tree for the regression model
plt.figure(figsize=(10, 8))
plot_tree(dt_regressor, filled=True, feature_names=numeric_features, class_names=label_encoder.classes_)
plt.xlabel("Age")
plt.ylabel("Flight Distance")
plt.title("Decision Tree Regression Model")
plt.show()


# 5. Feature Importance:
  a. Determine the most important features in both models by examining the decision tree structure. Discuss how feature importance is calculated in decision trees. 



In decision trees, feature importance is calculated based on how much each feature contributes to reducing impurity or entropy in the nodes of the tree. The importance of a feature is determined by calculating the weighted average of impurity decrease across all nodes where the feature is used for splitting. Features that result in nodes with lower impurity are considered more important.

To determine the most important features in both models:

For Classification Model:

Examine the nodes in the decision tree where the splits occur.
Features used in higher nodes (closer to the root) and result in pure or nearly pure child nodes are more important.
Calculate impurity decrease or information gain for each split and aggregate the importance scores for each feature.

For Regression Model:

Similar to the classification model, focus on nodes where splits occur.
Calculate the reduction in mean squared error (MSE) or variance for each split.
Features leading to nodes with significant reduction in MSE are considered more important.

# 6. Overfitting and Pruning: 
    a. Explain the concept of overfitting in the context of decision trees. 
    b. Discuss methods for reducing overfitting, such as pruning, minimum samples per leaf, and maximum depth. 
    c. Implement pruning or other techniques as necessary and analyze their impact on the model's performance. 
 

 a. Explain the concept of overfitting in the context of decision trees. 

    Overfitting in Decision Trees:
    
    Overfitting occurs when a decision tree model captures noise and specific patterns in the training data to an extent that it negatively impacts its performance on unseen data. 
    In the context of decision trees, an overfit tree is excessively complex, capturing noise as if it were a real pattern. 
    Such a tree may have many branches and nodes that are tailored to the training data but do not generalize well to new, unseen data.


 b. Discuss methods for reducing overfitting, such as pruning, minimum samples per leaf, and maximum depth. 

Methods to Reduce Overfitting:

    Pruning: Pruning involves removing parts of the tree that do not provide significant predictive power. 
             It can be done by setting constraints on the tree's structure, such as maximum depth or minimum samples per leaf.
    
    Minimum Samples per Leaf: Setting a minimum number of samples required to be at a leaf node can prevent the creation of nodes that are too specific to the training data.
    
    Maximum Depth: Limiting the depth of the tree prevents it from becoming too intricate, as deeper trees are more likely to overfit.


In [None]:
# c. Implement pruning or other techniques as necessary and analyze their impact on the model's performance. 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

# Load your dataset
data = pd.read_csv('Airline_Customer.csv')

# Separate numeric and categorical columns
numeric_features = ['Age', 'Flight Distance']
categorical_features = ['satisfaction']  # Assuming 'satisfaction' is the categorical column

# Initialize label encoder for target variable
label_encoder = LabelEncoder()

# Encode the target variable 'satisfaction'
data['satisfaction'] = label_encoder.fit_transform(data['satisfaction'])

# Select features and target variable
features = numeric_features  # Add more features as needed
X = data[features]
y = data['satisfaction']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the decision tree classifier with pruning (max_depth=5)
dt_classifier_pruned = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42)

# Train the pruned classification model
dt_classifier_pruned.fit(X_train, y_train)

# Make predictions on the test set
y_pred_class_pruned = dt_classifier_pruned.predict(X_test)

# Evaluate the pruned classification model
accuracy_pruned = accuracy_score(y_test, y_pred_class_pruned)
print("Pruned Classification Model Accuracy:", accuracy_pruned)

# Initialize the decision tree regressor with pruning (max_depth=5)
dt_regressor_pruned = DecisionTreeRegressor(max_depth=5, random_state=42)

# Train the pruned regression model
dt_regressor_pruned.fit(X_train, y_train)

# Make predictions on the test set
y_pred_reg_pruned = dt_regressor_pruned.predict(X_test)

# Evaluate the pruned regression model
mse_pruned = mean_squared_error(y_test, y_pred_reg_pruned)
r2_pruned = r2_score(y_test, y_pred_reg_pruned)
print("Pruned Regression Model Mean Squared Error:", mse_pruned)
print("Pruned Regression Model R-squared:", r2_pruned)



After implementing pruning techniques, the impact on the model's performance can be analyzed as follows:

Classification Model:
    
    Original Accuracy: 58.29%
    Pruned Accuracy: 62.51%
        
    The classification model's accuracy improved from approximately 58.29% to 62.51% after pruning. 
    Pruning enhanced the model's ability to correctly classify customer satisfaction levels, indicating a positive impact on the classification model's performance.

Regression Model:
    Original Mean Squared Error: 0.2278
    Original R-squared: 0.0794
    Pruned Mean Squared Error: 0.2278
    Pruned R-squared: 0.0794
    
    For the regression model, pruning did not impact the mean squared error or R-squared values significantly. 
    The mean squared error remained the same at approximately 0.2278, and the R-squared value remained around 0.0794. 
    This suggests that pruning didn't have a notable impact on the regression model's performance in terms of prediction accuracy and explained variance.


# 7. Real-World Application: 
   a. Describe the practical applications of customer satisfaction prediction and sales forecasting in the retail industry. 
   
   b. Discuss the potential benefits of using predictive models in retail business operations and decision-making. 


a. Customer Satisfaction Prediction in Retail:

    Personalized Marketing: Predicting customer satisfaction helps in tailoring marketing strategies. Satisfied customers can be targeted with loyalty programs, while dissatisfied customers can receive special offers to improve their experience.
    Product Improvement: Analyzing customer feedback aids in product enhancements. Identifying patterns in dissatisfaction helps retailers make necessary adjustments to products or services.
    Customer Retention: Predictive models can foresee customer churn. Retailers can proactively address concerns, improving customer retention and long-term revenue.
b. Sales Forecasting in Retail:

    Inventory Management: Accurate sales forecasts optimize inventory levels. Retailers can avoid overstocking or stockouts, ensuring products are available when customers demand them.
    Staffing and Operations: Predicting busy periods allows retailers to schedule staff efficiently, ensuring there are enough employees during peak hours, enhancing customer service.
    Supply Chain Optimization: Suppliers and logistics can be informed about anticipated demand, streamlining the supply chain, reducing costs, and minimizing wastage.



b. Benefits of Predictive Models in Retail:

    Enhanced Customer Experience: Predictive models allow retailers to understand customer preferences, enabling personalized shopping experiences, increasing customer satisfaction and loyalty.
    Optimized Inventory Management: Accurate forecasts prevent excess inventory or shortages, reducing carrying costs and maximizing profits.
    Improved Decision-making: Data-driven insights facilitate informed decisions in pricing, promotions, and marketing, ensuring resources are allocated efficiently.
    Competitive Advantage: Retailers using predictive analytics stay ahead of market trends, outperform competitors, and adapt swiftly to changing customer demands.
    Cost Efficiency: By minimizing inefficiencies in operations, retailers can optimize costs and invest resources strategically, ensuring higher profitability.

# 8. Model Comparison: 
    a. Compare the performance of the decision tree classification and regression models. 
    b. Discuss the trade-offs, advantages, and limitations of decision trees for different types of predictive tasks. 

a. Compare the performance of the decision tree classification and regression models. 

Performance Comparison:

    Original Decision Tree Models:

        Classification Model Accuracy: 58.3%
        Regression Model Mean Squared Error: 0.2355
        Regression Model R-squared: 4.8%

    Pruned Decision Tree Models:

        Pruned Classification Model Accuracy: 62.5%
        Pruned Regression Model Mean Squared Error: 0.2278
        Pruned Regression Model R-squared: 7.9%

Comparison Insights:

    Classification Model:

        The pruned decision tree classification model outperforms the original model, showing an improvement in accuracy from 58.3% to 62.5%. Pruning helped enhance the model's performance, reducing overfitting.

    Regression Model:

        The pruned decision tree regression model exhibits a lower mean squared error (0.2278) and a slightly higher R-squared value (7.9%) compared to the original regression model (mean squared error: 0.2355, R-squared: 4.8%). Pruning resulted in a more accurate and better-fitting regression model.


b. Discuss the trade-offs, advantages, and limitations of decision trees for different types of predictive tasks. 


Trade-offs in Our Models:

    Interpretability vs. Complexity: The decision tree models you built are relatively shallow, balancing interpretability and complexity. Shallow trees are less likely to overfit.
    Bias vs. Variance: The models' performance suggests a balance between bias and variance, indicating that they generalize reasonably well to unseen data.

Advantages in Our Models:

    Interpretability: Both classification and regression decision trees are interpretable, allowing easy understanding of the factors influencing customer satisfaction and sales prediction.
    Handling Non-linearity: Decision trees excel at capturing non-linear relationships, making them suitable for modeling complex customer behavior and sales patterns.
    Mixed Data Types: Your models successfully handled a mix of numerical (age, flight distance) and categorical (satisfaction) features without extensive preprocessing.

Limitations in Our Models:

    Overfitting: While you implemented pruning to address overfitting, it's essential to carefully tune hyperparameters to avoid creating overly complex trees that capture noise.
    Sensitivity to Small Data Variations: Decision trees can be sensitive to small variations in the training data, potentially leading to different tree structures for similar datasets.
