# 03 Statistical Testing & Regression
This notebook tests key hypotheses using t-tests, ANOVA, chi-square tests, and regression analysis.

### Statistical Analysis and Hypothesis Testing
T-test for difference in return rates between low and high review scores
**Hypotheses**<br>  
H₀: No difference in return rates between low and high score groups<br>  
H₁: Low score group has higher return rates

In [8]:
from scipy.stats import ttest_ind

# t-test
low = df[df['review_score'] <= 2]['return_status']
high = df[df['review_score'] >= 4]['return_status']

t_stat, p_value = ttest_ind(low, high, equal_var=False) # ttest_ind(x, y)
print("t-statistic:", t_stat, "p-value:", p_value)

t-statistic: 20.571767865353408 p-value: 5.514690293693966e-93


#### Explanation<br>
p-value < 0.05: Reject H₀ — there is a statistically significant difference in return rates based on review score.

#### - ANOVA test for return rate differences by product category
**Hypotheses**<br>  
H₀: No difference in return rates across product categories<br>  
H₁: At least one category has a different return rate

In [9]:
from scipy.stats import f_oneway

# Merge English category names: Check the category name
merged_df = pd.merge(products, category_translation, on='product_category_name', how='left')
df = pd.merge(df, merged_df[['product_id', 'product_category_name_english']], on='product_id', how='left')
print("Product Category Name in English:\n", df['product_category_name_english'].unique())

# Groups for ANOVA: Extract groups to use for comparing return rates by category
electronics = df[df['product_category_name_english'] == 'electronics']['return_status']
furniture = df[df['product_category_name_english'] == 'furniture_decor']['return_status']
fashion = df[df['product_category_name_english'] == 'fashion_male_clothing']['return_status']

# Sample size
# print("\nThe number of data in each category")
# print("Electronics:", len(electronics))
# print("Furniture:", len(furniture))
# print("Fashion:", len(fashion))

# ANOVA
anova_result = f_oneway(electronics, furniture, fashion)
print("\nF-statistic:", anova_result.statistic, "p-value:", anova_result.pvalue)

Product Category Name in English:
 ['housewares' 'perfumery' 'auto' 'pet_shop' 'stationery' nan
 'furniture_decor' 'office_furniture' 'garden_tools'
 'computers_accessories' 'bed_bath_table' 'toys' 'telephony'
 'health_beauty' 'electronics' 'baby' 'cool_stuff' 'watches_gifts'
 'air_conditioning' 'sports_leisure' 'books_general_interest'
 'construction_tools_construction' 'small_appliances' 'food'
 'luggage_accessories' 'fashion_underwear_beach' 'christmas_supplies'
 'fashion_bags_accessories' 'musical_instruments'
 'construction_tools_lights' 'books_technical' 'costruction_tools_garden'
 'home_appliances' 'market_place' 'agro_industry_and_commerce'
 'party_supplies' 'home_confort' 'cds_dvds_musicals'
 'industry_commerce_and_business' 'consoles_games' 'furniture_bedroom'
 'construction_tools_safety' 'fixed_telephony' 'drinks'
 'kitchen_dining_laundry_garden_furniture' 'fashion_shoes'
 'home_construction' 'audio' 'home_appliances_2' 'fashion_male_clothing'
 'cine_photo' 'furniture_living

#### Explanation<br>  
p-value = 0.8192 > 0.05: Do not reject H₀ — no statistically significant difference in return rates among the three categories.

#### - Chi-Square Test for independence between category and return status
**Hypotheses**<br>  
H₀: Product category and return status are independent<br>  
H₁: Product category and return status are related

In [10]:
from scipy.stats import chi2_contingency

# Create contingency table
contingency_table = pd.crosstab(df['product_category_name'], df['return_status'])

# Chi-Square
chi2, p, dof, expected = chi2_contingency(contingency_table)
print("Chi-square statistic:", chi2, "p-value:", p)

Chi-square statistic: 207.70715362160462 p-value: 4.294503050782418e-15


#### Explanation<br>  
p-value < 0.05: Reject H₀ — product category and return status are related.

#### - Linear Regression: Does review score influence return rate?
**Hypotheses**<br>  
H₀: Review score has no impact on return rate<br>  
H₁: Lower review score increases return rate

In [11]:
import statsmodels.api as sm

# Set the independent variable (x) and dependent variable (y)
X = df[['review_score']]
y = df['return_status']

# Add a constant term (with intercept)
X = sm.add_constant(X)
# Without an intercept, the line must pass through (0,0). Adding an intercept allows you to find the best-fit regression line even if the data does not go past the origin.

# Regression model
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:          return_status   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     1666.
Date:                Mon, 30 Jun 2025   Prob (F-statistic):               0.00
Time:                        22:34:20   Log-Likelihood:             1.4272e+05
No. Observations:              112372   AIC:                        -2.854e+05
Df Residuals:                  112370   BIC:                        -2.854e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            0.0287      0.001     46.149   

#### Explanation<br>  
- R² = 0.015 (very low): Only 1.5% of variance explained
- Coefficients:<br>
Intercept = 0.0287 (return rate when score = 0)<br>
review_score = -0.0060 (each 1-point increase in score reduces return rate by 0.6%)
- P>|t| = 0.000 (p-value < 0.05): reject H0 (statistically significant) → The lower the review score, the higher the return rate.

**Interpretation**<br>  
The explanatory power is low → consider multiple regression with more variables.

#### - Multiple Linear Regression: Analyze other influencing factors

In [12]:
import numpy as np
import statsmodels.api as sm

# Create delivery delay variable
# Convert to a date type
orders['order_delivered_customer_date'] = pd.to_datetime(orders['order_delivered_customer_date'])
orders['order_estimated_delivery_date'] = pd.to_datetime(orders['order_estimated_delivery_date'])

# Create a late delivery status
orders['delivery_late'] = (orders['order_delivered_customer_date'] > orders['order_estimated_delivery_date']).astype(int)

# Extract only the columns
orders_late = orders[['order_id', 'delivery_late']]

# Calculate total price
# Calculate the total amount paid per order
order_price = payments.groupby("order_id")["payment_value"].sum().reset_index()

# Change name
order_price.rename(columns={"payment_value": "total_price"}, inplace=True)

# Merge with main df
df = pd.merge(df, orders_late, on='order_id', how='left')
df = pd.merge(df, order_price, on='order_id', how='left')

# Check if it was merged into df
print(df.columns)

# Multiple Regression
# Select features
features = ['review_score', 'delivery_late', 'total_price']

# Drop missing(NaN) or infinite(inf)
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=features + ['return_status'])

# Define the independent variable (x) and dependent variable (y)
X = df[features]
y = df['return_status']

# Adding Constants
X = sm.add_constant(X)

# Regression model
model = sm.OLS(y, X).fit()

# Print Result
print(model.summary())

Index(['order_id', 'customer_id', 'order_status', 'review_score',
       'return_status', 'product_id', 'product_category_name',
       'product_category_name_english', 'delivery_late', 'total_price'],
      dtype='object')
                            OLS Regression Results                            
Dep. Variable:          return_status   R-squared:                       0.018
Model:                            OLS   Adj. R-squared:                  0.018
Method:                 Least Squares   F-statistic:                     698.4
Date:                Mon, 30 Jun 2025   Prob (F-statistic):               0.00
Time:                        22:34:20   Log-Likelihood:             1.4293e+05
No. Observations:              112369   AIC:                        -2.858e+05
Df Residuals:                  112365   BIC:                        -2.858e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                            

#### Explanation<br>  
- R² = 0.018 (still low): Slight influence from variables
- P>|t| = 0.000 (p-value < 0.05): reject H0 → The lower the review score, the more delayed the delivery, the higher the probability of return.
- Interpretation: Price has a small impact<br>
→ Add more variables (category, region, payment type, etc.) for improved prediction<br>
→ Move to modeling stage for prediction



**Interpretation**<br>  
To initially examine the influence of variables, I used OLS regression and found that review score, price, and delivery delay had statistically significant effects. Since `return_status` is a binary variable, I then applied logistic regression and classification models (Random Forest, XGBoost) to compare performance.