# Q1
Explain briefing about the sample data (descriptive statistics). Develop a multiple linear regression equation that describes the relationship between the cost of delivery and the other variables. Do these four variables explain a reasonable amount of variation in the dependent variable? Estimate the delivery cost for a kit to a house that takes 20 minutes for preparation, 30 minutes to deliver, and covers a distance of 25 kilometers.

In [1]:
import pandas as pd
import statsmodels.api as sm

In [2]:
# Load the data
data = pd.read_csv("data.csv")

In [3]:
data

Unnamed: 0,Sample Number,Cost ($),Prepare,Delivery,Kilometer,Resident
0,1,32,10,51,20,1
1,2,24,11,33,12,1
2,3,32,16,47,19,1
3,4,20,9,18,8,0
4,5,29,8,88,17,1
5,6,23,9,20,11,0
6,7,23,9,39,11,0
7,8,22,10,23,10,1
8,9,21,13,20,8,1
9,10,22,10,32,10,1


In [4]:
# 1. Descriptive statistics
print(data.describe())

       Sample Number   Cost ($)    Prepare   Delivery  Kilometer   Resident
count      60.000000  60.000000  60.000000  60.000000  60.000000  60.000000
mean       30.500000  23.000000  10.566667  28.500000  10.816667   0.516667
std        17.464249   3.459205   4.951859  13.704509   3.833358   0.503939
min         1.000000  16.000000   2.000000  11.000000   4.000000   0.000000
25%        15.750000  20.000000   8.000000  19.000000   8.000000   0.000000
50%        30.500000  23.000000   9.000000  25.500000  10.500000   1.000000
75%        45.250000  25.000000  13.000000  35.000000  14.000000   1.000000
max        60.000000  32.000000  25.000000  88.000000  20.000000   1.000000


In [5]:
# 2. Develop a multiple linear regression equation
X = data[['Prepare', 'Delivery', 'Kilometer', 'Resident']]
Y = data['Cost ($)']

# Adding a constant to the model (intercept)
X = sm.add_constant(X)

model = sm.OLS(Y, X).fit()

# Display the regression results
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:               Cost ($)   R-squared:                       0.960
Model:                            OLS   Adj. R-squared:                  0.957
Method:                 Least Squares   F-statistic:                     328.8
Date:                Thu, 14 Sep 2023   Prob (F-statistic):           1.09e-37
Time:                        18:16:54   Log-Likelihood:                -62.634
No. Observations:                  60   AIC:                             135.3
Df Residuals:                      55   BIC:                             145.7
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         13.2090      0.302     43.779      0.0

In [6]:
# 3. Checking the amount of variation explained
# The R-squared value from the model summary gives us this information

In [7]:
# 4. Estimate the delivery cost using the regression equation
# Create a new DataFrame with the values for prediction
new_data = pd.DataFrame({'const': [1], 'Prepare': [20], 'Delivery': [30], 'Kilometer': [25], 'Resident': [1]})

# Use the model to make a prediction
prediction = model.predict(new_data)

print("Estimated delivery cost: $", round(prediction[0], 2))

Estimated delivery cost: $ 35.24


Explanation:

In step 1, we use the describe method to get the descriptive statistics of the dataset.

In step 2, we create a multiple linear regression model using the statsmodels library, including an intercept in the model.

In step 3, we refer to the R-squared value in the model summary to understand the proportion of variation in the dependent variable explained by the independent variables.

In step 4, we create a new DataFrame with the given values to estimate the delivery cost, and use the fitted model to predict the cost using these values.

#Q2
Test to determine if one or more regression coefficient differs from zero. Also test to see
whether any of the variables can be dropped, rerun the regression equation until only
significant variables are included. (10 points)

In [8]:
# Setting a significance level
alpha = 0.05

# Iteratively removing insignificant variables
while max(model.pvalues) > alpha:
    # Drop the variable with the highest p-value
    drop_variable = model.pvalues.idxmax()
    X = X.drop(columns=[drop_variable])

    # Re-run the regression with the reduced set of variables
    model = sm.OLS(Y, X).fit()

    # Display the regression results
    print(model.summary())

# The final model will only include significant variables (based on the specified alpha level)

                            OLS Regression Results                            
Dep. Variable:               Cost ($)   R-squared:                       0.959
Model:                            OLS   Adj. R-squared:                  0.957
Method:                 Least Squares   F-statistic:                     440.7
Date:                Thu, 14 Sep 2023   Prob (F-statistic):           6.64e-39
Time:                        18:19:38   Log-Likelihood:                -63.001
No. Observations:                  60   AIC:                             134.0
Df Residuals:                      56   BIC:                             142.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         13.1765      0.298     44.179      0.0

Explanation:

- We initially run the regression with all variables included.
- We then enter a while loop, which continues until all variables in the model are significant at the specified alpha level.
- Inside the while loop, we identify and remove the most insignificant variable (the one with the highest p-value) and rerun the regression.
- This process repeats, removing one variable at a time, until only significant variables remain in the model.
- The summary of the final regression model, including only significant variables, is printed at the end.

#Q3
Write a brief report comparing each regression with different variables in the equation and
interpret your findings. (10 points)

## Answer
In your data analysis, you have applied multiple linear regression to understand the relationships between different independent variables ('Prepare', 'Delivery', 'Kilometer', and 'Resident') and the dependent variable ('Cost ($)').

### Descriptive Statistics

From the descriptive statistics, it's observed that:
- The mean cost is 23 USD with a standard deviation of approximately 3.46.
- The average preparation time is approximately 10.57 minutes, and the delivery time averages to 28.5 minutes.
- On average, the delivery covers a distance of about 10.82 kilometers.
- Approximately 51.67% of the data points are for residents.

### Regression Analysis

#### First Regression (All variables included)

In the first regression model where all variables are included, we have a high R-squared value of 0.960, indicating that the model explains 96% of the variability in the dependent variable. However, looking at the p-values for individual predictors, 'Prepare' and 'Delivery' have p-values greater than the chosen significance level of 0.05, suggesting that they are not statistically significant in predicting the cost.

#### Second Regression ('Delivery' variable removed)

After removing the 'Delivery' variable (which had the highest p-value), we ran the regression again. The adjusted R-squared slightly decreased to 0.957 but still indicates a very high explanatory power. The 'Prepare' variable remains insignificant with a p-value of 0.219.

#### Third Regression ('Prepare' variable removed)

In the next iteration, the 'Prepare' variable, having the highest p-value, is removed, leaving us with 'Kilometer' and 'Resident' variables. The model still retains a high R-squared value of 0.958, explaining a substantial amount of variation in the delivery cost. Both variables in this model are significant, with p-values less than 0.05.

### Interpretation and Findings

- **'Kilometer' Variable**: This variable has remained highly significant throughout the models with p-values much less than 0.05. It also has a large coefficient, indicating a strong positive relationship with the delivery cost. It means that as the distance increases, the cost increases substantially.
  
- **'Resident' Variable**: This variable is statistically significant in explaining the delivery cost, and it indicates that being a resident can increase the cost, albeit not as much as the distance covered.

- **Model Selection**: The final model, which includes only 'Kilometer' and 'Resident' as predictors, is preferable as it retains a high R-squared value while using fewer variables, adhering to the principle of parsimony. Moreover, both predictors in this model are statistically significant, helping in a reliable prediction of the delivery cost.

In conclusion, the 'Kilometer' and 'Resident' variables are sufficient to explain a large portion (95.7%) of the variability in the delivery cost. It suggests that focusing on these two factors could be pivotal in predicting and possibly reducing delivery costs. The 'Prepare' and 'Delivery' variables, despite being logically potential predictors, do not significantly contribute to the cost prediction in the presence of other variables and thus were correctly removed in the stepwise regression process.