# STAT 301 Project - Tyler Yih (Group 31)

## Project Stage 1: Data Description & Exploratory Data Analysis

### 0. TA feedback

Stage 1 Score: 28/30

*Pre-selection of variables: `room_type` provides more information than `room_shared` and `room_private` combined. Why choose to keep `room_shared` and `room_private` instead?*
- `room_type` I made a typo in Stage 1 of my project. I did in fact keep room_type instead of `room_shared` and `room_private`

*Scientific Question: longitude and latitude may not be relevant as we don't cover spatial analysis in this course.*
- Noted, I will remove them.

*Visualization: the legend for Day type is missing.*
- Noted, I will add one to my plot.

*Visualization: Interpretation: The word "cause" can be misleading. Association vs causation.*
- Noted, I changed the wording to "associatied with..."

*Visualization: Interpretation: The visualization suggests that `day_type` is a promising predictors to include?*
- It does not suggest that `day_type` is a promising indicator to include. I will remove it.

### 1. Data Description

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(broom)
library(cowplot)

In [None]:
# Airbnb data in Athens on weekdays
weekdays <- read_csv("data/athens_weekdays.csv")
head(weekdays)

In [None]:
# Airbnb data in Athens on weekends
weekends <- read_csv("data/athens_weekends.csv")
head(weekends)

#### Provide a descriptive summary

- The two datasets provide information on Airbnb prices in Athens, Greece on weekends and weekdays, including room type, cleanliness rating, guest satisfaction score, number of bedrooms, and distance from the city centre.
- The weekdays dataset consists of 2,653 entries, and the weekends dataset consists of 2,627 entries.
- Both datasets have 20 variables, whose names, descriptions, and types are provided in the table below.

| Variable Name                 | Description                                                     | Type        |
|-------------------------------|-----------------------------------------------------------------|-------------|
| `(index)`                     | Index of the row in the table                                   | Numeric     |
| `realSum`                     | Full price of accommodation for two people and two nights (EUR) | Numeric     |
| `room_type`                   | Type of the accommodation                                       | Categorical |
| `room_shared`                 | Whether the room is shared                                      | Boolean     |
| `room_private`                | Whether the room is private                                     | Boolean     |
| `person_capacity`             | Maximum number of guests                                        | Numeric     |
| `host_is_superhost`           | Whether the host is a superhost                                 | Boolean     |
| `multi`                       | Whether the listing belongs to hosts with 2-4 offers            | Boolean     |
| `biz`                         | Whether the listing belongs to hosts with more than 4 offers    | Boolean     |
| `cleanliness_rating`          | Cleanliness rating of the listing                               | Numeric     |
| `guest_satisfaction_overall`  | Overall guest satisfaction rating                               | Numeric     |
| `bedrooms`                    | Number of bedrooms in the listing (0 for studios)               | Numeric     |
| `dist`                        | Distance from the city centre (km)                              | Numeric     |
| `metro_dist`                  | Distance from the nearest metro station (km)                    | Numeric     |
| `attr_index`                  | Attraction index of the listing location                        | Numeric     |
| `attr_index_norm`             | Normalized attraction index (0-100)                             | Numeric     |
| `rest_index`                  | Restaurant index of the listing location                        | Numeric     |
| `rest_index_norm`             | Normalized restaurant index (0-100)                             | Numeric     |
| `lng`                         | Longitude of the listing location                               | Numeric     |
| `lat`                         | Latitude of the listing location                                | Numeric     |


#### Source and information

- The data was collected with web-scraping done with the help of a web-automation framework (Selenium WebDriver) during a study done by Gyódi, Kristóf and Nawaro, Łukasz, and attractiveness was based on TripAdvisor data.
- The data, provided by the University of Warsaw, was used to collect Airbnb offers that would be presented to a real user.
- *Sources:*

https://doi.org/10.5281/zenodo.4446043

https://doi.org/10.1016/j.tourman.2021.104319

#### Pre-selection of variables

In [None]:
unique(weekdays$room_type)

`room_type`, `room_shared`, and `room_private` have overlapping information, so we probably only need to keep one of theese features, `room_type`.

The `...1` (index) column isn't very useful information wise either.

Latly, both `attr_index` and `rest_index` will be heavily overlapping with their normalized counterparts. We only need to use one, most likely the normalized version.

## 2. Scientific Question

#### Clearly state the question you want try to answer using the dataset

- We want to examine the association between the price of the accommodation for two people staying two nights (response) and predictors related to room characteristics (e.g. room type, number of bedrooms), location information (e.g. distance from city centre, longitude, latitude), and the attraction index.
- The response is `real_sum`, the price of the accommodation for two people staying two nights in Euros.
- Our question is primarily focused on inference, as we are identifying which features are associated with price and estimating their effects. We may build a predictive model later and keep our options open.

## 3. Exploratory Data Analysis and Visualization

In [None]:
#### Data cleaning and wrangling

In [None]:
library(tidyverse)

weekdays <- read_csv("data/athens_weekdays.csv")
weekends <- read_csv("data/athens_weekends.csv")

athens <- bind_rows(
  # bind the two datasets together into a single tibble
  mutate(weekdays, day_type = "weekday"),
  mutate(weekends, day_type = "weekend")
) |>
  mutate(
    day_type = factor(day_type, levels = c("weekday", "weekend"))
  ) |> # make day_type a factor
  select(-room_shared, -room_private, -`...1`, -lng, -lat) |> # drop columns discussed in Section 1; NOTE: longitude and latitude dropped now
  rename(real_sum = realSum) |>
  mutate(
    room_type = str_squish(room_type),
    room_type = factor(
      room_type,
      levels = c("Entire home/apt", "Private room", "Shared room")
    ),
    room_type = fct_recode(
      room_type,
      "entire" = "Entire home/apt",
      "private" = "Private room",
      "shared" = "Shared room"
    )
  ) |>
  mutate(
    num_host_offers = case_when(
      biz == 1 ~ "moreThanFour",
      multi == 1 ~ "twoToFour",
      TRUE ~ "one"
    ),
    num_host_offers = factor(
      num_host_offers,
      levels = c("moreThanFour", "twoToFour", "one")
    ) # combine biz and multi into a single categorical column
  ) |>
  select(-multi, -biz)

head(athens)
summary(athens$real_sum)

#### Provide a visualization that you consider relevant to address your question or to explore the data

In [None]:
p1 <- ggplot(athens, aes(x = room_type, y = real_sum, fill = day_type)) +
  geom_boxplot(position = position_dodge(width = 0.8), alpha = 0.95) +
  scale_y_log10() +
  scale_x_discrete(labels = function(x) str_to_title(x)) +
  scale_fill_discrete(labels = function(x) str_to_title(x)) +
  labs(
    x = "Room Type",
    y = "Price for 2 Nights (log-scale Euros)",
    fill = "Day Type"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "top",
    strip.text = element_text(face = "bold"),
    plot.title = element_text(hjust = 0.5, size = 14, margin = margin(b = 10)),
    axis.text.x = element_text(angle = 20, vjust = 1, hjust = 1, size = 10)
  )

p2 <- ggplot(athens, aes(x = num_host_offers, y = real_sum, fill = day_type)) +
  geom_boxplot(position = position_dodge(width = 0.8), alpha = 0.95) +
  scale_y_log10() +
  scale_x_discrete(
    labels = function(x) case_when(
      x == "moreThanFour" ~ "More Than Four",
      x == "one" ~ "One",
      x == "twoToFour" ~ "Two to Four",
      TRUE ~ str_to_title(x)
    )) +
  scale_fill_discrete(labels = function(x) str_to_title(x)) +
  labs(
    x = "Host Offers",
    y = NULL,
    fill = "Day Type"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "top",
    strip.text = element_text(face = "bold"),
    plot.title = element_text(hjust = 0.5, size = 14, margin = margin(b = 10)),
    axis.text.x = element_text(angle = 20, vjust = 1, hjust = 1, size = 10)
  )

legend <- get_legend(p1 + theme(legend.position = "right"))

plot_grid(
  ggdraw() + 
    draw_label(
      "Price Distribution", 
      fontface = 'bold', 
      x = 0.5, 
      hjust = 0.5,
      size = 16
    ),
  plot_grid(
    plot_grid(
      p1 + labs(subtitle = "By Room Type", title = NULL) + theme(legend.position = "none"),
      p2 + labs(subtitle = "By Host Offer Count", title = NULL) + theme(legend.position = "none"),
      ncol = 2,
      align = "hv",
      axis = "tblr",
      rel_widths = c(1, 1)
    ),
    legend,
    ncol = 2,
    rel_widths = c(1, 0.15)
  ),
  ncol = 1,
  rel_heights = c(0.12, 1)
)

#### Provide the following Interpretations

##### Explain why you consider this plot relevant to address your question or to explore the data.
- This plot is relevant because it shows how the response (`real_sum`) varies across two key categorical predictors (`room_type` and `num_host_offers`), and how that relationship shifts by `day_type`.
- Using boxplots on a log10 y-scale makes the skewed price distribution easier to compare and helps spot group-level differences and heteroscedasticity that are important for later inference.
##### Interpret briefly the results obtained.
- `entire` Airbnb offers show substantially higher medians and much wider variability than `private` or `shared` offerings, and the `entire` category contains extreme high-price outliers that extend far above the rest of the distribution.
- `shared` appears very narrow and has few observations.
- Lastly, we notice the two `day_type` fills is **associated with** only small within-group shifts in medians, suggesting weekday/weekend differences exist but are not huge in these raw comparisons.
##### Interpret briefly the results obtained.
- The visualization suggests `room_type` and `num_host_offers` are promising predictors to include (and possibly interact) in a regression for inference, and it justifies the log-transform of price.
- It also reveals various problems to address in modelling, such as strong right skew and outliers, unequal group sizes, and confounding from variables that we haven't explored yet.
- The next steps are to check sample sizes per cell, examine covariate balance, and build a model using relevant features.

## 4. Method and Plan 

### Proposed method (one-line)
- **Multiple linear regression (MLR)** on `log(real_sum)` using `room_type`, `person_capacity`, `host_is_superhost`, `cleanliness_rating`, `guest_satisfaction_overall`, `bedrooms`, `dist`, `metro_dist`, `attr_index`, `rest_index`, `rest_index_norm` and `num_host_offers`.

### Why is this method appropriate?
- MLR directly estimates associations between multiple predictors and the (transformed) price while producing interpretable effect estimates and hypothesis tests.  
- Log-transforming `real_sum` stabilizes variance and turns multiplicative effects into additive ones, so coefficients approximate percent changes (this is convenient for skewed price data).

### Which assumptions are required?
- Linearity: the predictors relate linearly to `log(real_sum)`, and the errors are independent with mean zero. 
- We are assuming homoskedasticity and normality of residuals for valid standard inference, as well as low multicollinearity among predictors, for stable coefficient estimates.

### Potential limitations or weaknesses
- While log-transform helps, mutliers and heavy right skew can still influence estimates.
- Heteroskedasticity or non-normal errors would invalidate standard SEs.  
- Omitted confounders (e.g. spatial clustering) can make some coefficients unstable or biased.

## 5. Computational Code and Output

In [None]:
mlr_fit <- lm(
  formula = log(real_sum) ~
    room_type +
    person_capacity +
    host_is_superhost +
    cleanliness_rating +
    guest_satisfaction_overall +
    bedrooms +
    dist +
    metro_dist +
    attr_index_norm +
    rest_index_norm +
    num_host_offers,
  data = athens
)