# Ski Resort Price Analysis

This notebook explores ski resort pricing data along with weather information to identify patterns, trends, and correlations. We'll work through loading, cleaning, and analyzing CSV data to extract meaningful insights about how factors like weather and location impact ski resort prices.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization styles
plt.style.use("ggplot")
sns.set_palette("viridis")
sns.set_context("notebook")

## 1. Working with CSV Data

### 1.1 Reading and Exploring CSV Data

First, we'll load our datasets and perform some initial exploration to understand what we're working with.

In [None]:
# Load CSV files
df_prices = pd.read_csv("../data/01_ski-prices/prices.csv")
df_weather = pd.read_csv("../data/01_ski-prices/weather.csv")

print("Ski Resort Prices - First 5 rows:")
df_prices.head()

In [None]:
# Get basic information about the price dataset
print("Price dataset info:")
df_prices.info()

In [None]:
# Statistical summary of the price data
df_prices.describe()

In [None]:
# Check for missing values in price data
print("Missing values in price data:")
df_prices.isna().sum()

In [None]:
# Examine the weather data
print("Weather data - First 5 rows:")
df_weather.head()

In [None]:
# Get basic information about the weather dataset
print("Weather dataset info:")
df_weather.info()

In [None]:
# Check for missing values in weather data
print("Missing values in weather data:")
df_weather.isna().sum()

### 1.2 Data Distribution Analysis

Let's examine the distribution of our key variables before proceeding with the analysis.

In [None]:
# Distribution of temperature and precipitation
fig, ax = plt.subplots(3, 1, figsize=(15, 15))

sns.histplot(df_prices["price"], kde=True)
ax[0].set_title("Distribution of Ski Resort Prices")
ax[0].set_xlabel("Price (€)")

sns.histplot(df_weather["temperature"], kde=True, ax=ax[0])
ax[1].set_title("Temperature Distribution")
ax[1].set_xlabel("Temperature (°C)")

sns.histplot(df_weather["precipitation"], kde=True, ax=ax[1])
ax[2].set_title("Precipitation Distribution")
ax[2].set_xlabel("Precipitation (mm)")

plt.tight_layout()
plt.show()

### 1.3 Handling Missing Values

Now we'll handle missing values in our datasets. For some analyses we'll drop them, while for others (like correlation) we'll handle them more carefully.

In [None]:
# Create cleaned copies for general analysis (dropping missing values)
df_prices_cleaned = df_prices.dropna()
df_weather_cleaned = df_weather.dropna()

# Create copies for correlation analysis (keeping the original structure)
df_prices_for_correlation = df_prices.copy()
df_weather_for_correlation = df_weather.copy()

# Count records before and after cleaning
print(
    f"Price data: {len(df_prices)} rows before cleaning, {len(df_prices_cleaned)} after cleaning"
)
print(
    f"Weather data: {len(df_weather)} rows before cleaning, {len(df_weather_cleaned)} after cleaning"
)

### 1.4 Merging Datasets

Let's combine our price and weather data to analyze how weather conditions might affect pricing.

In [None]:
# Merge on date and region
df_merged = df_prices_cleaned.merge(
    df_weather_cleaned, on=["date", "region"], how="inner"
)

print(f"Merged dataset has {len(df_merged)} rows")
df_merged.head()

### 1.5 Filtering and Aggregation

Let's examine specific regions and calculate aggregated statistics.

In [None]:
# Convert date column to datetime format
df_merged["date"] = pd.to_datetime(df_merged["date"])

# List unique regions
print("Available regions:")
df_merged["region"].unique()

In [None]:
# Filter for a specific resort
kitzbuehl_df = df_merged[df_merged["region"] == "Kitzbuehl"]
print(f"Kitzbuehl dataset has {len(kitzbuehl_df)} rows")
kitzbuehl_df.head()

In [None]:
# Calculate monthly average price per region
df_monthly = (
    df_merged.groupby([df_merged["date"].dt.to_period("M"), "region"])["price"]
    .mean()
    .reset_index()
)

# Convert period to string for better display
df_monthly["date"] = df_monthly["date"].astype(str)

print("Monthly average prices by region:")
df_monthly.head(10)

In [None]:
# Calculate price statistics by region
region_stats = (
    df_merged.groupby("region")["price"]
    .agg(["mean", "median", "min", "max", "std"])
    .round(2)
)
print("Price statistics by region:")
region_stats

### 1.6 Visualizing Trends

Let's visualize price trends over time for different regions.

In [None]:
# Calculate rolling average to smooth trends
df_merged["rolling_avg"] = df_merged.groupby("region")["price"].transform(
    lambda x: x.rolling(7, min_periods=1).mean()
)

# Plot the smoothed trends
plt.figure(figsize=(12, 6))
for region in df_merged["region"].unique():
    subset = df_merged[df_merged["region"] == region]
    plt.plot(subset["date"], subset["rolling_avg"], label=f"{region} (7-day avg)")

plt.xlabel("Date")
plt.ylabel("Rolling Avg Price (€)")
plt.title("Smoothed Ski Resort Price Trends (7-day Rolling Average)")
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 1.7 Correlation Analysis

Let's examine the relationships between price and weather variables.

In [None]:
# Merge datasets on date and region to analyze correlations
df_correlation = pd.merge(
    df_weather_for_correlation,
    df_prices_for_correlation,
    on=["date", "region"],
    how="inner",  # Keep only rows that match in both datasets
)

# Remove rows with missing values for clean correlation analysis
df_correlation = df_correlation.dropna()

# Add a day of week column to analyze weekday patterns
df_correlation["day_of_week"] = pd.to_datetime(df_correlation["date"]).dt.dayofweek

# Create a 2x2 grid of subplots (one row for each relationship, one column for each region)
fig, axes = plt.subplots(4, 2, figsize=(15, 20))
fig.suptitle("Weather Effects on Ski Prices by Region", fontsize=20, y=1.01)

regions = df_correlation["region"].unique()

# Temperature vs Price - one plot per region
for i, region in enumerate(regions):
    region_data = df_correlation[df_correlation["region"] == region]

    # Temperature plot (top row)
    sns.scatterplot(
        data=region_data,
        x="temperature",
        y="price",
        alpha=0.7,
        color=f"C{i}",
        ax=axes[i, 0],
    )
    axes[i, 0].set_title(f"{region}")
    axes[i, 0].set_xlabel("Temperature (°C)")
    axes[i, 0].set_ylabel("Price (€)")

    # Add regression line
    sns.regplot(
        x="temperature",
        y="price",
        data=region_data,
        scatter=False,
        ax=axes[i, 0],
        color=f"C{i}",
        line_kws={"linestyle": "--"},
    )

    # Precipitation plot (bottom row)
    sns.scatterplot(
        data=region_data,
        x="precipitation",
        y="price",
        alpha=0.7,
        color=f"C{i}",
        ax=axes[i, 1],
    )
    axes[i, 1].set_title(f"{region}")
    axes[i, 1].set_xlabel("Precipitation (mm)")
    axes[i, 1].set_ylabel("Price (€)")

    # Add regression line
    sns.regplot(
        x="precipitation",
        y="price",
        data=region_data,
        scatter=False,
        ax=axes[i, 1],
        color=f"C{i}",
        line_kws={"linestyle": "--"},
    )

plt.tight_layout()
plt.show()

In [None]:
# Plots showing weekday effect for all regions
fig, axes = plt.subplots(1, 4, figsize=(15, 5))
fig.suptitle("Weekday Effect on Ski Prices by Region", fontsize=16)

for i, region in enumerate(regions):
    sns.boxplot(
        data=df_correlation[df_correlation["region"] == region],
        x="day_of_week",
        y="price",
        ax=axes[i],
        color=f"C{i}",
    )
    axes[i].set_xlabel("Day of Week")
    axes[i].set_ylabel("Price (€)")
    axes[i].set_title(f"{region}")
    axes[i].set_xticks(range(7))
    axes[i].set_xticklabels(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])

plt.tight_layout()
plt.show()

### 1.8 Saving Processed Data

Finally, let's save our processed dataset for future use.

In [None]:
# Save the processed dataset
df_merged.to_csv("01_ski-prices.csv", index=False)
print("Processed data saved")

## Conclusion

In this analysis, we explored ski resort pricing data along with weather conditions. Key findings include:

1. Trends over time for different resorts
2. Correlations between temperature, precipitation, and resort prices
3. Day-of-week patterns in pricing
4. Regional differences in pricing strategies