# Exploratory Data Analysis

**Objective:** To explore the member and match datasets, understand the key drivers of match attendance, and generate hypotheses for the modeling phase.

**Methodology:**
1.  **Setup & Initial Inspection**: Load libraries and both datasets, perform high-level checks.
2.  **Data Merging & Cleaning**: Combine datasets and handle any inconsistencies.
3.  **Univariate Analysis**: Analyze the distribution of individual features.
4.  **Bivariate Analysis**: Investigate relationships between the target variable (attendance) and key predictors.
5.  **Multivariate/Segment Analysis**: Explore temporal patterns, correlations, and differences across key segments.
6.  **Summary & Next Steps**: Consolidate findings and outline next steps for feature engineering and modeling.

---

## 1. Setup & Initial Inspection

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 7)

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
# Load the datasets
members_df = pd.read_csv('../data/members_data.csv')
matches_df = pd.read_csv('../data/match_data_sync.csv')

# --- Members Data Inspection ---
print("--- Members Dataset ---")
print("Shape:", members_df.shape)
print("\nFirst 5 Rows:")
display(members_df.head())
print("\nData Types and Non-Null Counts:")
members_df.info()
print("\nSummary Statistics:")
display(members_df.describe())

# --- Matches Data Inspection ---
print("\n--- Matches Dataset ---")
print("Shape:", matches_df.shape)
print("\nFirst 5 Rows:")
display(matches_df.head())
print("\nData Types and Non-Null Counts:")
matches_df.info()
print("\nSummary Statistics:")
display(matches_df.describe(include='all'))

**Initial Observations:**
* **Members Data:** Contains member demographics like age, gender, postal code, and member category. `member_id` seems to be the primary key.
* **Matches Data:** Contains match-specific information like opponent, competition, date, and attendance. `match_id` and `member_id` are likely keys.
* **Data Quality:** Check for missing values (e.g., `postal_code` in members) and data types (e.g., `match_date` should be a datetime object).

---

## 2. Data Merging & Cleaning

In [None]:
# Convert match_date to datetime objects
matches_df['match_date'] = pd.to_datetime(matches_df['match_date'])

# Merge the two dataframes on member_id
df = pd.merge(matches_df, members_df, on='member_id', how='left')

print("Merged Dataset Shape:", df.shape)
print("\nData Types and Non-Null Counts after Merge:")
df.info()
display(df.head())

**Note:** A left merge is used to ensure we keep all match attendance records, even if a corresponding member is not in the members dataset (though this shouldn't happen with clean data). We should check for any nulls introduced in the member columns after the merge.

---

## 3. Univariate Analysis

### Target Variable: `attended`

In [None]:
sns.countplot(data=df, x='attended')
plt.title('Distribution of Target Variable: Match Attendance')
plt.xlabel('Attended (1 = Yes, 0 = No)')
plt.ylabel('Frequency')
plt.show()

print(df['attended'].value_counts(normalize=True))

**Observation:** Note the balance of the classes. A highly imbalanced dataset might require techniques like over-sampling (e.g., SMOTE) or using appropriate evaluation metrics (e.g., F1-score, AUC-PR).

### Key Categorical & Numerical Features

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18, 12))
fig.suptitle('Distribution of Key Features', fontsize=16)

# Member Gender
sns.countplot(ax=axes[0, 0], data=df, y='gender', order=df['gender'].value_counts().index)
axes[0, 0].set_title('Frequency of Member Gender')

# Competition
sns.countplot(ax=axes[0, 1], data=df, y='competition', order=df['competition'].value_counts().index)
axes[0, 1].set_title('Frequency of Competition')

# Member Age
sns.histplot(ax=axes[1, 0], data=df, x='age', bins=30, kde=True)
axes[1, 0].set_title('Distribution of Member Age')

# Ticket Price
sns.histplot(ax=axes[1, 1], data=df, x='ticket_price', bins=30, kde=True)
axes[1, 1].set_title('Distribution of Ticket Price')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

---

## 4. Bivariate Analysis

### Attendance vs. Key Numeric Predictors

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18, 7))
fig.suptitle('Attendance by Key Numeric Predictors', fontsize=16)

sns.boxplot(ax=axes[0], data=df, x='attended', y='age')
axes[0].set_title('Attendance by Member Age')
axes[0].set_xlabel('Attended (1 = Yes, 0 = No)')

sns.boxplot(ax=axes[1], data=df, x='attended', y='ticket_price')
axes[1].set_title('Attendance by Ticket Price')
axes[1].set_xlabel('Attended (1 = Yes, 0 = No)')

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

**Observation:** Look for differences in the distribution of numeric features for attendees vs. non-attendees. For example, do younger members attend more? Is there a price sensitivity?

### Attendance vs. Key Categorical Predictors

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18, 7))
fig.suptitle('Attendance Rate by Key Categories', fontsize=16)

sns.barplot(ax=axes[0], data=df, x='member_category', y='attended', ci=None)
axes[0].set_title('Attendance Rate by Member Category')
axes[0].tick_params(axis='x', rotation=45)

sns.barplot(ax=axes[1], data=df, x='competition', y='attended', ci=None)
axes[1].set_title('Attendance Rate by Competition')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

**Hypothesis Check:** Do different categories show different attendance rates? This is a strong indicator that the feature is a good predictor.

---

## 5. Multivariate & Segment Analysis

### Correlation Analysis

In [None]:
# Select only numeric columns for the correlation matrix
numeric_cols = df.select_dtypes(include=np.number)

plt.figure(figsize=(14, 10))
sns.heatmap(numeric_cols.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numeric Features')
plt.show()

**Key Insights from Correlation:**
* **Target Correlation:** Identify features most correlated with attendance.
* **Multicollinearity:** Look for high correlations between predictor variables (e.g., `ticket_price` and `seat_category`). This might be an issue for linear models but less so for tree-based models.

### Time-Series Analysis

In [None]:
# Plot average attendance over time
time_analysis = df.groupby(df['match_date'].dt.to_period('M'))['attended'].mean().reset_index()
time_analysis['match_date'] = time_analysis['match_date'].dt.to_timestamp()

plt.figure(figsize=(14, 7))
sns.lineplot(data=time_analysis, x='match_date', y='attended')
plt.title('Average Attendance Over Time (Monthly)')
plt.xlabel('Match Date')
plt.ylabel('Average Attendance Rate')
plt.show()

**Observation:** Identify trends, seasonality, or cyclical patterns. This confirms the importance of time-based features for forecasting models. For example, is attendance higher at the beginning or end of the season?

---

## 6. Summary & Next Steps

### Summary of Key Findings
1.  **Finding 1:** [e.g., The target variable `attended` is slightly imbalanced, with more non-attendees than attendees.]
2.  **Finding 2:** [e.g., Predictors like `competition` and `member_category` show a significant relationship with attendance.]
3.  **Finding 3:** [e.g., There is a clear seasonal pattern, with attendance peaking mid-season.]
4.  **Finding 4:** [e.g., `ticket_price` and `seat_category` are highly correlated, as expected.]

### Proposed Next Steps
* **Data Cleaning**:
    * [e.g., Impute or remove rows with missing `postal_code` if it's deemed an important feature.]
* **Feature Engineering**:
    * [e.g., Create features from `match_date`, such as `day_of_week`, `month`, `is_weekend`.]
    * [e.g., Create interaction features, such as `age` combined with `member_category`.]
    * [e.g., Encode categorical variables (like `opponent`, `competition`) for modeling.]
* **Modeling Strategy**:
    * [e.g., The findings support using a tree-based model like XGBoost or LightGBM, which can handle non-linearities and feature interactions well.]
    * [e.g., A logistic regression model could serve as a good, interpretable baseline.]