In [1]:
import polars as pl
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Set default plotly template for better aesthetics
import plotly.io as pio
pio.templates.default = "plotly_white"

# %% [markdown]
# ## 2. Load and Prepare Data
# Here, we load the training data, rename the target column for clarity, and ensure all feature columns are cast to a numeric type (`Float64`).

# %%
# Load the dataset using Polars
full_train_df = pl.read_csv("./kaggle/train.csv")

# Rename the target column
full_train_df = full_train_df.rename({'market_forward_excess_returns': 'target'})

full_train_df = full_train_df.slice(2000)

# Explicitly cast all columns except date_id to Float64
feature_cols = [col for col in full_train_df.columns if col not in ['date_id', 'target']]

full_train_df = full_train_df.with_columns(
    pl.col(feature_cols + ['target']).cast(pl.Float64, strict=False)
)

print("Data loaded and prepared successfully.")
print(f"Shape of the dataframe: {full_train_df.shape}")
full_train_df.head()

# %% [markdown]
# ## 3. Null Rate Analysis
# A crucial step in any data analysis is understanding missing data. We'll calculate the percentage of null values for every column and visualize the result. This helps identify features that might require imputation or be unsuitable for modeling.

# %%
# Calculate null counts and rates
null_counts = full_train_df.null_count()
total_rows = len(full_train_df)

# Create a DataFrame for null rates
null_info_df = pl.DataFrame({
    'column': null_counts.columns,
    'null_count': null_counts.row(0),
}).with_columns(
    (pl.col('null_count') / total_rows * 100).alias('null_rate_pct')
).sort('null_rate_pct', descending=True)

# Filter out columns with zero nulls for a cleaner plot
null_info_to_plot = null_info_df.filter(pl.col('null_rate_pct') > 0)

# Display the null rate table
print("Null Value Rates per Column:")
print(null_info_df)

# %%
# Visualize the null rates
fig = px.bar(
    null_info_to_plot,
    x='column',
    y='null_rate_pct',
    title='Percentage of Missing Values per Feature (Features with >0% Missing)',
    labels={'column': 'Feature', 'null_rate_pct': 'Missing Value Rate (%)'},
    height=600
)
fig.update_layout(xaxis_tickangle=-90)
fig.show()


# %% [markdown]
# ## 4. Target Variable Visualization
# Understanding the target variable is the most important part of EDA for a predictive model. We'll look at its distribution and how it behaves over time.

# %%
# Plot the distribution of the target variable
fig_dist = px.histogram(
    full_train_df,
    x='target',
    nbins=100,
    title='Distribution of Target (Daily Forward Excess Returns)',
    labels={'target': 'Excess Return'}
)
fig_dist.show()

# Plot the target variable over time (using date_id as the time axis)
fig_ts = px.line(
    full_train_df,
    x='date_id',
    y='target',
    title='Target (Daily Forward Excess Returns) Over Time',
    labels={'date_id': 'Date ID', 'target': 'Excess Return'}
)
fig_ts.show()


# %% [markdown]
# ## 5. Feature Correlation Analysis
# To get a first look at which features might be predictive, we can calculate the Pearson correlation between each feature and the target. A higher absolute correlation suggests a stronger linear relationship.

# %%
# Calculate correlation of each feature with the target
correlations = []
for col in feature_cols:
    # .item() extracts the single value from the Polars expression result
    if col in ['forward_returns']:
        continue  # Skip
    corr_value = full_train_df.select(pl.corr('target', col)).item()
    correlations.append((col, corr_value))

# Create a DataFrame of correlations
corr_df = pl.DataFrame(correlations, schema=['feature', 'correlation']).sort('correlation', descending=True, nulls_last=True)

# Separate top positive and top negative correlations for visualization
top_n = 20
top_positive_corr = corr_df.head(top_n)
top_negative_corr = corr_df.tail(top_n).sort('correlation', descending=False)

# Combine for plotting
top_corr_to_plot = pl.concat([top_positive_corr, top_negative_corr])

print(f"Top {top_n} Most Positively Correlated Features with Target:")
print(top_positive_corr)
print(f"\nTop {top_n} Most Negatively Correlated Features with Target:")
print(top_negative_corr.sort('correlation'))


# %%
# Visualize the top correlations
fig_corr = px.bar(
    top_corr_to_plot,
    x='correlation',
    y='feature',
    orientation='h',
    title=f'Top {top_n*2} Features Correlated with Target',
    labels={'feature': 'Feature', 'correlation': 'Pearson Correlation'},
    color='correlation',
    color_continuous_scale='RdBu_r',
    height=800
)
fig_corr.update_layout(yaxis={'categoryorder':'total ascending'})
fig_corr.show()

Data loaded and prepared successfully.
Shape of the dataframe: (6990, 98)
Null Value Rates per Column:
shape: (98, 3)
┌─────────────────┬────────────┬───────────────┐
│ column          ┆ null_count ┆ null_rate_pct │
│ ---             ┆ ---        ┆ ---           │
│ str             ┆ i64        ┆ f64           │
╞═════════════════╪════════════╪═══════════════╡
│ E7              ┆ 4969       ┆ 71.087268     │
│ V10             ┆ 4049       ┆ 57.925608     │
│ S3              ┆ 3733       ┆ 53.404864     │
│ M1              ┆ 3547       ┆ 50.74392      │
│ M13             ┆ 3540       ┆ 50.643777     │
│ …               ┆ …          ┆ …             │
│ V7              ┆ 0          ┆ 0.0           │
│ V8              ┆ 0          ┆ 0.0           │
│ forward_returns ┆ 0          ┆ 0.0           │
│ risk_free_rate  ┆ 0          ┆ 0.0           │
│ target          ┆ 0          ┆ 0.0           │
└─────────────────┴────────────┴───────────────┘


Top 20 Most Positively Correlated Features with Target:
shape: (20, 2)
┌─────────┬─────────────┐
│ feature ┆ correlation │
│ ---     ┆ ---         │
│ str     ┆ f64         │
╞═════════╪═════════════╡
│ V13     ┆ 0.059044    │
│ M1      ┆ 0.046339    │
│ S5      ┆ 0.040966    │
│ D1      ┆ 0.036755    │
│ D2      ┆ 0.036755    │
│ …       ┆ …           │
│ V9      ┆ 0.016725    │
│ M3      ┆ 0.016725    │
│ D7      ┆ 0.016519    │
│ E9      ┆ 0.015674    │
│ V6      ┆ 0.015013    │
└─────────┴─────────────┘

Top 20 Most Negatively Correlated Features with Target:
shape: (20, 2)
┌────────────────┬─────────────┐
│ feature        ┆ correlation │
│ ---            ┆ ---         │
│ str            ┆ f64         │
╞════════════════╪═════════════╡
│ M4             ┆ -0.063344   │
│ S2             ┆ -0.034      │
│ P8             ┆ -0.033159   │
│ E7             ┆ -0.032476   │
│ E11            ┆ -0.032356   │
│ …              ┆ …           │
│ P11            ┆ -0.015763   │
│ risk_free_rate ┆ 



