## Analyzing Steel Industry Data
---
### What I intend to have in this notebook
* Minimal analysis of the Steel Industry dataset from UCI ML Repo
* Sklearn training loop with Wandb?
* Preprocessing for training
* Dataset class for Steel Industry

### After which, I should
* Create a script with torchmetrics or wandb



In [1]:
%load_ext watermark
%watermark  -v -p numpy,pandas,matplotlib,scikit-learn

Python implementation: CPython
Python version       : 3.9.5
IPython version      : 8.16.1

numpy       : 1.26.0
pandas      : 2.1.1
matplotlib  : 3.8.0
scikit-learn: 1.3.1



### Setup

In [48]:
import torch as t
import torch.nn.functional as F 
from torch.autograd import grad
from torch.utils.data import Dataset, DataLoader
from torch.utils.data import random_split

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import numpy as np
import pandas as pd
import plotly.graph_objs as go
from plotly.offline import iplot
from plotly.subplots import make_subplots

from dataclasses import dataclass
from pathlib import Path
from collections import Counter
from util.util import compute_total_loss, compute_accuracy
import pickle

from rich import print

### Analysis

In [20]:
BASE_DIR = "/home/therealmolf/model-board/data"
FILENAME = "steel_industry.csv"

path = Path(BASE_DIR) / FILENAME

if path.is_absolute():
    print("Absolute!")
else:
    print("Not Absolute :(")

df = pd.read_csv(path)

In [None]:
df.head(10)

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35040 entries, 0 to 35039
Data columns (total 11 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   date                                  35040 non-null  object 
 1   Usage_kWh                             35040 non-null  float64
 2   Lagging_Current_Reactive.Power_kVarh  35040 non-null  float64
 3   Leading_Current_Reactive_Power_kVarh  35040 non-null  float64
 4   CO2(tCO2)                             35040 non-null  float64
 5   Lagging_Current_Power_Factor          35040 non-null  float64
 6   Leading_Current_Power_Factor          35040 non-null  float64
 7   NSM                                   35040 non-null  int64  
 8   WeekStatus                            35040 non-null  object 
 9   Day_of_week                           35040 non-null  object 
 10  Load_Type                             35040 non-null  object 
dtypes: float64(6), 

In [25]:
# Check for nulls
df.isnull().sum()

date                                    0
Usage_kWh                               0
Lagging_Current_Reactive.Power_kVarh    0
Leading_Current_Reactive_Power_kVarh    0
CO2(tCO2)                               0
Lagging_Current_Power_Factor            0
Leading_Current_Power_Factor            0
NSM                                     0
WeekStatus                              0
Day_of_week                             0
Load_Type                               0
dtype: int64

In [32]:
# Number of Seconds from Midnight
df['NSM'].tail()

35035    82800
35036    83700
35037    84600
35038    85500
35039        0
Name: NSM, dtype: int64

In [34]:
df["WeekStatus"].nunique()

2

In [36]:
df["Day_of_week"].nunique()

7

In [None]:
df["date"].head(20)

#### Graph of Usage and Date 

In [118]:
def create_month_df(df: pd.DataFrame) -> pd.DataFrame:
    """
        Create month df from steel industry df

        Arguments:
            df (pd.DataFrame): Steel Industry DataFrame

        Returns:
            pd.DataFrame: Month DataFrame with Usage and CO2 Columns
    """

    # Convert date column objects to datetime
    df['date'] = pd.to_datetime(df['date'], format="mixed")

    # Get Month Names from Datetime object
    df['month_name'] = df['date'].dt.month_name()

    # Group by Month Name and Get Average per Month
    cols = ['month_name', 'Usage_kWh', 'CO2(tCO2)']
    month_df = df[cols].groupby(['month_name']).mean()

    return month_df    



In [117]:
month_df = create_month_df(df)

month_df

Unnamed: 0_level_0,Usage_kWh,CO2(tCO2)
month_name,Unnamed: 1_level_1,Unnamed: 2_level_1
April,25.923153,0.010878
August,28.021788,0.011912
December,23.312893,0.00955
February,29.330588,0.011741
January,33.8763,0.014758
July,27.497762,0.011697
June,25.90976,0.010903
March,27.107282,0.011475
May,28.636166,0.012218
November,30.867705,0.013087


In [119]:
def usage_co2_scatter(month_df: pd.DataFrame) -> go._figure.Figure:
    """
        Create two graphs for usage and co2 emission per month.

        Arguments:
            df (pd.DataFrame): A dataframe grouped by month
                with usage and co2 columns
        
        Returns:
            go._figure.Figure: A plotly Figure object with two subplots.
    """

    fig = make_subplots(
        rows=1, 
        cols=2,
        subplot_titles=(
                    'Average Usage (kWh) per Month in 2018',
                    'Average CO2 per Month in 2018'    ))

    # Set the template for the figure
    fig['layout']['template'] = 'ggplot2'

    usage_graph = go.Scatter(
        x=month_df.index,
        y=month_df['Usage_kWh'],
        fill='tozeroy',
        # fillcolor='rgba(255, 0, 0, 0.5)'
        )

    co_graph = go.Scatter(
        x=month_df.index,
        y=month_df['CO2(tCO2)'],
        fill='tozeroy',
    )

    # Add the Scatter plots to the subplot object
    fig.append_trace(usage_graph, row=1, col=1)
    fig.append_trace(co_graph, row=1, col=2)

    # Adding titles to xaxis and yaxis of the subplots
    fig['layout']['xaxis'].update(title='Month')
    fig['layout']['xaxis2'].update(title="Month")
    fig['layout']['yaxis'].update(title='Usage KWh')
    fig['layout']['yaxis2'].update(title="CO2(tCO2)")

    # Changing the fonts of the tick labels and the title
    fig['layout']['title']['font'].update(
        family='TImes New Roman Bold',
        size=30,
        color='black',
    )
    fig['layout']['font'].update(
        family='Times New Roman',
        size=10,
        color='gray'
    )

    fig.update_layout({
        "title": "Steel Industry"
    })

    return fig

In [166]:
fig = usage_co2_scatter(month_df)

In [171]:
iplot(fig)

In [64]:
fig['layout']

Layout({
    'template': '...',
    'xaxis': {'anchor': 'y', 'domain': [0.0, 0.45], 'title': {'text': 'Usage Graph'}},
    'xaxis2': {'anchor': 'y2', 'domain': [0.55, 1.0]},
    'yaxis': {'anchor': 'x', 'domain': [0.0, 1.0]},
    'yaxis2': {'anchor': 'x2', 'domain': [0.0, 1.0]}
})

#### Checking for Correlation
- Most of the time, feature importance is checked after training, in an attempt to
    - do feature selection (reduce features for smaller models -> faster inference)
    - but... you can do RFE and other things to check the feature importance before model training as well
        - for avoiding the curse of dimensionality
        - creating a heuristic baseline
        - Evaluating assumptions in general
- There are multiple ways for checking correlation. This includes
    - df.corr(), which is just pairwise correlation
    - Permutation importance, from scikit-learn
    - SHAP 
    - RFE
    - Random forest's built-in feature importance (or decision trees in general)

In [37]:
# Create a figure object
fig = go.Figure(data=[go.Heatmap(z=df.corr())])

# Customize the layout
fig['layout'].update(title='Correlation Matrix', xaxis_title='Feature 1', yaxis_title='Feature 2')

# Display the heatmap
iplot(fig)

ValueError: could not convert string to float: '01/01/2018 00:15'

#### Check Class Distribution

In [172]:
class_dist = df['Load_Type'].value_counts()

In [176]:
class_dist.index

"Index(['Light_Load', 'Medium_Load', 'Maximum_Load'], dtype='object', name='Load_Type')"

In [187]:
fig = go.Figure()

fig['layout']['template'] = 'ggplot2'
fig['layout']['font'].update(
    family='Times New Roman',
    size='20'
)
fig.update_layout(
    {
        "title": "Class Distribution of Steel Industry",
        "xaxis": ""
    }
)

bar = go.Bar(
    x=class_dist,
    y=class_dist.index,
    orientation='h'
    )

fig.add_trace(bar)

iplot(fig)