# This notebook implement a baseline model in linear regression 

## Import and prepare dataset
Load the dataset out of .pkl file we preapared ind file "DataPreparation.ipynb".

In [15]:
# import all needed libs
import pandas as pd
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error, r2_score

# read training data into datadframe
train_set = pd.read_pickle('../train_val_test_data/train_set.pkl')

## Try the first version of the model

In [16]:
# import all needed libs (if needed pip install)
import statsmodels.formula.api as smf

mod = smf.ols('Umsatz ~ Temperatur + monthly_mean_temp_diff + C(Warengruppe)', data=train_set).fit()

print(mod.summary())

                            OLS Regression Results                            
Dep. Variable:                 Umsatz   R-squared:                       0.696
Model:                            OLS   Adj. R-squared:                  0.695
Method:                 Least Squares   F-statistic:                     2443.
Date:                Fri, 21 Jun 2024   Prob (F-statistic):               0.00
Time:                        19:45:16   Log-Likelihood:                -43608.
No. Observations:                7493   AIC:                         8.723e+04
Df Residuals:                    7485   BIC:                         8.729e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 59

## Second version of the baseline model with more parameter

In [17]:
# import all needed libs (if needed pip install)
import statsmodels.formula.api as smf

mod = smf.ols('Umsatz ~ C(Wind_Kategorie) + rainday + monthly_mean_temp_diff + sunny + snowday \
              + C(DayOfWeek) +weekend \
              + KielerWoche + BinElec + SchholCode +  BinHoly \
              + BinSchhol \
              + Salesindex \
              + C(Warengruppe)', data=train_set).fit()

print(mod.summary())

                            OLS Regression Results                            
Dep. Variable:                 Umsatz   R-squared:                       0.728
Model:                            OLS   Adj. R-squared:                  0.727
Method:                 Least Squares   F-statistic:                     869.1
Date:                Fri, 21 Jun 2024   Prob (F-statistic):               0.00
Time:                        19:45:20   Log-Likelihood:                -43186.
No. Observations:                7493   AIC:                         8.642e+04
Df Residuals:                    7469   BIC:                         8.659e+04
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept   

Values like BinElec und C(Wind_Kategorie)[T.sturm] have high p-values and should remove.

### More improved Baseline model

Remove of some values that might not have much impact and try some interaction between some values

In [18]:
# import all needed libs (if needed pip install)
import statsmodels.formula.api as smf

mod = smf.ols('Umsatz ~ C(Wind_Kategorie) + rainday + monthly_mean_temp_diff \
              + sunny + (sunny * weekend) + snowday \
              + weekend \
              + KielerWoche + SchholCode +  BinHoly \
              + BinSchhol \
              + Salesindex \
              + C(Warengruppe)', data=train_set).fit()

print(mod.summary())

                            OLS Regression Results                            
Dep. Variable:                 Umsatz   R-squared:                       0.728
Model:                            OLS   Adj. R-squared:                  0.727
Method:                 Least Squares   F-statistic:                     1110.
Date:                Fri, 21 Jun 2024   Prob (F-statistic):               0.00
Time:                        19:45:25   Log-Likelihood:                -43190.
No. Observations:                7493   AIC:                         8.642e+04
Df Residuals:                    7474   BIC:                         8.655e+04
Df Model:                          18                                         
Covariance Type:            nonrobust                                         
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept   

## Validate the model 

In [19]:
# read validation data into datadframe
validation_set = pd.read_pickle('../train_val_test_data/validation_set.pkl')

# Remove rows with NaN values in 'Umsatz' from the validation_set
# Potential TO-DO: look why we have this 8 rows of NaN data?
validation_set = validation_set.dropna(subset=['Umsatz'])

# Make predictions on the validation data
validation_set['Umsatz_predictions'] = mod.predict(validation_set)

# Calculate evaluation metrics
mse = mean_squared_error(validation_set['Umsatz'], validation_set['Umsatz_predictions'])
r2 = r2_score(validation_set['Umsatz'], validation_set['Umsatz_predictions'])

print("Validation results")
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

Validation results
Mean Squared Error: 5004.9225953146415
R^2 Score: 0.7041951530223045


## Make predictions based on model above 

In [10]:
# load testset
test_set = pd.read_pickle('../train_val_test_data/test_set.pkl')

# calculate predictions for later upload 
test_set['Umsatz'] = mod.predict(test_set)

test_set.head()

# reduce to id and Umsatz columns 
submission_set = test_set[['id','Umsatz']]

# Check if the count of dataset is correct for kaggle upload
if submission_set.shape[0] == 1830:
    print("OK : DataFrame has exact 1830 Entries!")
else:
    print(f"ERROR Dataframe has wrong number of {submission_set.shape[0]} Entries!")

# store the submission data
submission_set.to_csv('../prediction_data/submission.csv', index=False)


OK : DataFrame has exact 1830 Entries!
