# Data-Selection Algorithmen

**Libraries**<br>
pandas:     Datenverarbeitung<br>
os:         Betriebsystem-Funktionen für relative Pfadreferenzierung<br>
tbd ...<br>
tbd ...

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.feature_selection import SelectKBest, f_classif, f_regression, mutual_info_regression, r_regression
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

**Importieren und Untersuchung der Trainingsdaten**

In [3]:
data_df = pd.read_csv('2019_Trainingsdaten_1h.csv')

# Checking for missing values using isna() or isnull()
missing_values = data_df.isna().sum() + data_df.isnull().sum()

print("Number of missing values in each column:")
print(missing_values)

Number of missing values in each column:
MESS_DATUM     0
RWS_DAU_10     0
RWS_10         0
DS_10          0
GS_10          0
SD_10          0
FF_10          0
DD_10          0
PP_10          0
TT_10          0
TM5_10         0
RF_10          0
load           0
Weekday        0
Weekend        0
Month          0
Hour_of_Day    0
dtype: int64


**Trennen der unabhängigen und abhängigen Variablen für die Evaluierung**

In [5]:
x = data_df.drop(['load', 'MESS_DATUM'], axis=1)  # Keep x as a DataFrame
y = data_df['load'].values

feature_names = x.columns.tolist()

**Select-K-Best**

In [6]:
# Convert x to a NumPy array for fitting SelectKBest
x_array = x.values

# Instantiate SelectKBest to select top k features
num_features_to_select = 4
feature_selector = SelectKBest(score_func=r_regression, k=num_features_to_select)

# Fit feature selector to data
feature_selector.fit(x_array, y)

# Get the indices of the selected features
selected_features_indices = feature_selector.get_support(indices=True)

# Get the names of the selected features from the original feature names
selected_features = [feature_names[i] for i in selected_features_indices]

# Get the scores for each feature
feature_scores = feature_selector.scores_
feature_scores_df = pd.DataFrame({'Feature': feature_names, 'Score': feature_scores})
feature_scores_df = feature_scores_df.sort_values(by='Score', ascending=False)

print("Feature scores:")
print(feature_scores_df)

print("Selected features:")
print(selected_features)

Feature scores:
        Feature     Score
14  Hour_of_Day  0.413215
9        TM5_10  0.043928
0    RWS_DAU_10  0.043607
4         SD_10  0.042691
12      Weekend  0.040385
5         FF_10  0.039451
8         TT_10  0.038819
11      Weekday  0.034881
6         DD_10  0.016704
3         GS_10 -0.001890
2         DS_10 -0.006605
1        RWS_10 -0.009223
7         PP_10 -0.033966
13        Month -0.034154
10        RF_10 -0.129935
Selected features:
['RWS_DAU_10', 'SD_10', 'TM5_10', 'Hour_of_Day']


**Recursive Feature Elimination (RFE)**<br>
Bei der [RFE (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) wird das Model zuerst mit der ganzen Menge an vorhandenen Features trainiert, dann wird das Feature mit der geringsten Relevanz entfernt und die RFE-Methode wiederholt.

In [7]:
model = RandomForestRegressor()

# Instantiate RFE to select top k features
num_features_to_select = 8
rfe_selector = RFE(model, n_features_to_select=num_features_to_select)

# Fit feature selector to data
rfe_selector = rfe_selector.fit(x, y)

# Get the mask of selected features
selected_features_mask = rfe_selector.support_

# Get the names of the selected features from the original feature names
selected_features = [feature_names[i] for i, selected in enumerate(selected_features_mask) if selected]

# Get the ranking of features (optional)
feature_ranking = rfe_selector.ranking_
feature_ranking_df = pd.DataFrame({'Feature': feature_names, 'Ranking': feature_ranking})
feature_ranking_df = feature_ranking_df.sort_values(by='Ranking')

print("Feature ranking:")
print(feature_ranking_df)

print("Selected features:")
print(selected_features)

Feature ranking:
        Feature  Ranking
5         FF_10        1
6         DD_10        1
7         PP_10        1
8         TT_10        1
9        TM5_10        1
10        RF_10        1
11      Weekday        1
14  Hour_of_Day        1
2         DS_10        2
13        Month        3
3         GS_10        4
12      Weekend        5
0    RWS_DAU_10        6
4         SD_10        7
1        RWS_10        8
Selected features:
['FF_10', 'DD_10', 'PP_10', 'TT_10', 'TM5_10', 'RF_10', 'Weekday', 'Hour_of_Day']
