## 1. Data Exploration

### 1.1 Load dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('synthetic_data.csv')
df.head()

Our dataset contains 56 features and 1 target. 

### 1.2 EDA and Data Preparation

In [None]:
df.info()

All of our data are numerical (float) and contains 2000 non-null instances.

In [None]:
df.describe()

Feature 1 can be dropped since all instances are zero

In [None]:
df.drop('Feature_1', axis=1, inplace=True)

In [None]:
# Plot histogram of each feature
df.hist(bins=50, figsize=(20,15))
plt.show()

Observations:
- Features 2, 3, 4, 5, 6 are clean
- Features 7, 12, 16, 19, 21 are skewed to the right
- Most of the features contains outliers
- Unscaled features

In [None]:
corr_matrix = df.corr()
corr_matrix['Target'].sort_values(ascending=False)

In [None]:
threshold = 0.1
features_to_drop = corr_matrix['Target'][abs(corr_matrix['Target']) < threshold].index
df.drop(features_to_drop, axis=1, inplace=True)
df.columns

In [None]:
# Set the correlation threshold
threshold = 0.8

# Filter the correlation matrix for highly correlated features
corr_matrix = df.corr()
highly_correlated = corr_matrix[corr_matrix > threshold]

# Print the highly correlated features
for column in highly_correlated.columns:
    correlated_features = highly_correlated[column].dropna().index.tolist()
    correlated_features.remove(column)  # Remove self-correlation
    if correlated_features:
        print(f"Highly correlated features to {column}: {correlated_features}")

There is no highly correlated features (>0.8). Hence, all features are quite independent to each other.

In [None]:
from sklearn.preprocessing import StandardScaler

df_scaled = StandardScaler().fit_transform(df)
df_scaled = pd.DataFrame(df)

In [None]:
# Plot histogram of the scaled features
df_scaled.hist(bins=50, figsize=(20,15))
plt.show()

Now our data is prepared for preprocessing and model training.

## 2. Data Preprocessing

### 2.1 Split features and target

In [None]:
X = df.drop('Target', axis=1)
y = df['Target']
X.shape, y.shape

### 2.2 Split training and testing dataset

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)