In [None]:
import pandas as pd
import numpy as np

## Read the dataset

- Using `pd.read_csv` for reading the tabular data from csv files.
- The methods returns a pandas `DataFrame` object, that can be explored in an interactive manner. For more details [follow](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.html?highlight=dataframe#)
- `read_csv` function returns an iterable object when provided with `chunksize: int` or `iterable: True`.
- Using `next` functionality we can iterate through the object to get data wit `chunksize`

In [None]:
df = pd.read_csv("../input/tabular-playground-series-nov-2021/train.csv", header=0, index_col=0)
df.head(10)

## Huge chunk of features let's analyse which feature contains more rational information about data

### Applying Principal Component Analysis(PCA)

- PCA is applied to analyse that how much of information about data is contained in a certain feature
- Basically PCA is Eigen Value Decomposition (eigen value for each feature defines the variance contained by that feature)
- Sometimes EVD is replaced with Singular Value Decomposition (SVD). In this the Singular values acts similar to eigen values.

In [None]:
# Get the feature and target columns
feature_cols = df.columns[:-1]
target_cols = df.columns[-1]

# Get the data as a numpy matrix
features = df[feature_cols].to_numpy()
target = df[target_cols].to_numpy()
print(f"Data shape: features -> {features.shape}, and Target -> {target.shape}") 

### Normalize the features

- Normalization is something that is needed by PCA.
- PCA is suceptible to variance in feature observations.
- If some features have significantly different scales, then the features with lower scale can get supressed while decomposition.
- We need to transform each feature to have standard deviation 1 unit, and mean to 0 unit.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# Create the StandardScaler object
scaler = StandardScaler()

# Transform the features
features = scaler.fit_transform(features)

In [None]:
# Check if it worked or not
features_mean = np.mean(features, axis=0)
features_std = np.std(features, axis=0)

print(f"Feature mean: {features_mean} \n Features standard deviation: {features_std}")

### Apply EVD

In [None]:
# Get the covariance matrix
features_cov = np.cov(features.T)

# Apply eigen value decomposition
eigen_vals, eigen_vecs = np.linalg.eig(features_cov)

# normalize the eigen values to compare in %
eigen_vals /= np.sum(eigen_vals)
eigen_vals *= 100

### Visualize the eigen values

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize = [35, 7])
ep = sns.scatterplot(x = feature_cols, y = eigen_vals, hue = np.log(eigen_vals), legend = False)

#### Sort eigen values and features for clear visualization

In [None]:
# Get the sort indexs
idx = np.flip(np.argsort(eigen_vals))

# plot the sorted eigen values and corresponsing features
plt.figure(figsize = [35, 7])
ep = sns.scatterplot(x = feature_cols[idx], y = eigen_vals[idx], hue = np.log(eigen_vals[idx]), legend = False)

#### Lets look at the cumulative behaviour of eigen values for each feature

In [None]:
eigen_vals_cumulative = np.cumsum(eigen_vals[idx])
plt.figure(figsize = [35, 7])
graph = sns.scatterplot(x = feature_cols[idx], y = eigen_vals_cumulative, hue = np.log(eigen_vals_cumulative), legend = False)
graph.axhline(99)
plt.show()

> We can easily observe in the previous figure that the change in cumulative behaviour of eigen values is gradual through out the feature space(simply, the 99% data retention is only on the cost of a single feature). Thus the PCA based feature selection won't be useful and if applied will create problem of data loss, without significant advantage on feature dismissal.

In [None]:
sns.histplot(x = target)