
# Wine Dataset: EDA and PCA

This Exercise uses a slightly adapted version of the Wine dataset.

1. Load and inspect the Wine dataset
2. Basic exploratory data analysis
4. Visualisations
3. Correlation analysis
5. Principal Component Analysis (PCA)


In [None]:

# Imports & display settings
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Seaborn style
sns.set_theme(style="whitegrid", context="notebook")


## 1. Load Data

In [None]:
df = pd.read_csv("wine_adapted.csv")
df

## 2. Basic Exploration

Basic information (head, info, shape...)

In [None]:
df.info()

Summary statistics

In [None]:

print("Class distribution:")
df["target"].value_counts()


In [None]:
df.describe()

In [None]:

# Mean of features per target
print("Mean feature values per target:")
means_per_target = df.groupby('target').mean(numeric_only=True)
means_per_target


In [None]:

# Standard deviation of features per target
print("Standard deviation for feature values per target:")
std_per_target = df.groupby('target').std(numeric_only=True)
std_per_target


In [None]:

# Variance of features per target
print("Variance of feature values per target:")
var_per_target = df.groupby('target').var(numeric_only=True)
var_per_target


Deal with missing values and duplicates:

In [None]:

print("Missing values per column:")
df.isna().sum()

## 3. Visualizations

Try different plots, e.g. histogram or box/violinplots or scatterplot (pick one feature)

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(x="target", y="flavanoids", data=df)
plt.title("Flavonoid Content by Wine Class")
plt.tight_layout()
# plt.savefig("boxplot.png", dpi=150)
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.scatterplot(
    data=df, 
    x="alcohol", 
    y="ash", 
    hue="target", 
    palette="viridis"
    )
plt.suptitle("Scatterplot: alcohol vs. ash (Wine dataset)")
plt.show()

## 4. Correlation Analysis

Pairplot to spot correlations:

In [None]:
bad_plot = sns.pairplot(data=df)

In [None]:

selected_features = ["alcohol", "flavanoids", "malic_acid", "color_intensity", "proline"]
pp = sns.pairplot(df, vars=selected_features, hue="target", palette="Set1")
# pp.savefig("pairplot.png", dpi=150)
plt.tight_layout()
plt.show()

In [None]:
corr = df.corr(numeric_only=True)

plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap - Wine Dataset")
plt.tight_layout()
# plt.savefig("heatmap.png", dpi=150)
plt.show()


## 5. PCA

1) Separate features (X) and labels (y)

In [None]:
X = df.drop("target", axis=1)
y = df["target"]

2) Scale the features (Features are on different scales, all features should have same importance! Variance is depending on scale!)

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

3. Apply PCA: `fit_transform` combines `fit` (learning from the datsa: covariance, eigenvalues, eigenvectors, PC directions) and `transform` (projecting project data onto learned components):

In [None]:
# learn from X_scaled and transform X_scaled
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# write PCs into df and add the label y
pca_df = pd.DataFrame(X_pca, columns=["PC1", "PC2"]).assign(target=y.values) # assign adds a new column to the df


Scatter plot for the PCs

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x="PC1", y="PC2", hue="target", data=pca_df, palette="Set1")
plt.title("PCA (2 Components) - Wine Dataset")
plt.tight_layout()
# plt.savefig("pca_scatter.png", dpi=150)
plt.show()

Assess the information loss pof the dimensionality reduction:

In [None]:
print("Explained variance ratio (PCA):")
print(pca.explained_variance_ratio_)
print(f"Cumulative variance explained: {pca.explained_variance_ratio_.sum():.3f}")

The PCA result could then be used for clustering or other predictive models...