In [None]:
Objective: 
The objective of this assignment is to implement PCA on a given dataset and analyse the results. 
Instructions: 
Download the wine dataset from the UCI Machine Learning Repository  
(https://archive.ics.uci.edu/ml/datasets/Wine).  
Load the dataset into a Pandas dataframe. 
Split the dataset into features and target variables. 
Perform data preprocessing (e.g., scaling, normalisation, missing value imputation) as necessary. Implement PCA on the preprocessed dataset using the scikit-learn library. 
Determine the optimal number of principal components to retain based on the explained variance ratio. Visualise the results of PCA using a scatter plot. 
Perform clustering on the PCA-transformed data using K-Means clustering algorithm. Interpret the results of PCA and clustering analysis. 
Deliverables: 
Jupyter notebook containing the code for the PCA implementation. 
A report summarising the results of PCA and clustering analysis. 
Scatter plot showing the results of PCA. 
A table showing the performance metrics for the clustering algorithm. 
Additional Information: 
You can use the python programming language. 
You can use any other machine learning libraries or tools as necessary. 
You can use any visualisation libraries or tools as necessary. 




1. Load the wine dataset into a pandas dataframe.

import pandas as pd

# Replace the path below with the path to your wine dataset file
wine_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)

# Rename the columns
wine_data.columns = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

2. Split the dataset into features and target variables.

from sklearn.model_selection import train_test_split

X = wine_data.drop('Class', axis=1)
y = wine_data['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Perform data preprocessing if necessary. In this case, you can scale the data to have a mean of 0 and a standard deviation of 1.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

4. Implement PCA on the preprocessed dataset using the scikit-learn library.

from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train)

5. Determine the optimal number of principal components to retain based on the explained variance ratio.

cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)

# For example, retain the first two principal components that account for 95% of the variance
n_components = np.argmax(cumulative_explained_variance >= 0.95) + 1

print(f'Optimal number of principal components: {n_components}')

6. Visualise the results of PCA using a scatter plot.

import matplotlib.pyplot as plt

X_train_pca = pca.transform(X_train)[:, :n_components]

plt.figure(figsize=(8, 6))
plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=y_train, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA Scatter Plot')
plt.colorbar(label='Class')
plt.show()

7. Perform clustering on the PCA-transformed data using K-Means clustering algorithm.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(X_train_pca)

y_train_pred = kmeans.predict(X_train_pca)

8. Interpret the results of PCA and clustering analysis. For example, you can evaluate the performance of the clustering algorithm using metrics such as Silhouette Score.

from sklearn.metrics import silhouette_score

silhouette_score = silhouette_score(X_train_pca, y_train_pred)
print(f'Silhouette Score: {silhouette_score}')