Theoretical Questions

---

**1. What is K-Nearest Neighbors (KNN) and how does it work?**  
-> KNN is a supervised learning algorithm used for classification and regression. It predicts the label of a new data point by looking at the 'K' closest points (neighbors) in the training set. The majority vote (classification) or average (regression) of neighbors determines the output. For example, to classify an email as spam or not, KNN checks the most similar emails.

---

**2. What is the difference between KNN Classification and KNN Regression?**  
-> KNN Classification predicts a **category** (e.g., "spam" or "not spam") by majority voting among neighbors. KNN Regression predicts a **continuous value** (e.g., house prices) by averaging the neighbors’ values. The fundamental process is similar, but the output type differs.

---

**3. What is the role of the distance metric in KNN?**  
-> The distance metric (like Euclidean or Manhattan distance) determines how 'closeness' is measured between points. The choice of metric impacts which neighbors are considered closest and thus affects prediction accuracy. For example, Euclidean distance works well when features are equally scaled.

---

**4. What is the Curse of Dimensionality in KNN?**  
-> As the number of features (dimensions) increases, distances between points become less meaningful, making KNN less effective. In high dimensions, all points tend to look similarly distant. For instance, in image recognition with thousands of pixels, KNN might struggle without dimensionality reduction.

---

**5. How can we choose the best value of K in KNN?**  
-> We usually use techniques like **cross-validation** to test different values of K and pick the one giving the best validation performance. Typically, a small K is noisy and large K can be too smooth. Odd values are often preferred to avoid ties.

---

**6. What are KD Tree and Ball Tree in KNN?**  
-> KD Tree and Ball Tree are data structures that partition the data to make neighbor searches faster. KD Tree splits data based on axis-aligned cuts; Ball Tree groups data into hyperspheres. These trees speed up finding nearest neighbors, especially for large datasets.

---

**7. When should you use KD Tree vs. Ball Tree?**  
-> KD Trees are efficient for low-dimensional data (up to ~20 features), while Ball Trees perform better with higher-dimensional datasets. For example, for a 2D geographic dataset, a KD Tree works well; for high-dimensional image data, Ball Tree is preferred.

---

**8. What are the disadvantages of KNN?**  
->  
- **Computationally expensive** during prediction (slow with large datasets).  
- **Sensitive to irrelevant features** and feature scaling.  
- **Curse of Dimensionality** issues.  
- **Memory-intensive**, as it needs to store the entire dataset.  
Example: KNN can become very slow for millions of records in real-time applications.

---

**9. How does feature scaling affect KNN?**  
-> KNN relies on distance metrics, so features with larger ranges dominate. **Feature scaling** (e.g., Min-Max or Standardization) ensures all features contribute equally to distance calculations. For instance, income (in thousands) and age (in years) need scaling to avoid bias toward income.

---

**10. What is PCA (Principal Component Analysis)?**  
-> PCA is a dimensionality reduction technique that transforms data into fewer dimensions (principal components) while preserving maximum variance. It helps in compressing data without losing significant information, making models faster and sometimes more accurate.

---

**11. How does PCA work?**  
-> PCA identifies the directions (principal components) in which the data varies the most. It computes these directions via eigenvectors and projects data onto them. This way, complex datasets can be simplified while keeping the most important information.

---

**12. What is the geometric intuition behind PCA?**  
-> Geometrically, PCA finds new axes (directions) that best capture data variance. Imagine stretching a cloud of points along a new line where data spreads the most — that line is the first principal component. Subsequent components are orthogonal (perpendicular).

---

**13. What is the difference between Feature Selection and Feature Extraction?**  
->  
- **Feature Selection** picks a subset of original features (e.g., dropping irrelevant columns).  
- **Feature Extraction** creates new features from existing ones (e.g., PCA generating principal components).  
Example: Selecting only 'age' and 'income' vs. combining features into 'spending power.'

---

**14. What are Eigenvalues and Eigenvectors in PCA?**  
-> In PCA, eigenvectors define the directions (principal components) and eigenvalues represent the amount of variance captured along each direction. Larger eigenvalues indicate components capturing more variance. They are key in forming the new reduced feature space.

---

**15. How do you decide the number of components to keep in PCA?**  
-> We choose the number of components that retain a significant percentage of variance (e.g., 95%). A **Scree plot** or **explained variance ratio** helps visualize how many components are enough to capture the majority of information.

---

**16. Can PCA be used for classification?**  
-> Yes, PCA is often used before classification to reduce dimensionality, noise, and training time. However, PCA itself is unsupervised; classification happens after PCA transforms the data. For instance, we can apply PCA to image data before using a classifier.

---

**17. What are the limitations of PCA?**  
->  
- Assumes linear relationships.  
- Sensitive to outliers.  
- Hard to interpret principal components.  
- Information loss if too few components are chosen.  
Example: Reducing 1000 features to 2 might oversimplify and hurt model performance.

---

**18. How do KNN and PCA complement each other?**  
-> PCA reduces dimensions and noise, making KNN faster and more accurate, especially in high-dimensional data. For instance, using PCA to compress image features before applying KNN speeds up search and improves prediction quality.

---

**19. How does KNN handle missing values in a dataset?**  
-> KNN doesn't inherently handle missing values. Preprocessing steps like **imputation** (filling missing values with mean/median/nearest neighbor estimates) are required. Alternatively, KNN Imputer can estimate missing values based on neighbors' known values.

---

**20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?**  
->  
- PCA is **unsupervised** (ignores labels); LDA is **supervised** (uses class labels).  
- PCA maximizes **variance**; LDA maximizes **class separability**.  
- PCA is good for compression; LDA is better for classification tasks.  
Example: PCA for visualizing data; LDA for building a classifier on reduced data.

---

In [None]:
#21. Train a KNN Classifier on the Iris dataset and print model accuracy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Train KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


In [None]:
#22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Create synthetic data
X, y = make_regression(n_samples=200, n_features=2, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train KNN Regressor
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

# Predict and evaluate
y_pred = knn_reg.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))


In [None]:
#23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy
# Euclidean Distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test))

# Manhattan Distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test))

print(f"Euclidean Accuracy: {acc_euclidean}")
print(f"Manhattan Accuracy: {acc_manhattan}")


In [None]:
#24. Train a KNN Classifier with different values of K and visualize decision boundaries
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import ListedColormap

# Reduce to 2D
X_small, y_small = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X_small, y_small, test_size=0.3, random_state=42)

# Function to plot decision boundaries
def plot_decision_boundary(k):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)

    h = .02
    x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
    y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']))
    plt.scatter(X_small[:, 0], X_small[:, 1], c=y_small, edgecolors='k', marker='o')
    plt.title(f"Decision Boundary (k={k})")
    plt.show()

plot_decision_boundary(3)
plot_decision_boundary(7)


In [None]:
#25. Apply Feature Scaling before training a KNN model and compare results with unscaled data
from sklearn.preprocessing import StandardScaler

# Without Scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
acc_unscaled = accuracy_score(y_test, knn_unscaled.predict(X_test))

# With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print(f"Accuracy without scaling: {acc_unscaled}")
print(f"Accuracy with scaling: {acc_scaled}")


In [None]:
#26. Train a PCA model on synthetic data and print the explained variance ratio for each component
from sklearn.decomposition import PCA

# Create synthetic dataset
X, _ = make_regression(n_samples=200, n_features=5, noise=0.1, random_state=42)

# Apply PCA
pca = PCA()
pca.fit(X)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)


In [None]:
#27. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA
# Without PCA
knn_no_pca = KNeighborsClassifier(n_neighbors=5)
knn_no_pca.fit(X_train_scaled, y_train)
acc_no_pca = accuracy_score(y_test, knn_no_pca.predict(X_test_scaled))

# With PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_with_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print(f"Accuracy without PCA: {acc_no_pca}")
print(f"Accuracy with PCA: {acc_with_pca}")


In [None]:
#28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_neighbors': [3,5,7,9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train_scaled, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)


In [None]:
#29. Train a KNN Classifier and check the number of misclassified samples
# Train
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict
y_pred = knn.predict(X_test_scaled)

# Misclassified samples
misclassified = (y_test != y_pred).sum()
print("Number of misclassified samples:", misclassified)


In [None]:
#30. Train a PCA model and visualize the cumulative explained variance
import matplotlib.pyplot as plt

# PCA
pca = PCA()
pca.fit(X_train_scaled)

# Plot cumulative explained variance
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Components')
plt.grid()
plt.show()


In [None]:
#31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy
# Uniform Weights
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_uniform.fit(X_train_scaled, y_train)
acc_uniform = accuracy_score(y_test, knn_uniform.predict(X_test_scaled))

# Distance Weights
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_distance.fit(X_train_scaled, y_train)
acc_distance = accuracy_score(y_test, knn_distance.predict(X_test_scaled))

print(f"Uniform Accuracy: {acc_uniform}")
print(f"Distance Accuracy: {acc_distance}")


In [None]:
#32. Train a KNN Regressor and analyze the effect of different K values on performance
k_values = [1,3,5,7,9]
mse_scores = []

for k in k_values:
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train_scaled, y_train)
    y_pred = knn_reg.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

print("K values:", k_values)
print("MSE scores:", mse_scores)


In [None]:
#33. Implement KNN Imputation for handling missing values in a dataset
from sklearn.impute import KNNImputer
import numpy as np

# Create data with missing values
X_missing = X_train_scaled.copy()
X_missing[np.random.randint(0, X_missing.shape[0], 10), np.random.randint(0, X_missing.shape[1], 10)] = np.nan

# KNN Imputer
imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X_missing)

print("Missing values after imputation:", np.isnan(X_imputed).sum())


In [None]:
#34. Train a PCA model and visualize the data projection onto the first two principal components
# Apply PCA with 2 components
pca_2 = PCA(n_components=2)
X_train_pca2 = pca_2.fit_transform(X_train_scaled)
X_test_pca2 = pca_2.transform(X_test_scaled)

# Plot the data projection
plt.scatter(X_train_pca2[:, 0], X_train_pca2[:, 1], c=y_train, cmap='viridis')
plt.title("Projection onto First Two Principal Components")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar()
plt.show()


In [None]:
#35. Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance
# KD Tree Algorithm
knn_kd_tree = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
knn_kd_tree.fit(X_train_scaled, y_train)
acc_kd_tree = accuracy_score(y_test, knn_kd_tree.predict(X_test_scaled))

# Ball Tree Algorithm
knn_ball_tree = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
knn_ball_tree.fit(X_train_scaled, y_train)
acc_ball_tree = accuracy_score(y_test, knn_ball_tree.predict(X_test_scaled))

print(f"KD Tree Accuracy: {acc_kd_tree}")
print(f"Ball Tree Accuracy: {acc_ball_tree}")


In [None]:
#36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot
# Create high-dimensional data
X_high_dim, _ = make_regression(n_samples=200, n_features=15, noise=0.1, random_state=42)

# Apply PCA
pca_high_dim = PCA()
pca_high_dim.fit(X_high_dim)

# Scree Plot
plt.plot(range(1, len(pca_high_dim.explained_variance_ratio_)+1),
         pca_high_dim.explained_variance_ratio_, marker='o')
plt.title("Scree Plot")
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.show()


In [None]:
#37. Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score
from sklearn.metrics import precision_score, recall_score, f1_score

# Train KNN
knn_class = KNeighborsClassifier(n_neighbors=5)
knn_class.fit(X_train_scaled, y_train)

# Predict
y_pred = knn_class.predict(X_test_scaled)

# Metrics
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")


In [None]:
#38. Train a PCA model and analyze the effect of different numbers of components on accuracy
components_range = [1, 2, 3, 4, 5]
accuracy_by_components = []

for n in components_range:
    pca = PCA(n_components=n)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)

    knn_pca = KNeighborsClassifier(n_neighbors=5)
    knn_pca.fit(X_train_pca, y_train)
    accuracy_by_components.append(accuracy_score(y_test, knn_pca.predict(X_test_pca)))

plt.plot(components_range, accuracy_by_components, marker='o')
plt.title("Effect of Number of Components on Accuracy")
plt.xlabel("Number of PCA Components")
plt.ylabel("Accuracy")
plt.show()


In [None]:
#39. Train a KNN Classifier with different leaf_size values and compare accuracy
leaf_sizes = [10, 20, 30, 40, 50]
accuracy_by_leaf_size = []

for leaf_size in leaf_sizes:
    knn = KNeighborsClassifier(n_neighbors=5, leaf_size=leaf_size)
    knn.fit(X_train_scaled, y_train)
    accuracy_by_leaf_size.append(accuracy_score(y_test, knn.predict(X_test_scaled)))

plt.plot(leaf_sizes, accuracy_by_leaf_size, marker='o')
plt.title("Effect of Leaf Size on Accuracy")
plt.xlabel("Leaf Size")
plt.ylabel("Accuracy")
plt.show()


In [None]:
#40. Train a PCA model and visualize how data points are transformed before and after PCA
# Apply PCA with 2 components
pca_2 = PCA(n_components=2)
X_train_pca2 = pca_2.fit_transform(X_train_scaled)
X_test_pca2 = pca_2.transform(X_test_scaled)

# Plot before PCA
plt.subplot(1, 2, 1)
plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, cmap='viridis')
plt.title("Before PCA")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

# Plot after PCA
plt.subplot(1, 2, 2)
plt.scatter(X_train_pca2[:, 0], X_train_pca2[:, 1], c=y_train, cmap='viridis')
plt.title("After PCA")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

plt.show()


In [None]:
#41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report
from sklearn.datasets import load_wine
from sklearn.metrics import classification_report

# Load Wine dataset
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3, random_state=42)

# Train KNN Classifier
knn_wine = KNeighborsClassifier(n_neighbors=5)
knn_wine.fit(X_train, y_train)

# Predict and evaluate
y_pred = knn_wine.predict(X_test)
print(classification_report(y_test, y_pred))


In [None]:
#42. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error
# Manhattan Distance
knn_reg_manhattan = KNeighborsRegressor(n_neighbors=5, metric='manhattan')
knn_reg_manhattan.fit(X_train_scaled, y_train)
mse_manhattan = mean_squared_error(y_test, knn_reg_manhattan.predict(X_test_scaled))

# Euclidean Distance
knn_reg_euclidean = KNeighborsRegressor(n_neighbors=5, metric='euclidean')
knn_reg_euclidean.fit(X_train_scaled, y_train)
mse_euclidean = mean_squared_error(y_test, knn_reg_euclidean.predict(X_test_scaled))

print(f"MSE (Manhattan): {mse_manhattan}")
print(f"MSE (Euclidean): {mse_euclidean}")


In [None]:
#43. Train a KNN Classifier and evaluate using ROC-AUC score
from sklearn.metrics import roc_auc_score

# Train KNN
knn_class_roc = KNeighborsClassifier(n_neighbors=5)
knn_class_roc.fit(X_train_scaled, y_train)

# Predict probabilities
y_pred_prob = knn_class_roc.predict_proba(X_test_scaled)

# ROC-AUC Score (multi-class)
roc_auc = roc_auc_score(y_test, y_pred_prob, multi_class='ovr')
print(f"ROC-AUC Score: {roc_auc}")


In [None]:
#44. Train a PCA model and visualize the variance captured by each principal component
# PCA
pca_var = PCA()
pca_var.fit(X_train_scaled)

# Plot variance captured
plt.bar(range(1, len(pca_var.explained_variance_ratio_)+1), pca_var.explained_variance_ratio_)
plt.title("Variance Captured by Each Principal Component")
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.show()


In [None]:
#45. Train a KNN Classifier and perform feature selection before training
from sklearn.feature_selection import SelectKBest, f_classif

# Feature selection
selector = SelectKBest(f_classif, k=2)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Train KNN on selected features
knn_selected = KNeighborsClassifier(n_neighbors=5)
knn_selected.fit(X_train_selected, y_train)

# Evaluate
acc_selected = accuracy_score(y_test, knn_selected.predict(X_test_selected))
print(f"Accuracy with Feature Selection: {acc_selected}")


In [None]:
#46. Train a PCA model and visualize the data reconstruction error after reducing dimensions
# Apply PCA
pca_reconstruct = PCA(n_components=2)
X_train_pca = pca_reconstruct.fit_transform(X_train_scaled)
X_test_pca = pca_reconstruct.transform(X_test_scaled)

# Inverse transform to get back the original space
X_train_reconstructed = pca_reconstruct.inverse_transform(X_train_pca)
X_test_reconstructed = pca_reconstruct.inverse_transform(X_test_pca)

# Compute reconstruction error
train_reconstruction_error = np.mean((X_train_scaled - X_train_reconstructed) ** 2)
test_reconstruction_error = np.mean((X_test_scaled - X_test_reconstructed) ** 2)

print(f"Train Reconstruction Error: {train_reconstruction_error}")
print(f"Test Reconstruction Error: {test_reconstruction_error}")


In [None]:
#47. Train a KNN Classifier and visualize the decision boundary
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X = data.data[:, :2]  # Take only the first two features for visualization
y = data.target

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Create a mesh grid to plot decision boundaries
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

# Predict on the grid points
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, edgecolors='k', marker='o', cmap='coolwarm')
plt.title("KNN Classifier Decision Boundary")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()


In [None]:
# 48. Train a PCA model and analyze the effect of different numbers of components on data variance
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler # Import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

# Assuming X_train_scaled is not available, we need to load the data and scale it
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Define and assign X_train_scaled

explained_variance_ratios = []
n_components_range = range(1, X_train_scaled.shape[1] + 1)

for n_components in n_components_range:
    pca = PCA(n_components=n_components)
    pca.fit(X_train_scaled)
    explained_variance_ratios.append(np.sum(pca.explained_variance_ratio_))

plt.plot(n_components_range, explained_variance_ratios, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Effect of Number of Components on Explained Variance')
plt.grid(True)
plt.show()