In [1]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [3]:
df = pd.read_csv('diabetes.csv')
df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.shape

(768, 9)

In [5]:
# Selected 7 variables for PCA
selected_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction']
df_selected = df[selected_columns]

# Standardize the data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_selected)

In [6]:
# Apply PCA
pca = PCA(n_components=3)
principal_components = pca.fit_transform(df_scaled)

# Create a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2', 'PC3'])

# Display the explained variance ratios
explained_variance = pca.explained_variance_ratio_
print("Explained Variance Ratios:", explained_variance)

Explained Variance Ratios: [0.29545516 0.17228178 0.14522548]


### Conclusion:

### 1. **0.29545516 (29.55%)**:
   - The **first principal component (PC1)** explains **29.55%** of the total variance in the dataset. This component captures the largest amount of variance compared to the other components, and therefore is the most significant in terms of describing the dataset's variability.

### 2. **0.17228178 (17.23%)**:
   - The **second principal component (PC2)** explains **17.23%** of the total variance. While it contributes less than PC1, it still captures a substantial portion of the dataset's variance.

### 3. **0.14522548 (14.52%)**:
   - The **third principal component (PC3)** explains **14.52%** of the total variance. This component captures even less variance, but together with PC1 and PC2, they cover a large portion of the variability.

### Key Insights:
- **Total explained variance** by the first three principal components is:
  \[
  29.55\% + 17.23\% + 14.52\% = 61.30\%
  \]
  - So, the first three components together explain **61.30%** of the total variance in the dataset.
  
- PCA is often used for **dimensionality reduction**, and with these three components, you could potentially reduce the dimensionality of the data (i.e., reduce the number of features) while retaining much of the variance or information in the original data.
  
- Since **61.30%** of the variance is captured by these three components, they provide a simplified representation of the dataset with fewer features but with a significant amount of the original structure and relationships preserved.

If a higher percentage of variance retention is needed (e.g., 80-90%), you might consider including additional principal components.

In [7]:
# Output the transformed data
print(pca_df.head())

        PC1       PC2       PC3
0  0.708054  0.631703  0.145194
1 -0.951638 -0.979558 -1.014125
2 -0.686428  1.608650  2.271100
3 -0.776792 -0.903345 -0.841062
4  2.719940 -2.809888  2.196097


### Conclusion

### Breakdown of each column:
- **PC1**: The values in this column represent the projections of the data onto the first principal component. Since PC1 captures the largest amount of variance (29.55%), these values are the most significant in terms of describing the variability of the data.
  
- **PC2**: This column contains the projections of the data onto the second principal component, which captures 17.23% of the variance. These values describe the second most significant variation in the data.
  
- **PC3**: The values in this column represent the projections onto the third principal component, capturing 14.52% of the variance.

### Sample Data Interpretation:
- **Row 0**: [0.708054, 0.631703, 0.145194]
  - This data point has moderately positive values across all three principal components, suggesting it is somewhat aligned with the directions captured by these components.
  
- **Row 1**: [-0.951638, -0.979558, -1.014125]
  - This point has large negative values across all components, indicating it is far from the origin in the transformed space and negatively aligned with the directions of PC1, PC2, and PC3.
  
- **Row 2**: [-0.686428, 1.608650, 2.271100]
  - The point is negatively aligned with PC1 but has strong positive projections onto PC2 and PC3, meaning it exhibits variance captured primarily by the second and third principal components.
  
- **Row 3**: [-0.776792, -0.903345, -0.841062]
  - This point has negative values for all components, meaning it is aligned negatively with the variability captured by PC1, PC2, and PC3.
  
- **Row 4**: [2.719940, -2.809888, 2.196097]
  - This point has a strong positive projection onto PC1 and PC3, but a strong negative projection onto PC2, indicating a unique alignment with the variance captured by these components.

### Key Insights:
- **Dimensionality Reduction**: The dataset has been reduced from its original feature space to three principal components, which summarize the majority of the variance in the dataset (61.30%). These transformed features can now be used for further analysis, such as clustering, regression, or visualization, with reduced complexity.
  
- **Interpretation**: The values in the transformed space (PC1, PC2, and PC3) no longer directly correspond to the original features but instead represent new axes that capture the most significant directions of variability in the data.

In [None]:
#Principle Component Analysis

#Import reuired packages
#load dataset
#define required functions

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from sklearn.feature_selection import mutual_info_regression

plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

### THIS PLOTING IS BIT TRICKY SO BETTER AVOID IT 
def plot_variance(pca, width=8, dpi=100):
    # Create figure
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n + 1)
    # Explained variance
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(
        xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)
    )
    # Cumulative Variance
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(
        xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)
    )
    # Set up figure
    fig.set(figwidth=8, dpi=100)
    return axs

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


df = pd.read_csv(r"G:\storage 4\drive_2\00_vit vellore\0_teaching assignments\multivariate data analysis\lab\data sets\autos.csv")
df

#We've selected four features that cover a range of properties. Each of these features also has a high MI score with the target, price. We'll standardize the data since #these features aren't naturally on the same scale.

features = ["highway_mpg", "engine_size", "horsepower", "curb_weight"]

X = df.copy()
y = X.pop('price')
X = X.loc[:, features]

# Standardize
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

#Now we can fit scikit-learn's PCA estimator and create the principal components. You can see here the first few rows of the transformed dataset.

from sklearn.decomposition import PCA

# Create principal components
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Convert to dataframe
component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns=component_names)

X_pca.head()


#After fitting, the PCA instance contains the loadings in its components_ attribute. (Terminology for PCA is inconsistent, unfortunately. We're following the convention #that calls the transformed columns in X_pca the components, which otherwise don't have a name.) We'll wrap the loadings up in a dataframe.

loadings = pd.DataFrame(
    pca.components_.T,  # transpose the matrix of loadings
    columns=component_names,  # so the columns are the principal components
    index=X.columns,  # and the rows are the original features
)
loadings

#Recall that the signs and magnitudes of a component's loadings tell us what kind of variation it's captured. The first component (PC1) shows a contrast between large, #powerful vehicles with poor gas milage, and smaller, more economical vehicles with good gas milage. We might call this the "Luxury/Economy" axis. The next figure shows #that our four chosen features mostly vary along the Luxury/Economy axis.

# Look at explained variance
plot_variance(pca);