In [2]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler



## IMPORTING THE DATASET

In [3]:
uni_df = pd.read_csv('/Universities.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/Universities.csv'

In [None]:
uni_df.head()

In [None]:
uni_df.info()

## DATA PREPROCESSING - Removing Categorical and Null Values from the Data

In [4]:
uni_df = uni_df.select_dtypes(exclude = ['object']).dropna()

NameError: name 'uni_df' is not defined

## PCA with the Regular Data

In [5]:
pca_original= PCA()

In [6]:
pca_original.fit(uni_df)

NameError: name 'uni_df' is not defined

In [7]:
uni_df_components = pd.DataFrame(pca_original.components_.transpose(),index=uni_df.columns)

AttributeError: 'PCA' object has no attribute 'components_'

In [8]:
uni_df_components.iloc[:,:5]

NameError: name 'uni_df_components' is not defined

### In-Depth Interpretation and Recommendation

Upon examining the loading weights associated with each principal component, a clear pattern emerges in the structure of the first principal component (PC1):

- **Key Insight**: The features *in-state tuition* and *out-of-state tuition* stand out with the highest loading weights, approximately **0.67** and **0.45** respectively. This implies that **PC1 is heavily influenced by these two tuition-related variables**, and they contribute disproportionately to the variance captured in this component.

- **Explanation**: The underlying reason for this strong influence likely lies in the **scale and magnitude** of the tuition variables. Compared to other variables in the dataset—such as faculty qualifications, student enrollment numbers, or graduation rates—the tuition values are numerically much larger. PCA, being a variance-based technique, gives greater importance to variables with larger numerical ranges unless the data is standardized. As a result, without normalization, the principal components may reflect the influence of high-magnitude features more than the true underlying structure of the data.

- **Impact**: If not addressed, this imbalance can lead to **biased interpretations**, where the components appear to highlight only the high-variance features while potentially ignoring meaningful patterns hidden in lower-range variables.

- **Recommendation**: To ensure that all features contribute fairly to the principal components—and to avoid distorted dimensionality reduction—it is **critical to normalize the data** before applying PCA. Normalization (typically using z-score standardization) scales all features to have a mean of 0 and a standard deviation of 1. This adjustment allows PCA to capture the true structure and relationships in the data without being dominated by any single variable due to its scale.

By normalizing the dataset, we enable PCA to provide a more holistic and balanced view of the variance structure, leading to more accurate insights and interpretations.

In [9]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(uni_df)


NameError: name 'uni_df' is not defined

In [10]:
pca_scaled = PCA()
pca_scaled.fit(scaled_data)


NameError: name 'scaled_data' is not defined

In [11]:
df_scaled_components = pd.DataFrame(pca_scaled.components_.transpose(),index=uni_df.columns).iloc[:,:5]

AttributeError: 'PCA' object has no attribute 'components_'

In [12]:
df_scaled_components

NameError: name 'df_scaled_components' is not defined

### Observations Before and After Normalization (PCA Analysis)

#### Before Normalization (Unscaled Data)

Prior to normalization, the PCA results were heavily influenced by variables with large numerical scales. Specifically:

- **Principal Component 1 (PC1)** was dominated by features such as `in-state tuition` and `out-of-state tuition`, which had the highest loading weights. These variables skewed the analysis due to their high magnitude relative to other features.
- As a result, the principal components reflected the variance of large-scale variables rather than the underlying structure or patterns in the dataset.
- This led to a biased dimensionality reduction, where important but lower-magnitude features, such as graduation rate or student-faculty ratio, were underrepresented.

#### After Normalization (Scaled Data)

After standardizing the data (zero mean and unit variance), the PCA produced more balanced and interpretable components. The key observations are as follows:

- **PC1**: Influenced by a combination of `in-state tuition` (0.397), `out-of-state tuition` (0.371), `room`, `board`, `% new students from top 10%`, and `graduation rate`. This component appears to capture a mix of financial investment, institutional cost, and academic quality.

- **PC2**: Dominated by `# of applications received`, `# accepted`, `# enrolled`, and `# full-time undergraduates`. These features indicate that PC2 represents the overall size and selectivity of the institution.

- **PC3**: Characterized by high weights for `estimated book costs` (0.593) and `estimated personal expenses` (0.415), suggesting it reflects the financial burden on students.

- **PC4**: Strong contributions from `room`, `board`, and `additional fees`, with negative weights for academic indicators like `% from top 10%`. This may represent a trade-off between student living costs and academic selectivity.

- **PC5**: Features such as `book costs` and `additional fees` again dominate, reinforcing a dimension related to student expenditure.

#### Conclusion

The normalization of the data prior to applying PCA was essential to mitigate the influence of variables with disproportionately large scales. After scaling, the principal components offered a more comprehensive and interpretable understanding of the dataset, revealing distinct dimensions such as institutional cost, student body size, academic performance, and student living expenses. This allows for more accurate dimensionality reduction and better-informed analysis.
