# Dimensionality Reduction with PCA: EDA on Student Performance

#### **Problem: The guidance counsellor would like to understand the types of students in Year 12 at our school to reccommend appropriate colleges.**

#### We would like to have an intuitive way to gather data from 8 subjects (8 dimensions) and find patterns to inform the counsellor. 

#### Question: can we capture more than 80% of the variation in our data with 2 principal components (2 dimensions)? 

#### If yes, we need to apply PCA and explore the students clusters in our visualisation

## 1. Applying PCA

In [1]:
# Import statements
from sklearn.decomposition import PCA

import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio

In [3]:
# 1. Read in the student grades data set
df = pd.read_csv('Data/student_grades.csv')
df.head()

Unnamed: 0,student_id,math,science,cs,band,english,history,spanish,physed
0,1,46,48,50,74,34,44,39,73
1,2,66,65,65,66,74,80,75,63
2,3,55,53,50,76,71,72,76,71
3,4,53,57,53,80,77,77,85,82
4,5,55,62,58,67,82,77,78,60


In [4]:
# 2. Drop the first column with student_id
grades = df.drop('student_id', axis=1)
grades.head()

Unnamed: 0,math,science,cs,band,english,history,spanish,physed
0,46,48,50,74,34,44,39,73
1,66,65,65,66,74,80,75,63
2,55,53,50,76,71,72,76,71
3,53,57,53,80,77,77,85,82
4,55,62,58,67,82,77,78,60


In [5]:
# 3. Center the data around 0
# All features are in the same scale, so there is no need for standardisation (if needed, use sklearn's StandardScaler module)

grades_centered = grades - grades.mean()
grades_centered.head()

Unnamed: 0,math,science,cs,band,english,history,spanish,physed
0,-7.05,-5.22,-2.75,3.96,-26.97,-19.6,-26.56,5.39
1,12.95,11.78,12.25,-4.04,13.03,16.4,9.44,-4.61
2,1.95,-0.22,-2.75,5.96,10.03,8.4,10.44,3.39
3,-0.05,3.78,0.25,9.96,16.03,13.4,19.44,14.39
4,1.95,8.78,5.25,-3.04,21.03,13.4,12.44,-7.61


In [6]:
# 4. Fit a PCA model with 2 components
pca = PCA(n_components=2)
pca.fit(grades_centered)


In [7]:
# 5. View and interpret the explained variance ratios
pca.explained_variance_ratio_

# INTERPRETATION: these two principal components account for 90% of the variance observed in our data


array([0.81844937, 0.09778153])

## 2. Interpreting PCA 

In [8]:
# 1. View and interpret the components of the PCA model
pca.components_

array([[ 0.34433892,  0.34662586,  0.32733313, -0.00417335,  0.45552196,
         0.46095972,  0.48354746,  0.01773586],
       [ 0.45069884,  0.44923506,  0.47433583,  0.096715  , -0.33260806,
        -0.31949261, -0.35110809,  0.15725648]])

In [9]:
df.columns

Index(['student_id', 'math', 'science', 'cs', 'band', 'english', 'history',
       'spanish', 'physed'],
      dtype='object')

**Component interpretations:**
* PC1: higher = better grades in general
* PC2: higher = good at math / science / cs, lower = good at english / history / spanish

In [10]:
# 2. Plot the students on a scatter plot with the x-axis as PC 1 and the y-axis as PC 2

grades_2d = pd.DataFrame(pca.transform(grades_centered), columns=['PC1', 'PC2'])
grades_2d.head()

Unnamed: 0,PC1,PC2
0,-39.221331,18.961672
1,30.547249,2.93545
2,13.219471,-9.100237
3,24.467554,-11.41809
4,27.082869,-9.821059


In [11]:
# Define plotting function for PCA interpretation, using plotly
def pca_2dplot_interactive(data_pca, data, x_col, y_col, label_col, title, xaxis_title, yaxis_title, 
                         marker_color='blue', marker_size=10, annotation_color='black', annotation_size=10):
    """
    Create a Plotly scatter plot with labels for PCA analysis.

    Parameters:
    data_pca (pd.DataFrame): DataFrame containing the PC components data for the scatter plot.
    data: original dataframe containing label.
    x_col (str): Column name for the x-axis values.
    y_col (str): Column name for the y-axis values.
    label_col (str): Column name for the labels to be shown on hover and as annotations.
    title (str): Title of the plot.
    xaxis_title (str): Title for the x-axis.
    yaxis_title (str): Title for the y-axis.
    marker_color (str): Color of the markers. Default is 'blue'.
    marker_size (int): Size of the markers. Default is 10.
    annotation_color (str): Color of the annotation text. Default is 'black'.
    annotation_size (int): Size of the annotation text. Default is 10.

    Returns:
    go.Figure: A Plotly figure object with the scatter plot.

    # Example usage:
    # fig = create_scatter_plot(cereal_2d, 'PC1', 'PC2', 'Cereal Name', 
                                'Comparing Cereals by Nutritional Facts', '<-- Higher Sugar & Calorie Cereals   Higher Protein Cereals -->',
                                'Higher Protein, Vitamins and Minerals Cereals -->')
    # fig.show()
    """
    # Create a Plotly figure
    fig = go.Figure()

    # Scatter plot trace
    scatter_trace = go.Scatter(
        x=data_pca[x_col],
        y=data_pca[y_col],
        mode='markers',
        marker=dict(color=marker_color, size=marker_size),
        text=data[label_col],  # Labels for hover
    )

    # Add trace to the figure
    fig.add_trace(scatter_trace)

    # Add data labels using annotations
    for i, label in enumerate(data[label_col]):
        fig.add_annotation(
            x=data_pca[x_col].iloc[i],
            y=data_pca[y_col].iloc[i],
            text=label,
            showarrow=False,
            xanchor='left',
            xshift=5,
            font=dict(color=annotation_color, size=annotation_size),
        )

    # Update layout with axis labels
    fig.update_layout(
        title=title,
        xaxis=dict(title=xaxis_title),
        yaxis=dict(title=yaxis_title),
        height=650,
        width=800,
    )

    return fig


In [12]:
figure_PCA = pca_2dplot_interactive(data_pca=grades_2d, data=df, 
                                  x_col='PC1', y_col='PC2', label_col='student_id',
                                  title='PCA Analysis of Student Performance',
                                  xaxis_title= 'Better grades ->',
                                  yaxis_title= '<-- Better in Humanities     Better in STEM -->')

In [None]:
figure_PCA.show()

**Recommendations:**
* The students at the top right have high grades and are good in STEM - **recommend top technical universities for them**
* The students at the top left are strong in STEM, but could use better grades overall - **encourage them to pursue STEM majors**
* The students at the bottom left are strong in humanities, but could use better grades overall - **encourage them to pursue humanities majors**
* For the remaining students in the middle - **work with them to help them figure out what type of careers they are interested in**