
# IS-4100: PCA and Clustering with NFL Data

**Objective:**

This lab will introduce you to principal component analysis (PCA) and clustering techniques to analyze and interpret NFL data. You will explore how dimensionality reduction can simplify data and how clustering can reveal patterns within NFL team or player performance metrics.

---

## Section 1: Data Preparation and Exploration

### Load the Data
- Use either `nflfastR` or `nfl_data_py` to load the play-by-play data or season stats for a specific range.
- Filter the data to focus on key columns such as `yards_gained`, `pass_attempt`, `rush_attempt`, `touchdown`, `interception`, `sack`, etc.

### Feature Engineering
- Create aggregated metrics for each team or player, such as:
  - Average yards per game
  - Touchdowns per game
  - Passing and rushing attempts per game
  - Average turnovers per game
- Ensure the final dataset has all numeric columns necessary for PCA and clustering.

### Data Cleaning
- Check for missing values, handling them appropriately.
- Standardize or normalize data to ensure comparability across features.

---

## Section 2: Principal Component Analysis (PCA)

### Perform PCA
- Apply PCA to the dataset to reduce it to two or three principal components.
- Use a scree plot to determine the number of components that capture the majority of variance.

### Interpret PCA Results
- Examine the component loadings to understand which metrics contribute most to each principal component.
- Discuss how PCA has simplified the data and retained the most critical information.

### Visualization
- Plot the data points in a 2D or 3D scatterplot using the principal components as axes.
- Label data points by team or player for better insights.

---

## Section 3: Clustering

### Choosing a Clustering Algorithm
- Select either K-means or hierarchical clustering for this analysis.
- Determine an appropriate number of clusters by using techniques like the elbow method or silhouette score.

### Run Clustering Algorithm
- Apply the chosen clustering algorithm to the PCA-reduced dataset.
- Assign a cluster label to each team or player based on the results.

### Visualize Clusters
- Create a scatter plot showing the clusters with different colors, and label key data points.
- Discuss the composition of each cluster (e.g., are certain teams or players consistently high-performing in specific metrics?).

---

## Section 4: Interpretation and Analysis

### Analyze Cluster Characteristics
- Compare clusters to identify patterns, such as clusters of teams with strong passing vs. rushing stats or players with high-risk, high-reward play styles.

### Discuss Findings
- Write a summary explaining how PCA helped to reduce data complexity and what the clusters reveal about NFL team or player performance.


---

## Questions for Reflection:
1. How did PCA simplify the dataset, and what insights were preserved?
2. Were there any clusters that grouped similar types of players or teams? Describe these patterns.
3. If you were to adjust the clustering parameters, what changes might you explore?