# Lecture 24 – Data 100, Fall 2024

Data 100, Fall 2024

[Acknowledgments Page](https://ds100.org/fa24/acks/)

In [None]:
import pandas as pd
import numpy as np
import scipy as sp
import plotly.express as px
import seaborn as sns

## Working with High Dimensional Data

In the following cells we will use visualization tools to push as far as we can in visualizing the MPG dataset in high-dimensional space:

In [None]:
mpg = sns.load_dataset("mpg").dropna()
mpg.head()

## Visualizing 1 Dimensional Data

In [None]:
px.histogram(mpg, x="displacement")

## Visualizing 2 Dimensional Data

In [None]:
px.scatter(mpg, x="displacement", y="horsepower")

## Visualizing 3 Dimensional Data

In [None]:
fig = px.scatter_3d(mpg, x="displacement", y="horsepower", z="weight",
                    width=800, height=800)
fig.update_traces(marker=dict(size=3))

## Visualizing 4 Dimensional Data

In [None]:
fig = px.scatter_3d(mpg, x="displacement", 
                    y="horsepower", 
                    z="weight", 
                    color="model_year",
                    width=800, height=800, 
                    opacity=.7)
fig.update_traces(marker=dict(size=5))

## Visualizing 6 Dimensional Data

Try clicking on the origin symbols in the legend to see how the plot changes. 

In [None]:
fig = px.scatter_3d(mpg, x="displacement", 
                    y="horsepower", 
                    z="weight", 
                    color="model_year",
                    size="mpg",
                    symbol="origin",
                    width=900, height=800, 
                    opacity=.7)
# remove heat map legend and freeze the axes
fig.update_layout(coloraxis_showscale=False,
                  scene=(dict(xaxis_range=[50, 500], 
                              yaxis_range=[40, 250], 
                              zaxis_range=[1000, 5000])))


Visualizing data in high-dimensional space is challenging. In general, the plots we made here can be sometimes helpful for interactive visualizations but can be difficult to interpret in a static form.


### Dimensionality Reduction 

One common approach to visualizing high-dimensional data is to use dimensionality reduction techniques. These techniques aim to find a lower-dimensional representation of the data that captures the most important information.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2,)

X = pd.get_dummies(mpg[["displacement", "horsepower", "weight", "model_year", "origin", "mpg"]])
zs = pca.fit_transform(X)
mpg[["z1", "z2"]] = zs
mpg.head()

In [None]:
fig = px.scatter(mpg, x="z1", y="z2", color="model_year", symbol="origin", 
                 hover_data=["displacement", "horsepower", "weight", "name"])
fig.update_layout(legend=dict(x=.92, y=1), xaxis_range=[-1500, 2500], yaxis_range=[-200, 300])

<br><br>

**Return to lecture.**

<br><br>