# Visualization & Analysis of Chemical Data
This week we will look at a few ways to visualize and analyze chemical data. We already learned some ways to analyze chemical data in weeks 5 and 6 when we got familiar with RDKit. Visualizing through smart ways of plotting data is also a very important part of data analysis, which allows you to get a feeling of your data and to identify patterns.

## Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize. It is also used for dimensionality reduction, which is useful when you have a lot of features describing your data. PCA is a linear transformation method that finds the directions (principal components) that maximize the variance in the data. These directions are orthogonal to each other and form a new coordinate system in which the data can be represented. The first principal component is the direction in which the data varies the most, the second principal component is the direction in which the data varies the second most, and so on. We will plot the first two principal components of the data and thus reduce the dimnesionality of the data (going from n to 2 dimensions). For more details have a look at [this blog post](https://towardsdatascience.com/principal-component-analysis-pca-explained-visually-with-zero-math-1cbf392b9e7d).

We will perform PCA on Morgan fingerprints as features and make use of an the package `molplotly`, which allows plotting data interactively!

In [None]:
!pip install molplotly
!pip install dash==2.10

In [1]:
# TODO: maybe first let the plot the components without molplotly and then add molplotly for ce wow effect

In [None]:
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
from sklearn.decomposition import PCA


def smi_to_fp(smi):
    fp = AllChem.GetMorganFingerprintAsBitVect(
        Chem.MolFromSmiles(smi), 2, nBits=1024)
    arr = np.zeros((0,), dtype=np.int8)
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

df['fp'] = df['smiles'].apply(smi_to_fp)
fps = np.array(df['fp'].tolist())
pca = PCA(n_components=2)
components = pca.fit_transform(fps.reshape(-1, 1024))
df['PCA-1'] = components[:, 0]
df['PCA-2'] = components[:, 1]

In [None]:
import plotly.express as px
import molplotly
fig_pca = px.scatter(df,
                     x="PCA-1",
                     y="PCA-2",
                     color='cluster_str',
                     title='PCA of morgan fingerprints',
                     labels={'cluster_str': 'cluster_str'},
                     width=1200,
                     height=800)

app_pca = molplotly.add_molecules(fig=fig_pca,
                                  df=df,
                                  smiles_col='smiles',
                                  title_col='hash',
                                  caption_cols=['cluster_str'],
                                  color_col='cluster_str',
                                  show_coords=False)

app_pca.run_server(mode='inline', port=8705, height=850)