Hi, I would like to share with the community the simple PCA analysis I performed on this dataset. The results I have are different form what the author obtained (shown in his [blog post](http://rpubs.com/burakh/robobohr)) and I am wondering why. Can anyone help?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from collections import defaultdict
from datetime import datetime
from scipy import stats
from statsmodels.formula.api import ols
import seaborn
import sklearn
from sklearn.decomposition import RandomizedPCA, PCA, SparsePCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from scipy import linalg as LA


First we import the data, we drop "pubchem_id" and "Eat" columns from the dataframe. Then we normalize the data with StandardScaler.

In [None]:
m = pd.read_csv('../input/roboBohr.csv')
m.drop('pubchem_id', axis = 1)
X = m.drop(['Eat', 'pubchem_id'],axis=1).values
X = StandardScaler().fit_transform(X)
Y = m['Eat']

Now it is time to perform PCA with the sklearn library

In [None]:
pca = PCA()
Xt = pca.fit_transform(X)
pca_score = pca.explained_variance_ratio_
V = pca.components_

We then plot the variance ratio, which looks ok. There are two principal components who explains nearly 80% of the variance. This is great! We can look at the PCA analysis on a simple 2D scatter plot!

In [None]:
fig = plt.figure(figsize=(16, 6))
ax1 = fig.add_subplot(111)
ax1.set_ylim([0,1])
lin1 = ax1.scatter(range(0, int(pca_score.shape[0])), pca_score, c = 'b', label = 'no random noise')
plt.show()

So let's plot the two principal components againts one another. We also color the dots proportionnaly to their energy value.

In [None]:
fig = plt.figure(figsize=(16, 6))
ax2 = fig.add_subplot(111)
ax2.scatter(Xt[:,0], Xt[:, 1], c=Y)
plt.show()

The plot looks great but it looks nothing like the beautiful conchoid shape plot the author of this dataset presents in his [blog post](http://rpubs.com/burakh/robobohr). This is some what disappointing... Maybe something is wrong with the PCA package of sklearn? Let's try implementing PCA our selves with linear algebra tools.

In [None]:
#Singular value decomposition of the covariance matrix
cov = np.cov(X, rowvar = False)
evals , evecs = LA.eigh(cov)

#Sort the eigenvectors based on the eigenvalues
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]

#Transform the data
Xt2 = np.dot(X, evecs)

In [None]:
fig = plt.figure(figsize=(16, 6))
ax2 = fig.add_subplot(111)
ax2.scatter(Xt2[:,0], Xt2[:, 1], c=Y)
plt.show()

The plot looks similar but with an inversion of sign for one of the components. That's not the end of the world. 

What is more problematic is not being able to reproduce the results shown on the [blog post](http://rpubs.com/burakh/robobohr). 

Anyone has any idea what is going wrong? I am totally mystified here ... I do not feel confortable going further with the analysis before being sure every thing is allright with the dataset and all...

In [None]:
# I tried dropping the first column of, but it didn't help much
X = X[:,1:]
pca = PCA()
Xt = pca.fit_transform(X)
fig = plt.figure(figsize=(16, 6))
ax2 = fig.add_subplot(111)
ax2.scatter(Xt[:,0], Xt[:, 1], c=Y)
plt.show()