> ### In this guide, we'll walk through the steps involved in principal component analysis (PCA) using plain old Python--without scikit-learn. Afterward, we'll do it the 'easy' way with Scikit-learn's PCA algorithm.


# 1) Import libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 2) Load the dataset
To import the dataset we will use Pandas library.It is the best Python library to play with the dataset and has a lot of functionalities. 

In [None]:
df = pd.read_csv('../input/hr-comma-sepcsv/HR_comma_sep.csv')

In [None]:
df.rename(columns={'sales':'department','Work_accident':'work_accident','time_spend_company':'years_with_company','left':'left_company'},inplace=True)
columns_names=df.columns.tolist()
print("Columns names:")
print(columns_names)
df.describe()
# df.info()

df.columns.tolist() fetches all the columns and then convert it into list type.This step is just to check out all the column names in our data.Columns are also called as features of our datasets.

In [None]:
df.shape

In [None]:
df.head(20)

In [None]:
df.corr()

**df.corr()** compute pairwise correlation of columns. Correlation shows how each pair of variables is related to each other. Positive values indicate positive correlation. Negative values indicates negative correlation. 

The magnitude of the value indicates the level/strength of correlation (0 <= |x| <= 1).


## Visualising correlation using Seaborn library

In [None]:
correlation = df.corr()
plt.figure(figsize=(10,10))
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')

plt.title('Correlation between different fearures')

**Doing some visualisation before moving onto PCA**

In [None]:
df['sales'].unique()

Here we are printing all the unique values in **sales** columns

In [None]:
satisfaction_by_dept=df.groupby('department').mean()
satisfaction_by_dept.sort_values(by="satisfaction_level", ascending=True, inplace=True)
satisfaction_by_dept

In [None]:
y_pos = np.arange(len(satisfaction_by_dept.index))

plt.barh(y_pos, satisfaction_by_dept['satisfaction_level'], align='center', alpha=0.8)
plt.yticks(y_pos, satisfaction_by_dept.index)

plt.xlabel('Satisfaction level')
plt.title('Mean Satisfaction Level of each department')

# Principal Component Analysis

In [None]:
df.head()

In [None]:
df_drop=df.drop(labels=['department','salary'],axis=1)
df_drop.head()

**df.drop()**  is the method to drop the columns in our dataframe

Now we need to bring "left" column to the front as it is the label and not the feature.

In [None]:
cols = df_drop.columns.tolist()
cols

Here we are converting columns of the dataframe to list so it would be easier for us to reshuffle the columns.We are going to use cols.insert method


In [None]:
cols.insert(0, cols.pop(cols.index('left_company')))

In [None]:
cols

In [None]:
#df_drop = df_drop.reindex(columns=cols)

By using df_drop.reindex(columns=cols) we are converting list to columns again

Now we are separating features of our dataframe from the labels.

In [None]:
X = df_drop.iloc[:,1:8].values
y = df_drop.iloc[:,0].values
X

In [None]:
y

In [None]:
np.shape(X)

Thus X is now a matrix with 14999 rows and 7 columns

In [None]:
np.shape(y)

y is now a matrix with 14999 rows and 1 column

# 4) Data Standardisation
Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance). It is useful to standardize attributes for a model.
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data 

In [None]:
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)

# 5) Computing Eigenvectors and Eigenvalues:
Before computing Eigen vectors and values we need to calculate covariance matrix.

## Covariance matrix

In [None]:
mean_vec = np.mean(X_std, axis=0)
cov_matrix = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix \n%s' % cov_matrix)

In [None]:
print('NumPy covariance matrix: \n%s' % np.cov(X_std.T))

Equivalently we could have used Numpy np.cov to calculate covariance matrix

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(cov_matrix, vmax=1, square=True, annot=True, cmap='cubehelix')

plt.title('Correlation between different features')

# Eigen decomposition of the covariance matrix

In [None]:
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)

print('Eigenvectors \n%s' % eig_vecs)
print('\nEigenvalues \n%s' % eig_vals)


# 6) Selecting Principal Components

In order to decide which eigenvector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.

In [None]:
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
#eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])

**Explained Variance**
After sorting the eigenpairs, the next question is "how many principal components are we going to choose for our new feature subspace?" A useful measure is the so-called "explained variance," which can be calculated from the eigenvalues. The explained variance tells us how much information (variance) can be attributed to each of the principal components.

In [None]:
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in eig_vals] #sorted(eig_vals, reverse=True)]

In [None]:
x_pos = [i for i, _ in enumerate(cols[1:])]

In [None]:
with plt.style.context('dark_background'):
    plt.figure(figsize=(10, 4))

    plt.barh(x_pos, var_exp, alpha=0.5, align='center',
            label='individual explained variance')
    plt.xlabel('Explained variance ratio')
    plt.ylabel('Principal components')
    plt.yticks(x_pos, cols[1:])
    plt.legend(loc='best')
    plt.tight_layout()

The plot above clearly shows that maximum variance (somewhere around 26%) can be explained by the satisfaction_level component alone. The promotion_5years, work_accident, years_with_company, and avg_monthly_hrs components share each about12-15%. The number_projects and last_evaluation components share the least information, but these cannot be ignored--since together they contribute almost 17% of the data.

### Projection Matrix

The construction of the projection matrix that will be used to transform the Human resouces analytics data onto the new feature subspace. **Suppose only 1st and 2nd principal component shares the maximum amount of information say around 90%**. Hence we can drop other components. Here, we are reducing the 7-dimensional feature space to a 2-dimensional feature subspace, by choosing the “top 2” eigenvectors with the highest eigenvalues to construct our d×k-dimensional eigenvector matrix W


In [None]:
matrix_w = np.hstack((eig_pairs[0][1].reshape(7,1), 
                      eig_pairs[1][1].reshape(7,1)
                    ))
print('Matrix W:\n', matrix_w)

**Projection Onto the New Feature Space**
In this last step we will use the 7×2-dimensional projection matrix W to transform our samples onto the new subspace via the equation
**Y=X×W**

In [None]:
Y = X_std.dot(matrix_w)
Y

# PCA in scikit-learn

In [None]:
from sklearn.decomposition import PCA
pca = PCA().fit(X_std)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlim(0,7,1)
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')

The above plot shows almost 90% variance by the first 6 components. Therfore we can drop 7th component.

In [None]:
from sklearn.decomposition import PCA 
sklearn_pca = PCA(n_components=6)
Y_sklearn = sklearn_pca.fit_transform(X_std)

In [None]:
print(Y_sklearn)

In [None]:
Y_sklearn.shape

Thus Principal Component Analysis is used to remove the redundant features from the datasets without losing much information.These features are low dimensional in nature.The first component has the highest variance followed by second, third and so on.PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data.

You can find my notebook on Github: 
("https://github.com/nirajvermafcb/Data-Science-with-python")

Here is my notebook for Principal Component Analysis with Scikit-learn:
(https://www.kaggle.com/nirajvermafcb/d/nsrose7224/crowdedness-at-the-campus-gym/principal-component-analysis-with-scikit-learn)