<a href="https://colab.research.google.com/github/rkn2/factorAnalysisExample/blob/master/FactorAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background on the problem


A wall in a building next to a river is deteriorating. What is causing it?

Below find an image of how the wall is situated in the building, where doors are, and where current sensors are located. 

**Your job is to look at the data coming from the sensors, and figure out what they are telling you.**

![alt text](https://drive.google.com/uc?id=1TWhPuLFF7pNcXlycMDmdDWlFxWQJq3DQ)



# Looking at our data


This step just installs any packages we will need to run the analysis code. You don't have to worry about any specifics here.

In [0]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_iris
#!pip install factor_analyzer
!pip install factor_analyzer==0.2.3
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
import numpy as np


!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Now we will load the data. This line reads in the comma separated value sheet that I made in excel.

If you are given sensor data in excel and want to export it to csv in the future, see this link:
https://www.ablebits.com/office-addins-blog/2014/04/24/convert-excel-csv/

In [0]:
url = 'https://raw.githubusercontent.com/rkn2/factorAnalysisExample/master/testDataSmall.csv'
df = pd.read_csv(url)
#df= pd.read_csv("testDataSmall.csv")

Now we can start to process our data

In [0]:
#index columns
df.columns

#drop unnecessary columns
df.drop(['corSet', 'notCor', 'true ground water', 'wind driven rain'],axis=1,inplace=True)

#drop any missing value rows
df.dropna(inplace=True)

#view information if you want
df.info()
#df.head()


Factor analysis code based on: https://www.datacamp.com/community/tutorials/introduction-factor-analysis

**check to make sure factor analysis makes sense**

1) bartlett's test checks whether the data is statistically significant by checking whether or not the observed variabels intercorrelate at all using the observed correlation matrix against the identity matrix. if the test is found statistically insignificant, you should not use factor analyiss. 

In [0]:
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_value

In this Bartlett ’s test, the p-value is 0. The test was statistically significant, indicating that the observed correlation matrix is not an identity matrix.


2) Kaiser-Meyer-Olkin (KMO) Test measures the suitability of data for factor analysis. It determines the adequacy for each observed variable and for the complete model. KMO estimates the proportion of variance among all the observed variable. Lower proportion id more suitable for factor analysis. KMO values range between 0 and 1. Value of KMO less than 0.6 is considered inadequate.

In [0]:
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(df)

kmo_model

The overall KMO for our data is 0.79, which is excellent. This value indicates that you can proceed with your planned factor analysis.

Now we need to choose the number of factors. For this, we can use the Kaiser criterion and scree plot. Both are based on eigen values. 

In [0]:
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer()
fa.analyze(df, 21, rotation=None)
# Check Eigenvalues
ev, v = fa.get_eigenvalues()
ev

Here, you can see only for 6-factors eigenvalues are greater than one. It means we need to choose only 6 factors (or unobserved variables).

In [0]:
# Create scree plot using matplotlib
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()

The scree plot method draws a straight line for each factor and its eigenvalues. Number eigenvalues greater than one considered as the number of factors.

Here, you can see only for 6-factors eigenvalues are greater than one. It means we need to choose only 6 factors (or unobserved variables).

However something we can additionally, see is that after the first two, there really isnt a diffence in the eigen values. So really only the first two are significant. Therefore we will only choose two factors.

**Perform factor analysis**

In [0]:
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer()
numFactors = 2
fa.analyze(df, numFactors, rotation="varimax")
fa.loadings

In [0]:
L = np.array(fa.loadings)
headings = list( fa.loadings.transpose().keys() )
factor_threshold = 0.25
for i, factor in enumerate(L.transpose()):
  descending = np.argsort(np.abs(factor))[::-1]
  contributions = [(np.round(factor[x],2),headings[x]) for x in descending if np.abs(factor[x])>factor_threshold]
  print('Factor %d:'%(i+1),contributions)

Factor 1 has high factor loadings for soil moisture, RH interior and exterior up to a certain height, as well as ground water only for the first sensor.

Factor 2 has high factor loadings for only external RH values all the way up and down the wall as well as rain duration, intensity, and wind direciton.


In [0]:
# Get variance of each factors
fa.get_factor_variance()

Total 41% cumulative Variance explained by the 5 factors.

**Pros and Cons of Factor Analysis**

Factor analysis explores large dataset and finds interlinked associations. It reduces the observed variables into a few unobserved variables or identifies the groups of inter-related variables, which help the market researchers to compress the market situations and find the hidden relationship among consumer taste, preference, and cultural influence. Also, It helps in improve questionnaire in for future surveys. Factors make for more natural data interpretation.

Results of factor analysis are controversial. Its interpretations can be debatable because more than one interpretation can be made of the same data factors. After factor identification and naming of factors requires domain knowledge.