## Performing Multivariate Analysis

### Implementing Cluster Analysis on Multiple Variables using Kmeans

### Focus:  Most used clustering algorithm - *Kmeans Clustering algorithm* 

##### This is a centroid-based algorithm that splits data into K number of clusters, usually predefined by the user. The goal is to minmize the variance of data points within their correspoding clusters.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [None]:
data = pd.read_csv('marketing_campaign.csv')
data = data[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds', 
             'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth']]
data.head()

In [None]:
data.isnull().sum()

In [None]:
data.dropna(inplace = True)
data.shape

##### Scaling the data using the StandardScaler class;

In [None]:
scaler  = StandardScaler()
data_scaled = scaler.fit_transform(data)

##### Build Kmeans model

In [None]:
kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state = 1)
kmeans.fit(data_scaled)

##### Visualizing the Kmeans clusters using the *matplotlib*

In [None]:
label = kmeans.fit_predict(data_scaled)
marketing_data_test = data.copy()
marketing_data_test['label'] = label
marketing_data_test['label'] = marketing_data_test['label'].astype(str)

In [None]:
plt.figure(figsize= (18,10))
sns.scatterplot(x= marketing_data_test['MntWines'], y= marketing_data_test['MntFruits'], hue = marketing_data_test['label'])

### Choosing the optimal number of clusters in Kmeans


##### One of the major drawbacks of Kmeans clustering algorithm is the fact that the K number of clusters must be predefined by the user. One of the comomonly used techniques to solve this problem is the elbow method. It uses the *Within Cluster Sum of Squares(WCSS)*, also called intertia, to find the number of clusters.

In [None]:
data= data
data.head()

In [None]:
data.shape

In [None]:
data = data[['MntWines','MntFruits', 'MntMeatProducts', 'MntFishProducts', 
                                 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 
                                 'NumWebPurchases','NumCatalogPurchases', 'NumStorePurchases', 
                                 'NumWebVisitsMonth']]
data.head()

In [None]:
data.isnull().sum()

In [None]:
data.dropna(inplace = True)
data.shape

##### Scaling

In [None]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data.head()

##### Buildig the Model

In [None]:
kmenas = KMeans(n_clusters = 4, init = 'kmeans++', random_state = 1)
kmeans.fit(data_scaled)

##### Examine the Kmeans cluster output:

In [None]:
label = kmeans.fit_predict(data_scaled)
data_output = data.copy()
data_output['cluster'] = label
data_output['cluster'].value_counts()

##### Using the elbow method to find optimal number of K clusters

In [None]:
distance_values = []
for cluster in range(1,14):
    kmeans = KMeans(n_clusters = cluster, init='k-means++')
    kmeans.fit(data_scaled)
    distance_values.append(kmeans.inertia_)

cluster_output = pd.DataFrame({'Cluster':range(1,14), 'distance_values':distance_values})
plt.figure(figsize=(12,6))
plt.plot(cluster_output['Cluster'], cluster_output['distance_values'], marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')

### Profiling Kmeans Clusters

##### Gives us a sense of what each cluster looks like. The approach to profiling for numerical fields is to find the mean of the numerical field per cluster. For categorical fields, we can find the percentage occurence of each category per cluster. 

In [None]:
data.head()

##### Get the overall mean per variable to profile the clusters:

In [None]:
cols = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
       'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
       'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
       'NumWebVisitsMonth']

In [None]:
overall_mean = data_output[cols].apply(np.mean).T
overall_mean = pd.DataFrame(overall_mean,columns =['overall_average'])
overall_mean

##### Mean per cluster variable

In [None]:
cluster_mean = data_output.groupby('cluster')[cols].mean().T
cluster_mean

##### Concatenate two datasets to get the final output

In [None]:
pd.concat([cluster_mean,overall_mean],axis =1)


### Implementing Principal Component Analysis on multiple variables. 

#### PCA is a popular dimensionality reduction method that is used to reduce the dimension of very large datasets. It does this by combining multiple variables into new variables called principal componenets. 

In [None]:
## Recall Marketing Data
data

##### Scale the data

In [None]:
marketing_data = data
marketing_data

In [None]:
x = marketing_data.values
marketing_data_scaled = StandardScaler().fit_transform(x)

##### Apply PCA to dataset using the PCA class.

In [None]:
from sklearn.decomposition import PCA


In [None]:
pca_marketing = PCA(n_components=6,random_state = 1)
principalComponents_marketing = pca_marketing.fit_transform(marketing_data_scaled)

In [None]:
principal_marketing_data = pd.DataFrame(data = principalComponents_marketing
             , columns = ['principal component 1', 'principal component 2',
                          'principal component 3','principal component 4'
                         ,'principal component 5','principal component 6'])
principal_marketing_data

### Choosing Number of Principal Component Analysis. 

#### We can use a scree plot to get a sense of the most useful components

In [None]:
data.head()

In [None]:
data.dtypes

In [None]:
data.isnull().sum()

In [None]:
x = marketing_data.values
marketing_data_sclaed = StandardScaler().fit_transform(x)


In [None]:
pca_marketing = PCA(n_components = 6, random_state = 1)
principalComponents_marketing = pca_marketing.fit_transform(marketing_data_scaled)

##### Check for explained variance for each component:

In [None]:
for i in range(0, len(pca_marketing.explained_variance_ratio_)):
    print("Component ", i, "", pca_marketing.explained_variance_ratio_[i])

##### Create a scree plot to check the number of optimal components.

In [None]:
plt.figure(figsize= (12,6))

PC_values = np.arange(pca_marketing.n_components_) + 1
cummulative_variance = np.cumsum(pca_marketing.explained_variance_ratio_)
plt.plot(PC_values, cummulative_variance, 'o-', linewidth = 2, color = 'blue')
plt.title('Scree Plot')
plt.xlabel('Principal Components')
plt.ylabel('Cummulative Explained Variance')
plt.show()

### Analyzing principal components

##### Principal Components are constructed as linear combinations of original variables, which makes them less interpretable and devoid on inherent meaning. To determine the meaning of componets we analyze the reltionship between the original varibales and the principal components. 

##### Extract the loadings and Examine; (cont. from PCA application.)

In [None]:
x = marketing_data.values
marketing_data_sclaed = StandardScaler().fit_transform(x)


In [None]:
pca_marketing = PCA(n_components = 6, random_state = 1)
principalComponents_marketing = pca_marketing.fit_transform(marketing_data_scaled)

In [None]:
data

In [None]:
loadings_df = pd.DataFrame(pca_marketing.components_).T
loadings_df = loadings_df.set_index(marketing_data.columns)
loadings_df

##### Filter out loadings below a specific threshhold

In [None]:
loadings_df.where(abs(loadings_df) >= 0.35)

### Factor Analysis


##### Can be used for dimensionality reduction, condensing multiple variables into smaller set of variables called factors that are easier to analyze and understand.

In [None]:
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_kmo

In [None]:
data = pd.read_csv('website_survey.csv')
data = data[['q1', 'q2', 'q3', 'q4', 'q5', 'q6', 'q6', 'q7', 'q8', 'q9', 'q10', 
             'q11', 'q12', 'q13', 'q14', 'q15', 'q16', 'q17', 'q18', 'q19', 'q20', 
             'q21', 'q23', 'q24', 'q25', 'q26']]
data.head()

In [None]:
satisfaction_data = data

In [None]:
satisfaction_data


In [None]:
satisfaction_data.isnull().sum()

##### Check for multicollinearity

In [None]:
satisfaction_data.corr()[(satisfaction_data.corr() > 0.9) & (satisfaction_data.corr() < 1)]

##### Test the suitability of the dataset using the Factor Analyzer

In [None]:
kmo_all, kmo_model = calculate_kmo(satisfaction_data)
kmo_model

In [None]:
FactorAnalyzer(n_factors=6, rotation='varimax', rotation_kwargs={})

##### Check loadings 

In [None]:
loadings_output = pd.DataFrame(fa.loadings_, index = satisfaction_data.columns)
loadings_output


### Determining the number of factors

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_kmo

In [None]:
satisfaction_data = pd.read_csv("website_survey.csv")
satisfaction_data = satisfaction_data[['q1', 'q2', 'q3','q4', 'q5', 'q6', 'q7', 'q8', 
                                       'q9', 'q10', 'q11', 'q12', 'q13', 'q14','q15', 'q16', 'q17', 
                                       'q18', 'q19', 'q20', 'q21', 'q22', 'q23', 'q24','q25', 'q26']]


In [None]:
satisfaction_data = data

In [None]:
satisfaction_data.shape


In [None]:
satisfaction_data.dtypes


##### Check for Multicollinearity


In [None]:

satisfaction_data.corr()


In [None]:
satisfaction_data.corr()[(satisfaction_data.corr()>0.8) & (satisfaction_data.corr()<1)]


##### Applying factor Analysis

In [None]:
FactorAnalyzer(n_factors=6, rotation='varimax', rotation_kwargs={})

In [None]:
# Get variance of each factors
from factor_analyzer import FactorAnalyzer
import pandas as pd



In [None]:
from factor_analyzer import FactorAnalyzer
import pandas as pd

pd.DataFrame(fa.get_factor_variance(),
             index=['Variance','Proportional Var','Cumulative Var'])

##### Create  scree plot

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Remove features with near-zero variance
selector = VarianceThreshold(threshold=0.01)  # Threshold can be adjusted
filtered_data = selector.fit_transform(satisfaction_data)

print("Original shape:", satisfaction_data.shape)
print("Filtered shape:", filtered_data.shape)


In [None]:
print(satisfaction_data.isnull().sum())

# Fill missing values (choose one)
satisfaction_data = satisfaction_data.fillna(satisfaction_data.mean())  # Fill with mean
# OR
satisfaction_data = satisfaction_data.dropna()  # Drop rows with missing values


##### Analyzing Factors

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_kmo

In [None]:
satisfaction_data = pd.read_csv("website_survey.csv")


In [None]:
satisfaction_data = satisfaction_data[['q1', 'q2', 'q3','q4', 'q5', 'q6', 'q7', 'q8', 'q9', 
                                       'q10', 'q11', 'q12', 'q13', 'q14','q15', 'q16', 'q17', 
                                       'q18', 'q19', 'q20', 'q21', 'q22', 'q23', 'q24','q25', 'q26']]

In [None]:
satisfaction_data.head()


In [None]:
satisfaction_data.shape


In [None]:
satisfaction_data.dtypes


##### Check for multicollinearity

In [None]:
satisfaction_data.corr()


In [None]:
satisfaction_data.corr()[(satisfaction_data.corr()>0.9) & (satisfaction_data.corr()<1)]


##### Apply Factor Analysis to dataset


In [None]:
fa = FactorAnalyzer(n_factors = 5, rotation="varimax")
fa.fit(satisfaction_data)

##### Analyze the factors

In [None]:
loadings_output = pd.DataFrame(fa.loadings_,index=satisfaction_data.columns)
loadings_output

In [None]:
loadings_output.where(abs(loadings_output) > 0.5)


In [None]:
pd.DataFrame(fa.get_communalities(),index=satisfaction_data.columns,columns=['Communalities'])
