### Erwin Antepuesto 
BSCS - 3<br>
Dec 2, 2023

### Instructions:
1. The prefinals can be done at home.
2. The Deadline will be on Saturday (Dec 2, 2023) at exactly 11:59 pm (no extensions will be given)
3. Dataset: https://archive.ics.uci.edu/dataset/830/visegrad+group+companies+data-2
4. Program from scratch Principal Component Analysis and Singular Value Decomposition.
5. Implement your works with the given dataset using jupyterhub. Compare your result with that of the sklearn library in python. Explain why your results are similar or dissimilar and cite which specific part of your code causes the deviation of result.
6. Push your .ipynb to your github account.
7. Along with the necessary information, copy paste your github link here: https://docs.google.com/spreadsheets/d/1FdyiTSDGZbkf-VpIn2n_XHjUDzzBjj97fk4t20DWl3Y/edit?usp=sharing

### Guidelines:
1. no libraries at all, and just using raw python: no library should be used.
2. minimal libraries are allowed, such as those for plotting, if this project involves plotting things into graphs: no plotting involved in this, matching between your solution and the one obtained from sklearn can be done programmatically.
3. any library is allowed as long as it's not sklearn: no library should be use

## Part 1  
### Dataset Imports and Preprocessing for Own Solution and <i>sklearn</i> Solution

In [11]:
# List of file paths
file_paths = [
    r'C:\Users\ACER\Documents\Python\cs3101-prefinals\dataset\2017.arff',
    r'C:\Users\ACER\Documents\Python\cs3101-prefinals\dataset\2018.arff',
    r'C:\Users\ACER\Documents\Python\cs3101-prefinals\dataset\2019.arff',
    r'C:\Users\ACER\Documents\Python\cs3101-prefinals\dataset\2020.arff',
    r'C:\Users\ACER\Documents\Python\cs3101-prefinals\dataset\2021 Q1.arff',
]

In [14]:
import arff
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import cmath

def load_and_preprocess_data(file_paths):
    # Load datasets into a list of DataFrames
    dfs = [pd.DataFrame(list(arff.load(file_path))) for file_path in file_paths]

    # Concatenate DataFrames along rows
    merged_df = pd.concat(dfs, axis=0, ignore_index=True)

    # Check for missing values
    missing_values = merged_df.isnull().sum()

    # Handle missing values (example: drop rows with any missing values)
    merged_df = merged_df.dropna()

    # Apply LabelEncoder to object columns after converting values to strings
    le = LabelEncoder()
    for column in merged_df.columns:
        if merged_df[column].dtype == 'object':
            merged_df[column] = merged_df[column].astype(str)
            merged_df[column] = le.fit_transform(merged_df[column])

    return merged_df

## Part 2  
### Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) using <i>sklearn</i> Library

In [28]:
import numpy as np
from sklearn.decomposition import PCA, TruncatedSVD

data = load_and_preprocess_data(file_paths)

# Apply PCA
pca = PCA(n_components=2)
principalComponents_pca = pca.fit_transform(data)

# Apply Truncated SVD
svd = TruncatedSVD(n_components=2)
principalComponents_svd = svd.fit_transform(data)

# Display the output
# PCA
explained_variance_ratio_pca = pca.explained_variance_ratio_
eigenvalues_pca = pca.explained_variance_
cumulative_explained_variance_pca = explained_variance_ratio_pca.cumsum()

print("PCA Results:")
print("Explained Variance Ratio (PCA):\t", explained_variance_ratio_pca)
print("Eigenvalues (PCA):\t\t\t", eigenvalues_pca)
print("Cumulative Explained Variance (PCA):\t", cumulative_explained_variance_pca)

# SVD
explained_variance_ratio_svd = svd.explained_variance_ratio_
singular_values_svd = svd.singular_values_
cumulative_explained_variance_svd = explained_variance_ratio_svd.cumsum()

print("\nSVD Results:")
print("Explained Variance Ratio (SVD):\t", explained_variance_ratio_svd)
print("Singular Values (SVD):\t\t\t", singular_values_svd)
print("Cumulative Explained Variance (SVD):\t", cumulative_explained_variance_svd)

PCA Results:
Explained Variance Ratio (PCA):	 [0.4172533  0.08504119]
Eigenvalues (PCA):			 [1405229.4284625   286402.49043666]
Cumulative Explained Variance (PCA):	 [0.4172533 0.5022945]

SVD Results:
Explained Variance Ratio (SVD):	 [0.36837609 0.08561444]
Singular Values (SVD):			 [133109.6694485   26344.70205895]
Cumulative Explained Variance (SVD):	 [0.36837609 0.45399053]


## Part 3.1  
### Principal Component Analysis (PCA) Manual Solution

In [103]:
def normalize(vector):
    norm = sum(x**2 for x in vector)**0.5
    return [x / norm for x in vector]

def pca(data, num_components=2):
    # Calculate means for each column
    means = [mean(column) for column in zip(*data)]

    # Center the data by subtracting means
    centered_data = [[x - mean for x, mean in zip(row, means)] for row in data]

    # Calculate covariance matrix
    cov_matrix = [[covariance(col_x, col_y, mean_x, mean_y)
                   for col_y, mean_y in zip(centered_data, means)]
                  for col_x, mean_x in zip(centered_data, means)]

    # Calculate eigenvalues and eigenvectors
    eigenvalues, eigenvectors = [], []
    for _ in range(num_components):
        # Power iteration method with normalization
        eigenvector = [1.0] * len(cov_matrix)
        for _ in range(100):
            next_eigenvector = [sum(cov * x for cov, x in zip(row, eigenvector))
                                for row in cov_matrix]
            # Normalize the eigenvector
            eigenvector = normalize(next_eigenvector)

        # Eigenvalue calculation
        eigenvalue = sum(row[i] * eigenvector[i] for i, row in enumerate(cov_matrix))

        eigenvalues.append(abs(eigenvalue))
        eigenvectors.append(eigenvector)

        # Deflate the covariance matrix
        for i in range(len(cov_matrix)):
            for j in range(len(cov_matrix)):
                cov_matrix[i][j] -= eigenvalue * eigenvector[i] * eigenvector[j]

    # Sort eigenvalues and corresponding eigenvectors in descending order
    sorted_indices = sorted(range(len(eigenvalues)), key=lambda k: eigenvalues[k], reverse=True)
    eigenvalues = [eigenvalues[i] for i in sorted_indices]
    eigenvectors = [eigenvectors[i] for i in sorted_indices]

    return eigenvalues, eigenvectors

# Example usage:
data_values = data.values  # Assuming 'data' is your DataFrame
eigenvalues_custom, eigenvectors_custom = pca(data_values)

# Display the output
explained_variance_ratio_custom = [ev / sum(eigenvalues_custom) for ev in eigenvalues_custom]
cumulative_explained_variance_custom = [sum(explained_variance_ratio_custom[:i+1])
                                        for i in range(len(explained_variance_ratio_custom))]

print("Custom PCA Results:")
print("Explained Variance Ratio:\t", explained_variance_ratio_custom)
print("Eigenvalues:\t\t\t", eigenvalues_custom)
print("Cumulative Explained Variance:\t", cumulative_explained_variance_custom)


Custom PCA Results:
Explained Variance Ratio:	 [0.5591340500663632, 0.4408659499336368]
Eigenvalues:			 [170926974.74379313, 134772480.91179776]
Cumulative Explained Variance:	 [0.5591340500663632, 1.0]


In [47]:
#PCA Results:
#Explained Variance Ratio (PCA):	 [0.4172533  0.08504119]
#Eigenvalues (PCA):			 [1405229.4284625   286402.49043675]
#Cumulative Explained Variance (PCA):	 [0.4172533 0.5022945]

## Part 3.2  
### Singular Value Decomposition (SVD) Manual Solution

In [102]:
data = load_and_preprocess_data(file_paths)

def manual_svd(matrix):
    # Calculate A^T * A
    rows, cols = len(matrix), len(matrix[0])
    ata = [[sum(matrix[i][k] * matrix[j][k] for k in range(cols)) for j in range(cols)] for i in range(cols)]

    # Calculate A * A^T
    aat = [[sum(matrix[i][k] * matrix[i][j] for i in range(cols)) for j in range(cols)] for k in range(cols)]

    # Eigenvalue decomposition for A^T * A
    eigenvalues_ata, eigenvectors_ata = eigenvalue_decomposition(ata)

    # Sort eigenvalues and eigenvectors in descending order
    idx = sorted(range(len(eigenvalues_ata)), key=lambda k: eigenvalues_ata[k], reverse=True)
    eigenvalues_ata = [eigenvalues_ata[i] for i in idx]
    eigenvectors_ata = [[eigenvectors_ata[j][i] for j in range(len(eigenvectors_ata))] for i in idx]

    # Calculate singular values and sort them
    singular_values = [cmath.sqrt(value) for value in eigenvalues_ata]

    # Calculate matrix U
    u = eigenvectors_ata

    # Calculate matrix V
    v = [[sum(u[j][i] * matrix[k][i] for i in range(cols)).real / singular_values[j] for j in range(cols)] for k in range(rows)]

    return u, singular_values, v

# Apply manual SVD
u_manual, singular_values_manual, v_manual = manual_svd(data.values.tolist())

# Sort singular values in descending order
idx_sort = sorted(range(len(singular_values_manual)), key=lambda k: singular_values_manual[k].real, reverse=True)
singular_values_manual = [singular_values_manual[i] for i in idx_sort]
u_manual = [[u_manual[j][i] for j in range(len(u_manual))] for i in idx_sort]
v_manual = [[v_manual[j][i] for j in range(len(v_manual))] for i in idx_sort]

# Calculate explained variance ratio for manual SVD
explained_variance_ratio_manual = [(singular_value.real ** 2) / sum(sv.real ** 2 for sv in singular_values_manual) for singular_value in singular_values_manual]

# Calculate cumulative explained variance for manual SVD
cumulative_explained_variance_manual = [sum(explained_variance_ratio_manual[:i + 1]) for i in range(len(explained_variance_ratio_manual))]

# Display the results
print("\nManual SVD Results:")
print("Explained Variance Ratio (SVD):\t", explained_variance_ratio_manual[:2])
print("Singular Values (Manual SVD):\t\t", singular_values_manual[:2])  # Display only the first two singular values for comparison
print("Cumulative Explained Variance (SVD):\t", cumulative_explained_variance_manual[:2])


Manual SVD Results:
Explained Variance Ratio (SVD):	 [0.782912432611862, 0.041141589703449395]
Singular Values (Manual SVD):		 [(23633.322046648704+0j), (5417.622156598941+0j)]
Cumulative Explained Variance (SVD):	 [0.782912432611862, 0.8240540223153114]


SVD Results:
Explained Variance Ratio (SVD):	 [0.36837609 0.08561444]
Singular Values (SVD):			 [133109.6694485   26344.70205895]
Cumulative Explained Variance (SVD):	 [0.36837609 0.45399053]

## Part 4
### Conclusion

#### sklearn PCA and SVD Results:

**PCA Results:**

| Metric                                | PCA Values                             |
|---------------------------------------|----------------------------------------|
| Explained Variance Ratio (PCA)         | [0.4172533, 0.08504119]               |
| Eigenvalues (PCA)                     | [1405229.4284625, 286402.49043666]    |
| Cumulative Explained Variance (PCA)   | [0.4172533, 0.5022945]                |

**SVD Results:**

| Metric                                | SVD Values                             |
|---------------------------------------|----------------------------------------|
| Explained Variance Ratio (SVD)         | [0.36837609, 0.08561444]              |
| Singular Values (SVD)                 | [133109.6694485, 26344.70205895]      |
| Cumulative Explained Variance (SVD)   | [0.36837609, 0.45399053]              |

#### Manual Method PCA and SVD Results:

**Manual PCA Results:**

| Metric                                | Manual PCA Values                     |
|---------------------------------------|----------------------------------------|
| Explained Variance Ratio              | [0.5591340500663632, 0.4408659499336368]|
| Eigenvalues                           | [170926974.74379313, 134772480.91179776]|
| Cumulative Explained Variance         | [0.5591340500663632, 1.0]             |

**Manual SVD Results:**

| Metric                                | Manual SVD Values                     |
|---------------------------------------|----------------------------------------|
| Explained Variance Ratio (SVD)         | [0.782912432611862, 0.041141589703449395]|
| Singular Values (Manual SVD)          | [(23633.322046648704+0j), (5417.622156598941+0j)]|
| Cumulative Explained Variance (SVD)   | [0.782912432611862, 0.8240540223153114]|


#### Possible reasons for why sklearn and my manual codes are not the same:

Normalization and Scaling:

In the domain of data preprocessing, Sklearn's PCA leverages an automated scaling mechanism to achieve feature standardization. Preceding the PCA computation, it systematically enforces a zero mean and unit variance for each feature. Notably, an oversight in explicitly normalizing data prior to PCA within my manual implementation has surfaced. This absence of normalization bears significance, as the resulting scaling factor holds the potential to exert a substantial impact on the final analytical outcomes.

Algorithm Differences:

Diverging into the intricacies of algorithmic implementations, Sklearn's PCA is distinguished by a sophisticated suite of algorithms and optimization techniques, effectively concealed beneath its operational facade. In contrast, the manual methodology at play may lack equivalent robustness. The intricacies of algorithmic processes and ancillary preprocessing steps, such as data centering, emerge as critical focal points where distinctions may manifest.

Numerical Stability:

The consideration of numerical stability assumes prominence in evaluating the methodologies employed. Evidently, my manual method may demonstrate diminished nimbleness compared to Sklearn, particularly in the realm of numerical handling. Analogous to a precarious tightrope act, the manual methodology exhibits a degree of instability, notably when confronting larger matrices. Recognizing the centrality of numerical stability becomes imperative, given the potential for seemingly inconsequential calculation errors to propagate into discernible disparities within the analytical results.

Parameter Differences:

In the sphere of parameterization, meticulous attention to alignment becomes imperative. Ensuring congruence in configuration settings between Sklearn and the manual implementation is foundational. Scrutiny of parameters, encompassing facets such as the number of components, tolerance levels, and latent hyperparameters, becomes a requisite. These parameters, acting as instrumental determinants, wield substantive influence over the ultimate analytical outcomes. Consequently, a judicious examination and synchronization of these settings become indispensable for achieving coherence in results.