# Unsupervised Models with Gridsearch

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


# Initialize the LSA dataframe from previous steps



In [2]:
lsa_category = pd.read_csv("SVD_reuters_df.csv", index_col=0)
#lsa_category

## Split the dataframe for Unsupervised Learning

In [3]:
# Establish outcome and predictors
y = lsa_category['category']
X = lsa_category.drop(columns=['category'])

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state=0,
                                                    stratify=y)

# Make sure classes are balanced after train-test-split
print(y_test.value_counts())
print(y_train.value_counts())

category
earn            981
acq             573
crude            93
trade            82
money-fx         77
interest         68
money-supply     38
ship             36
sugar            31
coffee           28
gold             23
cpi              18
gnp              18
cocoa            15
grain            13
reserves         12
jobs             12
alum             12
ipi              11
copper           11
rubber           10
iron-steel        9
nat-gas           9
bop               8
veg-oil           8
Name: count, dtype: int64
category
earn            2942
acq             1719
crude            281
trade            244
money-fx         232
interest         204
money-supply     113
ship             108
sugar             91
coffee            84
gold              67
gnp               56
cpi               53
cocoa             46
grain             38
alum              38
reserves          37
jobs              37
ipi               34
copper            33
rubber            30
iron-steel     

# Unsupervised Learning Techniques

In our project, we also explored the application of unsupervised learning models, which is another category of machine learning algorithms. Unlike supervised learning models, unsupervised models work with datasets that do not have pre-existing labels or targets. The aim of these models is to identify patterns, structures, or relationships within the data that are not immediately evident.

Unsupervised models can perform tasks such as clustering, where data is grouped based on similarities, or dimensionality reduction, where complex data is simplified while preserving its key structure. For instance, we used algorithms such as K-means for clustering and Gaussian Mixture Models (GMM).

These unsupervised models helped us uncover hidden patterns and structures within our data, which enriched our understanding of the data and provided insightful inputs for our supervised models. Despite not directly contributing to the predictive power of our system, the unsupervised models proved invaluable for exploratory analysis and feature engineering stages of our project.

In [23]:
def process_unsupervised(model, param_grid, X):
    
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
    grid_search.fit(X)
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    best_estimator = grid_search.best_estimator_
    cv_scores = grid_search.cv_results_ 
    y_predicted = best_estimator.predict(X)

    return y_predicted, best_params, best_score, cv_scores

def get_acc_cm(labels_encoded, y_pred):
    acc_score = accuracy_score(labels_encoded, y_pred)
    cm = confusion_matrix(labels_encoded, y_pred)
    return acc_score, cm

In [24]:
le = LabelEncoder()
labels_encoded = le.fit_transform(y)

## Kmeans

K-means is a simple and widely-used clustering algorithm that partitions the dataset into K distinct, non-overlapping clusters based on the similarity between data points. The algorithm iteratively assigns each data point to the nearest cluster's centroid and updates the centroid's position by averaging the positions of all points within the cluster. The process continues until convergence or a predefined number of iterations. K-means is computationally efficient and works well with large datasets. However, it assumes that clusters are spherical and have similar sizes, which may not always hold true.

In [25]:
#Create a Kmeans model
kmeans = KMeans()

# define the hyperparameter grid to search over
param_grid = {
    'n_clusters': [y.nunique()],
    'init': ['k-means++', 'random'],
    'n_init': [10, 20, 30, 40, 50],
    'max_iter': [10, 20, 30, 40, 50]
}
y_predicted, best_params, best_score, cv_scores = process_unsupervised(kmeans, param_grid, X)
acc_score, cm = get_acc_cm(labels_encoded, y_predicted)

print("Best parameters:", best_params)
print("Best score:", best_score)
print("Accuracy score: ", acc_score)
print("Confusion matrix:\n", cm)

Best parameters: {'init': 'k-means++', 'max_iter': 40, 'n_clusters': 25, 'n_init': 40}
Best score: -1353.1186145429328
Accuracy score:  0.038706739526411654
Confusion matrix:
 [[  6   0   0   0   1   3  13  11 396   0   0 888   1   4 366   0 378   0
    0 176  34   0  14   1   0]
 [  0   0   0   0   0   0   0   0  34   0   1   0   0  15   0   0   0   0
    0   0   0   0   0   0   0]
 [  1   0   0   0   0   3   0   0   4   0  22   0   0   0   0   0   0   0
    0   0   1   0   0   0   0]
 [  0   0   0   0   0   0   0   1  45   0   0   0   0  15   0   0   0   0
    0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   6   0   0   0   0   2   0   0   0   0
    0   0   1   0   0 103   0]
 [  0   0   0   0   0   0   0   0  31   0   0   9   0   4   0   0   0   0
    0   0   0   0   0   0   0]
 [  0   0   0   0   0  62   0   0   6   0   0   0   0   0   0   0   0   0
    1   0   1   0   1   0   0]
 [  1   0   0   0   0   0   1   2  56   0   0   3   0   1   1   0   1  77
    0   0   0  

## Gaussian Mixture Model (GMM)

GMM is a probabilistic model that assumes that the data points are generated from a mixture of several Gaussian distributions. The algorithm estimates the parameters of these distributions, such as means, covariances, and the mixture weights, using an iterative process called Expectation-Maximization (EM). GMM is more flexible than K-means, as it can model clusters with different shapes, sizes, and orientations. However, it is more computationally expensive and may not scale well to large datasets.

In [26]:
# Create a GMM model
gmm = GaussianMixture(random_state=42)

# define the hyperparameter grid to search over
param_grid = {'n_components': [y.nunique()], 
              'covariance_type': ['full', 'tied', 'diag', 'spherical']}
y_predicted, best_params, best_score, cv_scores = process_unsupervised(gmm, param_grid, X)
acc_score, cm = get_acc_cm(labels_encoded, y_predicted)

print("Best parameters:", best_params)
print("Best score:", best_score)
print("Accuracy score: ", acc_score)
print("Confusion matrix:\n", cm)

ValueError: Invalid parameter 'covariance_type' for estimator KMeans(). Valid parameters are: ['algorithm', 'copy_x', 'init', 'max_iter', 'n_clusters', 'n_init', 'random_state', 'tol', 'verbose'].

In [None]:
scores_map = {}
cross_validation = 5

In [None]:
kmeans = KMeans(n_clusters=y.nunique())

scores = cross_val_score(kmeans, X, cv=cross_validation, scoring='neg_mean_squared_error')
print(f"MSE: {scores.mean()} (+/- {scores.std()})")

scores_map['KMM'] = scores
scores

In [None]:
gmm = GaussianMixture(random_state=42, n_components=y.nunique())

scores = cross_val_score(gmm, X, cv=cross_validation, scoring='neg_mean_squared_error')
print(f"MSE: {scores.mean()} (+/- {scores.std()})")

scores_map['GMM'] = scores
scores