Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in machine learning and data analysis. It's particularly useful when dealing with high-dimensional datasets. The main goal of PCA is to transform the original features into a new set of uncorrelated variables, called principal components, which capture the maximum variance in the data.

Here's a brief overview of how PCA works in the context of machine learning:

PCA has various applications in machine learning:

- **Dimensionality Reduction:** It helps in reducing the number of features while retaining most of the variance in the data. This can be beneficial for models that suffer from the curse of dimensionality and may lead to improved performance and faster training times.

- **Visualization:** PCA can be used to visualize high-dimensional data in lower dimensions, making it easier to understand and interpret.

- **Noise Reduction:** By focusing on the principal components with the highest variance, PCA can help filter out noise present in the data.

- **Data Compression:** PCA can be used for compressing data while retaining most of its essential information.

In [13]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_digits

In [14]:
dig = load_digits()

In [15]:
file = pd.DataFrame(dig.data, columns=dig.feature_names)
file['target'] = dig.target

In [16]:
file.head()

Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7,target
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0,1
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0,3
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0,4


In [17]:
X = file.drop('target', axis = 1)
y = file['target']

In [18]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

In [19]:
X = scaler.fit_transform(X)

In [20]:
from sklearn.decomposition import PCA

In [21]:
pc = PCA(n_components=40)
X = pc.fit_transform(X)

In [22]:
from sklearn.naive_bayes import BernoulliNB

In [23]:
clf = BernoulliNB()

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [25]:
clf.fit(X_train, y_train)

In [26]:
clf.score(X_test, y_test)

0.8444444444444444

In [28]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

In [31]:
model_selection = {
    'svm' : {
         'model' : SVC(), 
        'params' : {
            'C' : [1,10,20], 
            'kernel' : ['linear', 'poly', 'rbf', 'sigmoid'], 
            'gamma' : ['scale', 'auto']
        }
    },
    'RandomForestClassifier' : {
         'model' : RandomForestClassifier(), 
        'params' : {
            'n_estimators' : [10,100,1000], 
            'criterion' : ["gini", "entropy", "log_loss"]
        }
    }, 
    'LogisticRegression' : {
         'model' : LogisticRegression(), 
        'params' : {
            'C' : [1,10,20], 
            'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
        }
    },
    'GaussianNB' : {
         'model' : GaussianNB(), 
        'params' : {
            
        }
    },
    'MultinomialNB' : {
         'model' : MultinomialNB(), 
        'params' : {
            'alpha' : [1,10,20]
        }
    },
    'DecisionTreeClassifier' : {
         'model' : DecisionTreeClassifier(), 
        'params' : {
            'criterion' : ["gini", "entropy", "log_loss"], 
            'splitter' : ["best", "random"] 
        }
    }
}

In [32]:
from sklearn.model_selection import GridSearchCV

In [33]:
best_ones = []

for model_name, model_param in model_selection.items():
    scorer = GridSearchCV(model_param['model'], model_param['params'], cv = 5, return_train_score=False)
    scorer.fit(X, y)
    best_ones.append(
        {
            'Model Name' : model_name, 
            "Best Score" : scorer.best_score_, 
            "Best Parameters" : scorer.best_params_
        }
    )



ValueError: 
All the 15 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\naive_bayes.py", line 726, in fit
    self._count(X, Y)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\naive_bayes.py", line 851, in _count
    check_non_negative(X, "MultinomialNB (input X)")
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py", line 1372, in check_non_negative
    raise ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to MultinomialNB (input X)
