# Evaluate Logistic Regression Model on raw data

In [3]:
import sklearn
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression

# synthetic classification dataset
from sklearn.datasets import make_classification

In [4]:
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)

# summarize the dataset
print(X.shape, y.shape)

(1000, 20) (1000,)


It is a binary classification task and we will evaluate a LogisticRegression model after each dimensionality reduction transform.

The model will be evaluated using the gold standard of repeated stratified 10-fold cross-validation. The mean and standard deviation classification accuracy across all folds and repeats will be reported.

In [13]:
# define the model
model1 = LogisticRegression()

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.824 (0.034)



# Evaluate PCA with logistic regression algorithm for classification

We will use a Pipeline to combine the data transform and model into an atomic unit that can be evaluated using the cross-validation procedure.


In [6]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

In [14]:
# define the pipeline
steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())]
model2 = Pipeline(steps=steps)

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.824 (0.034)


#Evaluate LDA with logistic regression algorithm for classification

The complete example of evaluating a model with SVD dimensionality reduction is listed below.

In [1]:
from sklearn.decomposition import TruncatedSVD

In [7]:
# define the pipeline
steps = [('svd', TruncatedSVD(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.824 (0.034)


**Task:** Choose any 2 (different from the ones shown above) dimensionality reduction techniques and train the above pipeline model using those versions of the make_classification data. Create a plot using pyplot or pandas comparing the results on the original data with all 4 Dimensionality Reduction techniques.