<a href="https://colab.research.google.com/github/smartwatch11/ML/blob/main/HSE_HW1_EgorRybin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
#import the necessary libraries
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt
import sklearn.linear_model
import sklearn.model_selection
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.datasets import fetch_20newsgroups

#download dataset
data = fetch_20newsgroups() 

#creating an array of penalties to run all 3 variants of penalty (regularization) in one loop
penalties = ['l1', 'l2', 'elasticnet']

#loop over penalties
for penalty in penalties:
    
    #creating a model for training
    #use linear model with Stochastic Gradient Descent Classifier
    #the loss function is log
    model = sklearn.linear_model.SGDClassifier(loss='log', penalty=penalty)
    
    #creating a model for selection of dataset for train and test
    #n_splits - Number of folds; shuffle - shuffle the data before splitting into batches; random_state - control the randomness of each fold
    xval = sklearn.model_selection.KFold(n_splits=3, shuffle=True, random_state=7)
    
    #variable for step count of n_splits
    step = 0
    
    #splitting the dataset into training and test parts
    for train, test in xval.split(data['data']):
        #print(train, test)
        step = step + 1
        
        #source array for the training part
        X_train = []
        #target array for the training part
        Y_train = []
        for i in train:
            X_train.append(data['data'][i])
            Y_train.append(data['target_names'][data['target'][i]])
        
        
        #source array for the test part
        X_test = []
        #target array for the test part so that after predicting the model, the accuracy of the model can be calculated
        Y_test = []
        for i in test:
            X_test.append(data['data'][i])
            Y_test.append(data['target_names'][data['target'][i]])
        
        #encoding the text into sparse features in training part
        #n_features - he number of features (columns) in the output matrices, in our case a number of words in English (let's say 10000000)
        #binary - If True, all non zero counts are set to 1; norm - None for no normalization of term vectors
        vectorizer = HashingVectorizer(n_features=1000000, binary=True, norm=None)
        train_vect = vectorizer.fit_transform(X_train)
    
        #model training
        model.fit(train_vect, np.ravel(Y_train))
        #print(model.coef_ )   
        
        #encoding the text into sparse features in test part
        test_vect = vectorizer.fit_transform(X_test)
        
        #run the model on the test part and check with the target class
        count = 0
        for i in range(len(Y_test)):
            if model.predict(test_vect[i]) == Y_test[i]:
                count = count + 1
        
        #output the prediction accuracy of our trained model as a percentage
        print('Accuracy of penalty', penalty, 'after', step, 'iteration', '=', round((count/len(Y_test))*100, 2),'%')
    
    print('\n')
    #clear model variables for next penalty
    del model
    del xval

Accuracy of penalty l1 after 1 iteration = 81.52 %
Accuracy of penalty l1 after 2 iteration = 79.98 %
Accuracy of penalty l1 after 3 iteration = 80.43 %


Accuracy of penalty l2 after 1 iteration = 87.35 %
Accuracy of penalty l2 after 2 iteration = 86.48 %
Accuracy of penalty l2 after 3 iteration = 87.51 %


Accuracy of penalty elasticnet after 1 iteration = 86.48 %
Accuracy of penalty elasticnet after 2 iteration = 85.44 %
Accuracy of penalty elasticnet after 3 iteration = 86.37 %




**The best model is l2**, since L2 regularization, while calculating the loss function in the gradient calculation step, the loss function tries to minimize the loss by subtracting it from the average of the data distribution.

The main intuitive difference between the L1 and L2 regularization is that L1 regularization tries to estimate the median of the data while the L2 regularization tries to estimate the mean of the data to avoid overfitting.