# Model fit

<img alt="" class="ce kx ky c" width="700" height="249" loading="lazy" role="presentation" src="https://miro.medium.com/max/1400/1*dpUnwfXqnU5Kd-gfafgIgQ.png">

**Overfit Model:** Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well.

Overfitting a model result in good accuracy for training data set but poor results on new data sets. Such a model is not of any use in the real world as it is not able to predict outcomes for new cases.

**Underfit Model:** Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Underfitting is often a result of an excessively simple model. By simple we mean that the missing data is not handled properly, no outlier treatment, removing of irrelevant features or features which do not contribute much to the predictor variable.

source:https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85

## Split the data into TRAIN, TEXT, and VELIDATION to check the model fit

**Definition of Train-Valid-Test Split**

Train-Valid-Test split is a technique to evaluate the performance of your machine learning model — classification or regression alike. You take a given dataset and divide it into three subsets. A brief description of the role of each of these datasets is below.

>**Train Dataset**
Set of data used for learning (by the model), that is, to fit the parameters to the machine learning model

>**Valid Dataset**
Set of data used to provide an unbiased evaluation of a model fitted on the training dataset while tuning model hyperparameters.
Also play a role in other forms of model preparation, such as feature selection, threshold cut-off selection.
It is used to validate the generalisation ability of the model or for early stopping, during the training process.

>**Test Dataset**
Set of data used to provide an unbiased evaluation of a final model fitted on the training dataset.

### Split validation (Randomly split the data into three datasets)

<img width="500" height="219" src="https://miro.medium.com/max/1400/1*f2KznlrIdj1MeobprVGBtg.png">

source:https://towardsdatascience.com/how-to-split-data-into-three-sets-train-validation-and-test-and-why-e50d22d3e54c

### *Hypeparameter tuning*

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data. The objective function takes a tuple of hyperparameters and returns the associated loss. **Cross-validation is often used to estimate this generalization performance.**

source: https://en.wikipedia.org/wiki/Hyperparameter_optimization


<img width="500" height="530" src="https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/61568656a13218cdde7f6166_training-data-validation-test.png" alt="Training, test, and validation data">

source:https://www.v7labs.com/blog/train-validation-test-set

> **Method 1: conduct train_test_split twice**

In [2]:
import pandas as pd

SMS = pd.read_csv('SpamSMStraining.txt', sep = '\t', header=None, names=["label", "sms"])
SMS.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# defining X and Y (Y is what you want to predict)
X = SMS['sms']
y = SMS["label"]

In [4]:
from sklearn.model_selection import train_test_split

# In the first step we will split the data in training and remaining dataset
X_train1, X_rem, y_train1, y_rem = train_test_split(X,y, train_size=0.8) 

# Now since we want the valid and test size to be equal (10% each of overall data). 
# we have to define valid_size=0.5 (that is 50% of remaining data)
X_valid1, X_test1, y_valid1, y_test1 = train_test_split(X_rem,y_rem, test_size=0.5)

print(X_train1.shape), print(y_train1.shape)
print(X_valid1.shape), print(y_valid1.shape)
print(X_test1.shape), print(y_test1.shape)

(4457,)
(4457,)
(557,)
(557,)
(558,)
(558,)


(None, None)

> **Method 2: use train_valid_test_split**

install packages

! pip install fast_ml

https://anaconda.org/bioconda/fastml

In [5]:
from fast_ml.model_development import train_valid_test_split

X_train2, y_train2, X_valid2, y_valid2, X_test2, y_test2 = train_valid_test_split(SMS, target = 'label', #target is y = teams['R'] 
                                                                            train_size=0.8, valid_size=0.1, test_size=0.1)

print(X_train2.shape), print(y_train2.shape)
print(X_valid2.shape), print(y_valid2.shape)
print(X_test2.shape), print(y_test2.shape)

(4457, 1)
(4457,)
(557, 1)
(557,)
(558, 1)
(558,)


(None, None)

### Model fit for Naive Bayes model

In [6]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating a model based on Multinomial Naive Bayes using make_pipeline
modelnb = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Training the model with the train data
modelnb.fit(X_train1, y_train1) 

# Creating labels for the test data
y_prednb = modelnb.predict(X_test1)

acc_NB_test = accuracy_score(y_test1, y_prednb)
acc_NB_test

0.953405017921147

In [7]:
# accuracy for validation data
y_val_prednb = modelnb.predict(X_valid1)

acc_NB_val = accuracy_score(y_valid1, y_val_prednb)
acc_NB_val

0.9676840215439856

>**Numberic ML Training and Loss**

Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples. For example, Figure 3 shows a high loss model on the left and a low loss model on the right. Note the following about the figure:

<img src="https://developers.google.com/machine-learning/crash-course/images/LossSideBySide.png" height="200" alt="Two Cartesian plots, each showing a line and some data points. In the first plot, the line is a terrible fit for the data, so the loss is high. In the second plot, the line is a a better fit for the data, so the loss is low.">

**Figure 3. High loss in the left model; low loss in the right model.**

**Mean square error (MSE)** is the average squared loss per example over the whole dataset.

source:https://developers.google.com/machine-learning/crash-course/descending-into-ml/training-and-loss


# Model Validation -- Split the dataset

When building a predictive model, we split the original data into two datasets: **training dataset and testing (validation) dataset**. This is called **"split validation"**, a type of **"model validation"**
- A predictive model is built using the **training dataset** and **the model quality** is assessed as the model is applied into the **testing (validation) dataset** (See Appendix for more details)

> **Two types of model validation**: 
 1. **split validation** (70% of the original data as training and the other 30% as testing dataset)
 2. **cross validation**  

This concept is same as **taking ACT exam**. There are two periods: **prep / practice test** and **actual testing**. **prep / practice test** is like **training data** and **actual testing** is like **testing data**. Your final ACT score is based on **actual test**, not practice test. Likewise, the **accuracy of predictive (regression) model** is based on **testing dataset**. 

> **What is k-Fold Cross-Validation?**

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

source:https://machinelearningmastery.com/k-fold-cross-validation/

<img alt="" class="ce kx ky c" width="700" height="314" loading="lazy" role="presentation" src="https://miro.medium.com/max/1400/1*AAwIlHM8TpAVe4l2FihNUQ.png">

source:https://towardsdatascience.com/cross-validation-k-fold-vs-monte-carlo-e54df2fc179b

In [8]:
# convert label to dummy variables
SMS["Dummy"] = SMS["label"].map({'ham': 1, 'spam': 0})
SMS.head()

Unnamed: 0,label,sms,Dummy
0,ham,"Go until jurong point, crazy.. Available only ...",1
1,ham,Ok lar... Joking wif u oni...,1
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,0
3,ham,U dun say so early hor... U c already then say...,1
4,ham,"Nah I don't think he goes to usf, he lives aro...",1


In [9]:
#using StratifiedKFold for cross-validation
from sklearn.model_selection import StratifiedKFold

In [10]:
#5-fold cross-validation

skf = StratifiedKFold(n_splits=5) #split the data into equally 5 parts
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    print(len(X_train), len(y_train),len(X_test),len(y_test))

4457 4457 1115 1115
4457 4457 1115 1115
4458 4458 1114 1114
4458 4458 1114 1114
4458 4458 1114 1114


# Classifier 2 -- Support Vector Machine

The objective of the support vector machine algorithm is to find **a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.**

<tr>
<td> <img alt="" class="ce kx ky c" width="210" height="206" loading="eager" role="presentation" src="https://miro.medium.com/max/600/0*9jEWNXTAao7phK-5.png" /> </td>
<td> <img alt="" class="ce kx ky c" width="210" height="206" loading="eager" role="presentation" src="https://miro.medium.com/max/600/0*0o8xIA4k3gXUDCFU.png" /> </td>
</tr>

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, **i.e the maximum distance between data points of both classes**. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

<img alt="" class="ce kx ky c" width="700" height="297" loading="eager" role="presentation" src="https://miro.medium.com/max/1400/1*ZpkLQf2FNfzfH4HXeMw4MQ.png">

<img alt="" class="ce kx ky c" width="700" height="362" loading="lazy" role="presentation" src="https://miro.medium.com/max/1400/0*ecA4Ls8kBYSM5nza.jpg">

source:https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

## It is not always linear 

by changing the kernel in SVC() function

`kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’`
Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).

rbf : radial basis function, for more:https://en.wikipedia.org/wiki/Radial_basis_function_kernel#:~:text=In%20machine%20learning%2C%20the%20radial,in%20support%20vector%20machine%20classification.

<img src="https://media.geeksforgeeks.org/wp-content/uploads/Circles.png" alt="" width="383" height="252" class="aligncenter size-full wp-image-824428">

<img alt="" class="ce ok ve c" width="491" height="387" loading="lazy" role="presentation" src="https://miro.medium.com/max/982/1*J0k7TxTLoL5ZG-Hq6v34Jg.png">

source:
1. https://www.geeksforgeeks.org/ml-using-svm-to-perform-classification-on-a-non-linear-dataset/
2. https://linguisticmaz.medium.com/support-vector-machines-explained-ii-f2688fbf02ae




In [11]:
from sklearn import svm
import numpy as np

### Cross validation

In [12]:
from sklearn.model_selection import StratifiedKFold

#SVM using 5-fold cross-validation
skf = StratifiedKFold(n_splits=5) #split the data into equally 5 parts

accuracy =[]

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    #tokenize the words and give each word a value
    tfIdfVectorizer=TfidfVectorizer(decode_error ='ignore', use_idf=True,stop_words='english') # we removed stopwords here
    X_train_tfidf = tfIdfVectorizer.fit_transform(X_train)
    X_test_tfidf = tfIdfVectorizer.transform(X_test)
    SVM = svm.SVC()# Build the SVM classifier
    SVM.fit(X_train_tfidf, y_train)# Train it on the entire training data set
    y_pred = SVM.predict(X_test_tfidf)
    print(y_pred[:10])
    
    accuracy.append(accuracy_score(y_test, y_pred))

accuracy = np.array(accuracy)
print('Mean accuracy: ', np.mean(accuracy, axis=0))
print('Std for accuracy: ', np.std(accuracy, axis=0))

['ham' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'spam']
['spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam']
['spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham']
['ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham']
['spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham']
Mean accuracy:  0.9782837268840924
Std for accuracy:  0.0023036860904803442


In [13]:
# use cross_val_score
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn import svm

svm_pipeline = Pipeline([('tfidf', TfidfVectorizer(decode_error ='ignore', stop_words='english', use_idf=True)), 
                         ('clf', svm.SVC(kernel='linear', probability=True))])#kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’

#svm_Mpipline = make_pipeline(TfidfVectorizer(decode_error ='ignore', stop_words='english', use_idf=True), svm.SVC(kernel='linear', probability=True))

SVMcv = cross_val_score(svm_pipeline, X, y, scoring='accuracy', cv=5) # 5-fold cross-validation
print(SVMcv)
print(SVMcv.mean())

[0.98295964 0.9838565  0.98384201 0.97845601 0.98473968]
0.9827707690945247


<table><thead><tr><th><p style="text-align:center"><strong>pipeline</strong></p></th><th><p style="text-align:center"><strong>make_pipeline</strong></p></th></tr></thead><tbody><tr><td>The pipeline requires naming the steps, manually.&nbsp;</td><td>make_pipeline names the steps, automatically.&nbsp;</td></tr><tr><td>Names are defined explicitly, without rules.</td><td>Names are generated automatically using a straightforward rule (lower case of the estimator).</td></tr><tr><td>Names cannot be changed based on the transformer or estimator used.</td><td>&nbsp;Names are readable, short, and easy to understand, and can be changed based on the estimator used.</td></tr></tbody></table>

source:https://www.geeksforgeeks.org/what-is-the-difference-between-pipeline-and-make_pipeline-in-scikit/#:~:text=The%20pipeline%20requires%20naming%20the,lower%20case%20of%20the%20estimator).

##  Feature engineering (Words to Vectors)

### Bag of words

- Tokenization
- Word Frequency
- Stemming
- Lemmatization
- Remove stopwords


>**TF-IDF**
to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

**You also can use CountVectorizer**

In [14]:
# Documentation:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

vectorizer = CountVectorizer()
vec = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

### Split validation

In [15]:
# Creating training and test sets (80-20): X = corpus; y = classifications
SX_train, SX_test, Sy_train, Sy_test = train_test_split(X, y, test_size=0.2, random_state=10)
len(SX_train), len(Sy_train), len(SX_test), len(Sy_test)

(4457, 4457, 1115, 1115)

In [16]:
modelsvm = make_pipeline(TfidfVectorizer(), svm.SVC()) # we did not remove stopwords here
modelsvm.fit(SX_train,Sy_train)

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()), ('svc', SVC())])

In [17]:
Sy_pred = modelsvm.predict(SX_test)
print(Sy_pred[:5])

['ham' 'ham' 'ham' 'ham' 'ham']


In [18]:
# calculate the accuracy 
acc_SVM = accuracy_score(Sy_test, Sy_pred)
acc_SVM

0.9748878923766816

In [20]:
# let's try some new data

docs_new = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question",
            "Even my brother is not like to speak with me. They treat me like aids patent.",
             "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9",
            "hello, thank you",
           "To claim txt DIS to 87121",
                "SNAP for free pics",
           "Call 30303 for free prizes!!"]

predicted = modelsvm.predict(docs_new)

for doc, category in zip(docs_new, predicted):
    print(('%r => %s' % (doc, category)))

'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question' => spam
'Even my brother is not like to speak with me. They treat me like aids patent.' => ham
"As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9" => ham
'hello, thank you' => ham
'To claim txt DIS to 87121' => spam
'SNAP for free pics' => ham
'Call 30303 for free prizes!!' => spam


### Changing parameters to improve the model accuracy

e.g., removing stopwords, using stemming words, using ngrams, removing too frequent words, removing too rare words

- TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

- max_df : float in range [0.0, 1.0] or int, default=1.0 
When building the vocabulary ignore terms that have a **document frequency** strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. 
    - For example, max_df = 0.7 ==> This removes words which appear in more than 70% of the corpus (**removing frequent words**).
<br><br>    
- min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. 
- For example, min_df = 5 ==> This removes words which appear in less than five documents (**removing rare words**).

In [21]:
# remove stopwords
modelsvm2 = make_pipeline(TfidfVectorizer(stop_words='english'), svm.SVC()) 
modelsvm2.fit(SX_train,Sy_train)
Sy_pred2 = modelsvm2.predict(SX_test)
acc_SVM2 = accuracy_score(Sy_test, Sy_pred2)

acc_SVM2

0.968609865470852

# Classifier 3 -- k-nearest neighbors (KNN) 


<img src="https://www.simplilearn.com/ice9/free_resources_article_thumb/legnth-ears.JPG" alt="" width="276" height="187" class="blend-mode">

<img src="https://www.simplilearn.com/ice9/free_resources_article_thumb/knn.JPG" alt="knn" width="402" height="148" class="blend-mode">


>**How does it actually work**

<img loading="lazy" class="aligncenter" src="https://editor.analyticsvidhya.com/uploads/369941_-pMkFM7U6GX22WUCLG5g2g.png" alt="K even or odd" width="547" height="378">

**Larger K value:** The case of underfitting occurs when the value of k is increased. In this case, the model would be unable to correctly learn on the training data.

**Smaller k value:** The condition of overfitting occurs when the value of k is smaller. The model will capture all of the training data, including noise. The model will perform poorly for the test data in this scenario.

<img loading="lazy" class="aligncenter" src="https://editor.analyticsvidhya.com/uploads/34077images-1.png" alt="K large or small | KNN" width="393" height="162">


>**Cons:**
The KNN algorithm does not work well with large datasets. The cost of calculating the distance between the new point and each existing point is huge, which degrades performance.
Feature scaling (standardization and normalization) is required before applying the KNN algorithm to any dataset. Otherwise, KNN may generate wrong predictions.

source:
1. https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
2. https://www.analyticsvidhya.com/blog/2021/05/knn-the-distance-based-machine-learning-algorithm/#:~:text=The%20abbreviation%20KNN%20stands%20for,classification%20and%20regression%20problem%20statements.
3.https://rstudio-pubs-static.s3.amazonaws.com/188798_4c808643569c44a3ad06c04f74a32943.html

In [22]:
# documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
from sklearn.neighbors import KNeighborsClassifier

In [23]:
knn = KNeighborsClassifier()

modelknn = make_pipeline(TfidfVectorizer(), knn) # we did not remove stopwords here
modelknn.fit(SX_train,Sy_train)

Sy_predknn = modelknn.predict(SX_test)
acc_knn = accuracy_score(Sy_test, Sy_predknn)

acc_knn

0.9130044843049328

In [24]:
# using cross_val_score to conduct a 10-fold cross validation 
KNNcv = cross_val_score(modelknn, X, y, scoring='accuracy', cv=10)
print(KNNcv)
print(KNNcv.mean())

[0.90860215 0.91577061 0.91741472 0.91023339 0.91741472 0.91202873
 0.91202873 0.91921005 0.91023339 0.92459605]
0.9147532544416773


In [25]:
# let's try some new data with knn classifier

docs_new = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question",
            "Even my brother is not like to speak with me. They treat me like aids patent.",
             "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9",
            "hello, thank you",
           "To claim txt DIS to 87121",
                "SNAP for free pics",
           "Call 30303 for free prizes!!"]

predicted_knn = modelknn.predict(docs_new)

for doc, category in zip(docs_new, predicted_knn):
    print(('%r => %s' % (doc, category)))

'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question' => ham
'Even my brother is not like to speak with me. They treat me like aids patent.' => ham
"As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9" => ham
'hello, thank you' => ham
'To claim txt DIS to 87121' => ham
'SNAP for free pics' => ham
'Call 30303 for free prizes!!' => ham


# Classifier 4 -- Logistic Regression

## Binary classification

<img alt="" class="ce mb mc c" width="700" height="311" loading="lazy" role="presentation" src="https://miro.medium.com/max/1400/1*dm6ZaX5fuSmuVvM4Ds-vcg.jpeg">

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Some of the examples of classification problems are Email spam or not spam, Online transactions Fraud or not Fraud, Tumor Malignant or Benign. 

source:
1. https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148
2. **For more information, please read:** https://developers.google.com/machine-learning/crash-course/logistic-regression/video-lecture

In [26]:
# Action 1: using the same logic to build a logistic regression classifier  (Split validation data)
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()


In [27]:
# Action 2: using the cross-validation method to build the model



## Action 3 -- build SVM, KNN models to the fetch_20newsgroups dataset, then test the models using articles1.csv, respectively. Which one is better (including the Naive Bayes model, comparing the three models)?

## What about using cross-validation method, will it make the model perform better?