# Introduction

In this project the task was to predict whether the news source is a reputable media or an opinionated source (headlines taken from news articles that were deemed false by politifact) by only seeing the headline of an artice. The training data was composed of headlines from Reuters being classified as reputable and headlines from opinionated sources that were deemed false by Politifact. Two different preprocessing techniques were used on the headlines and multiple machine learning models were trained with the preprocessed headlines to see if the classification was possible within reasonable accuracy.


# Results

As the task was to classify whether or not a title of a news article came from a reputable source, the accuracy of the models should not be bellow $0.5$ as we might as well just randomly guess the outcome and achieve, on average, better results. Through out the model traning process, a three-fold cross-validation was used to estimate the model accuracy by taking the mean of the validation scores.

## BOW

Many different algorithms were trained with the data set but ultimately logistic regression, SVM and gradient boosting were chosen for the official results. The implementation of XGBoost was done by a third party library that did not work too well with `sklearn`. Source code for the different models can still be found from the project repository.

### Logistic regression

**Logistic regression** is a special case of generalized linear model which models a Bernoulli distribution with a sigmoid function taken from a linear combination of the sample features. The algorithm tries to do maximum likelihood estimation by minimizing the negative log-likelihood loss function. This can be done in many ways, for example by stochastic gradient descent or iteratively reweighted least squares. The algorithm used in this project instead used SAGA solver which optimizes a sum of convex functions which should be faster than traditional SGD. L2 regularization was also used. The trained model can predict the class of new data points and gives a probability of it belonging to either class. A threshold of 0.5 probability is used to determine if the data point is classified to class 0 or class 1. The trained model gained a mean cross-validation accuracy of $0.770$ and test accyracy of $0.776$.

### SVM

A **support vector machine** tries to maximize the margin between (in this case) two classes by minimizing the length of the norm of the margin. To handle missclassification and margin violations, it assigns slack variables greater than one to missclasified samples, less than one to samples that inside the margins and zero to correctly classified samples. The sum of the slack variables is added to the minimization problem and the sum is multiplied by the regularization parameter $C$ which was set to one. The SVM model in this project also uses radial basis function (RBF) kernel to make it possible to have a nonlinear model. The model had test accuracy of $0.789$.

### Gradient boosting

**Gradient boosting**, like any boosting algorithm, combines multiple weak learners to make a final decision based on how each of the weak learner "voted" to classify the data. It is quite similar with adaboost as it is regarded as a generalized version of it. Instead of giving larger weights to samples that were misclassified preveiously or, in other words, minimizing a specialized loss function, it can use many different loss functions. Gradient boosting can also use many different weak learners and criterion for a single split along with a large number of hyperparameters. While testing different parameters, exponential loss gave best results which technically makes the algoritm Adaboost. Decision trees were used as the weak learners with a maximum tree depth of 3 and the siplit criterion used was MSE. The model had test accuracy of $0.783$.

## LDA

There seems to be an upper limit with the BOW representation of the headlines for accuracy at around $0.78$. To get around of this issue, another approach with the preprocessing would be needed in order to have better results. The chosen method was *latent dirichlet allocation* (LDA) which uses matrix decomposition to extract topics from the headlines and assigns a weight per topic for each headline. [MALLET](http://mallet.cs.umass.edu/) was used to do the topic extraction as it gave the best results for the data set. The number of topics was set to $25$ as it gave the best coherence score of $0.36$ for the data set. The coherence score still is relatively low as everything under $0.6$ seems to be poor so better results can probably be possible with different NLP methods.

Using the LDA data set to train the models significantly better test accuracies were achieved. **Logistic regression** had test accuracy of $0.865$ and **SVM** had $0.866$. With the default hyper parameters the **gradient boosting** model already gave the best results so they were fine-tuned further using grid search cross-validation. The resulting model gave a test accuracy of $0.887$.


# Conclusions

It can be seen from the results that the data preprocessing methods have a great impact on performance in NLP classification problems. Only by changing the preprocessing method from BOW to LDA, albeit being quite a bit more sophisticated, the logistic regression model's mean cross-validtaion accuracy went from $0.77$ to $0.86$. With a gradient boosting model the accuracy was pushed to $0.887$. As the LDA topic extraction gave quite a bad cohesion with very similar topics the training data was not very optimal. With small changes to the used words it could be quite easy to trick the classifier. Furthermore, there might be a bias towards "well written" headlines as Retures has professinals writing and crafting good headlines where as "opinionated sources" wont neccessarily posses such craftmanship. Therefore writing headlines in "professional" manner could make it more likely to pass the classification as reputable.
