# News Classification 

In this project, We used 6500 pieces of news from [Mashable](http://mashable.com/).
Preprocessing was applied to stokenlize and vectorize the dataset. Multiple machine learning models were used. For each model, we recorded the runing time and the accuracy and compared the results.

## Loading the news data
This dataset contains 6500 pieces of news, each contains url, label and article.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('articles.csv', header=None)
df = df.dropna(axis=0,how='any')
df.head()

Unnamed: 0,0,1,2
0,http://mashable.com/2013/01/07/ap-samsung-spon...,business,The Associated Press is the latest news organi...
1,http://mashable.com/2013/01/07/apple-40-billio...,business,It looks like 2012 was a pretty good year for ...
2,http://mashable.com/2013/01/07/astronaut-notre...,entertainment,"When it comes to college football, NASA astron..."
3,http://mashable.com/2013/01/07/att-u-verse-apps/,tech,LAS VEGAS — Sharing photos and videos on your ...
4,http://mashable.com/2013/01/07/beewi-smart-toys/,tech,LAS VEGAS — RC toys have traded in their bulky...


In [3]:
dataset = np.array(df)
dataset.shape

(6500, 3)

## Preprocessing data
In our project, we used TF-IDF features for training model. However, punctuations, numbers, some stop-words, tense of verbs (For example, do and did) would affect the result of TF-IDF features. In order to address this problem, we preprocessed articles.

In [4]:
labels = dataset[:, 1]
raw_articles = dataset[:, 2]

Before preprocessing, we use the first piece of news as example to show how the articles look like.

In [5]:
print(raw_articles[0])

The Associated Press is the latest news organization to experiment with trying to make money from Twitter by using its feed to advertise for other companies. The AP announced Monday that it will share sponsored tweets from Samsung throughout this week for the International CES taking place in Las Vegas. The news service will let Samsung post two tweets per day to the AP's Twitter account, which has more than 1.5 million users, and each of these tweets will be labeled "SPONSORED TWEETS."This marks the first time that the AP has sold advertising on its Twitter feed, and the company says it spent months developing guidelines to pave the way for this and other new media business models. For this particular promotion, Samsung will provide the sponsored tweets and non-editorial staff at the AP will handle the publishing side. In this way, the company hopes to maintain a clear dividing line between its editorial and advertising operations on Twitter."We are thrilled to be taking this next ste

The nltk package is applied to preprocess data.

In [6]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import unicodedata
import re

clean_articles = []
for text in raw_articles:
    text = re.sub(r'[^\x00-\x7F]+',' ', text)
    tokens = word_tokenize(text)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuation from each word
    # remove remaining tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in words]
    string = " ".join(stemmed)
    clean_articles.append(string)

Articles are preprocessed into format below. numbers, punctuations, case and tense are remove from artciles. This would make it hard to read by human, but it would make the data cleaner. 
We displayed the first article as we showed above after preprocessing.

In [7]:
print(clean_articles[0])

associ press latest news organ experi tri make money twitter use feed advertis compani ap announc monday share sponsor tweet samsung throughout week intern ce take place la vega news servic let samsung post two tweet per day ap twitter account million user tweet label sponsor tweet mark first time ap sold advertis twitter feed compani say spent month develop guidelin pave way new media busi model particular promot samsung provid sponsor tweet staff ap handl publish side way compani hope maintain clear divid line editori advertis oper twitter thrill take next step social media said lou ferrara ap manag editor overse social media effort statement industri must look new way develop revenu provid good experi advertis consum time advertis audienc expect ap without compromis core mission break news publish dabbl twitter ad includ atlant nation journal courtesi flickr nan palmero


## Tokenization and Vectorizing
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)(term frequency–inverse document frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. This provide us a way to classify articles, based on the occurance frequency of import words in articles.

In this process, we tokenize all dataset, extract top 12000 words from dataset. Then those top 12000 words were converted to tf-idf format, and vectorize the dataset.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5, max_features=12000,
                            min_df=2, use_idf=True, lowercase=True)
X = vectorizer.fit_transform(clean_articles)

Next, we would use y to present labels and convert it to numbers which can be used in neural network.

In [9]:
y = labels
print(y[:10])
numy = pd.factorize(y)[0]
print(numy[:10])

['business' 'business' 'entertainment' 'tech' 'tech' 'lifestyle' 'tech'
 'tech' 'world' 'world']
[0 0 1 2 2 3 2 2 4 4]


## Multi-logistic Regression with Cross Validation
Multi-logistic Regression is the basic way to classfiy data set. In this process, we used this method to train our data and used cross validation to verify.
Notice that the implement of cross validation is referenced from "Breast Cancer Diagnosis via Logistic Regression".

In [10]:
from sklearn import linear_model
multilogreg = linear_model.LogisticRegression(C=1e4, solver = 'newton-cg')

In [12]:
import time
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
nfold = 4
kf = KFold(n_splits=nfold)
prec = []
rec = []
f1 = []
acc = []
t = []
i = 1
for train, test in kf.split(X):  
    start = time.time()
    # Get training and test data
    Xtr = X[train,:]
    ytr = numy[train]
    Xts = X[test,:]
    yts = numy[test]
    
    # Fit a model
    multilogreg.fit(Xtr, ytr)
    yhat = multilogreg.predict(Xts)
    
    # Measure performance
    preci,reci,f1i,_= precision_recall_fscore_support(yts,yhat,average = 'weighted') 
    prec.append(preci)
    rec.append(reci)
    f1.append(f1i)
    acci = np.mean(yhat == yts)
    acc.append(acci)
    end = time.time()
    t.append(end-start)
    print('Running Time of Group %d : %.3f seconds' %(i, end-start))
    i = i+1

# Take average values of the metrics
precm = np.mean(prec)
recm = np.mean(rec)
f1m = np.mean(f1)
accm= np.mean(acc)

# Compute the standard errors
prec_se = np.std(prec)/np.sqrt(nfold-1)
rec_se = np.std(rec)/np.sqrt(nfold-1)
f1_se = np.std(f1)/np.sqrt(nfold-1)
acc_se = np.std(acc)/np.sqrt(nfold-1)
total_time = sum(t)
print('Total running time = %.3f seconds' % total_time)
print('================summary==================')
print('Precision = {0:.4f}, SE={1:.4f}'.format(precm,prec_se))
print('Recall =    {0:.4f}, SE={1:.4f}'.format(recm, rec_se))
print('f1 =        {0:.4f}, SE={1:.4f}'.format(f1m, f1_se))
print('Accuracy =  {0:.4f}, SE={1:.4f}'.format(accm, acc_se))

Running Time of Group 1 : 3.030 seconds
Running Time of Group 2 : 2.873 seconds
Running Time of Group 3 : 2.945 seconds
Running Time of Group 4 : 3.130 seconds
Total running time = 11.979 seconds
Precision = 0.7061, SE=0.0105
Recall =    0.7085, SE=0.0095
f1 =        0.7057, SE=0.0100
Accuracy =  0.7085, SE=0.0095


## Support Vector Machine with Cross Validation

Support Vector Machine is also a useful model classifing data set. K-fold was applied to cross validation.

In [13]:
from sklearn import svm
svc = svm.SVC(kernel='linear')

In [15]:
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import time

t = []
nfold = 4
kf = KFold(n_splits=nfold)
prec = []
rec = []
f1 = []
acc = []
i = 1
for train, test in kf.split(X): 
    start = time.time()
    # Get training and test data
    Xtr = X[train,:]
    ytr = numy[train]
    Xts = X[test,:]
    yts = numy[test]
    
    # Fit a model
    svc.fit(Xtr, ytr)
    yhat = svc.predict(Xts)
    
    # Measure performance
    preci,reci,f1i,_= precision_recall_fscore_support(yts,yhat,average = 'weighted') 
    prec.append(preci)
    rec.append(reci)
    f1.append(f1i)
    acci = np.mean(yhat == yts)
    acc.append(acci)
    end = time.time()
    t.append(end-start)
    print('Running Time of Group %d : %.3f seconds' %(i, end-start))
    i = i+1

# Take average values of the metrics
precm = np.mean(prec)
recm = np.mean(rec)
f1m = np.mean(f1)
accm= np.mean(acc)

# Compute the standard errors
prec_se = np.std(prec)/np.sqrt(nfold-1)
rec_se = np.std(rec)/np.sqrt(nfold-1)
f1_se = np.std(f1)/np.sqrt(nfold-1)
acc_se = np.std(acc)/np.sqrt(nfold-1)
total_time = sum(t)
print('Total running time = %.3f seconds' % total_time)
print('================summary==================')
print('Precision = {0:.4f}, SE={1:.4f}'.format(precm,prec_se))
print('Recall =    {0:.4f}, SE={1:.4f}'.format(recm, rec_se))
print('f1 =        {0:.4f}, SE={1:.4f}'.format(f1m, f1_se))
print('Accuracy =  {0:.4f}, SE={1:.4f}'.format(accm, acc_se))

Running Time of Group 1 : 29.168 seconds
Running Time of Group 2 : 27.713 seconds
Running Time of Group 3 : 30.667 seconds
Running Time of Group 4 : 29.773 seconds
Total running time = 117.321 seconds
Precision = 0.7397, SE=0.0093
Recall =    0.7398, SE=0.0086
f1 =        0.7361, SE=0.0095
Accuracy =  0.7398, SE=0.0086


## Naive Bayes with Cross Validation

In [17]:
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import time

t = []
nfold = 4
kf = KFold(n_splits=nfold)
prec = []
rec = []
f1 = []
acc = []
i = 1
for train, test in kf.split(X): 
    start = time.time()
    # Get training and test data
    Xtr = X[train,:]
    ytr = numy[train]
    Xts = X[test,:]
    yts = numy[test]
    
    # Fit a model
    clf = MultinomialNB(alpha = 0.01)
    clf.fit(Xtr, ytr)
    yhat = clf.predict(Xts)
    
    # Measure performance
    preci,reci,f1i,_= precision_recall_fscore_support(yts,yhat,average = 'weighted') 
    prec.append(preci)
    rec.append(reci)
    f1.append(f1i)
    acci = np.mean(yhat == yts)
    acc.append(acci)
    end = time.time()
    t.append(end-start)
    print('Running Time of Group %d : %.3f seconds' %(i, end-start))
    i = i+1

# Take average values of the metrics
precm = np.mean(prec)
recm = np.mean(rec)
f1m = np.mean(f1)
accm= np.mean(acc)

# Compute the standard errors
prec_se = np.std(prec)/np.sqrt(nfold-1)
rec_se = np.std(rec)/np.sqrt(nfold-1)
f1_se = np.std(f1)/np.sqrt(nfold-1)
acc_se = np.std(acc)/np.sqrt(nfold-1)
total_time = sum(t)
print('Total running time = %.3f seconds' % total_time)
print('================summary==================')
print('Precision = {0:.4f}, SE={1:.4f}'.format(precm,prec_se))
print('Recall =    {0:.4f}, SE={1:.4f}'.format(recm, rec_se))
print('f1 =        {0:.4f}, SE={1:.4f}'.format(f1m, f1_se))
print('Accuracy =  {0:.4f}, SE={1:.4f}'.format(accm, acc_se))

Running Time of Group 1 : 0.026 seconds
Running Time of Group 2 : 0.020 seconds
Running Time of Group 3 : 0.019 seconds
Running Time of Group 4 : 0.018 seconds
Total running time = 0.083 seconds
Precision = 0.7070, SE=0.0069
Recall =    0.7102, SE=0.0062
f1 =        0.7057, SE=0.0062
Accuracy =  0.7102, SE=0.0062


# Neural Network

### Permutation
We shuffle the 6500 data by pertation.

In [18]:
permutation = np.random.permutation(6500)

### Group the data
We separate the data into 2 groups for training and testing.

In [19]:
Xtr = X[permutation[:6000]].todense()
Xts = X[permutation[6000:6500]].todense()
ytr = numy[permutation[:6000]]
yts = numy[permutation[6000:6500]]

In [35]:
from keras.models import Model, Sequential 
from keras.layers import Dense, Activation
import keras.backend as K
K.clear_session()

Now we create a network. The features are:
* We have one hidden layer with nh=500 units.
* One output layer with nout=6 units, one for each of the 6 possible classes
* The output activation is softmax, which is used for multi-class targets


In [36]:
nin = X.shape[1]  # dimension of input data
nh = 500     # number of hidden units
nout = 6    # number of outputs = 10 since there are 10 classes
model = Sequential()
model.add(Dense(nh, input_shape=(nin,), activation='relu', name='hidden'))
model.add(Dense(nout, activation='softmax', name='output'))

In [37]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
hidden (Dense)               (None, 500)               6000500   
_________________________________________________________________
output (Dense)               (None, 6)                 3006      
Total params: 6,003,506
Trainable params: 6,003,506
Non-trainable params: 0
_________________________________________________________________


### keep track of the loss history and validation accuracy
This part is referenced from "Lab 7: Neural Networks for Music Classification"

In [38]:
import keras
class LossHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        # TODO:  Create two empty lists, self.loss and self.val_acc
        self.loss = list()
        self.val_acc = list()
 
    def on_batch_end(self, batch, logs={}):
        # TODO:  This is called at the end of each batch.  
        # Add the loss in logs.get('loss') to the loss list
        self.loss.append(logs.get('loss'))
        
    def on_epoch_end(self, epoch, logs):
        # TODO:  This is called at the end of each epoch.  
        # Add the test accuracy in logs.get('val_acc') to the val_acc list
        self.val_acc.append(logs.get('val_acc'))

# Create an instance of the history callback
history_cb = LossHistory()

### Create an optimizer and compile the model

In [39]:
from keras import optimizers

opt = optimizers.Adam(lr=0.001) # beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(optimizer=opt,
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

### Fit the model for 10 epochs

In [40]:
batch_size = 500
start = time.time()
model.fit(Xtr, ytr, epochs=5, batch_size=100, validation_data=(Xts,yts), callbacks = [history_cb])
end = time.time()
print('Total running time = %.3f seconds' % (end - start))

Train on 6000 samples, validate on 500 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Total running time = 26.912 seconds
