The Challenge is to classify news items into 1 of 46 categories.

In [1]:
#pip install contractions

In [2]:
#Importing required libraries
import pandas as pd
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('seaborn')
import warnings
warnings.filterwarnings('ignore')
import contractions

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.pipeline import Pipeline

import spacy
nlp = spacy.load('en_core_web_sm')

In [3]:
#pip install openpyxl

In [4]:
#Read the dataset
data = pd.read_excel('../input/reuters/train.xlsx')
#View the top rows
data.head()

In [5]:
#Print a concise summary of a DataFrame
data.info()

In [6]:
#Return a Series containing counts of unique values
data['class'].value_counts()

In [7]:
#Show the counts of observations using bars
plt.figure(figsize=(12,8))
sns.countplot('class',data=data);

Text Normalization: When we normalize text, we attempt to reduce its randomness. This helps us to reduce the amount of different information that the computer has to deal with, and therefore improves efficiency. \
Operations: Text cleaning, Tokenization, Expand Contractions, Case Conversions, Remove Stopwords, Lemmatization and Stemming

In [8]:
#Expand Contractions
data.text = data.text.apply(lambda item: ' '.join([contractions.fix(word) for word in item.split()]) )
data.head()

In [9]:
#Remove Punctuations and Numbers
data.text = data.text.apply(lambda item: re.sub('[^a-zA-Z]',' ',str(item)))
data.head()

In [10]:
#Remove Whitespace
data.text = data.text.apply(lambda item: re.sub(r"\s+", " ", item, flags=re.UNICODE))
data.head()

In [11]:
data.text = data.text.apply(lambda item: ' '.join([word for word in item.split() if not len(word)<3]) )
data.head()

In [12]:
stopwords = nlp.Defaults.stop_words
stopwords.remove('not')

data.text = data.text.apply(lambda item: nlp(str(item)))
data.text = data.text.apply(lambda item: [words.lemma_ for words in item if not words in stopwords])
data.text = data.text.apply(lambda item: ' '.join([words for words in item]))
data.head()

In [13]:
x = data['text']
y = data['class']

In [14]:
#The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents
vector = TfidfVectorizer()
vx = vector.fit_transform(x)

In [15]:
xtrain,xtest,ytrain,ytest = train_test_split(vx,y,test_size=0.3,random_state=101)

In [16]:
accuracy = []
f1 = []

The word ‘Forest’ in the term suggests that it will contain a lot of trees. The algorithm contains a bundle of decision trees to make a classification. It works great when it comes to taking decisions on data by creating branches from a root, which are essentially the conditions present in the data, and providing an output known as a leaf. \
Pros:
1. It reduces overfitting in decision trees and helps to improve the accuracy
2. It is flexible to both classification and regression problems
3. It works well with both categorical and continuous values

Cons:
1. It requires much computational power as well as resources as it builds numerous trees to combine their outputs. 
2. It also requires much time for training as it combines a lot of decision trees to determine the class.

In [17]:
random_forest = RandomForestClassifier(n_estimators=150)
random_forest.fit(xtrain,ytrain)
predictions = random_forest.predict(xtest)

accuracy.append(accuracy_score(ytest,predictions))
f1.append(f1_score(ytest,predictions,average='micro'))

print('Accuracy Score: ', accuracy_score(ytest,predictions))
print('F1 Score: ', f1_score(ytest,predictions,average='micro'))
print(classification_report(ytest,predictions))

In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well. \
Pros:
1. It works really well with a clear margin of separation
2. It is effective in high dimensional spaces.

Cons:
1. It doesn’t perform well when we have large data set because the required training time is higher
2. It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping

In [18]:
svm = SVC(kernel='linear')
svm.fit(xtrain,ytrain)
predictions = svm.predict(xtest)

accuracy.append(accuracy_score(ytest,predictions))
f1.append(f1_score(ytest,predictions,average='micro'))

print('Accuracy Score: ', accuracy_score(ytest,predictions))
print('F1 Score: ', f1_score(ytest,predictions,average='micro'))
print(classification_report(ytest,predictions))

In [19]:
naive = MultinomialNB(alpha=0)
naive.fit(xtrain,ytrain)
predictions = naive.predict(xtest)

accuracy.append(accuracy_score(ytest,predictions))
f1.append(f1_score(ytest,predictions,average='micro'))

print('Accuracy Score: ', accuracy_score(ytest,predictions))
print('F1 Score: ', f1_score(ytest,predictions,average='micro'))
print(classification_report(ytest,predictions))

In [20]:
mlp = MLPClassifier(hidden_layer_sizes=(200,200),activation='relu')
mlp.fit(xtrain,ytrain)
predictions = mlp.predict(xtest)

accuracy.append(accuracy_score(ytest,predictions))
f1.append(f1_score(ytest,predictions,average='micro'))

print('Accuracy Score: ', accuracy_score(ytest,predictions))
print('F1 Score: ', f1_score(ytest,predictions,average='micro'))
print(classification_report(ytest,predictions))

In [21]:
fig, ax = plt.subplots()
ax.barh(('Random Forest','SVM','Naive Bayes','MLP'),accuracy);
plt.title('Accuracy Score Comparison')
plt.xlabel('Accuracy Score')
plt.ylabel('Algorithm')

In [22]:
fig, ax = plt.subplots()
ax.barh(('Random Forest','SVM','Naive Bayes','MLP'),f1,color=['r','g','b','black'],alpha=0.7)
plt.title('F1 Score Comparison')
plt.xlabel('F1 Score')
plt.ylabel('Algorithm')