In this notebook we will see how a more simplistic bag of words (simplistic when compared to distributional word embeddings or contextualized language models) approach with classic machine learning models perform on the task of disaster classification on tweets.

In [None]:
import numpy as np 
import pandas as pd 
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import string
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


Reading training and testing data from the csv files.

In [None]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
train_df.head()

In this notebook we will only be using data from the text field as features.

In [None]:
train_df = train_df.drop(["keyword","location"],1)
test_df = test_df.drop(["keyword","location"],1)
train_df.head()

To preprocess the text we will transform all letters to lower case, remove punctuation, tokenize, filter out stopwords and numbers and stem the remaining tokens.

In [None]:
ps=PorterStemmer()
stop_words = set(stopwords.words("english"))
translate_table = dict((ord(char), None) for char in string.punctuation) 

def preprocess(text):
    text = text.lower()
    text = text.translate(translate_table)
    tokens = word_tokenize(text)
    tokens = [ps.stem(token) for token in tokens if token.isalpha() and not token in stop_words]
    tokens = " ".join(tokens)
        
    return tokens
    

In [None]:
#example of the desired preprocessed tokens
preprocess("Oh my god there was a 7.2 #earthquake in Lisbon")

In [None]:
#preprocessing the text data
train_df['text'] = train_df['text'].apply(preprocess) 
test_df['text'] = test_df['text'].apply(preprocess) 

In [None]:
#resulting data
train_df['text']

At this stage we will use a bag of words approach with the Tf-Idf weighting methodology to turn words into vectors. We limit the words to the 4000 most frequent ones.

In [None]:
vectorizer = TfidfVectorizer(max_features=4000)
X = vectorizer.fit_transform(train_df['text'])
X.shape

To assess which model to submit with, we will test some by splitting the training data into a training and testing subset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, train_df['target'], test_size = 0.2)



We will be training a Logistic Regression, Support Vector Machine and Random Forrest models, with sklearn's default parameters

In [None]:
LR_model = LogisticRegression().fit(X_train, y_train)
SVM_model = svm.SVC().fit(X_train, y_train)
RF_model = RandomForestClassifier().fit(X_train, y_train)
LR_model.predict(X_test)
SVM_model.predict(X_test)
RF_model.predict(X_test)

Now we can test their performance.

In [None]:
LR_score= LR_model.score(X_test, y_test)
SVM_score=SVM_model.score(X_test, y_test)
RF_score=RF_model.score(X_test, y_test)
print(LR_score, SVM_score, RF_score)

In most tests the results were very close but SVM tended to provide the best score.

Once we decided which model to use we train it using all the training samples.

We may now transform the competition's test set into a document-term matrix, to be able to make new predictions.

In [None]:
SVM_model = svm.SVC().fit(X, train_df["target"])
test_X= vectorizer.transform(test_df['text'])
predictions = SVM_model.predict(test_X)
predictions

Now we can create a dataframe containing each testing sample's id and predicted target, and put it into a csv file to submit.

In [None]:
submission_df = pd.DataFrame(test_df["id"], columns=["id"])
submission_df["target"] = predictions
submission_df

In [None]:
submission_df.to_csv('submission.csv', header=True, index=False)