This problem is taken from Kaggle and is known as NLP with Disaster Tweets.
link to this problem : https://www.kaggle.com/c/nlp-getting-started/overview


Importing the required libraries first.

In [64]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer

Lets import the data.

In [65]:
train_data = pd.read_csv("/Users/rohankilledar/Documents/projects/Kaggle/NLP_with_disaster_tweets/train.csv")

#the label of these test_set are unknown as part of the compitition
test_data = pd.read_csv("/Users/rohankilledar/Documents/projects/Kaggle/NLP_with_disaster_tweets/test.csv")

Lets have a look at the data we have at hand.

In [66]:
print(train_data[train_data["target"]==0]["text"].values[1])
print(train_data[train_data["target"]==1]["text"].values[1])
print(len(train_data))


I love fruits
Forest fire near La Ronge Sask. Canada
7613


In [67]:
from nltk import FreqDist, word_tokenize
from nltk.corpus import stopwords
import string
#lowercasing all the words and making a list of those words
all_text = [words.lower() for words in train_data["text"]]
tokens = [word_tokenize(word) for word in all_text]

#removing the stopwords from the list of words
en_stopwords = stopwords.words("english")

clean_tokens = []

for tweet in tokens:
    token_wo_sw = [word for word in tweet if word not in en_stopwords and word not in string.punctuation]
    clean_tokens.extend(token_wo_sw)

clean_tokens

word_freq = FreqDist(clean_tokens)



In [68]:
dictFD = dict(word_freq)
print("total number of unique tokens in the training set are: "+ str(len(dictFD)))


total number of unique tokens in the training set are: 22891


We need to split down our training set into train set and validation set.
I'll split them in ratio of 3:1


In [69]:
def divide_ratio(data,ratio):
    leng = len(data)
    splitter = int(leng*ratio)
    return data[:splitter],data[splitter:]

ratio = 0.75
train_text,validation_text = divide_ratio(all_text,ratio)
train_label, validation_label = divide_ratio(train_data["target"].tolist(),ratio)



We can use TF-IDF to vectorize the text and take the max_feature_num as 1000

In [70]:
max_feature_number = 1000
train_vectorizer = TfidfVectorizer(max_features=max_feature_number)
train_vecs = train_vectorizer.fit_transform(train_text)
validation_vecs = TfidfVectorizer(max_features=max_feature_number,vocabulary=train_vectorizer.vocabulary_).fit_transform(validation_text)
test_vecs = TfidfVectorizer(max_features=max_feature_number,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_data["text"].tolist())

Lets try using Logistic regression to figure out the label: 0 or 1

In [71]:
from sklearn.linear_model import LogisticRegression

#train model
clf =  LogisticRegression().fit(train_vecs, train_label)
#test model
test_pred = clf.predict(validation_vecs)

from sklearn.metrics import precision_recall_fscore_support,accuracy_score

acc = accuracy_score(validation_label, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(validation_label, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

acc 0.7547268907563025
precision 0.7612253148326893
rec 0.7433568662058145
f1 0.7456856965937189


adding the prediction to sample submission file for test.csv dataset

In [75]:
sample_submission = pd.read_csv("/Users/rohankilledar/Documents/projects/Kaggle/NLP_with_disaster_tweets/sample_submission.csv")
sample_submission["target"] = clf.predict(test_vecs)
sample_submission.head

<bound method NDFrame.head of          id  target
0         0       1
1         2       0
2         3       1
3         9       0
4        11       1
...     ...     ...
3258  10861       1
3259  10865       1
3260  10868       1
3261  10874       1
3262  10875       0

[3263 rows x 2 columns]>

In [74]:
sample_submission.to_csv("submission.csv", index=False)