# Disaster Predict
You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

Introduction and References
Few approaches to begin with in NLP Text Classification. This kernel includes codes and ideas from kernels below. If this kernel helps you, please upvote their work as well.

References: 
[Simple Exploration Notebook - QIQC](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) by [@sudalairajkumar](https://www.kaggle.com/sudalairajkumar)


# Load

In [64]:
! pip install eli5



In [65]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
import eli5
from scipy import sparse

In [66]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# Analyze Data

In [67]:
train_df.shape

(7613, 5)

In [68]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [69]:
print("samples with disaster tweet:", sum(train_df['target']==1))
print("samples without disaster tweet:", sum(train_df['target']==0))

samples with disaster tweet: 3271
samples without disaster tweet: 4342


In [70]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


#1 Base Model using BOW (Bag of Words)/Count Vectorizer

In [71]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [72]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


In [73]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])
test_vectors = count_vectorizer.transform(test_df["text"])

In [74]:
clf = linear_model.RidgeClassifier()

Cross Validation

In [75]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.59453669, 0.56498283, 0.64082434])

Fit the model

In [76]:
clf.fit(train_vectors, train_df["target"])

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=None,
                solver='auto', tol=0.001)

Submission

In [77]:
sample_submission = pd.read_csv("sample_submission.csv")

In [78]:
sample_submission["target"] = clf.predict(test_vectors)

In [79]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


In [80]:
sample_submission.to_csv("submission_cv.csv", index=False)

#2 TF_IDF Approach

In [81]:
tfidf_vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.5, use_idf=True)

## let's get counts for the first 5 tweets in the data
example_train_vectors = tfidf_vectorizer.fit_transform(train_df["text"][0:5])

In [82]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 7)
[[0.57735027 0.57735027 0.         0.         0.         0.57735027
  0.        ]]


In [83]:
train_vectors = tfidf_vectorizer.fit_transform(train_df["text"])
test_vectors = tfidf_vectorizer.transform(test_df["text"])

In [84]:
clf = linear_model.RidgeClassifier()

Cross Validation

In [85]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.62604807, 0.60780287, 0.66869301])

Fit the model

In [86]:
clf.fit(train_vectors, train_df["target"])

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=None,
                solver='auto', tol=0.001)

Submission

In [87]:
sample_submission = pd.read_csv("sample_submission.csv")

In [88]:
sample_submission["target"] = clf.predict(test_vectors)

In [89]:
sample_submission.head()

Unnamed: 0,id,target
0,0,1
1,2,0
2,3,1
3,9,0
4,11,1


In [90]:
sample_submission.to_csv("submission_tfidf.csv", index=False)

#3 TF-IDF Using RF Classifier

In [91]:
train_vectors = tfidf_vectorizer.fit_transform(train_df["text"])
test_vectors = tfidf_vectorizer.transform(test_df["text"])

In [92]:
clf = RandomForestClassifier(class_weight="balanced")

Cross Validation using RF

In [93]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.57523148, 0.54974123, 0.64640324])

In [94]:
clf.fit(train_vectors, train_df["target"])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Submission

In [95]:
sample_submission = pd.read_csv("sample_submission.csv")
sample_submission["target"] = clf.predict(test_vectors)
sample_submission.to_csv("submission_tfidf_rf.csv", index=False)

#4 TF-IDF -> LSA

In [96]:
tfidf_vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1,2), use_idf=True, max_features= 2000)

In [97]:
train_vectors = tfidf_vectorizer.fit_transform(train_df["text"])

In [98]:
# Project the tfidf vectors onto the first N principal components.
# Though this is significantly fewer features than the original tfidf vector,
# they are stronger features, and the accuracy is higher.
svd = TruncatedSVD(1000)
lsa = make_pipeline(svd, Normalizer(copy=False))

# Run SVD on the training data, then project the training data.
train_vectors = lsa.fit_transform(train_vectors)

explained_variance = svd.explained_variance_ratio_.sum()
print("  Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))

  Explained variance of the SVD step: 90%


In [99]:
test_vectors = tfidf_vectorizer.fit_transform(test_df["text"])
test_vectors = lsa.fit_transform(test_vectors)

In [100]:
clf = linear_model.RidgeClassifier()

Cross Validation

In [101]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.63259817, 0.58575727, 0.64936337])

Fit the model

In [102]:
clf.fit(train_vectors, train_df["target"])

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=None,
                solver='auto', tol=0.001)

Submission

In [103]:
sample_submission = pd.read_csv("sample_submission.csv")

In [104]:
sample_submission["target"] = clf.predict(test_vectors)

In [105]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,0
3,9,0
4,11,1


In [106]:
sample_submission.to_csv("submission_tfidf_lsa.csv", index=False)

#5 Adding meta data points along with tf-idf using logistic regression

tf-idf vectorisation

In [107]:
tfidf_vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.5, use_idf=True)

In [108]:
tfidf_vectorizer.fit_transform(train_df['text'].values.tolist() + test_df['text'].values.tolist())
train_vectors = tfidf_vectorizer.transform(train_df['text'].values.tolist())
test_vectors = tfidf_vectorizer.transform(test_df['text'].values.tolist())

In [109]:
train_vectors

<7613x24543 sparse matrix of type '<class 'numpy.float64'>'
	with 149792 stored elements in Compressed Sparse Row format>

In [110]:
train_vectors=train_vectors.toarray()
test_vectors=test_vectors.toarray()

In [111]:
train_vectors.shape

(7613, 24543)

Adding Meta Features

In [112]:
from wordcloud import WordCloud, STOPWORDS
import string
stopwords = set(STOPWORDS)
more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown', '-', '...', '|', '&', '?', '??', 'via', '2'}
stopwords = stopwords.union(more_stopwords)
## Number of words in the text ##
train_df["num_words"] = train_df["text"].apply(lambda x: len(str(x).split()))
test_df["num_words"] = test_df["text"].apply(lambda x: len(str(x).split()))

## Number of unique words in the text ##
train_df["num_unique_words"] = train_df["text"].apply(lambda x: len(set(str(x).split())))
test_df["num_unique_words"] = test_df["text"].apply(lambda x: len(set(str(x).split())))

## Number of characters in the text ##
train_df["num_chars"] = train_df["text"].apply(lambda x: len(str(x)))
test_df["num_chars"] = test_df["text"].apply(lambda x: len(str(x)))

## Number of stopwords in the text ##
train_df["num_stopwords"] = train_df["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in stopwords]))
test_df["num_stopwords"] = test_df["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in stopwords]))

## Number of punctuations in the text ##
train_df["num_punctuations"] =train_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test_df["num_punctuations"] =test_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

## Number of title case words in the text ##
train_df["num_words_upper"] = train_df["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test_df["num_words_upper"] = test_df["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

## Number of title case words in the text ##
train_df["num_words_title"] = train_df["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test_df["num_words_title"] = test_df["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

## Average length of the words in the text ##
train_df["mean_word_len"] = train_df["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df["mean_word_len"] = test_df["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

In [113]:
for col in ['num_words', 'num_unique_words', 'num_chars', 'num_stopwords', 'num_punctuations', 'num_words_upper', 'num_words_title', 'mean_word_len']:
  train_vectors = np.append(train_vectors, train_df[col].values.reshape(-1,1), axis=1)
  test_vectors = np.append(test_vectors, test_df[col].values.reshape(-1,1), axis=1)

In [114]:
train_vectors = sparse.csr_matrix(train_vectors)
test_vectors = sparse.csr_matrix(test_vectors)

In [115]:
train_vectors

<7613x24551 sparse matrix of type '<class 'numpy.float64'>'
	with 204615 stored elements in Compressed Sparse Row format>

Cross Validation

In [116]:
from sklearn import metrics
train_y = train_df["target"].values

def runModel(train_X, train_y, test_X, test_y, test_X2):
    model = linear_model.LogisticRegression(C=5., solver='sag')
    model.fit(train_X, train_y)
    pred_test_y = model.predict_proba(test_X)[:,1]
    pred_test_y2 = model.predict_proba(test_X2)[:,1]
    return pred_test_y, pred_test_y2, model

print("Building model.")
cv_scores = []
pred_full_test = 0
pred_train = np.zeros([train_df.shape[0]])
kf = model_selection.KFold(n_splits=3, shuffle=True, random_state=2017)
for dev_index, val_index in kf.split(train_df):
    dev_X, val_X = train_vectors[dev_index], train_vectors[val_index]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    pred_val_y, pred_test_y, model = runModel(dev_X, dev_y, val_X, val_y, test_vectors)
    pred_full_test = pred_full_test + pred_test_y
    pred_train[val_index] = pred_val_y
    cv_scores.append(metrics.log_loss(val_y, pred_val_y))

Building model.




In [117]:
cv_scores

[0.6235709233315329, 0.6285653804503707, 0.6439376119911367]

In [118]:
for thresh in np.arange(0.3, 0.55, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_val_y>thresh).astype(int))))

F1 score at threshold 0.3 is 0.6342592592592593
F1 score at threshold 0.31 is 0.6372218476062036
F1 score at threshold 0.32 is 0.6388984509466438
F1 score at threshold 0.33 is 0.6404770256050508
F1 score at threshold 0.34 is 0.6385714285714287
F1 score at threshold 0.35 is 0.6432452010141253
F1 score at threshold 0.36 is 0.6440677966101696
F1 score at threshold 0.37 is 0.6436349981224183
F1 score at threshold 0.38 is 0.6434583014537107
F1 score at threshold 0.39 is 0.6345852895148669
F1 score at threshold 0.4 is 0.633306645316253
F1 score at threshold 0.41 is 0.6217105263157895
F1 score at threshold 0.42 is 0.610482180293501
F1 score at threshold 0.43 is 0.6034997865983781
F1 score at threshold 0.44 is 0.6005221932114883
F1 score at threshold 0.45 is 0.5890778871978514
F1 score at threshold 0.46 is 0.5796439981743495
F1 score at threshold 0.47 is 0.5640785781103836
F1 score at threshold 0.48 is 0.54519368723099
F1 score at threshold 0.49 is 0.523410547067521
F1 score at threshold 0.5 i

Submission

In [119]:
sample_submission = pd.read_csv("sample_submission.csv")

In [120]:
sample_submission['target'] = list(map(lambda x: 1 if model.predict_proba(x)[:,1] > 0.44 else 0, test_vectors))

In [121]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [122]:
sample_submission.to_csv("submission_tfidf_logistic.csv", index=False)

In [123]:
eli5.show_weights(model, vec=tfidf_vectorizer, top=100, feature_filter=lambda x: x != '<BIAS>')

Weight?,Feature
+0.180,in
+0.096,fire
+0.083,california
+0.076,suicide
+0.075,hiroshima
+0.071,at
+0.061,crash
+0.060,disaster
+0.060,http co
+0.060,mh370


# Conclusion

tf-idf along with logistic regression worked well with LB score of 0.80416. By Chance the meta information didn't help the model to improve.
We will work on BERT model on another notebook.