<a href="https://www.kaggle.com/code/mkubina/disaster-tweets-1?scriptVersionId=168050297" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import numpy as np # linear algebra

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


# Predicting disaster tweets

This is my first attempt to predict which tweets regard actual disaster and which not in the [competition](http://), using models I learn here on Kaggle and Codecademy.

For now some basic random forests.

## Loading and inspecting data

In [2]:
import pandas as pd

train_df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

print(train_df.head())
print(test_df.head())

   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  
   id keyword location                                               text
0   0     NaN      NaN                 Just happened a terrible car crash
1   2     NaN      NaN  Heard about #earthquake is different cities, s...
2   3     NaN      NaN  there is a forest fire at spot pond, geese are...
3   9     NaN      NaN           Apocalypse lighting. #Spokane #wildfires
4  11     NaN      NaN      Typhoon Soudelor kills 28 in China and Taiwan


In [3]:
print(train_df.shape)
print('Any nulls\n', train_df.isnull().sum())

(7613, 5)
Any nulls
 id             0
keyword       61
location    2533
text           0
target         0
dtype: int64


Let's keep things simple at first, and keep only texts and targets.

In [4]:
train_df = train_df[['id', 'text', 'target']]
train_df.head()

Unnamed: 0,id,text,target
0,1,Our Deeds are the Reason of this #earthquake M...,1
1,4,Forest fire near La Ronge Sask. Canada,1
2,5,All residents asked to 'shelter in place' are ...,1
3,6,"13,000 people receive #wildfires evacuation or...",1
4,7,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
print('Examples of disaster tweets:')
for text in train_df[train_df['target'] == 1]['text'][:5]:
    print(text)

print('\nExamples of non-disaster tweets:')
for text in train_df[train_df['target'] == 0]['text'][:5]:
    print(text)

Examples of disaster tweets:
Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California 
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 

Examples of non-disaster tweets:
What's up man?
I love fruits
Summer is lovely
My car is so fast
What a goooooooaaaaaal!!!!!!


## Pre-processing the text column

In [6]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag

import re

### We will clean our text data here.
Get rid of URLs
Tokenize on word level
Remove common English stopwords to get rid of noise
Keep only alfanumerical characters and get everything lowercase
join list of tokens back into a string

In [7]:
#removing hyperlinks, tokenize texts, remove stopwords
def clean_words(text):
    text = re.sub(r'http.*', '', text)
    tokens = word_tokenize(text)
    stops = stopwords.words('english')
    clean_tokens = [token.lower() for token in tokens if token.lower() not in stops and token.isalnum()]
    return ' '.join(clean_tokens)


In [8]:
train_df.text = train_df.text.apply(lambda x: clean_words(x))
test_df.text = test_df.text.apply(lambda x: clean_words(x))

train_df.head(10)

Unnamed: 0,id,text,target
0,1,deeds reason earthquake may allah forgive us,1
1,4,forest fire near la ronge sask canada,1
2,5,residents asked place notified officers evacua...,1
3,6,people receive wildfires evacuation orders cal...,1
4,7,got sent photo ruby alaska smoke wildfires pou...,1
5,8,rockyfire update california hwy 20 closed dire...,1
6,10,flood disaster heavy rain causes flash floodin...,1
7,13,top hill see fire woods,1
8,14,emergency evacuation happening building across...,1
9,15,afraid tornado coming area,1


## Getting categorical value from text column into numericals - encoding

As ML models need numerical data, we need to vectorize our strings.

One-hot encoding / gettings dummies is not enough as it'd produce absurd amount of new columns.

The simplest suitable vectorization is bag of words, via scikit-learn's CountVectorizer.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

bag = CountVectorizer()

vectors_train = bag.fit_transform(train_df['text'])
vectors_test = bag.transform(test_df['text'])
print('this is first five strings:\n\n', train_df['text'][:5])
print('\n and this their vector representation:\n\n', vectors_train[:5])

this is first five strings:

 0         deeds reason earthquake may allah forgive us
1                forest fire near la ronge sask canada
2    residents asked place notified officers evacua...
3    people receive wildfires evacuation orders cal...
4    got sent photo ruby alaska smoke wildfires pou...
Name: text, dtype: object

 and this their vector representation:

   (0, 3765)	1
  (0, 11075)	1
  (0, 4462)	1
  (0, 8562)	1
  (0, 800)	1
  (0, 5461)	1
  (0, 14332)	1
  (1, 5452)	1
  (1, 5273)	1
  (1, 9297)	1
  (1, 7756)	1
  (1, 11572)	1
  (1, 11817)	1
  (1, 2386)	1
  (2, 11320)	1
  (2, 1177)	1
  (2, 10338)	2
  (2, 9517)	1
  (2, 9666)	1
  (2, 4851)	1
  (2, 12172)	1
  (2, 9827)	1
  (2, 4940)	1
  (3, 4851)	1
  (3, 9827)	1
  (3, 10155)	1
  (3, 11093)	1
  (3, 14871)	1
  (3, 2345)	1
  (4, 14871)	1
  (4, 5947)	1
  (4, 12037)	1
  (4, 10241)	1
  (4, 11646)	1
  (4, 756)	1
  (4, 12477)	1
  (4, 10515)	1
  (4, 11889)	1


## Applying simple ML model: random forest

We define our features (X, vextorized tweets) and target (y) and split the set into train and validations subsets.

In [10]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

y = train_df['target']
X = vectors_train

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

We train basic random forest and see F1 score. We have to round our predictions because random forests produce decimal outputs.

In [11]:
from sklearn.metrics import f1_score

my_model = RandomForestRegressor(random_state=1)

my_model.fit(X_train, y_train)
print('trained')
predictions = my_model.predict(X_valid).round()
print('predicted')
f1 = f1_score(y_valid, predictions)
print(f'F1: {f1}')

trained
predicted
F1: 0.7433489827856025


In [12]:
print(y_valid[:10], predictions[:10])

1141    1
5444    0
3983    0
6944    1
5196    0
1246    0
44      0
2732    0
7069    0
699     0
Name: target, dtype: int64 [1. 1. 0. 1. 0. 1. 0. 0. 0. 0.]


## Whoever knows. Let's predict on test data

Even though the predictions now were rounded 1s and 0s, they were still floats (1.0 and 0.0) and competition score thus gave neat 0.00000 score. We turn them into integers then.

In [13]:
X_test = vectors_test
test_predictions = my_model.predict(X_test).round()
test_predictions = test_predictions.astype(int)
print(test_predictions[:100])

[1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0]


## Composing submission

In [14]:
submission = pd.DataFrame({'id': test_df['id'],
                        'target': test_predictions})

submission.to_csv('submission_6_bag_forest_integerTargets.csv', index=False)