### Read data

In [79]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [80]:
import os
for dirname, _, filenames in os.walk('.\kaggle\input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

.\kaggle\input\sample_submission.csv
.\kaggle\input\test.csv
.\kaggle\input\train.csv


In [81]:
train_df = pd.read_csv("./kaggle/input/train.csv")
test_df = pd.read_csv("./kaggle/input/test.csv")

### A quick look at our data


In [82]:
len(train_df)

7613

In [83]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


key word and location have some missing values, may need further preprocessing

In [84]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


look at NOT disaster tweets...

In [85]:
train_df[train_df["target"]==0]["text"].values

array(["What's up man?", 'I love fruits', 'Summer is lovely', ...,
       'These boxes are ready to explode! Exploding Kittens finally arrived! gameofkittens #explodingkittens\x89Û_ https://t.co/TFGrAyuDC5',
       'Sirens everywhere!',
       'I just heard a really loud bang and everyone is asleep great'],
      dtype=object)

In [86]:
train_df[train_df["target"] == 0]["text"].values[1]

'I love fruits'

look at disaster tweets...

In [87]:
train_df[train_df["target"]==1]["text"].values

array(['Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
       'Forest fire near La Ronge Sask. Canada',
       "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
       ...,
       'M1.94 [01:04 UTC]?5km S of Volcano Hawaii. http://t.co/zDtoyd8EbJ',
       'Police investigating after an e-bike collided with a car in Little Portugal. E-bike rider suffered serious non-life threatening injuries.',
       'The Latest: More Homes Razed by Northern California Wildfire - ABC News http://t.co/YmY4rSkQ3d'],
      dtype=object)

In [88]:
train_df[train_df["target"] == 1]["text"].values[1]

'Forest fire near La Ronge Sask. Canada'

### Building vectors

To kick start a base model, let's start with using the count of words in each tweet.
Below will be using `CountVectorizer` to build the count of words matrix.

In [89]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 2 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:2])
print(train_df["text"][0:2].values)
print(count_vectorizer.get_feature_names_out())
print(count_vectorizer.vocabulary_)
print(example_train_vectors.toarray())
print(example_train_vectors.toarray().shape)

['Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'
 'Forest fire near La Ronge Sask. Canada']
['all' 'allah' 'are' 'canada' 'deeds' 'earthquake' 'fire' 'forest'
 'forgive' 'la' 'may' 'near' 'of' 'our' 'reason' 'ronge' 'sask' 'the'
 'this' 'us']
{'our': 13, 'deeds': 4, 'are': 2, 'the': 17, 'reason': 14, 'of': 12, 'this': 18, 'earthquake': 5, 'may': 10, 'allah': 1, 'forgive': 8, 'us': 19, 'all': 0, 'forest': 7, 'fire': 6, 'near': 11, 'la': 9, 'ronge': 15, 'sask': 16, 'canada': 3}
[[1 1 1 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 1 1]
 [0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0]]
(2, 20)


The above tells us that, there are 20 unique words (or "tokens") in the first two tweets.

Now let's create vectors for all of our tweets.

In [90]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])

## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors, 
# that is the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["text"])

### Our model

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a _linear_ connection. So let's build a linear model and see!
Since our matrix is quite sparse (not every word will appear in every tweet), we're going to use `RidgeClassifier`. It is a linear multiclassification model that uses regularization to avoid exceeding parameters.

The process of building model with hyper parameters tuning and cross validation can be found in this link: https://www.youtube.com/watch?v=jY2v4q3TPbs

In [91]:
## Our vectors are really sparse, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
from sklearn.linear_model import RidgeClassifier
clf = linear_model.RidgeClassifier()

Let's test our model and see how well it does on the training data. For this we'll use `cross-validation` - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

The metric for this competition is F1, so let's use that here.

In [92]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(clf, train_vectors, train_df["target"], cv=3, scoring="f1", return_train_score=True)

In [93]:
print('train_score', cv_results['train_score'])
print('test_score', cv_results['test_score'])

train_score [0.99448529 0.99678899 0.99425683]
test_score [0.59453669 0.5642787  0.64082434]


The test score above are around 0.6, which is not so bad but definetly can be improved by more preprocessing and other models.

Also the train score is much higher than test score, which indicates potential overfitting problem.

In [94]:
clf.fit(train_vectors, train_df["target"])

If we do the prediction, the F1 score is nearly 1, which also indicates the overfitting problem.

In [95]:
y_train_pred = clf.predict(train_vectors)
# Calculate the F1 score on the training data
f1_train = f1_score(train_df["target"], y_train_pred, average='binary')
# Print the F1 scores
print("F1 Score on Training Data:", f1_train)

# Confusion matrix
pd.DataFrame(confusion_matrix(train_df["target"], y_train_pred))

F1 Score on Training Data: 0.9946425838052962


Unnamed: 0,0,1
0,4329,13
1,22,3249


To solve the above problem, we can split data and do hyper parameter tuning to generalize our model.

In [96]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

X_train, X_temp, y_train, y_temp = train_test_split(train_vectors,train_df["target"], test_size = 0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Define the hyperparameter grid
param_grid = {'alpha': [0.1, 1, 10]}

clf = linear_model.RidgeClassifier()

# Create the GridSearchCV object
grid_search = GridSearchCV(clf, param_grid, cv=3, scoring='f1')

# Fit the GridSearchCV to the training and validation data to find the best hyperparameters
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_alpha = grid_search.best_params_['alpha']

# Train the model with the best hyperparameters on the entire training set
best_clf = RidgeClassifier(alpha=best_alpha)
best_clf.fit(X_train, y_train)

# Evaluate the model on the validation set
y_val_pred = best_clf.predict(X_val)
validation_f1 = f1_score(y_val, y_val_pred)

# Finally, evaluate the model on the test set
y_test_pred = best_clf.predict(X_test)
test_f1 = f1_score(y_test, y_test_pred)


In [97]:
print(validation_f1,test_f1)

0.7590759075907589 0.7651006711409395


A more generalized model is built with the f1 score drop from nearly 1 to 0.75. 

let's do predictions on our training set and build a submission for the competition.

In [98]:
sample_submission = pd.read_csv("./kaggle/input/sample_submission.csv")

In [99]:
sample_submission["target"] = best_clf.predict(test_vectors)

In [100]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


In [101]:
sample_submission.to_csv("./submission/baseline.csv", index=False)