<a href="https://colab.research.google.com/github/robmaz22/Kaggle-competitions/blob/main/Natural_Language_Processing_with_Disaster_Tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Disaster tweets classification - project
##Competition [link](https://www.kaggle.com/c/nlp-getting-started)

###1. Download dataset

In [None]:
!pip install -q kaggle
from google.colab import files

files.upload()

!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c nlp-getting-started

Saving kaggle.json to kaggle (1).json
mkdir: cannot create directory ‘/root/.kaggle’: File exists
Downloading sample_submission.csv to /content
  0% 0.00/22.2k [00:00<?, ?B/s]
100% 22.2k/22.2k [00:00<00:00, 21.6MB/s]
Downloading test.csv to /content
  0% 0.00/411k [00:00<?, ?B/s]
100% 411k/411k [00:00<00:00, 56.0MB/s]
Downloading train.csv to /content
  0% 0.00/965k [00:00<?, ?B/s]
100% 965k/965k [00:00<00:00, 63.7MB/s]


###2. Dataset analysis

In [None]:
import pandas as pd

In [None]:
train_df = pd.read_csv('train.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
len(train_df)

7613

In [None]:
train_df.isna().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

###3. Preparation for training

In [None]:
data = train_df[['text', 'target']]
data.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


* Cleaning text

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
def clean_text(sentence):
  tokens = nltk.word_tokenize(sentence)
  lemmatizer = WordNetLemmatizer()

  words = []
  for token in tokens:
    if not token in stopwords.words() and not token in string.punctuation:
      words.append(lemmatizer.lemmatize(token).lower())

  return ' '.join(words)

In [None]:
print('Sentence:')
print(data['text'][0])
print('Cleaned:')
print(clean_text(data['text'][0]))

Sentence:
Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Cleaned:
our deeds reason earthquake may allah forgive u


In [None]:
X = data['text'].apply(clean_text)

In [None]:
X

0         our deeds reason earthquake may allah forgive u
1                   forest fire near la ronge sask canada
2       all resident asked 'shelter place notified off...
3       13,000 people receive wildfire evacuation orde...
4       just got sent photo ruby alaska smoke wildfire...
                              ...                        
7608    two giant crane holding bridge collapse nearby...
7609    aria_ahrary thetawniest the control wild fire ...
7610    m1.94 01:04 utc 5km s volcano hawaii http //t....
7611    police investigating e-bike collided car littl...
7612    the latest more homes razed northern californi...
Name: text, Length: 7613, dtype: object

In [None]:
y = data['target']

* Splitting data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(X,y)

###4. Train model and evaluate

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(x_train, y_train)
model.score(x_val, y_val)

0.8040966386554622

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_val, model.predict(x_val)))

              precision    recall  f1-score   support

           0       0.78      0.92      0.85      1105
           1       0.86      0.64      0.73       799

    accuracy                           0.80      1904
   macro avg       0.82      0.78      0.79      1904
weighted avg       0.81      0.80      0.80      1904



###5. Create submission

In [None]:
test_df = pd.read_csv('test.csv')
test_data = test_df['text']
X_test = test_data.apply(clean_text)

X_test = vectorizer.transform(X_test)
y_pred = model.predict(X_test)
y_pred

array([1, 1, 1, ..., 1, 1, 0])

In [None]:
submission_dict = {'id': test_df.id,
                   'target': y_pred}

submission_df = pd.DataFrame(submission_dict)
submission_df.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1


In [None]:
submission_df.to_csv('submission.csv', index=False)
print('Submission created')