#Tweet classification task

This notebook presents the text classification task of tweets. It is a binary classifiation of negative and positive tweets.

##Setting up and exploring the dataset

In [1]:
!unzip 'tweet_classif_data.zip'

Archive:  tweet_classif_data.zip
   creating: tweet_classif_data/
  inflating: tweet_classif_data/info.txt  
  inflating: tweet_classif_data/test.csv  
  inflating: tweet_classif_data/train.csv  


In [2]:
import pandas as pd
df = pd.read_csv("tweet_classif_data/train.csv")

In [3]:
df.shape

(7613, 5)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [5]:
df.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [6]:
df.drop(columns=['id'], inplace=True)

In [7]:
df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [8]:
print("Number of empty values per column:")
for col in df.columns:
  print(col + " " + str(df[col].isna().sum()))

Number of empty values per column:
keyword 61
location 2533
text 0
target 0


In [9]:
df['text'].head(10)

0    Our Deeds are the Reason of this #earthquake M...
1               Forest fire near La Ronge Sask. Canada
2    All residents asked to 'shelter in place' are ...
3    13,000 people receive #wildfires evacuation or...
4    Just got sent this photo from Ruby #Alaska as ...
5    #RockyFire Update => California Hwy. 20 closed...
6    #flood #disaster Heavy rain causes flash flood...
7    I'm on top of the hill and I can see a fire in...
8    There's an emergency evacuation happening now ...
9    I'm afraid that the tornado is coming to our a...
Name: text, dtype: object

##Text pre-processing

I will be doing the text pre-processing steps and observing the result after each step.

####Lowercase

Lowercasing helps in standardizing the text by converting all the characters to lowercase. This step ensures that words with same characters but different case are treated as the same word.

In [10]:
df['text'] = df['text'].str.lower()

In [11]:
df['text'].head(10)

0    our deeds are the reason of this #earthquake m...
1               forest fire near la ronge sask. canada
2    all residents asked to 'shelter in place' are ...
3    13,000 people receive #wildfires evacuation or...
4    just got sent this photo from ruby #alaska as ...
5    #rockyfire update => california hwy. 20 closed...
6    #flood #disaster heavy rain causes flash flood...
7    i'm on top of the hill and i can see a fire in...
8    there's an emergency evacuation happening now ...
9    i'm afraid that the tornado is coming to our a...
Name: text, dtype: object

####Special characters and numbers removal

Special characters and numbers often do not contribute much to the meaning of the text and can introduce noise or unnecessary complexity. Removing them helps in simplifying the text and focusing on the more meaningful linguistic content.

In [12]:
import string
df['text'] = df['text'].str.replace('[{}]'.format(string.punctuation + string.digits), '')

  df['text'] = df['text'].str.replace('[{}]'.format(string.punctuation + string.digits), '')


In [13]:
df['text'].head(10)

0    our deeds are the reason of this earthquake ma...
1                forest fire near la ronge sask canada
2    all residents asked to shelter in place are be...
3     people receive wildfires evacuation orders in...
4    just got sent this photo from ruby alaska as s...
5    rockyfire update  california hwy  closed in bo...
6    flood disaster heavy rain causes flash floodin...
7    im on top of the hill and i can see a fire in ...
8    theres an emergency evacuation happening now i...
9     im afraid that the tornado is coming to our area
Name: text, dtype: object

####Stop words removal

Stop words are commonly used words such as "the," "is," or "and" that appear frequently in a language but often do not carry significant meaning in specific contexts. Removing them helps to reduce noise and focus on more meaningful and informative words.

In [14]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    words_without_stopwords = [word for word in words if word.lower() not in stop_words]
    text_without_stopwords = ' '.join(words_without_stopwords)
    
    return text_without_stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [15]:
df['text'] = df['text'].apply(remove_stopwords)

In [16]:
df['text'].head(10)

0         deeds reason earthquake may allah forgive us
1                forest fire near la ronge sask canada
2    residents asked shelter place notified officer...
3    people receive wildfires evacuation orders cal...
4    got sent photo ruby alaska smoke wildfires pou...
5    rockyfire update california hwy closed directi...
6    flood disaster heavy rain causes flash floodin...
7                           im top hill see fire woods
8    theres emergency evacuation happening building...
9                        im afraid tornado coming area
Name: text, dtype: object

####Lemmatization

Lemmatization helps in standardizing words by reducing them to their base form. This process ensures that different forms of the same word are treated as a single entity, which is crucial for tasks such as text classification, information retrieval, and natural language understanding.

In [17]:
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    # Split the text into individual words
    words = text.split()
    
    # Lemmatize each word in the text
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the lemmatized words back into a single string
    lemmatized_text = ' '.join(lemmatized_words)
    
    return lemmatized_text

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [18]:
df['text'] = df['text'].apply(lemmatize_text)

In [19]:
df['text'].head(10)

0           deed reason earthquake may allah forgive u
1                forest fire near la ronge sask canada
2    resident asked shelter place notified officer ...
3    people receive wildfire evacuation order calif...
4    got sent photo ruby alaska smoke wildfire pour...
5    rockyfire update california hwy closed directi...
6    flood disaster heavy rain cause flash flooding...
7                            im top hill see fire wood
8    there emergency evacuation happening building ...
9                        im afraid tornado coming area
Name: text, dtype: object

####Tokenization

Tokenization is a fundamental step in text preprocessing that involves breaking down a text into individual units called tokens. Each token typically represents a word.

In [20]:
nltk.download('punkt')
df['text_tokens'] = df['text'].apply(nltk.word_tokenize)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [21]:
df['text_tokens'].head(10)

0    [deed, reason, earthquake, may, allah, forgive...
1        [forest, fire, near, la, ronge, sask, canada]
2    [resident, asked, shelter, place, notified, of...
3    [people, receive, wildfire, evacuation, order,...
4    [got, sent, photo, ruby, alaska, smoke, wildfi...
5    [rockyfire, update, california, hwy, closed, d...
6    [flood, disaster, heavy, rain, cause, flash, f...
7                     [im, top, hill, see, fire, wood]
8    [there, emergency, evacuation, happening, buil...
9                  [im, afraid, tornado, coming, area]
Name: text_tokens, dtype: object

####Train test split

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text_tokens'], df['target'], test_size=0.2, random_state=42)

####Vectorization (TF-IDF)

Vectorization is an essential step in preparing data for machine learning models. By converting textual information into numerical representations, vectorization enables algorithms to process and analyze text effectively.<br>In this notebook, I employed the **TF-IDF** (Term Frequency-Inverse Document Frequency) technique for vectorization.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_joined = [' '.join(text) for text in X_train]
X_train_tfidf = vectorizer.fit_transform(X_train_joined)

In [24]:
X_test_joined = [' '.join(text) for text in X_test]
X_test_tfidf = vectorizer.transform(X_test_joined)

##Models

After pre-processing the text and splitting the dataset to a training set and a test set, comes the classifiation step. I will be trying different Machine Learning models for this classification task and printing the result of each model.

####Logistic regression

In [25]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train_tfidf, y_train)

In [26]:
y_pred = classifier.predict(X_test_tfidf)

In [27]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.88      0.84       874
           1       0.82      0.69      0.74       649

    accuracy                           0.80      1523
   macro avg       0.80      0.79      0.79      1523
weighted avg       0.80      0.80      0.80      1523



####SVM

In [28]:
from sklearn.svm import SVC

svm_classifier = SVC()
svm_classifier.fit(X_train_tfidf, y_train)

In [29]:
y_pred = svm_classifier.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.90      0.84       874
           1       0.83      0.67      0.74       649

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.79      1523
weighted avg       0.80      0.80      0.80      1523



####Naive bayes

In [30]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tfidf, y_train)

In [31]:
y_pred = naive_bayes_classifier.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.90      0.84       874
           1       0.83      0.66      0.73       649

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.79      1523
weighted avg       0.80      0.80      0.79      1523



####Random Forest

In [32]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train_tfidf, y_train)

In [33]:
y_pred = rfc.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.86      0.82       874
           1       0.78      0.67      0.73       649

    accuracy                           0.78      1523
   macro avg       0.78      0.77      0.77      1523
weighted avg       0.78      0.78      0.78      1523

