# Identifying the intent of tweet using Logistic Regression model and NLP(Natural Language Processing).

There are various tweet intents like Appreciation, Community, Done,
Giveaway, Interested, Launching Soon, PinkSale, PreSale, Whitelist. The aim is
to do some basic and rudimentary data analyzing like checking for null values,
duplicates and then cleaning it by removing unwanted (less important) features,
splitting the whole dataset into training and testing part and then
finally using classification, evaluating the model accuracy to predict
the intent of some unknown(new) tweets.

Importing Required Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Loading the dataset and executing some analysis from the raw data

In [2]:
df = pd.read_excel('Tweet_NFT.xlsx')

df.head()

Unnamed: 0,id,tweet_text,tweet_created_at,tweet_intent
0,1212762,@crypto_brody @eCoLoGy1990 @MoonrunnersNFT @It...,2022-08-06T16:56:36.000Z,Community
1,1212763,Need Sick Character artâ“#art #artist #Artist...,2022-08-06T16:56:36.000Z,Giveaway
2,1212765,@The_Hulk_NFT @INagotchiNFT @Tesla @killabears...,2022-08-06T16:56:35.000Z,Appreciation
3,1212766,@CryptoBatzNFT @DarekBTW The first project in ...,2022-08-06T16:56:35.000Z,Community
4,1212767,@sashadysonn The first project in crypto with ...,2022-08-06T16:56:34.000Z,Community


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96364 entries, 0 to 96363
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                96364 non-null  int64 
 1   tweet_text        96364 non-null  object
 2   tweet_created_at  96364 non-null  object
 3   tweet_intent      96364 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.9+ MB


In [4]:
df.tweet_intent.unique()

array(['Community', 'Giveaway', 'Appreciation', 'Presale', 'Whitelist',
       'pinksale', 'Done', 'Interested', 'Launching Soon'], dtype=object)

Checking for the null values

In [5]:
df.isnull()

Unnamed: 0,id,tweet_text,tweet_created_at,tweet_intent
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
96359,False,False,False,False
96360,False,False,False,False
96361,False,False,False,False
96362,False,False,False,False


In [7]:
df.isnull().sum()

id                  0
tweet_text          0
tweet_created_at    0
tweet_intent        0
dtype: int64

Eliminating the unwanted feature so that the classification task will be easier. Input(feature) will be only tweet texts
and output(target) will be the intent of the tweets.

In [8]:
df_mod = df.drop(['id', 'tweet_created_at'], axis=1)

df_mod.head(10)

Unnamed: 0,tweet_text,tweet_intent
0,@crypto_brody @eCoLoGy1990 @MoonrunnersNFT @It...,Community
1,Need Sick Character artâ“#art #artist #Artist...,Giveaway
2,@The_Hulk_NFT @INagotchiNFT @Tesla @killabears...,Appreciation
3,@CryptoBatzNFT @DarekBTW The first project in ...,Community
4,@sashadysonn The first project in crypto with ...,Community
5,ðŸŽ‰ Just registered for the saphire on @PREMI...,Presale
6,ðŸš¨ THE BRIDGED #4660/9999 SOLD!!! =&gt; PRIC...,Giveaway
7,@mtnDAO PROJECT 21 - THE BEST GAMEFI PROJECT O...,Whitelist
8,@Ra8bitsNFT Feature it on @Globalnft07\nWe hav...,Community
9,@SpaceBrosBSC PROJECT 21 - THE BEST GAMEFI PRO...,Whitelist


Deleting (dropping) the duplicate values

In [9]:
df_mod.shape

(96364, 2)

In [10]:
df_mod.drop_duplicates(inplace=True)

In [11]:
df_mod.shape

(84908, 2)

Adjusting the input tweets, converting the entire text into lower case,
removing special(alpha-numerical) characters and eliminating some usual
and obvious words(like 'the', 'for', 'of', 'a', 'an' etc) which do not
affect much into text classification using Python stopwords library.

In [12]:
print(df_mod['tweet_text'].apply(lambda x: len(x.split(' '))).sum())

1968630


In [13]:
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer,TfidfTransformer
from sklearn.metrics import accuracy_score

In [14]:
special_char_remover = re.compile('[\/{}|@:#,;]')
Stopwords = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()
    text = special_char_remover.sub(' ',text)
    text = ' '.join(word for word in text.split() if word not in Stopwords)
    return text

In [15]:
df_mod['tweet_text'] = df_mod['tweet_text'].apply(clean_text) 

In [16]:
print(df_mod['tweet_text'].apply(lambda x: len(x.split(' '))).sum())

1661859


Splliting the dataset into training and testing part

In [17]:
X = df_mod.tweet_text
y = df_mod.tweet_intent

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

print(X_train.shape)
print(X_test.shape)

(67926,)
(16982,)


Building the ML model and pedicting the accuracy. It has been found that the final accuracy is
almost 96% which is substantially enough to predict the intents of unknown tweets. 

In [19]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

model = Pipeline([('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('lr', LogisticRegression())])

model.fit(X_train, y_train)
prediction = model.predict(X_test)
accuracy = accuracy_score(prediction, y_test)

print(accuracy)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9643740431044635
