# 1. Basic Feature Extraction

We can use the text data to extract a number of features even if we do not have sufficient knowledge of Natural Language Processing.

Before starting, lets quickly read the training file from the dataset in order to perform different tasks on it.

I am using the Twitter Sentiment Dataset.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    print(dirname)

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('/kaggle/input/twitter-sentiment-analysis-hatred-speech/train.csv')
df.head()

Note: Here we are only working with Textual data, bt we can also use the same methods to numerical features.

# 1.1 Number of words
One of the most basic features we can extract is the number of words in each tweet.  

To do this we simply use the split function in python.

In [None]:
df['word_count'] = df['tweet'].apply(lambda x:len(str(x).split(" ")))
# df.head()
df[['tweet', 'word_count']].head()

# 1.2 Number of Characters
Here we calculate the number of characters in each tweet. This is done by calculating the length of the tweet.

In [None]:
df['char_count'] = df['tweet'].str.len() # This will include spaces / white space.
# df.head()
df[['tweet', 'char_count']].head()

# 1.3 Average Word Length
We will also extract anothe feature which will calculate the average word length of each tweet. This can also potentially help us in improving our model.

Here we simply take the sum of the length of all the words and divide it by the total length of the tweet.

In [None]:
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words) / len(words))

df['avg_word'] = df['tweet'].apply(lambda x: avg_word(x))

# df.head()
df[['tweet', 'avg_word']].head()

# 1.4 Number of Stopwords
Generally while solving an NLP problem, the first thing we do is to remove the stopwords. But sometimes calculating the number of stopwords can also give us some extra information which we might have been losing before.

Here we have importing stopwords from NLTK which is a basic NLP library in python.

In [None]:
from nltk.corpus import stopwords

stop = stopwords.words('english')

In [None]:
df['stopwords'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))

# df.head()
df[['tweet', 'stopwords']].head()

# 1.5 Number of Special Char
One more interesting feature which we can extract from a tweet is caluclating the number of hastags or methines present in it. This also helps in extracting extra information from our text data.

Here we make use of the `starts with` function because hashtags always appear at the beginning of a word.

In [None]:
df['hashtags'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))

# df.head()
df[['tweet', 'hashtags']].head()

# 1.6 Number of numerics
Just like we calculated the number of words, we can also calculate the number of numerics which are present in the tweets. It does not have a lot of use in our example, but this is still a useful that should be run while doing similar exercises.

In [None]:
df['numerics'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))

# df.head()
df[['tweet', 'numerics']].head()

# 1.7 Number of UpperCase words.
Anger or rage is quite often expressed by writing in UPPERCASE words which makes this a necessary operation to identiy those words.

In [None]:
df['upper'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x.isupper()]))

# df.head()
df[['tweet', 'upper']].head()

# 2. Basic Pre-Processing
So far we have learned hwo to extract basic features from the text data. Before diving into text and feature extraction, our first step should b cleaning the data in order to obtain better features. We will achieve this by doing some of the basic pre-processing steps on our training data.

# 2.1 Lower Case
The first pre-processing step which we will do is transform our tweets into lower case. This avoids having multiple copies of the same words.

For instance, while calculating the word count "Analytics' and 'analytics' will be taken as different words.

In [None]:
df['tweet_lower'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# df.head()
df[['tweet', 'tweet_lower']].head()

# 2.2 Removing Punctuation
The next step is to remove punctuation, as it doesn't add any extra information while treating text data. Therefore removing all instances of it will help us reduce the size of the training data.

In [None]:
df['tweet_lower'] = df['tweet_lower'].str.replace('[^\w\s]','')
# df.head()
df['tweet_lower'].head()

Note: As you can see in the above output, all the punctuation including `#` and `@` has been removed from the `df`.

# 2.3 Removal of Stop Words
As we discussed earlier, stop words (or commonly occuring words) should be removed from the textdat. 
For this purpose we can either create a list of stopwords ourselves or we can use predefined libraries.

In [None]:
df['tweet_lower'] = df['tweet_lower'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

df.tweet_lower.head()

# 2.4 Common word removal
Prevoously, we just removed commonly occuring words in a general sense. We can also remove commonly occuring words from our text data first, lets check the 10 most frequently occuring words in our text data then take a call to remove or retain.

In [None]:
freq_words = pd.Series(' '.join(df['tweet_lower']).split()).value_counts()[:10]
freq_words

Now lets remove these words as their presence will not of any use in classification of our text data.

In [None]:
freq_words = list(freq_words.index)

df['tweet_lower'] = df['tweet_lower'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_words))

df['tweet_lower'].head()

# 2.5 Rare words removal
Similar to the most common words, this time lets remove rarely occuring words from the text. As they are so rare, the association between them and other words is dominated by noise. You can replace rare words with a more general form and then this will have higher counts.

In [None]:
rare_words = pd.Series(' '.join(df['tweet_lower']).split()).value_counts()[-10:]
rare_words

Remove rare words

In [None]:
rare_words = list(rare_words.index)

df['tweet_lower'] = df['tweet_lower'].apply(lambda x: " ".join(x for x in x.split() if x not in rare_words))

df['tweet_lower'].head()

All these pre-processing steps are essential and help  us in reducing our vocabulary clutter so that the featues produced in the end are more effective.

# 2.6 Spelling Correction
We have all seen tweets with a plethora of spelling mistakes. Out timeline are often filled with hastly sent tweets that are barely legible at times.

In that regards, spelling correction is a useful pre-processing step because this also will help us in reducing multiple copies of words.

To achieve this we will use the textblob library.

In [None]:
from textblob import TextBlob

In [None]:
df['tweet_lower'][:5].apply(lambda x: str(TextBlob(x).correct()))

Note that it will actually take a lot of time to make these corrections. Therefore just for the purpose of learning, i have shown this technique by applying it on only the first 5 rows.

Moreover we cannot always expect it to be accurate so some care should be taken before applying it.

We should also keep in mind that words are often used in their abbreviated form. For instance, `your` is used as `ur`. We should treat this before the spelling correction step, other wise these words might be transformed into any other word like above `model take or ðððð ððð` initially it was `model take urð ðððð ððð`.

# 2.7 Tokenization
Tokenization refers to dividing the text into a sequence of words or sentences. In our example, we have used the textblob library to first transform our tweets into a blob and then converted them into a series of words.

In [None]:
TextBlob(df['tweet_lower'][1]).words 

# 2.8 Stemming
Stemming refers to the removal of sufficies, like `ing`, `ly`, `s` etc by a simple rule-based approach. For this purpose we will use PorterStemmer from the NLTK library.

In [None]:
from nltk.stem import PorterStemmer

In [None]:
st = PorterStemmer()

In [None]:
df['tweet_lower'][:5].apply(lambda x:" ".join([st.stem(word) for word in x.split()]))

Above output, we can see that the word `dysfunctional` has been transformed into `dysfunct`, and `kids` transformed into `kid`, amoong other changes.

# 2.9 Lemmatization
Lemmatization is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices. It makes use of the vocabulary and does a morphological analysis to obtain the root word. Therefore we usually perfer using lemmatization over stemming.

In [None]:
from textblob import Word
df['tweet_lower'] = df['tweet_lower'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

df.tweet_lower.head()

# 3 Advance Text Processing

Upto this point we have done all the basic pre-processing steps in order to clean our data.

Now we can finally move on to extracting features using NLP technique.

# 3.1 N-grams
N-grams are the combination of multiple words used together.
N-grams with N = 1 are called unigrams. Similarly, bigrams when N = 2; trigrams when N = 3 and so on.

Unigrams do not usually contain as much information as compared to bigrams and trigrams.

The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with.

Optimum length really depends on the application - if yoyr n-grams are too short, you may fail to capture important differences. On the other hand, they are too long, you may fail to capture the "general knwoeldge" and only stick to a particular cases.

So lets quickly extract bigrams from our tweet using the ngrams function of the textblob library.

In [None]:
TextBlob(df['tweet_lower'][0]).ngrams(2)

In [None]:
TextBlob(df['tweet'][0]).ngrams(2)

# 3.2 Term frequency
Term Frequency is simply the ratio of the count of word present in a sentence, to the length of the sentence.

Therefore we can generalize term frequency as

`TF = (Number of times term T appears in the particular row) / (Number of terms in that row`



In [None]:
# tf1 = (df['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1 = (df['tweet_lower'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()

tf1.columns = ['words', 'tf']

tf1

# 3.3 Inverse Document Frequency
The intution behind inverse document frequency (IDF) is that a word is not of much use to us if it's appearing in all the documents.

Therefore the IDF of each word is the log of the ratio of the total number of rows to the number of rows in which that word is present.

`IDF = log(N/n)` where N is the total number of rows and n is thenumber of rows in which the word was present.



In [None]:
for i, word in enumerate(tf1['words']):
    tf1.loc[i, 'idf'] = np.log(df.shape[0] / (len(df[df['tweet_lower'].str.contains(word)])))
    
tf1

The more the value of IDF, the more unique is the word.

# 3.4 Term Frequency - IDF (TF-IDF)
TF-ID is the multiplication of the TF and IDF which we calculated above.

In [None]:
tf1['tf_idf'] = tf1['tf'] * tf1['idf']
tf1

Alternatively we can use sklearn's `TfidfVectorizer` as below.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features = 1000, lowercase = True, analyzer = 'word', stop_words = 'english', ngram_range = (1,1))

df_vect = tfidf.fit_transform(df['tweet'])

df_vect


# 3.5 Bag of Words
Bag of Words (BoW) referes to the representation of text which describes the presence of words within the text data. The intution behind this is that two similar text fields will contain similar kind of words, and will therefor have s similar bag of words.

Further that form the text alone we can learn somethins about the meaning of the document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer



In [None]:
bow = CountVectorizer(max_features=1000, lowercase = True, ngram_range= (1,1), analyzer = 'word')

df_bow = bow.fit_transform(df['tweet'])

df_bow

# 3.6 Sentiment Analysis


In [None]:
df['tweet'][:5].apply(lambda x: TextBlob(x).sentiment )

Above we can see that it returns a tuple representing polarity and subjectivity of each tweet.

Here we only extract polarity as it indicates the sentiment as value nearer to 1 means a positive and nearer to -1 means a negative sentiment. This can also work as a feature for building a ML Model.

In [None]:
df['sentiment'] = df['tweet'].apply(lambda x: TextBlob(x).sentiment[0])

df[['tweet','sentiment']].head()

In [None]:
df['sentiment2'] = df['tweet_lower'].apply(lambda x: TextBlob(x).sentiment[0])

df[['tweet_lower','sentiment2']].head()

# 3.7 Word Embeddings
Word Embedding is the representation of text in the form ofo vectors. Th eunderlying idea here is tha similar words will have a minimum distance between their vectors.

Word2Vec models require a lot of text, so either we can train it on our training data or we can use the pre-trained word vectors developed by Google, Wiki etc.


In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec