# Basic Text Features

Text features refer to the characteristics, elements, or attributes within a body of text that can be analyzed to extract information or gain insights. In natural language processing (NLP) and text mining, identifying and understanding text features is crucial for various tasks. 

In [1]:
# sample text string
text = "Dark matter is one of the greatest enigmas of astrophysics and cosmology"

We will split the string into individual words or tokens. This is also known as __tokenization__.  Tokenization is the process of breaking down a sequence of text into smaller units, known as tokens. Tokens can be words, phrases, symbols, or other meaningful elements depending on the context of the analysis

In [2]:
# split words of the text
text.split()

['Dark',
 'matter',
 'is',
 'one',
 'of',
 'the',
 'greatest',
 'enigmas',
 'of',
 'astrophysics',
 'and',
 'cosmology']

In [3]:
# store the individual words in a variable
words = text.split()

### Commonly used features

### 1. Word Count

In [4]:
# word count
len(words)

12

### 2. Spaces Count

In [5]:
# spaces count
text.count(' ')

11

### 3. Characters count

In [6]:
# character count
len(text)

72

Even the spaces have been included.

In [7]:
# character count (excluding spaces)
len(text)-text.count(' ')

61

So, the text string has 61 characters excluding spaces.

### 4. Average Word Length

In [8]:
# empty list for
word_lengths = []

for i in text.split():
    word_lengths.append(len(i))
    
print(word_lengths)

[4, 6, 2, 3, 2, 3, 8, 7, 2, 12, 3, 9]


In [9]:
# average word length
sum(word_lengths)/len(word_lengths)

5.083333333333333

---

# Create Features for Twitter Dataset

Let's create the above mentioned features for a real-life dataset. 

In [10]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [11]:
tweets = pd.read_csv("tweets.csv")

Have a glimpse at the data.

In [12]:
tweets.head()

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...
1,2,0,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...


This dataset has 3 features right now. 

1. __id:__ tweet id number, unique for every tweet
2. __label:__ 1 for negative tweet and 0 for positive or neutral tweet
3. __tweet:__ text data

We will create new features from the feature "tweet".


### 1. Word Count Feature

In [13]:
# number of words/terms in the tweets
tweets['word_count'] = [len(i.split()) for i in tweets['tweet']]

In [14]:
tweets.head()

Unnamed: 0,id,label,tweet,word_count
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,13
1,2,0,Finally a transparant silicon case ^^ Thanks t...,17
2,3,0,We love this! Would you go? #talk #makememorie...,15
3,4,0,I'm wired I know I'm George I was made that wa...,17
4,5,1,What amazing service! Apple won't even talk to...,23


As you can see, we have a new feature __word_count__. Now let's create a feature of number of spaces in the tweets.

### 2. Space Count Feature

In [15]:
tweets['space_count'] = [i.count(' ') for i in tweets['tweet']]

In [16]:
tweets.head()

Unnamed: 0,id,label,tweet,word_count,space_count
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,13,12
1,2,0,Finally a transparant silicon case ^^ Thanks t...,17,16
2,3,0,We love this! Would you go? #talk #makememorie...,15,14
3,4,0,I'm wired I know I'm George I was made that wa...,17,16
4,5,1,What amazing service! Apple won't even talk to...,23,22


### 3. Character Count Feature

In [17]:
tweets['character_count'] = [len(i) - i.count(' ') for i in tweets['tweet']]

In [18]:
tweets.head()

Unnamed: 0,id,label,tweet,word_count,space_count,character_count
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,13,12,116
1,2,0,Finally a transparant silicon case ^^ Thanks t...,17,16,115
2,3,0,We love this! Would you go? #talk #makememorie...,15,14,109
3,4,0,I'm wired I know I'm George I was made that wa...,17,16,96
4,5,1,What amazing service! Apple won't even talk to...,23,22,102


### 4. Average Word Length Feature

In [19]:
avg_word_length = []

# nested for loop
for i in tweets['tweet']:
    word_lengths = []
    for j in i.split():
        # length of terms in a tweet
        word_lengths.append(len(j))
    
    # average word length of a tweet
    l = sum(word_lengths)/len(word_lengths)
    
    avg_word_length.append(l)

In [20]:
# create new feature 
tweets['average_word_length'] = avg_word_length

# Build Model

In [21]:
tweets.head()

Unnamed: 0,id,label,tweet,word_count,space_count,character_count,average_word_length
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,13,12,116,8.923077
1,2,0,Finally a transparant silicon case ^^ Thanks t...,17,16,115,6.764706
2,3,0,We love this! Would you go? #talk #makememorie...,15,14,109,7.266667
3,4,0,I'm wired I know I'm George I was made that wa...,17,16,96,5.647059
4,5,1,What amazing service! Apple won't even talk to...,23,22,102,4.434783


In [22]:
X = tweets[['word_count', 'space_count', 'character_count', 'average_word_length']]
y = tweets['label']

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler # for standardization

In [24]:
# split dataset into train and test set
xtrain, xtest, ytrain, ytest = train_test_split(StandardScaler().fit_transform(X), y, 
                                                test_size=0.33, random_state=42)

In [25]:
xtrain.shape, xtest.shape

((5306, 4), (2614, 4))

In [26]:
# fit model
lr = LogisticRegression()
lr.fit(xtrain, ytrain)

In [27]:
# predict on test set
preds = lr.predict_proba(xtest)

In [28]:
preds

array([[0.92292913, 0.07707087],
       [0.59968502, 0.40031498],
       [0.95161997, 0.04838003],
       ...,
       [0.22803203, 0.77196797],
       [0.57412551, 0.42587449],
       [0.85135054, 0.14864946]])

A Receiver Operating Characteristic (ROC) curve is a graphical representation that illustrates the performance of a binary classification model across different discrimination thresholds. It displays the trade-off between sensitivity (true positive rate) and specificity (true negative rate) as the decision threshold of the classifier is varied.

In [29]:
roc_auc_score(ytest, preds[:,1])

0.8635027062917578