# Data Cleaning and Vectorization For NLP

**Data Preparation With NLP**

In this lesson, you will study text pre-processing. The text pre-processing simply means to bring your text into a form that is analyzable and predictable for your task. In this respect you will learn how to;

- Remove punctuation,
- Remove stopwords,
- Tokenize the text,
- Vectorize the words.
1. Remove punctuation

There might be some punctuation such as commas, quotes, apostrophes, question marks, and more. We cannot feed a machine learning model from raw text. We need to clean the text first. Removing the punctuation is usually one of the first steps of cleaning the text. Because the model does not need them.

2. Removing Stopwords: 

Stopwords do not contribute to the meaning of the text deeply. These words introduce much noise because they appear more frequently than other words. Some examples of stopwords are given below:

"and",  "the",  "how", "all",  "about", "on", "under", "up",  "after", "i", "me", "myself", "we", our", ours", your", "yours"...

We filter out these stopwords before doing any statistical analysis or creating a model.

3. Tokenization: 

Tokenization means splitting a sentence, paragraph, phrase, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units is called tokens. You can consider each word as a token.
![image-7.png](attachment:image-7.png)
okenization is significant because the meaning of the text can easily be interpreted by analyzing the words present in the text. We can count the number of words in the text after tokenization. 

You can find more information at this link about tokenization methods with python:
https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/

4. Vectorization :

Word vectorization is a methodology in NLP to map words from vocabulary to a corresponding numeric vector. In other words, it is the process of converting words into numbers.


There are several methods to implement vectorization. But maybe the most famous methods are listed below:
- Count Vectors (CountVectorizer)
- TF-IDF Vectors (TfidfVectorizer)
- Word Embeddings (Word2Word, GloVe, etc.)

## Install and Import

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 50)

In [2]:
#!pip install nltk

## Tokenization

In [3]:
import nltk

In [4]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\Emre_Celik\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\Emre_Celik\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\Emre_Celik\AppData\Roaming\nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\Emre_Celik\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\Emre_Celik\AppData\Roaming\nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_

True

In [5]:
# or 
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Emre_Celik\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Emre_Celik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Emre_Celik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Emre_Celik\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [6]:
sample_text = "Oh man!, this is #pretty!? cool_. We will do #more such_ *things*. 2 ½ % ()"

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [8]:
sentence_token = sent_tokenize(sample_text.lower())
sentence_token

['oh man!, this is #pretty!?',
 'cool_.',
 'we will do #more such_ *things*.',
 '2 ½ % ()']

In [9]:
word_token = word_tokenize(sample_text.lower())
word_token

['oh',
 'man',
 '!',
 ',',
 'this',
 'is',
 '#',
 'pretty',
 '!',
 '?',
 'cool_',
 '.',
 'we',
 'will',
 'do',
 '#',
 'more',
 'such_',
 '*',
 'things',
 '*',
 '.',
 '2',
 '½',
 '%',
 '(',
 ')']

## Removing Punctuation and Numbers

In [10]:
# import string
# string.punctuation

In [11]:
# import re
# regex = re.compile('[%s]' % re.escape(string.punctuation))
# regex

In [12]:
# new_token = regex.sub(u'', "_!matrix_.?")
# new_token

In [13]:
tokens_without_punc = [w for w in word_token if w.isalpha()] # .isalnum() for number and object
tokens_without_punc

['oh', 'man', 'this', 'is', 'pretty', 'we', 'will', 'do', 'more', 'things']

## Removing Stopwords

In [14]:
from nltk.corpus import stopwords

In [15]:
stop_words = stopwords.words("english")
print(stop_words,'\n\n')
print("The umber of stopwords:",len(stop_words))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [16]:
token_without_sw = [t for t in tokens_without_punc if t not in stop_words] # if you don't make a sentiment analysis , 
                                                                           # you can remove negative auxiliary verb
token_without_sw

['oh', 'man', 'pretty', 'things']

## Data Normalization-Lemmatization

In [17]:
from nltk.stem import WordNetLemmatizer

In [18]:
WordNetLemmatizer().lemmatize("drivers")

'driver'

In [19]:
lemma = [WordNetLemmatizer().lemmatize(t) for t in token_without_sw]

In [20]:
lemma

['oh', 'man', 'pretty', 'thing']

## Data Normalization-Stemming

In [21]:
from nltk.stem import PorterStemmer

In [22]:
PorterStemmer().stem("drivers")

'driver'

In [23]:
stem = [PorterStemmer().stem(t) for t in token_without_sw]

In [24]:
stem

['oh', 'man', 'pretti', 'thing']

## Joining

In [25]:
" ".join(lemma)

'oh man pretty thing'

## Cleaning Function - for classification (NOT for sentiment analysis)

In [26]:
def cleaning(data):
    
    #1. Tokenize and lower
    text_tokens = word_tokenize(data.lower()) 
    
    #2. Remove Puncs and numbers
    tokens_without_punc = [w for w in text_tokens if w.isalpha()]
    
    #3. Removing Stopwords
    tokens_without_sw = [t for t in tokens_without_punc if t not in stop_words]
    
    #4. lemma
    text_cleaned = [WordNetLemmatizer().lemmatize(t) for t in tokens_without_sw]
    
    #joining
    return " ".join(text_cleaned)

In [27]:
pd.Series(sample_text).apply(cleaning) #df["text"].apply(cleaning)

0    oh man pretty thing
dtype: object

## Cleaning Function - for sentiment analysis

In [28]:
sample_text= "Oh man, this is pretty cool. We will do more such things. don't aren't are not. no problem"

In [29]:
s = sample_text.replace("'",'')
s

'Oh man, this is pretty cool. We will do more such things. dont arent are not. no problem'

In [30]:
word = word_tokenize(s.lower())
word 

['oh',
 'man',
 ',',
 'this',
 'is',
 'pretty',
 'cool',
 '.',
 'we',
 'will',
 'do',
 'more',
 'such',
 'things',
 '.',
 'dont',
 'arent',
 'are',
 'not',
 '.',
 'no',
 'problem']

In [31]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [32]:
cleaning_words = [t for t in word if t not in stop_words]
cleaning_words

['oh',
 'man',
 ',',
 'pretty',
 'cool',
 '.',
 'things',
 '.',
 'dont',
 'arent',
 '.',
 'problem']

In [33]:
for i in ["not", "no"]:
        stop_words.remove(i)

def cleaning_fsa(data):
    
    #1. removing upper brackets to keep negative auxiliary verbs in text
    text = data.replace("'",'')
         
    #2. Tokenize
    text_tokens = word_tokenize(text.lower()) 
    
    #3. Remove punkt and numbers
    tokens_without_punc = [w for w in text_tokens if w.isalpha()]
    
    #4. Removing Stopwords     
    tokens_without_sw = [t for t in tokens_without_punc if t not in stop_words]
    
    #5. lemma
    text_cleaned = [WordNetLemmatizer().lemmatize(t) for t in tokens_without_sw]
    
    #joining
    return " ".join(text_cleaned)

In [34]:
np.array(pd.Series(sample_text).apply(cleaning_fsa))

array(['oh man pretty cool thing dont arent not no problem'], dtype=object)

## CountVectorization and TF-IDF Vectorization

In [35]:
df = pd.read_csv("airline_tweets.csv")

In [36]:
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [37]:
df = df[['airline_sentiment','text']]
df

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...
...,...,...
14635,positive,@AmericanAir thank you we got on a different f...
14636,negative,@AmericanAir leaving over 20 minutes Late Flig...
14637,neutral,@AmericanAir Please bring American Airlines to...
14638,negative,"@AmericanAir you have my money, you change my ..."


In [38]:
df = df.head(8)
df

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...
5,negative,@VirginAmerica seriously would pay $30 a fligh...
6,positive,"@VirginAmerica yes, nearly every time I fly VX..."
7,neutral,@VirginAmerica Really missed a prime opportuni...


In [39]:
df2 = df.copy()

In [40]:
df2["text"] = df2["text"].apply(cleaning_fsa)

In [41]:
df2

Unnamed: 0,airline_sentiment,text
0,neutral,virginamerica dhepburn said
1,positive,virginamerica plus youve added commercial expe...
2,neutral,virginamerica didnt today must mean need take ...
3,negative,virginamerica really aggressive blast obnoxiou...
4,negative,virginamerica really big bad thing
5,negative,virginamerica seriously would pay flight seat ...
6,positive,virginamerica yes nearly every time fly vx ear...
7,neutral,virginamerica really missed prime opportunity ...


## CountVectorization

In [42]:
X = df2["text"]
y = df2["airline_sentiment"]

In [43]:
from sklearn.model_selection import train_test_split

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, stratify = y, random_state = 42)

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

In [46]:
X_train

6    virginamerica yes nearly every time fly vx ear...
0                          virginamerica dhepburn said
2    virginamerica didnt today must mean need take ...
4                   virginamerica really big bad thing
Name: text, dtype: object

In [47]:
vectorizer = CountVectorizer()
X_train_count = vectorizer.fit_transform(X_train)
X_test_count = vectorizer.transform(X_test)

In [48]:
vectorizer.get_feature_names_out()

array(['another', 'away', 'bad', 'big', 'dhepburn', 'didnt', 'ear',
       'every', 'fly', 'go', 'mean', 'must', 'nearly', 'need', 'really',
       'said', 'take', 'thing', 'time', 'today', 'trip', 'virginamerica',
       'vx', 'worm', 'yes'], dtype=object)

In [49]:
X_train_count.toarray()

array([[0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
        1, 1, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
        0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        0, 0, 0]], dtype=int64)

In [50]:
df_train_count = pd.DataFrame(X_train_count.toarray(), columns = vectorizer.get_feature_names_out(), index=X_train.index)
df_train_count

Unnamed: 0,another,away,bad,big,dhepburn,didnt,ear,every,fly,go,mean,must,nearly,need,really,said,take,thing,time,today,trip,virginamerica,vx,worm,yes
6,0,1,0,0,0,0,1,1,1,1,0,0,1,0,0,0,0,0,1,0,0,1,1,1,1
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0
2,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,0,1,1,1,0,0,0
4,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0


In [51]:
X_train

6    virginamerica yes nearly every time fly vx ear...
0                          virginamerica dhepburn said
2    virginamerica didnt today must mean need take ...
4                   virginamerica really big bad thing
Name: text, dtype: object

In [52]:
X_train[6]

'virginamerica yes nearly every time fly vx ear worm go away'

In [53]:
df_train_count.loc[6]

another          0
away             1
bad              0
big              0
dhepburn         0
didnt            0
ear              1
every            1
fly              1
go               1
mean             0
must             0
nearly           1
need             0
really           0
said             0
take             0
thing            0
time             1
today            0
trip             0
virginamerica    1
vx               1
worm             1
yes              1
Name: 6, dtype: int64

In [54]:
df_test_count = pd.DataFrame(X_test_count.toarray(), columns = vectorizer.get_feature_names_out(), index = X_test.index)
df_test_count

Unnamed: 0,another,away,bad,big,dhepburn,didnt,ear,every,fly,go,mean,must,nearly,need,really,said,take,thing,time,today,trip,virginamerica,vx,worm,yes
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
5,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0


In [55]:
X_test

3    virginamerica really aggressive blast obnoxiou...
5    virginamerica seriously would pay flight seat ...
1    virginamerica plus youve added commercial expe...
7    virginamerica really missed prime opportunity ...
Name: text, dtype: object

In [56]:
X_test[3]

'virginamerica really aggressive blast obnoxious entertainment guest face amp little recourse'

In [57]:
vectorizer.vocabulary_

{'virginamerica': 21,
 'yes': 24,
 'nearly': 12,
 'every': 7,
 'time': 18,
 'fly': 8,
 'vx': 22,
 'ear': 6,
 'worm': 23,
 'go': 9,
 'away': 1,
 'dhepburn': 4,
 'said': 15,
 'didnt': 5,
 'today': 19,
 'must': 11,
 'mean': 10,
 'need': 13,
 'take': 16,
 'another': 0,
 'trip': 20,
 'really': 14,
 'big': 3,
 'bad': 2,
 'thing': 17}

## TF-IDF

sklearn TD-IDF
https://towardsdatascience.com/how-sklearns-tf-idf-is-different-from-the-standard-tf-idf-275fa582e73d

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [59]:
tf_idf_vectorizer = TfidfVectorizer()
X_train_tf_idf = tf_idf_vectorizer.fit_transform(X_train)
X_test_tf_idf = tf_idf_vectorizer.transform(X_test)

In [60]:
tf_idf_vectorizer.get_feature_names_out()

array(['another', 'away', 'bad', 'big', 'dhepburn', 'didnt', 'ear',
       'every', 'fly', 'go', 'mean', 'must', 'nearly', 'need', 'really',
       'said', 'take', 'thing', 'time', 'today', 'trip', 'virginamerica',
       'vx', 'worm', 'yes'], dtype=object)

In [61]:
X_train_tf_idf.toarray()

array([[0.        , 0.31200802, 0.        , 0.        , 0.        ,
        0.        , 0.31200802, 0.31200802, 0.31200802, 0.31200802,
        0.        , 0.        , 0.31200802, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.31200802, 0.        ,
        0.        , 0.16281873, 0.31200802, 0.31200802, 0.31200802],
       [0.        , 0.        , 0.        , 0.        , 0.66338461,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.66338461, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.34618161, 0.        , 0.        , 0.        ],
       [0.34768534, 0.        , 0.        , 0.        , 0.        ,
        0.34768534, 0.        , 0.        , 0.        , 0.        ,
        0.34768534, 0.34768534, 0.        , 0.34768534, 0.        ,
        0.        , 0.34768534, 0.        , 0.        , 0.34768534,
        0.34768534, 0.18143663, 0.        , 0.

In [62]:
df_train_tfidf = pd.DataFrame(X_train_tf_idf.toarray(), columns = tf_idf_vectorizer.get_feature_names_out(), 
                              index= X_train.index)
df_train_tfidf

Unnamed: 0,another,away,bad,big,dhepburn,didnt,ear,every,fly,go,mean,must,nearly,need,really,said,take,thing,time,today,trip,virginamerica,vx,worm,yes
6,0.0,0.312008,0.0,0.0,0.0,0.0,0.312008,0.312008,0.312008,0.312008,0.0,0.0,0.312008,0.0,0.0,0.0,0.0,0.0,0.312008,0.0,0.0,0.162819,0.312008,0.312008,0.312008
0,0.0,0.0,0.0,0.0,0.663385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.663385,0.0,0.0,0.0,0.0,0.0,0.346182,0.0,0.0,0.0
2,0.347685,0.0,0.0,0.0,0.0,0.347685,0.0,0.0,0.0,0.0,0.347685,0.347685,0.0,0.347685,0.0,0.0,0.347685,0.0,0.0,0.347685,0.347685,0.181437,0.0,0.0,0.0
4,0.0,0.0,0.483803,0.483803,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.483803,0.0,0.0,0.483803,0.0,0.0,0.0,0.252468,0.0,0.0,0.0


In [63]:
X_train[6]

'virginamerica yes nearly every time fly vx ear worm go away'

In [64]:
df_train_tfidf.loc[0].sort_values(ascending=False)

dhepburn         0.663385
said             0.663385
virginamerica    0.346182
another          0.000000
need             0.000000
worm             0.000000
vx               0.000000
trip             0.000000
today            0.000000
time             0.000000
thing            0.000000
take             0.000000
really           0.000000
nearly           0.000000
away             0.000000
must             0.000000
mean             0.000000
go               0.000000
fly              0.000000
every            0.000000
ear              0.000000
didnt            0.000000
big              0.000000
bad              0.000000
yes              0.000000
Name: 0, dtype: float64

In [65]:
df_test_tfidf=pd.DataFrame(X_test_tf_idf.toarray(), columns = tf_idf_vectorizer.get_feature_names(), index = X_test.index)
df_test_tfidf

Unnamed: 0,another,away,bad,big,dhepburn,didnt,ear,every,fly,go,mean,must,nearly,need,really,said,take,thing,time,today,trip,virginamerica,vx,worm,yes
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.886548,0.0,0.0,0.0,0.0,0.0,0.0,0.462637,0.0,0.0,0.0
5,0.0,0.0,0.483803,0.0,0.0,0.483803,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.483803,0.0,0.0,0.483803,0.0,0.0,0.0,0.252468,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.886548,0.0,0.0,0.0,0.0,0.0,0.0,0.462637,0.0,0.0,0.0


In [66]:
X_test[3]

'virginamerica really aggressive blast obnoxious entertainment guest face amp little recourse'

**A Short Review of tf-idf**

In [67]:
np.log10(3/100)  # logarithm provides a kind of scaling

-1.5228787452803376

In [68]:
np.log10(100/3)

1.5228787452803376

In [69]:
np.log10(10000000/3)

6.522878745280337

----------------------------

-------------------------------