# Day 2: 2.1

1. N-Grams
2. Bag-Of-Words(CountVectorizer)
3. Email Spam Detection

### N-Grams in One_NLP

## Bag-of-Words
1. Usd to perform document(corpus) level task.
2. Is a vectorization technique to represent text data.
3. Has no effect of grammar and order of words in sentence.
**Example usage:** Sentiment Analysis and Spam Detection

- Bag-of-words model is the way to extracting features from text and representing the text data, while modeling the text with a machine learing algorithm
    - Tokenization
        - While creating the BOW, tokenized word of each observation is used.
    - Process
        - Collect Data
        - Create a vocabulary by listing all uniques words
        - Create document vectors after scoring
    - Scoring Mechanism
        - Word hashing
        - TF-IDF
        - Boolean value

In [1]:
import pandas as pd
import numpy as np

In [2]:
emails = pd.read_csv('02_emails.csv')
emails.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [3]:
emails.spam.value_counts()/emails.shape[0]*100

0    76.117318
1    23.882682
Name: spam, dtype: float64

In [4]:
emails['text'] = emails['text'].apply(lambda x: x.replace('Subject:', '').lower())
emails['text'] = emails['text'].apply(lambda x: x.replace('subject:', '').lower())

In [5]:
emails.head()

Unnamed: 0,text,spam
0,naturally irresistible your corporate identit...,1
1,the stock trading gunslinger fanny is merril...,1
2,unbelievable new homes made easy im wanting ...,1
3,4 color printing special request additional ...,1
4,"do not have money , get software cds from her...",1


In [6]:
from sklearn.feature_extraction.text import CountVectorizer # CountVectorizer -> BagOfWords

In [7]:
matrix = CountVectorizer(lowercase=True, 
                         stop_words='english',
                         min_df=0.2,
                         max_df=0.95)
X = matrix.fit_transform(emails['text']).toarray()

In [8]:
X[0:5]

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

In [9]:
matrix.get_feature_names()



['10',
 '2000',
 'cc',
 'com',
 'ect',
 'enron',
 'group',
 'hou',
 'information',
 'kaminski',
 'know',
 'let',
 'like',
 'need',
 'pm',
 'research',
 'subject',
 'thanks',
 'time',
 'vince']

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
x_train, x_test, y_train, y_test = train_test_split(X, emails['spam'], train_size=0.8, random_state=50)

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [13]:
tree_modal = DecisionTreeClassifier()
tree_modal.fit(x_train, y_train)

DecisionTreeClassifier()

In [14]:
pred = tree_modal.predict(x_test)

In [15]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [16]:
confusion_matrix(y_true=y_test, y_pred=pred)

array([[815,  63],
       [ 26, 242]], dtype=int64)

In [17]:
accuracy_score(y_pred=pred, y_true=y_test)

0.9223385689354275

In [18]:
rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)

RandomForestClassifier()

In [19]:
rf_pred = rf_model.predict(x_test)

In [20]:
confusion_matrix(rf_pred, y_test)

array([[819,  26],
       [ 59, 242]], dtype=int64)

In [21]:
accuracy_score(rf_pred, y_test)

0.9258289703315882

## TF-ID(Term Frequency-Inverse Document Frequesncy)
- **Bag of Words assumes that each word is equally important.**
- **In real-world scenario, each word has its own weight based on the context.**
    - **Example:**
        - Cost occurs more frequently in an economy related documents. To overcome this limitation TF-IDF is used which assigns weights to the words based on their relevence in the document.