<a href="https://colab.research.google.com/github/vuthuyan/Election_Analysis/blob/main/W4_demo_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CountVectorizer

In [None]:
corpus = ["Facebook is a social network.", "Twitter is another social network.", "WeChat is too."]

In [None]:
# import CountVectorizer library
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X = vect.fit_transform(corpus)

In [None]:
vect.get_feature_names_out()

array(['another', 'facebook', 'is', 'network', 'social', 'too', 'twitter',
       'wechat'], dtype=object)

In [None]:
print(X.toarray()) # documnt term matrix (DTM)

[[0 1 1 1 1 0 0 0]
 [1 0 1 1 1 0 1 0]
 [0 0 1 0 0 1 0 1]]


What do you see from the matrix above?

## TfidfVectorizer
Term Frequency Inverse Document Frequency Vectorizer or TfidfVectorizer allows us to understand the context fo words across an entire corpus of documents instead of just its relative importance in a single document. <br>

In [None]:
# import TfidfVectorizer library
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()

In [None]:
dtm = vect.fit_transform(corpus)
print (dtm)

  (0, 3)	0.4804583972923858
  (0, 4)	0.4804583972923858
  (0, 2)	0.3731188059313277
  (0, 1)	0.6317450542765208
  (1, 0)	0.5340933749435834
  (1, 6)	0.5340933749435834
  (1, 3)	0.4061917781433947
  (1, 4)	0.4061917781433947
  (1, 2)	0.31544415103177975
  (2, 5)	0.652490884512534
  (2, 7)	0.652490884512534
  (2, 2)	0.3853716274664007


What do you see from the dtm or document term matrix above?

In [None]:
vect.get_feature_names_out()

array(['another', 'facebook', 'is', 'network', 'social', 'too', 'twitter',
       'wechat'], dtype=object)

# Classifier

## read data

In [None]:
# Import libraries and load the dataset:
import numpy as np
import pandas as pd

df = pd.read_csv('smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [None]:
df['label'].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

In [None]:
from sklearn.model_selection import train_test_split

X = df['message']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Approach 1: use countvectorizer + tfidftransformer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(3733, 7082)

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3733, 7082)

## Approach 2: TfidfVectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(3733, 7082)

## Train a classifier

In [None]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

## pipeline

In [None]:
from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)

## Test the classifier

In [None]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [None]:
# Report the confusion matrix
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
print(confusion_matrix(y_test,predictions))

[[1586    7]
 [  12  234]]


In [None]:
# Print a classification report
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839

