# Tutorial: Machine Learning with Text in scikit-learn

# Agenda
   1. Model building in scikit-learn (refresher)
   2. Representing text as numerical data
   3. Reading a text-based dataset into pandas
   4. Vectorizing our dataset
   5. Building and evaluating model
   6. Comparing models
   7. Examining a model for further insight
   8. Tuning the vectorizer (discussion)

## Part 1: Model building in scikit-learn (refresher)¶

In [93]:
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()

In [94]:
# store the features matrix (X) and response vector (y)
X = iris.data
y = iris.target

**"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output. **"Observations"** are also known as samples, instances, or records.

In [101]:
# check the shapes of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [102]:
# examine the first 5 rows of the feature matrix (including the feature names)
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [103]:
# examine the response vector
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.

In [104]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [122]:
# predict the response for a new observation
knn.predict([[6.3, 3.3, 6. , 2.5]])

array([2])

## Part 2: Representing text as numerical data

In [2]:
import pandas as pd
import sklearn

In [3]:
simple_train = ['call you tonight', 'Call me  a cab', 'please call me...PLEASE!']

In [4]:
# import and instantiate CountVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# CounrVectorizer can only recognise 1 dimentional objects

In [5]:
# learn vocabulary of the traning data
vect.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [6]:
# examine the fitted vocabulary
vect.get_feature_names()

['cab', 'call', 'me', 'please', 'tonight', 'you']

In [7]:
# transform training data into a document-term matrix
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [8]:
# 3*6 document term matrix is 3 rows * 6 columns. 3 refers to 3 documents and 6 refer to 6 terms/features/tokens learnt during fitting.

In [9]:
# convert sparse matrix to a dense arrary
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In [10]:
# show the terms
pd.DataFrame(simple_train_dtm.toarray(), columns = vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [11]:
# a corpus of document can be represented with one row per document and one columne 
# per token vectorization is the processingjust now of converting text documents 
# to numerical feature vectors. 
# the term 'bag of words' simply mean that you don't keep track of the order 
# you cannot contruct the original from the document term matrix

In [12]:
# Check the type of document text matrix
type(simple_train_dtm)

scipy.sparse.csr.csr_matrix

In [13]:
# examine the sparse matrix contents
print(simple_train_dtm)
# the coordinares indicates the locations of the non zero values)

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


In [14]:
# as most documents will typically use a very small subset of the words in the 
# corpus,the matrix will have many feature values that are zeros.
# no. of columns of the matrix is the no. of unique words in the corpus.
# in order to store such a matrix in memory and also speed up the operation,
# sparse representation such as scipy.sparse is used.

In [15]:
# example text for model testing
simple_test = ["please don't call me"]

In [16]:
# in order to make a prediction, the new ob needs to have the same 
# features as the training obs, both in number and in meanining.
# thus, we need to use transfer method
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]], dtype=int64)

In [17]:
# vect.fit(train) learns the vocabulary of the training data
# vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
# vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data and ignore tokens it has
# not seen before (* it still uses fitted vocabulary from train to build the dtms)

In [18]:
print(vect.transform(simple_test))

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1


In [19]:
pd.DataFrame(simple_test_dtm.toarray(), columns = vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


## Part 3: Reading a text-based dataset into pandas

In [123]:
path = 'C:/Users/zhangxinhua/Desktop/Python/data/sms.tsv'

In [21]:
sms = pd.read_table(path, header = None, names = ['label', 'message'])

In [22]:
# convert label to a numerical variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})

In [23]:
sms.head(10)

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
5,1,FreeMsg Hey there darling it's been 3 week's n...
6,0,Even my brother is not like to speak with me. ...
7,0,As per your request 'Melle Melle (Oru Minnamin...
8,1,WINNER!! As a valued network customer you have...
9,1,Had your mobile 11 months or more? U R entitle...


In [24]:
sms.shape

(5572, 2)

In [25]:
# examine the class distribution
sms.label.value_counts()

0    4825
1     747
Name: label, dtype: int64

In [26]:
# how to define X and y for the use of CountVectorizer
X = sms.message
y = sms.label
print(X.shape)
print(y.shape)
# usually X is two dimensional. in this case, it's one dimensional for now but will be transformed with CountVectorizer
# *CountVectorizer can only handle one dimensional object (e.g. cannot even handle (5572, 1) object)

(5572,)
(5572,)


In [27]:
# split X and y  into training and test sets (*before vectorising them)
# we need to split before vectorizing becuase: 
# 1. the corpus is too large if we don't split first (i.e. train and test togather)
# 2. to simulate the real world, the test set won't know all the featues the training sets have
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


## Part 4: Vectorizing our dataset¶

In [28]:
# instantiate the vectorize
vect = CountVectorizer()

In [29]:
# learn training data vocabulary, then use it to create 
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [30]:
# equivalenty: combine fit anad transform into a single step
X_train_dtm = vect.fit_transform(X_train)

In [31]:
# examine the dtm
X_train_dtm

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [32]:
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

## Part 5: Building and evaluating a model¶

In [33]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [34]:
# train the model using X_training_dtm (timing it with the magic function %time)
# X_train_dtm instead of X_train as the model building requires numbers
%time nb.fit(X_train_dtm, y_train)

Wall time: 2.99 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [35]:
# make predictions for X_train_dtm
y_pred_class = nb.predict(X_test_dtm)

In [36]:
# calculate accuracy of class prediction
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9885139985642498

In [37]:
# print confusion metrics
metrics.confusion_matrix(y_test, y_pred_class)
# tn,fp 
# fn,tp

array([[1203,    5],
       [  11,  174]], dtype=int64)

In [38]:
# check the FP and FN to inspect the texts and see how you can possibly improve the model to get them right

In [39]:
# print message text for the false positves (hams incorrectly predicted as spams)
# meaning y_test = ham and y_pred_class = spam
print(y_test)
print(y_pred_class)
X_test

1078    0
4028    0
958     0
4642    0
4674    0
       ..
3207    0
4655    0
1140    0
1793    1
1710    0
Name: label, Length: 1393, dtype: int64
[0 0 0 ... 0 1 0]


1078                         Yep, by the pretty sculpture
4028        Yes, princess. Are you going to make me moan?
958                            Welp apparently he retired
4642                                              Havent.
4674    I forgot 2 ask ü all smth.. There's a card on ...
                              ...                        
3207                                        At home also.
4655                     Hope you are having a great day.
1140    Message:some text missing* Sender:Name Missing...
1793    WIN: We have a winner! Mr. T. Foley won an iPo...
1710    U meet other fren dun wan meet me ah... Muz b ...
Name: message, Length: 1393, dtype: object

In [40]:
# print out messages for false negatives
X_test[(y_pred_class ==  1) & (y_test == 0)]
# y_pred_class and y_test have the same order (the index are preserved)

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [41]:
# or a more elegant expression since the classes are in numeric
X_test[y_pred_class>y_test]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [42]:
# with the same logic, this is the false positives
X_test[y_pred_class<y_test]

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [43]:
y_test

1078    0
4028    0
958     0
4642    0
4674    0
       ..
3207    0
4655    0
1140    0
1793    1
1710    0
Name: label, Length: 1393, dtype: int64

In [44]:
# example of false negative
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

In [45]:
# nb.predict_proba->np arrary that the probability of class being 0 and 1
nb.predict_proba(X_test_dtm)

array([[9.97122551e-01, 2.87744864e-03],
       [9.99981651e-01, 1.83488846e-05],
       [9.97926987e-01, 2.07301295e-03],
       ...,
       [9.99998910e-01, 1.09026171e-06],
       [1.86697467e-10, 1.00000000e+00],
       [9.99999996e-01, 3.98279868e-09]])

In [46]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated->Naive bayes produce extreme values and their
# probabilities should not be interpreted as actual probabilities. i.e.when it says the prob is 1, it's not really 1)
y_pred_prob = nb.predict_proba(X_test_dtm)[:,1]
y_pred_prob

array([2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,
       1.09026171e-06, 1.00000000e+00, 3.98279868e-09])

In [47]:
# y_pred_prob is needed for calculating AUC curve
metrics.roc_auc_score(y_test, y_pred_prob)

0.9866431000536962

## Part 6: Comparing models

In [49]:
# import and instantiate logistic regression from sklearn
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [50]:
# train the model using X_train_dtm （slower than Naive bayes)
%time logreg.fit(X_train_dtm, y_train)

Wall time: 78.8 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [51]:
# make class prediction
y_pred_class = logreg.predict(X_test_dtm)

In [52]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:,1]
y_pred_prob

array([0.00959377, 0.00295662, 0.00452424, ..., 0.031302  , 0.99748962,
       0.00119521])

In [53]:
# calculate accuracy 
metrics.accuracy_score(y_test, y_pred_class)

0.9877961234745154

In [54]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.9936280651512441

## Part 7: Examining a model for further insight

In [56]:
# store the vocabulary of X_train 
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

7456

In [57]:
# examine the first 50 tokens (from numeric 0 to abc...z to symbols)
print(X_train_tokens[0:50])

['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']


In [58]:
# examine the last 50 tokens
print(X_train_tokens[-50:])

['yer', 'yes', 'yest', 'yesterday', 'yet', 'yetunde', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'youphone', 'your', 'youre', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'èn', '〨ud']


In [59]:
# Naive bayes counts the number of times each token appears in each class
nb.feature_count_
# the ending '_' after feature_count is a sklearn convention for attributes that are learnt during fitting

array([[ 0.,  0.,  0., ...,  1.,  1.,  1.],
       [ 5., 23.,  2., ...,  0.,  0.,  0.]])

In [60]:
# interpretation f the output above:
# the first token '00' appeared 0 times in ham and 5 times in spam
# the way naive bayes works for text is that it learns the spamminess of each token vs. the cleaness of each token 
# and makes the prediction (for each token, it calculated the conditional probability of that token givne each class
# and it calculate the conditinal probabilities of each class given each token-> A given B, B given A)

In [61]:
nb.feature_count_.shape 

(2, 7456)

In [62]:
# number of times each token appears across all ham messages
ham_token_count = nb.feature_count_[0,:]
ham_token_count

array([0., 0., 0., ..., 1., 1., 1.])

In [63]:
# number of times each token appears across all spam messages
spam_token_count = nb.feature_count_[1,:]
spam_token_count

array([ 5., 23.,  2., ...,  0.,  0.,  0.])

In [64]:
# create a dataframe of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token': X_train_tokens, 'ham': ham_token_count, 'spam': spam_token_count}).set_index('token')
# pass a dictionary to pd.DataFrame -> Keys are the column names and the values becomes the columns. Set the index as token.
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.0,5.0
0,0.0,23.0
8704050406,0.0,2.0
121,0.0,1.0
1223585236,0.0,1.0


In [65]:
len(tokens)
# all the brokendown tokens

7456

In [66]:
# examine 5 random DataFrame rows (tokens)
tokens.sample(5, random_state = 6)
# 5-> 5 rows 6-> seed
# *the frequencies are the number of times the word/token appeared not the number of messages the token appeared. i.e.some words
# can repeat a few times in the same message
# 'nasty' is still a more spammy word although the frequency is both 1. however, the no. of spammy messages are much fewer

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,64.0,2.0
nasty,1.0,1.0
villa,0.0,1.0
beloved,1.0,0.0
textoperator,0.0,2.0


In [67]:
# Naive bayes counts the number of observations in each class
nb.class_count_

array([3617.,  562.])

In [68]:
# calculate the spamminess and hamminess by class
# *add 1 to ham and spam counts to aviod dividing by 0 (also solving the conceptual issue that when seeing 0, e.g. 0 in beloved, 
# we thought the spamminess of this word is 0)
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state = 6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,65.0,3.0
nasty,2.0,2.0
villa,1.0,2.0
beloved,2.0,1.0
textoperator,1.0,3.0


In [69]:
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state = 6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,0.017971,0.005338
nasty,0.000553,0.003559
villa,0.000276,0.003559
beloved,0.000553,0.001779
textoperator,0.000276,0.005338


In [70]:
# calcualte the ratio of spam-to-ham for each token
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state = 6)
# spam ratio only gives a sense of the level/ranking and shouldn't be interpreted with more ratio/numeric meaning

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
very,0.017971,0.005338,0.297044
nasty,0.000553,0.003559,6.435943
villa,0.000276,0.003559,12.871886
beloved,0.000553,0.001779,3.217972
textoperator,0.000276,0.005338,19.307829


In [71]:
# Naive bayes look at the messages as individual words and assess the spamminess of the words with conditional 
# probabilities to predict 

In [72]:
# examine the DataFrame sorted by spam_ratio
tokens.sort_values('spam_ratio', ascending = False)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
claim,0.000276,0.158363,572.798932
prize,0.000276,0.135231,489.131673
150p,0.000276,0.087189,315.361210
tone,0.000276,0.085409,308.925267
guaranteed,0.000276,0.076512,276.745552
...,...,...,...
da,0.032900,0.001779,0.054084
she,0.035665,0.001779,0.049891
he,0.047000,0.001779,0.037858
lt,0.064142,0.001779,0.027741


In [73]:
# look up the spam_ratio for a given token
tokens

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
00,0.000276,0.010676,38.615658
000,0.000276,0.042705,154.462633
008704050406,0.000276,0.005338,19.307829
0121,0.000276,0.003559,12.871886
01223585236,0.000276,0.003559,12.871886
...,...,...,...
zoom,0.000553,0.001779,3.217972
zouk,0.000276,0.003559,12.871886
zyada,0.000553,0.001779,3.217972
èn,0.000553,0.001779,3.217972


In [83]:
# to examine the spam ratio of words
tokens.loc['sms', 'spam_ratio']

11.799228944246737

In [84]:
tokens.loc['dating','spam_ratio']

83.66725978647686

## Part 8: Tuning the vectorizer (discussion)

In [126]:
# show default parameters for CountVectorizer
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

**Tuning CountVetorizer:** sklearn try to give the the most sensible default, however, it'e worth tunning, just like models
some parameters that are useful and relatively easy/effect for tunning
stop_words: string{'enlish'}, list, or None (default) -> to remove stop words

In [129]:
# remove English stop words (built-in stop word list)
vect = CountVectorizer(stop_words = 'english')

**ngram_range:** the lower and upper value of n values. default is (1, 1) meaning all 1-gram -> to identify the important word pairs

In [130]:
# include 1-gram and 2-grams
vect = CountVectorizer(ngram_range = (1, 2))
# the danger of using 2-gram is that the number of features will grow very quickly and may introduce more noise than signals 
# (e.g. there are a lot of 2-grams only appear once in the dataset)

**max_df:** max document frequency -> ignore the terms that have a higher frequency than threshold, ranging from 0 to 1. It works similiar as stop words. This is more of corpus specific stop words.

In [132]:
vect = CountVectorizer(max_df = 0.2)

**min_df**: min document frequency -> ignore the terms that have a lower frequency than threshold, ranging from 0 to 1 or int. It works similiar as stop words. This is to ignore the terms that have too low frequencies.

In [None]:
vect = CountVectorizer(min_df = 2)
# min_df = 2 means that the cut off is two documents

## Guidelines for tunning CountVectorizer:
* Use your knowledge of the problem and the text, and your understanding of the tunning parameters, to help you decide what parameters to tune and how to tune them.
* Experiment, and let the data tell you the best approaches

# Summary: small tricks I have learnt
1. Green box: edit mode blue box: view mode you can go to view mode by using esc and go to edit mode by using enter
2. At view mode, you can use A to add a cell above, B to add a cell below and  X to cut a cell. Use S to save the file
3. Use shift + tab to view the parameters of a function/method
4. If it's a attribute, there is no (), if it's a function/method, there is ()
5. Modelling with sklearn: import->instantiate->fit->predict
6. %time to get a rough sense of the time need 
7. Naive base is fast. sometimes for big datasets with cross-validation, we can use %time with naive base to test the time needed and then to estimate the time needed for other models. e.g. when logistic regression is too slow, naive bayes only takes 1/4 of the time
8. For this test analysis, sklearn don't know what we are analysing. We converted all documents with numeric representation (word count). Thus, any classification model can be used with text problems, just that naive bayes is more popular
9. Index is like rows. it can be something like 1, 2, 3, 4..or by names. It can has duplicates too. set_index() can be used to set a column to index. reset_index() can be used to reset index when there are multi-index issue (and when the format goes all weird)