## Text classification with Bag of words
Outline - 
1. Download and explore the data
2. Apply text preprocessing techniques
3. Implement the bag of words model
4. Train Ml models for text classification
5. Mske predictions and submit to kaggle

## Download and explore the data

1. Download the data set from kaggle
2. Explore the data using pandas
3. create a small working sample

### Download the data

In [1]:
import pandas as pd

In [2]:
raw_df=pd.read_csv('quora_train.csv')

In [3]:
raw_df

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0
...,...,...,...
1306117,ffffcc4e2331aaf1e41e,What other technical skills do you need as a c...,0
1306118,ffffd431801e5a2f4861,Does MS in ECE have good job prospects in USA ...,0
1306119,ffffd48fb36b63db010c,Is foam insulation toxic?,0
1306120,ffffec519fa37cf60c78,How can one start a research project based on ...,0


In [4]:
sincere_df=raw_df[raw_df.target==0]

In [5]:
sincere_df.question_text.values[:10]

array(['How did Quebec nationalists see their province as a nation in the 1960s?',
       'Do you have an adopted dog, how would you encourage people to adopt and not shop?',
       'Why does velocity affect time? Does velocity affect space geometry?',
       'How did Otto von Guericke used the Magdeburg hemispheres?',
       'Can I convert montra helicon D to a mountain bike by just changing the tyres?',
       'Is Gaza slowly becoming Auschwitz, Dachau or Treblinka for Palestinians?',
       'Why does Quora automatically ban conservative opinions when reported, but does not do the same for liberal views?',
       'Is it crazy if I wash or wipe my groceries off? Germs are everywhere.',
       'Is there such a thing as dressing moderately, and if so, how is that different than dressing modestly?',
       'Is it just me or have you ever been in this phase wherein you became ignorant to the people you once loved, completely disregarding their feelings/lives so you get to have something g

In [6]:
insincere_df=raw_df[raw_df.target==1]

In [7]:
insincere_df.question_text.values[:10]

array(['Has the United States become the largest dictatorship in the world?',
       'Which babies are more sweeter to their parents? Dark skin babies or light skin babies?',
       "If blacks support school choice and mandatory sentencing for criminals why don't they vote Republican?",
       'I am gay boy and I love my cousin (boy). He is sexy, but I dont know what to do. He is hot, and I want to see his di**. What should I do?',
       'Which races have the smallest penis?',
       'Why do females find penises ugly?',
       'How do I marry an American woman for a Green Card? How much do they charge?',
       "Why do Europeans say they're the superior race, when in fact it took them over 2,000 years until mid 19th century to surpass China's largest economy?",
       'Did Julius Caesar bring a tyrannosaurus rex on his campaigns to frighten the Celts into submission?',
       "In what manner has Republican backing of 'states rights' been hypocritical and what ways have they actually r

In [8]:
raw_df.target.value_counts(normalize=True)

0    0.93813
1    0.06187
Name: target, dtype: float64

In [9]:
test_df=pd.read_csv('test.csv')

In [10]:
test_df

Unnamed: 0,qid,question_text
0,0000163e3ea7c7a74cd7,Why do so many women become so rude and arroga...
1,00002bd4fb5d505b9161,When should I apply for RV college of engineer...
2,00007756b4a147d2b0b3,What is it really like to be a nurse practitio...
3,000086e4b7e1c7146103,Who are entrepreneurs?
4,0000c4c3fbe8785a3090,Is education really making good people nowadays?
...,...,...
375801,ffff7fa746bd6d6197a9,How many countries listed in gold import in in...
375802,ffffa1be31c43046ab6b,Is there an alternative to dresses on formal p...
375803,ffffae173b6ca6bfa563,Where I can find best friendship quotes in Tel...
375804,ffffb1f7f1a008620287,What are the causes of refraction of light?


In [11]:
sub_df=pd.read_csv('sample_submission.csv')

In [12]:
sub_df.prediction.value_counts()

0    375806
Name: prediction, dtype: int64

In [13]:
sample_size=100_000
sample_df=raw_df.sample(sample_size,random_state=42)

In [14]:
sample_df

Unnamed: 0,qid,question_text,target
443046,56d324bb1e2c29f43b12,What is the most effective classroom managemen...,0
947549,b9ad893dc78c577f8a63,Can I study abroad after 10th class from Bangl...,0
523769,6689ebaeeb65b209a412,How can I make friends as a college junior?,0
949821,ba1e2c4a0fef09671516,How do I download free APK Minecraft: Pocket E...,0
1030397,c9ea2b69bf0d74626f46,"Like Kuvera, is ""Groww"" also a free online inv...",0
...,...,...,...
998930,c3c03a307a29c69971b4,How do I research list of reliable charcoal im...,0
66641,0d119aba95ee6684f506,"What are petroleum products, and what is petro...",0
90024,11a46cd148a104b271cf,What are some services that will let you quick...,0
130113,1973e6e2111a0c93193a,What credit card processors do online marketpl...,0


## apply text preprocessing techniques

outline
1. Understand the bag of words model
2. Tokenization
3. Stop word removal
4. Lemmatization

Bag of words intuition
1. create a list of all words across all text
2. You convert each document into vector counts of each word

Limitation
1. There may be too many words
2. Some words occur too frequently
3. Some words may ocur very rarely or even once
4. A single word may have many forms

### Tokenization
splitting a document into words and seperators

In [15]:
q0=sincere_df.question_text.values[1]

In [16]:
q0

'Do you have an adopted dog, how would you encourage people to adopt and not shop?'

In [17]:
q1=insincere_df.question_text.values[2]

In [18]:
q1

"If blacks support school choice and mandatory sentencing for criminals why don't they vote Republican?"

In [19]:
import nltk

In [20]:
from nltk.tokenize import word_tokenize

In [21]:
word_tokenize(q0)

['Do',
 'you',
 'have',
 'an',
 'adopted',
 'dog',
 ',',
 'how',
 'would',
 'you',
 'encourage',
 'people',
 'to',
 'adopt',
 'and',
 'not',
 'shop',
 '?']

In [22]:
q0_toke=word_tokenize(q0)
q1_toke=word_tokenize(q1)

### Stop word removal
removing commonly occuring words

In [23]:
from nltk.corpus import stopwords

In [24]:
english_stopwords=stopwords.words('english')

In [25]:
", ".join(english_stopwords)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [26]:
def removal_stopwords(tokens):
    return [word for word in tokens if word.lower() not in english_stopwords]

In [27]:
q0_stp=removal_stopwords(q0_toke)

### Stemming
* getting root word
- go, going,gone -> go
- bird,birds -> bird


In [28]:
from nltk.stem.snowball import SnowballStemmer

In [29]:
stemmer=SnowballStemmer(language='english')

In [30]:
stemmer.stem('going')

'go'

In [31]:
stemmer.stem('supposedingly')

'supposed'

In [32]:
q0_stm=[stemmer.stem(word) for word in q0_stp]

In [33]:
q0_stm

['adopt', 'dog', ',', 'would', 'encourag', 'peopl', 'adopt', 'shop', '?']

### Lemmatization 
- it overcomes the drawbacks of stemming
- finds meaningful words/ representation 
"love" - love, "loving" - love, "lovable" - love
- it takes more time compared to stemming

## Implement bag of words
Outline
- create a vocabulary using count vectorizer
- Transform text to vectors using count vectorizer
- Configure text preprocessing in count vectorizer

In [34]:
small_df=sample_df[:5]

In [35]:
small_df.question_text.values

array(['What is the most effective classroom management skill/technique to create a good learning environment?',
       'Can I study abroad after 10th class from Bangladesh?',
       'How can I make friends as a college junior?',
       'How do I download free APK Minecraft: Pocket Edition for iOS (iPhone)?',
       'Like Kuvera, is "Groww" also a free online investment platform where I can invest in direct mutual funds?'],
      dtype=object)

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

In [37]:
small_vect=CountVectorizer()

In [38]:
small_vect.fit(small_df.question_text)

CountVectorizer()

In [39]:
small_vect.get_feature_names_out()

array(['10th', 'abroad', 'after', 'also', 'apk', 'as', 'bangladesh',
       'can', 'class', 'classroom', 'college', 'create', 'direct', 'do',
       'download', 'edition', 'effective', 'environment', 'for', 'free',
       'friends', 'from', 'funds', 'good', 'groww', 'how', 'in', 'invest',
       'investment', 'ios', 'iphone', 'is', 'junior', 'kuvera',
       'learning', 'like', 'make', 'management', 'minecraft', 'most',
       'mutual', 'online', 'platform', 'pocket', 'skill', 'study',
       'technique', 'the', 'to', 'what', 'where'], dtype=object)

### Transform documents into vectors

In [40]:
vectors=small_vect.transform(small_df.question_text)

In [41]:
vectors

<5x51 sparse matrix of type '<class 'numpy.int64'>'
	with 56 stored elements in Compressed Sparse Row format>

In [42]:
vectors.shape

(5, 51)

In [43]:
vectors.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
        1, 0, 1, 1, 1, 1, 0],
       [1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 1]], dtype=int64)

### Configure count vectorizer

In [44]:
def tokenize(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]

In [45]:
tokenize("what is really happening ( here )?")

['what', 'is', 'realli', 'happen', '(', 'here', ')', '?']

In [46]:
vectorizer=CountVectorizer(lowercase=True,
                          tokenizer=tokenize,
                          stop_words=english_stopwords,
                          max_features=1000)

In [47]:
%%time
vectorizer.fit(sample_df.question_text)



CPU times: total: 34.5 s
Wall time: 34.8 s


CountVectorizer(max_features=1000,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function tokenize at 0x000002423C4C50D0>)

In [48]:
len(vectorizer.vocabulary_)

1000

In [49]:
vectorizer.get_feature_names_out()[:100]

array(['!', '$', '%', '&', "'", "''", "'m", "'s", '(', ')', ',', '-', '.',
       '1', '10', '100', '12', '12th', '15', '2', '20', '2017', '2018',
       '3', '4', '5', '6', '7', '8', ':', '?', '[', ']', '``', 'abl',
       'abroad', 'abus', 'accept', 'access', 'accomplish', 'accord',
       'account', 'achiev', 'act', 'action', 'activ', 'actor', 'actual',
       'ad', 'add', 'address', 'admiss', 'adult', 'advanc', 'advantag',
       'advic', 'affect', 'africa', 'african', 'age', 'agre', 'air',
       'allow', 'almost', 'alon', 'alreadi', 'also', 'altern', 'alway',
       'amazon', 'america', 'american', 'amount', 'analysi', 'android',
       'ani', 'anim', 'anoth', 'answer', 'anxieti', 'anyon', 'anyth',
       'apart', 'app', 'appear', 'appl', 'appli', 'applic', 'approach',
       'arab', 'area', 'armi', 'around', 'art', 'asian', 'ask', 'associ',
       'atheist', 'attack', 'attend'], dtype=object)

In [50]:
%%time
inputs=vectorizer.transform(sample_df.question_text)

CPU times: total: 42.7 s
Wall time: 43.5 s


In [51]:
inputs

<100000x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 548298 stored elements in Compressed Sparse Row format>

In [52]:
inputs.shape

(100000, 1000)

In [53]:
%%time
test_inputs=vectorizer.transform(test_df.question_text)

CPU times: total: 2min 13s
Wall time: 2min 14s


## Train ML models

### ML models for text classification
Outline 
- create a training and test validation set
- train a logistic regression model
- Make predictions on training ,validation and test sets

#### Splitting into training and validation

In [54]:
from sklearn.model_selection import train_test_split

In [55]:
train_inputs,val_inputs,train_targets,val_targets=train_test_split(inputs,sample_df.target,test_size=0.3,random_state=42)

In [56]:
train_inputs.shape

(70000, 1000)

In [57]:
test_inputs.shape

(375806, 1000)

### Train Logistic regression model

In [58]:
from sklearn.linear_model import LogisticRegression

In [59]:
model=LogisticRegression(max_iter=1000,solver='sag')

In [60]:
model.fit(train_inputs,train_targets)



LogisticRegression(max_iter=1000, solver='sag')

In [61]:
train_preds=model.predict(train_inputs)

In [62]:
pd.Series(train_preds).value_counts()

0    67957
1     2043
dtype: int64

In [63]:
pd.Series(train_targets).value_counts()

0    65784
1     4216
Name: target, dtype: int64

In [64]:
from sklearn.metrics import accuracy_score

In [65]:
accuracy_score(train_targets,train_preds)

0.9504428571428571

In [66]:
import numpy as np

In [67]:
accuracy_score(train_targets,np.zeros(len(train_targets)))

0.9397714285714286

In [68]:
from sklearn.metrics import f1_score

In [69]:
f1_score(train_targets,train_preds)

0.4457581083240134

In [70]:
f1_score(train_targets,np.zeros(len(train_targets)))

0.0

In [71]:
val_preds=model.predict(val_inputs)

In [72]:
accuracy_score(val_targets,val_preds)

0.9467

In [73]:
f1_score(val_targets,val_preds)

0.40843507214206437

In [74]:
sincere_df.question_text.values[:10]

array(['How did Quebec nationalists see their province as a nation in the 1960s?',
       'Do you have an adopted dog, how would you encourage people to adopt and not shop?',
       'Why does velocity affect time? Does velocity affect space geometry?',
       'How did Otto von Guericke used the Magdeburg hemispheres?',
       'Can I convert montra helicon D to a mountain bike by just changing the tyres?',
       'Is Gaza slowly becoming Auschwitz, Dachau or Treblinka for Palestinians?',
       'Why does Quora automatically ban conservative opinions when reported, but does not do the same for liberal views?',
       'Is it crazy if I wash or wipe my groceries off? Germs are everywhere.',
       'Is there such a thing as dressing moderately, and if so, how is that different than dressing modestly?',
       'Is it just me or have you ever been in this phase wherein you became ignorant to the people you once loved, completely disregarding their feelings/lives so you get to have something g

In [75]:
sincere_df.target.values[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [76]:
model.predict(vectorizer.transform(sincere_df.question_text.values[:10]))

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [77]:
insincere_df.question_text.values[:10]

array(['Has the United States become the largest dictatorship in the world?',
       'Which babies are more sweeter to their parents? Dark skin babies or light skin babies?',
       "If blacks support school choice and mandatory sentencing for criminals why don't they vote Republican?",
       'I am gay boy and I love my cousin (boy). He is sexy, but I dont know what to do. He is hot, and I want to see his di**. What should I do?',
       'Which races have the smallest penis?',
       'Why do females find penises ugly?',
       'How do I marry an American woman for a Green Card? How much do they charge?',
       "Why do Europeans say they're the superior race, when in fact it took them over 2,000 years until mid 19th century to surpass China's largest economy?",
       'Did Julius Caesar bring a tyrannosaurus rex on his campaigns to frighten the Celts into submission?',
       "In what manner has Republican backing of 'states rights' been hypocritical and what ways have they actually r

In [78]:
insincere_df.target.values[:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

In [79]:
model.predict(vectorizer.transform(insincere_df.question_text.values[:10]))

array([0, 0, 1, 1, 0, 0, 0, 1, 0, 0], dtype=int64)

## Make predictions and submit to kaggle

In [80]:
test_df

Unnamed: 0,qid,question_text
0,0000163e3ea7c7a74cd7,Why do so many women become so rude and arroga...
1,00002bd4fb5d505b9161,When should I apply for RV college of engineer...
2,00007756b4a147d2b0b3,What is it really like to be a nurse practitio...
3,000086e4b7e1c7146103,Who are entrepreneurs?
4,0000c4c3fbe8785a3090,Is education really making good people nowadays?
...,...,...
375801,ffff7fa746bd6d6197a9,How many countries listed in gold import in in...
375802,ffffa1be31c43046ab6b,Is there an alternative to dresses on formal p...
375803,ffffae173b6ca6bfa563,Where I can find best friendship quotes in Tel...
375804,ffffb1f7f1a008620287,What are the causes of refraction of light?


In [81]:
test_inputs.shape

(375806, 1000)

In [82]:
test_preds=model.predict(test_inputs)

In [83]:
sub_df

Unnamed: 0,qid,prediction
0,0000163e3ea7c7a74cd7,0
1,00002bd4fb5d505b9161,0
2,00007756b4a147d2b0b3,0
3,000086e4b7e1c7146103,0
4,0000c4c3fbe8785a3090,0
...,...,...
375801,ffff7fa746bd6d6197a9,0
375802,ffffa1be31c43046ab6b,0
375803,ffffae173b6ca6bfa563,0
375804,ffffb1f7f1a008620287,0


In [84]:
sub_df.prediction=test_preds

In [86]:
sub_df.to_csv('submission.csv',index=None)