# Text Classification Bag of Words

Outline
- Download and explore the data
- Apply text processing techniques
- Implement bag of words model
- Train ML models for text classification
- Make predictions and publish to kaggle


## Download and explore the data

In [None]:
!pip3 install kaggle

In [63]:
import os


In [64]:
os.environ['KAGGLE_CONFIG_DIR'] = '.'

In [None]:
!kaggle competitions download -c quora-insincere-questions-classification -f train.csv -p data

In [66]:
train_fname = 'data/train.csv.zip'
test_fname = 'data/test.csv.zip'
sample_fname = 'data/sample_submission.csv.zip'

In [67]:
import pandas as pd


In [None]:
raw_df = pd.read_csv(train_fname)
raw_df
# insincere questions has target as 1 otherwise 0.

In [None]:
sincere_df = raw_df[raw_df['target'] == 0]
insincere_df = raw_df[raw_df['target'] == 1]
insincere_df['question_text'].values[:10]

In [None]:
raw_df['target'].value_counts(normalize=True)

In [None]:
raw_df['target'].value_counts(normalize=True).plot(kind='bar')

In [None]:
test_df = pd.read_csv(test_fname)
test_df

In [None]:
sub_df = pd.read_csv(sample_fname)
sub_df

In [74]:
SAMPLE_SIZE = 100_000

In [None]:
#Create working sample of data
sample_df = raw_df.sample(SAMPLE_SIZE, random_state=42)
sample_df

## Apply text processing techniques
Outline:
1. Undertsand the Bag of Words model
2. Tokenisation
3. Stop word removal
4. Stemming

#### Bag of Words Intuition
1. Create list of all the words across all the text documents
2. Convert each document into a vector containing the counts of each word

Limitations:
1. There may be too many words
2. Some words may occur too frequently
3. Some words may occur very rarely or only once
4. Single word can have many forms (eg: go, gone, going or bird, birds)


#### Tokenisation
Splitting the document into  words and separators

In [None]:
q0 = sincere_df['question_text'].values[1]
q0

In [None]:
q1 = raw_df[raw_df['target'] == 1].question_text.values[0]
q1

In [None]:
!pip3 install nltk


In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize


In [None]:
q0_tok = word_tokenize(q0)
q0_tok

In [None]:
q1_tok = word_tokenize(q1)
q1_tok

#### Stop Word Removal

Removing commonly occurring words


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
english_stopwords = stopwords.words('english')
english_stopwords

In [120]:
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in english_stopwords]

In [None]:
q0_stop = remove_stopwords(q0_tok)
q0_stop


In [None]:
q1_stop = remove_stopwords(q1_tok)
q1_stop

#### Stemming
Moving words to the root word eg: go, gone, going -> go

In [126]:
from nltk.stem import SnowballStemmer

In [None]:
stemmer = SnowballStemmer(language='english')

In [None]:
stemmer.stem('going')

In [None]:
q0_stem = [stemmer.stem(word) for word in q0_stop]
q0_stem

In [None]:
q1_stem = [stemmer.stem(word) for word in q1_stop]
q1_stem

In [135]:
# We can use Lemmatization instead of Stemmer which gives meaningful words but it is not generally used as it looks for dictionary words and
# can result in slowness 

## Implement bag of words model

Outline:
- Create a vocabulary using Count Vectorizer
- Transform text to Vectors using Count Vectorizer
- Configure Text Preprocessing in Count Vectorizer


#### Create a Vocabulary

In [None]:
small_df = sample_df[:5]
small_df['question_text']

In [137]:
from sklearn.feature_extraction.text import CountVectorizer


In [138]:
small_vect = CountVectorizer()

In [None]:
small_vect.fit(small_df['question_text'])

In [None]:
small_vect.vocabulary_

In [None]:
small_vect.get_feature_names_out()

#### Transform documents to vector

In [None]:
vectors = small_vect.transform(small_df['question_text']).toarray()
vectors

#### Configure Count Vectorizer Parameter

In [145]:
def tokenize(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]

In [153]:
vectorizer = CountVectorizer(lowercase=True, tokenizer=tokenize, stop_words=english_stopwords, max_features=1000)

In [None]:
%%time
vectorizer.fit(sample_df['question_text'])

In [None]:
len(vectorizer.vocabulary_)

In [None]:
vectorizer.get_feature_names_out()[:100]

In [158]:
inputs = vectorizer.transform(sample_df['question_text'])

In [None]:
inputs.shape

In [160]:
test_inputs = vectorizer.transform(test_df['question_text'])

## Train ML models for text classification

- Create a training and validation set
- Train a logistic regression model
- Make predictions on training, validation and test data

#### Create Training and Validation Set

In [161]:
from sklearn.model_selection import train_test_split

In [162]:
train_inputs, val_inputs, train_targets, val_targets  = train_test_split(inputs, sample_df['target'], test_size=0.3, random_state=42)

In [163]:
train_inputs.shape, val_inputs.shape

((70000, 1000), (30000, 1000))

#### Train Logistic Regression Model

In [164]:
from sklearn.linear_model import LogisticRegression

In [165]:
model = LogisticRegression()

In [166]:
model.fit(train_inputs, train_targets)

#### Make Predictions

In [167]:
train_preds = model.predict(train_inputs)
train_preds

array([0, 0, 0, ..., 0, 0, 0])

In [168]:
pd.Series(train_preds).value_counts()

0    68235
1     1765
Name: count, dtype: int64

In [169]:
pd.Series(train_targets).value_counts()

target
0    65784
1     4216
Name: count, dtype: int64

In [170]:
from sklearn.metrics import accuracy_score

In [171]:
accuracy_score(train_targets, train_preds)

0.9484142857142858

In [172]:
from sklearn.metrics import f1_score

In [173]:
f1_score(train_targets, train_preds)

np.float64(0.3962548068884802)

In [174]:
val_preds = model.predict(val_inputs)
val_preds

array([0, 0, 0, ..., 0, 0, 0])

In [175]:
accuracy_score(val_targets, val_preds)

0.9459333333333333

In [176]:
f1_score(val_targets, val_preds)

np.float64(0.3732612055641422)

In [177]:
sincere_df['question_text'].values[:10]

array(['How did Quebec nationalists see their province as a nation in the 1960s?',
       'Do you have an adopted dog, how would you encourage people to adopt and not shop?',
       'Why does velocity affect time? Does velocity affect space geometry?',
       'How did Otto von Guericke used the Magdeburg hemispheres?',
       'Can I convert montra helicon D to a mountain bike by just changing the tyres?',
       'Is Gaza slowly becoming Auschwitz, Dachau or Treblinka for Palestinians?',
       'Why does Quora automatically ban conservative opinions when reported, but does not do the same for liberal views?',
       'Is it crazy if I wash or wipe my groceries off? Germs are everywhere.',
       'Is there such a thing as dressing moderately, and if so, how is that different than dressing modestly?',
       'Is it just me or have you ever been in this phase wherein you became ignorant to the people you once loved, completely disregarding their feelings/lives so you get to have something g

In [179]:
sincere_df['target'].values[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [178]:
# Lets test our model on some of the sincere questions above
model.predict(vectorizer.transform(sincere_df['question_text'].values[:10]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [180]:
model.predict(vectorizer.transform(insincere_df['question_text'].values[:10]))

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0])

## Make predictions and publish to kaggle