# Text Classification Bag of Words

Outline
- Download and explore the data
- Apply text processing techniques
- Implement bag of words model
- Train ML models for text classification
- Make predictions and publish to kaggle


## Download and explore the data

In [None]:
!pip3 install kaggle

In [63]:
import os


In [64]:
os.environ['KAGGLE_CONFIG_DIR'] = '.'

In [None]:
!kaggle competitions download -c quora-insincere-questions-classification -f train.csv -p data

In [66]:
train_fname = 'data/train.csv.zip'
test_fname = 'data/test.csv.zip'
sample_fname = 'data/sample_submission.csv.zip'

In [67]:
import pandas as pd


In [None]:
raw_df = pd.read_csv(train_fname)
raw_df
# insincere questions has target as 1 otherwise 0.

In [None]:
sincere_df = raw_df[raw_df['target'] == 0]
insincere_df = raw_df[raw_df['target'] == 1]
insincere_df['question_text'].values[:10]

In [None]:
raw_df['target'].value_counts(normalize=True)

In [None]:
raw_df['target'].value_counts(normalize=True).plot(kind='bar')

In [None]:
test_df = pd.read_csv(test_fname)
test_df

In [None]:
sub_df = pd.read_csv(sample_fname)
sub_df

In [74]:
SAMPLE_SIZE = 100_000

In [None]:
#Create working sample of data
sample_df = raw_df.sample(SAMPLE_SIZE, random_state=42)
sample_df

## Apply text processing techniques
Outline:
1. Undertsand the Bag of Words model
2. Tokenisation
3. Stop word removal
4. Stemming

#### Bag of Words Intuition
1. Create list of all the words across all the text documents
2. Convert each document into a vector containing the counts of each word

Limitations:
1. There may be too many words
2. Some words may occur too frequently
3. Some words may occur very rarely or only once
4. Single word can have many forms (eg: go, gone, going or bird, birds)


#### Tokenisation
Splitting the document into  words and separators

In [None]:
q0 = sincere_df['question_text'].values[1]
q0

In [None]:
q1 = raw_df[raw_df['target'] == 1].question_text.values[0]
q1

In [None]:
!pip3 install nltk


In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize


In [None]:
q0_tok = word_tokenize(q0)
q0_tok

In [None]:
q1_tok = word_tokenize(q1)
q1_tok

#### Stop Word Removal

Removing commonly occurring words


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
english_stopwords = stopwords.words('english')
english_stopwords

In [120]:
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in english_stopwords]

In [None]:
q0_stop = remove_stopwords(q0_tok)
q0_stop


In [None]:
q1_stop = remove_stopwords(q1_tok)
q1_stop

#### Stemming
Moving words to the root word eg: go, gone, going -> go

In [126]:
from nltk.stem import SnowballStemmer

In [None]:
stemmer = SnowballStemmer(language='english')

In [None]:
stemmer.stem('going')

In [None]:
q0_stem = [stemmer.stem(word) for word in q0_stop]
q0_stem

In [None]:
q1_stem = [stemmer.stem(word) for word in q1_stop]
q1_stem

In [135]:
# We can use Lemmatization instead of Stemmer which gives meaningful words but it is not generally used as it looks for dictionary words and
# can result in slowness 

## Implement bag of words model

Outline:
- Create a vocabulary using Count Vectorizer
- Transform text to Vectors using Count Vectorizer
- Configure Text Preprocessing in Count Vectorizer


#### Create a Vocabulary

In [None]:
small_df = sample_df[:5]
small_df['question_text']

In [137]:
from sklearn.feature_extraction.text import CountVectorizer


In [138]:
small_vect = CountVectorizer()

In [None]:
small_vect.fit(small_df['question_text'])

In [None]:
small_vect.vocabulary_

In [None]:
small_vect.get_feature_names_out()

#### Transform documents to vector

In [None]:
vectors = small_vect.transform(small_df['question_text']).toarray()
vectors

#### Configure Count Vectorizer Parameter

In [145]:
def tokenize(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]

In [153]:
vectorizer = CountVectorizer(lowercase=True, tokenizer=tokenize, stop_words=english_stopwords, max_features=1000)

In [None]:
%%time
vectorizer.fit(sample_df['question_text'])

In [None]:
len(vectorizer.vocabulary_)

In [None]:
vectorizer.get_feature_names_out()[:100]

In [158]:
inputs = vectorizer.transform(sample_df['question_text'])

In [None]:
inputs.shape

In [160]:
test_inputs = vectorizer.transform(test_df['question_text'])

## Train ML models for text classification

## Make predictions and publish to kaggle