# Bag of words/N-grams and Tf-idf models with simple logistic regression can get you to a score of 0.80

> ***In this notebook we will see how to build Bag of words/N-grams model using CountVectorizer and Tf-idf model using TfidfVectorizer from scikit-learn library.***

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score

In [None]:
train=pd.read_csv('../input/nlp-getting-started/train.csv')
test=pd.read_csv('../input/nlp-getting-started/test.csv')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
print(train.shape)
print(test.shape)

> ***There are null values in the keyword and location columns but as you will see I haven't used those columns for model building as they don't seem to be necessary.***

In [None]:
print(train.info())
print(test.info())

In [None]:
train.target.value_counts()

> ***Loading necessary modules for cleaning text***

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
!pip install contractions
import contractions
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
!pip install pyspellchecker
from spellchecker import SpellChecker

>  * **Removing URL tags such as www. and https.**
>  * **Removing HTML tags**
>  * **Removing all other noise except alphabets such as emojis etc**
>  * **Lemmatizing each word (Can also use stemming and spell checking if required)**
>  * **Removing stop words if there are any.**

In [None]:
stop_words=nltk.corpus.stopwords.words('english')
i=0
#sc=SpellChecker()
#data=pd.concat([train,test])
wnl=WordNetLemmatizer()
stemmer=SnowballStemmer('english')
for doc in train.text:
    doc=re.sub(r'https?://\S+|www\.\S+','',doc)
    doc=re.sub(r'<.*?>','',doc)
    doc=re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
    #doc=' '.join([stemmer.stem(i) for i in doc.lower().split()])
    doc=' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
    #doc=' '.join([sc.correction(i) for i in doc.split()])
    doc=contractions.fix(doc)
    tokens=nltk.word_tokenize(doc)
    filtered=[token for token in tokens if token not in stop_words]
    doc=' '.join(filtered)
    train.text[i]=doc
    i+=1
i=0
for doc in test.text:
    doc=re.sub(r'https?://\S+|www\.\S+','',doc)
    doc=re.sub(r'<.*?>','',doc)
    doc=re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
    #doc=' '.join([stemmer.stem(i) for i in doc.lower().split()])
    doc=' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
    #doc=' '.join([sc.correction(i) for i in doc.split()])
    doc=contractions.fix(doc)
    tokens=nltk.word_tokenize(doc)
    filtered=[token for token in tokens if token not in stop_words]
    doc=' '.join(filtered)
    test.text[i]=doc
    i+=1

In [None]:
train.head()

In [None]:
test.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1,1)) 

#    ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, 
#    and (2, 2) means only bigrams.

cv_matrix=cv.fit_transform(train.text).toarray()
train_df=pd.DataFrame(cv_matrix,columns=cv.get_feature_names())
test_df=pd.DataFrame(cv.transform(test.text).toarray(),columns=cv.get_feature_names())
train_df.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(ngram_range=(1,1),use_idf=True)
mat=tfidf.fit_transform(train.text).toarray()
train_df=pd.DataFrame(mat,columns=tfidf.get_feature_names())
test_df=pd.DataFrame(tfidf.transform(test.text).toarray(),columns=tfidf.get_feature_names())
train_df.head()

**We can try out one of BOW,bag of n-grams or tfidf models and use those in model building.**

In [None]:
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(train_df,train.target)
print(f1_score(model.predict(train_df),train.target))
pred=model.predict(test_df)

**I haven't divided the dataset to train and test sets to validate and directly trained the model on the whole dataset.**

In [None]:
pd.DataFrame({
    'id':test.id,
    'target':pred
}).to_csv('submission.csv',index=False)

**do upvote if you find this helpful and comment if there are any doubts.**

Here is my next notebook: [NLP disaster tweets-Glove,Word2Vec & BiLSTM](https://www.kaggle.com/urstrulysai/nlp-disaster-tweets-glove-word2vec-bilstm)