In a [previous notebook](https://www.kaggle.com/riyadmorshedshoeb/using-tfidf-model-comparison), I only used the text feature and Tf-Idf vectorization on that to classify the tweets.

In this notebook and future versions of this, I intend to use all three features and use different feature extraction methods on them on every version. If there is any improvement of performance for a method, then I will use that method on my other notebook for the actual classification task and submission to the competition.

# Versions

## Version 1
Tf-Idf on text feature and Bag-Of-Words on the rest.

## Version 2
Tf-Idf with unigrams and bi-grams on text feature and BOW on the rest.

# Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Data

## Reading the data

In [None]:
train = pd.read_csv('../input/nlp-getting-started/train.csv')
test = pd.read_csv('../input/nlp-getting-started/test.csv')

## Handling the missing values

In [None]:
train.isna().sum()

In [None]:
test.isna().sum()

In [None]:
train.fillna('missing', inplace=True)
test.fillna('missing', inplace=True)

In [None]:
train.iloc[0]

### Splitting the data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train[['keyword', 'location', 'text']], train['target'],
                                                   test_size=0.2, random_state=42)

### Creating the vectors and Transforming the features

In [None]:
bow_key = CountVectorizer()

X_train_key = bow_key.fit_transform(X_train['keyword']).toarray()
X_test_key = bow_key.transform(X_test['keyword']).toarray()

test_key = bow_key.transform(test['keyword']).toarray()

In [None]:
bow_loc = CountVectorizer()

X_train_loc = bow_loc.fit_transform(X_train['location']).toarray()
X_test_loc = bow_loc.transform(X_test['location']).toarray()

test_loc = bow_loc.transform(test['location']).toarray()

In [None]:
tfidf = TfidfVectorizer(ngram_range=(1,2))

X_train_text = tfidf.fit_transform(X_train['text']).toarray()
X_test_text = tfidf.transform(X_test['text']).toarray()

test_text = tfidf.transform(test['text']).toarray()

In [None]:
len(tfidf.get_feature_names())

### Combining the vectors together

In [None]:
X_train_vec = np.concatenate((X_train_key, X_train_loc, X_train_text), axis=1)
X_test_vec = np.concatenate((X_test_key, X_test_loc, X_test_text), axis=1)
test_vec = np.concatenate((test_key, test_loc, test_text), axis=1)

In [None]:
X_train_vec.shape

# Test it on a model

In [None]:
clf = MultinomialNB().fit(X_train_vec, y_train)
clf.score(X_test_vec, y_test)

If you have any idea that I should try, please comment below.