# Word Vectors and SVC Model on NLP with Disaster Tweets Competition
In this notebook, I will create an SVC model and train it on the word vectors created from the dataset in the NLP with Disaster Tweets competition.

First, here are the basic imports.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import spacy # required library to create word vectors and import the LinearSVC
from sklearn.model_selection import train_test_split # required to split data into training and validation data
from sklearn.svm import SVC # model to be used --> separates vectors into regions for classification

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Creating Word Vectors
Word vectors (or word embeddings) are effectively ways of encoding the definition of a word in a vector, or a sequence of numbers.

This allows us to train a model on the sequence of numbers rather than the word itself.

Here is a more formal definition (and an example) from the Kaggle "Natural Language Processing" course:
> Word embeddings (also called word vectors) represent each word numerically in such a way that the vector corresponds to how that word is used or what it means. Vector encodings are learned by considering the context in which the words appear. Words that appear in similar contexts will have similar vectors. For example, vectors for "leopard", "lion", and "tiger" will be close together, while they'll be far away from "planet" and "castle".

In this case, we will be loading in the average of the word vectors of each tweet; this usually works really well.

In [None]:
# Download the spacy large model to be able to create the vectors
nlp = spacy.load('en_core_web_lg')

# Read the data
train_data = pd.read_csv('../input/nlp-getting-started/train.csv')
test_data = pd.read_csv('../input/nlp-getting-started/test.csv')

# Using "disable_pipes" to speed up the vector transformation
with nlp.disable_pipes():
    train_vecs = np.array([nlp(text).vector for text in train_data.text]) # doc vectors for training set
    test_vecs = np.array([nlp(text).vector for text in test_data.text]) # doc vectors for testing set

## Using the LinearSVC Model
As explained on https://scikit-learn.org/stable/modules/svm.html#svm:
> Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The SVC model classifies in the way shown in the diagrams, dividing the space into multiple regions (image from the same website as above):

![LinearSVC](https://scikit-learn.org/stable/_images/sphx_glr_plot_iris_svc_0011.png)

After training with different types of models, I found that for this competition, the SVC model with the RBF kernel works best, so we will use it to predict the sentiments behind tweets.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train_vecs, train_data.target, test_size=0.2, random_state=1)

# creating function to create separate models
def my_model():
    # using RBF kernel to separate the region
    model = SVC(random_state=1, kernel='rbf', max_iter=20000)
    return model

model_val = my_model()
model_val.fit(X_train, y_train)
print(f'Accuracy: {model_val.score(X_valid, y_valid)}')

This seems to be getting an accuracy of about 82.3%.

## Final Training and Submission
Now, we will do the final training with all of the training data, and we will use the model to predict the outcomes for the testing set.

In [None]:
# model definition and training
model = my_model()
model.fit(train_vecs, train_data.target)

# create predictions
predictions = model.predict(test_vecs)

# organize into the way needed for the competition
final_preds = pd.concat([test_data.id, pd.DataFrame(predictions, columns=['target'])], axis=1)

final_preds.to_csv('submission.csv', index=False)

print("Your submission was successfully saved!")