# Predicting Fake News 

## Dataset: Datacamp dataset for fake news challenge 
### URL : https://s3.amazonaws.com/assets.datacamp.com/blog_assets/fake_or_real_news.csv

### Goal:
Learning and exploring the fake news dataset further using various natural language processing techniques
and word embeddings. This notebook thus contains bag-of words and word2vec approaches. This tutorial is
beautifully explained and referred from https://www.kaggle.com/c/word2vec-nlp-tutorial .
As a part of learning process, the first task is to understand 'Bag of Words' approach which will be
helpful in learning a 'Word2vec' model better. 

### Bag-of-words: 
For a given document, you extract only the unigram words (aka terms) to create an unordered list of words.
No POS tag, no syntax, no semantics, no position, no bigrams, no trigrams. Only the unigram words themselves,
making for a bunch of words to represent the document. Thus: Bag-of-words. 
Source: "Speech and Language Processing" by Jurafsky and Martin, 2009, in section 23.1 

### Word2vec:
Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set
of vectors: feature vectors for words in that corpus. 

### Outline: 
The notebook focuses mainly on text analysis techniques and compares the performances of two approaches 
explained. Descriptions have been provided with them in accordance with the implementation.  

### Models: 
### 1. GaussianNB
### 2. SVM
### 3. Random Forest Classifier

In [201]:
import pandas as pd
import numpy as np

In [202]:
df = pd.read_csv("C:\\Users\\Prajakta\\fake_or_real_news.csv")
df.head(5)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [203]:
df['label'] = np.where(df['label'] == 'FAKE', 0 , 1)

In [204]:
df.shape

(6335, 4)

In [205]:
df.columns.values

array(['Unnamed: 0', 'title', 'text', 'label'], dtype=object)

Checking the first news.....

In [206]:
print (df["text"][0])

Daniel Greenfield, a Shillman Journalism Fellow at the Freedom Center, is a New York writer focusing on radical Islam. 
In the final stretch of the election, Hillary Rodham Clinton has gone to war with the FBI. 
The word “unprecedented” has been thrown around so often this election that it ought to be retired. But it’s still unprecedented for the nominee of a major political party to go war with the FBI. 
But that’s exactly what Hillary and her people have done. Coma patients just waking up now and watching an hour of CNN from their hospital beds would assume that FBI Director James Comey is Hillary’s opponent in this election. 
The FBI is under attack by everyone from Obama to CNN. Hillary’s people have circulated a letter attacking Comey. There are currently more media hit pieces lambasting him than targeting Trump. It wouldn’t be too surprising if the Clintons or their allies were to start running attack ads against the FBI. 
The FBI’s leadership is being warned that the entire left

#### The BeautifulSoup Package
It removes the HTML markup. For this purpose, we'll use the Beautiful Soup library and test it for the first news text.

In [207]:
from bs4 import BeautifulSoup     

In [208]:
example1 = BeautifulSoup(df["text"][0],"lxml")  

In [209]:
print (df["text"][0])

Daniel Greenfield, a Shillman Journalism Fellow at the Freedom Center, is a New York writer focusing on radical Islam. 
In the final stretch of the election, Hillary Rodham Clinton has gone to war with the FBI. 
The word “unprecedented” has been thrown around so often this election that it ought to be retired. But it’s still unprecedented for the nominee of a major political party to go war with the FBI. 
But that’s exactly what Hillary and her people have done. Coma patients just waking up now and watching an hour of CNN from their hospital beds would assume that FBI Director James Comey is Hillary’s opponent in this election. 
The FBI is under attack by everyone from Obama to CNN. Hillary’s people have circulated a letter attacking Comey. There are currently more media hit pieces lambasting him than targeting Trump. It wouldn’t be too surprising if the Clintons or their allies were to start running attack ads against the FBI. 
The FBI’s leadership is being warned that the entire left

#### Calling get_text() gives you the text of the news, without tags or markup.

In [210]:
print (example1.get_text())

Daniel Greenfield, a Shillman Journalism Fellow at the Freedom Center, is a New York writer focusing on radical Islam. 
In the final stretch of the election, Hillary Rodham Clinton has gone to war with the FBI. 
The word “unprecedented” has been thrown around so often this election that it ought to be retired. But it’s still unprecedented for the nominee of a major political party to go war with the FBI. 
But that’s exactly what Hillary and her people have done. Coma patients just waking up now and watching an hour of CNN from their hospital beds would assume that FBI Director James Comey is Hillary’s opponent in this election. 
The FBI is under attack by everyone from Obama to CNN. Hillary’s people have circulated a letter attacking Comey. There are currently more media hit pieces lambasting him than targeting Trump. It wouldn’t be too surprising if the Clintons or their allies were to start running attack ads against the FBI. 
The FBI’s leadership is being warned that the entire left

#### Data Preprocessing Techniques
Regular expressions to print characters only and convert all into lower cases.

In [211]:
import re
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search

In [212]:
print (letters_only)



In [213]:
lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words

#### NLTK : 

To deal with frequently occurring words that dont't carry much meaning are called "stop words"; in English they include words such as "a", "and", "is", and "the". Conveniently, there are Python packages that come with stop word lists built in. Let's import a stop word list from the Python Natural Language Toolkit (NLTK). 

In [214]:
import nltk
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

#### Text without stopwords 

In [215]:
words = [w for w in words if not w in stopwords.words("english")]
print (words)



As seen from above example, we need to clean every news text from our data. Below is a function that consist of combined steps that performs data cleaning. 

In [216]:
def news_text_to_words(raw_text):
    # Function to convert a raw news text to a string of words
    # The input is a single string (a raw news text), and 
    # the output is a single string (a preprocessed news text)
    # 1. Remove HTML
    news_text = BeautifulSoup(raw_text).get_text() 
    
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", news_text) 
    
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                 
    
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))  

In [217]:
# Checking our function for first news text
clean_text = news_text_to_words(df["text"][0])
print(clean_text)





 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


#### Get the number of news texts based on the dataframe column size

In [218]:
num_news_text = df["text"].size

#### Initialize an empty list to hold the clean news text

In [219]:
clean_df_news = []

#### Loop over each news text; create an index i that goes from 0 to the length of the number of news texts list 

In [220]:
for i in range( 0, num_news_text ):
    # Call our function for each one, and add the result to the list of clean news text
    clean_df_news.append( news_text_to_words(df["text"][i] ) )



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [221]:
len(clean_df_news)

6335

#### Train-Test split using cross-validation : X contains all the text i.e. news data and y contains the label to be predicted
#### Train Data - 66%
#### Test Data - 34%

In [222]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(clean_df_news, df['label'], test_size=0.33, random_state=53)

#### The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. Below, we use the 1000 most frequent words (remembering that stop words have already been removed).

In [223]:
print ("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 1000) 

X_train_data_features = vectorizer.fit_transform(X_train)
X_train_data_features = X_train_data_features.toarray()

X_test_data_features = vectorizer.fit_transform(X_test)
X_test_data_features = X_test_data_features.toarray()

Creating the bag of words...



In [224]:
print(X_train_data_features.shape)
print(X_test_data_features.shape)

(4244, 1000)
(2091, 1000)


#### Thus the training array has 4244 rows with 1000 features and similarly for test array. 

#### To check the feature names, we use .get_feature_name() function

In [225]:
vocab = vectorizer.get_feature_names()
print(vocab)

['abedin', 'ability', 'able', 'abortion', 'access', 'according', 'account', 'accused', 'across', 'act', 'action', 'actions', 'actually', 'added', 'address', 'administration', 'african', 'age', 'agency', 'agenda', 'ago', 'agreement', 'ahead', 'air', 'al', 'allies', 'allow', 'allowed', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'america', 'american', 'americans', 'among', 'announced', 'another', 'answer', 'anti', 'anyone', 'anything', 'appear', 'appeared', 'appears', 'approach', 'april', 'arab', 'area', 'areas', 'argument', 'army', 'around', 'article', 'ask', 'asked', 'assault', 'associated', 'attack', 'attacks', 'attempt', 'attention', 'attorney', 'august', 'author', 'authorities', 'away', 'back', 'bad', 'ballot', 'bank', 'barack', 'base', 'based', 'battle', 'became', 'become', 'began', 'beginning', 'behind', 'believe', 'believes', 'bernie', 'best', 'better', 'beyond', 'biden', 'big', 'biggest', 'bill', 'billion', 'black', 'board', 'body', 'boehner', 'book', 'b

#### Printing count of each word in the vocabulary

In [227]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(X_train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print(count, tag)

424 abedin
859 ability
398 able
598 abortion
2088 access
413 according
977 account
1112 accused
826 across
529 act
1161 action
854 actions
557 actually
1503 added
497 address
395 administration
608 african
376 age
392 agency
1055 agenda
365 ago
582 agreement
485 ahead
777 air
957 al
461 allies
497 allow
413 allowed
864 almost
406 alone
786 along
1455 already
5508 also
591 although
906 always
2551 america
3663 american
2170 americans
1631 among
652 announced
2066 another
501 answer
1079 anti
724 anyone
841 anything
382 appear
421 appeared
436 appears
395 approach
374 april
373 arab
586 area
408 areas
375 argument
511 army
1615 around
749 article
471 ask
1196 asked
392 assault
461 associated
1257 attack
1127 attacks
357 attempt
516 attention
471 attorney
452 august
495 author
360 authorities
364 away
418 back
1147 bad
2628 ballot
369 bank
712 barack
360 base
503 based
391 battle
680 became
479 become
999 began
402 beginning
525 behind
1205 believe
657 believes
362 bernie
1024 best
1416 b

In [228]:
from sklearn.naive_bayes import GaussianNB 
clf = GaussianNB() 
clf = clf.fit(X_train_data_features, y_train)

In [229]:
result = clf.predict(X_train_data_features)

In [230]:
from sklearn.metrics import accuracy_score
from sklearn import metrics
score = metrics.accuracy_score(y_train, result)
print("Accuracy of GaussianNB:   %0.3f" % score)

Accuracy of GaussianNB:   0.826


In [231]:
from sklearn.svm import SVC
svc = SVC()
svc = svc.fit(X_train_data_features, y_train)

In [232]:
result = svc.predict(X_train_data_features)

In [234]:
from sklearn.metrics import accuracy_score
from sklearn import metrics
score = metrics.accuracy_score(y_train, result)
print("Accuracy of SVC:   %0.3f" % score)

Accuracy of SVC:   0.920


### Observations :

SVC seems to perform better giving an accuracy of 92% while GaussianNB gives 82%. It can be seen that this bag-of-words approach using labeled training data performs much better when data preprocessing is performed in some cases. The results can be relied as the data contains less noise and is consistent. 

# Word2vec

#### Word2Vec does not need labels in order to create meaningful representations. 
This is useful, since most data in the real world is unlabeled. If the network is given enough training data (tens of billions of words), it produces word vectors with intriguing characteristics. Words with similar meanings appear in clusters, and clusters are spaced such that some word relationships, such as analogies, can be reproduced using vector math. 

This part of notebook explains working with unlabeled data and trains the model using the excellent implementation of word2vec from the gensim package in Python. 

In [235]:
import pandas as pd
df_new = pd.read_csv("C:\\Users\\Prajakta\\fake_or_real_news.csv")
df_new.head(5)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


#### Deleting column 'label' so we can work with unlabeled data

In [236]:
del df_new['label']

In [237]:
df_new.head(5)

Unnamed: 0.1,Unnamed: 0,title,text
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello..."
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T..."
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...


In [238]:
print ("Read %d unlabeled news text" % (df_new["text"].size ))

Read 6335 unlabeled news text


First, to train Word2Vec it is better not to remove stop words because the algorithm relies on the broader context of the sentence in order to produce high-quality word vectors. For this reason, we will make stop word removal optional in the functions below.

In [239]:
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

def news_to_wordlist( news_text, remove_stopwords=False ):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    
    # 1. Remove HTML
    news_text = BeautifulSoup(news_text).get_text()
     
    # 2. Remove non-letters
    news_text = re.sub("[^a-zA-Z]"," ",  news_text)
    
    # 3. Convert words to lower case and split them
    words =  news_text.lower().split()
    
    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    
    # 5. Return a list of words
    return(words)

#### Word2Vec expects single sentences, each one as a list of words. In other words, the input format is a list of lists.

It is not at all straightforward how to split a paragraph into sentences. English sentences can end with "?", "!", """, or ".", among other things, and spacing and capitalization are not reliable guides either. For this reason, we'll use NLTK's punkt tokenizer for sentence splitting.

In [240]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [241]:
def news_to_sentences( news_text, tokenizer, remove_stopwords=False ):
    # Function to split a news text into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(news_text.strip())
    
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call news_to_wordlist to get a list of words
            sentences.append( news_to_wordlist( raw_sentence, remove_stopwords ))
    
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

#### Applying the above function to prepare data for Word2vec

In [242]:
sentences = []  # Initialize an empty list of sentences

print ("Parsing sentences from news data set")
for news_text in df["text"]:
    sentences += news_to_sentences(news_text, tokenizer)

Parsing sentences from news data set




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
  '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)
  '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a filename, not markup. You

In [243]:
print(len(sentences))

217376


In [244]:
print(sentences[0])

['daniel', 'greenfield', 'a', 'shillman', 'journalism', 'fellow', 'at', 'the', 'freedom', 'center', 'is', 'a', 'new', 'york', 'writer', 'focusing', 'on', 'radical', 'islam']


In [245]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

#### Training the model

Source: https://radimrehurek.com/gensim/models/word2vec.html
The number of parameters chosen can be changed and experimented using other values as well. For now, I have taken them as shown below and ran the model. 

In [246]:
num_features = 100    # Word vector dimensionality                    
min_word_count = 5   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(sentences, workers=num_workers, 
            size=num_features, min_count = min_word_count,
            window = context, sample = downsampling)



Training model...


#### Now that our model is trained, we make use of vector operations to combine the words in each news text. One method is to try to simply average the word vectors in a given news text. Following code averages feature vectors. 

In [247]:
import numpy as np 
def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given paragraph
    
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")

    nwords = 0.
    
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.wv.index2word)
    
    # Loop over each word in the news texts and, if it is in the model's vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec

In [248]:
def getAvgFeatureVecs(news_texts, model, num_features):
    # Given a set of news texts(each one a list of words), calculate 
    # the average feature vector for each one and return a 2D numpy array 
    
    # Initialize a counter
    counter = 0.
    
    # Preallocate a 2D numpy array, for speed
    NewsFeatureVecs = np.zeros((len(news_texts),num_features),dtype="float32")
    
    # Loop through the news texts
    for news_text in news_texts:
        if counter%1000. == 0.:
            print ("News_text %d of %d" % (counter, len(news_texts)))
        NewsFeatureVecs[int(counter)] = makeFeatureVec(news_text, model, num_features)
        counter = counter + 1.
    return NewsFeatureVecs

In [249]:
clean_news_text = []

In [250]:
for i in range( 0, num_news_text ):
    # Call function for each one, and add the result to the list of clean news text
    clean_news_text.append( news_to_wordlist(df["text"][i] ) )



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [251]:
len(clean_news_text)

6335

In [252]:
df['label'].shape

(6335,)

#### Train-Test split using cross-validation : X contains all the text i.e. news data and y contains the label to be predicted
#### Train Data - 66%
#### Test Data - 34%

In [174]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(clean_news_text, df['label'], test_size=0.33, random_state=53)

#### Now that the model is trained on unlabeled data, we can apply it over labeled data to check if it actually gives results.

Thus, we get average feature vectors for X_train and X_test using the model. 

In [175]:
trainDataVecs = getAvgFeatureVecs(X_train, model, num_features )

News_text 0 of 4244
News_text 1000 of 4244
News_text 2000 of 4244
News_text 3000 of 4244
News_text 4000 of 4244


In [176]:
testDataVecs = getAvgFeatureVecs(X_test, model, num_features )

News_text 0 of 2091
News_text 1000 of 2091
News_text 2000 of 2091


These train and test data vectors contain NaN values which can be either replaced using median, mean or other methods. For simplicity, we use median. 

In [253]:
print(trainDataVecs)

[[ -1.82002056e-02  -1.82002056e-02  -1.82002056e-02 ...,  -1.82002056e-02
   -1.82002056e-02  -1.82002056e-02]
 [  4.86127764e-01  -1.00027084e+00  -8.80418625e-03 ...,  -3.30448538e-01
    5.47815621e-01   2.05120534e-01]
 [  1.07551277e+00   6.76691532e-02  -6.43887103e-01 ...,   4.64854449e-01
    7.53573596e-01  -7.27285385e-01]
 ..., 
 [  7.42553651e-01  -2.80313224e-01  -7.56411674e-03 ...,  -2.11087927e-01
    5.93109906e-01  -5.66880889e-02]
 [  3.14573288e-01  -3.17868173e-01  -8.75919908e-02 ...,  -9.44647640e-02
    3.01275730e-01   1.79786637e-01]
 [  7.17018366e-01  -3.27937454e-01  -3.52707505e-01 ...,  -8.91063188e-04
    2.27979913e-01   2.00643107e-01]]


In [254]:
trainDataVecs[np.isnan(trainDataVecs)] = np.median(trainDataVecs[~np.isnan(trainDataVecs)])

In [255]:
print(trainDataVecs)

[[ -1.82002056e-02  -1.82002056e-02  -1.82002056e-02 ...,  -1.82002056e-02
   -1.82002056e-02  -1.82002056e-02]
 [  4.86127764e-01  -1.00027084e+00  -8.80418625e-03 ...,  -3.30448538e-01
    5.47815621e-01   2.05120534e-01]
 [  1.07551277e+00   6.76691532e-02  -6.43887103e-01 ...,   4.64854449e-01
    7.53573596e-01  -7.27285385e-01]
 ..., 
 [  7.42553651e-01  -2.80313224e-01  -7.56411674e-03 ...,  -2.11087927e-01
    5.93109906e-01  -5.66880889e-02]
 [  3.14573288e-01  -3.17868173e-01  -8.75919908e-02 ...,  -9.44647640e-02
    3.01275730e-01   1.79786637e-01]
 [  7.17018366e-01  -3.27937454e-01  -3.52707505e-01 ...,  -8.91063188e-04
    2.27979913e-01   2.00643107e-01]]


In [256]:
print(testDataVecs)

[[ 0.63023293 -0.33161122 -0.37748441 ..., -0.41339785  0.1987765
   0.07789288]
 [ 0.53071845  0.10510877 -0.15914772 ..., -0.12847476  0.17196661
  -0.20268564]
 [ 0.56169784 -0.57044953 -0.38548091 ..., -0.2903015   0.2906279
   0.21123165]
 ..., 
 [ 0.42608303 -0.06368953 -0.12225134 ..., -0.38583004 -0.05151556
   0.11681455]
 [ 0.73321557 -0.71283627  0.2015903  ..., -0.34714261  0.97302938
   0.55668175]
 [ 0.53274626 -0.85694373 -0.23410724 ..., -0.49146745  0.53223646
   0.25593287]]


In [257]:
testDataVecs[np.isnan(testDataVecs)] = np.median(testDataVecs[~np.isnan(testDataVecs)])

In [258]:
from sklearn.svm import SVC
svc1 = SVC()
svc1 = svc1.fit(trainDataVecs, y_train)

In [259]:
result = svc1.predict(testDataVecs)

In [260]:
from sklearn.metrics import accuracy_score
from sklearn import metrics
score = metrics.accuracy_score(y_test, result)
print("Accuracy of SVC using Word2vec:   %0.3f" % score)

Accuracy of SVC using Word2vec:   0.867


In [262]:
from sklearn.naive_bayes import GaussianNB 
clf1 = GaussianNB() 
clf1 = clf1.fit(trainDataVecs, y_train)

In [263]:
result = clf1.predict(testDataVecs)

In [264]:
score = metrics.accuracy_score(y_test, result)
print("Accuracy of GaussianNB using Word2vec:   %0.3f" % score)

Accuracy of GaussianNB using Word2vec:   0.780


### Observations: 

For an unlabeled training data model using Word2vec, the performance is really good i.e. SVC giving 86% accuracy and GaussianNB having 78% accuracy. These results are close to the above bag-pf-words approach which works with labeled data. 
This clearly shows that this approach is incredibly useful for real life data which is unlabeled and can be used in prediction tasks. 