# Detecting Fake News From Twitter
Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

This problem is a Kaggle Comptetion in which the goal is to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.
Dataset includes 10,000 tweets that were hand classified. Dataset is originally from https://appen.com/open-source-datasets/

We first start with a very simple model as a hand-on explained in the Kaggle.

# First Approach: Quick Start

Let's look at our data... first, an example of what is NOT a disaster tweet.

In [10]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print("non disaster example:",train_df[train_df["target"] == 0]["text"].values[1])
print("disaster example:",train_df[train_df["target"] == 1]["text"].values[1])

non disaster example: I love fruits
disaster example: Forest fire near La Ronge Sask. Canada


Building vectors

The theory behind the model we'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's CountVectorizer to count the words in each tweet and turn them into data our machine learning model can process.

In [11]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

# The above tells us that:
# There are 54 unique words (or "tokens") in the first five tweets.
# The first tweet contains only some of those unique tokens - 
# all of the non-zero counts above are the tokens that DO exist in the first tweet.

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


Now let's create vectors for all of our tweets.

In [12]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])

## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["text"])

Our model

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

In [14]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
clf = linear_model.RidgeClassifier()
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores
clf.fit(train_vectors, train_df["target"])


RidgeClassifier()

In [16]:
sample_submission = pd.read_csv("submission.csv")
sample_submission["target"] = clf.predict(test_vectors)
sample_submission.to_csv("quick_submission.csv", index=False)