In [66]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

First we'll read in our dataset. The `dropna` function will remove all items that have the value `nan` (not a number).

In [72]:
data = pd.read_csv('tweets.csv').dropna()
data

Unnamed: 0,tweet,score
0,idk whats wrong with my i still feel this cons...,1.0
1,"really needs to update "" the game of life "" to...",1.0
2,general pregnancy counselling now available at...,1.0
3,anxiety is the biggest bitch ☹ ️,1.0
4,"#benzodiazepines once rarely prescribed , alwa...",1.0
5,fun fact i just found out earthquakes make anx...,1.0
6,"lol can't even finish a cup , anxiety trips ov...",1.0
7,hello everyone how are we all doing i need to ...,1.0
8,i love random anxiety attack : ),1.0
9,seriously though - “ islamophobia ” is a bulls...,1.0


   There are a million tweets in this dataset, so let's train on a smaller set of data. We'll randomly select 100,000 using the Pandas `sample` function.

In [74]:
model_data = data.sample(n = 100000)

We'll now use the _bag of words_ model to represent each tweet as a row in a _matrix_.
The columns of the matrix are going to words, and each item is either a 0 or a 1, depending on whether that word is in the tweet.

In [70]:
bow = CountVectorizer()
X = bow.fit_transform(model_data['tweet'])
y = data['score']

 We split our data into training and test sets. We'll use 70% of the data for training and 30% for testing.

In [77]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

Here, we initialize and train the classifier using the `fit` function. The `BernoulliNB` classifier in Scikit-Learn assumes that the _features_ are either 0 or 1, which is what the bag of words model does to our data.

In [81]:
clf = BernoulliNB()
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

Let's now see how well the classifier performs on our test dataset. In order to evaluate classification models, we use accuracy- how many did it get right out of the total amount of items?

In [84]:
preds = clf.predict(X_test)
accuracy_score(preds, y_test)

0.9686508296309729

That's a high accuracy score, so let's look at the percentage of times each class occurs. It could be possible that tweets mentioning depression and anxiety are a small portion of the dataset.

In [85]:
data['score'].value_counts()

0.0    798765
1.0    230606
Name: score, dtype: int64

Around 80% of the tweets have positive sentiment, so an algorithm like Naive Bayes offers a good improvement! 

Extension: What if we got lucky with our training set? This is where cross-validation comes in, where you see how the model performs over many splits of the data.